arXiv 论文速递

2025-10-31 03:28
Snapshot: 20251031_0328
FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Authors: Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, Minghua Liu
First: 2025-10-29T17:58:14+00:00 · Latest: 2025-10-29T17:58:14+00:00
Abstract
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.
中文标题/摘要
标题:FreeArt3D:无需训练的3D可动物体生成方法利用3D扩散
3D可动物体在机器人学、AR/VR和动画等领域中至关重要。最近对这类物体建模的方法要么依赖于需要密集视角监督的优化重建管道,要么依赖于生成前馈模型,这些模型生成粗略的几何近似,往往忽略了表面纹理。相比之下,静态3D物体的开放世界生成已经取得了显著成功,尤其是随着原生3D扩散模型(如Trellis)的出现。然而,将这些方法扩展到可动物体,通过训练原生3D扩散模型,面临着重大挑战。在本文中,我们提出了FreeArt3D,这是一种无需训练的可动3D物体生成框架。FreeArt3D 不是针对有限的可动数据训练新模型,而是重新利用一个预先训练好的静态3D扩散模型(例如Trellis)作为强大的形状先验。它将Score Distillation Sampling (SDS) 扩展到3D到4D领域,将可动性视为额外的生成维度。给定不同可动状态下的少量图像,FreeArt3D 联合优化物体的几何形状、纹理和可动参数,无需特定任务的训练或访问大规模可动数据集。我们的方法生成高保真几何形状和纹理,准确预测潜在的运动结构,并在多种物体类别中表现出良好的泛化能力。尽管遵循单个实例优化范式,FreeArt3D 完成时间仅需几分钟,并且在质量和多功能性方面显著优于先前的先进方法。
Summary / 总结
FreeArt3D is a training-free framework for generating articulated 3D objects. It repurposes a pre-trained static 3D diffusion model as a shape prior and extends Score Distillation Sampling to the 3D-to-4D domain. Given images of an object in different articulation states, FreeArt3D optimizes the object's geometry, texture, and articulation parameters without requiring additional training. The method produces high-fidelity geometry and textures, accurately predicts kinematic structures, and generalizes well across various object categories, outperforming previous approaches in both quality and versatility.
FreeArt3D 是一个无需训练的框架,用于生成 articulated 3D 对象。它利用预训练的静态 3D 扩散模型生成高保真几何和纹理,无需特定任务的训练即可优化对象参数。该方法准确预测了运动结构,并在多种对象类别中表现出良好的泛化能力,优于之前的先进方法,在质量和多功能性方面均表现出色。
ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
Authors: Tianyu Yang, Terry Ruas, Yijun Tian, Jan Philip Wahle, Daniel Kurzawe, Bela Gipp
First: 2025-10-29T16:32:26+00:00 · Latest: 2025-10-29T16:32:26+00:00
Abstract
Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.
中文标题/摘要
标题:ALDEN:在长文档中进行主动导航和证据收集的强化学习
视觉语言模型(VLMs)在解读图文丰富的图像方面表现出色,但在处理长篇复杂文档时却遇到困难,这些文档需要对分布在多页上的信息进行分析和整合。现有方法通常依赖固定的推理模板或刚性的处理流程,这迫使VLMs处于被动角色,影响了效率和泛化能力。我们提出了Active Long-DocumEnt Navigation (ALDEN),这是一种多轮次的强化学习框架,能够微调VLMs作为能够主动导航长图文文档的交互式代理。ALDEN引入了一种新的获取动作,可以直接通过索引访问页面,补充了经典的搜索动作,并更好地利用了文档结构。为了进行密集的过程监督和高效的训练,我们提出了一种基于规则的跨层次奖励机制,提供了轮次级和标记级的信号。为了解决由长文档中的大量视觉标记引起的训练不稳定性问题,我们进一步提出了一种视觉语义锚定机制,在训练过程中分别对视觉和文本表示施加双重路径的KL散度约束,以稳定它们。ALDEN在三个开源数据集构建的语料库上进行训练,实现了五个长文档基准测试中的最佳性能。总体而言,ALDEN标志着从被动文档阅读向能够自主导航和在长图文文档中进行推理的代理的一步跨越,提供了一条通往更准确和高效的长文档理解的稳健路径。
Summary / 总结
ALDEN is a multi-turn reinforcement learning framework that enhances VLMs to act as interactive agents for navigating and gathering evidence from long, visually complex documents. It introduces a fetch action to access document pages by index and a rule-based cross-level reward for efficient training. ALDEN also includes a visual-semantic anchoring mechanism to stabilize training with long documents. The model achieves state-of-the-art performance on five long-document benchmarks, demonstrating its effectiveness in active navigation and reasoning across such documents.
ALDEN 是一个多轮强化学习框架,增强 VLMs 使其作为交互式代理,用于导航和从长且视觉复杂的文档中收集证据。它引入了通过索引访问文档页面的 fetch 动作,并提出了一种基于规则的跨层级奖励以实现高效的训练。ALDEN 还包含一种视觉语义锚定机制,以在处理长文档时稳定训练。该模型在五个长文档基准测试中达到了最先进的性能,展示了其在主动导航和跨长文档推理方面的有效性。
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
Authors: Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov
First: 2025-10-29T15:20:10+00:00 · Latest: 2025-10-29T15:20:10+00:00
Comments: 13 pages, 6 figures
Abstract
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
中文标题/摘要
标题:不要盲目训练VLA:为OOD泛化对齐视觉表示
视觉-语言-行动(VLA)模型的成功得益于预训练的视觉-语言模型(VLMs)能够赋予智能体可转移的世界知识和视觉-语言(VL)定位,为具有更广泛泛化能力的行动模型奠定了基础。然而,当这些VLMs适应行动模态时,尚不清楚它们的原始VL表示和知识在多大程度上得到了保留。在本文中,我们系统研究了VLA微调期间表示保留情况,表明简单的行动微调会导致视觉表示的退化。为了表征和测量这些影响,我们探测了VLA的隐藏表示并分析了注意力图,进一步设计了一系列对比VLA模型与其对应VLMs的目标任务和方法,以隔离由行动微调引起的VL能力变化。我们还评估了一系列视觉表示对齐策略,并引入了一种简单而有效的方法,该方法减轻了退化并提高了对离分布(OOD)场景的泛化能力。综上所述,我们的分析阐明了行动微调与VL表示退化之间的权衡,并强调了恢复继承的VL能力的实用方法。代码已公开:https://blind-vla-paper.github.io
Summary / 总结
This paper investigates the impact of fine-tuning Vision-Language-Action (VLA) models on their visual representations and generalization capabilities. The study shows that naive action fine-tuning degrades visual representations, leading to poorer out-of-distribution (OOD) generalization. To address this, the authors propose a simple method to align visual representations, which improves OOD generalization without significant loss in action performance.
该研究探讨了对Vision-Language-Action (VLA)模型进行动作微调对其视觉表示和泛化能力的影响。研究发现,简单的动作微调会降低视觉表示的质量,导致在未见过的数据上的表现变差。作者提出了一种方法来对齐视觉表示,这可以缓解这种退化并提高未见过数据上的性能。分析揭示了动作微调与视觉表示质量之间的权衡,并提供了保留继承的视觉和语言能力的实用方法。
FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering
Authors: Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian Hüger, Thorsten Bagdonat, Hanno Gottschalk, Leo Schwinn
Venue: NeurIPS 2025
First: 2025-06-26T18:51:04+00:00 · Latest: 2025-10-29T14:46:17+00:00
Comments: Accepted by NeurIPS 2025 - main track. Project page: https://focus-mllm-vqa.github.io/
Abstract
While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.
中文标题/摘要
标题:FOCUS:内部MLLM表示用于高效细粒度视觉问答
虽然多模态大型语言模型(MLLMs)在图像-文本输入方面提供了强大的感知和推理能力,但视觉问答(VQA)专注于小图像细节仍然是一项挑战。尽管视觉裁剪技术似乎很有前景,但最近的方法有几个局限性:需要针对特定任务进行微调、由于缺乏信息的穷举搜索导致效率低下,或者与高效的注意力实现不兼容。我们通过提出一种无需训练的视觉裁剪方法FOCUS来解决这些不足,该方法利用MLLM内部表示来指导对最相关图像区域的搜索。这一过程分为四个步骤:首先,我们识别VQA提示中的目标对象;其次,我们使用键值(KV)缓存计算对象相关性图;第三,我们根据该图提出并排名相关图像区域;最后,我们使用排名最高的区域执行细粒度VQA任务。由于这种有信息的搜索策略,FOCUS在四个细粒度VQA数据集和三种类型的MLLMs上均表现出色。它在准确性和效率上均优于三种流行的视觉裁剪方法,并且与表现最佳的基线ZoomEye相当,但所需计算量仅为3-6.5倍。
Summary / 总结
FOCUS addresses the challenge of fine-grained visual question answering by proposing a training-free visual cropping method that leverages internal representations of Multimodal Large Language Models (MLLMs). It identifies the target object(s) in the VQA prompt, computes an object relevance map using the key-value cache, proposes and ranks relevant image regions, and performs the VQA task using the top-ranked region. FOCUS outperforms three popular visual cropping methods in both accuracy and efficiency and matches the best-performing baseline while requiring significantly less compute resources.
研究旨在通过解决现有视觉裁剪方法的局限性,提高细粒度的视觉问答(VQA)性能。FOCUS方法利用多模态大型语言模型(MLLMs)的内部表示来引导搜索相关图像区域,无需进行微调。FOCUS在准确性和效率上均优于三种流行的视觉裁剪方法,并且在使用显著较少的计算资源的同时,达到了最佳基线的性能。
More than a Moment: Towards Coherent Sequences of Audio Descriptions
Authors: Eshika Khandelwal, Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, Andrew Zisserman, Gül Varol, Makarand Tapaswi
First: 2025-10-29T12:06:42+00:00 · Latest: 2025-10-29T12:06:42+00:00
Abstract
Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.
中文标题/摘要
标题:超越瞬间:朝向连贯的音频描述序列
音频描述(ADs)传达屏幕上的关键信息,使视障观众能够跟随视频。为了有效,ADs 必须形成一个连贯的序列,帮助听众可视化正在展开的场景,而不是描述孤立的时刻。然而,大多数自动方法会独立生成每个 AD,通常导致重复且不连贯的描述。为了解决这个问题,我们提出了一种无需训练的方法 CoherentAD,该方法首先为每个 AD 时间间隔生成多个候选描述,然后在序列中进行自回归选择,以形成一个连贯且富有信息量的叙述。为了全面评估 AD 序列,我们引入了一个序列级度量 StoryRecall,该度量衡量预测的 ADs 如何准确传达真实叙述,同时还包括重复度量,以捕捉连续 AD 输出之间的冗余。我们的方法生成了连贯且增强叙述理解的 AD 序列,优于依赖独立生成的先前方法。
Summary / 总结
The research aims to improve audio descriptions (ADs) for visually impaired audiences by ensuring they form a coherent sequence rather than isolated moments. The proposed method, CoherentAD, generates multiple candidate descriptions for each AD time interval and then selects them in an auto-regressive manner to create a coherent and informative narrative. Evaluation metrics include StoryRecall, which assesses how well the predicted ADs convey the ground truth narrative, and repetition metrics. The method outperforms previous approaches that generate ADs independently, producing more coherent and narrative-rich sequences.
研究旨在通过确保音频描述(ADs)形成连贯的序列而非孤立的时刻,来改善视障观众的音频描述。提出的CoherentAD方法为每个AD时间间隔生成多个候选描述,并通过自回归方式选择它们以形成连贯的叙述。评估指标包括衡量叙述连贯性的StoryRecall,以及衡量连续描述冗余性的重复度量。该方法通过生成更连贯和信息丰富的AD序列,优于之前的依赖独立生成的方法。
Improving Temporal Consistency and Fidelity at Inference-time in Perceptual Video Restoration by Zero-shot Image-based Diffusion Models
Authors: Nasrin Rahimi, A. Murat Tekalp
First: 2025-10-29T11:40:06+00:00 · Latest: 2025-10-29T11:40:06+00:00
Abstract
Diffusion models have emerged as powerful priors for single-image restoration, but their application to zero-shot video restoration suffers from temporal inconsistencies due to the stochastic nature of sampling and complexity of incorporating explicit temporal modeling. In this work, we address the challenge of improving temporal coherence in video restoration using zero-shot image-based diffusion models without retraining or modifying their architecture. We propose two complementary inference-time strategies: (1) Perceptual Straightening Guidance (PSG) based on the neuroscience-inspired perceptual straightening hypothesis, which steers the diffusion denoising process towards smoother temporal evolution by incorporating a curvature penalty in a perceptual space to improve temporal perceptual scores, such as Fr\'echet Video Distance (FVD) and perceptual straightness; and (2) Multi-Path Ensemble Sampling (MPES), which aims at reducing stochastic variation by ensembling multiple diffusion trajectories to improve fidelity (distortion) scores, such as PSNR and SSIM, without sacrificing sharpness. Together, these training-free techniques provide a practical path toward temporally stable high-fidelity perceptual video restoration using large pretrained diffusion models. We performed extensive experiments over multiple datasets and degradation types, systematically evaluating each strategy to understand their strengths and limitations. Our results show that while PSG enhances temporal naturalness, particularly in case of temporal blur, MPES consistently improves fidelity and spatio-temporal perception--distortion trade-off across all tasks.
中文标题/摘要
标题:通过零样本图像导向扩散模型在推断时提高感知视频恢复的时间一致性和保真度
扩散模型已成为单图像恢复的强大先验,但在零样本视频恢复中的应用由于采样的随机性和显式时间建模的复杂性而受到时间不一致性的困扰。在本文中,我们通过不重新训练或修改其架构,使用零样本图像导向扩散模型来解决视频恢复中提高时间连贯性的挑战。我们提出了两种互补的推断时策略:(1) 基于神经科学启发的感知直角假设的感知直角引导 (PSG),通过在感知空间中引入曲率惩罚来引导扩散去噪过程,以改善时间感知得分,如弗雷歇视频距离 (FVD) 和感知直角度;(2) 多路径集成采样 (MPES),旨在通过集成多个扩散轨迹来减少随机性,以提高保真度 (失真) 得分,如 PSNR 和 SSIM,而不牺牲清晰度。这些无需训练的技术为使用大型预训练扩散模型进行时间稳定高保真感知视频恢复提供了实际途径。我们在多个数据集和退化类型上进行了广泛的实验,系统地评估了每种策略以了解其优缺点。我们的结果表明,虽然 PSG 在时间模糊的情况下增强了时间自然度,但 MPES 在所有任务中一致地提高了保真度和时空感知-失真权衡。
Summary / 总结
This paper addresses the challenge of temporal inconsistency in zero-shot video restoration using diffusion models. It introduces two inference-time strategies: Perceptual Straightening Guidance (PSG) and Multi-Path Ensemble Sampling (MPES). PSG improves temporal perceptual scores by incorporating a curvature penalty in a perceptual space, while MPES reduces stochastic variation by ensembling multiple diffusion trajectories. The experiments demonstrate that PSG enhances temporal naturalness, especially in cases of temporal blur, and MPES consistently improves fidelity and spatio-temporal perception-distortion trade-off across various tasks.
该研究针对零样本视频恢复中扩散模型存在的时间不一致性问题,提出了两种推理时策略:感知平直引导(PSG)通过在感知空间中引入曲率惩罚来增强时间连贯性,以及多路径集成采样(MPES)通过集成多个扩散轨迹来减少随机变化并提高保真度分数。实验结果表明,PSG特别在时间模糊的情况下提高了时间自然性,而MPES则在所有任务中一致地提高了保真度和时空感知-失真权衡。
Instance-Level Composed Image Retrieval
Authors: Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagiotis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, Giorgos Tolias
Venue: NeurIPS 2025
First: 2025-10-29T10:57:59+00:00 · Latest: 2025-10-29T10:57:59+00:00
Comments: NeurIPS 2025
Abstract
The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives. To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.
中文标题/摘要
标题:实例级合成图像检索
合成图像检索(CIR),一种流行的图像检索研究方向,由于缺乏高质量的训练和评估数据而受到限制。我们引入了一个新的评估数据集i-CIR,与现有数据集不同,它专注于实例级类定义。目标是检索包含与视觉查询相同特定对象的图像,这些图像在文本查询定义的多种修改下呈现。该数据集的设计和编纂过程使其保持紧凑,以促进未来研究,同时通过半自动选择硬负例,保持其挑战性与对超过4000万随机干扰项的检索相当。为了克服获得干净、多样且合适的训练数据的挑战,我们利用预训练的视觉-语言模型(VLMs)采用了一种无需训练的方法BASIC。该方法分别估计查询-图像到图像和查询-文本到图像的相似性,并进行后期融合,以增加同时满足两个查询的图像的权重,同时降低仅与其中一个查询高度相似的图像的权重。每个单独的相似性进一步通过一组简单直观的组件进行改进。BASIC在i-CIR上以及现有遵循语义级类定义的CIR数据集上均达到了新的最佳性能。项目页面:https://vrg.fel.cvut.cz/icir/
Summary / 总结
The research addresses the limitation of existing composed image retrieval datasets by introducing i-CIR, a new dataset focusing on instance-level object retrieval. The method, BASIC, uses pre-trained vision-and-language models to estimate similarities between queries and images, performing late fusion to prioritize images that satisfy both visual and textual queries. BASIC achieves state-of-the-art performance on i-CIR and other CIR datasets with semantic-level class definitions.
研究旨在通过解决高质量训练和评估数据不足的问题,推进组合图像检索(CIR)的发展。它引入了一个新的数据集i-CIR,专注于实例级别的类别定义,以检索包含与视觉查询相同特定对象的图像,这些图像在文本查询定义的各种修改下呈现。方法BASIC利用预训练的视觉-语言模型分别估计查询-图像和查询-文本相似性,进行后期融合以优先考虑同时满足两个查询的图像。BASIC在i-CIR和现有基于语义级别的CIR数据集上均达到了新的最佳性能。
4-Doodle: Text to 3D Sketches that Move!
Authors: Hao Chen, Jiaqi Wang, Yonggang Qi, Ke Li, Kaiyue Pang, Yi-Zhe Song
First: 2025-10-29T09:33:29+00:00 · Latest: 2025-10-29T09:33:29+00:00
Abstract
We present a novel task: text-to-3D sketch animation, which aims to bring freeform sketches to life in dynamic 3D space. Unlike prior works focused on photorealistic content generation, we target sparse, stylized, and view-consistent 3D vector sketches, a lightweight and interpretable medium well-suited for visual communication and prototyping. However, this task is very challenging: (i) no paired dataset exists for text and 3D (or 4D) sketches; (ii) sketches require structural abstraction that is difficult to model with conventional 3D representations like NeRFs or point clouds; and (iii) animating such sketches demands temporal coherence and multi-view consistency, which current pipelines do not address. Therefore, we propose 4-Doodle, the first training-free framework for generating dynamic 3D sketches from text. It leverages pretrained image and video diffusion models through a dual-space distillation scheme: one space captures multi-view-consistent geometry using differentiable B\'ezier curves, while the other encodes motion dynamics via temporally-aware priors. Unlike prior work (e.g., DreamFusion), which optimizes from a single view per step, our multi-view optimization ensures structural alignment and avoids view ambiguity, critical for sparse sketches. Furthermore, we introduce a structure-aware motion module that separates shape-preserving trajectories from deformation-aware changes, enabling expressive motion such as flipping, rotation, and articulated movement. Extensive experiments show that our method produces temporally realistic and structurally stable 3D sketch animations, outperforming existing baselines in both fidelity and controllability. We hope this work serves as a step toward more intuitive and accessible 4D content creation.
中文标题/摘要
标题:4-Doodle:从文本到动态3D草图
我们提出了一项新颖的任务:文本到3D草图动画,旨在将自由形式的草图在动态3D空间中赋予生命。与以往专注于生成逼真内容的工作不同,我们旨在生成稀疏、风格化且视角一致的3D矢量草图,这是一种轻量级且易于解释的媒介,非常适合视觉交流和原型设计。然而,这项任务非常具有挑战性:(i)不存在文本和3D(或4D)草图的配对数据集;(ii)草图需要结构抽象,这很难用传统的3D表示法(如NeRF或点云)建模;(iii)动画此类草图需要时间连贯性和多视角一致性,而当前的管道并未解决这些问题。因此,我们提出了4-Doodle,这是第一个无需训练即可生成动态3D草图的框架。它通过一种双空间蒸馏方案利用预训练的图像和视频扩散模型:一个空间使用可微Bezier曲线捕捉多视角一致的几何结构,而另一个则通过时间感知的先验编码运动动力学。与以往工作(例如DreamFusion)从每一步的一个视角优化不同,我们的多视角优化确保了结构对齐并避免了视角歧义,这对于稀疏草图至关重要。此外,我们引入了一种结构感知的运动模块,将形状保持的轨迹与变形感知的变化分离,从而实现诸如翻转、旋转和关节运动等富有表现力的运动。大量实验表明,我们的方法生成了时间上真实且结构上稳定的3D草图动画,在保真度和可控性方面均优于现有基线。我们希望这项工作能够成为更直观和易用的4D内容创作的一个步骤。
LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
Authors: Yang Miao, Jan-Nico Zaech, Xi Wang, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
Venue: Neurips 2025
First: 2025-10-29T08:21:59+00:00 · Latest: 2025-10-29T08:21:59+00:00
Comments: 10 pages, 5 figures, 14 tables, Neurips 2025
Abstract
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
中文标题/摘要
标题:LangHOPS:基于语言的层次开放词汇部件分割
我们提出了LangHOPS,这是第一个基于多模态大型语言模型(MLLM)的开放词汇对象部件实例分割框架。给定一张图像,LangHOPS 可以联合检测和分割来自开放词汇候选类别的层次对象和部件实例。与依赖启发式或可学习视觉分组的先前方法不同,我们的方法将对象部件层次结构扎根于语言空间。它将 MLLM 集成到对象部件解析管道中,利用其丰富的知识和推理能力,并在层次结构内链接多粒度概念。我们在多个具有挑战性的场景中评估了LangHOPS,包括领域内和跨数据集对象部件实例分割以及零样本语义分割。LangHOPS 达到了最先进的技术水平,在 PartImageNet 数据集上,领域内平均精度(AP)提高了 5.5%,跨数据集提高了 4.8%,在 ADE20K 中未见过的对象部件上,mIOU 提高了 2.5%。消融研究进一步验证了语言扎根层次结构和 MLLM 驱动部件查询细化策略的有效性。代码将在此发布。
Summary / 总结
LangHOPS is a framework that uses a Multimodal Large Language Model to perform open-vocabulary object-part instance segmentation. It can detect and segment hierarchical object and part instances from various categories in an image. LangHOPS outperforms previous methods by 5.5% in Average Precision in-domain and 4.8% in cross-dataset scenarios on PartImageNet, and achieves 2.5% mIOU improvement in zero-shot semantic segmentation on ADE20K. Ablation studies confirm the effectiveness of its language-grounded hierarchy and part query refinement strategy.
LangHOPS 是一个使用多模态大型语言模型进行开放词汇对象部件实例分割的框架。它可以检测和分割图像中不同类别中的层次化对象和部件实例。LangHOPS 在 PartImageNet 数据集上的室内域和跨数据集设置中分别以 5.5% 的平均精度和 4.8% 的 AP 超越了先前的方法,并在 ADE20K 上实现了 2.5% 的 mIOU 提升。消融研究证实了其语言导向的层次结构和部件查询精炼策略的有效性。
Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models
Authors: Juan Ren, Mark Dras, Usman Naseem
First: 2025-10-29T05:23:24+00:00 · Latest: 2025-10-29T05:23:24+00:00
Abstract
Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to coordinate and independently solve complex tasks. We extend this paradigm to safety alignment by introducing Agentic Moderation, a model-agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks. Unlike prior approaches that apply as a static layer over inputs or outputs and provide only binary classifications (safe or unsafe), our method integrates dynamic, cooperative agents, including Shield, Responder, Evaluator, and Reflector, to achieve context-aware and interpretable moderation. Extensive experiments across five datasets and four representative Large Vision-Language Models (LVLMs) demonstrate that our approach reduces the Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF), and improves the Refusal Rate (RR) by 4-20%, achieving robust, interpretable, and well-balanced safety performance. By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine-grained safety enforcement, highlighting the broader potential of agentic systems as a foundation for automated safety governance.
中文标题/摘要
标题:代理调节:多代理设计以实现更安全的视觉-语言模型
代理方法已成为一种强大且自主的范式,增强了推理、协作和自适应控制能力,使系统能够协调并独立解决复杂任务。我们通过引入代理调节将这一范式扩展到安全性对齐,这是一种模型无关的框架,利用专门的代理来防御多模态系统免受脱逃攻击。与先前仅在输入或输出上作为静态层应用并仅提供二元分类(安全或不安全)的方法不同,我们的方法整合了动态、协作的代理,包括盾牌、响应者、评估者和反射器,以实现上下文感知和可解释的调节。在五个数据集和四个代表性大型视觉-语言模型(LVLMs)上的广泛实验表明,我们的方法将攻击成功率(ASR)降低了7-19%,保持了稳定的不跟随率(NF),并将拒绝率(RR)提高了4-20%,实现了稳健、可解释且平衡的安全性能。通过利用代理架构的灵活性和推理能力,代理调节提供了模块化、可扩展和精细的安全执行,突显了代理系统作为自动化安全治理基础的更广泛潜力。
Summary / 总结
The research introduces Agentic Moderation, a model-agnostic framework that uses specialized agents to defend multimodal systems against jailbreak attacks. Unlike previous methods that provide binary classifications, this approach integrates dynamic agents to achieve context-aware and interpretable moderation. Experiments show a 7-19% reduction in Attack Success Rate, a stable Non-Following Rate, and a 4-20% increase in Refusal Rate, demonstrating robust, interpretable, and balanced safety performance.
研究旨在通过引入Agentic Moderation框架来增强视觉语言模型的安全性,该框架使用专门的代理来防御脱逃攻击。该方法整合了动态协作的代理,如Shield、Responder、Evaluator和Reflector,以实现上下文感知和可解释的调节。实验结果显示,该方法将攻击成功率降低了7-19%,保持了稳定的Non-Following Rate,并将拒绝率提高了4-20%,展示了稳健、可解释和平衡的安全性能。
Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving
Authors: Enming Zhang, Peizhe Gong, Xingyuan Dai, Min Huang, Yisheng Lv, Qinghai Miao
First: 2025-03-09T07:53:19+00:00 · Latest: 2025-10-29T04:35:35+00:00
Abstract
Ensuring the safety of vision-language models (VLMs) in autonomous driving systems is of paramount importance, yet existing research has largely focused on conventional benchmarks rather than safety-critical evaluation. In this work, we present SCD-Bench (Safety Cognition Driving Benchmark) a novel framework specifically designed to assess the safety cognition capabilities of VLMs within interactive driving scenarios. To address the scalability challenge of data annotation, we introduce ADA (Autonomous Driving Annotation), a semi-automated labeling system, further refined through expert review by professionals with domain-specific knowledge in autonomous driving. To facilitate scalable and consistent evaluation, we also propose an automated assessment pipeline leveraging large language models, which demonstrates over 98% agreement with human expert judgments. In addressing the broader challenge of aligning VLMs with safety cognition in driving environments, we construct SCD-Training, the first large-scale dataset tailored for this task, comprising 324.35K high-quality samples. Through extensive experiments, we show that models trained on SCD-Training exhibit marked improvements not only on SCD-Bench, but also on general and domain-specific benchmarks, offering a new perspective on enhancing safety-aware interactions in vision-language systems for autonomous driving.
中文标题/摘要
标题:视觉语言模型在自动驾驶中的安全认知能力评估
确保视觉语言模型(VLMs)在自动驾驶系统中的安全性至关重要,但现有研究主要集中在传统基准上,而非安全关键评估。在本工作中,我们提出了SCD-Bench(安全认知驾驶基准),这是一种专门设计用于评估VLMs在交互驾驶场景中安全认知能力的新框架。为了解决数据标注的可扩展性挑战,我们引入了ADA(自动驾驶标注系统),并通过领域专家的专业审查进一步优化。为了实现可扩展和一致的评估,我们还提出了一种基于大型语言模型的自动化评估流水线,该流水线与人类专家判断的共识超过98%。在解决VLMs与驾驶环境中的安全认知对齐的更广泛挑战时,我们构建了SCD-Training,这是首个针对此任务的大规模数据集,包含324.35万个高质量样本。通过广泛的实验,我们表明,使用SCD-Training训练的模型不仅在SCD-Bench上表现出显著改进,还在通用和特定领域的基准上也表现出改进,为增强视觉语言系统在自动驾驶中的安全意识交互提供了新的视角。
Summary / 总结
This work introduces SCD-Bench, a novel framework to evaluate the safety cognition capabilities of vision-language models (VLMs) in autonomous driving scenarios. It addresses the scalability challenge through ADA, a semi-automated labeling system, and proposes an automated assessment pipeline with over 98% agreement with human judgments. The study also constructs SCD-Training, a large-scale dataset for safety cognition training, which leads to significant improvements in both SCD-Bench and general benchmarks, enhancing safety-aware interactions in VLMs for autonomous driving.
该研究引入了SCD-Bench,这是一种新型框架,用于评估视觉-语言模型(VLMs)在自动驾驶场景中的安全认知能力。它通过ADA半自动化标注系统和使用大型语言模型的自动化评估流水线来解决可扩展性挑战。研究还提出了SCD-Training,这是一个大规模数据集,用于训练VLMs,并展示了在该数据集上训练的模型不仅在SCD-Bench,还在通用和特定领域基准测试中表现出显著改进,从而增强了VLMs在自动驾驶中的安全感知交互。
Revisiting Reconstruction-based AI-generated Image Detection: A Geometric Perspective
Authors: Wan Jiang, Jing Yan, Ruixuan Zhang, Xiaojing Chen, Changtao Miao, Zhe Li, Chenhao Lin, Yunfeng Diao, Richang Hong
First: 2025-10-29T03:45:03+00:00 · Latest: 2025-10-29T03:45:03+00:00
Abstract
The rise of generative Artificial Intelligence (AI) has made detecting AI-generated images a critical challenge for ensuring authenticity. Existing reconstruction-based methods lack theoretical foundations and on empirical heuristics, limiting interpretability and reliability. In this paper, we introduce the Jacobian-Spectral Lower Bound for reconstruction error from a geometric perspective, showing that real images off the reconstruction manifold exhibit a non-trivial error lower bound, while generated images on the manifold have near-zero error. Furthermore, we reveal the limitations of existing methods that rely on static reconstruction error from a single pass. These methods often fail when some real images exhibit lower error than generated ones. This counterintuitive behavior reduces detection accuracy and requires data-specific threshold tuning, limiting their applicability in real-world scenarios. To address these challenges, we propose ReGap, a training-free method that computes dynamic reconstruction error by leveraging structured editing operations to introduce controlled perturbations. This enables measuring error changes before and after editing, improving detection accuracy by enhancing error separation. Experimental results show that our method outperforms existing baselines, exhibits robustness to common post-processing operations and generalizes effectively across diverse conditions.
中文标题/摘要
标题:从几何视角重访基于重建的AI生成图像检测
生成型人工智能的兴起使得检测AI生成的图像成为确保真实性的关键挑战。现有的基于重建的方法缺乏理论基础,依赖于经验直觉,限制了可解释性和可靠性。在本文中,我们从几何视角引入了雅可比谱下界来衡量重建误差,表明真实图像在重建流形之外表现出非平凡的误差下界,而流形上的生成图像则具有接近零的误差。此外,我们揭示了现有方法依赖单一通过静态重建误差的局限性。这些方法在某些真实图像的误差低于生成图像时往往会失效,这种反直觉的行为降低了检测准确性,并要求针对特定数据进行阈值调整,限制了其在实际场景中的应用。为了解决这些挑战,我们提出了ReGap,这是一种无需训练的方法,通过利用结构化编辑操作引入可控扰动来计算动态重建误差。这使得在编辑前后测量误差变化成为可能,通过增强误差分离来提高检测准确性。实验结果表明,我们的方法优于现有基线,对常见的后处理操作具有鲁棒性,并且在多种条件下具有良好的泛化能力。
Summary / 总结
This paper addresses the challenge of detecting AI-generated images by revisiting reconstruction-based methods from a geometric perspective. It introduces the Jacobian-Spectral Lower Bound to show that real images off the reconstruction manifold have a non-trivial error lower bound, while generated images on the manifold have near-zero error. The authors reveal the limitations of existing methods that rely on static reconstruction error and propose ReGap, a training-free method that computes dynamic reconstruction error using structured editing operations. Experimental results demonstrate that ReGap outperforms existing baselines and is robust to common post-processing operations and generalizes well across diverse conditions.
本文从几何角度重新审视基于重建的方法,以解决检测AI生成图像的挑战。它引入了Jacobian-Spectral Lower Bound,表明真实图像具有非平凡的错误下限,而生成图像的错误接近零。论文还指出现有方法依赖静态重建误差的局限性。为克服这些局限性,作者提出了ReGap,这是一种无需训练的方法,通过结构化编辑操作计算动态重建误差,从而提高检测准确性。实验结果表明,ReGap在检测准确性上优于现有方法,并且对常见的后处理操作具有鲁棒性。
RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation
Authors: Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, Si Liu
Venue: NeurIPS 2025
First: 2025-06-07T06:15:49+00:00 · Latest: 2025-10-29T03:38:36+00:00
Comments: 25 pages, 18 figures, Accepted by NeurIPS 2025
Abstract
Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs' strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.
中文标题/摘要
标题:RoboCerebra:大规模长时机器人操作评估基准
近期视觉-语言模型(VLMs)的进步使指令驱动的机器人系统具备了更好的泛化能力。然而,现有大多数工作集中在反应型System 1策略上,未能充分利用VLMs在语义推理和长时规划方面的优势。这些代表深思熟虑、目标导向思考的System 2能力因当前基准的有限时间尺度和结构复杂性而未得到充分探索。为解决这一问题,我们引入了RoboCerebra,一个用于评估长时机器人操作高级推理的基准。RoboCerebra包括:(1) 一个大规模模拟数据集,包含扩展的任务时序和家庭环境中的多样化子任务序列;(2) 一个分层框架,结合了高层VLM规划器和低层视觉-语言-行动(VLA)控制器;(3) 一个通过结构化的System 1-System 2交互来评估规划、反思和记忆的评估协议。数据集通过自上而下的管道构建,其中GPT生成任务指令并将其分解为子任务序列。人类操作者在模拟中执行子任务,生成具有动态物体变化的高质量轨迹。与先前的基准相比,RoboCerebra具有更长的动作序列和更密集的注释。我们进一步将最先进的VLMs作为System 2模块进行基准测试,并在关键认知维度上分析其性能,推动更强大和通用的机器人规划器的发展。
Summary / 总结
RoboCerebra is a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. It includes a large-scale simulation dataset with extended task horizons and diverse subtask sequences, a hierarchical framework combining a high-level VLM planner with a low-level VLA controller, and an evaluation protocol targeting planning, reflection, and memory. Compared to previous benchmarks, RoboCerebra features longer action sequences and denser annotations. The benchmark is used to assess state-of-the-art VLMs as System 2 modules and analyze their performance across cognitive dimensions, advancing the development of more capable and generalizable robotic planners.
RoboCerebra 是一个用于评估长期机器人操作中高级推理能力的基准。它包括一个具有扩展任务时序和多样化子任务序列的大规模模拟数据集、一个结合高级 VLM 计划器和低级 VLA 控制器的分层框架,以及一个针对规划、反思和记忆的评估协议。与之前的基准相比,RoboCerebra 具有更长的动作序列和更密集的注释。该基准用于评估最先进的 VLM 作为 System 2 模块,并分析其在认知维度上的表现,从而推动更强大和通用的机器人规划器的发展。
Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection
Authors: Chanhyeong Yang, Taehoon Song, Jihwan Park, Hyunwoo J. Kim
Venue: NeurIPS 2025
First: 2025-10-29T01:58:35+00:00 · Latest: 2025-10-29T01:58:35+00:00
Comments: Accepted by NeurIPS 2025
Abstract
Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.
中文标题/摘要
标题:视觉多样性与区域感知提示学习在零样本HOI检测中的应用
零样本人类-物体交互检测旨在在图像中定位人类和物体,并识别它们之间的交互,即使在训练期间未见过特定的动词-物体配对。最近的研究表明,使用预训练的视觉-语言模型(如CLIP)进行提示学习可以取得有希望的结果,这些模型可以在共享嵌入空间中将自然语言提示与视觉特征对齐。然而,现有的方法仍然无法处理交互的视觉复杂性,包括(1)同类别视觉多样性,即同一动词的不同实例在不同的姿态和上下文中出现,以及(2)跨类别视觉纠缠,即不同的动词产生视觉上相似的模式。为了解决这些挑战,我们提出了VDRP,一种视觉多样性和区域感知提示学习框架。首先,我们引入了一种视觉多样性感知的提示学习策略,将组内视觉方差注入到上下文嵌入中。我们进一步应用高斯扰动,以鼓励提示捕捉动词的多种视觉变化。其次,我们从人类、物体和联合区域检索区域特定的概念。这些概念用于增强视觉多样性感知的提示嵌入,生成区域感知的提示,从而增强动词级别的区分能力。在HICO-DET基准测试上的实验表明,我们的方法在四种零样本评估设置下均达到了最先进的性能,有效地解决了同类别多样性和跨类别视觉纠缠的问题。代码可在https://github.com/mlvlab/VDRP/获取。
Summary / 总结
The paper addresses the challenges of zero-shot Human-Object Interaction (HOI) detection, particularly visual complexity due to intra-class diversity and inter-class entanglement. It introduces VDRP, a framework that includes a visual diversity-aware prompt learning strategy and region-aware prompts. Experiments show that VDRP outperforms existing methods on the HICO-DET benchmark, effectively handling both intra-class and inter-class visual complexities.
论文针对零样本Human-Object Interaction (HOI)检测中的视觉复杂性挑战,特别是同类别视觉多样性与跨类别视觉纠缠。提出了VDRP框架,包括视觉多样性感知的提示学习策略和区域感知提示。VDRP增强了模型处理多样视觉模式和动词级区分的能力,在HICO-DET基准测试中达到了最先进的性能。
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Authors: Jiaqi Wang, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou
First: 2025-05-22T16:13:29+00:00 · Latest: 2025-10-29T01:19:12+00:00
Comments: camera ready revision
Abstract
Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.
中文标题/摘要
标题:思考还是不思考?通过强化学习实现选择性推理的视觉-语言模型
强化学习(RL)已被证明是提高视觉-语言模型(VLMs)推理能力的有效后训练策略。最近的GRPO方法鼓励模型在回答前生成完整的推理轨迹,导致了更多的令牌使用和计算成本。受人类思考过程的启发——人们在遇到简单问题时会跳过推理,而在需要时会仔细思考,我们探索如何使VLMs首先决定何时需要推理。为此,我们提出了TON,一种两阶段训练策略:(i) 一个带有简单有效的“思考删除”操作的监督微调(SFT)阶段,其中随机用空想法替换推理轨迹,这引入了一种思考或不思考的格式,作为选择性推理的冷启动;(ii) 一个GRPO阶段,使模型能够自由探索何时思考或不思考,同时最大化任务感知的结果奖励。实验结果表明,与vanilla GRPO相比,TON可以将完成长度最多减少90%,而不会牺牲性能甚至提高性能。在LLM(GSM8K)、VLM(CLEVR、Super-CLEVR、GeoQA)和Agentic(AITZ)任务中的一致评估涵盖了从3B到7B模型的各种推理难度,结果显示模型随着训练的进行逐渐学会了绕过不必要的推理步骤。这些发现为RL方法中的人类级推理模式指明了道路。我们的代码可在https://github.com/kokolerk/TON获取。
Summary / 总结
This paper proposes TON, a two-stage training strategy for vision-language models to enable selective reasoning. The first stage uses supervised fine-tuning with a 'thought dropout' operation to introduce a think-or-not format. The second stage employs Group Relative Policy Optimization to allow models to decide when to reason. Experiments show that TON can reduce completion length by up to 90% compared to vanilla GRPO, without sacrificing performance and sometimes even improving it. This method helps models learn to bypass unnecessary reasoning steps, moving closer to human-like reasoning patterns.
本文提出了一种名为TON的两阶段训练策略,旨在使视觉-语言模型(VLMs)能够进行选择性推理。第一阶段使用监督微调和“思考跳过”操作引入思考或不思考的格式。第二阶段采用组相对策略优化(GRPO)让模型决定何时推理。实验表明,TON可以将完成长度最多减少90%,同时在各种任务中保持或提升性能。这表明随着训练的进行,模型学会了跳过不必要的推理步骤,符合人类的推理模式。
DRIP: Dynamic patch Reduction via Interpretable Pooling
Authors: Yusen Peng, Sachin Kumar
First: 2025-10-29T01:10:28+00:00 · Latest: 2025-10-29T01:10:28+00:00
Abstract
Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
中文标题/摘要
标题:DRIP: 动态可解释池化缩减patches
近年来,视觉-语言模型的进步,包括对比预训练和指令调优,极大地推动了多模态人工智能的前沿。然而,由于大规模预训练成本高昂,效率问题阻碍了研究人员从头开始预训练视觉语言模型的努力。在本工作中,我们提出了动态可解释池化缩减patches (DRIP),该方法适应输入图像并在视觉编码器的深层动态合并tokens。我们在从零开始训练ImageNet和CLIP对比预训练上的结果表明,在保持相当的分类/零样本性能的同时,显著减少了GFLOP。为了进一步验证我们提出的方法,我们在一个大型生物学数据集上进行持续预训练,将其影响扩展到科学领域。
Summary / 总结
The research aims to address the efficiency concerns in pretraining vision-language models by proposing DRIP, which dynamically merges tokens in deeper layers of a visual encoder. The method achieves a significant reduction in GFLOPs while maintaining comparable performance on ImageNet training from scratch and CLIP contrastive pretraining. Additionally, continual pretraining on a large biology dataset further validates the method's effectiveness in scientific domains.
论文提出了DRIP方法,该方法在视觉编码器的深层动态合并tokens以减少计算成本,同时保持性能。实验结果显示,DRIP在ImageNet从零开始训练和CLIP对比预训练中实现了显著的GFLOP减少,且分类和零样本性能相当。进一步在大规模生物学数据集上的持续预训练验证了该方法在科学领域的有效性。
Efficient License Plate Recognition via Pseudo-Labeled Supervision with Grounding DINO and YOLOv8
Authors: Zahra Ebrahimi Vargoorani, Amir Mohammad Ghoreyshi, Ching Yee Suen
First: 2025-10-28T23:21:00+00:00 · Latest: 2025-10-28T23:21:00+00:00
Comments: 6 pages, 8 figures. Presented at 2025 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), August 31 - September 3, 2025, Istanbul, Turkey
Abstract
Developing a highly accurate automatic license plate recognition system (ALPR) is challenging due to environmental factors such as lighting, rain, and dust. Additional difficulties include high vehicle speeds, varying camera angles, and low-quality or low-resolution images. ALPR is vital in traffic control, parking, vehicle tracking, toll collection, and law enforcement applications. This paper proposes a deep learning strategy using YOLOv8 for license plate detection and recognition tasks. This method seeks to enhance the performance of the model using datasets from Ontario, Quebec, California, and New York State. It achieved an impressive recall rate of 94% on the dataset from the Center for Pattern Recognition and Machine Intelligence (CENPARMI) and 91% on the UFPR-ALPR dataset. In addition, our method follows a semi-supervised learning framework, combining a small set of manually labeled data with pseudo-labels generated by Grounding DINO to train our detection model. Grounding DINO, a powerful vision-language model, automatically annotates many images with bounding boxes for license plates, thereby minimizing the reliance on labor-intensive manual labeling. By integrating human-verified and model-generated annotations, we can scale our dataset efficiently while maintaining label quality, which significantly enhances the training process and overall model performance. Furthermore, it reports character error rates for both datasets, providing additional insight into system performance.
中文标题/摘要
标题:通过Grounding DINO和YOLOv8的伪标签监督实现高效的车牌识别
由于环境因素如光照、雨和尘土的影响,开发高精度的自动车牌识别系统(ALPR)具有挑战性。此外,高速行驶的车辆、不同的摄像头角度以及低质量或低分辨率的图像也增加了难度。ALPR在交通控制、停车、车辆跟踪、收费和执法等领域至关重要。本文提出了一种使用YOLOv8进行车牌检测和识别任务的深度学习策略。该方法利用来自安大略省、魁北克省、加利福尼亚州和纽约州的数据集,提高了模型性能。在模式识别与机器智能中心(CENPARMI)的数据集上,该方法实现了94%的召回率,在UFPR-ALPR数据集上实现了91%的召回率。此外,该方法采用半监督学习框架,结合少量的手动标注数据和由Grounding DINO生成的伪标签来训练检测模型。Grounding DINO是一种强大的视觉-语言模型,能够自动为许多图像标注车牌的边界框,从而减少了对劳动密集型手动标注的依赖。通过整合人工验证和模型生成的注释,我们可以高效地扩展数据集,同时保持标签质量,这显著提高了训练过程和整体模型性能。此外,还报告了两个数据集的字符错误率,提供了系统性能的额外见解。
Summary / 总结
This paper addresses the challenges of developing an accurate automatic license plate recognition system, particularly in varying environmental conditions. It proposes a deep learning approach using YOLOv8 for license plate detection and recognition, leveraging datasets from Ontario, Quebec, California, and New York State. The method employs a semi-supervised learning framework, combining manually labeled data with pseudo-labels generated by Grounding DINO, achieving a recall rate of 94% on the CENPARMI dataset and 91% on the UFPR-ALPR dataset. Character error rates are also reported, providing further insights into the system's performance.
本文针对在不同环境条件下开发高精度自动车牌识别系统面临的挑战,提出了一种使用YOLOv8进行车牌检测和识别的深度学习方法,利用来自安大略省、魁北克省、加利福尼亚州和纽约州的数据集。该方法采用半监督学习框架,结合人工标注数据和由Grounding DINO生成的伪标签,分别在CENPARMI数据集和UFPR-ALPR数据集上实现了94%和91%的召回率。还报告了字符错误率,进一步提供了系统性能的见解。
Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models
Authors: Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, Amir Hosein Khasahmadi, Rahul G. Krishnan
First: 2024-12-11T18:40:16+00:00 · Latest: 2025-10-28T22:43:29+00:00
Abstract
Physical reasoning remains a significant challenge for Vision-Language Models (VLMs). This limitation arises from an inability to translate learned knowledge into predictions about physical behavior. Although continual fine-tuning can mitigate this issue, it is expensive for large models and impractical to perform repeatedly for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a modular framework where specialized smaller VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts to enhance the reasoning capabilities of larger VLMs. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8% on complex physical reasoning tasks. Notably, PCBs also show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.
中文标题/摘要
标题:物理情境构建者:视觉语言模型中物理推理的模块化框架
物理推理仍然是视觉语言模型(VLMs)的一个重大挑战。这一限制源于无法将学到的知识转化为对物理行为的预测。尽管持续微调可以缓解这一问题,但对于大型模型来说成本高昂,且反复为每个任务进行持续微调是不切实际的。因此,需要创建模块化和可扩展的方法来教授VLMs物理推理。为此,我们引入了物理情境构建者(PCBs),这是一种模块化框架,其中专门的小型VLMs被微调以生成详细的物理场景描述。这些描述可以作为物理上下文,增强大型VLMs的推理能力。PCBs使视觉感知与推理分离,使我们能够分析它们在物理理解中的相对贡献。我们在CLEVRER和倒塌塔数据集上进行了实验,该数据集包含模拟和真实场景的稳定性检测,证明PCBs提供了显著的性能提升,复杂物理推理任务的平均准确率提高了13.8%。值得注意的是,PCBs还展示了强大的模拟到现实世界的迁移能力,成功地从模拟训练数据推广到真实场景。
Summary / 总结
The research addresses the challenge of physical reasoning in Vision-Language Models (VLMs) by introducing Physics Context Builders (PCBs), a modular framework that fine-tunes specialized smaller VLMs to generate detailed physical scene descriptions. These descriptions are used to enhance the reasoning capabilities of larger VLMs. Experiments on CLEVRER and the Falling Tower dataset show that PCBs improve average accuracy by up to 13.8% on complex physical reasoning tasks and demonstrate strong Sim2Real transfer, effectively generalizing from simulated to real-world scenes.
研究通过引入物理情境构建器(PCBs)模块化框架来解决视觉-语言模型(VLMs)中的物理推理问题。PCBs涉及对专门的小型VLM进行微调,生成详细的物理场景描述,然后用于增强大型VLM的推理能力。在CLEVRER和Falling Tower数据集上的实验表明,PCBs在复杂物理推理任务中的平均准确率提高了最多13.8%,并且展示了强大的Sim2Real迁移能力,能够从模拟数据有效推广到真实场景。
SCOUT: A Lightweight Framework for Scenario Coverage Assessment in Autonomous Driving
Authors: Anil Yildiz, Sarah M. Thornton, Carl Hildebrandt, Sreeja Roy-Singh, Mykel J. Kochenderfer
First: 2025-10-28T20:31:19+00:00 · Latest: 2025-10-28T20:31:19+00:00
Abstract
Assessing scenario coverage is crucial for evaluating the robustness of autonomous agents, yet existing methods rely on expensive human annotations or computationally intensive Large Vision-Language Models (LVLMs). These approaches are impractical for large-scale deployment due to cost and efficiency constraints. To address these shortcomings, we propose SCOUT (Scenario Coverage Oversight and Understanding Tool), a lightweight surrogate model designed to predict scenario coverage labels directly from an agent's latent sensor representations. SCOUT is trained through a distillation process, learning to approximate LVLM-generated coverage labels while eliminating the need for continuous LVLM inference or human annotation. By leveraging precomputed perception features, SCOUT avoids redundant computations and enables fast, scalable scenario coverage estimation. We evaluate our method across a large dataset of real-life autonomous navigation scenarios, demonstrating that it maintains high accuracy while significantly reducing computational cost. Our results show that SCOUT provides an effective and practical alternative for large-scale coverage analysis. While its performance depends on the quality of LVLM-generated training labels, SCOUT represents a major step toward efficient scenario coverage oversight in autonomous systems.
中文标题/摘要
标题:SCOUT:自动驾驶场景覆盖评估的轻量级框架
场景覆盖评估对于评估自主代理的鲁棒性至关重要,但现有方法依赖昂贵的人工注释或计算密集型大型视觉-语言模型(LVLM)。这些方法由于成本和效率限制,在大规模部署中不切实际。为解决这些不足,我们提出SCOUT(场景覆盖监督和理解工具),这是一种轻量级的替代模型,旨在直接从代理的潜在传感器表示中预测场景覆盖标签。SCOUT通过蒸馏过程进行训练,学习近似LVLM生成的覆盖标签,同时消除持续LVLM推理或人工注释的需要。通过利用预计算的感知特征,SCOUT避免了冗余计算,实现了快速、可扩展的场景覆盖估计。我们在大量实际自主导航场景数据集上评估了我们的方法,证明它在保持高精度的同时显著降低了计算成本。我们的结果表明,SCOUT为大规模覆盖分析提供了一种有效且实用的替代方案。尽管其性能取决于LVLM生成的训练标签的质量,但SCOUT代表了自主系统中高效场景覆盖监督的一个重要进步。
Summary / 总结
The research aims to develop a cost-effective and efficient method for assessing scenario coverage in autonomous driving systems. SCOUT, a lightweight framework, predicts scenario coverage labels directly from an agent's latent sensor representations using a distillation process. This approach avoids the need for expensive human annotations or computationally intensive Large Vision-Language Models (LVLMs). Experiments show that SCOUT maintains high accuracy while significantly reducing computational costs, making it a practical alternative for large-scale coverage analysis.
研究旨在开发一种低成本且高效的场景覆盖评估方法,以评估自动驾驶系统的鲁棒性。SCOUT 是一个轻量级框架,通过从代理的潜传感器表示直接预测场景覆盖标签的蒸馏过程来实现。这种方法避免了连续大型视觉-语言模型推理或人工注释的需要,从而实现更快、更可扩展的场景覆盖估计。实验表明,SCOUT 在保持高准确性的同时显著降低了计算成本,为大规模覆盖分析提供了有效替代方案。
Finding Culture-Sensitive Neurons in Vision-Language Models
Authors: Xiutian Zhao, Rochelle Choenni, Rohit Saxena, Ivan Titov
First: 2025-10-28T20:14:37+00:00 · Latest: 2025-10-28T20:14:37+00:00
Comments: 22 pages, 13 figures
Abstract
Despite their impressive performance, vision-language models (VLMs) still struggle on culturally situated inputs. To understand how VLMs process culturally grounded information, we study the presence of culture-sensitive neurons, i.e. neurons whose activations show preferential sensitivity to inputs associated with particular cultural contexts. We examine whether such neurons are important for culturally diverse visual question answering and where they are located. Using the CVQA benchmark, we identify neurons of culture selectivity and perform causal tests by deactivating the neurons flagged by different identification methods. Experiments on three VLMs across 25 cultural groups demonstrate the existence of neurons whose ablation disproportionately harms performance on questions about the corresponding cultures, while having minimal effects on others. Moreover, we propose a new margin-based selector - Contrastive Activation Selection (CAS), and show that it outperforms existing probability- and entropy-based methods in identifying culture-sensitive neurons. Finally, our layer-wise analyses reveals that such neurons tend to cluster in certain decoder layers. Overall, our findings shed new light on the internal organization of multimodal representations.
中文标题/摘要
标题:在视觉-语言模型中寻找文化敏感神经元
尽管视觉-语言模型(VLMs)在性能上表现出色,但在处理文化背景信息时仍然存在困难。为了理解VLMs如何处理文化背景信息,我们研究了文化敏感神经元的存在,即对特定文化背景相关输入表现出偏好敏感性的神经元。我们研究了这些神经元是否对文化多样性的视觉问答任务很重要,以及它们位于何处。使用CVQA基准,我们识别出具有文化选择性的神经元,并通过不同识别方法标记的神经元进行因果测试。在三个VLMs上对25个文化群体进行的实验表明,存在一类神经元,其消除会不成比例地损害对应文化问题的表现,而对其他问题的影响则很小。此外,我们提出了一种新的基于边距的选择器——对比激活选择(CAS),并证明它在识别文化敏感神经元方面优于现有的基于概率和熵的方法。最后,逐层分析表明,这类神经元倾向于集中在某些解码器层。总体而言,我们的发现为多模态表示的内部组织提供了新的见解。
TowerVision: Understanding and Improving Multilinguality in Vision-Language Models
Authors: André G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M. Guerreiro, Amin Farajian, Pierre Colombo, Graham Neubig, André F. T. Martins
First: 2025-10-22T17:02:48+00:00 · Latest: 2025-10-28T19:23:06+00:00
Comments: 15 pages, 7 figures, submitted to arXiv October 2025. All models, datasets, and training code will be released at https://huggingface.co/collections/utter-project/towervision
Abstract
Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization -- both from high-resource to underrepresented languages and vice versa -- and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.
中文标题/摘要
标题:TowerVision:理解并改进视觉语言模型中的多语言性
尽管在视觉语言模型(VLMs)方面取得了显著进展,但大多数现有工作都遵循以英语为中心的设计过程,限制了它们在多语言环境中的有效性。在本研究中,我们提供了一项全面的经验性研究,分析了多种多语言设计选择的影响,如训练数据组成、编码器选择和文本骨干。结果是TowerVision,一个基于多语言文本模型Tower+的多语言VLM家族,适用于图像文本和视频文本任务。TowerVision在多个跨模态多语言基准测试中取得了竞争力的表现,并在文化背景任务和跨模态翻译方面表现出特别的优势。通过在微调过程中结合视觉和文化背景,我们的模型在ALM-Bench和Multi30K(图像任务)以及ViMUL-Bench(视频任务)上超过了现有在更大数据集上训练的方法。除了模型外,我们还发布了VisionBlocks,一个高质量、精选的视觉语言数据集。我们的研究结果表明,多语言视觉语言训练数据显著提高了跨语言泛化能力——无论是从高资源语言到未充分代表的语言,还是反之亦然——并且指令调优的大规模语言模型并不总是最佳的初始化点。为了支持进一步的研究,我们将在https://huggingface.co/collections/utter-project/towervision上公开发布所有模型、数据和训练配方。
VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Authors: Md. Mahfuzur Rahman, Kishor Datta Gupta, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, Roy George
First: 2025-09-25T21:21:00+00:00 · Latest: 2025-10-28T18:57:29+00:00
Comments: 29 pages, 40 figures, 3 algorithms
Abstract
Immediate damage assessment is essential after natural catastrophes; yet, conventional hand evaluation techniques are sluggish and perilous. Although satellite and unmanned aerial vehicle (UAV) photos offer extensive perspectives of impacted regions, current computer vision methodologies generally yield just classification labels or segmentation masks, so constraining their capacity to deliver a thorough situational comprehension. We introduce the Vision Language Caption Enhancer (VLCE), a multimodal system designed to produce comprehensive, contextually-informed explanations of disaster imagery. VLCE employs a dual-architecture approach: a CNN-LSTM model with a ResNet50 backbone pretrained on EuroSat satellite imagery for the xBD dataset, and a Vision Transformer (ViT) model pretrained on UAV pictures for the RescueNet dataset. Both systems utilize external semantic knowledge from ConceptNet and WordNet to expand vocabulary coverage and improve description accuracy. We assess VLCE in comparison to leading vision-language models (LLaVA and QwenVL) utilizing CLIPScore for semantic alignment and InfoMetIC for caption informativeness. Experimental findings indicate that VLCE markedly surpasses baseline models, attaining a maximum of 95.33% on InfoMetIC while preserving competitive semantic alignment. Our dual-architecture system demonstrates significant potential for improving disaster damage assessment by automating the production of actionable, information-dense descriptions from satellite and drone photos.
中文标题/摘要
标题:VLCE:一种增强知识框架的灾害评估图像描述
自然灾害发生后,立即进行损害评估至关重要;然而,传统的手工评估技术既缓慢又危险。尽管卫星和无人机照片提供了受灾区域的广泛视角,但当前的计算机视觉方法通常只能提供分类标签或分割掩码,限制了它们提供全面情况理解的能力。我们介绍了Vision Language Caption Enhancer (VLCE),这是一种多模态系统,旨在生成全面且上下文相关的灾难图像解释。VLCE采用双架构方法:一个基于预训练在EuroSat卫星图像上的ResNet50骨干的CNN-LSTM模型,用于xBD数据集,以及一个基于预训练在无人机图像上的Vision Transformer (ViT)模型,用于RescueNet数据集。两个系统均利用ConceptNet和WordNet的外部语义知识来扩展词汇覆盖范围并提高描述准确性。我们使用CLIPScore进行语义对齐和InfoMetIC进行描述信息量评估,将VLCE与领先的语言-视觉模型(LLaVA和QwenVL)进行比较。实验结果表明,VLCE显著优于基线模型,在InfoMetIC上达到最高95.33%的同时保持了竞争力的语义对齐。我们的双架构系统展示了通过自动化生成来自卫星和无人机照片的可操作、信息密集型描述来提高灾害损害评估的巨大潜力。
Summary / 总结
The research aims to enhance immediate damage assessment after natural disasters by developing a multimodal system, VLCE, which provides comprehensive explanations of disaster imagery. VLCE uses a dual-architecture approach combining a CNN-LSTM model and a Vision Transformer, both pretrained on specific datasets, and integrates external semantic knowledge to improve description accuracy. Experimental results show that VLCE outperforms existing models, achieving 95.33% on caption informativeness while maintaining strong semantic alignment.
研究旨在通过开发多模态系统VLCE来增强自然灾害后的即时损害评估,该系统能够提供灾难图像的全面解释。VLCE采用结合CNN-LSTM模型和Vision Transformer的双架构,并在特定数据集上预训练,同时整合外部语义知识以提高描述准确性。实验结果显示,VLCE在描述信息量方面优于现有模型,达到95.33%,同时保持了较强的语义对齐。
Advancing site-specific disease and pest management in precision agriculture: From reasoning-driven foundation models to adaptive, feedback-based learning
Authors: Nitin Rai, Daeun, Choi, Nathan S. Boyd, Arnold W. Schumann
First: 2025-10-28T17:16:47+00:00 · Latest: 2025-10-28T17:16:47+00:00
Comments: 26 pages, 8 figures, and 2 tables
Abstract
Site-specific disease management (SSDM) in crops has advanced rapidly through machine and deep learning (ML and DL) for real-time computer vision. Research evolved from handcrafted feature extraction to large-scale automated feature learning. With foundation models (FMs), crop disease datasets are now processed in fundamentally new ways. Unlike traditional neural networks, FMs integrate visual and textual data, interpret symptoms in text, reason about symptom-management relationships, and support interactive QA for growers and educators. Adaptive and imitation learning in robotics further enables field-based disease management. This review screened approx. 40 articles on FM applications for SSDM, focusing on large-language models (LLMs) and vision-language models (VLMs), and discussing their role in adaptive learning (AL), reinforcement learning (RL), and digital twin frameworks for targeted spraying. Key findings: (a) FMs are gaining traction with surging literature in 2023-24; (b) VLMs outpace LLMs, with a 5-10x increase in publications; (c) RL and AL are still nascent for smart spraying; (d) digital twins with RL can simulate targeted spraying virtually; (e) addressing the sim-to-real gap is critical for real-world deployment; (f) human-robot collaboration remains limited, especially in human-in-the-loop approaches where robots detect early symptoms and humans validate uncertain cases; (g) multi-modal FMs with real-time feedback will drive next-gen SSDM. For updates, resources, and contributions, visit, https://github.com/nitin-dominic/AgriPathogenDatabase, to submit papers, code, or datasets.
中文标题/摘要
标题:精准农业中特定地点病害和害虫管理的进步:从推理驱动的基础模型到适应性、反馈学习
作物的特定地点病害管理(SSDM)通过机器学习和深度学习(ML和DL)的实时计算机视觉技术得到了迅速发展。研究从手工特征提取演进到大规模自动化特征学习。借助基础模型(FMs),作物病害数据集现在以根本不同的方式处理。与传统神经网络不同,FMs整合了视觉和文本数据,解释文本中的症状,推理症状管理关系,并支持种植者和教育者的交互式问答。机器人领域的适应性和模仿学习进一步使田间病害管理成为可能。本文综述了约40篇关于FMs在SSDM应用的文章,重点关注大型语言模型(LLMs)和视觉语言模型(VLMs),讨论了它们在适应性学习(AL)、强化学习(RL)和数字孪生框架中的作用,用于精准喷洒。主要发现:(a)FMs在2023-24年因文献激增而受到关注;(b)VLMs超越了LLMs,出版物增加了5-10倍;(c)智能喷洒领域的RL和AL仍处于初级阶段;(d)带有RL的数字孪生可以虚拟模拟精准喷洒;(e)解决模拟与现实之间的差距对于实际部署至关重要;(f)人机协作仍然有限,尤其是在人类在环方法中,机器人检测早期症状,人类验证不确定的案例;(g)具有实时反馈的多模态FMs将推动下一代SSDM。欲获取更新、资源和贡献,请访问https://github.com/nitin-dominic/AgriPathogenDatabase,提交论文、代码或数据集。
VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree
Authors: Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, Shuiguang Deng
Venue: NeurIPS 2025 poster
First: 2025-10-26T14:36:15+00:00 · Latest: 2025-10-28T16:57:22+00:00
Comments: NeurIPS 2025 poster
Abstract
Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.
中文标题/摘要
标题:VADTree:基于层次粒度感知树的无训练视频异常检测
视频异常检测(VAD)专注于识别视频中的异常。 监督方法需要大量领域内的训练数据,并且无法为异常提供清晰的解释。相比之下,无训练方法利用大型预训练模型的知识储备和语言互动性来检测异常。然而,当前固定长度的时间窗口采样方法难以准确捕捉具有不同时间跨度的异常。因此,我们提出了VADTree,利用层次粒度感知树(HGTree)结构进行灵活的VAD采样。VADTree利用预训练的通用事件边界检测(GEBD)模型来表征潜在的异常事件边界。具体来说,VADTree基于边界置信度将视频分解为通用事件节点,并进行自适应粗细层次结构构建和冗余去除以构建HGTree。然后,将多维先验注入视觉语言模型(VLMs)以增强节点级别的异常感知,并通过大型语言模型(LLMs)实现通用事件节点的异常推理。最后,使用跨簇节点相关方法整合多粒度异常评分。在三个具有挑战性的数据集上的广泛实验表明,VADTree在无训练设置中实现了最先进的性能,同时大幅减少了采样的视频片段数量。代码将在https://github.com/wenlongli10/VADTree上提供。
Summary / 总结
VADTree is proposed to address the limitations of training-free video anomaly detection methods by utilizing a Hierarchical Granularity-Aware Tree (HGTree) structure. It decomposes videos into generic event nodes and constructs an HGTree for adaptive sampling. VADTree leverages a pre-trained Generic Event Boundary Detection (GEBD) model to detect potential anomaly event boundaries and integrates multi-granularity anomaly scores through inter-cluster node correlation. Experiments show that VADTree outperforms existing methods while significantly reducing the number of sampled video segments.
VADTree 提出了一种 Hierarchical Granularity-Aware Tree (HGTree) 结构,用于灵活地在视频异常检测中进行采样,利用预训练的 Generic Event Boundary Detection (GEBD) 模型来识别潜在的异常事件边界。VADTree 将视频分解为通用事件节点,并通过自适应粗细层次结构和冗余去除构建 HGTree。它使用视觉语言模型增强节点级别的异常感知,并通过大型语言模型实现通用事件节点的异常推理。实验表明,VADTree 在训练免费设置中优于现有方法,并且使用更少的视频片段样本。
TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models
Authors: Jiahao Wang, Mingyue Cheng, Qingyang Mao, Yitong Zhou, Daoyu Wang, Qi Liu, Feiyang Xu, Xin Li
First: 2024-11-24T07:02:32+00:00 · Latest: 2025-10-28T16:23:53+00:00
Abstract
Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification (MTSC). Effective adaptation of LLMs for MTSC necessitates informative data representations. Existing LLM-based methods directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs. Despite their effectiveness, we reveal that these methods conceal three inherent bottlenecks: (1) they struggle to encode temporal and channel-specific information in a lossless manner, both of which are critical components of multivariate time series; (2) it is much difficult to align the learned representation space with the semantic space of the LLMs; (3) they require task-specific retraining, which is both computationally expensive and labor-intensive. To bridge these gaps, we propose TableTime, which reformulates MTSC as a table understanding task. Specifically, TableTime introduces the following strategies: (1) convert multivariate time series into a tabular form, thus minimizing information loss to the greatest extent; (2) represent tabular time series in text format to achieve natural alignment with the semantic space of LLMs; (3) design a reasoning framework that integrates contextual text information, neighborhood assistance, multi-path inference and problem decomposition to enhance the reasoning ability of LLMs and realize zero-shot classification. Extensive experiments performed on 10 publicly representative datasets from UEA archive verify the superiorities of the TableTime.
中文标题/摘要
标题:TableTime:将多变量时间序列分类重新定义为大型语言模型无需训练的表理解
大型语言模型(LLMs)在多变量时间序列分类(MTSC)中展示了其有效性。有效适应LLMs对于MTSC需要信息性的数据表示。现有的基于LLM的方法直接从零开始在LLM的潜在空间中编码时间序列嵌入,以与LLM的语义空间对齐。尽管这些方法有效,但我们发现它们隐藏了三个内在瓶颈:(1)它们难以以无损方式编码时间性和通道特定信息,这两种信息都是多变量时间序列的关键组成部分;(2)学习到的表示空间与LLM的语义空间对齐非常困难;(3)它们需要特定任务的重新训练,这既耗费计算资源又劳动密集。为了弥合这些差距,我们提出了TableTime,将其重新定义为一个表理解任务。具体来说,TableTime 引入了以下策略:(1)将多变量时间序列转换为表格形式,从而最大限度地减少信息损失;(2)以文本格式表示表格时间序列,以实现自然与LLM语义空间的对齐;(3)设计一个推理框架,结合上下文文本信息、邻域协助、多路径推理和问题分解,以增强LLM的推理能力和实现零样本分类。在UEA存档中的10个公开代表性数据集上进行的广泛实验验证了TableTime的优越性。
Summary / 总结
TableTime reformulates multivariate time series classification as a table understanding task, addressing limitations of existing methods by converting time series into tabular form, representing them in text format, and integrating a reasoning framework. Experiments on 10 datasets show TableTime's superior performance in zero-shot classification.
TableTime 将多变量时间序列分类重新表述为表格理解任务,通过将时间序列转换为表格形式、以文本形式表示以与 LLM 的语义空间对齐,并结合推理框架来解决现有方法的局限性。在 UEA 存档的 10 个数据集上的实验验证了其优越性。
Superpowering Open-Vocabulary Object Detectors for X-ray Vision
Authors: Pablo Garcia-Fernandez, Lorenzo Vaquero, Mingxuan Liu, Feng Xue, Daniel Cores, Nicu Sebe, Manuel Mucientes, Elisa Ricci
Venue: ICCV 2025
First: 2025-03-21T11:54:16+00:00 · Latest: 2025-10-28T15:20:36+00:00
Comments: Accepted at ICCV 2025
Abstract
Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset available at: https://github.com/PAGF188/RAXO.
中文标题/摘要
标题:为X射线视觉增强开放词汇对象检测
开放词汇对象检测(OvOD)有望通过使系统能够识别X射线扫描中的任何物品来革新安全筛查。然而,由于数据稀缺性和模态差距,开发适用于X射线成像的高效OvOD模型面临独特挑战,这阻碍了直接采用基于RGB的解决方案。为克服这些限制,我们提出了一种无需训练的框架RAXO,该框架重新利用现成的RGB OvOD检测器以实现稳健的X射线检测。RAXO使用双源检索策略构建高质量的X射线类别描述符。它从网络收集相关RGB图像并通过一种新颖的X射线材料转移机制丰富它们,从而消除对标注数据库的需求。这些视觉描述符取代了OvOD中的基于文本的分类,利用模态内特征距离实现稳健检测。大量实验表明,RAXO持续提升OvOD性能,相对于基础检测器平均mAP提升高达17.0个百分点。为了进一步支持这一新兴领域的研究,我们还引入了DET-COMPASS基准,该基准包含超过300个对象类别的边界框注释,使OvOD在X射线中的大规模评估成为可能。代码和数据集可在:https://github.com/PAGF188/RAXO/ 获取。
Summary / 总结
The paper addresses the challenge of open-vocabulary object detection (OvOD) in X-ray imaging, proposing RAXO, a training-free framework that repurposes existing RGB OvOD detectors. RAXO uses a dual-source retrieval strategy to gather and enrich relevant RGB images, which are then transformed into X-ray class descriptors. Experiments show that RAXO significantly improves OvOD performance, achieving an average mAP increase of 17.0 points over base detectors. Additionally, the authors introduce DET-COMPASS, a new benchmark with bounding box annotations for over 300 object categories, facilitating large-scale evaluation of OvOD in X-ray imaging.
研究旨在通过解决数据稀缺性和模态差异问题,提升X射线成像中的开放词汇对象检测(OvOD)性能。RAXO是一种无需训练的框架,通过双源检索策略收集和丰富相关RGB图像,并将其转化为X射线类描述符,从而提高OvOD性能,平均mAP提升17.0个百分点。此外,还引入了DET-COMPASS基准,包含超过300个对象类别的边界框注释,用于评估X射线中的OvOD。代码和数据集可在https://github.com/PAGF188/RAXO获取。
Iterative Critique-Refine Framework for Enhancing LLM Personalization
Authors: Durga Prasad Maram, Dhruvin Gandhi, Zonghai Yao, Gayathri Akkinapalli, Franck Dernoncourt, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed
First: 2025-10-28T14:36:22+00:00 · Latest: 2025-10-28T14:36:22+00:00
Abstract
Personalized text generation requires models not only to produce coherent text but also to align with a target user's style, tone, and topical focus. Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich profiles with user and neighbor histories, but they stop at generation and often yield outputs that drift in tone, topic, or style. We present PerFine, a unified, training-free critique-refine framework that enhances personalization through iterative, profile-grounded feedback. In each iteration, an LLM generator produces a draft conditioned on the retrieved profile, and a critic LLM - also conditioned on the same profile - provides structured feedback on tone, vocabulary, sentence structure, and topicality. The generator then revises, while a novel knockout strategy retains the stronger draft across iterations. We further study additional inference-time strategies such as Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp, Goodreads, and Amazon datasets, PerFine consistently improves personalization over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5 refinement iterations, and scalability with increasing critic size. These results highlight that post-hoc, profile-aware feedback offers a powerful paradigm for personalized LLM generation that is both training-free and model-agnostic.
中文标题/摘要
标题:迭代批判-完善框架以增强LLM个性化
个性化文本生成不仅要求模型生成连贯的文本,还要求与目标用户的风格、语气和主题焦点相一致。现有的检索增强方法,如LaMP和PGraphRAG,通过用户和邻居历史丰富个人资料,但它们仅停留在生成阶段,往往导致输出在语气、主题或风格上漂移。我们提出了一种名为PerFine的统一、无需训练的批判-完善框架,通过迭代、基于个人资料的反馈来增强个性化。在每次迭代中,LLM生成器根据检索到的个人资料生成草稿,而同样基于同一个人资料的批判LLM则提供结构化的反馈,包括语气、词汇、句子结构和主题性。生成器随后进行修订,而一种新颖的淘汰策略保留了每次迭代中更强的草稿。我们还研究了在推理阶段的其他策略,如Best-of-N和主题提取,以平衡质量和效率。在Yelp、Goodreads和Amazon数据集上,PerFine在GEval上的一致性改进超过了PGraphRAG,稳态改进在3-5次完善迭代中持续进行,并且随着批判者规模的增加而可扩展。这些结果表明,事后、基于个人资料的反馈为个性化LLM生成提供了一种强大的范式,该范式既无需训练又模型无关。
Summary / 总结
The research aims to improve personalized text generation by ensuring coherence and alignment with user style and preferences. PerFine, a critique-refine framework, iteratively refines drafts based on structured feedback from a critic model, enhancing personalization. Across datasets, PerFine outperforms PGraphRAG, showing consistent improvements and scalability with larger critic models.
研究旨在通过确保连贯性和与用户风格和偏好的一致性来提升个性化文本生成。PerFine是一种批判-修正框架,通过基于批评模型的结构化反馈迭代修正草案,增强个性化。在多个数据集上,PerFine优于PGraphRAG,显示出一致的改进和随着批评模型规模增加的可扩展性。
Mano Technical Report
Authors: Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang
First: 2025-09-22T03:13:58+00:00 · Latest: 2025-10-28T14:31:14+00:00
Abstract
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
中文标题/摘要
标题:Mano 技术报告
图形用户界面(GUI)是人机交互的主要媒介,但由于视觉元素的复杂性、动态环境以及多步推理的需求,自动化GUI交互仍然具有挑战性。现有的基于视觉-语言模型(VLMs)的方法往往受到分辨率有限、领域不匹配和序列决策能力不足的限制。为了解决这些问题,我们提出了一种名为Mano的稳健的GUI代理,该代理基于在大量网络和计算机系统数据上预训练的多模态基础模型构建。我们的方法结合了一个新颖的模拟环境以生成高保真数据、三阶段训练流程(监督微调、离线强化学习和在线强化学习)以及一个验证模块以实现错误恢复。Mano在多个GUI基准测试中表现出最先进的性能,包括Mind2Web和OSWorld,显著提高了成功率和操作准确性。我们的工作为强化学习与VLMs的有效集成提供了新的见解,强调了领域特定数据、迭代训练和整体奖励设计的重要性。
Summary / 总结
The research addresses the challenges of automating GUI interactions by proposing Mano, a robust GUI agent based on a multi-modal foundation model. Mano integrates a simulated environment for data generation, a three-stage training pipeline, and a verification module for error recovery. The agent shows superior performance on GUI benchmarks, improving success rate and operational accuracy compared to existing methods.
研究通过提出基于多模态基础模型的Robust GUI代理Mano来解决自动化图形用户界面交互的挑战。Mano整合了数据生成的模拟环境、三阶段训练管道和错误恢复验证模块。该代理在Mind2Web和OSWorld等基准测试中表现出最先进的性能,提高了成功率和操作准确性。
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Authors: Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong
First: 2025-10-28T13:22:39+00:00 · Latest: 2025-10-28T13:22:39+00:00
Comments: work in progress
Abstract
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.
中文标题/摘要
标题:OS-Sentinel:通过混合验证在现实工作流中提升移动GUI代理的安全性
由视觉-语言模型(VLMs)驱动的计算机使用代理在操作数字环境如移动平台方面展示了类似人类的能力。尽管这些代理在推进数字自动化方面具有巨大潜力,但它们进行不安全操作的可能性,如系统破坏和隐私泄露,引发了重大担忧。在移动环境复杂且庞大的操作空间中检测这些安全问题是一项艰巨的挑战,目前仍严重未被探索。为了建立移动代理安全研究的基础,我们引入了MobileRisk-Live,一个动态沙盒环境,附带一个包含现实轨迹和细粒度注释的安全检测基准。在此基础上,我们提出了OS-Sentinel,一种新颖的混合安全检测框架,该框架将形式验证器与基于VLM的上下文评估器相结合,用于检测系统级违规并评估上下文风险和代理行为。实验结果显示,OS-Sentinel在多个指标上比现有方法提高了10%-30%。进一步的分析提供了关键见解,促进了更安全、更可靠的自主移动代理的发展。
Summary / 总结
The paper introduces OS-Sentinel, a hybrid safety detection framework for mobile GUI agents powered by Vision-Language Models (VLMs). Motivated by the potential for unsafe operations like system compromise and privacy leakage, the framework combines a Formal Verifier for explicit system-level violations with a VLM-based Contextual Judge for contextual risk assessment. Experiments demonstrate that OS-Sentinel outperforms existing approaches by 10%-30% across multiple metrics, contributing to safer and more reliable autonomous mobile agents.
论文提出了OS-Sentinel,这是一种结合形式验证器和基于VLM的上下文评估器的混合安全检测框架,用于由Vision-Language Models(VLMs)驱动的移动GUI代理。该框架旨在解决系统级违规和上下文风险评估问题,以应对系统妥协和隐私泄露等潜在风险。实验结果显示,OS-Sentinel在多个指标上比现有方法提高了10%-30%,有助于开发更安全和可靠的自主移动代理。
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
Authors: Xuannan Liu, Zekun Li, Zheqi He, Peipei Li, Shuhan Xia, Xing Cui, Huaibo Huang, Xi Yang, Ran He
Venue: NeurIPS 2025
First: 2025-05-17T05:06:38+00:00 · Latest: 2025-10-28T12:44:07+00:00
Comments: Accepted by NeurIPS 2025 Dataset and Benchmark Track, Project page: https://liuxuannan.github.io/Video-SafetyBench.github.io/
Abstract
The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
中文标题/摘要
标题:Video-SafetyBench:视频LVLM安全性评估基准
随着大型视觉-语言模型(LVLMs)的广泛应用,潜在恶意输入可能引发的安全问题日益凸显。然而,现有的多模态安全性评估主要关注由静态图像输入暴露的模型漏洞,忽视了视频中的时间动态可能带来的独特安全风险。为弥补这一差距,我们提出了Video-SafetyBench,这是首个旨在评估视频-文本攻击下LVLMs安全性的全面基准。它包含2,264个视频-文本配对,覆盖48个细粒度的不安全类别,每个配对包含一个合成视频和一个有害查询或一个看似无害但实际上与视频结合后会引发有害行为的良性查询。为了生成适合安全评估的语义准确视频,我们设计了一个可控的流水线,将视频语义分解为主题图像(显示什么)和运动文本(如何移动),两者共同指导生成与查询相关的视频。为了有效评估不确定或边缘有害输出,我们提出了RJScore,这是一种新颖的基于LLM的度量标准,结合了法官模型的信心和人类对决策阈值的校准。大量实验表明,良性查询视频组合的平均攻击成功率达到了67.2%,揭示了LVLMs对视频诱导攻击的一致性漏洞。我们相信Video-SafetyBench将促进未来基于视频的安全性评估和防御策略的研究。
Summary / 总结
The paper introduces Video-SafetyBench, a benchmark for evaluating the safety of Large Vision-Language Models (LVLMs) under video-text attacks. It includes 2,264 video-text pairs covering 48 unsafe categories, with each pair consisting of a synthesized video and either a harmful or benign query. The study finds that benign-query videos achieve an average attack success rate of 67.2%, highlighting LVLM vulnerabilities to video-induced attacks. A novel RJScore metric is proposed to evaluate uncertain or borderline harmful outputs. This work aims to advance research on video-based safety evaluation and defense strategies.
论文介绍了Video-SafetyBench,这是一个用于评估大型视觉语言模型(LVLM)在视频文本攻击下的安全性基准。它包含2,264个视频-查询对,覆盖48个不安全类别,每对包括一个合成视频和一个有害或无害的查询。研究发现,无害查询视频的平均攻击成功率达到了67.2%,揭示了LVLM对视频诱导攻击的一致性漏洞。作者提出了RJScore,这是一种用于评估不确定或边缘有害输出的新颖度量方法。
History
20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553