RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
Authors: Yash Jangir, Yidi Zhang, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, Katerina Fragkiadaki
First: 2025-10-27T17:41:38+00:00 · Latest: 2025-10-27T17:41:38+00:00
Comments: Website: https://robotarenainf.github.io
Abstract
The pursuit of robot generalists - instructable agents capable of performing
diverse tasks across diverse environments - demands rigorous and scalable
evaluation. Yet real-world testing of robot policies remains fundamentally
constrained: it is labor-intensive, slow, unsafe at scale, and difficult to
reproduce. Existing simulation benchmarks are similarly limited, as they train
and test policies within the same synthetic domains and cannot assess models
trained from real-world demonstrations or alternative simulation environments.
As policies expand in scope and complexity, these barriers only intensify,
since defining "success" in robotics often hinges on nuanced human judgments of
execution quality. In this paper, we introduce a new benchmarking framework
that overcomes these challenges by shifting VLA evaluation into large-scale
simulated environments augmented with online human feedback. Leveraging
advances in vision-language models, 2D-to-3D generative modeling, and
differentiable rendering, our approach automatically converts video
demonstrations from widely used robot datasets into simulated counterparts.
Within these digital twins, we assess VLA policies using both automated
VLM-guided scoring and scalable human preference judgments collected from
crowdworkers, transforming human involvement from tedious scene setup,
resetting, and safety supervision into lightweight preference comparisons. To
measure robustness, we systematically perturb simulated environments along
multiple axes, such as textures and object placements, stress-testing policy
generalization under controlled variation. The result is a continuously
evolving, reproducible, and scalable benchmark for real-world trained robot
manipulation policies, addressing a critical missing capability in today's
robotics landscape.
中文标题/摘要
标题:RobotArena $\infty$: 通过实到模拟转换实现可扩展的机器人基准测试
机器人通才——能够跨不同环境执行多种任务的可指导代理——的追求需要严格的可扩展评估。然而,机器人策略的实地测试仍然受到根本性的限制:它劳动密集、速度慢、在大规模下不安全且难以复制。现有的模拟基准测试同样有限,因为它们在相同的合成领域内训练和测试策略,无法评估从真实世界演示或替代模拟环境中训练的模型。随着策略的范围和复杂性扩大,这些障碍只会加剧,因为机器人中的“成功”往往依赖于对执行质量的微妙人类判断。在本文中,我们介绍了一种新的基准测试框架,通过将跨模态评估转移到增强有人工反馈的大规模模拟环境中来克服这些挑战。利用视觉语言模型、2D到3D生成建模和可微渲染的最新进展,我们的方法自动将广泛使用的机器人数据集中的视频演示转换为模拟对应物。在这些数字双胞胎中,我们使用自动化的VLM指导评分和从众包工人收集的大规模人类偏好判断来评估跨模态策略,将人类参与从繁琐的场景设置、重置和安全监督转变为轻量级的偏好比较。为了衡量鲁棒性,我们系统地沿多个轴线(如纹理和物体放置)对模拟环境进行扰动,对策略在受控变化下的泛化能力进行压力测试。结果是一个不断演进、可复制且可扩展的基准测试,用于真实世界训练的机器人操作策略,填补了当今机器人领域的一项关键缺失能力。
Summary / 总结
This paper introduces RobotArena $\infty$, a new benchmarking framework for evaluating versatile robot policies. It addresses the limitations of real-world testing by leveraging large-scale simulated environments with online human feedback. The method uses advances in vision-language models and 2D-to-3D generative modeling to convert real-world robot demonstrations into simulated counterparts. Key findings include the ability to assess policies across diverse and perturbed environments, providing a robust and scalable evaluation tool for real-world trained robot manipulation policies.
本文介绍了RobotArena $\infty$,这是一种新的机器人策略评估框架,旨在克服现实世界测试的局限性,通过大规模模拟环境和在线人类反馈来实现。该方法利用视觉语言模型和2D到3D生成建模技术将实际机器人演示转换为模拟对应物。主要发现包括能够在多种多样且经过扰动的环境中评估策略,提供了一个用于实际训练机器人操作策略的稳健且可扩展的评估工具。
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis
Authors: Boming Miao, Chunxiao Li, Xiaoxiao Wang, Andi Zhang, Rui Sun, Zizhe Wang, Yao Zhu
First: 2024-11-25T15:40:47+00:00 · Latest: 2025-10-27T17:31:22+00:00
Comments: Updated author formatting; no substantive changes
Abstract
Diffusion models have achieved impressive success in generating
photorealistic images, but challenges remain in ensuring precise semantic
alignment with input prompts. Optimizing the initial noisy latent offers a more
efficient alternative to modifying model architectures or prompt engineering
for improving semantic alignment. A latest approach, InitNo, refines the
initial noisy latent by leveraging attention maps; however, these maps capture
only limited information, and the effectiveness of InitNo is highly dependent
on the initial starting point, as it tends to converge on a local optimum near
this point. To this end, this paper proposes leveraging the language
comprehension capabilities of large vision-language models (LVLMs) to guide the
optimization of the initial noisy latent, and introduces the Noise Diffusion
process, which updates the noisy latent to generate semantically faithful
images while preserving distribution consistency. Furthermore, we provide a
theoretical analysis of the condition under which the update improves semantic
faithfulness. Experimental results demonstrate the effectiveness and
adaptability of our framework, consistently enhancing semantic alignment across
various diffusion models. The code is available at
https://github.com/Bomingmiao/NoiseDiffusion.
中文标题/摘要
标题:噪声扩散在文本到图像合成中增强语义忠实性
扩散模型在生成逼真图像方面取得了显著成功,但在确保与输入提示精确语义对齐方面仍面临挑战。通过优化初始噪声潜在变量,提供了一种更有效的替代方法,以改进语义对齐,而不是修改模型架构或提示工程。最新的方法InitNo通过利用注意力图来细化初始噪声潜在变量;然而,这些图只能捕捉有限的信息,InitNo的效果高度依赖于初始起点,因为它倾向于在这一点附近收敛到局部最优。为此,本文提出利用大型视觉语言模型(LVLM)的语言理解能力来指导初始噪声潜在变量的优化,并引入了噪声扩散过程,该过程更新噪声潜在变量以生成语义忠实的图像,同时保持分布一致性。此外,我们还对更新条件下的语义忠实性改进进行了理论分析。实验结果表明,我们的框架的有效性和适应性,能够在各种扩散模型中一致提高语义对齐。代码可在https://github.com/Bomingmiao/NoiseDiffusion获取。
Summary / 总结
This paper addresses the challenge of ensuring precise semantic alignment in text-to-image synthesis using diffusion models. It proposes leveraging the language comprehension capabilities of large vision-language models to optimize the initial noisy latent, introducing a Noise Diffusion process that updates the noisy latent to generate semantically faithful images while maintaining distribution consistency. Experiments show that this approach effectively enhances semantic alignment across different diffusion models.
本文旨在通过利用大型视觉语言模型的语言理解能力优化初始噪声潜变量,解决文本到图像合成中精确语义对齐的问题。提出了噪声扩散过程,该过程更新噪声潜变量以生成语义忠实的图像,同时保持分布一致性。实验结果表明,该方法能够有效提高不同扩散模型中的语义对齐。
FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time
Authors: Yaoli Liu, Yao-Xiang Ding, Kun Zhou
First: 2025-10-27T16:54:08+00:00 · Latest: 2025-10-27T16:54:08+00:00
Abstract
This paper proposes FreeFuse, a novel training-free approach for
multi-subject text-to-image generation through automatic fusion of multiple
subject LoRAs. In contrast to existing methods that either focus on
pre-inference LoRA weight merging or rely on segmentation models and complex
techniques like noise blending to isolate LoRA outputs, our key insight is that
context-aware dynamic subject masks can be automatically derived from
cross-attention layer weights. Mathematical analysis shows that directly
applying these masks to LoRA outputs during inference well approximates the
case where the subject LoRA is integrated into the diffusion model and used
individually for the masked region. FreeFuse demonstrates superior practicality
and efficiency as it requires no additional training, no modification to LoRAs,
no auxiliary models, and no user-defined prompt templates or region
specifications. Alternatively, it only requires users to provide the LoRA
activation words for seamless integration into standard workflows. Extensive
experiments validate that FreeFuse outperforms existing approaches in both
generation quality and usability under the multi-subject generation tasks. The
project page is at https://future-item.github.io/FreeFuse/
中文标题/摘要
标题:FreeFuse:通过测试时自动掩码的多主题LoRA融合
本文提出FreeFuse,这是一种无需训练的新型方法,用于通过自动融合多个主题LoRA进行多主题文本到图像生成。与现有方法要么专注于推理前的LoRA权重合并,要么依赖分割模型和噪声混合等复杂技术来隔离LoRA输出不同,我们的关键洞察是,可以从跨注意力层权重中自动推导出上下文感知的动态主题掩码。数学分析表明,在推理时直接应用这些掩码到LoRA输出可以很好地近似于将主题LoRA集成到扩散模型并在掩码区域单独使用的情况。FreeFuse展示了更高的实用性和效率,因为它不需要额外的训练,不需要修改LoRA,不需要辅助模型,也不需要用户定义的提示模板或区域指定。相反,它只需要用户提供LoRA激活词即可无缝集成到标准工作流程中。广泛的实验验证了在多主题生成任务中,FreeFuse在生成质量和易用性方面均优于现有方法。项目页面位于https://future-item.github.io/FreeFuse/
Summary / 总结
FreeFuse is a training-free approach for multi-subject text-to-image generation that automatically fuses multiple subject LoRAs using context-aware dynamic subject masks derived from cross-attention layer weights. This method avoids the need for pre-inference LoRA weight merging, segmentation models, or complex noise blending techniques. Experimental results show that FreeFuse outperforms existing methods in both generation quality and usability for multi-subject tasks, requiring only LoRA activation words for seamless integration into standard workflows without additional training or modifications.
FreeFuse 是一种无需训练的方法,用于通过从交叉注意力层权重中自动提取上下文感知的动态主体掩码来融合多个主体 LoRA。这种方法避免了预推理 LoRA 权重合并、分割模型或复杂技术的需求。实验表明,FreeFuse 在多主体生成任务中在生成质量和易用性方面优于现有方法,无需额外训练或对 LoRA 进行修改。
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
Authors: Walid Bousselham, Hilde Kuehne, Cordelia Schmid
Venue: www
First: 2025-10-27T16:32:12+00:00 · Latest: 2025-10-27T16:32:12+00:00
Comments: www.walidbousselham.com/VOLD/
Abstract
Training vision-language models (VLMs) for complex reasoning remains a
challenging task, i.a. due to the scarcity of high-quality image-text reasoning
data. Conversely, text-based reasoning resources are abundant and scalable, but
it is still an open question how to leveraging them for VLM reasoning. To
address this problem, we propose VOLD, a framework to transfer reasoning
capabilities from text-only teacher models to VLM student models. To this end,
VOLD combines reinforcement learning via Group Relative Policy Optimization
(GRPO) with on-policy distillation, which allows the student reasoning traces
to be guided by the teacher model, resulting in a significant gain over using
GRPO alone. We further show that a cold-start alignment is essential for an
effective transfer during the online training phase in this scenario and that
without sufficient distributional alignment between teacher and student,
on-policy distillation fails to provide meaningful guidance. We evaluate VOLD
across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and
LogicVista, showing that VOLD outperforms the baseline model significantly and
improves over the state of the art by a margin. Our ablation shows the
importance of a cold-start alignment via SFT for on-policy distillation with a
text-only teacher.
中文标题/摘要
标题:VOLD:通过策略优化蒸馏从语言模型向视觉语言模型的知识迁移
训练视觉语言模型(VLMs)进行复杂推理仍然是一个具有挑战性的任务,主要是由于高质量图像文本推理数据的稀缺。相反,基于文本的推理资源丰富且可扩展,但如何利用它们来增强VLM推理仍然是一个开放的问题。为了解决这个问题,我们提出了VOLD,这是一种将仅文本教师模型的推理能力转移到VLM学生模型的框架。为此,VOLD结合了组相对策略优化(GRPO)的强化学习与策略优化蒸馏,这使得学生推理轨迹能够受到教师模型的引导,从而显著优于单独使用GRPO。我们进一步表明,在这种场景下的在线训练阶段,冷启动对齐对于有效的知识迁移至关重要,如果没有足够的分布对齐,策略优化蒸馏将无法提供有意义的指导。我们在包括MMMU-Pro、MathVision、MathVista和LogicVista在内的多种基准上评估了VOLD,结果显示VOLD显著优于基线模型,并且在某些方面超越了现有技术。我们的消融实验表明,通过仅文本教师进行的SFT对于策略优化蒸馏的冷启动对齐至关重要。
On the Faithfulness of Visual Thinking: Measurement and Enhancement
Authors: Zujing Liu, Junwen Pan, Qi She, Yuan Gao, Guisong Xia
First: 2025-10-27T16:15:54+00:00 · Latest: 2025-10-27T16:15:54+00:00
Abstract
Recent large vision-language models (LVLMs) can generate vision-text
multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning
(RFT). However, we observe that the visual information incorporated in MCoT is
often inaccurate, though still yield correct answers, indicating a lack of
faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to
the RL reward in RFT, which solely incentivizes the format of interleaved
vision-text cues, ie, it encourages the model to incorporate visual information
into its text reasoning steps without considering the correctness of the visual
information. In this paper, we first probe the faithfulness of MCoT by
measuring how much the prediction changes when its visual and textual thoughts
are intervened. Surprisingly, the model's predictions remain nearly unchanged
under visual intervention but change significantly under textual intervention,
indicating that the visual evidence is largely ignored. To further analyze
visual information, we introduce an automated LVLM-based evaluation metric that
quantifies the faithfulness of visual cues from two perspectives: reliability
and sufficiency. Our evaluation reveals that the visual information in current
MCoT traces is simultaneously unreliable and insufficient. To address this
issue, we propose a novel MCoT learning strategy termed Sufficient-Component
Cause Model (SCCM) learning. This approach encourages the MCoT to generate
sufficient yet minimal visual components that are independently capable of
leading to correct answers. We note that the proposed SCCM is annotation-free
and compatible with various RFT for MCoT in a plug-and-play manner. Empirical
results demonstrate that SCCM consistently improves the visual faithfulness
across a suite of fine-grained perception and reasoning benchmarks. Code is
available at https://github.com/EugeneLiu01/Faithful_Thinking_with_Image.
中文标题/摘要
标题:视觉思维的忠实性:测量与增强
近期的大规模视觉-语言模型(LVLMs)在强化微调(RFT)后可以生成视觉-文本多模态链式思考(MCoT)痕迹。然而,我们观察到MCoT中包含的视觉信息往往不准确,尽管仍然能给出正确答案,这表明MCoT推理过程缺乏忠实性。我们将这种不忠实归因于RFT中的RL奖励,它仅激励交错的视觉-文本提示格式,即它鼓励模型在文本推理步骤中加入视觉信息而不考虑视觉信息的准确性。在本文中,我们首先通过测量在干预视觉和文本思考时预测的变化来探究MCoT的忠实性。令人惊讶的是,在视觉干预下模型的预测几乎没有变化,但在文本干预下则显著变化,这表明视觉证据被大量忽视。为了进一步分析视觉信息,我们引入了一个基于LVLM的自动化评估指标,从可靠性和充分性两个角度量化视觉提示的忠实性。我们的评估揭示了当前MCoT痕迹中的视觉信息同时是不可靠和不充分的。为了解决这一问题,我们提出了一种新的MCoT学习策略,称为充分组件因果模型(SCCM)学习。该方法鼓励MCoT生成充分且最小的视觉组件,这些组件独立地能够导致正确答案。我们注意到,所提出的SCCM是无注释的,并且以插拔式方式兼容各种MCoT的RFT。实验证明,SCCM在一系列细粒度感知和推理基准上一致地提高了视觉忠实性。代码可在https://github.com/EugeneLiu01/Faithful_Thinking_with_Image获取。
Summary / 总结
This paper investigates the faithfulness of visual thinking in large vision-language models (LVLMs) after reinforcement fine-tuning (RFT). The study finds that while the models generate correct answers, the visual information used in their reasoning process is often inaccurate. To address this, the authors propose a new evaluation metric and a learning strategy called Sufficient-Component Cause Model (SCCM) to enhance the reliability and sufficiency of visual cues in the models' reasoning process, leading to improved visual faithfulness across various benchmarks.
本文研究了大型视觉语言模型(LVLM)在强化微调(RFT)后的视觉思考的忠实性。研究发现,尽管模型生成的答案是正确的,但用于推理的视觉信息往往是不准确的。为了解决这一问题,作者提出了一种新的评估指标和学习策略,称为充分组件因果模型(SCCM),以增强模型推理过程中视觉提示的可靠性和充分性,从而在各种细粒度感知和推理基准测试中提高了视觉忠实性。
Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models
Authors: Julius Vetter, Manuel Gloeckler, Daniel Gedon, Jakob H. Macke
First: 2025-04-24T15:29:39+00:00 · Latest: 2025-10-27T16:14:05+00:00
Abstract
Simulation-based inference (SBI) offers a flexible and general approach to
performing Bayesian inference: In SBI, a neural network is trained on synthetic
data simulated from a model and used to rapidly infer posterior distributions
for observed data. A key goal for SBI is to achieve accurate inference with as
few simulations as possible, especially for expensive simulators. In this work,
we address this challenge by repurposing recent probabilistic foundation models
for tabular data: We show how tabular foundation models -- specifically TabPFN
-- can be used as pre-trained autoregressive conditional density estimators for
SBI. We propose Neural Posterior Estimation with Prior-data Fitted Networks
(NPE-PFN) and show that it is competitive with current SBI approaches in terms
of accuracy for both benchmark tasks and two complex scientific inverse
problems. Crucially, it often substantially outperforms them in terms of
simulation efficiency, sometimes requiring orders of magnitude fewer
simulations. NPE-PFN eliminates the need for inference network selection,
training, and hyperparameter tuning. We also show that it exhibits superior
robustness to model misspecification and can be scaled to simulation budgets
that exceed the context size limit of TabPFN. NPE-PFN provides a new direction
for SBI, where training-free, general-purpose inference models offer efficient,
easy-to-use, and flexible solutions for a wide range of stochastic inverse
problems.
中文标题/摘要
标题:使用表格基础模型进行轻松高效的贝叶斯推断
基于模拟的推断(SBI)提供了一种灵活且通用的方法来进行贝叶斯推断:在SBI中,神经网络在从模型模拟的合成数据上进行训练,并用于快速推断观测数据的后验分布。SBI的关键目标是在尽可能少的模拟次数下实现准确的推断,尤其是对于昂贵的模拟器。在本文中,我们通过重新利用最近的概率表格基础模型来应对这一挑战:我们展示了如何使用表格基础模型——特别是TabPFN——作为预训练的自回归条件密度估计器来进行SBI。我们提出了神经后验估计与先验-数据拟合网络(NPE-PFN),并展示了它在基准任务和两个复杂的科学逆问题中与当前SBI方法在准确性方面具有竞争力。至关重要的是,它在模拟效率方面通常表现更优,有时需要的数量级更少的模拟次数。NPE-PFN消除了推断网络选择、训练和超参数调整的需要。我们还展示了它在模型误指定时表现出更优的鲁棒性,并且可以扩展到超过TabPFN上下文大小限制的模拟预算。NPE-PFN为SBI提供了一个新方向,在此方向上,无需训练的通用推断模型提供了高效、易于使用且灵活的解决方案,适用于广泛的随机逆问题。
Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries
Authors: Pengfei Cai, Yan Song, Qing Gu, Nan Jiang, Haoyu Song, Ian McLoughlin
Venue: MM 2025
First: 2025-07-22T08:24:01+00:00 · Latest: 2025-10-27T15:55:48+00:00
Comments: Accepted by MM 2025
Abstract
Most existing sound event detection~(SED) algorithms operate under a
closed-set assumption, restricting their detection capabilities to predefined
classes. While recent efforts have explored language-driven zero-shot SED by
exploiting audio-language models, their performance is still far from
satisfactory due to the lack of fine-grained alignment and cross-modal feature
fusion. In this work, we propose the Detect Any Sound Model (DASM), a
query-based framework for open-vocabulary SED guided by multi-modal queries.
DASM formulates SED as a frame-level retrieval task, where audio features are
matched against query vectors derived from text or audio prompts. To support
this formulation, DASM introduces a dual-stream decoder that explicitly
decouples event recognition and temporal localization: a cross-modality event
decoder performs query-feature fusion and determines the presence of sound
events at the clip-level, while a context network models temporal dependencies
for frame-level localization. Additionally, an inference-time attention masking
strategy is proposed to leverage semantic relations between base and novel
classes, substantially enhancing generalization to novel classes. Experiments
on the AudioSet Strong dataset demonstrate that DASM effectively balances
localization accuracy with generalization to novel classes, outperforming
CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in
the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot
evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the
supervised CRNN baseline. The project page is available at
https://cai525.github.io/Transformer4SED/demo_page/DASM/.
中文标题/摘要
标题:检测任何声音:多模态查询指导下的开放词汇声事件检测
大多数现有的声事件检测(SED)算法基于封闭集假设,限制了其检测能力仅限于预定义的类别。虽然最近的努力通过利用音频-语言模型探索了语言驱动的零样本SED,但由于细粒度对齐和跨模态特征融合的缺乏,其性能仍然不尽如人意。在本工作中,我们提出了检测任何声音模型(DASM),这是一种由多模态查询指导的基于查询的框架,用于开放词汇的SED。DASM将SED公式化为一个帧级检索任务,其中音频特征与从文本或音频提示派生的查询向量进行匹配。为了支持这种公式化,DASM引入了一种双流解码器,明确地将事件识别和时间定位分离:跨模态事件解码器执行查询特征融合,并在片段级确定声音事件的存在,而上下文网络则建模时间依赖性以进行帧级定位。此外,在推理时提出了一种注意力掩蔽策略,以利用基本类和新颖类之间的语义关系,显著增强了对新颖类别的泛化能力。在AudioSet Strong数据集上的实验表明,DASM有效地平衡了定位精度与对新颖类别的泛化能力,在开放词汇设置中优于CLAP基线方法(+7.8 PSDS),在封闭集设置中优于基线方法(+6.9 PSDS)。此外,在DESED的跨数据集零样本评估中,DASM的PSDS1得分为42.2,甚至超过了监督的CRNN基线。项目页面可在https://cai525.github.io/Transformer4SED/demo_page/DASM/获取。
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Authors: Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Yuheng Li, Konstantinos Psounis, Xiaofeng Yang
First: 2025-03-18T06:12:38+00:00 · Latest: 2025-10-27T15:33:53+00:00
Abstract
Vision-language models (VLMs) have achieved impressive progress in natural
image reasoning, yet their potential in medical imaging remains underexplored.
Medical vision-language tasks demand precise understanding and clinically
coherent answers, which are difficult to achieve due to the complexity of
medical data and the scarcity of high-quality expert annotations. These
challenges limit the effectiveness of conventional supervised fine-tuning (SFT)
and Chain-of-Thought (CoT) strategies that work well in general domains. To
address these challenges, we propose Med-R1, a reinforcement learning
(RL)-enhanced vision-language model designed to improve generalization and
reliability in medical reasoning. Built on the DeepSeek strategy, Med-R1 adopts
Group Relative Policy Optimization (GRPO) to encourage reward-guided learning
beyond static annotations. We comprehensively evaluate Med-R1 across eight
distinct medical imaging modalities. Med-R1 achieves a 29.94% improvement in
average accuracy over its base model Qwen2-VL-2B, and even outperforms
Qwen2-VL-72B-a model with 36x more parameters. To assess cross-task
generalization, we further evaluate Med-R1 on five question types. Med-R1
outperforms Qwen2-VL-2B by 32.06% in question-type generalization, also
surpassing Qwen2-VL-72B. We further explore the thinking process in Med-R1, a
crucial component for the success of Deepseek-R1. Our results show that
omitting intermediate rationales (No-Thinking-Med-R1) not only improves
in-domain and cross-domain generalization with less training, but also
challenges the assumption that more reasoning always helps. These findings
suggest that in medical VQA, it is not reasoning itself, but its quality and
domain alignment, that determine effectiveness. Together, these results
highlight that RL improves medical reasoning and generalization, enabling
efficient and reliable VLMs for real-world deployment.
中文标题/摘要
标题:Med-R1:强化学习在通用医疗推理中的视觉-语言模型
视觉-语言模型(VLMs)在自然图像推理方面取得了显著进展,但在医学成像领域的潜力尚未得到充分探索。医学视觉-语言任务需要精确的理解和临床连贯的答案,由于医学数据的复杂性和高质量专家注释的稀缺性,这些目标难以实现。这些挑战限制了传统监督微调(SFT)和思维链(CoT)策略的有效性,这些策略在通用领域表现良好。为了解决这些挑战,我们提出了Med-R1,这是一种基于DeepSeek策略的强化学习(RL)增强视觉-语言模型,旨在提高医疗推理的泛化能力和可靠性。Med-R1采用组相对策略优化(GRPO)鼓励基于奖励的学习,超越静态注释。我们全面评估了Med-R1在八种不同的医学成像模态中的表现。Med-R1在平均准确率上比其基础模型Qwen2-VL-2B提高了29.94%,甚至在参数量为Qwen2-VL-72B的36倍的情况下也表现更优。为了评估跨任务泛化能力,我们进一步在五种问题类型上评估了Med-R1。Med-R1在问题类型泛化上比Qwen2-VL-2B提高了32.06%,也超过了Qwen2-VL-72B。我们进一步探讨了Med-R1的思维过程,这是Deepseek-R1成功的关键组成部分。我们的结果显示,省略中间推理(No-Thinking-Med-R1)不仅在训练量更少的情况下提高了领域内和跨领域的泛化能力,还挑战了更多推理总是有益的假设。这些发现表明,在医学VQA中,决定有效性的不是推理本身,而是其质量和领域对齐。总之,这些结果强调了RL在提高医疗推理和泛化方面的作用,使视觉-语言模型能够高效可靠地应用于实际部署。
Summary / 总结
The research aims to enhance the medical reasoning capabilities of vision-language models by leveraging reinforcement learning. Med-R1, built on the DeepSeek strategy, uses Group Relative Policy Optimization to improve generalization and reliability in medical imaging tasks. Med-R1 outperforms its base models Qwen2-VL-2B and Qwen2-VL-72B by 29.94% and 32.06% respectively in terms of accuracy and question-type generalization, demonstrating the effectiveness of reinforcement learning in medical VQA.
研究旨在通过解决复杂医学数据和专家标注稀缺的问题,提升视觉语言模型在医学推理方面的能力。Med-R1 是一种增强学习的模型,使用组相对策略优化来提高泛化能力和可靠性。Med-R1 在平均准确率和问题类型泛化方面分别比基模型 Qwen2-VL-2B 和 Qwen2-VL-72B 提高了 29.94% 和 32.06%。研究还发现,省略中间推理可以增强领域内和跨领域的泛化能力,表明推理的质量和领域对齐是决定有效性的重要因素。
GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains
Authors: Chun Wang, Xiaojun Ye, Xiaoran Pan, Zihao Pan, Haofan Wang, Yiren Song
First: 2025-05-24T13:48:57+00:00 · Latest: 2025-10-27T12:39:19+00:00
Abstract
Recent advances in Visual Language Models (VLMs) have demonstrated
exceptional performance in visual reasoning tasks. However, geo-localization
presents unique challenges, requiring the extraction of multigranular visual
cues from images and their integration with external world knowledge for
systematic reasoning. Current approaches to geo-localization tasks often lack
robust reasoning mechanisms and explainability, limiting their effectiveness.
To address these limitations, we propose the Geo Reason Enhancement (GRE)
Suite, a novel framework that augments VLMs with structured reasoning chains
for accurate and interpretable location inference. The GRE Suite is
systematically developed across three key dimensions: dataset, model, and
benchmark. First, we introduce GRE30K, a high-quality geo-localization
reasoning dataset designed to facilitate fine-grained visual and contextual
analysis. Next, we present the GRE model, which employs a multi-stage reasoning
strategy to progressively infer scene attributes, local details, and semantic
features, thereby narrowing down potential geographic regions with enhanced
precision. Finally, we construct the Geo Reason Evaluation Benchmark
(GREval-Bench), a comprehensive evaluation framework that assesses VLMs across
diverse urban, natural, and landmark scenes to measure both coarse-grained
(e.g., country, continent) and fine-grained (e.g., city, street) localization
performance. Experimental results demonstrate that GRE significantly
outperforms existing methods across all granularities of geo-localization
tasks, underscoring the efficacy of reasoning-augmented VLMs in complex
geographic inference. Code and data will be released at
https://github.com/Thorin215/GRE.
中文标题/摘要
标题:GRE 套件:通过微调视觉语言模型和增强的推理链进行地理定位推断
近期视觉语言模型(VLMs)在视觉推理任务中的表现极为出色。然而,地理定位任务提出了独特的挑战,需要从图像中提取多层次的视觉线索,并将其与外部世界知识整合进行系统推理。当前的地理定位方法往往缺乏稳健的推理机制和可解释性,限制了其有效性。为解决这些限制,我们提出了地理推理增强(GRE)套件,这是一种新颖的框架,通过结构化的推理链增强VLMs,以实现准确和可解释的位置推断。GRE套件在三个关键维度上系统地开发:数据集、模型和基准测试。首先,我们引入了GRE30K,这是一个高质量的地理定位推理数据集,旨在促进细粒度的视觉和上下文分析。其次,我们介绍了GRE模型,该模型采用多阶段推理策略逐步推断场景属性、局部细节和语义特征,从而通过增强的精度缩小潜在地理区域。最后,我们构建了地理推理评估基准(GREval-Bench),这是一个全面的评估框架,评估VLMs在各种城市、自然和地标场景中的表现,以衡量粗粒度(例如,国家、大陆)和细粒度(例如,城市、街道)的定位性能。实验结果表明,GRE在所有地理定位任务的各个粒度级别上显著优于现有方法,突显了增强推理的VLMs在复杂地理推理中的有效性。代码和数据将在https://github.com/Thorin215/GRE/发布。
Summary / 总结
The GRE Suite is a framework that enhances Visual Language Models (VLMs) with structured reasoning chains for accurate geo-localization. It includes GRE30K, a high-quality dataset for fine-grained visual and contextual analysis, the GRE model with a multi-stage reasoning strategy, and GREval-Bench, a comprehensive evaluation framework. Experimental results show that GRE outperforms existing methods in all levels of geo-localization tasks.
GRE Suite 通过将结构化推理链与细调的视觉语言模型(VLMs)结合,解决了当前地理定位方法的局限性。它引入了 GRE30K,一个高质量的数据集,用于细粒度的视觉和上下文分析,并提出了 GRE 模型,该模型使用多阶段推理策略以提高精度。GRE Suite 还包括 GREval-Bench,一个全面的评估框架。实验结果表明,GRE 在所有地理定位任务的粒度上都优于现有方法。
PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization
Authors: Xinhai Wang, Shu Yang, Liangyu Wang, Lin Zhang, Huanyi Xie, Lijie Hu, Di Wang
First: 2025-10-27T12:24:14+00:00 · Latest: 2025-10-27T12:24:14+00:00
Abstract
Circuit discovery, which involves identifying sparse and task-relevant
subnetworks in pre-trained language models, is a cornerstone of mechanistic
interpretability. Automated Circuit Discovery (ACDC) has emerged as a pivotal
methodology in circuit discovery, but its application to large language models
is severely limited by computational inefficiency and prohibitively high memory
requirements. Although several accelerated approaches have been proposed, they
primarily rely on linear approximations to ACDC, which significantly
compromises analytical faithfulness. Our proposed method for accelerating
automated circuit discovery, Per Attention Head Quantization (PAHQ), takes a
fundamentally different approach by optimizing the efficiency of each
individual patching operation. PAHQ leverages a fundamental alignment between
activation patching and mixed-precision quantization (MPQ): interpretability
analysis through patching essentially performs targeted ablation studies.
Therefore, we can maintain high precision exclusively for investigated
components while safely reducing precision elsewhere in the network.
PAHQ-accelerated ACDC reduces runtime by up to 80\% and memory consumption by
up to 30\% compared to unaccelerated ACDC while maintaining faithfulness.
Importantly, our method readily integrates with existing edge-based circuit
discovery techniques by modifying the attention computation mechanism. This
training-free approach provides a practical and novel pathway for accelerating
mechanistic interpretability methods. Our code is available at
https://github.com/626619403/PAHQ.
中文标题/摘要
标题:PAHQ:通过混合精度推断优化加速自动电路发现
电路发现涉及在预训练语言模型中识别稀疏且任务相关的子网络,是机制可解释性的基石。自动电路发现(ACDC)已成为电路发现的关键方法,但其在大型语言模型中的应用受到计算效率低下和内存需求过高的严重限制。尽管提出了一些加速方法,但它们主要依赖于ACDC的线性近似,这显著牺牲了分析的忠实性。我们提出了一种加速自动电路发现的方法——注意头量化(PAHQ),它通过优化每个单独补丁操作的效率采取了根本不同的方法。PAHQ 利用了激活补丁和混合精度量化(MPQ)之间的基本对齐:通过补丁进行的解释性分析实际上执行了有针对性的消融研究。因此,我们可以在保持高精度的同时,安全地在网络的其他部分降低精度。PAHQ 加速的 ACDC 相比未加速的 ACDC 运行时间最多可减少 80%,内存消耗最多可减少 30%,同时保持忠实性。重要的是,我们的方法可以轻松与现有的基于边缘的电路发现技术集成,通过修改注意力计算机制。这种无需训练的方法为加速机制可解释性方法提供了一种实用且新颖的途径。我们的代码可在 https://github.com/626619403/PAHQ 获取。
Summary / 总结
PAHQ accelerates automated circuit discovery by optimizing each patching operation through mixed-precision inference, reducing runtime by up to 80% and memory consumption by up to 30% while maintaining analytical faithfulness. This method integrates with existing edge-based circuit discovery techniques by modifying the attention computation mechanism, offering a practical solution for mechanistic interpretability methods.
PAHQ通过优化每个剪接操作来实现混合精度推理,从而将运行时间最多减少80%,内存消耗最多减少30%,同时保持分析的忠实性。该方法通过修改注意力计算机制与现有的边缘基电路发现技术集成,为机制可解释性方法提供了一个实用的解决方案。
A Video Is Not Worth a Thousand Words
Authors: Sam Pollard, Michael Wray
First: 2025-10-27T12:15:02+00:00 · Latest: 2025-10-27T12:15:02+00:00
Abstract
As we become increasingly dependent on vision language models (VLMs) to
answer questions about the world around us, there is a significant amount of
research devoted to increasing both the difficulty of video question answering
(VQA) datasets, and the context lengths of the models that they evaluate. The
reliance on large language models as backbones has lead to concerns about
potential text dominance, and the exploration of interactions between
modalities is underdeveloped. How do we measure whether we're heading in the
right direction, with the complexity that multi-modal models introduce? We
propose a joint method of computing both feature attributions and modality
scores based on Shapley values, where both the features and modalities are
arbitrarily definable. Using these metrics, we compare $6$ VLM models of
varying context lengths on $4$ representative datasets, focusing on
multiple-choice VQA. In particular, we consider video frames and whole textual
elements as equal features in the hierarchy, and the multiple-choice VQA task
as an interaction between three modalities: video, question and answer. Our
results demonstrate a dependence on text and show that the multiple-choice VQA
task devolves into a model's ability to ignore distractors. Code available at
https://github.com/sjpollard/a-video-is-not-worth-a-thousand-words.
中文标题/摘要
标题:视频不值千言万语
随着我们越来越依赖视觉语言模型(VLMs)来回答我们周围世界的问题,对视频问答(VQA)数据集的难度和模型上下文长度的研究显著增加。依赖大型语言模型作为骨干引发了关于潜在文本主导性的担忧,而不同模态之间交互的研究尚不充分。我们如何衡量多模态模型引入的复杂性,以确定我们是否在正确的方向上前进?我们提出了一种基于Shapley值同时计算特征归因和模态得分的联合方法,其中特征和模态都可以任意定义。使用这些指标,我们在4个代表性数据集上比较了6个不同上下文长度的VLM模型,重点关注多项选择VQA任务。特别是,我们将视频帧和整个文本元素视为层次结构中的平等特征,并将多项选择VQA任务视为视频、问题和答案之间三种模态的交互。我们的结果表明了对文本的依赖性,并显示了多项选择VQA任务退化为模型忽略干扰项的能力。代码可在https://github.com/sjpollard/a-video-is-not-worth-a-thousand-words/ 获取。
Summary / 总结
This study addresses the reliance on large language models in vision-language models (VLMs) for video question answering (VQA), which raises concerns about text dominance. The authors propose a joint method using Shapley values to compute feature attributions and modality scores, comparing six VLM models of different context lengths on four datasets. The results indicate a significant dependence on textual information and suggest that the multiple-choice VQA task primarily tests a model's ability to ignore distractors.
该研究关注视觉语言模型(VLMs)在视频问答(VQA)中对大型语言模型的依赖,这引发了文本主导性的担忧。作者提出了一种联合方法,使用Shapley值计算特征归因和模态得分,比较了六个不同上下文长度的VLM模型在四个数据集上的表现。结果显示,模型对文本信息的依赖性很强,并且多项选择VQA任务主要测试模型忽略干扰项的能力。
Training-Free In-Context Forensic Chain for Image Manipulation Detection and Localization
Authors: Rui Chen, Bin Liu, Changtao Miao, Xinghao Wang, Yi Li, Tao Gong, Qi Chu, Nenghai Yu
First: 2025-10-11T08:42:31+00:00 · Latest: 2025-10-27T11:57:33+00:00
Abstract
Advances in image tampering pose serious security threats, underscoring the
need for effective image manipulation localization (IML). While supervised IML
achieves strong performance, it depends on costly pixel-level annotations.
Existing weakly supervised or training-free alternatives often underperform and
lack interpretability. We propose the In-Context Forensic Chain (ICFC), a
training-free framework that leverages multi-modal large language models
(MLLMs) for interpretable IML tasks. ICFC integrates an objectified rule
construction with adaptive filtering to build a reliable knowledge base and a
multi-step progressive reasoning pipeline that mirrors expert forensic
workflows from coarse proposals to fine-grained forensics results. This design
enables systematic exploitation of MLLM reasoning for image-level
classification, pixel-level localization, and text-level interpretability.
Across multiple benchmarks, ICFC not only surpasses state-of-the-art
training-free methods but also achieves competitive or superior performance
compared to weakly and fully supervised approaches.
中文标题/摘要
标题:无需训练的上下文法医链用于图像篡改检测与定位
图像篡改技术的进步带来了严重的安全威胁,突显了有效图像篡改定位(IML)的必要性。虽然监督IML能够取得优异性能,但它依赖于昂贵的像素级注释。现有的弱监督或无需训练的替代方法往往表现不佳且缺乏可解释性。我们提出了一种无需训练的框架——上下文法医链(ICFC),该框架利用多模态大型语言模型(MLLMs)进行可解释的IML任务。ICFC 结合了对象化规则构建与自适应过滤,构建了一个可靠的知识库和多步渐进推理管道,该管道模仿了从粗略提案到精细法医结果的专家法医工作流程。此设计使MLLM推理系统化地应用于图像级分类、像素级定位和文本级可解释性。在多个基准测试中,ICFC 不仅超越了最先进的无需训练方法,而且在弱监督和完全监督方法中也取得了竞争性或更优的性能。
Summary / 总结
The paper addresses the challenge of image manipulation localization (IML) by proposing the In-Context Forensic Chain (ICFC), a training-free framework that uses multi-modal large language models to construct an interpretable knowledge base and a multi-step reasoning pipeline. This approach outperforms existing training-free methods and achieves competitive or superior performance compared to weakly and fully supervised approaches across multiple benchmarks.
研究旨在通过开发一种无需训练的框架来解决图像篡改带来的安全威胁,即图像篡改定位(IML)。In-Context Forensic Chain (ICFC) 利用多模态大型语言模型构建可解释的知识库和多步推理管道。该方法在多个基准测试中不仅超越了现有的无需训练的方法,而且在弱监督和完全监督方法中也取得了竞争性或更优的性能。
Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports
Authors: Alois Thomas, Maya Varma, Jean-Benoit Delbrouck, Curtis P. Langlotz
First: 2025-10-27T11:08:05+00:00 · Latest: 2025-10-27T11:08:05+00:00
Abstract
Automating radiology report generation with Large Vision-Language Models
(LVLMs) holds great potential, yet these models often produce clinically
critical hallucinations, posing serious risks. Existing hallucination detection
methods frequently lack the necessary sentence-level granularity or robust
generalization across different LVLM generators. We introduce a novel approach:
a sentence-level Process Reward Model (PRM) adapted for this vision-language
task. Our PRM predicts the factual correctness of each generated sentence,
conditioned on clinical context and preceding text. When fine-tuned on
MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM
outperforms existing verification techniques, demonstrating, for instance,
relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in
AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods
reliant on internal model states, our PRM demonstrates strong generalization to
an unseen LVLM. We further show its practical utility: PRM scores effectively
filter low-quality reports, improving F1-CheXbert scores by 4.5% (when
discarding the worst 10% of reports). Moreover, when guiding a novel weighted
best-of-N selection process on the MIMIC-CXR test set, our PRM show relative
improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for
BERTScore. These results demonstrate that a lightweight, context-aware PRM
provides a model-agnostic safety layer for clinical LVLMs without access to
internal activations
中文标题/摘要
标题:过程奖励模型在LVLM放射学报告句级验证中的应用
使用大型视觉-语言模型(LVLM)自动化生成放射学报告具有巨大潜力,但这些模型经常产生临床关键的幻觉,带来严重风险。现有的幻觉检测方法通常缺乏必要的句级粒度或在不同LVLM生成器之间稳健的泛化能力。我们提出了一种新颖的方法:一种适应于视觉-语言任务的句级过程奖励模型(PRM)。我们的PRM在临床背景和前文文本的条件下,预测每个生成句子的事实正确性。在使用MIMIC-CXR和弱监督标签微调后,一个轻量级的0.5B参数PRM优于现有验证技术,例如,在一个LVLM输出上,马修相关系数相对提高了7.5%,AUROC提高了1.8%,超过了强大的白盒基线。与依赖于内部模型状态的方法不同,我们的PRM在未见过的LVLM上表现出强大的泛化能力。我们进一步展示了其实际用途:PRM分数有效地过滤低质量报告,提高F1-CheXbert分数4.5%(当丢弃最差的10%报告时)。此外,在MIMIC-CXR测试集上引导一种新的加权最佳N选择过程时,我们的PRM在临床指标上分别提高了7.4%的F1-CheXbert和0.6%的BERTScore。这些结果表明,一个轻量级、上下文感知的PRM为无内部激活访问的临床LVLM提供了一种模型无关的安全层。
Summary / 总结
This study addresses the issue of hallucinations in radiology reports generated by Large Vision-Language Models (LVLMs) by introducing a sentence-level Process Reward Model (PRM). The PRM predicts the factual correctness of each sentence, considering clinical context and preceding text. Fine-tuned on MIMIC-CXR with weakly-supervised labels, the PRM outperforms existing verification techniques, showing improvements in metrics such as Matthews Correlation Coefficient and AUROC. The PRM also demonstrates strong generalization to an unseen LVLM and effectively filters low-quality reports, improving F1-CheXbert scores by 4.5%. Additionally, it enhances clinical metrics in a weighted best-of-N selection process by 7.4% for F1-CheXbert and 0.6% for BERTScore.
该研究通过引入句子级过程奖励模型(PRM)来解决大型视觉-语言模型(LVLM)生成的放射学报告中的幻觉问题。PRM 预测每个句子的事实正确性,考虑临床上下文和前文文本。在 MIMIC-CXR 上使用弱监督标签微调后,PRM 在多项指标上优于现有验证技术,如 Matthews 相关系数和 AUROC。PRM 还展示了对未见过的 LVLM 的强大泛化能力,并有效过滤低质量报告,提高 F1-CheXbert 得分 4.5%。此外,在 MIMIC-CXR 测试集上的加权最佳-of-N 选择过程中,它在临床指标上分别提高了 7.4% 的 F1-CheXbert 和 0.6% 的 BERTScore。
DecoDINO: 3D Human-Scene Contact Prediction with Semantic Classification
Authors: Lukas Bierling, Davide Pasero, Fleur Dolmans, Helia Ghasemi, Angelo Broere
First: 2025-10-27T10:46:22+00:00 · Latest: 2025-10-27T10:46:22+00:00
Abstract
Accurate vertex-level contact prediction between humans and surrounding
objects is a prerequisite for high fidelity human object interaction models
used in robotics, AR/VR, and behavioral simulation. DECO was the first in the
wild estimator for this task but is limited to binary contact maps and
struggles with soft surfaces, occlusions, children, and false-positive foot
contacts. We address these issues and introduce DecoDINO, a three-branch
network based on DECO's framework. It uses two DINOv2 ViT-g/14 encoders,
class-balanced loss weighting to reduce bias, and patch-level cross-attention
for improved local reasoning. Vertex features are finally passed through a
lightweight MLP with a softmax to assign semantic contact labels. We also
tested a vision-language model (VLM) to integrate text features, but the
simpler architecture performed better and was used instead. On the DAMON
benchmark, DecoDINO (i) raises the binary-contact F1 score by 7$\%$, (ii)
halves the geodesic error, and (iii) augments predictions with object-level
semantic labels. Ablation studies show that LoRA fine-tuning and the dual
encoders are key to these improvements. DecoDINO outperformed the challenge
baseline in both tasks of the DAMON Challenge. Our code is available at
https://github.com/DavidePasero/deco/tree/main.
中文标题/摘要
标题:DecoDINO:人体与场景接触的3D语义分类预测
准确的顶点级人体与周围物体接触预测是用于机器人技术、AR/VR和行为模拟的高保真人体物体交互模型的前提。DECO是首个野外接触预测估计器,但仅限于二元接触图,并且在软表面、遮挡、儿童和假阳性脚接触方面存在困难。我们解决了这些问题并引入了基于DECO框架的三支网络DecoDINO。它使用两个DINOv2 ViT-g/14编码器、类别平衡损失加权以减少偏差,并在补丁级别使用交叉注意力以提高局部推理能力。最终的顶点特征通过一个轻量级MLP和softmax传递,以分配语义接触标签。我们还测试了一种视觉语言模型(VLM)以整合文本特征,但更简单的架构表现更好并被采用。在DAMON基准测试中,DecoDINO (i) 将二元接触F1分数提高了7%,(ii) 将地线误差减半,并 (iii) 通过对象级别的语义标签增强预测。消融研究显示,LoRA微调和双编码器是这些改进的关键。DecoDINO在DAMON挑战的两个任务中均优于挑战基线。我们的代码可在https://github.com/DavidePasero/deco/tree/main/ 获取。
Summary / 总结
The research aims to improve the accuracy of vertex-level contact prediction between humans and objects, crucial for realistic human-object interaction models in robotics, AR/VR, and behavioral simulation. DecoDINO, a three-branch network based on DECO, uses two DINOv2 ViT-g/14 encoders, class-balanced loss weighting, and patch-level cross-attention to enhance local reasoning. DecoDINO achieves a 7% increase in binary-contact F1 score, halves the geodesic error, and provides object-level semantic labels. Ablation studies indicate that LoRA fine-tuning and dual encoders are critical for these improvements, outperforming the challenge baseline on the DAMON benchmark.
研究旨在提高人类与物体之间顶点级接触预测的准确性,应用于机器人技术、AR/VR和行为模拟。DecoDINO基于DECO框架,采用DINOv2 ViT-g/14编码器、类平衡损失加权和补丁级交叉注意力,增强局部推理。该模型在二元接触F1分数上提高了7%,将地线误差减半,并提供物体级别的语义标签。消融研究显示,LoRA微调和双编码器是这些改进的关键,DecoDINO在DAMON基准测试和挑战基准测试中均优于基线模型。
Evaluation of Vision-LLMs in Surveillance Video
Authors: Pascal Benschop, Cristian Meo, Justin Dauwels, Jelte P. Mense
Venue: NeurIPS 2025 poster
First: 2025-10-27T10:27:02+00:00 · Latest: 2025-10-27T10:27:02+00:00
Comments: Accepted as poster in the NeurIPS 2025 Workshop on Space in Vision,
Language, and Embodied AI
Abstract
The widespread use of cameras in our society has created an overwhelming
amount of video data, far exceeding the capacity for human monitoring. This
presents a critical challenge for public safety and security, as the timely
detection of anomalous or criminal events is crucial for effective response and
prevention. The ability for an embodied agent to recognize unexpected events is
fundamentally tied to its capacity for spatial reasoning. This paper
investigates the spatial reasoning of vision-language models (VLMs) by framing
anomalous action recognition as a zero-shot, language-grounded task, addressing
the embodied perception challenge of interpreting dynamic 3D scenes from sparse
2D video. Specifically, we investigate whether small, pre-trained vision--LLMs
can act as spatially-grounded, zero-shot anomaly detectors by converting video
into text descriptions and scoring labels via textual entailment. We evaluate
four open models on UCF-Crime and RWF-2000 under prompting and
privacy-preserving conditions. Few-shot exemplars can improve accuracy for some
models, but may increase false positives, and privacy filters -- especially
full-body GAN transforms -- introduce inconsistencies that degrade accuracy.
These results chart where current vision--LLMs succeed (simple, spatially
salient events) and where they falter (noisy spatial cues, identity
obfuscation). Looking forward, we outline concrete paths to strengthen spatial
grounding without task-specific training: structure-aware prompts, lightweight
spatial memory across clips, scene-graph or 3D-pose priors during description,
and privacy methods that preserve action-relevant geometry. This positions
zero-shot, language-grounded pipelines as adaptable building blocks for
embodied, real-world video understanding. Our implementation for evaluating
VLMs is publicly available at:
https://github.com/pascalbenschopTU/VLLM_AnomalyRecognition
中文标题/摘要
标题:监控视频中视觉大模型的评估
社会上广泛使用摄像头产生了大量的视频数据,远远超过了人类监控的能力。这对公共安全和安全构成了严峻挑战,因为及时检测异常或犯罪事件对于有效响应和预防至关重要。一个具身代理识别意外事件的能力与其空间推理能力密切相关。本文通过将异常动作识别框定为零样本、语言导向的任务,研究了视觉语言模型(VLMs)的空间推理能力,解决了从稀疏2D视频中解释动态3D场景的具身感知挑战。具体而言,我们研究了小型预训练视觉-LLMs是否可以作为空间导向的零样本异常检测器,通过将视频转换为文本描述并使用文本蕴含评分标签。我们在UCF-Crime和RWF-2000上评估了四个开源模型,在提示和隐私保护条件下。少量示例可以提高某些模型的准确性,但可能会增加假阳性率,而隐私过滤器——尤其是全身GAN变换——引入的不一致性会降低准确性。这些结果指出了当前视觉-LLMs在简单、空间显著事件上成功而在嘈杂的空间线索和身份模糊上失败的地方。展望未来,我们概述了无需特定任务训练即可加强空间定位的具体路径:结构感知提示、轻量级跨片段的空间记忆、描述期间的场景图或3D姿态先验以及保留动作相关几何的隐私方法。这将零样本、语言导向的流水线定位为具身、现实世界视频理解的可适应构建块。我们用于评估VLMs的实现已公开发布于:https://github.com/pascalbenschopTU/VLLM_AnomalyRecognition
Summary / 总结
This paper evaluates the spatial reasoning capabilities of vision-language models (VLMs) in recognizing anomalous actions in surveillance videos. It frames the task as a zero-shot, language-grounded challenge, converting videos into text descriptions and using textual entailment to score labels. The study finds that while few-shot exemplars can improve accuracy, they may also increase false positives. Privacy-preserving methods, such as full-body GAN transforms, introduce inconsistencies that degrade accuracy. The research highlights the current strengths and weaknesses of VLMs in spatial reasoning and suggests concrete paths to improve their performance without task-specific training.
本文评估了视觉语言模型(VLMs)在监控视频中识别异常行为时的空间推理能力。它将任务定义为零样本、语言导向的挑战,将视频转换为文本描述,并使用文本蕴含来评分标签。研究发现,虽然少量样本可以提高准确性,但也可能增加误报。隐私保护方法,如全身GAN变换,会引入不一致性,从而降低准确性。研究指出了VLMs在空间推理方面的当前优势和不足,并提出了具体的改进路径,无需特定任务训练。
Finding 3D Scene Analogies with Multimodal Foundation Models
Authors: Junho Kim, Young Min Kim
Venue: RSS 2025
First: 2025-10-27T10:23:31+00:00 · Latest: 2025-10-27T10:23:31+00:00
Comments: Accepted to FM4RoboPlan workshop at RSS 2025
Abstract
Connecting current observations with prior experiences helps robots adapt and
plan in new, unseen 3D environments. Recently, 3D scene analogies have been
proposed to connect two 3D scenes, which are smooth maps that align scene
regions with common spatial relationships. These maps enable detailed transfer
of trajectories or waypoints, potentially supporting demonstration transfer for
imitation learning or task plan transfer across scenes. However, existing
methods for the task require additional training and fixed object vocabularies.
In this work, we propose to use multimodal foundation models for finding 3D
scene analogies in a zero-shot, open-vocabulary setting. Central to our
approach is a hybrid neural representation of scenes that consists of a sparse
graph based on vision-language model features and a feature field derived from
3D shape foundation models. 3D scene analogies are then found in a
coarse-to-fine manner, by first aligning the graph and refining the
correspondence with feature fields. Our method can establish accurate
correspondences between complex scenes, and we showcase applications in
trajectory and waypoint transfer.
中文标题/摘要
标题:使用多模态基础模型在3D场景中寻找类比
将当前观察与先前经验联系起来有助于机器人在新的未见过的3D环境中适应和规划。最近,提出了3D场景类比来连接两个3D场景,它们是平滑的地图,将具有共同空间关系的场景区域对齐。这些地图使轨迹或航点的详细转移成为可能,可能支持模仿学习中的演示转移或场景间任务计划的转移。然而,现有方法需要额外的训练和固定的物体词汇表。在本文中,我们提出使用多模态基础模型在零样本、开放词汇表设置中寻找3D场景类比。我们方法的核心是一种基于视觉-语言模型特征的稀疏图和从3D形状基础模型派生的功能场的混合神经表示。然后,通过首先对齐图并用功能场细化对应关系,以粗到细的方式寻找3D场景类比。我们的方法可以建立复杂场景之间的准确对应关系,并展示了轨迹和航点转移的应用。
Summary / 总结
This work aims to enable robots to adapt and plan in new 3D environments by finding 3D scene analogies using multimodal foundation models. The method employs a hybrid neural representation combining a sparse graph from vision-language model features and a feature field from 3D shape foundation models. It aligns scenes in a coarse-to-fine manner, achieving accurate correspondences between complex scenes and supporting applications like trajectory and waypoint transfer.
该研究旨在通过使用多模态基础模型找到3D场景类比,使机器人能够适应和规划新的3D环境。方法采用结合视觉-语言模型特征的稀疏图和3D形状基础模型特征字段的混合神经表示。它以粗到细的方式对齐场景,并能准确地在复杂场景之间转移轨迹和航点,支持模仿学习或场景间任务计划转移的应用。
Attention! Your Vision Language Model Could Be Maliciously Manipulated
Authors: Xiaosen Wang, Shaokang Wang, Zhijin Ge, Yuyang Luo, Shudong Zhang
Venue: NeurIPS 2025
First: 2025-05-26T12:38:58+00:00 · Latest: 2025-10-27T09:54:32+00:00
Comments: NeurIPS 2025
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable success in
understanding complex real-world scenarios and supporting data-driven
decision-making processes. However, VLMs exhibit significant vulnerability
against adversarial examples, either text or image, which can lead to various
adversarial outcomes, e.g., jailbreaking, hijacking, and hallucination, etc. In
this work, we empirically and theoretically demonstrate that VLMs are
particularly susceptible to image-based adversarial examples, where
imperceptible perturbations can precisely manipulate each output token. To this
end, we propose a novel attack called Vision-language model Manipulation Attack
(VMA), which integrates first-order and second-order momentum optimization
techniques with a differentiable transformation mechanism to effectively
optimize the adversarial perturbation. Notably, VMA can be a double-edged
sword: it can be leveraged to implement various attacks, such as jailbreaking,
hijacking, privacy breaches, Denial-of-Service, and the generation of sponge
examples, etc, while simultaneously enabling the injection of watermarks for
copyright protection. Extensive empirical evaluations substantiate the efficacy
and generalizability of VMA across diverse scenarios and datasets. Code is
available at https://github.com/Trustworthy-AI-Group/VMA.
中文标题/摘要
标题:注意!您的视觉语言模型可能被恶意操控
大型视觉-语言模型(VLMs)在理解和处理复杂现实场景以及支持数据驱动决策方面取得了显著成功。然而,VLMs 对对抗样本(无论是文本还是图像)表现出显著的脆弱性,这可能导致各种对抗结果,例如脱狱、劫持和幻觉等。在本文中,我们通过实证和理论证明,VLMs 特别容易受到基于图像的对抗样本的影响,其中不可感知的扰动可以精确地操纵每个输出标记。为此,我们提出了一种名为视觉语言模型操控攻击(VMA)的新攻击方法,该方法结合了一阶和二阶动量优化技术以及可微变换机制,以有效优化对抗扰动。值得注意的是,VMA 可以是一把双刃剑:它可以被用来实施各种攻击,如脱狱、劫持、隐私泄露、服务拒绝和海绵样本生成等,同时还可以用于版权保护中的水印注入。广泛的实证评估证实了 VMA 在不同场景和数据集中的有效性和普适性。代码可在 https://github.com/Trustworthy-AI-Group/VMA 获取。
Summary / 总结
This paper addresses the vulnerability of Vision-Language Models (VLMs) to adversarial attacks, particularly image-based perturbations. It introduces a novel attack method called VMA, which uses optimization techniques to manipulate VLM outputs imperceptibly. The study demonstrates VMA's effectiveness in various attacks like jailbreaking and privacy breaches, and also shows its potential for copyright protection through watermarking. Extensive evaluations confirm VMA's efficacy and generalizability across different scenarios and datasets.
该研究探讨了视觉语言模型(VLMs)对基于图像的对抗攻击的脆弱性。作者提出了一种名为VMA的新攻击方法,利用一阶和二阶动量优化技术对VLM输出进行不可感知的操纵。VMA可以用于多种恶意目的,如越狱和隐私泄露,同时也可用于通过水印进行版权保护。广泛的实验验证了VMA的有效性和在不同场景和数据集上的普适性。
Can Less Precise Be More Reliable? A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy
Authors: Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Fabio Arnez, Chokri Mraidha
First: 2025-09-25T13:54:34+00:00 · Latest: 2025-10-27T09:18:44+00:00
Comments: Preprint, under peer review
Abstract
The powerful zero-shot generalization capabilities of vision-language models
(VLMs) like CLIP have enabled new paradigms for safety-related tasks such as
out-of-distribution (OOD) detection. However, additional aspects crucial for
the computationally efficient and reliable deployment of CLIP are still
overlooked. In particular, the impact of quantization on CLIP's performance
beyond accuracy remains underexplored. This work presents a large-scale
evaluation of quantization on CLIP models, assessing not only in-distribution
accuracy but a comprehensive suite of reliability metrics and revealing
counterintuitive results driven by pre-training source. We demonstrate that
quantization consistently improves calibration for typically underconfident
pre-trained models, while often degrading it for overconfident variants.
Intriguingly, this degradation in calibration does not preclude gains in other
reliability metrics; we find that OOD detection can still improve for these
same poorly calibrated models. Furthermore, we identify specific
quantization-aware training (QAT) methods that yield simultaneous gains in
zero-shot accuracy, calibration, and OOD robustness, challenging the view of a
strict efficiency-performance trade-off. These findings offer critical insights
for navigating the multi-objective problem of deploying efficient, reliable,
and robust VLMs by utilizing quantization beyond its conventional role.
中文标题/摘要
标题:少精确度是否更可靠?CLIP 超越准确性的量化影响的系统评估
视觉语言模型(VLMs)如CLIP的强大零样本泛化能力已为安全相关任务,如离分布(OOD)检测,开辟了新的范式。然而,对于CLIP高效且可靠的部署至关重要的其他方面仍然被忽视。特别是,量化对CLIP性能的影响,超越准确性,仍被广泛忽视。本研究对量化对CLIP模型的影响进行了大规模评估,不仅评估了分布内准确性,还评估了全面的可靠性指标,揭示了由预训练来源驱动的反直觉结果。我们证明量化一致地提高了通常欠自信的预训练模型的校准,同时经常降低过度自信变体的校准。有趣的是,这种校准的降低并不妨碍其他可靠性指标的提升;我们发现,这些校准不佳的模型的离分布检测仍然可以改善。此外,我们确定了特定的量化感知训练(QAT)方法,这些方法同时提高了零样本准确性、校准和离分布鲁棒性,挑战了效率与性能之间严格权衡的观点。这些发现为利用量化超越其传统角色部署高效、可靠和鲁棒的VLMs提供了关键见解。
Summary / 总结
This study evaluates the impact of quantization on CLIP models beyond accuracy, focusing on computational efficiency and reliability. The research finds that quantization improves calibration for underconfident models but degrades it for overconfident ones, yet still enhances OOD detection. Specific quantization-aware training methods are identified that simultaneously improve zero-shot accuracy, calibration, and OOD robustness, challenging the conventional trade-off between efficiency and performance.
这项研究评估了量化对CLIP模型的影响,不仅限于准确性,还关注可靠性指标。研究发现,量化可以提高欠自信模型的校准,但会降低过自信模型的校准,然而仍能提升OOD检测。此外,研究还识别出特定的量化感知训练方法,能够同时提升零样本准确性、校准和OOD鲁棒性,挑战了效率与性能之间的权衡。这为部署高效、可靠和鲁棒的VLM提供了关键见解。
Revisiting Multimodal Positional Encoding in Vision-Language Models
Authors: Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, Shuai Bai
First: 2025-10-27T08:00:46+00:00 · Latest: 2025-10-27T08:00:46+00:00
Comments: 16 pages
Abstract
Multimodal position encoding is essential for vision-language models, yet
there has been little systematic investigation into multimodal position
encoding. We conduct a comprehensive analysis of multimodal Rotary Positional
Embedding (RoPE) by examining its two core components: position design and
frequency allocation. Through extensive experiments, we identify three key
guidelines: positional coherence, full frequency utilization, and preservation
of textual priors-ensuring unambiguous layout, rich representation, and
faithful transfer from the pre-trained LLM. Based on these insights, we propose
Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and
plug-and-play variants that require no architectural changes. Our methods
consistently outperform existing approaches across diverse benchmarks, with
significant improvements in both general and fine-grained multimodal
understanding. Code will be avaliable at
https://github.com/JJJYmmm/Multimodal-RoPEs.
中文标题/摘要
标题:重新审视视觉-语言模型中的多模态位置编码
多模态位置编码对于视觉-语言模型至关重要,但对其系统的调查研究却很少。我们对多模态旋转位置嵌入(RoPE)进行了全面分析,考察了其两个核心组成部分:位置设计和频率分配。通过大量实验,我们确定了三个关键指导原则:位置一致性、频率充分利用和文本先验的保留,以确保布局明确、表示丰富和从预训练的大语言模型中忠实转移。基于这些见解,我们提出了多头RoPE(MHRoPE)和MRoPE-交错(MRoPE-I)两种简单且即插即用的变体,无需进行架构更改。我们的方法在各种基准测试中始终优于现有方法,显著提高了通用和细粒度多模态理解。代码将在https://github.com/JJJYmmm/Multimodal-RoPEs上提供。
Summary / 总结
The paper revisits multimodal position encoding in vision-language models, focusing on Rotary Positional Embedding (RoPE). By analyzing the components of RoPE, the authors propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), which enhance positional coherence, frequency utilization, and textual priors. These methods improve multimodal understanding across various benchmarks, demonstrating significant performance gains in both general and fine-grained tasks without requiring architectural changes.
论文重新审视了视觉-语言模型中的多模态位置编码,集中在旋转位置嵌入(RoPE)上。通过对RoPE组件的分析,作者提出了多头RoPE(MHRoPE)和MRoPE-交错(MRoPE-I),这些方法增强了位置一致性、频率利用和文本先验。这些方法在各种基准测试中提高了多模态理解能力,展示了在一般和细粒度任务中的显著性能提升,且无需改变架构。
T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting
Authors: Yifei Qian, Zhongliang Guo, Bowen Deng, Chun Tong Lei, Shuai Zhao, Chun Pong Lau, Xiaopeng Hong, Michael P. Pound
First: 2025-02-28T01:09:18+00:00 · Latest: 2025-10-27T07:31:06+00:00
Comments: Accepted by CVPR2025
Abstract
Zero-shot object counting aims to count instances of arbitrary object
categories specified by text descriptions. Existing methods typically rely on
vision-language models like CLIP, but often exhibit limited sensitivity to text
prompts. We present T2ICount, a diffusion-based framework that leverages rich
prior knowledge and fine-grained visual understanding from pretrained diffusion
models. While one-step denoising ensures efficiency, it leads to weakened text
sensitivity. To address this challenge, we propose a Hierarchical Semantic
Correction Module that progressively refines text-image feature alignment, and
a Representational Regional Coherence Loss that provides reliable supervision
signals by leveraging the cross-attention maps extracted from the denosing
U-Net. Furthermore, we observe that current benchmarks mainly focus on majority
objects in images, potentially masking models' text sensitivity. To address
this, we contribute a challenging re-annotated subset of FSC147 for better
evaluation of text-guided counting ability. Extensive experiments demonstrate
that our method achieves superior performance across different benchmarks. Code
is available at https://github.com/cha15yq/T2ICount.
中文标题/摘要
标题:T2ICount:增强零样本计数的跨模态理解
零样本对象计数旨在根据文本描述统计任意对象类别的实例数量。现有方法通常依赖于像CLIP这样的视觉语言模型,但往往对文本提示的敏感性有限。我们提出了T2ICount,这是一种基于扩散的框架,利用预训练扩散模型丰富的先验知识和精细的视觉理解。虽然一步去噪确保了效率,但会导致文本敏感性减弱。为了解决这一挑战,我们提出了一种分层语义校正模块,逐步细化文本-图像特征对齐,并提出了一种表示区域一致性损失,通过利用从去噪U-Net中提取的跨注意力图提供可靠的监督信号。此外,我们观察到当前基准主要关注图像中的主要对象,可能掩盖了模型的文本敏感性。为了解决这一问题,我们贡献了一个具有挑战性的重新注释的FSC147子集,以更好地评估文本引导的计数能力。广泛的实验表明,我们的方法在不同基准上取得了优越的性能。代码可在https://github.com/cha15yq/T2ICount获取。
Summary / 总结
T2ICount aims to enhance zero-shot object counting by improving text sensitivity in vision-language models. It uses a diffusion-based framework with a Hierarchical Semantic Correction Module and Representational Regional Coherence Loss to refine text-image feature alignment and provide reliable supervision. Experiments show that T2ICount outperforms existing methods across various benchmarks. A re-annotated subset of FSC147 is introduced to better evaluate text-guided counting ability.
T2ICount 是一种基于扩散的框架,旨在增强零样本对象计数中的跨模态理解。该方法利用预训练的扩散模型来整合丰富的先验知识和精细的视觉理解。该方法包括层次语义校正模块和表示区域一致性损失,以提高文本敏感性。实验表明,T2ICount 在各种基准测试中表现出色。还贡献了一个重新注释的 FSC147 子集,以更好地评估文本引导的计数能力。
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
Authors: Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
First: 2025-06-05T07:26:34+00:00 · Latest: 2025-10-27T07:27:25+00:00
Comments: Project page: https://youngwanlee.github.io/holisafe
Abstract
Despite emerging efforts to enhance the safety of Vision-Language Models
(VLMs), current approaches face two main shortcomings. 1) Existing
safety-tuning datasets and benchmarks only partially consider how image-text
interactions can yield harmful content, often overlooking contextually unsafe
outcomes from seemingly benign pairs. This narrow coverage leaves VLMs
vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely
primarily on data-centric tuning, with limited architectural innovations to
intrinsically strengthen safety. We address these gaps by introducing a
holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five
safe/unsafe image-text combinations, providing a more robust basis for both
training and evaluation (HoliSafe-Bench). We further propose a novel modular
framework for enhancing VLM safety with a visual guard module (VGM) designed to
assess the harmfulness of input images for VLMs. This module endows VLMs with a
dual functionality: they not only learn to generate safer responses but can
also provide an interpretable harmfulness classification to justify their
refusal decisions. A significant advantage of this approach is its modularity;
the VGM is designed as a plug-in component, allowing for seamless integration
with diverse pre-trained VLMs across various scales. Experiments show that
Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety
performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench
itself reveals critical vulnerabilities in existing VLM models. We hope that
HoliSafe and VGM will spur further research into robust and interpretable VLM
safety, expanding future avenues for multimodal alignment.
中文标题/摘要
标题:HoliSafe:视觉语言模型的全面安全基准和建模
尽管已经出现了增强视觉语言模型(VLMs)安全性的努力,但当前的方法存在两个主要不足。1)现有的安全调优数据集和基准仅部分考虑了图像-文本交互可能导致有害内容的问题,经常忽视看似无害的配对所引发的上下文不安全结果。这种狭窄的覆盖范围使VLMs在未见配置中容易受到脱狱攻击。2)先前的方法主要依赖于数据驱动的调优,缺乏对内在增强安全性的架构创新。我们通过引入一个全面的安全数据集和基准——HoliSafe,跨越所有五种安全/不安全的图像-文本组合,为训练和评估提供了更坚实的基础(HoliSafe-Bench)。我们还提出了一种新的模块化框架,通过视觉守护模块(VGM)增强VLM的安全性,该模块旨在评估输入图像对VLM的有害性。该模块赋予VLMs双重功能:它们不仅学习生成更安全的响应,还可以提供可解释的有害性分类,以证明其拒绝决策的合理性。这种方法的一个显著优势是其模块化;VGM被设计为插件组件,可以无缝集成到各种规模的预训练VLMs中。实验表明,使用VGM训练的Safe-VLM在多个VLM基准上实现了最先进的安全性能。此外,HoliSafe-Bench本身揭示了现有VLM模型中的关键漏洞。我们希望HoliSafe和VGM能够激发更多关于稳健和可解释的VLM安全性的研究,扩展未来多模态对齐的途径。
Summary / 总结
The research aims to enhance the safety of Vision-Language Models (VLMs) by addressing the limitations of existing safety-tuning datasets and methods. The authors introduce HoliSafe, a holistic safety dataset and benchmark that covers all safe/unsafe image-text combinations, and a modular framework with a visual guard module (VGM) to assess harmfulness. Experiments show that VLMs enhanced with VGM achieve state-of-the-art safety performance and reveal critical vulnerabilities in existing models. This work aims to spur further research into robust and interpretable VLM safety.
研究旨在通过解决现有安全调优数据集和方法的局限性,提升视觉语言模型(VLM)的安全性。作者引入了HoliSafe,这是一个涵盖所有安全/不安全图像-文本组合的综合安全数据集和基准,并提出了一种模块化框架,其中包含一个视觉防护模块(VGM),用于评估输入图像的有害性。实验表明,使用VGM增强的VLM在多个VLM基准测试中实现了最先进的安全性性能,并揭示了现有模型中的关键漏洞。这项工作旨在推动对稳健和可解释的VLM安全性的进一步研究。
3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks
Authors: Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, Zuozhu Liu
Venue: NeurIPS 2025
First: 2025-06-11T09:55:42+00:00 · Latest: 2025-10-27T07:14:11+00:00
Comments: Accepted by NeurIPS 2025
Abstract
Medical Visual Question Answering (Med-VQA) holds significant potential for
clinical decision support, yet existing efforts primarily focus on 2D imaging
with limited task diversity. This paper presents 3D-RAD, a large-scale dataset
designed to advance 3D Med-VQA using radiology CT scans. The 3D-RAD dataset
encompasses six diverse VQA tasks: anomaly detection, image observation,
medical computation, existence detection, static temporal diagnosis, and
longitudinal temporal diagnosis. It supports both open- and closed-ended
questions while introducing complex reasoning challenges, including
computational tasks and multi-stage temporal analysis, to enable comprehensive
benchmarking. Extensive evaluations demonstrate that existing vision-language
models (VLMs), especially medical VLMs exhibit limited generalization,
particularly in multi-temporal tasks, underscoring the challenges of real-world
3D diagnostic reasoning. To drive future advancements, we release a
high-quality training set 3D-RAD-T of 136,195 expert-aligned samples, showing
that fine-tuning on this dataset could significantly enhance model performance.
Our dataset and code, aiming to catalyze multimodal medical AI research and
establish a robust foundation for 3D medical visual understanding, are publicly
available at https://github.com/Tang-xiaoxiao/3D-RAD.
中文标题/摘要
标题:3D-RAD:一种综合性的3D放射学Med-VQA数据集,包含多时相分析和多样化的诊断任务
医学视觉问答(Med-VQA)在临床决策支持方面具有巨大潜力,但现有努力主要集中在2D成像且任务多样性有限。本文介绍了3D-RAD,这是一个大型数据集,旨在使用放射学CT扫描推进3D Med-VQA。3D-RAD数据集涵盖了六种不同的VQA任务:异常检测、图像观察、医学计算、存在检测、静态时相诊断和纵向时相诊断。它支持开放式和封闭式问题,引入了复杂的推理挑战,包括计算任务和多阶段时相分析,以实现全面的基准测试。广泛评估表明,现有的视觉-语言模型(VLMs),尤其是医学VLMs在多时相任务中的泛化能力有限,突显了现实世界3D诊断推理的挑战。为了推动未来的发展,我们发布了高质量的训练集3D-RAD-T,包含136,195个专家对齐样本,表明在该数据集上进行微调可以显著提高模型性能。我们的数据集和代码旨在促进多模态医学AI研究,并为3D医学视觉理解建立坚实的基础,已公开发布于https://github.com/Tang-xiaoxiao/3D-RAD。
Summary / 总结
3D-RAD is a large-scale dataset for 3D radiology Med-VQA, incorporating six diverse tasks and supporting both open- and closed-ended questions. It introduces complex reasoning challenges such as computational tasks and multi-temporal analysis. Evaluations show that existing vision-language models, especially medical ones, struggle with multi-temporal tasks, highlighting the need for better 3D diagnostic reasoning. The dataset includes 136,195 expert-aligned samples, and fine-tuning on this dataset can significantly improve model performance.
3D-RAD 是一个大规模的 3D 医学影像 Med-VQA 数据集,通过包含异常检测、医学计算和多时态分析等多种任务,解决了现有 2D 影像数据集的局限性。该数据集支持开放和封闭问题,并表明当前的视觉-语言模型,尤其是医学模型,在多时态任务上表现不佳。通过在包含 136,195 个高质量样本的 3D-RAD-T 训练集上进行微调,可以显著提高模型性能,突显了对 3D 诊断推理更好模型的需求。
M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark
Authors: Huixuan Zhang, Xiaojun Wan
First: 2025-10-27T05:32:50+00:00 · Latest: 2025-10-27T05:32:50+00:00
Abstract
Text-to-image models are known to struggle with generating images that
perfectly align with textual prompts. Several previous studies have focused on
evaluating image-text alignment in text-to-image generation. However, these
evaluations either address overly simple scenarios, especially overlooking the
difficulty of prompts with multiple different instances belonging to the same
category, or they introduce metrics that do not correlate well with human
evaluation. In this study, we introduce M$^3$T2IBench, a large-scale,
multi-category, multi-instance, multi-relation along with an
object-detection-based evaluation metric, $AlignScore$, which aligns closely
with human evaluation. Our findings reveal that current open-source
text-to-image models perform poorly on this challenging benchmark.
Additionally, we propose the Revise-Then-Enforce approach to enhance image-text
alignment. This training-free post-editing method demonstrates improvements in
image-text alignment across a broad range of diffusion models. \footnote{Our
code and data has been released in supplementary material and will be made
publicly available after the paper is accepted.}
中文标题/摘要
标题:M$^{3}$T2IBench:大规模多类别、多实例、多关系文本到图像基准
文本到图像模型在生成与文本提示完美对齐的图像方面存在困难。先前的一些研究主要集中在评估文本到图像生成中的图像-文本对齐。然而,这些评估要么仅涉及过于简单的场景,尤其是忽略了同一类别中多个不同实例的提示难度,要么引入的度量标准与人类评估的相关性较差。在本研究中,我们引入了M$^3$T2IBench,这是一个大规模、多类别、多实例、多关系的基准,并且引入了基于对象检测的评估指标$AlignScore$,该指标与人类评估高度一致。我们的研究发现,当前的开源文本到图像模型在这一具有挑战性的基准上表现不佳。此外,我们提出了Revise-Then-Enforce方法以增强图像-文本对齐。这种无需训练的后编辑方法在一系列扩散模型中展示了图像-文本对齐的改进。\footnote{我们的代码和数据已在补充材料中发布,并将在论文被接受后公开。}
Summary / 总结
The study introduces M$^3$T2IBench, a benchmark for evaluating text-to-image models that addresses the challenge of generating images that align well with complex textual prompts. It includes multi-category, multi-instance, and multi-relation scenarios and introduces an $AlignScore$ metric that correlates with human evaluation. The findings show that current text-to-image models struggle with this benchmark, and the Revise-Then-Enforce approach, a training-free post-editing method, improves image-text alignment across various diffusion models.
研究引入了M$^3$T2IBench基准,用于评估文本到图像模型,特别针对生成与复杂文本提示高度一致的图像的挑战。该基准包含多类别、多实例和多关系场景,并引入了与人类评价高度相关的$AlignScore$评估指标。研究发现,当前的文本到图像模型在这一基准上表现不佳,而一种无需训练的后编辑方法——Revise-Then-Enforce——能够提升各种扩散模型中的图像与文本的一致性。
Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System
Authors: Haokun Liu, Zhaoqi Ma, Yunong Li, Junichiro Sugihara, Yicheng Chen, Jinjie Li, Moju Zhao
Venue: Advanced Intelligent Systems, Oct. 2025
First: 2025-06-05T13:27:41+00:00 · Latest: 2025-10-27T04:26:01+00:00
Comments: 18 pages, 10 figures
Abstract
Heterogeneous multirobot systems show great potential in complex tasks
requiring coordinated hybrid cooperation. However, existing methods that rely
on static or task-specific models often lack generalizability across diverse
tasks and dynamic environments. This highlights the need for generalizable
intelligence that can bridge high-level reasoning with low-level execution
across heterogeneous agents. To address this, we propose a hierarchical
multimodal framework that integrates a prompted large language model (LLM) with
a fine-tuned vision-language model (VLM). At the system level, the LLM performs
hierarchical task decomposition and constructs a global semantic map, while the
VLM provides semantic perception and object localization, where the proposed
GridMask significantly enhances the VLM's spatial accuracy for reliable
fine-grained manipulation. The aerial robot leverages this global map to
generate semantic paths and guide the ground robot's local navigation and
manipulation, ensuring robust coordination even in target-absent or ambiguous
scenarios. We validate the framework through extensive simulation and
real-world experiments on long-horizon object arrangement tasks, demonstrating
zero-shot adaptability, robust semantic navigation, and reliable manipulation
in dynamic environments. To the best of our knowledge, this work presents the
first heterogeneous aerial-ground robotic system that integrates VLM-based
perception with LLM-driven reasoning for global high-level task planning and
execution.
中文标题/摘要
标题:层次语言模型在空中-地面机器人系统中用于语义导航和操作
异构多机器人系统在需要协调混合合作的复杂任务中显示出巨大的潜力。然而,现有的依赖于静态或任务特定模型的方法往往缺乏在多样任务和动态环境中的普适性。这突显了需要一种能够将高层推理与异构代理的低层执行相结合的一般智能。为了解决这一问题,我们提出了一种层次多模态框架,该框架将提示的大语言模型(LLM)与微调的视觉-语言模型(VLM)结合起来。在系统层面,LLM执行层次任务分解并构建全局语义地图,而VLM提供语义感知和物体定位,其中提出的GridMask显著提高了VLM的空间准确性,以实现可靠的细粒度操作。空中机器人利用这一全局地图生成语义路径,并指导地面机器人的局部导航和操作,即使在目标缺失或模糊的场景中也能确保稳健的协调。我们通过广泛的仿真和现实世界实验验证了该框架,该实验在长期目标排列任务中展示了零样本适应性、稳健的语义导航和可靠的动态环境操作。据我们所知,这项工作首次将基于VLM的感知与由LLM驱动的推理结合,用于全局高层任务规划和执行。
Summary / 总结
The research aims to enhance the generalizability of multirobot systems for complex tasks through a hierarchical multimodal framework. This framework integrates a large language model for high-level task decomposition and a vision-language model for semantic perception and localization. The system constructs a global semantic map and uses it to guide the aerial robot in generating semantic paths and the ground robot in local navigation and manipulation. Experiments show zero-shot adaptability, robust semantic navigation, and reliable manipulation in dynamic environments, marking the first integration of VLM-based perception with LLM-driven reasoning for heterogeneous aerial-ground robotic systems.
研究旨在增强异构多机器人系统在复杂任务中的通用性。提出了一种分层多模态框架,结合了提示的大语言模型和微调的视觉语言模型。LLM进行任务分解并构建全局语义地图,而VLM提供语义感知和物体定位,其中GridMask提高了空间准确性。该框架通过广泛的仿真和实际实验,在动态环境中展示了零样本适应性和稳健的性能,用于对象排列任务。
SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency
Authors: Quanjian Song, Donghao Zhou, Jingyu Lin, Fei Shen, Jiaze Wang, Xiaowei Hu, Cunjian Chen, Pheng-Ann Heng
Venue: NeurIPS 2025
First: 2025-10-27T04:19:22+00:00 · Latest: 2025-10-27T04:19:22+00:00
Comments: Accepted by NeurIPS 2025; Project Page:
https://lulupig12138.github.io/SceneDecorator
Abstract
Recent text-to-image models have revolutionized image generation, but they
still struggle with maintaining concept consistency across generated images.
While existing works focus on character consistency, they often overlook the
crucial role of scenes in storytelling, which restricts their creativity in
practice. This paper introduces scene-oriented story generation, addressing two
key challenges: (i) scene planning, where current methods fail to ensure
scene-level narrative coherence by relying solely on text descriptions, and
(ii) scene consistency, which remains largely unexplored in terms of
maintaining scene consistency across multiple stories. We propose
SceneDecorator, a training-free framework that employs VLM-Guided Scene
Planning to ensure narrative coherence across different scenes in a
``global-to-local'' manner, and Long-Term Scene-Sharing Attention to maintain
long-term scene consistency and subject diversity across generated stories.
Extensive experiments demonstrate the superior performance of SceneDecorator,
highlighting its potential to unleash creativity in the fields of arts, films,
and games.
中文标题/摘要
标题:SceneDecorator:面向场景导向的故事生成与场景规划及一致性
近期的文本到图像模型已经革新了图像生成,但它们仍然难以在生成的图像中保持概念一致性。虽然现有工作主要关注角色一致性,但往往忽视了场景在叙事中的关键作用,这在实践中限制了其创造力。本文提出了面向场景导向的故事生成,解决两个关键挑战:(i)场景规划,当前方法仅依赖文本描述无法确保场景层面的叙事连贯性;(ii)场景一致性,这一方面在多故事场景一致性保持方面尚未得到充分探索。我们提出了SceneDecorator,这是一种无需训练的框架,通过VLM引导的场景规划确保不同场景间的叙事连贯性,并通过长期场景共享注意力保持生成故事中的长期场景一致性和主题多样性。大量实验表明,SceneDecorator在性能上优于现有方法,突显了其在艺术、电影和游戏领域的潜在创造力。
Summary / 总结
The paper aims to enhance scene-oriented story generation by addressing scene planning and scene consistency. It introduces SceneDecorator, a framework that uses VLM-Guided Scene Planning for narrative coherence and Long-Term Scene-Sharing Attention for maintaining scene consistency and subject diversity. Experiments show that SceneDecorator outperforms existing methods in ensuring narrative and scene consistency, making it a promising tool for creative arts, films, and games.
本文针对现有文本到图像模型在保持场景一致性和叙事连贯性方面的局限性,提出了SceneDecorator框架,该框架利用VLM引导的场景规划和长期场景共享注意力机制,确保不同场景的叙事连贯性和长期场景一致性。实验表明,SceneDecorator在艺术、电影和游戏等领域具有优越的表现和潜力。
VoMP: Predicting Volumetric Mechanical Property Fields
Authors: Rishit Dagli, Donglai Xiang, Vismay Modi, Charles Loop, Clement Fuji Tsang, Anka He Chen, Anita Hu, Gavriel State, David I. W. Levin, Maria Shugrina
First: 2025-10-27T03:56:25+00:00 · Latest: 2025-10-27T03:56:25+00:00
Comments: hi-res paper and other details at:
https://research.nvidia.com/labs/sil/projects/vomp
Abstract
Physical simulation relies on spatially-varying mechanical properties, often
laboriously hand-crafted. VoMP is a feed-forward method trained to predict
Young's modulus ($E$), Poisson's ratio ($\nu$), and density ($\rho$) throughout
the volume of 3D objects, in any representation that can be rendered and
voxelized. VoMP aggregates per-voxel multi-view features and passes them to our
trained Geometry Transformer to predict per-voxel material latent codes. These
latents reside on a manifold of physically plausible materials, which we learn
from a real-world dataset, guaranteeing the validity of decoded per-voxel
materials. To obtain object-level training data, we propose an annotation
pipeline combining knowledge from segmented 3D datasets, material databases,
and a vision-language model, along with a new benchmark. Experiments show that
VoMP estimates accurate volumetric properties, far outperforming prior art in
accuracy and speed.
中文标题/摘要
标题:VoMP:预测体积机械属性场
物理模拟依赖于空间变化的机械属性,通常需要手工制作。VoMP 是一种前馈方法,用于训练预测 3D 对象体积中各处的杨氏模量($E$)、泊松比($\nu$)和密度($\rho$),适用于任何可以渲染和体素化的表示。VoMP 汇聚了体素级别的多视图特征,并将其传递给我们的训练几何变换器以预测体素级别的材料潜在代码。这些潜在变量位于我们从真实世界数据集中学习到的物理上合理的材料流形上,确保解码的体素级别材料的有效性。为了获得对象级别的训练数据,我们提出了一种结合分割的 3D 数据集知识、材料数据库和视觉语言模型的新注释管道,以及一个新的基准。实验表明,VoMP 估计了准确的体积属性,其准确性和速度远超先前方法。
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
Authors: Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi
Venue: NeurIPS 2025
First: 2025-05-23T16:44:21+00:00 · Latest: 2025-10-27T03:45:34+00:00
Comments: Accepted at NeurIPS 2025 Datasets and Benchmarks Track
Abstract
Recent progress in Large Vision-Language Models (LVLMs) has enabled promising
applications in medical tasks, such as report generation and visual question
answering. However, existing benchmarks focus mainly on the final diagnostic
answer, offering limited insight into whether models engage in clinically
meaningful reasoning. To address this, we present CheXStruct and CXReasonBench,
a structured pipeline and benchmark built on the publicly available
MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of
intermediate reasoning steps directly from chest X-rays, such as segmenting
anatomical regions, deriving anatomical landmarks and diagnostic measurements,
computing diagnostic indices, and applying clinical thresholds. CXReasonBench
leverages this pipeline to evaluate whether models can perform clinically valid
reasoning steps and to what extent they can learn from structured guidance,
enabling fine-grained and transparent assessment of diagnostic reasoning. The
benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases,
each paired with up to 4 visual inputs, and supports multi-path, multi-stage
evaluation including visual grounding via anatomical region selection and
diagnostic measurements. Even the strongest of 12 evaluated LVLMs struggle with
structured reasoning and generalization, often failing to link abstract
knowledge with anatomically grounded visual interpretation. The code is
available at https://github.com/ttumyche/CXReasonBench
中文标题/摘要
标题:CXReasonBench:胸部X光结构化诊断推理基准
大型视觉-语言模型(LVLMs)的最新进展在医学任务中取得了令人鼓舞的应用,如报告生成和视觉问答。然而,现有的基准主要关注最终的诊断答案,未能提供模型是否进行临床相关推理的深入见解。为了解决这一问题,我们提出了CheXStruct和CXReasonBench,一个基于公开的MIMIC-CXR-JPG数据集构建的结构化管道和基准。CheXStruct自动从胸部X光片中推导出一系列中间推理步骤,如分割解剖区域、推导解剖标志和诊断测量、计算诊断指数以及应用临床阈值。CXReasonBench利用此管道评估模型是否能够执行临床有效的推理步骤,以及在多路径、多阶段评估中,模型从结构化指导中学习的程度,从而实现诊断推理的精细和透明评估。基准包括18,988个问答对,涉及12项诊断任务和1,200个案例,每个案例配有多达4个视觉输入,并支持包括解剖区域选择和诊断测量在内的视觉定位的多路径、多阶段评估。即使在评估的12个最强的LVLMs中,也难以进行结构化推理和泛化,经常无法将抽象知识与解剖学基础的视觉解释联系起来。代码可在https://github.com/ttumyche/CXReasonBench获取。
Summary / 总结
The research aims to evaluate structured diagnostic reasoning in chest X-rays using Large Vision-Language Models (LVLMs). The method involves creating CheXStruct, an automated pipeline that generates intermediate reasoning steps from chest X-rays, and CXReasonBench, a benchmark with 18,988 QA pairs across 12 diagnostic tasks. Key findings show that even the strongest LVLMs struggle with structured reasoning and generalization, often failing to connect abstract knowledge with visual interpretation grounded in anatomy.
CXReasonBench 是一个用于评估大型视觉-语言模型在胸部X光片上进行结构化诊断推理的基准。它基于MIMIC-CXR-JPG数据集,并自动生成一系列中间推理步骤,如分割解剖区域和计算诊断指标。基准包括18,988个问答对,涵盖12个诊断任务和1,200个案例,即使最强的LVLMs在结构化推理和泛化方面也存在问题,往往无法将抽象知识与解剖学基础的视觉解释联系起来。
VALA: Learning Latent Anchors for Training-Free and Temporally Consistent
Authors: Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Longbing Cao
First: 2025-10-27T03:44:11+00:00 · Latest: 2025-10-27T03:44:11+00:00
Abstract
Recent advances in training-free video editing have enabled lightweight and
precise cross-frame generation by leveraging pre-trained text-to-image
diffusion models. However, existing methods often rely on heuristic frame
selection to maintain temporal consistency during DDIM inversion, which
introduces manual bias and reduces the scalability of end-to-end inference. In
this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment
for \textbf{L}atent \textbf{A}nchors), a variational alignment module that
adaptively selects key frames and compresses their latent features into
semantic anchors for consistent video editing. To learn meaningful assignments,
VALA propose a variational framework with a contrastive learning objective.
Therefore, it can transform cross-frame latent representations into compressed
latent anchors that preserve both content and temporal coherence. Our method
can be fully integrated into training-free text-to-image based video editing
models. Extensive experiments on real-world video editing benchmarks show that
VALA achieves state-of-the-art performance in inversion fidelity, editing
quality, and temporal consistency, while offering improved efficiency over
prior methods.
中文标题/摘要
标题:VALA:学习潜在锚点以实现无需训练和时间一致的训练
近期在无需训练的视频编辑方面的进展通过利用预训练的文本到图像扩散模型,实现了轻量级和精确的跨帧生成。然而,现有方法通常依赖启发式帧选择来在DDIM反向过程中保持时间一致性,这引入了人为偏见并降低了端到端推理的可扩展性。在本文中,我们提出了**VALA**(**V**ariational **A**lignment for **L**atent **A**nchors),一种变分对齐模块,该模块自适应地选择关键帧并将它们的潜在特征压缩成语义锚点,以实现一致的视频编辑。为了学习有意义的分配,VALA 提出了一种变分框架,具有对比学习目标。因此,它可以将跨帧的潜在表示转换为压缩的潜在锚点,同时保留内容和时间连贯性。我们的方法可以完全集成到基于文本到图像的无需训练的视频编辑模型中。在现实世界的视频编辑基准上的广泛实验表明,VALA 在反向保真度、编辑质量和时间一致性方面均达到最先进的性能,同时比先前的方法提供了更高的效率。
Summary / 总结
The research aims to improve training-free video editing by addressing the issue of temporal consistency in cross-frame generation. VALA, a variational alignment module, is proposed to adaptively select key frames and compress their latent features into semantic anchors, ensuring consistent video editing. Experiments demonstrate that VALA outperforms previous methods in inversion fidelity, editing quality, and temporal consistency, while maintaining efficiency.
研究旨在通过解决跨帧生成中的时间一致性问题,改进无训练视频编辑。提出了一个变分对齐模块VALA,该模块能够自适应地选择关键帧并将它们的潜在特征压缩成语义锚点,以确保视频编辑的一致性。实验表明,VALA在反转保真度、编辑质量和时间一致性方面优于先前的方法,同时保持了效率。
FAME: Fairness-aware Attention-modulated Video Editing
Authors: Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Zhidong Li, Longbing Cao
First: 2025-10-27T03:34:15+00:00 · Latest: 2025-10-27T03:34:15+00:00
Abstract
Training-free video editing (VE) models tend to fall back on gender
stereotypes when rendering profession-related prompts. We propose \textbf{FAME}
for \textit{Fairness-aware Attention-modulated Video Editing} that mitigates
profession-related gender biases while preserving prompt alignment and temporal
consistency for coherent VE. We derive fairness embeddings from existing
minority representations by softly injecting debiasing tokens into the text
encoder. Simultaneously, FAME integrates fairness modulation into both temporal
self attention and prompt-to-region cross attention to mitigate the motion
corruption and temporal inconsistency caused by directly introducing fairness
cues. For temporal self attention, FAME introduces a region constrained
attention mask combined with time decay weighting, which enhances intra-region
coherence while suppressing irrelevant inter-region interactions. For cross
attention, it reweights tokens to region matching scores by incorporating
fairness sensitive similarity masks derived from debiasing prompt embeddings.
Together, these modulations keep fairness-sensitive semantics tied to the right
visual regions and prevent temporal drift across frames. Extensive experiments
on new VE fairness-oriented benchmark \textit{FairVE} demonstrate that FAME
achieves stronger fairness alignment and semantic fidelity, surpassing existing
VE baselines.
中文标题/摘要
标题:FAME:公平意识的注意力调制视频编辑
无需训练的视频编辑(VE)模型在渲染职业相关提示时往往会依赖性别刻板印象。我们提出了\textbf{FAME}(公平意识的注意力调制视频编辑),以减轻职业相关的性别偏见,同时保持提示对齐和时间一致性,以实现连贯的VE。我们通过在文本编码器中柔和地注入去偏见标记,从现有的少数群体表示中推导出公平性嵌入。同时,FAME 将公平性调制整合到时间自注意力和提示到区域交叉注意力中,以减轻直接引入公平性线索导致的运动失真和时间不一致性。对于时间自注意力,FAME 引入了结合时间衰减加权的区域约束注意力掩码,这增强了区域内的连贯性并抑制了无关的跨区域交互。对于交叉注意力,它通过结合来自去偏见提示嵌入的公平性敏感相似度掩码重新加权标记,以区域匹配得分。这些调制共同保持了公平性敏感的语义与正确的视觉区域相关联,并防止了帧间的时间漂移。在新的公平性导向的VE基准\textit{FairVE}上的广泛实验表明,FAME 在公平性对齐和语义保真度方面表现更优,超越了现有的VE基线。
Summary / 总结
FAME is proposed to address gender biases in training-free video editing models when handling profession-related prompts. It introduces fairness embeddings and modulates attention to preserve prompt alignment and temporal consistency. Experiments on the FairVE benchmark show that FAME outperforms existing baselines in achieving stronger fairness alignment and semantic fidelity.
FAME旨在解决训练-free 视频编辑模型在处理职业相关提示时出现的性别偏见问题。它通过从少数群体表示中提取公平性嵌入,并将公平性调制整合到时间自注意力和提示到区域交叉注意力中,来保持提示对齐和时间一致性。在FairVE基准上的实验结果表明,FAME在公平性对齐和语义保真度方面优于现有视频编辑基线。
GOOD: Training-Free Guided Diffusion Sampling for Out-of-Distribution Detection
Authors: Xin Gao, Jiyao Liu, Guanghao Li, Yueming Lyu, Jianxiong Gao, Weichen Yu, Ningsheng Xu, Liang Wang, Caifeng Shan, Ziwei Liu, Chenyang Si
First: 2025-10-20T03:58:46+00:00 · Latest: 2025-10-27T02:58:39+00:00
Comments: 28 pages, 16 figures, conference
Abstract
Recent advancements have explored text-to-image diffusion models for
synthesizing out-of-distribution (OOD) samples, substantially enhancing the
performance of OOD detection. However, existing approaches typically rely on
perturbing text-conditioned embeddings, resulting in semantic instability and
insufficient shift diversity, which limit generalization to realistic OOD. To
address these challenges, we propose GOOD, a novel and flexible framework that
directly guides diffusion sampling trajectories towards OOD regions using
off-the-shelf in-distribution (ID) classifiers. GOOD incorporates dual-level
guidance: (1) Image-level guidance based on the gradient of log partition to
reduce input likelihood, drives samples toward low-density regions in pixel
space. (2) Feature-level guidance, derived from k-NN distance in the
classifier's latent space, promotes sampling in feature-sparse regions. Hence,
this dual-guidance design enables more controllable and diverse OOD sample
generation. Additionally, we introduce a unified OOD score that adaptively
combines image and feature discrepancies, enhancing detection robustness. We
perform thorough quantitative and qualitative analyses to evaluate the
effectiveness of GOOD, demonstrating that training with samples generated by
GOOD can notably enhance OOD detection performance.
中文标题/摘要
标题:GOOD:无需训练的引导扩散采样以检测分布外样本
最近的研究探索了文本到图像的扩散模型,用于生成分布外(OOD)样本,显著提升了OOD检测的性能。然而,现有方法通常依赖于扰动文本条件嵌入,导致语义不稳定性和不足的语义变化多样性,这限制了其对现实世界OOD的泛化能力。为了解决这些挑战,我们提出了一种名为GOOD的新颖且灵活的框架,该框架直接使用现成的在分布(ID)分类器引导扩散采样轨迹向OOD区域。GOOD结合了双层引导:(1)基于对数分区梯度的图像级引导,降低输入似然性,驱动样本向像素空间中的低密度区域移动。(2)基于分类器潜在空间中k-NN距离的特征级引导,促进在特征稀疏区域的采样。因此,这种双引导设计能够实现更可控和多样的OOD样本生成。此外,我们引入了一种统一的OOD评分,能够自适应地结合图像和特征差异,增强检测鲁棒性。我们进行了详尽的定量和定性分析,以评估GOOD的有效性,证明使用GOOD生成的样本进行训练可以显著提升OOD检测性能。
Summary / 总结
The paper proposes GOOD, a training-free framework for generating out-of-distribution (OOD) samples using off-the-shelf in-distribution classifiers. It employs dual-level guidance: image-level guidance based on the gradient of log partition to reduce input likelihood, and feature-level guidance derived from k-NN distance in the classifier's latent space. This approach enhances OOD sample generation's controllability and diversity. Additionally, a unified OOD score combining image and feature discrepancies is introduced to improve detection robustness. Experimental results show that using samples generated by GOOD can significantly enhance OOD detection performance.
研究旨在通过文本到图像的扩散模型生成异常分布(OOD)样本,以提高OOD检测性能。GOOD是一种新颖的框架,通过使用现成的在分布分类器进行双重指导:基于对数分区梯度的图像级指导和来自k-NN距离的特征级指导,直接引导扩散采样向OOD区域。这种方法增强了OOD样本生成的可控性和多样性,从而提高了OOD检测性能。定量和定性分析表明,使用GOOD生成的样本进行训练可以显著提高OOD检测的鲁棒性。