Test-Time Defense Against Adversarial Attacks via Stochastic Resonance of Latent Ensembles
Authors: Dong Lao, Yuxiang Zhang, Haniyeh Ehsani Oskouie, Yangchao Wu, Alex Wong, Stefano Soatto
First: 2025-10-03T17:57:25+00:00 · Latest: 2025-10-03T17:57:25+00:00
Abstract
We propose a test-time defense mechanism against adversarial attacks:
imperceptible image perturbations that significantly alter the predictions of a
model. Unlike existing methods that rely on feature filtering or smoothing,
which can lead to information loss, we propose to "combat noise with noise" by
leveraging stochastic resonance to enhance robustness while minimizing
information loss. Our approach introduces small translational perturbations to
the input image, aligns the transformed feature embeddings, and aggregates them
before mapping back to the original reference image. This can be expressed in a
closed-form formula, which can be deployed on diverse existing network
architectures without introducing additional network modules or fine-tuning for
specific attack types. The resulting method is entirely training-free,
architecture-agnostic, and attack-agnostic. Empirical results show
state-of-the-art robustness on image classification and, for the first time,
establish a generic test-time defense for dense prediction tasks, including
stereo matching and optical flow, highlighting the method's versatility and
practicality. Specifically, relative to clean (unperturbed) performance, our
method recovers up to 68.1% of the accuracy loss on image classification, 71.9%
on stereo matching, and 29.2% on optical flow under various types of
adversarial attacks.
中文标题/摘要
标题:通过潜在集合的随机共振实现对抗攻击的测试时防御
我们提出了一种对抗攻击的测试时防御机制:不可感知的图像扰动,这些扰动可以显著改变模型的预测。与依赖于特征过滤或平滑的现有方法不同,这些方法可能导致信息丢失,我们提出通过利用随机共振来“以噪制噪”,从而增强鲁棒性并最小化信息丢失。我们的方法在输入图像上引入了小的平移扰动,对变换后的特征嵌入进行对齐,并在映射回原始参考图像之前进行聚合。这可以表示为一个闭式公式,可以在各种现有的网络架构上部署,而无需引入额外的网络模块或针对特定攻击类型进行微调。该方法完全无需训练,架构无关,攻击无关。实验证明,该方法在图像分类上的鲁棒性达到了最先进的水平,并首次为密集预测任务(包括立体匹配和光流)建立了通用的测试时防御,突显了该方法的通用性和实用性。具体而言,与干净(未扰动)的性能相比,我们的方法在不同类型的对抗攻击下,分别恢复了图像分类68.1%的准确率损失、立体匹配71.9%的准确率损失和光流29.2%的准确率损失。
Summary / 总结
The paper proposes a test-time defense mechanism against adversarial attacks using stochastic resonance to enhance robustness without information loss. By introducing small translational perturbations to input images and aggregating transformed feature embeddings, the method achieves state-of-the-art robustness in image classification and demonstrates versatility in dense prediction tasks like stereo matching and optical flow, recovering significant accuracy under various attacks.
该论文提出了一种针对对抗攻击的测试时防御机制,通过在输入图像上引入小的平移扰动,并利用随机共振增强鲁棒性而不损失信息。该方法将对齐并聚合变换的特征嵌入,然后将其映射回原始图像,适用于各种网络架构和攻击类型。实验表明,在各种类型的对抗攻击下,该方法能显著恢复准确率,特别是在图像分类、立体匹配和光学流任务中。
Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning
Authors: Yilun Hao, Yongchao Chen, Chuchu Fan, Yang Zhang
First: 2025-10-03T16:57:01+00:00 · Latest: 2025-10-03T16:57:01+00:00
Comments: 30 pages, 5 figures, 5 tables
Abstract
Vision Language Models (VLMs) show strong potential for visual planning but
struggle with precise spatial and long-horizon reasoning. In contrast, Planning
Domain Definition Language (PDDL) planners excel at long-horizon formal
planning, but cannot interpret visual inputs. Recent works combine these
complementary advantages by enabling VLMs to turn visual planning problems into
PDDL files for formal planning. However, while VLMs can generate PDDL problem
files satisfactorily, they struggle to accurately generate the PDDL domain
files, which describe all the planning rules. As a result, prior methods rely
on human experts to predefine domain files or on constant environment access
for refinement. We propose VLMFP, a Dual-VLM-guided framework that can
autonomously generate both PDDL problem and domain files for formal visual
planning. VLMFP introduces two VLMs to ensure reliable PDDL file generation: A
SimVLM that simulates action consequences based on input rule descriptions, and
a GenVLM that generates and iteratively refines PDDL files by comparing the
PDDL and SimVLM execution results. VLMFP unleashes multiple levels of
generalizability: The same generated PDDL domain file works for all the
different instances under the same problem, and VLMs generalize to different
problems with varied appearances and rules. We evaluate VLMFP with 6 grid-world
domains and test its generalization to unseen instances, appearance, and game
rules. On average, SimVLM accurately describes 95.5%, 82.6% of scenarios,
simulates 85.5%, 87.8% of action sequence, and judges 82.4%, 85.6% goal
reaching for seen and unseen appearances, respectively. With the guidance of
SimVLM, VLMFP can generate PDDL files to reach 70.0%, 54.1% valid plans for
unseen instances in seen and unseen appearances, respectively. Project page:
https://sites.google.com/view/vlmfp.
中文标题/摘要
标题:模拟到规则:一种用于正式视觉规划的双VLM框架
视觉语言模型(VLMs)在视觉规划方面显示出强大的潜力,但在精确的空间和长时序推理方面存在困难。相比之下,规划领域定义语言(PDDL)规划器在长时序正式规划方面表现出色,但无法解释视觉输入。最近的研究通过使VLMs能够将视觉规划问题转换为PDDL文件来进行正式规划,结合了这些互补的优势。然而,虽然VLMs可以很好地生成PDDL问题文件,但在准确生成描述所有规划规则的PDDL领域文件方面存在困难。因此,先前的方法依赖于人类专家预先定义领域文件或不断访问环境进行改进。我们提出了VLMFP,这是一种双VLM引导框架,可以自主生成正式视觉规划的PDDL问题文件和领域文件。VLMFP引入了两个VLMs以确保可靠的PDDL文件生成:一个SimVLM根据输入规则描述模拟动作后果,一个GenVLM通过比较PDDL和SimVLM执行结果生成并迭代改进PDDL文件。VLMFP释放了多个层次的泛化能力:生成的PDDL领域文件适用于同一问题下的所有不同实例,VLMs可以泛化到具有不同外观和规则的不同问题。我们使用6个网格世界领域评估了VLMFP,并测试了其对未见过的实例、外观和游戏规则的泛化能力。平均而言,SimVLM准确描述了95.5%、82.6%的场景,模拟了85.5%、87.8%的动作序列,并判断了82.4%、85.6%的目标达成情况,分别针对已见过和未见过的外观。在SimVLM的指导下,VLMFP可以生成PDDL文件以达到70.0%、54.1%的有效计划,分别针对已见过和未见过的实例。
Summary / 总结
The research aims to improve visual planning by combining the strengths of Vision Language Models (VLMs) and Planning Domain Definition Language (PDDL) planners. VLMFP, a Dual-VLM framework, autonomously generates both PDDL problem and domain files for visual planning. It uses a SimVLM to simulate action consequences and a GenVLM to generate and refine PDDL files. Experiments on six grid-world domains show that SimVLM accurately simulates 85.5% to 95.5% of scenarios and action sequences, and VLMFP can generate valid plans for unseen instances in seen and unseen appearances with 70.0% and 54.1% success rates, respectively.
研究旨在通过结合视觉语言模型(VLM)和规划领域定义语言(PDDL)规划器的优势来改进视觉规划。VLMFP是一种双VLM框架,能够自主生成视觉规划中的PDDL问题文件和领域文件。该框架使用SimVLM模拟动作后果,并使用GenVLM生成和迭代优化PDDL文件。实验结果显示,SimVLM能够准确模拟85.5%到95.5%的场景和动作序列,而VLMFP在未见过的实例和未见过的外观中生成有效计划的成功率分别为70.0%和54.1%。
SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus
Authors: Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan
First: 2025-10-03T16:32:02+00:00 · Latest: 2025-10-03T16:32:02+00:00
Abstract
Spine disorders affect 619 million people globally and are a leading cause of
disability, yet AI-assisted diagnosis remains limited by the lack of
level-aware, multimodal datasets. Clinical decision-making for spine disorders
requires sophisticated reasoning across X-ray, CT, and MRI at specific
vertebral levels. However, progress has been constrained by the absence of
traceable, clinically-grounded instruction data and standardized,
spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem
co-designed with practicing spine surgeons. It features SpineMed-450k, the
first large-scale dataset explicitly designed for vertebral-level reasoning
across imaging modalities with over 450,000 instruction instances, and
SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is
curated from diverse sources, including textbooks, guidelines, open datasets,
and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline
with a two-stage LLM generation method (draft and revision) to ensure
high-quality, traceable data for question-answering, multi-turn consultations,
and report generation. SpineBench evaluates models on clinically salient axes,
including level identification, pathology assessment, and surgical planning.
Our comprehensive evaluation of several recently advanced large vision-language
models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained,
level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k
demonstrates consistent and significant improvements across all tasks.
Clinician assessments confirm the diagnostic clarity and practical utility of
our model's outputs.
中文标题/摘要
标题:SpineBench:基于SpineMed-450k语料库的临床相关、分级驱动基准
脊椎疾病影响全球6.19亿人,是导致残疾的主要原因之一,但AI辅助诊断受限于缺乏分级意识的多模态数据集。脊椎疾病的临床决策需要在特定椎体水平上对X光、CT和MRI进行复杂的推理。然而,由于缺乏可追溯的、临床依据的数据和标准化的脊椎特定基准,进展受限。为了解决这一问题,我们引入了SpineMed,一个与实践中的脊椎外科医生共同设计的生态系统。它包括SpineMed-450k,这是首个明确为跨成像模态的椎体级推理设计的大规模数据集,包含超过45万个指令实例,以及SpineBench,一个临床依据的评估框架。SpineMed-450k从多种来源收集,包括教科书、指南、开放数据集和约1000个匿名医院病例,使用临床医生在环的管道和两阶段LLM生成方法(草案和修订)来确保高质量、可追溯的数据,用于问题回答、多轮咨询和报告生成。SpineBench在临床相关轴上评估模型,包括椎体识别、病理评估和手术规划。我们对SpineBench上几种最近先进的大型视觉-语言模型的全面评估揭示了它们在细粒度、椎体特定推理方面的系统性弱点。相比之下,我们基于SpineMed-450k微调的模型在所有任务上都表现出一致且显著的改进。临床医生评估证实了我们模型输出的诊断清晰度和实用价值。
Summary / 总结
The research addresses the limitation of AI-assisted diagnosis in spine disorders due to the lack of level-aware, multimodal datasets. It introduces SpineMed-450k, a large-scale dataset for vertebral-level reasoning across imaging modalities, and SpineBench, an evaluation framework. The study evaluates several advanced large vision-language models on SpineBench and finds that models fine-tuned on SpineMed-450k show consistent improvements in level-specific reasoning tasks, with clinician assessments confirming the diagnostic clarity and practical utility of the model outputs.
研究旨在解决由于缺乏层次意识的多模态数据集而导致脊椎疾病的人工智能辅助诊断受限的问题。研究引入了SpineMed-450k,这是一个用于跨影像模态的椎体层次推理的大规模数据集,以及SpineBench,一个临床导向的评估框架。研究评估了几种先进的大型视觉-语言模型在SpineBench上的表现,并发现基于SpineMed-450k微调的模型在所有任务中表现出一致且显著的改进,临床评估确认了该模型输出的诊断清晰度和实用价值。
Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields
Authors: Zhiting Mei, Ola Shorinwa, Anirudha Majumdar
First: 2025-10-03T15:32:56+00:00 · Latest: 2025-10-03T15:32:56+00:00
Abstract
Semantic distillation in radiance fields has spurred significant advances in
open-vocabulary robot policies, e.g., in manipulation and navigation, founded
on pretrained semantics from large vision models. While prior work has
demonstrated the effectiveness of visual-only semantic features (e.g., DINO and
CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit
of geometry-grounding in distilled fields remains an open question. In
principle, visual-geometry features seem very promising for spatial tasks such
as pose estimation, prompting the question: Do geometry-grounded semantic
features offer an edge in distilled fields? Specifically, we ask three critical
questions: First, does spatial-grounding produce higher-fidelity geometry-aware
semantic features? We find that image features from geometry-grounded backbones
contain finer structural details compared to their counterparts. Secondly, does
geometry-grounding improve semantic object localization? We observe no
significant difference in this task. Thirdly, does geometry-grounding enable
higher-accuracy radiance field inversion? Given the limitations of prior work
and their lack of semantics integration, we propose a novel framework SPINE for
inverting radiance fields without an initial guess, consisting of two core
components: coarse inversion using distilled semantics, and fine inversion
using photometric-based optimization. Surprisingly, we find that the pose
estimation accuracy decreases with geometry-grounded features. Our results
suggest that visual-only features offer greater versatility for a broader range
of downstream tasks, although geometry-grounded features contain more geometric
detail. Notably, our findings underscore the necessity of future research on
effective strategies for geometry-grounding that augment the versatility and
performance of pretrained semantic features.
中文标题/摘要
标题:几何与视觉的交汇:重访蒸馏领域中的预训练语义
辐射场中的语义蒸馏在开放词汇机器人策略中取得了显著进展,例如在操作和导航中的应用,这些策略基于大型视觉模型中的预训练语义。尽管先前的工作已经证明了仅基于视觉的语义特征(例如DINO和CLIP)在Gaussian散点图和神经辐射场中的有效性,但在蒸馏领域中几何约束的潜在益处仍然是一个开放的问题。原则上,视觉-几何特征对于空间任务(如姿态估计)非常有前景,这引发了这样一个问题:几何约束的语义特征在蒸馏领域中是否具有优势?具体来说,我们提出了三个关键问题:首先,几何约束是否能产生更高保真的几何感知语义特征?我们发现,来自几何约束主干的图像特征包含比其对应物更精细的结构细节。其次,几何约束是否能改善语义对象定位?我们没有观察到显著差异。第三,几何约束是否能提高辐射场的重建精度?鉴于先前工作的局限性和缺乏语义集成,我们提出了一种名为SPINE的新框架,用于在没有初始猜测的情况下反转辐射场,该框架由两个核心组件组成:使用蒸馏语义进行粗略反转,以及使用基于光度的优化进行精细反转。令人惊讶的是,我们发现使用几何约束特征的姿态估计精度降低了。我们的结果表明,仅基于视觉的特征对于更广泛的下游任务具有更大的灵活性,尽管几何约束特征包含更多的几何细节。值得注意的是,我们的发现强调了未来研究中有效几何约束策略的必要性,这些策略可以增强预训练语义特征的灵活性和性能。
Summary / 总结
This study investigates the impact of geometry-grounded features on semantic distillation in radiance fields, addressing whether such features enhance spatial tasks. The research finds that while geometry-grounded features provide finer structural details, they do not significantly improve object localization and can even reduce pose estimation accuracy. A novel framework, SPINE, is proposed for radiance field inversion, showing that visual-only features are more versatile for various downstream tasks, despite containing less geometric detail.
研究探讨了几何导向语义特征在蒸馏场中的影响,是否这些特征能提升空间任务的表现。研究发现,虽然几何导向特征提供了更精细的结构细节,并未显著提高物体定位的准确性,但也没有增强辐射场反演的精度。研究引入了SPINE框架,用于在没有初始猜测的情况下反演辐射场,并得出结论认为,视觉仅特征对于各种下游任务更具通用性,尽管几何导向特征提供了更多的几何细节。
Less is More: Lean yet Powerful Vision-Language Model for Autonomous Driving
Authors: Sheng Yang, Tong Zhan, Guancheng Chen, Yanfeng Lu, Jian Wang
First: 2025-09-29T05:14:18+00:00 · Latest: 2025-10-03T15:30:05+00:00
Abstract
In this work, we reconceptualize autonomous driving as a generalized language
and formulate the trajectory planning task as next waypoint prediction. We
introduce Max-V1, a novel framework for one-stage end-to-end autonomous
driving. Our framework presents a single-pass generation paradigm that aligns
with the inherent sequentiality of driving. This approach leverages the
generative capacity of the VLM (Vision-Language Model) to enable end-to-end
trajectory prediction directly from front-view camera input. The efficacy of
this method is underpinned by a principled supervision strategy derived from
statistical modeling. This provides a well-defined learning objective, which
makes the framework highly amenable to master complex driving policies through
imitation learning from large-scale expert demonstrations. Empirically, our
method achieves the state-of-the-art performance on the nuScenes dataset,
delivers an overall improvement of over 30% compared to prior baselines.
Furthermore, it exhibits superior generalization performance on cross-domain
datasets acquired from diverse vehicles, demonstrating notable potential for
cross-vehicle robustness and adaptability. Due to these empirical strengths,
this work introduces a model enabling fundamental driving behaviors, laying the
foundation for the development of more capable self-driving agents. Code will
be available upon publication.
中文标题/摘要
标题:少即是多:精简而强大的视觉语言模型在自动驾驶中的应用
在本研究中,我们将自动驾驶重新概念化为一种通用的语言任务,并将轨迹规划任务形式化为下一个航点预测。我们引入了Max-V1,一种新颖的一阶段端到端自动驾驶框架。该框架采用了一次生成的范式,与驾驶的固有顺序性相契合。该方法利用VLM(视觉语言模型)的生成能力,直接从前视摄像头输入中进行端到端的轨迹预测。该方法的有效性基于从统计建模中得出的原则性监督策略,这为通过大规模专家演示的模仿学习掌握复杂的驾驶策略提供了明确的学习目标。实证上,我们的方法在nuScenes数据集上达到了最先进的性能,相比之前的基线提高了超过30%的整体性能。此外,它在不同车辆采集的跨域数据集上表现出优越的泛化性能,展示了跨车辆鲁棒性和适应性的显著潜力。由于这些实证优势,本研究引入了一个能够实现基本驾驶行为的模型,为开发更强大的自动驾驶代理奠定了基础。代码将在发表后提供。
Summary / 总结
This work reimagines autonomous driving as a language task, formulating trajectory planning as next waypoint prediction. It introduces Max-V1, a one-stage end-to-end framework that uses a Vision-Language Model for direct trajectory prediction from front-view camera inputs. The method achieves state-of-the-art performance on the nuScenes dataset, improving over 30% compared to previous models and showing strong generalization across different vehicles, indicating potential for robust and adaptable self-driving agents.
本文将自动驾驶重新定义为语言任务,将轨迹规划任务表述为下一个航点预测。引入了Max-V1,这是一种端到端的一阶段框架,利用视觉语言模型直接从摄像头输入中进行轨迹预测。该方法由一个原理性的监督策略引导,实现了在nuScenes数据集上的最新性能,相比之前的方法提高了超过30%。此外,它在不同车辆的数据集上表现出强大的泛化能力,显示出强大的鲁棒性和适应性潜力。
Toward a Holistic Evaluation of Robustness in CLIP Models
Authors: Weijie Tu, Weijian Deng, Tom Gedeon
Venue: NeurIPS
First: 2024-10-02T13:26:17+00:00 · Latest: 2025-10-03T12:59:41+00:00
Comments: Accepted to IEEE TPAMI, extension of NeurIPS'23 work: A Closer Look
at the Robustness of Contrastive Language-Image Pre-Training (CLIP)
Abstract
Contrastive Language-Image Pre-training (CLIP) models have shown significant
potential, particularly in zero-shot classification across diverse distribution
shifts. Building on existing evaluations of overall classification robustness,
this work aims to provide a more comprehensive assessment of CLIP by
introducing several new perspectives. First, we investigate their robustness to
variations in specific visual factors. Second, we assess two critical safety
objectives--confidence uncertainty and out-of-distribution detection--beyond
mere classification accuracy. Third, we evaluate the finesse with which CLIP
models bridge the image and text modalities. Fourth, we extend our examination
to 3D awareness in CLIP models, moving beyond traditional 2D image
understanding. Finally, we explore the interaction between vision and language
encoders within modern large multimodal models (LMMs) that utilize CLIP as the
visual backbone, focusing on how this interaction impacts classification
robustness. In each aspect, we consider the impact of six factors on CLIP
models: model architecture, training distribution, training set size,
fine-tuning, contrastive loss, and test-time prompts. Our study uncovers
several previously unknown insights into CLIP. For instance, the architecture
of the visual encoder in CLIP plays a significant role in their robustness
against 3D corruption. CLIP models tend to exhibit a bias towards shape when
making predictions. Moreover, this bias tends to diminish after fine-tuning on
ImageNet. Vision-language models like LLaVA, leveraging the CLIP vision
encoder, could exhibit benefits in classification performance for challenging
categories over CLIP alone. Our findings are poised to offer valuable guidance
for enhancing the robustness and reliability of CLIP models.
Summary / 总结
This work evaluates CLIP models' robustness from multiple perspectives, including their sensitivity to visual variations, safety objectives like confidence uncertainty and out-of-distribution detection, modality bridging, and 3D awareness. It also examines the impact of factors such as model architecture, training distribution, and fine-tuning. Key findings include the significant role of the visual encoder architecture in 3D robustness, the shape bias in predictions which diminishes with fine-tuning, and the potential benefits of using CLIP in vision-language models for challenging categories.
这项研究通过考察CLIP模型对视觉变化的敏感性、评估安全目标如置信不确定性及异常分布检测、探索其在图像和文本模态间的桥梁作用,以及探索3D感知和视觉与语言编码器在现代大型多模态模型中的相互作用。关键发现包括视觉编码器架构在3D鲁棒性中的重要作用、预测中的形状偏见在微调后会减弱,以及在挑战性类别中视觉语言模型如LLaVA可能表现出优于CLIP的分类性能。
Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking, Fine-Tuning, and Clinical Insights
Authors: Daphne Tsolissou, Theofanis Ganitidis, Konstantinos Mitsis, Stergios CHristodoulidis, Maria Vakalopoulou, Konstantina Nikita
First: 2025-10-03T11:48:12+00:00 · Latest: 2025-10-03T11:48:12+00:00
Abstract
Reliable risk assessment for carotid atheromatous disease remains a major
clinical challenge, as it requires integrating diverse clinical and imaging
information in a manner that is transparent and interpretable to clinicians.
This study investigates the potential of state-of-the-art and recent large
vision-language models (LVLMs) for multimodal carotid plaque assessment by
integrating ultrasound imaging (USI) with structured clinical, demographic,
laboratory, and protein biomarker data. A framework that simulates realistic
diagnostic scenarios through interview-style question sequences is proposed,
comparing a range of open-source LVLMs, including both general-purpose and
medically tuned models. Zero-shot experiments reveal that even if they are very
powerful, not all LVLMs can accurately identify imaging modality and anatomy,
while all of them perform poorly in accurate risk classification. To address
this limitation, LLaVa-NeXT-Vicuna is adapted to the ultrasound domain using
low-rank adaptation (LoRA), resulting in substantial improvements in stroke
risk stratification. The integration of multimodal tabular data in the form of
text further enhances specificity and balanced accuracy, yielding competitive
performance compared to prior convolutional neural network (CNN) baselines
trained on the same dataset. Our findings highlight both the promise and
limitations of LVLMs in ultrasound-based cardiovascular risk prediction,
underscoring the importance of multimodal integration, model calibration, and
domain adaptation for clinical translation.
中文标题/摘要
标题:使用大型视觉-语言模型进行多模态颈动脉风险分层:基准测试、微调及临床见解
可靠的颈动脉粥样硬化疾病风险评估仍然是一个主要的临床挑战,因为它需要以透明和可解释的方式整合多种临床和影像信息。本研究探讨了最先进的和近期大型视觉-语言模型(LVLMs)在结合超声成像(USI)和结构化的临床、人口统计、实验室和蛋白质生物标志物数据进行颈动脉斑块评估方面的潜力。提出了一种通过访谈式问题序列模拟现实诊断场景的框架,比较了多种开源LVLMs,包括通用和医学调优模型。零样本实验表明,尽管非常强大,但并非所有LVLMs都能准确识别影像模态和解剖结构,而它们在准确的风险分类方面表现不佳。为解决这一局限性,通过低秩适应(LoRA)将LLaVa-NeXT-Vicuna 调整到超声领域,显著提高了中风风险分层。以文本形式整合的多模态表格数据进一步提高了特异性和平衡准确率,与在相同数据集上训练的卷积神经网络(CNN)基线相比,性能具有竞争力。我们的研究结果突显了LVLMs在基于超声的心血管风险预测中的潜力和局限性,强调了多模态整合、模型校准和领域适应对于临床转化的重要性。
Summary / 总结
This study explores the use of large vision-language models (LVLMs) for carotid plaque assessment by integrating ultrasound imaging with clinical and demographic data. It compares various open-source LVLMs and finds that while they are powerful, none can accurately identify imaging modality and anatomy or perform well in risk classification. To improve stroke risk stratification, LLaVa-NeXT-Vicuna is adapted using low-rank adaptation, leading to better performance. The integration of multimodal tabular data further enhances the model's specificity and balanced accuracy, showing competitive performance compared to previous CNN baselines. The study highlights the potential and limitations of LVLMs in ultrasound-based cardiovascular risk prediction, emphasizing the need for multimodal integration and domain adaptation for clinical use.
该研究探索了将大型视觉-语言模型(LVLMs)用于通过将超声成像与临床和人口统计学数据结合来进行颈动脉斑块评估的方法。研究比较了各种开源LVLMs,发现尽管它们非常强大,但都无法准确识别成像模态和解剖结构或进行风险分类。为了改善中风风险分层,使用低秩适应(LoRA)对LLaVa-NeXT-Vicuna进行了调整,从而提高了性能。进一步整合多模态表格数据进一步增强了模型的特异性和平衡准确性,显示出与之前基于卷积神经网络(CNN)的基线相比具有竞争力的性能。研究强调了LVLMs在基于超声成像的心血管风险预测中的潜力和局限性,强调了多模态整合和领域适应对于临床应用的重要性。
Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting
Authors: Nikoo Naghavian, Mostafa Tavassolipour
Venue: NeurIPS 2025
First: 2025-10-03T11:36:02+00:00 · Latest: 2025-10-03T11:36:02+00:00
Comments: Accepted to the NeurIPS 2025 Workshop on Reliable ML from Unreliable
Data
Abstract
Vision-language models like CLIP demonstrate impressive zero-shot
generalization but remain highly vulnerable to adversarial attacks. In this
work, we propose Confidence-Aware Weighting (CAW) to enhance zero-shot
robustness in vision-language models. CAW consists of two components: (1) a
Confidence-Aware loss that prioritizes uncertain adversarial examples by
scaling the KL divergence between clean and adversarial predictions, and (2) a
feature alignment regularization that preserves semantic consistency by
minimizing the distance between frozen and fine-tuned image encoder features on
adversarial inputs. These components work jointly to improve both clean and
robust accuracy without sacrificing generalization. Extensive experiments on
TinyImageNet and 14 additional datasets show that CAW outperforms recent
methods such as PMG-AFT and TGA-ZSR under strong attacks like AutoAttack, while
using less memory.
中文标题/摘要
标题:视觉语言模型的零样本鲁棒性通过信心感知加权提升
像CLIP这样的视觉-语言模型在零样本泛化方面表现出色,但仍然高度易受对抗攻击的影响。在本工作中,我们提出了信心感知加权(CAW)以增强视觉-语言模型的零样本鲁棒性。CAW 包含两个组件:(1)信心感知损失,通过按比例调整干净和对抗预测之间的KL散度来优先处理不确定的对抗样本,(2)特征对齐正则化,通过在对抗输入上最小化冻结和微调图像编码器特征之间的距离来保持语义一致性。这些组件共同作用,提高干净和鲁棒准确性,而不牺牲泛化能力。在TinyImageNet和14个额外数据集上的广泛实验表明,在AutoAttack等强大攻击下,CAW 在性能上优于PMG-AFT和TGA-ZSR 等最近的方法,同时使用更少的内存。
Summary / 总结
This work addresses the vulnerability of vision-language models like CLIP to adversarial attacks by proposing Confidence-Aware Weighting (CAW), which includes a Confidence-Aware loss and feature alignment regularization. The method enhances both clean and robust accuracy without compromising generalization. Experiments show that CAW outperforms recent methods like PMG-AFT and TGA-ZSR under strong attacks such as AutoAttack, while requiring less memory.
本文针对视觉-语言模型如CLIP在对抗攻击面前的脆弱性,尽管它们在零样本泛化方面表现出色。提出了信心感知加权(CAW),包括信心感知损失和特征对齐正则化。该方法优先处理不确定的对抗样本,并保持语义一致性,从而提高干净和鲁棒准确性。实验表明,CAW在强攻击如AutoAttack下优于最近的方法,同时使用更少的内存。
Understanding Adversarial Transfer: Why Representation-Space Attacks Fail Where Data-Space Attacks Succeed
Authors: Isha Gupta, Rylan Schaeffer, Joshua Kazdan, Ken Ziyu Liu, Sanmi Koyejo
First: 2025-10-01T22:10:58+00:00 · Latest: 2025-10-03T11:35:23+00:00
Abstract
The field of adversarial robustness has long established that adversarial
examples can successfully transfer between image classifiers and that text
jailbreaks can successfully transfer between language models (LMs). However, a
pair of recent studies reported being unable to successfully transfer image
jailbreaks between vision-language models (VLMs). To explain this striking
difference, we propose a fundamental distinction regarding the transferability
of attacks against machine learning models: attacks in the input data-space can
transfer, whereas attacks in model representation space do not, at least not
without geometric alignment of representations. We then provide theoretical and
empirical evidence of this hypothesis in four different settings. First, we
mathematically prove this distinction in a simple setting where two networks
compute the same input-output map but via different representations. Second, we
construct representation-space attacks against image classifiers that are as
successful as well-known data-space attacks, but fail to transfer. Third, we
construct representation-space attacks against LMs that successfully jailbreak
the attacked models but again fail to transfer. Fourth, we construct data-space
attacks against VLMs that successfully transfer to new VLMs, and we show that
representation space attacks can transfer when VLMs' latent geometries are
sufficiently aligned in post-projector space. Our work reveals that adversarial
transfer is not an inherent property of all attacks but contingent on their
operational domain - the shared data-space versus models' unique representation
spaces - a critical insight for building more robust models.
中文标题/摘要
标题:理解对抗迁移:为什么数据空间攻击成功而表示空间攻击失败
对抗鲁棒性领域长期以来已经确立,对抗样本可以在图像分类器之间成功迁移,文本突破也可以在语言模型(LMs)之间成功迁移。然而,最近的两项研究报道无法成功在视觉语言模型(VLMs)之间迁移图像突破。为了解释这种显著差异,我们提出了一种基本的区别,即针对机器学习模型的攻击在传输方面的不同:数据空间中的攻击可以传输,而表示空间中的攻击则不行,至少在没有表示的几何对齐的情况下不行。然后我们在四个不同的场景中提供了这一假设的理论和实验证据。首先,我们在一个简单的场景中通过数学证明了这种区别,即两个网络计算相同的输入输出映射但通过不同的表示。其次,我们构建了针对图像分类器的表示空间攻击,这些攻击与众所周知的数据空间攻击一样成功,但无法传输。第三,我们构建了针对LMs的表示空间攻击,这些攻击成功突破了被攻击的模型,但同样无法传输。第四,我们构建了针对VLMs的数据空间攻击,这些攻击成功迁移到新的VLMs,我们展示了当VLMs的潜在几何结构在后投影空间中充分对齐时,表示空间攻击可以传输。我们的工作揭示了对抗迁移并不是所有攻击的固有属性,而是取决于它们的操作领域——共享的数据空间与模型的独特表示空间——这是构建更稳健模型的一个关键见解。
Summary / 总结
This paper investigates why representation-space attacks fail to transfer between models while data-space attacks succeed. The authors propose that attacks in the input data-space can transfer, whereas attacks in the model representation space do not, unless the representations are geometrically aligned. They provide theoretical and empirical evidence in four settings: proving the distinction in a simple setting, constructing representation-space attacks against image classifiers and LMs, and demonstrating data-space attacks that transfer between VLMs when latent geometries are aligned.
该论文探讨了为什么代表空间攻击在模型之间无法转移,而数据空间攻击可以成功转移。它提出了数据空间和代表空间攻击之间的区别,表明数据空间攻击可以转移,而代表空间攻击则需要几何对齐才能转移。研究在四个设置中提供了理论和实验证据,证明了在简单设置中的区别,构建了代表空间攻击使其无法转移,并展示了当潜在几何结构在后投影空间中对齐时,数据空间攻击可以转移。这项工作强调了操作域对于对抗转移的重要性。
Training-Free Out-Of-Distribution Segmentation With Foundation Models
Authors: Laith Nayal, Hadi Salloum, Ahmad Taha, Yaroslav Kholodov, Alexander Gasnikov
First: 2025-10-03T11:27:40+00:00 · Latest: 2025-10-03T11:27:40+00:00
Comments: 12 pages, 5 figures, 2 tables, ICOMP 2025
Abstract
Detecting unknown objects in semantic segmentation is crucial for
safety-critical applications such as autonomous driving. Large vision
foundation models, including DINOv2, InternImage, and CLIP, have advanced
visual representation learning by providing rich features that generalize well
across diverse tasks. While their strength in closed-set semantic tasks is
established, their capability to detect out-of-distribution (OoD) regions in
semantic segmentation remains underexplored. In this work, we investigate
whether foundation models fine-tuned on segmentation datasets can inherently
distinguish in-distribution (ID) from OoD regions without any outlier
supervision. We propose a simple, training-free approach that utilizes features
from the InternImage backbone and applies K-Means clustering alongside
confidence thresholding on raw decoder logits to identify OoD clusters. Our
method achieves 50.02 Average Precision on the RoadAnomaly benchmark and 48.77
on the benchmark of ADE-OoD with InternImage-L, surpassing several supervised
and unsupervised baselines. These results suggest a promising direction for
generic OoD segmentation methods that require minimal assumptions or additional
data.
Summary / 总结
This work addresses the challenge of detecting unknown objects in semantic segmentation, crucial for safety-critical applications like autonomous driving. It leverages the features from the InternImage backbone and applies K-Means clustering with confidence thresholding on raw decoder logits to identify out-of-distribution regions. The method achieves high Average Precision on the RoadAnomaly and ADE-OoD benchmarks, outperforming several supervised and unsupervised baselines.
该研究解决了在语义分割中检测未知物体的问题,这对于自动驾驶等安全应用至关重要。利用如InternImage等基础模型的丰富特征,该方法无需额外训练即可识别出分布外(OoD)区域。提出的方法通过在原始解码器logits上应用K-Means聚类和置信度阈值化来区分在分布(ID)区域和OoD区域。该方法在RoadAnomaly和ADE-OoD基准测试中取得了与多种监督和无监督基线相当甚至更好的结果。
ViLBias: Detecting and Reasoning about Bias in Multimodal Content
Authors: Shaina Raza, Caesar Saleh, Azib Farooq, Emrul Hasan, Franklin Ogidi, Maximus Powers, Veronica Chatrath, Marcelo Lotif, Karanpal Sekhon, Roya Javadi, Haad Zahid, Anam Zahid, Vahid Reza Khazaie, Zhenyu Yu
First: 2024-12-22T15:05:30+00:00 · Latest: 2025-10-03T11:22:35+00:00
Comments: Under review
Abstract
Detecting bias in multimodal news requires models that reason over
text--image pairs, not just classify text. In response, we present ViLBias, a
VQA-style benchmark and framework for detecting and reasoning about bias in
multimodal news. The dataset comprises 40,945 text--image pairs from diverse
outlets, each annotated with a bias label and concise rationale using a
two-stage LLM-as-annotator pipeline with hierarchical majority voting and
human-in-the-loop validation. We evaluate Small Language Models (SLMs), Large
Language Models (LLMs), and Vision--Language Models (VLMs) across closed-ended
classification and open-ended reasoning (oVQA), and compare parameter-efficient
tuning strategies. Results show that incorporating images alongside text
improves detection accuracy by 3--5\%, and that LLMs/VLMs better capture subtle
framing and text--image inconsistencies than SLMs. Parameter-efficient methods
(LoRA/QLoRA/Adapters) recover 97--99\% of full fine-tuning performance with
$<5\%$ trainable parameters. For oVQA, reasoning accuracy spans 52--79\% and
faithfulness 68--89\%, both improved by instruction tuning; closed accuracy
correlates strongly with reasoning ($r = 0.91$). ViLBias offers a scalable
benchmark and strong baselines for multimodal bias detection and rationale
quality.
中文标题/摘要
标题:ViLBias:检测和推理多模态内容中的偏见
检测多模态新闻中的偏见需要模型能够在文本-图像对上进行推理,而不仅仅是对文本进行分类。为此,我们提出了ViLBias,这是一种基于VQA风格的基准和框架,用于检测和推理多模态新闻中的偏见。数据集包含来自不同来源的40,945个文本-图像对,每个对都用两级LLM作为注释者管道和分层多数投票进行标注,并通过人工在环验证。我们评估了小型语言模型(SLMs)、大型语言模型(LLMs)和视觉-语言模型(VLMs)在封闭分类和开放推理(oVQA)中的表现,并比较了参数高效调优策略。结果显示,将图像与文本结合使用可提高检测准确性3-5%,并且LLMs/VLMs比SLMs更好地捕捉到了微妙的框架和文本-图像不一致。参数高效方法(LoRA/QLoRA/适配器)在不到5%的可训练参数下恢复了97-99%的全量微调性能。对于oVQA,推理准确率在52-79%之间,忠实度在68-89%之间,两者都通过指令调优得到了提升;封闭准确度与推理的相关性很强(r = 0.91)。ViLBias提供了一个可扩展的基准和多模态偏见检测以及推理质量的强基线。
Summary / 总结
The research aims to detect bias in multimodal news by developing a VQA-style benchmark called ViLBias. It evaluates SLMs, LLMs, and VLMs on closed-ended classification and open-ended reasoning tasks, showing that incorporating images improves detection accuracy and that LLMs/VLMs better capture subtle biases compared to SLMs. Parameter-efficient tuning methods recover near full fine-tuning performance with minimal trainable parameters, and reasoning accuracy in open-ended tasks ranges from 52% to 79% with improved faithfulness.
研究旨在通过开发名为ViLBias的VQA风格基准来检测多模态新闻中的偏见,评估了SLMs、LLMs和VLMs在封闭式分类和开放式推理任务上的表现。关键发现表明,使用图像与文本结合可以提高3-5%的偏见检测准确性,LLMs/VLMs在捕捉微妙的框架和文本-图像不一致方面优于SLMs。参数高效调优方法可以恢复接近完全微调的性能,且可调参数少于5%。开放式任务的推理准确率范围为52-79%,忠实度为68-89%,指令调优可以提升这两方面的表现。
NCV: A Node-Wise Consistency Verification Approach for Low-Cost Structured Error Localization in LLM Reasoning
Authors: Yulong Zhang, Li Wang, Wei Du, Peilin Li, Yuqin Dai Zhiyuan Zhao, Lingyong Fang, Ziniu Liu, Ru Zhang, Huijia Zhu, Gongshen Liu
First: 2025-10-03T08:48:04+00:00 · Latest: 2025-10-03T08:48:04+00:00
Abstract
Verifying multi-step reasoning in large language models is difficult due to
imprecise error localization and high token costs. Existing methods either
assess entire reasoning chains, suffering attention dilution, or rely on
expensive multi-sampling. We introduce Node-wise Consistency Verification
(NCV), a training-free framework that recasts verification as lightweight
binary consistency checks at the node level. By decomposing the chain of
thought into interconnected verification nodes, NCV precisely localizes errors
and avoids unnecessary long-form generation. Experiments demonstrate that our
approach enhances interpretability and efficiency, presenting a scalable
solution for reliable LLM reasoning verification. On public datasets, NCV
achieves a 10\% to 25\% improvement in F1 scores over baselines while utilizing
$6\times$~$58\times$ fewer tokens than traditional methods like CoT-based
verifiers.
中文标题/摘要
标题:NCV:一种低成本结构化错误定位的节点级一致性验证方法
由于精确的错误定位不精确和高token成本,大型语言模型中的多步推理验证非常困难。现有方法要么评估整个推理链,导致注意力稀释,要么依赖昂贵的多采样。我们提出了节点级一致性验证(NCV),这是一种无需训练的框架,将验证重新定义为节点级别的轻量级二元一致性检查。通过将推理链分解为相互连接的验证节点,NCV能够精确定位错误并避免不必要的长形式生成。实验表明,我们的方法提高了可解释性和效率,提供了一种可扩展的可靠LLM推理验证解决方案。在公共数据集上,NCV在F1分数上比基线方法提高了10%到25%,而使用的token数量仅为传统方法(如基于CoT的验证器)的六分之一到五十八分之一。
Summary / 总结
The research aims to address the challenge of verifying multi-step reasoning in large language models by proposing Node-wise Consistency Verification (NCV), a training-free approach that performs lightweight binary consistency checks at the node level. This method decomposes the reasoning chain into interconnected verification nodes, allowing for precise error localization and reducing the need for long-form generation. Experimental results show that NCV improves interpretability and efficiency, achieving up to 25% higher F1 scores compared to baselines while using significantly fewer tokens (6 to 58 times less) than traditional methods like CoT-based verifiers.
研究旨在通过解决精确错误定位不准确和高token成本的问题,提高大型语言模型多步推理的验证。引入了节点一致性验证(NCV)方法,这是一种无需训练的框架,通过节点级别的轻量级二元一致性检查来执行验证。该方法将推理链分解为相互连接的验证节点,从而实现精确的错误定位并减少长形式生成的需求。实验结果表明,NCV提高了可解释性和效率,相比基线方法在F1分数上提高了高达25%,同时使用了比传统方法如基于CoT的验证器少得多的token。
Med-K2N: Flexible K-to-N Modality Translation for Medical Image Synthesis
Authors: Feng Yuan, Yifan Gao, Yuehua Ye, Haoyue Li, Xin Gao
First: 2025-10-03T08:47:17+00:00 · Latest: 2025-10-03T08:47:17+00:00
Comments: ICLR2026 under review
Abstract
Cross-modal medical image synthesis research focuses on reconstructing
missing imaging modalities from available ones to support clinical diagnosis.
Driven by clinical necessities for flexible modality reconstruction, we explore
K to N medical generation, where three critical challenges emerge: How can we
model the heterogeneous contributions of different modalities to various target
tasks? How can we ensure fusion quality control to prevent degradation from
noisy information? How can we maintain modality identity consistency in
multi-output generation? Driven by these clinical necessities, and drawing
inspiration from SAM2's sequential frame paradigm and clinicians' progressive
workflow of incrementally adding and selectively integrating multi-modal
information, we treat multi-modal medical data as sequential frames with
quality-driven selection mechanisms. Our key idea is to "learn" adaptive
weights for each modality-task pair and "memorize" beneficial fusion patterns
through progressive enhancement. To achieve this, we design three collaborative
modules: PreWeightNet for global contribution assessment, ThresholdNet for
adaptive filtering, and EffiWeightNet for effective weight computation.
Meanwhile, to maintain modality identity consistency, we propose the Causal
Modality Identity Module (CMIM) that establishes causal constraints between
generated images and target modality descriptions using vision-language
modeling. Extensive experimental results demonstrate that our proposed Med-K2N
outperforms state-of-the-art methods by significant margins on multiple
benchmarks. Source code is available.
中文标题/摘要
标题:Med-K2N: 灵活的K到N模态转换方法在医学图像合成中的应用
跨模态医学图像合成研究旨在从现有模态重建缺失的成像模态,以支持临床诊断。受灵活模态重建的临床需求驱动,我们探索了K到N医学生成,其中三个关键挑战浮现:如何建模不同模态对各种目标任务的异质贡献?如何确保融合质量控制以防止从噪声信息中退化?如何在多输出生成中保持模态身份一致性?受这些临床需求的驱动,以及从SAM2的顺序帧范式和临床医生逐步增加和选择性整合多模态信息的渐进工作流程中汲取灵感,我们将多模态医学数据视为具有质量驱动选择机制的顺序帧。我们的核心思想是“学习”每个模态-任务对的自适应权重,并通过渐进增强“记忆”有益的融合模式。为了实现这一点,我们设计了三个协作模块:PreWeightNet进行全局贡献评估,ThresholdNet进行自适应过滤,EffiWeightNet进行有效权重计算。同时,为了保持模态身份一致性,我们提出了因果模态身份模块(CMIM),该模块使用视觉-语言建模在生成图像和目标模态描述之间建立因果约束。广泛的实验结果表明,我们提出的Med-K2N在多个基准上显著优于现有最先进的方法。源代码已公开。
Summary / 总结
The research aims to reconstruct missing imaging modalities from available ones for clinical diagnosis, addressing the need for flexible modality reconstruction. The method involves treating multi-modal medical data as sequential frames and using three collaborative modules: PreWeightNet, ThresholdNet, and EffiWeightNet, to assess global contributions, filter information adaptively, and compute effective weights. The proposed Med-K2N model outperforms existing methods on multiple benchmarks by significantly improving fusion quality and maintaining modality identity consistency.
研究旨在从现有影像模态中重建缺失的影像模态以支持临床诊断,满足灵活的模态重建需求。方法将多模态医学数据视为连续帧,并使用三个协作模块:PreWeightNet、ThresholdNet 和 EffiWeightNet,来评估全局贡献、适配性过滤和有效权重计算。提出的 Med-K2N 模型在多个基准上显著优于现有方法,通过提高融合质量和保持模态身份一致性来实现这一目标。
MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding
Authors: Jingyuan Deng, Yujiu Yang
First: 2025-10-03T07:59:16+00:00 · Latest: 2025-10-03T07:59:16+00:00
Comments: accepted to emnlp2025 findings
Abstract
Large vision-language models (LVLMs) have shown remarkable performance in
visual-language understanding for downstream multimodal tasks. While their
capabilities are improving, problems emerge simultaneously. Among those
problems, the hallucinations have attracted much attention, which stands for
the phenomenon where LVLMs generate contradictory content to their input visual
and text contents. Many approaches have been proposed to deal with this issue,
such as contrastive decoding and attention manipulation. However, contrastive
decoding methods struggle in constructing appropriate contrastive samples, and
attention manipulation methods are highly sensitive, lacking stability. In this
work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach
utilizes the "image heads" in LVLMs, masking them to construct contrastive
samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and
Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The
results demonstrate that MaskCD effectively alleviates the phenomenon of
hallucinations and retains the general capabilities of LVLMs. Corresponding
resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .
中文标题/摘要
标题:MaskCD:通过图像头部遮蔽对比解码减轻LVLM幻觉
大型视觉-语言模型(LVLMs)在下游多模态任务中的视觉-语言理解方面表现出色。尽管其能力在不断提高,但同时也出现了问题。这些问题中,幻觉引起了广泛关注,指的是LVLMs生成与其输入的视觉和文本内容相矛盾的内容。已经提出了许多方法来解决这一问题,例如对比解码和注意力操作。然而,对比解码方法在构建适当的对比样本方面存在困难,而注意力操作方法则高度敏感,缺乏稳定性。在本工作中,我们提出了图像头部遮蔽对比解码(MaskCD)。我们的方法利用LVLM中的“图像头部”,对其进行遮蔽以构建对比样本用于对比解码。我们在LLaVA-1.5-7b和Qwen-VL-7b上对MaskCD进行了评估,使用了诸如CHAIR、POPE、AMBER和MME等多种基准。结果表明,MaskCD有效地缓解了幻觉现象,并保留了LVLMs的一般能力。相关资源可以在:https://github.com/Deng-Jingyuan/MaskCD 找到。
Summary / 总结
MaskCD is proposed to address the hallucinations in large vision-language models (LVLMs) by utilizing the 'image heads' to mask and construct contrastive samples for contrastive decoding. Evaluated on LLaVA-1.5-7b and Qwen-VL-7b using CHAIR, POPE, AMBER, and MME benchmarks, MaskCD effectively reduces hallucinations while maintaining the general capabilities of LVLMs.
MaskCD 通过掩蔽 '图像头部' 来构建对比样本进行对比解码,以解决大型视觉语言模型(LVLM)中的幻觉问题。在LLaVA-1.5-7b 和 Qwen-VL-7b 上使用CHAIR、POPE、AMBER 和 MME 等基准进行评估,MaskCD 有效减少了幻觉现象,同时保持了LVLM 的一般能力。
OTR: Synthesizing Overlay Text Dataset for Text Removal
Authors: Jan Zdenek, Wataru Shimoda, Kota Yamaguchi
Venue: MM
First: 2025-10-03T07:44:07+00:00 · Latest: 2025-10-03T07:44:07+00:00
Comments: This is the author's version of the work. It is posted here for your
personal use. Not for redistribution. The definitive Version of Record was
published in Proceedings of the 33rd ACM International Conference on
Multimedia (MM '25), October 27-31, 2025, Dublin, Ireland,
https://doi.org/10.1145/3746027.3758297
Abstract
Text removal is a crucial task in computer vision with applications such as
privacy preservation, image editing, and media reuse. While existing research
has primarily focused on scene text removal in natural images, limitations in
current datasets hinder out-of-domain generalization or accurate evaluation. In
particular, widely used benchmarks such as SCUT-EnsText suffer from ground
truth artifacts due to manual editing, overly simplistic text backgrounds, and
evaluation metrics that do not capture the quality of generated results. To
address these issues, we introduce an approach to synthesizing a text removal
benchmark applicable to domains other than scene texts. Our dataset features
text rendered on complex backgrounds using object-aware placement and
vision-language model-generated content, ensuring clean ground truth and
challenging text removal scenarios. The dataset is available at
https://huggingface.co/datasets/cyberagent/OTR .
中文标题/摘要
标题:OTR:合成覆盖文本数据集以实现文本去除
文本去除是计算机视觉中的关键任务,具有隐私保护、图像编辑和媒体重用等应用。尽管现有研究主要集中在自然图像中的场景文本去除,但当前数据集的局限性阻碍了跨域泛化或准确评估。特别是,广泛使用的基准数据集如SCUT-EnsText由于人工编辑导致的地面真实数据缺陷、过于简单的文本背景以及无法捕捉生成结果质量的评估指标,这些问题亟待解决。为了解决这些问题,我们提出了一种合成适用于除场景文本外其他领域的文本去除基准的方法。我们的数据集通过对象感知放置和视觉语言模型生成的内容,实现了干净的地面真实数据和具有挑战性的文本去除场景。数据集可在https://huggingface.co/datasets/cyberagent/OTR 获取。
Summary / 总结
The research aims to improve text removal in computer vision by addressing limitations in existing datasets. The method involves synthesizing a new dataset called OTR, which uses object-aware placement and vision-language model-generated content to create complex backgrounds for text. Key findings include clean ground truth and challenging text removal scenarios, enhancing out-of-domain generalization and accurate evaluation. The dataset is publicly available at https://huggingface.co/datasets/cyberagent/OTR .
研究旨在通过解决现有数据集的局限性,提高计算机视觉中的文本去除效果。方法是合成一个名为OTR的新数据集,使用对象感知放置和视觉语言模型生成的内容来创建复杂的背景文本。关键发现包括干净的地面真实和具有挑战性的文本去除场景,增强了跨域泛化和准确评估。数据集可在https://huggingface.co/datasets/cyberagent/OTR 获取。
Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models
Authors: Prahitha Movva
Venue: COLM 2025: First Workshop on the Application of LLM Explainability
to Reasoning and Planning
First: 2025-10-03T07:27:47+00:00 · Latest: 2025-10-03T07:27:47+00:00
Abstract
Vision-Language Models (VLMs) excel at many multimodal tasks, yet their
cognitive processes remain opaque on complex lateral thinking challenges like
rebus puzzles. While recent work has demonstrated these models struggle
significantly with rebus puzzle solving, the underlying reasoning processes and
failure patterns remain largely unexplored. We address this gap through a
comprehensive explainability analysis that moves beyond performance metrics to
understand how VLMs approach these complex lateral thinking challenges. Our
study contributes a systematically annotated dataset of 221 rebus puzzles
across six cognitive categories, paired with an evaluation framework that
separates reasoning quality from answer correctness. We investigate three
prompting strategies designed to elicit different types of explanatory
processes and reveal critical insights into VLM cognitive processes. Our
findings demonstrate that reasoning quality varies dramatically across puzzle
categories, with models showing systematic strengths in visual composition
while exhibiting fundamental limitations in absence interpretation and cultural
symbolism. We also discover that prompting strategy substantially influences
both cognitive approach and problem-solving effectiveness, establishing
explainability as an integral component of model performance rather than a
post-hoc consideration.
中文标题/摘要
标题:推理谜题:可解释性揭示视觉-语言模型的认知局限
视觉-语言模型(VLMs)在许多多模态任务中表现出色,但在解决复杂的横向思维挑战如谜语时其认知过程仍然不透明。尽管最近的研究表明这些模型在解决谜语时面临显著困难,但其背后的推理过程和失败模式仍鲜有探索。我们通过一项全面的可解释性分析来填补这一空白,该分析超越了性能指标,以理解VLMs如何应对这些复杂的横向思维挑战。我们的研究贡献了一个系统注释的221个谜语数据集,涵盖了六个认知类别,并配有一个将推理质量与答案正确性分开的评估框架。我们研究了三种不同的提示策略,以激发不同类型的解释过程,并揭示了VLM认知过程的关键见解。我们的研究结果表明,谜题类别之间推理质量存在巨大差异,模型在视觉组合方面表现出系统性优势,但在缺失解释和文化象征方面表现出根本性局限。我们还发现,提示策略对认知方法和问题解决效果有重大影响,确立了可解释性作为模型性能不可或缺的组成部分,而不仅仅是事后考虑。
Summary / 总结
This study explores the cognitive processes of Vision-Language Models (VLMs) in solving rebus puzzles, a complex lateral thinking challenge. By conducting a comprehensive explainability analysis, the researchers identified significant variations in reasoning quality across different puzzle categories and highlighted VLMs' strengths in visual composition and limitations in absence interpretation and cultural symbolism. The study also found that different prompting strategies can influence both the cognitive approach and problem-solving effectiveness, emphasizing the importance of explainability in understanding VLMs' reasoning processes.
该研究探讨了视觉语言模型(VLMs)在解决谜语这类横向思维挑战时的认知局限。通过全面的解释性分析,研究人员发现了不同谜题类别中推理质量的显著差异,并指出了VLMs在解释缺失和文化符号方面存在的根本局限。研究还发现,不同的提示策略显著影响了VLMs的认知方法和问题解决效果,强调了解释性在理解模型行为中的重要性。
AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
Authors: Xian Zhang, Zexi Wu, Zinuo Li, Hongming Xu, Luqi Gong, Farid Boussaid, Naoufel Werghi, Mohammed Bennamoun
First: 2025-10-03T07:19:34+00:00 · Latest: 2025-10-03T07:19:34+00:00
Abstract
Understanding long-form videos remains a significant challenge for
vision--language models (VLMs) due to their extensive temporal length and high
information density. Most current multimodal large language models (MLLMs) rely
on uniform sampling, which often overlooks critical moments, leading to
incorrect responses to queries. In parallel, many keyframe selection approaches
impose rigid temporal spacing: once a frame is chosen, an exclusion window
suppresses adjacent timestamps to reduce redundancy. While effective at
limiting overlap, this strategy frequently misses short, fine-grained cues near
important events. Other methods instead emphasize visual diversity but neglect
query relevance. We propose AdaRD-Key, a training-free keyframe sampling module
for query-driven long-form video understanding. AdaRD-Key maximizes a unified
Relevance--Diversity Max-Volume (RD-MV) objective, combining a
query-conditioned relevance score with a log-determinant diversity component to
yield informative yet non-redundant frames. To handle broad queries with weak
alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating
mechanism; when the relevance distribution indicates weak alignment, the method
seamlessly shifts into a diversity-only mode, enhancing coverage without
additional supervision. Our pipeline is training-free, computationally
efficient (running in real time on a single GPU), and compatible with existing
VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and
Video-MME demonstrate state-of-the-art performance, particularly on long-form
videos. Code available at https://github.com/Xian867/AdaRD-Key.
Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models
Authors: Lihua Zhou, Mao Ye, Shuaifeng Li, Nianxin Li, Jinlin Wu, Xiatian Zhu, Lei Deng, Hongbin Liu, Jiebo Luo, Zhen Lei
First: 2025-10-03T06:27:33+00:00 · Latest: 2025-10-03T06:27:33+00:00
Comments: Under Review
Abstract
Vision-language models (VLMs) such as CLIP and Grounding DINO have achieved
remarkable success in object recognition and detection. However, their
performance often degrades under real-world distribution shifts. Test-time
adaptation (TTA) aims to mitigate this issue by adapting models during
inference. Existing methods either rely on computationally expensive
backpropagation, which hinders real-time deployment, or focus solely on
likelihood adaptation, which overlooks the critical role of the prior. Our
prior work, Bayesian Class Adaptation (BCA), addressed these shortcomings for
object recognition by introducing a training-free framework that incorporates
adaptive priors. Building upon this foundation, we now present Bayesian Class
Adaptation plus (BCA+), a unified, training-free framework for TTA for both
object recognition and detection. BCA+ introduces a dynamic cache that
adaptively stores and updates class embeddings, spatial scales (for detection),
and, crucially, adaptive class priors derived from historical predictions. We
formulate adaptation as a Bayesian inference problem, where final predictions
are generated by fusing the initial VLM output with a cache-based prediction.
This cache-based prediction combines a dynamically updated likelihood
(measuring feature and scale similarity) and a prior (reflecting the evolving
class distribution). This dual-adaptation mechanism, coupled with
uncertainty-guided fusion, enables BCA+ to correct both the model's semantic
understanding and its contextual confidence. As a training-free method
requiring no backpropagation, BCA+ is highly efficient. Extensive experiments
demonstrate that BCA+ achieves state-of-the-art performance on both recognition
and detection benchmarks.
中文标题/摘要
标题:基于贝叶斯测试时适应的视觉-语言模型物体识别与检测
视觉-语言模型(VLMs)如CLIP和Grounding DINO在物体识别和检测方面取得了显著成功。然而,它们在现实世界分布变化下的性能往往会下降。测试时适应(TTA)旨在通过在推理过程中调整模型来缓解这一问题。现有方法要么依赖于计算成本高昂的反向传播,这妨碍了实时部署,要么仅专注于似然性适应,忽视了先验的关键作用。我们之前的工作,贝叶斯类别适应(BCA),通过引入一个无需训练的框架,结合了自适应先验,解决了这些不足之处,适用于物体识别。在此基础上,我们现提出贝叶斯类别适应增强版(BCA+),这是一个统一的、无需训练的框架,用于物体识别和检测的TTA。BCA+引入了一个动态缓存,可以自适应地存储和更新类别嵌入、空间尺度(用于检测)以及历史预测推导出的自适应类别先验。我们将适应问题表述为贝叶斯推理问题,最终预测是通过融合初始VLM输出和基于缓存的预测生成的。基于缓存的预测结合了动态更新的似然性(衡量特征和尺度相似性)和先验(反映类别分布的演变)。这种双重适应机制,结合不确定性引导的融合,使BCA+能够纠正模型的语义理解和上下文信心。作为一种无需训练的方法,BCA+不需要反向传播,因此非常高效。广泛的实验表明,BCA+在识别和检测基准上达到了最先进的性能。
Summary / 总结
The paper addresses the performance degradation of vision-language models (VLMs) under real-world distribution shifts by proposing Bayesian Class Adaptation plus (BCA+), a unified, training-free framework for test-time adaptation in object recognition and detection. BCA+ introduces a dynamic cache that stores and updates class embeddings, spatial scales, and adaptive class priors. The framework formulates adaptation as a Bayesian inference problem, combining a dynamically updated likelihood with a prior to generate final predictions. Experiments show that BCA+ outperforms existing methods on both recognition and detection benchmarks.
本文通过引入BCA+,一种统一的无需训练的测试时适应框架,解决了视觉-语言模型(VLMs)在现实世界分布变化下的性能下降问题,该框架适用于对象识别和检测。BCA+使用动态缓存来存储和更新类别嵌入、空间尺度以及适应的类别先验,并将适应问题形式化为贝叶斯推理问题。该方法在识别和检测基准上达到了最先进的性能,且无需进行计算昂贵的反向传播。
Filter-Guided Diffusion for Controllable Image Generation
Authors: Zeqi Gu, Ethan Yang, Abe Davis
Venue: SIGGRAPH 2024
First: 2023-06-29T17:44:18+00:00 · Latest: 2025-10-03T06:04:38+00:00
Comments: First two listed authors have equal contribution. The latest version
has been accepted to SIGGRAPH 2024
Abstract
Recent advances in diffusion-based generative models have shown incredible
promise for zero shot image-to-image translation and editing. Most of these
approaches work by combining or replacing network-specific features used in the
generation of new images with those taken from the inversion of some guide
image. Methods of this type are considered the current state-of-the-art in
training-free approaches, but have some notable limitations: they tend to be
costly in runtime and memory, and often depend on deterministic sampling that
limits variation in generated results. We propose Filter-Guided Diffusion
(FGD), an alternative approach that leverages fast filtering operations during
the diffusion process to support finer control over the strength and
frequencies of guidance and can work with non-deterministic samplers to produce
greater variety. With its efficiency, FGD can be sampled over multiple seeds
and hyperparameters in less time than a single run of other SOTA methods to
produce superior results based on structural and semantic metrics. We conduct
extensive quantitative and qualitative experiments to evaluate the performance
of FGD in translation tasks and also demonstrate its potential in localized
editing when used with masks. Project page:
https://filterguideddiffusion.github.io/
中文标题/摘要
标题:基于滤波的扩散模型以实现可控的图像生成
基于扩散的生成模型的最新进展在零样本图像到图像的翻译和编辑方面展现了巨大的潜力。大多数这些方法通过将生成新图像时使用的网络特定特征与某些引导图像的反演特征结合起来或替换来工作。这类方法被认为是无训练方法的当前最佳实践,但有一些明显的局限性:它们在运行时间和内存方面通常较为昂贵,并且往往依赖于确定性采样,这限制了生成结果的多样性。我们提出了基于滤波的扩散(FGD),这是一种利用扩散过程中快速滤波操作的方法,可以支持对引导强度和频率的更精细控制,并且可以与非确定性采样器一起工作以产生更大的多样性。凭借其效率,FGD可以在比其他最新方法单次运行更短的时间内通过多个种子和超参数进行采样,从而根据结构和语义指标生成更优的结果。我们进行了广泛的定量和定性实验来评估FGD在翻译任务中的性能,并展示了其在与掩膜一起使用时进行局部编辑的潜力。项目页面:https://filterguideddiffusion.github.io/
Summary / 总结
The paper introduces Filter-Guided Diffusion (FGD), a method that uses fast filtering operations during the diffusion process to enhance control over image generation. This approach offers greater flexibility and variety in generated images compared to existing methods, which often rely on deterministic sampling and are costly in terms of runtime and memory. FGD can produce superior results based on structural and semantic metrics, and it can be sampled over multiple seeds and hyperparameters more efficiently than state-of-the-art methods.
该论文提出了Filter-Guided Diffusion (FGD) 方法,通过在扩散过程中使用快速滤波操作来增强图像生成的控制和效率。这种方法允许对指导强度和频率进行更精细的控制,并且可以与非确定性采样器一起工作以生成更多样化的结果。实验表明,FGD 在结构和语义指标上优于现有最佳方法,并且在使用多个种子和超参数采样时可以更快地生成更优结果。
ExGS: Extreme 3D Gaussian Compression with Diffusion Priors
Authors: Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun
First: 2025-09-29T13:23:06+00:00 · Latest: 2025-10-03T03:11:54+00:00
Abstract
Neural scene representations, such as 3D Gaussian Splatting (3DGS), have
enabled high-quality neural rendering; however, their large storage and
transmission costs hinder deployment in resource-constrained environments.
Existing compression methods either rely on costly optimization, which is slow
and scene-specific, or adopt training-free pruning and quantization, which
degrade rendering quality under high compression ratios. In contrast, recent
data-driven approaches provide a promising direction to overcome this
trade-off, enabling efficient compression while preserving high rendering
quality.We introduce ExGS, a novel feed-forward framework that unifies
Universal Gaussian Compression (UGC) with GaussPainter for Extreme 3DGS
compression. UGC performs re-optimization-free pruning to aggressively reduce
Gaussian primitives while retaining only essential information, whereas
GaussPainter leverages powerful diffusion priors with mask-guided refinement to
restore high-quality renderings from heavily pruned Gaussian scenes. Unlike
conventional inpainting, GaussPainter not only fills in missing regions but
also enhances visible pixels, yielding substantial improvements in degraded
renderings.To ensure practicality, it adopts a lightweight VAE and a one-step
diffusion design, enabling real-time restoration. Our framework can even
achieve over 100X compression (reducing a typical 354.77 MB model to about 3.31
MB) while preserving fidelity and significantly improving image quality under
challenging conditions. These results highlight the central role of diffusion
priors in bridging the gap between extreme compression and high-quality neural
rendering.Our code repository will be released at:
https://github.com/chenttt2001/ExGS
中文标题/摘要
标题:ExGS: 极端3D高斯压缩与扩散先验
神经场景表示,如3D高斯斑点化(3DGS),已实现高质量的神经渲染;然而,其庞大的存储和传输成本阻碍了在资源受限环境中的部署。现有压缩方法要么依赖昂贵的优化,这既慢又场景特定,要么采用无训练剪枝和量化,这在高压缩比下会降低渲染质量。相比之下,最近的数据驱动方法为克服这一权衡提供了有希望的方向,实现了高效压缩同时保持高质量渲染。我们引入了ExGS,这是一种新颖的前馈框架,将通用高斯压缩(UGC)与GaussPainter结合用于极端3DGS压缩。UGC通过不重新优化的剪枝激进地减少高斯原语,同时保留关键信息,而GaussPainter利用强大的扩散先验和掩码引导细化,从高度剪枝的高斯场景中恢复高质量渲染。与传统的修复不同,GaussPainter不仅填补缺失区域,还增强可见像素,显著改善了降级渲染。为了确保实用性,它采用轻量级的VAE和一步扩散设计,实现实时恢复。我们的框架即使在挑战性条件下也能实现超过100倍的压缩(将典型的354.77 MB模型减少到约3.31 MB),同时保持保真度并显著提高图像质量。这些结果突显了扩散先验在极端压缩与高质量神经渲染之间桥梁作用。我们的代码库将在以下链接发布:https://github.com/chenttt2001/ExGS
Summary / 总结
ExGS is a novel feed-forward framework that combines Universal Gaussian Compression (UGC) and GaussPainter for efficient 3D Gaussian Splatting (3DGS) compression. UGC reduces Gaussian primitives without re-optimization, while GaussPainter uses diffusion priors to refine pruned scenes, enhancing image quality. ExGS achieves over 100X compression while maintaining high rendering fidelity and image quality.
ExGS 是一种结合了通用高斯压缩(UGC)和 GaussPainter 的新型前馈框架,用于高效压缩 3D 高斯散点图(3DGS)。UGC 减少高斯基元的同时保留关键信息,而 GaussPainter 利用扩散先验进行增强和恢复,以生成高质量的渲染。ExGS 在保持保真度的同时实现了超过 100 倍的压缩,并在挑战性条件下显著提高了图像质量,展示了扩散先验在极端压缩中的重要作用。
RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation
Authors: Liheng Zhang, Lexi Pang, Hang Ye, Xiaoxuan Ma, Yizhou Wang
First: 2025-07-03T16:56:15+00:00 · Latest: 2025-10-03T02:54:26+00:00
Abstract
Text-to-image (T2I) diffusion models have shown remarkable success in
generating high-quality images from text prompts. Recent efforts extend these
models to incorporate conditional images (e.g., canny edge) for fine-grained
spatial control. Among them, feature injection methods have emerged as a
training-free alternative to traditional fine-tuning-based approaches. However,
they often suffer from structural misalignment, condition leakage, and visual
artifacts, especially when the condition image diverges significantly from
natural RGB distributions. Through an empirical analysis of existing methods,
we identify a key limitation: the sampling schedule of condition features,
previously unexplored, fails to account for the evolving interplay between
structure preservation and domain alignment throughout diffusion steps.
Inspired by this observation, we propose a flexible training-free framework
that decouples the sampling schedule of condition features from the denoising
process, and systematically investigate the spectrum of feature injection
schedules for a higher-quality structure guidance in the feature space.
Specifically, we find that condition features sampled from a single timestep
are sufficient, yielding a simple yet efficient schedule that balances
structure alignment and appearance quality. We further enhance the sampling
process by introducing a restart refinement schedule, and improve the visual
quality with an appearance-rich prompting strategy. Together, these designs
enable training-free generation that is both structure-rich and
appearance-rich. Extensive experiments show that our approach achieves
state-of-the-art results across diverse zero-shot conditioning scenarios.
Summary / 总结
Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts.
研究旨在通过解决现有特征注入方法的局限性,提高文本到图像生成的空间控制。作者提出了RichControl,一种训练-free框架,将条件特征的采样时间表与去噪过程分离,并系统地研究各种时间表以增强结构和外观。关键发现包括单时间步采样的有效性,以及通过重启细化时间表和外观丰富的提示策略提高视觉质量,从而在零样本条件场景中达到最先进的结果。
ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks
Authors: Zhaorun Chen, Xun Liu, Mintong Kang, Jiawei Zhang, Minzhou Pan, Shuang Yang, Bo Li
First: 2025-10-03T02:28:02+00:00 · Latest: 2025-10-03T02:28:02+00:00
Comments: 60 pages, 16 figures
Abstract
As vision-language models (VLMs) gain prominence, their multimodal interfaces
also introduce new safety vulnerabilities, making the safety evaluation
challenging and critical. Existing red-teaming efforts are either restricted to
a narrow set of adversarial patterns or depend heavily on manual engineering,
lacking scalable exploration of emerging real-world VLM vulnerabilities. To
bridge this gap, we propose ARMs, an adaptive red-teaming agent that
systematically conducts comprehensive risk assessments for VLMs. Given a target
harmful behavior or risk definition, ARMs automatically optimizes diverse
red-teaming strategies with reasoning-enhanced multi-step orchestration, to
effectively elicit harmful outputs from target VLMs. We propose 11 novel
multimodal attack strategies, covering diverse adversarial patterns of VLMs
(e.g., reasoning hijacking, contextual cloaking), and integrate 17 red-teaming
algorithms into ARMs via model context protocol (MCP). To balance the diversity
and effectiveness of the attack, we design a layered memory with an
epsilon-greedy attack exploration algorithm. Extensive experiments on instance-
and policy-based benchmarks show that ARMs achieves SOTA attack success rates,
exceeding baselines by an average of 52.1% and surpassing 90% on
Claude-4-Sonnet. We show that the diversity of red-teaming instances generated
by ARMs is significantly higher, revealing emerging vulnerabilities in VLMs.
Leveraging ARMs, we construct ARMs-Bench, a large-scale multimodal safety
dataset comprising over 30K red-teaming instances spanning 51 diverse risk
categories, grounded in both real-world multimodal threats and regulatory
risks. Safety fine-tuning with ARMs-Bench substantially improves the robustness
of VLMs while preserving their general utility, providing actionable guidance
to improve multimodal safety alignment against emerging threats.
中文标题/摘要
标题:ARMs:针对多模态模型的自适应红队代理对抗插拔式攻击
随着视觉语言模型(VLMs)的兴起,它们的多模态界面也带来了新的安全漏洞,使得安全评估变得具有挑战性和关键性。现有的红队努力要么局限于狭窄的对抗模式集,要么严重依赖手动工程,缺乏对新兴现实世界VLM漏洞的大规模探索。为了解决这一差距,我们提出了ARMs,这是一种自适应红队代理,系统地对VLM进行全面的风险评估。给定一个目标有害行为或风险定义,ARMs自动优化多种增强推理的多步协调红队策略,以有效从目标VLM中引发有害输出。我们提出了11种新颖的多模态攻击策略,涵盖了VLMs的各种对抗模式(例如,推理劫持、上下文隐身),并通过模型上下文协议(MCP)将17种红队算法集成到ARMs中。为了平衡攻击的多样性和有效性,我们设计了一层记忆和ε-贪婪攻击探索算法。在基于实例和策略的基准上进行的广泛实验表明,ARMs实现了SOTA攻击成功率,平均超过基线52.1%,在Claude-4-Sonnet上超过90%。我们展示了ARMs生成的红队实例多样性显著更高,揭示了VLMs中的新兴漏洞。利用ARMs,我们构建了ARMs-Bench,这是一个包含超过30,000个红队实例的大规模多模态安全数据集,涵盖了51个多样风险类别,基于现实世界的多模态威胁和监管风险。使用ARMs-Bench进行的安全微调显著提高了VLMs的鲁棒性,同时保留了其通用功能,提供了针对新兴威胁改进多模态安全对齐的可操作指导。
Summary / 总结
ARMs is an adaptive red-teaming agent designed to systematically assess the safety of vision-language models (VLMs) by automatically optimizing diverse red-teaming strategies. It introduces 11 novel multimodal attack strategies and integrates 17 red-teaming algorithms, achieving state-of-the-art attack success rates. Experiments show that ARMs outperforms baselines by an average of 52.1% and generates a large-scale multimodal safety dataset, ARMs-Bench, which includes over 30K red-teaming instances across 51 risk categories.
ARMs 是一个自适应红队代理,旨在系统地评估视觉语言模型(VLMs)的安全性,通过自动优化多种红队策略。它引入了11种新型的多模态攻击策略,并集成了17种红队算法,相比基线模型,其攻击成功率显著提高。ARMs 构建了一个大规模的多模态安全数据集 ARMs-Bench,包含超过30K个红队实例,增强了VLMs的鲁棒性同时保持其通用性。
PEO: Training-Free Aesthetic Quality Enhancement in Pre-Trained Text-to-Image Diffusion Models with Prompt Embedding Optimization
Authors: Hovhannes Margaryan, Bo Wan, Tinne Tuytelaars
First: 2025-10-02T22:12:36+00:00 · Latest: 2025-10-02T22:12:36+00:00
Abstract
This paper introduces a novel approach to aesthetic quality improvement in
pre-trained text-to-image diffusion models when given a simple prompt. Our
method, dubbed Prompt Embedding Optimization (PEO), leverages a pre-trained
text-to-image diffusion model as a backbone and optimizes the text embedding of
a given simple and uncurated prompt to enhance the visual quality of the
generated image. We achieve this by a tripartite objective function that
improves the aesthetic fidelity of the generated image, ensures adherence to
the optimized text embedding, and minimal divergence from the initial prompt.
The latter is accomplished through a prompt preservation term. Additionally,
PEO is training-free and backbone-independent. Quantitative and qualitative
evaluations confirm the effectiveness of the proposed method, exceeding or
equating the performance of state-of-the-art text-to-image and prompt
adaptation methods.
中文标题/摘要
标题:PEO:基于提示嵌入优化的预训练文本到图像扩散模型无训练美学质量提升
本文介绍了一种在给定简单提示时提高预训练文本到图像扩散模型美学质量的新方法。我们的方法称为提示嵌入优化(PEO),利用一个预训练的文本到图像扩散模型作为基础,并优化给定的简单且未经筛选的提示的文本嵌入,以提高生成图像的视觉质量。我们通过一个三重目标函数实现这一点,该函数提高了生成图像的美学保真度,确保符合优化后的文本嵌入,并尽量减少与初始提示的偏差。后者通过提示保留项实现。此外,PEO 是无训练的且与基础模型无关。定量和定性评估证实了所提出方法的有效性,超过了最先进的文本到图像和提示适应方法的表现。
Summary / 总结
This paper introduces a novel approach to aesthetic quality improvement in pre-trained text-to-image diffusion models when given a simple prompt.
该论文提出了一种名为Prompt Embedding Optimization (PEO)的方法,用于通过简单的提示优化预训练的文本到图像扩散模型生成的图像的美学质量。PEO通过优化提示的文本嵌入来改善视觉质量,同时保留初始提示的意图。它使用一个三重目标函数,专注于美学保真度、对优化文本嵌入的遵循以及与初始提示的最小偏差。实验结果表明,PEO在定量和定性评估中均优于或与最先进的文本到图像和提示适应方法相当,并且是无需训练和模型无关的。
CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification
Authors: Cristiano Patrício, Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira, João C. Neves
First: 2025-01-21T16:38:04+00:00 · Latest: 2025-10-02T20:31:29+00:00
Comments: Accepted for publication in Computers in Biology and Medicine
Abstract
The main challenges limiting the adoption of deep learning-based solutions in
medical workflows are the availability of annotated data and the lack of
interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the
latter by constraining the model output on a set of predefined and
human-interpretable concepts. However, the increased interpretability achieved
through these concept-based explanations implies a higher annotation burden.
Moreover, if a new concept needs to be added, the whole system needs to be
retrained. Inspired by the remarkable performance shown by Large
Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet
effective, methodology, CBVLM, which tackles both of the aforementioned
challenges. First, for each concept, we prompt the LVLM to answer if the
concept is present in the input image. Then, we ask the LVLM to classify the
image based on the previous concept predictions. Moreover, in both stages, we
incorporate a retrieval module responsible for selecting the best examples for
in-context learning. By grounding the final diagnosis on the predicted
concepts, we ensure explainability, and by leveraging the few-shot capabilities
of LVLMs, we drastically lower the annotation cost. We validate our approach
with extensive experiments across four medical datasets and twelve LVLMs (both
generic and medical) and show that CBVLM consistently outperforms CBMs and
task-specific supervised methods without requiring any training and using just
a few annotated examples. More information on our project page:
https://cristianopatricio.github.io/CBVLM/.
中文标题/摘要
标题:CBVLM:无需训练的基于概念的可解释大型视觉语言模型在医学图像分类中的应用
深度学习解决方案在医学工作流程中的应用受到标注数据不足和系统缺乏可解释性的限制。概念瓶颈模型(CBMs)通过在一组预定义且可由人类解释的概念上约束模型输出来解决后者。然而,通过这些基于概念的解释获得的更高可解释性意味着更高的标注负担。此外,如果需要添加新概念,整个系统都需要重新训练。受大型视觉语言模型(LVLMs)在少样本设置中表现出色的启发,我们提出了一种简单而有效的方法——CBVLM,以解决上述两个问题。首先,对于每个概念,我们提示LVLM判断该概念是否出现在输入图像中。然后,我们要求LVLM根据之前的概念预测对图像进行分类。此外,在两个阶段中,我们引入了一个检索模块,负责选择最佳示例进行上下文学习。通过基于预测的概念进行最终诊断,我们确保了可解释性,并通过利用LVLM的少样本能力,大幅降低了标注成本。我们通过在四个医学数据集和十二个LVLM(通用和医学)上进行广泛的实验验证了我们的方法,并展示了CBVLM在无需训练且仅使用少量标注示例的情况下,始终优于CBMs和任务特定的监督方法。更多关于我们项目的详细信息,请参阅项目页面:https://cristianopatricio.github.io/CBVLM/
Summary / 总结
CBVLM addresses the challenges of limited annotated data and lack of interpretability in medical image classification by leveraging Large Vision-Language Models (LVLMs). It prompts LVLMs to predict the presence of predefined concepts in images and then classifies the images based on these predictions, incorporating a retrieval module for in-context learning. Experiments across four medical datasets and twelve LVLMs show that CBVLM outperforms Concept Bottleneck Models (CBMs) and task-specific supervised methods without requiring any training and using minimal annotated examples.
CBVLM通过利用大型视觉-语言模型(LVLM)来解决医学图像分类中的标注数据和模型可解释性问题。它促使LVLM预测概念在图像中的存在,并基于这些预测进行分类,同时引入检索模块进行上下文学习。在四个医学数据集和十二个LVLM(包括通用和医学专用)的实验中,CBVLM在无需训练的情况下,仅使用少量标注示例,表现出优于概念瓶颈模型(CBMs)和特定任务监督方法的效果。
Exploring OCR-augmented Generation for Bilingual VQA
Authors: JoonHo Lee, Sunho Park
First: 2025-10-02T20:19:31+00:00 · Latest: 2025-10-02T20:19:31+00:00
Abstract
We investigate OCR-augmented generation with Vision Language Models (VLMs),
exploring tasks in Korean and English toward multilingualism. To support
research in this domain, we train and release KLOCR, a strong bilingual OCR
baseline trained on 100M instances to augment VLMs with OCR ability. To
complement existing VQA benchmarks, we curate KOCRBench for Korean VQA, and
analyze different prompting methods. Extensive experiments show that
OCR-extracted text significantly boosts performance across open source and
commercial models. Our work offers new insights into OCR-augmented generation
for bilingual VQA. Model, code, and data are available at
https://github.com/JHLee0513/KLOCR.
中文标题/摘要
标题:探索OCR增强生成在双语VQA中的应用
我们研究了使用视觉语言模型(VLMs)的OCR增强生成,探索了韩语和英语任务以实现多语言能力。为了支持该领域的研究,我们训练并发布了KLOCR,这是一个基于1亿实例训练的强双语OCR基线,用于增强VLMs的OCR能力。为了补充现有的VQA基准,我们为韩语VQA整理了KOCRBench,并分析了不同的提示方法。广泛的实验表明,OCR提取的文本显著提升了开源和商用模型的性能。我们的工作为双语VQA中的OCR增强生成提供了新的见解。模型、代码和数据可在https://github.com/JHLee0513/KLOCR获取。
Summary / 总结
The research aims to enhance bilingual VQA (Visual Question Answering) by integrating OCR-augmented generation with Vision Language Models. The study introduces KLOCR, a strong bilingual OCR baseline trained on 100 million instances, to improve VLMs. Experiments demonstrate that OCR-extracted text significantly improves the performance of both open source and commercial models across various tasks. The work provides new insights into OCR-augmented generation for bilingual VQA. Model, code, and data are available at https://github.com/JHLee0513/KLOCR.
研究旨在通过将OCR增强生成与视觉语言模型结合来提升双语VQA(视觉问答)。研究引入了KLOCR,这是一个基于1亿实例训练的双语OCR基线,以提高视觉语言模型的能力。实验表明,OCR提取的文本显著提升了开源和商用模型在各种任务中的性能。该工作为双语VQA中的OCR增强生成提供了新的见解。模型、代码和数据可在https://github.com/JHLee0513/KLOCR获取。
Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection
Authors: Jiawen Zhu, Yew-Soon Ong, Chunhua Shen, Guansong Pang
Venue: ICCV 2025
First: 2024-10-14T08:41:31+00:00 · Latest: 2025-10-02T20:14:05+00:00
Comments: Accepted to ICCV 2025; 11 pages, 3 figures
Abstract
Current zero-shot anomaly detection (ZSAD) methods show remarkable success in
prompting large pre-trained vision-language models to detect anomalies in a
target dataset without using any dataset-specific training or demonstration.
However, these methods often focus on crafting/learning prompts that capture
only coarse-grained semantics of abnormality, e.g., high-level semantics like
"damaged", "imperfect", or "defective" objects. They therefore have limited
capability in recognizing diverse abnormality details that deviate from these
general abnormal patterns in various ways. To address this limitation, we
propose FAPrompt, a novel framework designed to learn Fine-grained Abnormality
Prompts for accurate ZSAD. To this end, a novel Compound Abnormality Prompt
learning (CAP) module is introduced in FAPrompt to learn a set of
complementary, decomposed abnormality prompts, where abnormality prompts are
enforced to model diverse abnormal patterns derived from the same normality
semantic. On the other hand, the fine-grained abnormality patterns can be
different from one dataset to another. To enhance the cross-dataset
generalization, another novel module, namely Data-dependent Abnormality Prior
learning (DAP), is introduced in FAPrompt to learn a sample-wise abnormality
prior from abnormal features of each test image to dynamically adapt the
abnormality prompts to individual test images. Comprehensive experiments on 19
real-world datasets, covering both industrial defects and medical anomalies,
demonstrate that FAPrompt substantially outperforms state-of-the-art methods in
both image- and pixel-level ZSAD tasks. Code is available at
https://github.com/mala-lab/FAPrompt.
中文标题/摘要
标题:细粒度异常提示学习在零样本异常检测中的应用
当前的零样本异常检测(ZSAD)方法在无需使用任何特定数据集的训练或演示的情况下,成功地促使大型预训练视觉-语言模型在目标数据集中检测异常。然而,这些方法通常仅关注构建/学习捕捉异常粗粒度语义的提示,例如“损坏”、“不完美”或“缺陷”等高级语义。因此,它们在识别与这些一般异常模式在各种方式上偏离的多种异常细节方面能力有限。为解决这一局限,我们提出了FAPrompt,一种用于准确ZSAD的新型框架。为此,FAPrompt引入了一种新颖的复合异常提示学习(CAP)模块,用于学习一组互补的、分解的异常提示,其中异常提示被强制建模为源自相同正常性语义的多种异常模式。另一方面,细粒度的异常模式在不同数据集中可能有所不同。为了增强跨数据集的泛化能力,FAPrompt还引入了另一种新颖模块,即数据依赖异常先验学习(DAP),从每个测试图像的异常特征中学习样本级别的异常先验,以动态适应异常提示到个体测试图像。在19个真实世界数据集上的全面实验,涵盖了工业缺陷和医学异常,表明FAPrompt在图像级和像素级ZSAD任务中显著优于现有方法。代码可在https://github.com/mala-lab/FAPrompt获取。
Summary / 总结
The paper proposes FAPrompt, a novel framework for fine-grained abnormality prompt learning to enhance zero-shot anomaly detection. It introduces a Compound Abnormality Prompt (CAP) module to learn complementary, decomposed abnormality prompts and a Data-dependent Abnormality Prior (DAP) module to adapt abnormality prompts to individual test images. Experiments on 19 real-world datasets show that FAPrompt outperforms existing methods in both image- and pixel-level zero-shot anomaly detection tasks.
论文提出了一种新颖的细粒度异常提示学习框架FAPrompt,以提高零样本异常检测的准确性。该框架引入了复合异常提示(CAP)模块来学习互补的分解异常提示,并引入了数据依赖异常先验(DAP)模块来使异常提示适应每个测试图像。在19个真实世界数据集上的实验表明,FAPrompt在图像级和像素级的零样本异常检测任务中均优于现有方法。
Multimodal Function Vectors for Spatial Relations
Authors: Shuhao Fu, Esther Goldberg, Ying Nian Wu, Hongjing Lu
First: 2025-10-02T19:55:56+00:00 · Latest: 2025-10-02T19:55:56+00:00
Abstract
Large Multimodal Models (LMMs) demonstrate impressive in-context learning
abilities from limited multimodal demonstrations, yet the internal mechanisms
supporting such task learning remain opaque. Building on prior work of large
language models, we show that a small subset of attention heads in the
vision-language model OpenFlamingo-4B is responsible for transmitting
representations of spatial relations. The activations of these attention heads,
termed function vectors, can be extracted and manipulated to alter an LMM's
performance on relational tasks. First, using both synthetic and real image
datasets, we apply causal mediation analysis to identify attention heads that
strongly influence relational predictions, and extract multimodal function
vectors that improve zero-shot accuracy at inference time. We further
demonstrate that these multimodal function vectors can be fine-tuned with a
modest amount of training data, while keeping LMM parameters frozen, to
significantly outperform in-context learning baselines. Finally, we show that
relation-specific function vectors can be linearly combined to solve analogy
problems involving novel and untrained spatial relations, highlighting the
strong generalization ability of this approach. Our results show that LMMs
encode spatial relational knowledge within localized internal structures, which
can be systematically extracted and optimized, thereby advancing our
understanding of model modularity and enhancing control over relational
reasoning in LMMs.
Summary / 总结
The study aims to uncover the internal mechanisms of Large Multimodal Models (LMMs) in learning spatial relations from limited demonstrations. By focusing on a subset of attention heads in OpenFlamingo-4B, the researchers extract and manipulate function vectors to improve zero-shot accuracy. They demonstrate that these vectors can be fine-tuned with minimal training data to outperform in-context learning baselines and can be combined to solve analogy problems involving new spatial relations, showcasing strong generalization capabilities.
该研究探讨了大型多模态模型(LMMs)在学习空间关系时的内部机制。通过分析OpenFlamingo-4B中的注意力头,研究人员识别出特定的功能向量,这些向量对关系预测有显著影响。这些向量被提取并用少量数据微调,提高了零样本准确率。研究还展示了特定关系的功能向量可以线性组合来解决类比问题,展示了强大的泛化能力。这项工作推进了对模型模块性的理解,并增强了对关系推理的控制。
Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
Authors: Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
First: 2025-09-22T11:54:58+00:00 · Latest: 2025-10-02T18:38:00+00:00
Comments: project page: https://soroush-mim.github.io/projects/evict3r/
Abstract
Streaming visual transformers like StreamVGGT achieve strong 3D perception
but suffer from unbounded growth of key value (KV) memory, which limits
scalability. We propose a training-free, inference-time token eviction policy
that bounds memory by discarding redundant tokens while keeping the most
informative ones. Our method uses significantly less memory with little to no
drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from
18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under
strict memory budgets, eviction enables denser frame sampling, which improves
reconstruction accuracy compared to the baseline. Experiments across video
depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and
camera pose estimation (Sintel, TUM-dynamics) show that our approach closely
matches StreamVGGT at a fraction of the memory and makes long-horizon streaming
inference more practical.
Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity
Authors: Eric Tillmann Bill, Enis Simsar, Thomas Hofmann
First: 2025-10-02T17:59:58+00:00 · Latest: 2025-10-02T17:59:58+00:00
Comments: Code: https://github.com/ericbill21/FOCUS/
Abstract
Text-to-image (T2I) models excel on single-entity prompts but struggle with
multi-subject descriptions, often showing attribute leakage, identity
entanglement, and subject omissions. We introduce the first theoretical
framework with a principled, optimizable objective for steering sampling
dynamics toward multi-subject fidelity. Viewing flow matching (FM) through
stochastic optimal control (SOC), we formulate subject disentanglement as
control over a trained FM sampler. This yields two architecture-agnostic
algorithms: (i) a training-free test-time controller that perturbs the base
velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight
fine-tuning rule that regresses a control network to a backward adjoint signal
while preserving base-model capabilities. The same formulation unifies prior
attention heuristics, extends to diffusion models via a flow-diffusion
correspondence, and provides the first fine-tuning route explicitly designed
for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and
Stable Diffusion XL, both algorithms consistently improve multi-subject
alignment while maintaining base-model style. Test-time control runs
efficiently on commodity GPUs, and fine-tuned controllers trained on limited
prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal
Control for Unentangled Subjects), which achieves state-of-the-art
multi-subject fidelity across models.
中文标题/摘要
标题:最优控制与流匹配结合:通往多主体保真度的原理性路径
文本到图像(T2I)模型在单一实体提示上表现出色,但在处理多主体描述时遇到困难,经常出现属性泄漏、身份纠缠和主体遗漏。我们提出了第一个理论框架,提供了一个可优化的目标,以引导采样动力学向多主体保真度方向发展。通过将流匹配(FM)视为随机最优控制(SOC),我们将主体解纠缠视为对训练好的FM采样器的控制。这产生了两种架构无关的算法:(i)一个无需训练的测试时控制器,通过单次更新扰动基础速度,以及(ii)伴随匹配,这是一种轻量级的微调规则,通过回归控制网络到后向伴随信号来实现,同时保留基础模型的能力。相同的公式统一了先前的注意力启发式方法,通过流扩散对应关系扩展到扩散模型,并提供了第一个明确为多主体保真度设计的微调路径。实验上,在Stable Diffusion 3.5、FLUX和Stable Diffusion XL上,两个算法在保持基础模型风格的同时,一致地提高了多主体对齐度。测试时控制在普通GPU上高效运行,微调控制器在有限提示下训练后可以泛化到未见过的提示。我们还强调了FOCUS(流最优控制以解纠缠主体),它在多个模型中实现了最先进的多主体保真度。
Summary / 总结
The paper addresses the challenge of generating images from multi-subject descriptions in text-to-image models, which often suffer from attribute leakage, identity entanglement, and subject omissions. It introduces a theoretical framework using stochastic optimal control to steer the sampling dynamics towards multi-subject fidelity. Two algorithms are proposed: a test-time controller that perturbs the base velocity and a fine-tuning rule called Adjoint Matching. Both methods improve multi-subject alignment while preserving the base model's style, and the test-time control runs efficiently on GPUs. FOCUS, a specific implementation, achieves state-of-the-art results in multi-subject fidelity across different models.
本文通过引入基于随机最优控制和流匹配的理论框架,解决了多主体保真度在文本到图像生成中的挑战。作者提出了两个架构无关的算法:测试时控制器和伴随匹配,这些算法在无需重新训练的情况下提高了多主体对齐。实验表明,这些算法在Stable Diffusion 3.5、FLUX和Stable Diffusion XL上的一致提高了多主体保真度,同时保持了基础模型的风格。测试时的控制效率高,微调控制器也能很好地泛化到未见过的提示上。
NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation
Authors: Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez
First: 2025-10-02T17:59:43+00:00 · Latest: 2025-10-02T17:59:43+00:00
Abstract
Text-to-image diffusion models trained on a fixed set of resolutions often
fail to generalize, even when asked to generate images at lower resolutions
than those seen during training. High-resolution text-to-image generators are
currently unable to easily offer an out-of-the-box budget-efficient alternative
to their users who might not need high-resolution images. We identify a key
technical insight in diffusion models that when addressed can help tackle this
limitation: Noise schedulers have unequal perceptual effects across
resolutions. The same level of noise removes disproportionately more signal
from lower-resolution images than from high-resolution images, leading to a
train-test mismatch. We propose NoiseShift, a training-free method that
recalibrates the noise level of the denoiser conditioned on resolution size.
NoiseShift requires no changes to model architecture or sampling schedule and
is compatible with existing models. When applied to Stable Diffusion 3, Stable
Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly
improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and
Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by
10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results
demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent
artifacts and enhancing the quality of low-resolution image generation.
中文标题/摘要
标题:NoiseShift:针对分辨率的噪声重新校准以提高低分辨率图像生成质量
训练在固定分辨率集上的文本到图像扩散模型在生成低于训练分辨率的图像时往往无法很好地泛化。高分辨率的文本到图像生成器目前无法为那些不需要高分辨率图像的用户提供一个开箱即用且成本效益高的替代方案。我们发现了一个关键的技术洞察:当扩散模型中的噪声调度器得到解决时,可以缓解这一限制:噪声调度器在不同分辨率下的感知效果不等。相同水平的噪声从低分辨率图像中移除的信号比从高分辨率图像中移除的更多,导致训练和测试之间的不匹配。我们提出了一种无需训练的方法NoiseShift,该方法根据分辨率大小重新校准去噪器的噪声水平。NoiseShift 不需要对模型架构或采样时间表进行任何更改,并且与现有模型兼容。当应用于Stable Diffusion 3、Stable Diffusion 3.5和Flux-Dev时,低分辨率的质量显著提高。在LAION-COCO上,NoiseShift分别将SD3.5、SD3和Flux-Dev的FID提高了15.89%、8.56%和2.44%。在CelebA上,NoiseShift分别将SD3.5、SD3和Flux-Dev的FID提高了10.36%、5.19%和3.02%。这些结果表明NoiseShift在减轻分辨率依赖性伪影和提高低分辨率图像生成质量方面的有效性。
Summary / 总结
NoiseShift is a resolution-aware noise recalibration method for improving low-resolution image generation in text-to-image diffusion models. It addresses the train-test mismatch caused by unequal perceptual effects of noise across resolutions. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, NoiseShift significantly improves quality at low resolutions, with average FID improvements of 15.89%, 8.56%, and 2.44% on LAION-COCO, and 10.36%, 5.19%, and 3.02% on CelebA, respectively.
NoiseShift 是一种针对低分辨率图像生成的文本到图像扩散模型的分辨率感知噪声重新校准方法。它解决了由于噪声在不同分辨率下的感知效果不等导致的训练-测试不匹配问题。当应用于 Stable Diffusion 3、Stable Diffusion 3.5 和 Flux-Dev 时,NoiseShift 显著提高了低分辨率图像的质量,在 LAION-COCO 上的平均 FID 改进了 15.89%、8.56% 和 2.44%,在 CelebA 上的平均 FID 改进了 10.36%、5.19% 和 3.02%。