Lost in Translation? Vocabulary Alignment for Source-Free Domain Adaptation in Open-Vocabulary Semantic Segmentation
Authors: Silvio Mazzucco, Carl Persson, Mattia Segu, Pier Luigi Dovesi, Federico Tombari, Luc Van Gool, Matteo Poggi
First: 2025-09-18T17:59:58+00:00 · Latest: 2025-09-18T17:59:58+00:00
Comments: BMVC 2025 - Project Page: https://thegoodailab.org/blog/vocalign -
Code: https://github.com/Sisso16/VocAlign
Abstract
We introduce VocAlign, a novel source-free domain adaptation framework
specifically designed for VLMs in open-vocabulary semantic segmentation. Our
method adopts a student-teacher paradigm enhanced with a vocabulary alignment
strategy, which improves pseudo-label generation by incorporating additional
class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to
fine-tune the model, preserving its original capabilities while minimizing
computational overhead. In addition, we propose a Top-K class selection
mechanism for the student model, which significantly reduces memory
requirements while further improving adaptation performance. Our approach
achieves a notable 6.11 mIoU improvement on the CityScapes dataset and
demonstrates superior performance on zero-shot segmentation benchmarks, setting
a new standard for source-free adaptation in the open-vocabulary setting.
中文标题/摘要
标题:迷失翻译?源代码自由领域适应在开放词汇语义分割中的词汇对齐
我们引入了VocAlign,这是一种专为开放词汇语义分割中的VLM设计的源代码自由领域适应框架。我们的方法采用学生-教师范式,并结合了词汇对齐策略,通过引入额外的类别概念来改进伪标签生成。为了确保效率,我们使用低秩适应(LoRA)对模型进行微调,同时保留其原始功能并最小化计算开销。此外,我们还提出了一种学生模型的Top-K类别选择机制,这显著减少了内存需求并进一步提高了适应性能。我们的方法在CityScapes数据集上实现了显著的6.11 mIoU改进,并在零样本分割基准测试中表现出色,为开放词汇设置中的源代码自由适应设定了新标准。
Summary / 总结
The research introduces VocAlign, a source-free domain adaptation framework for VLMs in open-vocabulary semantic segmentation. It uses a student-teacher paradigm with vocabulary alignment to enhance pseudo-label generation and employs Low-Rank Adaptation (LoRA) for efficient fine-tuning. The approach also includes a Top-K class selection mechanism to reduce memory usage. Experiments show a 6.11 mIoU improvement on CityScapes and superior performance in zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in open-vocabulary settings.
论文提出了VocAlign,这是一种针对开放词汇语义分割中VLM的源免费域适应框架。该方法采用学生-教师范式并结合词汇对齐策略来增强伪标签生成,并采用低秩适应(LoRA)进行高效微调。此外,还提出了一种Top-K类选择机制以减少内存使用。VocAlign在CityScapes上实现了6.11 mIoU的改进,并在零样本分割基准测试中表现出色,为开放词汇设置中的源免费适应设定了新标准。
Calibration-Aware Prompt Learning for Medical Vision-Language Models
Authors: Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan
First: 2025-09-18T17:59:58+00:00 · Latest: 2025-09-18T17:59:58+00:00
Comments: Accepted in BMVC 2025
Abstract
Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable
performance across diverse medical imaging tasks by leveraging large-scale
image-text pretraining. However, their confidence calibration is largely
unexplored, and so remains a significant challenge. As such, miscalibrated
predictions can lead to overconfident errors, undermining clinical trust and
decision-making reliability. To address this, we introduce CalibPrompt, the
first framework to calibrate Med-VLMs during prompt tuning. CalibPrompt
optimizes a small set of learnable prompts with carefully designed calibration
objectives under scarce labeled data regime. First, we study a regularizer that
attempts to align the smoothed accuracy with the predicted model confidences.
Second, we introduce an angular separation loss to maximize textual feature
proximity toward improving the reliability in confidence estimates of
multimodal Med-VLMs. Extensive experiments on four publicly available Med-VLMs
and five diverse medical imaging datasets reveal that CalibPrompt consistently
improves calibration without drastically affecting clean accuracy. Our code is
available at https://github.com/iabh1shekbasu/CalibPrompt.
中文标题/摘要
标题:医疗视觉语言模型的校准感知提示学习
医疗视觉语言模型(Med-VLMs)通过大规模图像-文本预训练,在多种医疗成像任务中表现出色。然而,它们的置信度校准尚未得到充分探索,仍然是一个重大挑战。因此,未校准的预测可能导致过度自信的错误,削弱临床信任和决策可靠性。为了解决这一问题,我们引入了CalibPrompt,这是第一个在提示调优过程中校准Med-VLMs的框架。CalibPrompt在少量标注数据条件下,通过精心设计的校准目标优化一小组可学习的提示。首先,我们研究了一个正则化器,试图使平滑后的准确率与预测模型置信度对齐。其次,我们引入了角度分离损失,以最大化文本特征的接近性,从而提高多模态Med-VLMs置信度估计的可靠性。在四个公开的Med-VLMs和五个多样化的医疗成像数据集上的广泛实验表明,CalibPrompt在不大幅影响干净准确率的情况下,始终能够提高校准。我们的代码可在https://github.com/iabh1shekbasu/CalibPrompt/ 获取。
Summary / 总结
The paper introduces CalibPrompt, a framework to calibrate Medical Vision-Language Models (Med-VLMs) during prompt tuning. It optimizes learnable prompts with calibration objectives under limited labeled data. The method includes a regularizer to align smoothed accuracy with predicted model confidences and an angular separation loss to enhance textual feature proximity. Experiments show that CalibPrompt improves calibration without significantly affecting clean accuracy on four Med-VLMs and five medical imaging datasets.
研究旨在通过改进Medical Vision-Language Models (Med-VLMs)的置信度校准,减少医疗成像任务中的过度自信错误。CalibPrompt是一个新颖的框架,通过在有限标注数据下优化可学习提示和校准目标来实现这一目标。它包括一个正则化项,以使平滑准确度与预测置信度对齐,以及一个角度分离损失,以增强文本特征的接近性。实验表明,CalibPrompt在不显著影响干净准确率的情况下,能够提高校准效果,覆盖四个Med-VLMs和五个数据集。
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Authors: Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang
First: 2025-09-18T17:59:22+00:00 · Latest: 2025-09-18T17:59:22+00:00
Abstract
Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that
operate GUIs autonomously, showing great potential, yet progress is limited by
the lack of large-scale, open-source computer use data and foundation models.
In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It
offers a large-scale dataset spanning 6 operating systems and 3 task domains,
built via a closed-loop pipeline uniting automated agents with human experts.
Trained on this scaled-up data, ScaleCUA can operate seamlessly across
platforms. Specifically, it delivers strong gains over baselines (+26.6 on
WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art
results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on
WebArena-Lite-v2). These findings underscore the power of data-driven scaling
for general-purpose computer use agents. We will release data, models, and code
to advance future research: https://github.com/OpenGVLab/ScaleCUA.
中文标题/摘要
标题:ScaleCUA:跨平台数据扩展开源计算机使用代理
视觉-语言模型(VLMs)使计算机使用代理(CUAs)能够自主操作GUI,展现出巨大的潜力,但进展受限于缺乏大规模、开源的计算机使用数据和基础模型。在本项工作中,我们介绍了ScaleCUA,这是迈向扩展开源CUA的一个步骤。它提供了一个跨越6个操作系统和3个任务领域的大型数据集,通过结合自动化代理和人类专家的闭环管道构建而成。在这些扩展的数据上训练后,ScaleCUA可以在不同平台之间无缝操作。具体而言,它在WebArena-Lite-v2上比基线模型提高了26.6%,在ScreenSpot-Pro上提高了10.7%,并在MMBench-GUI L1-Hard上达到了94.4%的新最佳结果,在OSWorld-G上达到了60.6%,在WebArena-Lite-v2上达到了47.4%。这些发现强调了数据驱动扩展对于通用计算机使用代理的强大作用。我们将发布数据、模型和代码以促进未来研究:https://github.com/OpenGVLab/ScaleCUA。
Summary / 总结
ScaleCUA addresses the limitation of open-source computer use agents by introducing a large-scale dataset spanning multiple operating systems and task domains. Utilizing a closed-loop pipeline involving automated agents and human experts, it enables seamless cross-platform operation. ScaleCUA outperforms baselines and sets new state-of-the-art results on various benchmarks, highlighting the importance of data-driven scaling for general-purpose computer use agents. The dataset, models, and code are available for future research.
ScaleCUA通过引入跨越多个操作系统和任务领域的大型数据集来解决开源计算机使用代理的限制问题。该数据集通过结合自动化代理和人类专家的闭环管道创建。ScaleCUA在MMBenchmark-GUI L1-Hard、OSWorld-G和WebArena-Lite-v2等任务上显著优于现有基线,展示了数据驱动扩展对于通用计算机使用代理的重要性。数据集、模型和代码将公开发布,以促进进一步的研究。
MedFact-R1: Towards Factual Medical Reasoning via Pseudo-Label Augmentation
Authors: Gengliang Li, Rongyu Chen, Bin Li, Linlin Yang, Guodong Ding
First: 2025-09-18T16:59:59+00:00 · Latest: 2025-09-18T16:59:59+00:00
Comments: Tech report
Abstract
Ensuring factual consistency and reliable reasoning remains a critical
challenge for medical vision-language models. We introduce MEDFACT-R1, a
two-stage framework that integrates external knowledge grounding with
reinforcement learning to improve the factual medical reasoning. The first
stage uses pseudo-label supervised fine-tuning (SFT) to incorporate external
factual expertise; while the second stage applies Group Relative Policy
Optimization (GRPO) with four tailored factual reward signals to encourage
self-consistent reasoning. Across three public medical QA benchmarks,
MEDFACT-R1 delivers up to 22.5% absolute improvement in factual accuracy over
previous state-of-the-art methods. Ablation studies highlight the necessity of
pseudo-label SFT cold start and validate the contribution of each GRPO reward,
underscoring the synergy between knowledge grounding and RL-driven reasoning
for trustworthy medical AI. Codes are released at
https://github.com/Garfieldgengliang/MEDFACT-R1.
中文标题/摘要
标题:MedFact-R1:通过伪标签增强实现医学事实推理
确保事实一致性与可靠推理仍然是医学视觉-语言模型的关键挑战。我们引入了MEDFACT-R1,这是一种两阶段框架,结合了外部知识接地与强化学习以提高医学事实推理。第一阶段使用伪标签监督微调(SFT)来整合外部事实专业知识;而第二阶段则应用组相对策略优化(GRPO)并使用四个定制的事实奖励信号来促进自我一致的推理。在三个公开的医学问答基准测试中,MEDFACT-R1在事实准确性上相对于之前最先进的方法实现了高达22.5%的绝对改进。消融研究强调了伪标签SFT冷启动的必要性,并验证了每个GRPO奖励的贡献,突显了知识接地与基于RL的推理之间的协同作用对于可信赖的医学AI的重要性。代码已发布于https://github.com/Garfieldgengliang/MEDFACT-R1。
WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance
Authors: Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang
First: 2025-09-18T16:40:47+00:00 · Latest: 2025-09-18T16:40:47+00:00
Comments: Project Webpage: https://worldforge-agi.github.io/
Abstract
Recent video diffusion models demonstrate strong potential in spatial
intelligence tasks due to their rich latent world priors. However, this
potential is hindered by their limited controllability and geometric
inconsistency, creating a gap between their strong priors and their practical
use in 3D/4D tasks. As a result, current approaches often rely on retraining or
fine-tuning, which risks degrading pretrained knowledge and incurs high
computational costs. To address this, we propose WorldForge, a training-free,
inference-time framework composed of three tightly coupled modules. Intra-Step
Recursive Refinement introduces a recursive refinement mechanism during
inference, which repeatedly optimizes network predictions within each denoising
step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages
optical flow similarity to decouple motion from appearance in the latent space
and selectively inject trajectory guidance into motion-related channels.
Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths
to adaptively correct trajectory drift caused by noisy or misaligned structural
signals. Together, these components inject fine-grained, trajectory-aligned
guidance without training, achieving both accurate motion control and
photorealistic content generation. Extensive experiments across diverse
benchmarks validate our method's superiority in realism, trajectory
consistency, and visual fidelity. This work introduces a novel plug-and-play
paradigm for controllable video synthesis, offering a new perspective on
leveraging generative priors for spatial intelligence.
中文标题/摘要
标题:WorldForge:通过训练-free 指导解锁视频扩散模型中的3D/4D生成
近期的视频扩散模型在空间智能任务中表现出强大的潜力,这得益于它们丰富的潜在世界先验知识。然而,这种潜力受到其有限的可控性和几何不一致性的影响,导致其先验知识强大但实际应用在3D/4D任务中存在差距。因此,当前的方法往往依赖于重新训练或微调,这可能会损害预训练知识并导致高计算成本。为了解决这个问题,我们提出了WorldForge,这是一种训练-free、推理时框架,由三个紧密耦合的模块组成。Intra-Step 递归细化引入了一种在推理过程中重复优化网络预测的递归细化机制,以实现精确的轨迹注入。流门控潜在融合利用光学流相似性在潜在空间中解耦运动和外观,并选择性地将轨迹指导注入与运动相关的通道中。双路径自我纠正指导将指导和未指导的去噪路径进行比较,以自适应地纠正由噪声或对齐不良的结构信号引起的轨迹漂移。这些组件共同在不进行训练的情况下注入细粒度、轨迹对齐的指导,实现准确的运动控制和逼真的内容生成。广泛的跨不同基准的实验验证了我们方法在逼真度、轨迹一致性和视觉保真度方面的优越性。这项工作引入了一种新的插即用范式,用于可控视频合成,为利用生成先验知识进行空间智能提供了新的视角。
Summary / 总结
WorldForge addresses the limitations of video diffusion models in 3D/4D tasks by proposing a training-free framework. It consists of three modules: Intra-Step Recursive Refinement, Flow-Gated Latent Fusion, and Dual-Path Self-Corrective Guidance. These modules enable precise trajectory injection, motion decoupling, and adaptive correction during inference, respectively. Experiments show that WorldForge achieves accurate motion control and photorealistic content generation, outperforming existing methods in realism, trajectory consistency, and visual fidelity.
WorldForge 提出了一种无需训练的框架来解决视频扩散模型在3D/4D任务中的局限性。该框架包含三个模块:Intra-Step Recursive Refinement、Flow-Gated Latent Fusion 和 Dual-Path Self-Corrective Guidance。这些模块分别在推理过程中实现精确的轨迹注入、运动与外观的解耦以及自适应纠正。实验表明,WorldForge 在现实感、轨迹一致性以及视觉保真度方面优于现有方法。
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models
Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Chengyuan Yu, Mengshu Sun, Qiang Zhang, Yijie Guo, Kaidi Xu, Jize Zhang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu
First: 2024-09-20T03:02:05+00:00 · Latest: 2025-09-18T16:36:42+00:00
Abstract
Recently, driven by advancements in Multimodal Large Language Models (MLLMs),
Vision Language Action Models (VLAMs) are being proposed to achieve better
performance in open-vocabulary scenarios for robotic manipulation tasks. Since
manipulation tasks involve direct interaction with the physical world, ensuring
robustness and safety during the execution of this task is always a very
critical issue. In this paper, by synthesizing current safety research on MLLMs
and the specific application scenarios of the manipulation task in the physical
world, we comprehensively evaluate VLAMs in the face of potential physical
threats. Specifically, we propose the Physical Vulnerability Evaluating
Pipeline (PVEP) that can incorporate as many visual modal physical threats as
possible for evaluating the physical robustness of VLAMs. The physical threats
in PVEP specifically include Out-of-Distribution, Typography-based Visual
Prompt, and Adversarial Patch Attacks. By comparing the performance
fluctuations of VLAMs before and after being attacked, we provide generalizable
\textbf{\textit{Analyses}} of how VLAMs respond to different physical threats.
中文标题/摘要
标题:面对威胁的操控:评估端到端视觉语言动作模型的物理脆弱性
近年来,随着多模态大型语言模型(MLLMs)的发展,视觉语言动作模型(VLAMs)被提出以在机器人操控任务的开放词汇场景中实现更好的性能。由于操控任务涉及直接与物理世界互动,确保执行此任务时的鲁棒性和安全性始终是一个非常关键的问题。在本文中,通过综合当前MLLMs的安全研究以及操控任务在物理世界中的具体应用场景,我们全面评估了VLAMs在面对潜在物理威胁时的表现。具体地,我们提出了物理脆弱性评估管道(PVEP),它可以尽可能地整合各种视觉模态的物理威胁,以评估VLAMs的物理鲁棒性。PVEP中的物理威胁具体包括离分布、基于字体的视觉提示和对抗性补丁攻击。通过比较攻击前后VLAMs的性能波动,我们提供了关于VLAMs如何应对不同物理威胁的可推广的分析。
Summary / 总结
This paper evaluates the physical robustness of Vision Language Action Models (VLAMs) in robotic manipulation tasks, which involve direct interaction with the physical world. The authors propose the Physical Vulnerability Evaluating Pipeline (PVEP) to assess VLAMs against various physical threats, including out-of-distribution, typography-based visual prompt, and adversarial patch attacks. The study reveals how VLAMs perform under these attacks, providing insights into their vulnerability to different types of physical threats.
本文通过提出物理脆弱性评估管道(PVEP)来评估Vision Language Action Models(VLAMs)在机器人操作任务中的物理鲁棒性,评估其对各种物理威胁(如离分布、字体基于的视觉提示和对抗性补丁攻击)的响应。研究展示了VLAMs在遭受这些攻击前后的表现,提供了它们在实际应用中对物理威胁的脆弱性的见解。
Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering
Authors: Neale Ratzlaff, Matthew Lyle Olson, Musashi Hinck, Estelle Aflalo, Shao-Yen Tseng, Vasudev Lal, Phillip Howard
First: 2024-11-15T20:06:09+00:00 · Latest: 2025-09-18T15:58:56+00:00
Comments: 10 pages, 6 Figures, 8 Tables. arXiv admin note: text overlap with
arXiv:2410.13976
Abstract
Large Multi-Modal Models (LMMs) have demonstrated impressive capabilities as
general-purpose chatbots able to engage in conversations about visual inputs.
However, their responses are influenced by societal biases present in their
training datasets, leading to undesirable differences in how the model responds
when presented with images depicting people of different demographics. In this
work, we propose a training-free debiasing framework for LMMs that intervenes
on the model's representations during text generation by constructing a
steering vector that reduces reference on protected attributes. Our framework
introduces two complementary methods: (1) a dataset-based approach that
constructs a steering vector by contrasting model activations on biased and
neutral inputs, and (2) a novel optimization-based approach designed for
low-resource settings, which constructs the steering vector using a single step
of gradient-based perturbation without requiring additional data. Our
experiments show that these interventions effectively reduce the propensity of
LMMs to generate text related to protected attributes while maintaining
sentiment and fluency. Furthermore, we demonstrate that debiased LMMs achieve
comparable accuracy to their unmodified counterparts on downstream tasks,
indicating that bias mitigation can be achieved without sacrificing model
performance.
中文标题/摘要
标题:在测试时去偏大型多模态模型通过非对比视觉属性引导
大型多模态模型(LMMs)展示了作为通用聊天机器人的出色能力,能够就视觉输入进行对话。然而,它们的响应受到其训练数据集中存在的社会偏见的影响,导致在展示不同人口统计学特征的人像时,模型的响应存在不希望的差异。在本工作中,我们提出了一种无需训练的去偏框架,通过在文本生成过程中干预模型的表示来构建一个减少对受保护属性依赖的引导向量。我们的框架引入了两种互补的方法:(1)基于数据的方法,通过对比模型在有偏和中性输入上的激活来构建引导向量;(2)一种针对资源有限环境的新颖优化方法,使用单步梯度扰动构建引导向量,无需额外数据。我们的实验表明,这些干预措施有效地减少了LMMs生成与受保护属性相关的文本的倾向,同时保持了情感和流畅性。此外,我们证明去偏的LMMs在下游任务上的准确度与未修改的版本相当,表明可以在不牺牲模型性能的情况下实现偏见缓解。
Summary / 总结
This work addresses the issue of societal biases in Large Multi-Modal Models (LMMs) by proposing a training-free debiasing framework. The framework uses two methods: a dataset-based approach that constructs a steering vector by contrasting model activations on biased and neutral inputs, and an optimization-based approach that uses a single step of gradient-based perturbation. Experiments show that these interventions reduce the model's propensity to generate text related to protected attributes while maintaining sentiment and fluency, and debiased LMMs perform comparably to their unmodified counterparts on downstream tasks.
本文提出了一种无需训练的去偏见框架,以解决大型多模态模型(LMMs)中的社会偏见问题。该框架在文本生成过程中通过构建导向向量来减少对受保护属性的依赖。引入了两种方法:基于数据集的对比学习方法和适用于资源有限环境的优化方法。实验表明,这些干预措施可以有效减少生成文本中的偏见,同时保持情感和流畅性,并且去偏见后的LMMs在下游任务上的性能与未修改的模型相当。
Forecasting and Visualizing Air Quality from Sky Images with Vision-Language Models
Authors: Mohammad Saleh Vahdatpour, Maryam Eyvazi, Yanqing Zhang
First: 2025-09-18T15:36:38+00:00 · Latest: 2025-09-18T15:36:38+00:00
Comments: Published at ICCVW 2025
Abstract
Air pollution remains a critical threat to public health and environmental
sustainability, yet conventional monitoring systems are often constrained by
limited spatial coverage and accessibility. This paper proposes an AI-driven
agent that predicts ambient air pollution levels from sky images and
synthesizes realistic visualizations of pollution scenarios using generative
modeling. Our approach combines statistical texture analysis with supervised
learning for pollution classification, and leverages vision-language model
(VLM)-guided image generation to produce interpretable representations of air
quality conditions. The generated visuals simulate varying degrees of
pollution, offering a foundation for user-facing interfaces that improve
transparency and support informed environmental decision-making. These outputs
can be seamlessly integrated into intelligent applications aimed at enhancing
situational awareness and encouraging behavioral responses based on real-time
forecasts. We validate our method using a dataset of urban sky images and
demonstrate its effectiveness in both pollution level estimation and
semantically consistent visual synthesis. The system design further
incorporates human-centered user experience principles to ensure accessibility,
clarity, and public engagement in air quality forecasting. To support scalable
and energy-efficient deployment, future iterations will incorporate a green CNN
architecture enhanced with FPGA-based incremental learning, enabling real-time
inference on edge platforms.
中文标题/摘要
标题:基于视觉语言模型的天空图像空气质量预测与可视化
空气污染仍然是对公共健康和环境可持续性的重大威胁,但传统的监测系统往往受限于有限的空间覆盖范围和可访问性。本文提出了一种基于人工智能的代理,可以从天空图像中预测环境空气污染水平,并使用生成模型合成现实的污染场景可视化。我们的方法结合了统计纹理分析和监督学习进行污染分类,并利用视觉语言模型(VLM)指导的图像生成来生成可解释的空气质量条件表示。生成的视觉效果模拟了不同程度的污染,为面向用户的界面提供了基础,以提高透明度并支持基于实时预测的环境决策。这些输出可以无缝集成到旨在增强态势感知并鼓励基于实时预测的行为响应的智能应用中。我们使用城市天空图像数据集验证了该方法,并证明了其在污染水平估计和语义一致的视觉合成方面的有效性。系统设计进一步融入了以用户为中心的人机交互原则,以确保空气质量预测的可访问性、清晰性和公众参与。为了实现可扩展和节能部署,未来的迭代将结合增强的FPGA基于增量学习的绿色CNN架构,使边缘平台能够进行实时推理。
QuizRank: Picking Images by Quizzing VLMs
Authors: Tenghao Ji, Eytan Adar
First: 2025-09-18T15:22:33+00:00 · Latest: 2025-09-18T15:22:33+00:00
Abstract
Images play a vital role in improving the readability and comprehension of
Wikipedia articles by serving as `illustrative aids.' However, not all images
are equally effective and not all Wikipedia editors are trained in their
selection. We propose QuizRank, a novel method of image selection that
leverages large language models (LLMs) and vision language models (VLMs) to
rank images as learning interventions. Our approach transforms textual
descriptions of the article's subject into multiple-choice questions about
important visual characteristics of the concept. We utilize these questions to
quiz the VLM: the better an image can help answer questions, the higher it is
ranked. To further improve discrimination between visually similar items, we
introduce a Contrastive QuizRank that leverages differences in the features of
target (e.g., a Western Bluebird) and distractor concepts (e.g., Mountain
Bluebird) to generate questions. We demonstrate the potential of VLMs as
effective visual evaluators by showing a high congruence with human quiz-takers
and an effective discriminative ranking of images.
中文标题/摘要
标题:QuizRank:通过问答VLMs挑选图像
图像在提高维基百科文章的可读性和理解性方面起着至关重要的作用,作为‘说明性辅助工具’。然而,并非所有图像都同样有效,也不是所有维基百科编辑都受过图像选择的培训。我们提出QuizRank,一种新颖的图像选择方法,利用大型语言模型(LLMs)和视觉语言模型(VLMs)对图像进行排名,作为学习干预措施。我们的方法将文章主题的文本描述转化为关于概念重要视觉特征的多项选择题。我们利用这些问题来测试VLM:图像越能帮助回答问题,排名越高。为了进一步提高对视觉相似项目的区分度,我们引入了对比QuizRank,利用目标(如蓝冠山雀)和干扰概念(如蓝山雀)的特征差异来生成问题。我们通过展示VLMs与人类问答者的高度一致性以及对图像的有效区分排名,证明了VLMs作为有效的视觉评估工具的潜力。
PRISM: Product Retrieval In Shopping Carts using Hybrid Matching
Authors: Arda Kabadayi, Senem Velipasalar, Jiajing Chen
First: 2025-09-18T14:15:37+00:00 · Latest: 2025-09-18T14:15:37+00:00
Abstract
Compared to traditional image retrieval tasks, product retrieval in retail
settings is even more challenging. Products of the same type from different
brands may have highly similar visual appearances, and the query image may be
taken from an angle that differs significantly from view angles of the stored
catalog images. Foundational models, such as CLIP and SigLIP, often struggle to
distinguish these subtle but important local differences. Pixel-wise matching
methods, on the other hand, are computationally expensive and incur
prohibitively high matching times. In this paper, we propose a new, hybrid
method, called PRISM, for product retrieval in retail settings by leveraging
the advantages of both vision-language model-based and pixel-wise matching
approaches. To provide both efficiency/speed and finegrained retrieval
accuracy, PRISM consists of three stages: 1) A vision-language model (SigLIP)
is employed first to retrieve the top 35 most semantically similar products
from a fixed gallery, thereby narrowing the search space significantly; 2) a
segmentation model (YOLO-E) is applied to eliminate background clutter; 3)
fine-grained pixel-level matching is performed using LightGlue across the
filtered candidates. This framework enables more accurate discrimination
between products with high inter-class similarity by focusing on subtle visual
cues often missed by global models. Experiments performed on the ABV dataset
show that our proposed PRISM outperforms the state-of-the-art image retrieval
methods by 4.21% in top-1 accuracy while still remaining within the bounds of
real-time processing for practical retail deployments.
中文标题/摘要
标题:PRISM:购物车中产品检索的混合匹配方法
与传统的图像检索任务相比,零售环境中的产品检索更具挑战性。同一类型的不同品牌产品可能具有高度相似的视觉外观,查询图像的角度可能与存储目录图像的角度相差很大。基础模型如CLIP和SigLIP往往难以区分这些细微但重要的局部差异。像素级匹配方法虽然效率高,但计算成本高昂,匹配时间难以接受。本文提出了一种新的混合方法PRISM,通过结合基于视觉语言模型和像素级匹配方法的优势,用于零售环境中的产品检索。PRISM由三个阶段组成:1) 使用视觉语言模型(SigLIP)从固定画廊中检索出最相似的35个产品,显著缩小搜索空间;2) 应用分割模型(YOLO-E)消除背景杂乱;3) 使用LightGlue在筛选后的候选产品中进行精细的像素级匹配。该框架通过关注全局模型经常忽略的细微视觉线索,使具有高类间相似性的产品之间能够更准确地区分。在ABV数据集上的实验表明,我们的PRISM在top-1准确率上比最先进的图像检索方法高出4.21%,同时仍保持在实时处理的范围内,适用于实际零售部署。
Summary / 总结
PRISM is a hybrid method for product retrieval in retail settings, combining the efficiency of vision-language models with the accuracy of pixel-level matching. It consists of three stages: first, SigLIP retrieves the top 35 semantically similar products, then YOLO-E removes background clutter, and finally, LightGlue performs fine-grained pixel-level matching. Experiments show PRISM outperforms state-of-the-art methods by 4.21% in top-1 accuracy while maintaining real-time processing capabilities.
PRISM 是一种结合视觉语言模型效率和像素级匹配准确性的混合方法,用于零售环境中的产品检索。该方法分为三个阶段:首先,SigLIP检索最相似的35个产品,然后YOLO-E去除背景杂乱,最后使用LightGlue进行精细的像素级匹配。实验表明,PRISM在top-1准确率上比最先进的方法高出4.21%,同时保持实时处理能力。
EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence
Authors: Chaoyin She, Ruifang Lu, Lida Chen, Wei Wang, Qinghua Huang
First: 2025-09-18T14:07:53+00:00 · Latest: 2025-09-18T14:07:53+00:00
Abstract
Ultrasound imaging has become the preferred imaging modality for early cancer
screening due to its advantages of non-ionizing radiation, low cost, and
real-time imaging capabilities. However, conventional ultrasound diagnosis
heavily relies on physician expertise, presenting challenges of high
subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer
promising solutions for this issue, but existing general-purpose models
demonstrate limited knowledge in ultrasound medical tasks, with poor
generalization in multi-organ lesion recognition and low efficiency across
multi-task diagnostics. To address these limitations, we propose EchoVLM, a
vision-language model specifically designed for ultrasound medical imaging. The
model employs a Mixture of Experts (MoE) architecture trained on data spanning
seven anatomical regions. This design enables the model to perform multiple
tasks, including ultrasound report generation, diagnosis and visual
question-answering (VQA). The experimental results demonstrated that EchoVLM
achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and
ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report
generation task. These findings suggest that EchoVLM has substantial potential
to enhance diagnostic accuracy in ultrasound imaging, thereby providing a
viable technical solution for future clinical applications. Source code and
model weights are available at https://github.com/Asunatan/EchoVLM.
中文标题/摘要
标题:EchoVLM:用于通用超声智能的动态专家混合视觉语言模型
超声成像已成为早期癌症筛查的首选成像技术,因其无辐射、低成本和实时成像能力。然而,传统的超声诊断高度依赖于医生的专业知识,存在主观性高和诊断效率低的挑战。视觉语言模型(VLMs)为这一问题提供了有前景的解决方案,但现有的通用模型在超声医学任务中的知识有限,在多器官病灶识别上的泛化能力差,且在多任务诊断中的效率低。为解决这些局限性,我们提出了一种专门针对超声医学成像的视觉语言模型EchoVLM。该模型采用跨七个解剖区域数据训练的专家混合(MoE)架构,能够执行包括超声报告生成、诊断和视觉问答(VQA)在内的多种任务。实验结果表明,与Qwen2-VL相比,EchoVLM在超声报告生成任务中的BLEU-1分数和ROUGE-1分数分别提高了10.15和4.77分。这些发现表明,EchoVLM在提高超声成像诊断准确性方面具有巨大潜力,从而为未来的临床应用提供可行的技术解决方案。源代码和模型权重可在https://github.com/Asunatan/EchoVLM/ 获取。
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion
Authors: Pei Liu, Haipeng Liu, Haichao Liu, Xin Liu, Jinxin Ni, Jun Ma
First: 2025-02-25T10:02:12+00:00 · Latest: 2025-09-18T11:55:02+00:00
Abstract
Human drivers adeptly navigate complex scenarios by utilizing rich
attentional semantics, but the current autonomous systems struggle to replicate
this ability, as they often lose critical semantic information when converting
2D observations into 3D space. In this sense, it hinders their effective
deployment in dynamic and complex environments. Leveraging the superior scene
understanding and reasoning abilities of Vision-Language Models (VLMs), we
propose VLM-E2E, a novel framework that uses the VLMs to enhance training by
providing attentional cues. Our method integrates textual representations into
Bird's-Eye-View (BEV) features for semantic supervision, which enables the
model to learn richer feature representations that explicitly capture the
driver's attentional semantics. By focusing on attentional semantics, VLM-E2E
better aligns with human-like driving behavior, which is critical for
navigating dynamic and complex environments. Furthermore, we introduce a
BEV-Text learnable weighted fusion strategy to address the issue of modality
importance imbalance in fusing multimodal information. This approach
dynamically balances the contributions of BEV and text features, ensuring that
the complementary information from visual and textual modalities is effectively
utilized. By explicitly addressing the imbalance in multimodal fusion, our
method facilitates a more holistic and robust representation of driving
environments. We evaluate VLM-E2E on the nuScenes dataset and achieve
significant improvements in perception, prediction, and planning over the
baseline end-to-end model, showcasing the effectiveness of our
attention-enhanced BEV representation in enabling more accurate and reliable
autonomous driving tasks.
中文标题/摘要
标题:VLM-E2E:利用多模态驾驶员注意力融合提升端到端自动驾驶
人类驾驶员能够利用丰富的注意力语义在复杂场景中自如驾驶,但当前的自动驾驶系统在将2D观察转换为3D空间时往往会丢失关键的语义信息,这限制了它们在动态和复杂环境中的有效部署。利用视觉语言模型(VLMs)的优越场景理解和推理能力,我们提出了一种新的VLM-E2E框架,通过提供注意力提示来增强训练。我们的方法将文本表示整合到鸟瞰视图(BEV)特征中,用于语义监督,使模型能够学习更丰富的特征表示,明确捕捉驾驶员的注意力语义。通过关注注意力语义,VLM-E2E更好地与人类驾驶行为对齐,这对于在动态和复杂环境中导航至关重要。此外,我们引入了一种可学习的BEV-Text加权融合策略,以解决多模态信息融合中的模态重要性不平衡问题。这种方法动态平衡BEV和文本特征的贡献,确保视觉和文本模态互补信息的有效利用。通过明确解决多模态融合中的不平衡问题,我们的方法促进了更全面和稳健的驾驶环境表示。我们在nuScenes数据集上评估了VLM-E2E,并在感知、预测和规划方面显著优于基线端到端模型,展示了我们增强的注意力BEV表示在实现更准确和可靠的自动驾驶任务方面的有效性。
Summary / 总结
Human drivers adeptly navigate complex scenarios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space.
该论文提出了一种名为VLM-E2E的框架,利用视觉语言模型提供注意力线索,增强端到端的自动驾驶能力。该方法将文本表示集成到鸟瞰图特征中进行语义监督,从而改善特征表示并使驾驶行为更接近人类。此外,该方法还引入了一种可学习的加权融合策略,以平衡视觉和文本模态的贡献,从而在感知、预测和规划任务中取得了显著改进,优于基线模型。
MARIC: Multi-Agent Reasoning for Image Classification
Authors: Wonduk Seo, Minhyeong Yu, Hyunjin An, Seunghyun Lee
First: 2025-09-18T11:27:00+00:00 · Latest: 2025-09-18T11:27:00+00:00
Comments: Preprint
Abstract
Image classification has traditionally relied on parameter-intensive model
training, requiring large-scale annotated datasets and extensive fine tuning to
achieve competitive performance. While recent vision language models (VLMs)
alleviate some of these constraints, they remain limited by their reliance on
single pass representations, often failing to capture complementary aspects of
visual content. In this paper, we introduce Multi Agent based Reasoning for
Image Classification (MARIC), a multi agent framework that reformulates image
classification as a collaborative reasoning process. MARIC first utilizes an
Outliner Agent to analyze the global theme of the image and generate targeted
prompts. Based on these prompts, three Aspect Agents extract fine grained
descriptions along distinct visual dimensions. Finally, a Reasoning Agent
synthesizes these complementary outputs through integrated reflection step,
producing a unified representation for classification. By explicitly
decomposing the task into multiple perspectives and encouraging reflective
synthesis, MARIC mitigates the shortcomings of both parameter-heavy training
and monolithic VLM reasoning. Experiments on 4 diverse image classification
benchmark datasets demonstrate that MARIC significantly outperforms baselines,
highlighting the effectiveness of multi-agent visual reasoning for robust and
interpretable image classification.
中文标题/摘要
标题:MARIC:多智能体推理在图像分类中的应用
传统的图像分类依赖于参数密集型模型的训练,需要大规模标注数据集和大量的微调才能达到竞争性性能。虽然最近的视觉语言模型(VLMs)在一定程度上缓解了这些限制,但它们仍然受限于单次表示,往往无法捕捉视觉内容的互补方面。在本文中,我们提出了基于多智能体的图像分类(MARIC)框架,将图像分类重新定义为协作推理过程。MARIC 首先利用一个离群点智能体分析图像的全局主题并生成针对性的提示。基于这些提示,三个方面智能体沿不同的视觉维度提取细粒度描述。最后,一个推理智能体通过综合反思步骤综合这些互补输出,生成用于分类的统一表示。通过明确将任务分解为多个视角,并促进反思综合,MARIC 缓解了参数密集型训练和单一 VLM 推理的不足。在 4 个不同的图像分类基准数据集上的实验表明,MARIC 显著优于基线,突显了多智能体视觉推理在稳健和可解释图像分类中的有效性。
Summary / 总结
MARIC is a multi-agent framework for image classification that decomposes the task into multiple perspectives and encourages reflective synthesis. It uses an Outliner Agent to analyze the global theme and generate targeted prompts, followed by three Aspect Agents that extract fine-grained descriptions along distinct visual dimensions. The Reasoning Agent then synthesizes these outputs to produce a unified representation for classification. Experiments show that MARIC outperforms baselines on four diverse image classification benchmarks, demonstrating the effectiveness of multi-agent visual reasoning.
MARIC 是一种多代理框架,用于图像分类,旨在解决传统参数密集型训练和单一视觉语言模型的局限性。它将任务分解为全局主题分析的代理、三个针对不同视觉维度的细粒度描述的代理,以及一个进行综合推理的代理。实验表明,MARIC 在四个不同的数据集上优于基线,展示了多代理视觉推理在鲁棒和可解释图像分类中的有效性。
The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs
Authors: Asif Azad, Mohammad Sadat Hossain, MD Sadik Hossain Shanto, M Saifur Rahman, Md Rizwan Parvez
First: 2025-09-16T08:17:39+00:00 · Latest: 2025-09-18T10:10:19+00:00
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in complex
visual understanding across scientific and reasoning tasks. While performance
benchmarking has advanced our understanding of these capabilities, the critical
dimension of uncertainty quantification has received insufficient attention.
Therefore, unlike prior conformal prediction studies that focused on limited
settings, we conduct a comprehensive uncertainty benchmarking study, evaluating
16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets
with 3 distinct scoring functions. Our findings demonstrate that larger models
consistently exhibit better uncertainty quantification; models that know more
also know better what they don't know. More certain models achieve higher
accuracy, while mathematical and reasoning tasks elicit poorer uncertainty
performance across all models compared to other domains. This work establishes
a foundation for reliable uncertainty evaluation in multimodal systems.
中文标题/摘要
标题:说“可能”的艺术:一种用于VLMs不确定性基准测试的同构透镜
视觉-语言模型(VLMs)在跨科学和推理任务的复杂视觉理解方面取得了显著进展。尽管性能基准测试已加深了我们对这些能力的理解,但不确定性量化这一关键维度却未得到充分关注。因此,不同于以往专注于有限场景的同构预测研究,我们进行了全面的不确定性基准测试研究,评估了16个最先进的VLMs(开源和闭源)在6个多模态数据集上的表现,使用了3种不同的评分函数。我们的研究结果表明,较大的模型在不确定性量化方面表现更一致;知道得越多的模型也更清楚自己不知道什么。更确定的模型具有更高的准确性,而数学和推理任务在所有模型中的不确定性表现普遍低于其他领域。这项工作为多模态系统的可靠不确定性评估奠定了基础。
Summary / 总结
This study aims to evaluate the uncertainty quantification in Vision-Language Models (VLMs) by benchmarking 16 state-of-the-art VLMs across 6 multimodal datasets using 3 scoring functions. The research finds that larger models provide better uncertainty quantification, and models with more knowledge are more aware of their limitations. More certain models achieve higher accuracy, but mathematical and reasoning tasks show poorer uncertainty performance compared to other domains. This work lays the groundwork for reliable uncertainty evaluation in multimodal systems.
本研究旨在通过在6个多模态数据集上使用3种评分函数评估16个最先进的视觉-语言模型的不确定性量化。研究发现,较大的模型提供了更好的不确定性量化,知识更多的模型更能意识到自己的局限性。更确定的模型能获得更高的准确性,但数学和推理任务在所有领域中的不确定性表现较差。这项工作为多模态系统的可靠不确定性评估奠定了基础。
Frame Sampling Strategies Matter: A Benchmark for small vision language models
Authors: Marija Brkic, Anas Filali Razzouki, Yannis Tevissen, Khalil Guetari, Mounim A. El Yacoubi
First: 2025-09-18T09:18:42+00:00 · Latest: 2025-09-18T09:18:42+00:00
Abstract
Comparing vision language models on videos is particularly complex, as the
performances is jointly determined by the model's visual representation
capacity and the frame-sampling strategy used to construct the input. Current
video benchmarks are suspected to suffer from substantial frame-sampling bias,
as models are evaluated with different frame selection strategies. In this
work, we propose the first frame-accurate benchmark of state-of-the-art small
VLMs for video question-answering, evaluated under controlled frame-sampling
strategies. Our results confirm the suspected bias and highlight both
data-specific and task-specific behaviors of SVLMs under different
frame-sampling techniques. By open-sourcing our benchmarking code, we provide
the community with a reproducible and unbiased protocol for evaluating video
VLMs and emphasize the need for standardized frame-sampling strategies tailored
to each benchmarking dataset in future research.
中文标题/摘要
标题:帧采样策略很重要:小型视觉语言模型基准测试
在视频上比较视觉语言模型特别复杂,因为模型的表现由其视觉表示能力和用于构建输入的帧采样策略共同决定。当前的视频基准可能受到显著的帧采样偏差影响,因为模型使用了不同的帧选择策略进行评估。在本工作中,我们提出了第一个针对视频问答的最先进的小型视觉语言模型的帧准确基准测试,该基准测试在受控的帧采样策略下进行评估。我们的结果证实了存在的偏差,并突显了在不同帧采样技术下SVLMs的数据特异性和任务特异性行为。通过开源我们的基准测试代码,我们为社区提供了可重复且无偏的视频VLMs评估协议,并强调未来研究中为每个基准测试数据集制定标准化帧采样策略的必要性。
Summary / 总结
This study addresses the complexity of evaluating vision language models on videos, where model performance is influenced by both its visual representation ability and the frame-sampling strategy. The researchers propose a new benchmark for small vision language models, ensuring controlled frame-sampling strategies. Their findings confirm the presence of frame-sampling bias and reveal task-specific and data-specific behaviors of these models under different sampling techniques. The benchmarking code is open-sourced to facilitate reproducible and unbiased evaluations in future research.
该研究解决了在视频上比较视觉语言模型的复杂性,模型性能受其视觉表示能力和帧采样策略的影响。作者引入了一个针对小型视觉语言模型的帧准确基准,采用受控的帧采样策略进行评估。结果证实了帧采样偏见的存在,并揭示了在不同采样技术下这些模型的特定行为。开源基准代码确保了可重复性,并促进了未来研究中针对每个基准数据集标准化帧采样策略的推广。
PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution
Authors: Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao
First: 2025-04-19T01:11:46+00:00 · Latest: 2025-09-18T08:24:25+00:00
Abstract
The challenge of tracing the source attribution of forged faces has gained
significant attention due to the rapid advancement of generative models.
However, existing deepfake attribution (DFA) works primarily focus on the
interaction among various domains in vision modality, and other modalities such
as texts and face parsing are not fully explored. Besides, they tend to fail to
assess the generalization performance of deepfake attributors to unseen
advanced generators like diffusion in a fine-grained manner. In this paper, we
propose a novel parsing-aware vision language model with dynamic contrastive
learning(PVLM) method for zero-shot deepfake attribution (ZS-DFA),which
facilitates effective and fine-grained traceability to unseen advanced
generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark
to evaluate the attribution performance of deepfake attributors to unseen
advanced generators like diffusion. Besides, we propose an innovative
parsing-guided vision language model with dynamic contrastive learning (PVLM)
method to capture general and diverse attribution features. We are motivated by
the observation that the preservation of source face attributes in facial
images generated by GAN and diffusion models varies significantly. We employ
the inherent face attributes preservation differences to capture face
parsing-aware forgery representations. Therefore, we devise a novel parsing
encoder to focus on global face attribute embeddings, enabling parsing-guided
DFA representation learning via dynamic vision-parsing matching. Additionally,
we present a novel deepfake attribution contrastive center loss to pull
relevant generators closer and push irrelevant ones away, which can be
introduced into DFA models to enhance traceability. Experimental results show
that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various
protocol evaluations.
中文标题/摘要
标题:PVLM:具有动态对比学习的感知导向视觉语言模型在零样本深度伪造溯源中的应用
由于生成模型的迅速发展,伪造人脸的来源溯源挑战引起了广泛关注。然而,现有的深度伪造溯源(DFA)工作主要集中在视觉模态内各领域的交互上,而文本和其他模态如面部解析尚未得到充分探索。此外,它们倾向于以精细的方式评估深度伪造溯源器对未见过的高级生成器(如扩散模型)的一般化性能。在本文中,我们提出了一种新颖的具有动态对比学习的感知导向视觉语言模型(PVLM)方法,用于零样本深度伪造溯源(ZS-DFA),以促进对未见过的高级生成器的有效和精细的溯源。具体而言,我们建立了一个新颖且精细的ZS-DFA基准,以评估深度伪造溯源器对未见过的高级生成器(如扩散模型)的溯源性能。此外,我们提出了一种创新的感知导向视觉语言模型与动态对比学习(PVLM)方法,以捕捉通用和多样的溯源特征。我们受到观察到的由GAN和扩散模型生成的面部图像中源人脸属性保留差异显著这一事实的启发,利用这些内在的属性保留差异来捕捉感知导向的伪造表示。因此,我们设计了一种新颖的解析编码器,专注于全局人脸属性嵌入,通过动态视觉解析匹配实现感知导向的DFA表示学习。此外,我们提出了一种新颖的深度伪造溯源对比中心损失,将相关生成器拉近,将不相关生成器推开,该方法可以引入到DFA模型中以增强溯源性。实验结果表明,通过各种协议评估,我们的模型在ZS-DFA基准上超过了最先进的方法。
Summary / 总结
This paper addresses the challenge of tracing the source attribution of forged faces using a novel parsing-aware vision language model with dynamic contrastive learning (PVLM) for zero-shot deepfake attribution (ZS-DFA). The authors propose a fine-grained ZS-DFA benchmark to evaluate the performance of deepfake attributors on unseen advanced generators like diffusion. They introduce a parsing-guided vision language model with dynamic contrastive learning to capture general and diverse attribution features. The model includes a parsing encoder that focuses on global face attribute embeddings and a novel deepfake attribution contrastive center loss to enhance traceability. Experimental results demonstrate that PVLM outperforms existing methods on the ZS-DFA benchmark.
本文提出了一种新的解析感知视觉语言模型(PVLM)结合动态对比学习方法,用于零样本深伪溯源(ZS-DFA),以应对伪造人脸来源追踪的挑战。作者提出了一个细粒度的ZS-DFA基准来评估深伪检测器在未见过的高级生成器(如扩散模型)上的性能。他们引入了一种解析引导的视觉语言模型结合动态对比学习方法,以捕捉通用和多样化的溯源特征。该模型包含一个解析编码器,专注于全局面部属性嵌入,并提出了一种新的深伪溯源对比中心损失来增强溯源能力。实验结果表明,PVLM在ZS-DFA基准上超越了现有方法。
BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
Authors: Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia
First: 2025-09-17T07:58:36+00:00 · Latest: 2025-09-18T04:57:32+00:00
Abstract
Recent advancements in Diffusion Transformers (DiTs) have established them as
the state-of-the-art method for video generation. However, their inherently
sequential denoising process results in inevitable latency, limiting real-world
applicability. Existing acceleration methods either compromise visual quality
due to architectural modifications or fail to reuse intermediate features at
proper granularity. Our analysis reveals that DiT blocks are the primary
contributors to inference latency. Across diffusion timesteps, the feature
variations of DiT blocks exhibit a U-shaped pattern with high similarity during
intermediate timesteps, which suggests substantial computational redundancy. In
this paper, we propose Block-Wise Caching (BWCache), a training-free method to
accelerate DiT-based video generation. BWCache dynamically caches and reuses
features from DiT blocks across diffusion timesteps. Furthermore, we introduce
a similarity indicator that triggers feature reuse only when the differences
between block features at adjacent timesteps fall below a threshold, thereby
minimizing redundant computations while maintaining visual fidelity. Extensive
experiments on several video diffusion models demonstrate that BWCache achieves
up to 2.24$\times$ speedup with comparable visual quality.
中文标题/摘要
标题:BWCache:通过块级缓存加速视频扩散变换器
近期扩散变换器(DiTs)的发展已使其成为视频生成的最新方法。然而,其固有的顺序去噪过程不可避免地导致了延迟,限制了其实用性。现有的加速方法要么因架构修改而牺牲视觉质量,要么无法在适当粒度上重用中间特征。我们的分析表明,DiT块是推理延迟的主要来源。在扩散时间步中,DiT块的特征变化呈现出U形模式,在中间时间步具有高度相似性,这表明存在大量的计算冗余。在本文中,我们提出了一种无需训练的块级缓存(BWCache)方法,以加速基于DiT的视频生成。BWCache动态地跨扩散时间步缓存和重用DiT块的特征。此外,我们引入了一个相似性指标,仅在相邻时间步块特征之间的差异低于阈值时触发特征重用,从而最小化冗余计算并保持视觉保真度。在多个视频扩散模型上的广泛实验表明,BWCache实现了最高2.24倍的加速,同时保持了视觉质量。
Summary / 总结
BWCache is a training-free method that accelerates Diffusion Transformers (DiTs) by caching and reusing features from DiT blocks across diffusion timesteps. This approach reduces computational redundancy and achieves up to 2.24 times speedup with comparable visual quality. The method uses a similarity indicator to trigger feature reuse only when the differences between block features at adjacent timesteps are below a threshold, ensuring minimal redundant computations while maintaining visual fidelity.
BWCache 通过在扩散时间步之间缓存和重用 DiT 块的特征来加速视频扩散变换器 (DiTs),减少计算冗余。该方法基于相似性指标动态触发特征重用,确保视觉保真度。实验显示最高可达 2.24 倍的加速比,视觉质量相当。
Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
Authors: Rashid Mushkani
First: 2025-09-18T03:21:10+00:00 · Latest: 2025-09-18T03:21:10+00:00
Abstract
Understanding how people read city scenes can inform design and planning. We
introduce a small benchmark for testing vision-language models (VLMs) on urban
perception using 100 Montreal street images, evenly split between photographs
and photorealistic synthetic scenes. Twelve participants from seven community
groups supplied 230 annotation forms across 30 dimensions mixing physical
attributes and subjective impressions. French responses were normalized to
English. We evaluated seven VLMs in a zero-shot setup with a structured prompt
and deterministic parser. We use accuracy for single-choice items and Jaccard
overlap for multi-label items; human agreement uses Krippendorff's alpha and
pairwise Jaccard. Results suggest stronger model alignment on visible,
objective properties than subjective appraisals. The top system (claude-sonnet)
reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human
agreement coincides with better model scores. Synthetic images slightly lower
scores. We release the benchmark, prompts, and harness for reproducible,
uncertainty-aware evaluation in participatory urban analysis.
中文标题/摘要
标题:视觉语言模型如何理解城市场景?一种城市感知基准
理解人们如何阅读城市场景可以指导设计和规划。我们引入了一个小型基准,用于测试视觉语言模型(VLMs)在城市感知方面的表现,使用了100张蒙特利尔街道图像,其中照片和逼真合成场景各占一半。来自七个社区团体的12名参与者提供了涵盖30个维度的230份注释表,这些维度混合了物理属性和主观印象。法语回答被标准化为英语。我们在零样本设置下评估了七种VLMs,使用结构化提示和确定性解析器。我们使用准确率评估单选题,使用Jaccard重叠评估多标签题;人类一致性使用Krippendorff的alpha和成对Jaccard。结果表明,模型在可见的、客观的属性上比主观评价有更好的对齐。顶级系统(claude-sonnet)在多标签题上的宏平均得分为0.31,平均Jaccard得分为0.48。更高的人类一致性与更好的模型得分相吻合。合成图像略微降低了得分。我们发布了基准、提示和框架,以实现可重复的、考虑不确定性的评估,用于参与式城市分析。
VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Authors: Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, Yuxin Ma
First: 2025-09-18T03:15:00+00:00 · Latest: 2025-09-18T03:15:00+00:00
Comments: 11 pages, 7 figures, 1 table, accepted to IEEE VIS 2025 (IEEE
Transactions on Visualization and Computer Graphics)
Abstract
Vision-language (VL) models have shown transformative potential across
various critical domains due to their capability to comprehend multi-modal
information. However, their performance frequently degrades under distribution
shifts, making it crucial to assess and improve robustness against real-world
data corruption encountered in practical applications. While advancements in VL
benchmark datasets and data augmentation (DA) have contributed to robustness
evaluation and improvement, there remain challenges due to a lack of in-depth
comprehension of model behavior as well as the need for expertise and iterative
efforts to explore data patterns. Given the achievement of visualization in
explaining complex models and exploring large-scale data, understanding the
impact of various data corruption on VL models aligns naturally with a visual
analytics approach. To address these challenges, we introduce VisMoDAl, a
visual analytics framework designed to evaluate VL model robustness against
various corruption types and identify underperformed samples to guide the
development of effective DA strategies. Grounded in the literature review and
expert discussions, VisMoDAl supports multi-level analysis, ranging from
examining performance under specific corruptions to task-driven inspection of
model behavior and corresponding data slice. Unlike conventional works,
VisMoDAl enables users to reason about the effects of corruption on VL models,
facilitating both model behavior understanding and DA strategy formulation. The
utility of our system is demonstrated through case studies and quantitative
evaluations focused on corruption robustness in the image captioning task.
中文标题/摘要
标题:VisMoDAl:评估和提升视觉语言模型抗腐败鲁棒性的视觉分析
视觉语言(VL)模型因其能够理解多模态信息而在各个关键领域展现了变革性的潜力。然而,它们在分布变化下的性能经常下降,因此评估和提升其在实际应用中遇到的真实数据腐败情况下的鲁棒性变得至关重要。尽管VL基准数据集和数据增强(DA)的进步已经促进了鲁棒性的评估和提升,但由于缺乏对模型行为的深入理解以及需要专业知识和迭代探索数据模式,仍然存在挑战。鉴于可视化在解释复杂模型和探索大规模数据方面的成就,理解各种数据腐败对VL模型的影响自然与视觉分析方法相契合。为了解决这些挑战,我们引入了VisMoDAl,这是一种视觉分析框架,旨在评估VL模型在各种腐败类型下的鲁棒性,并识别表现不佳的样本以指导有效的数据增强策略的发展。VisMoDAl基于文献综述和专家讨论,支持多级分析,从特定腐败下的性能检查到任务驱动的模型行为和相应数据切片的检查。与传统工作不同,VisMoDAl使用户能够推理数据腐败对VL模型的影响,促进模型行为理解和数据增强策略的制定。通过针对图像字幕任务的抗腐败鲁棒性的案例研究和定量评估,展示了我们系统的实用性。
Summary / 总结
VisMoDAl is a visual analytics framework designed to evaluate and improve the corruption robustness of vision-language models. It addresses the challenge of understanding model behavior under distribution shifts by enabling multi-level analysis of performance under various corruptions. Key findings show that VisMoDAl helps identify underperformed samples, guiding the development of effective data augmentation strategies. The system's utility is demonstrated through case studies and quantitative evaluations in the image captioning task.
VisMoDAl 是一个视觉分析框架,旨在评估和提高视觉语言模型在各种数据腐蚀类型下的鲁棒性。它支持多级分析,从特定腐蚀下的性能检查到任务驱动的模型行为和相应数据切片的检查。关键发现表明,VisMoDAl 使用户能够更好地理解模型行为并开发有效的数据增强策略,从而提高图像描述任务中的模型性能。
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
Authors: Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
First: 2024-06-20T17:45:02+00:00 · Latest: 2025-09-18T02:44:34+00:00
Comments: Project website: https://ical-learning.github.io/
Abstract
Large-scale generative language and vision-language models (LLMs and VLMs)
excel in few-shot learning but require high-quality demonstrations. We propose
In-Context Abstraction Learning (ICAL), enabling VLM agents to transform
suboptimal trajectories into high-quality training data through self-reflection
and human feedback. Given imperfect task demonstrations, a VLM abstracts
trajectories into generalized strategies and action annotations by correcting
inefficiencies and annotating cognitive abstractions: causal relationships,
object state changes, temporal subgoals, and task-relevant visual elements.
These annotations are iteratively refined through human feedback during
execution in similar environments. The resulting examples significantly improve
decision-making when used for retrieval-augmented generation or fine-tuning. As
the agent's example library grows, it becomes more efficient at abstracting new
examples, requiring less human feedback and fewer environment interactions.
ICAL achieves state-of-the-art results across multiple benchmarks. In TEACh
dialogue-based instruction following, combining fine-tuning and retrieval on
ICAL examples outperforms raw human demonstrations and expert examples by 17.5%
in goal-condition success. In VisualWebArena, retrieval-augmented GPT-4V with
ICAL improves task success 1.6x, while fine-tuned Qwen2-VL achieves 2.8x
improvement over the base model. In Ego4D action forecasting, we surpass
few-shot GPT-4V and remain competitive with supervised models. Our approach
scales 2x better than raw demonstrations and significantly reduces manual
prompt engineering requirements.
中文标题/摘要
标题:VLM智能体生成自己的记忆:将经验提炼为具身思维程序
大规模生成语言和跨模态语言模型(LLMs和VLMs)在少量示例学习方面表现出色,但需要高质量的演示。我们提出了上下文抽象学习(ICAL),使VLM智能体能够通过自我反思和人类反馈将次优轨迹转化为高质量的训练数据。给定不完美的任务演示,VLM将轨迹抽象为通用策略和动作注释,通过纠正低效性和标注认知抽象:因果关系、物体状态变化、时间子目标和任务相关的视觉元素。这些注释在类似环境中执行期间通过人类反馈迭代优化。生成的示例在用于检索增强生成或微调时显著改善了决策。随着智能体示例库的增长,它在抽象新示例方面变得更加高效,需要更少的人类反馈和环境交互。ICAL在多个基准测试中取得了最先进的成果。在TEACh对话式指令遵循中,结合ICAL示例的微调和检索优于原始人类演示和专家示例17.5%的目标条件成功率。在VisualWebArena中,使用ICAL的检索增强GPT-4V将任务成功率提高了1.6倍,而微调后的Qwen2-VL将基线模型提高了2.8倍。在Ego4D动作预测中,我们超越了少量示例的GPT-4V,并在监督模型中保持竞争力。我们的方法比原始演示扩展速度快2倍,并显著减少了手动提示工程的需求。
Summary / 总结
The research aims to enhance the few-shot learning capabilities of vision-language models (VLMs) by enabling them to generate high-quality training data through self-reflection and human feedback. The method involves In-Context Abstraction Learning (ICAL), which allows VLM agents to transform suboptimal trajectories into generalized strategies and action annotations. Key findings show that ICAL significantly improves decision-making, outperforming raw human demonstrations and expert examples in various benchmarks. For instance, in TEACh dialogue-based instruction following, ICAL improves goal-condition success by 17.5%, and in VisualWebArena, it enhances task success by 1.6x compared to the base model.
研究旨在通过自我反思和人类反馈,使视觉语言模型(VLMs)能够生成高质量的训练数据,从而增强其少量样本学习能力。方法In-Context Abstraction Learning (ICAL) 允许VLM代理将次优轨迹转化为通用策略和动作注释。关键发现表明,ICAL在多种基准测试中显著提高了决策能力,优于原始的人类示范和专家示例。例如,在TEACh对话式指令跟随中,ICAL示例将目标条件成功率提高了17.5%。在VisualWebArena中,ICAL将任务成功率提高了1.6倍,而微调后的Qwen2-VL将基线模型提高了2.8倍。
An Empirical Study of Federated Prompt Learning for Vision Language Model
Authors: Zhihao Wang, Wenke Huang, Tian Chen, Zekun Shi, Guancheng Wan, Yu Qiao, Bin Yang, Jian Wang, Bing Li, Mang Ye
First: 2025-05-29T03:09:15+00:00 · Latest: 2025-09-18T02:36:50+00:00
Abstract
The Vision Language Model (VLM) excels in aligning vision and language
representations, and prompt learning has emerged as a key technique for
adapting such models to downstream tasks. However, the application of prompt
learning with VLM in federated learning (FL) scenarios remains underexplored.
This paper systematically investigates the behavioral differences between
language prompt learning (LPT) and vision prompt learning (VPT) under data
heterogeneity challenges, including label skew and domain shift. We conduct
extensive experiments to evaluate the impact of various FL and prompt
configurations, such as client scale, aggregation strategies, and prompt
length, to assess the robustness of Federated Prompt Learning (FPL).
Furthermore, we explore strategies for enhancing prompt learning in complex
scenarios where label skew and domain shift coexist, including leveraging both
prompt types when computational resources allow. Our findings offer practical
insights into optimizing prompt learning in federated settings, contributing to
the broader deployment of VLMs in privacy-preserving environments.
中文标题/摘要
标题:联邦提示学习在视觉语言模型中的实证研究
视觉语言模型(VLM)在视觉和语言表示对齐方面表现出色,提示学习已成为将此类模型适应下游任务的关键技术。然而,提示学习在联邦学习(FL)场景中的应用尚未得到充分探索。本文系统地研究了在数据异质性挑战(包括标签偏斜和领域偏移)下语言提示学习(LPT)和视觉提示学习(VPT)的行为差异。我们进行了广泛的实验,评估了各种FL和提示配置(如客户端规模、聚合策略和提示长度)对联邦提示学习(FPL)鲁棒性的影响。此外,我们探讨了在标签偏斜和领域偏移共存的复杂场景中增强提示学习的策略,包括在计算资源允许时利用两种提示类型。我们的研究结果为优化联邦设置中的提示学习提供了实用见解,有助于在隐私保护环境中更广泛地部署VLMs。
Summary / 总结
This paper explores the differences between language prompt learning (LPT) and vision prompt learning (VPT) in federated learning (FL) scenarios, focusing on data heterogeneity challenges like label skew and domain shift. Through extensive experiments, the study evaluates the impact of various FL and prompt configurations, such as client scale, aggregation strategies, and prompt length, to assess the robustness of Federated Prompt Learning (FPL). The findings provide practical insights for optimizing prompt learning in federated settings, enhancing the deployment of Vision Language Models (VLMs) in privacy-preserving environments.
该研究探讨了在联邦学习(FL)场景下语言提示学习(LPT)和视觉提示学习(VPT)之间的差异,重点关注标签偏斜和领域偏移等数据异质性挑战。通过大量实验,研究评估了各种FL和提示配置的影响,如客户端规模、聚合策略和提示长度,以评估Federated Prompt Learning(FPL)的鲁棒性。关键发现包括在计算资源允许时使用两种提示类型的有效性,为优化联邦环境下的提示学习提供了实用见解。
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
Authors: Yuxuan Jiang, Zehua Chen, Zeqian Ju, Chang Li, Weibei Dou, Jun Zhu
Venue: ACM MM 2025
First: 2025-07-11T12:57:51+00:00 · Latest: 2025-09-18T02:19:36+00:00
Comments: Accepted at ACM MM 2025
Abstract
Text-to-audio (T2A) generation has achieved promising results with the recent
advances in generative models. However, because of the limited quality and
quantity of temporally-aligned audio-text pairs, existing T2A methods struggle
to handle the complex text prompts that contain precise timing control, e.g.,
"owl hooted at 2.4s-5.2s". Recent works have explored data augmentation
techniques or introduced timing conditions as model inputs to enable
timing-conditioned 10-second T2A generation, while their synthesis quality is
still limited. In this work, we propose a novel training-free timing-controlled
T2A framework, FreeAudio, making the first attempt to enable timing-controlled
long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping
at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time
windows and recaption each with a refined natural language description, based
on the input text and timing prompts. Then we introduce: 1) Decoupling and
Aggregating Attention Control for precise timing control; 2) Contextual Latent
Composition for local smoothness and Reference Guidance for global consistency.
Extensive experiments show that: 1) FreeAudio achieves state-of-the-art
timing-conditioned T2A synthesis quality among training-free methods and is
comparable to leading training-based methods; 2) FreeAudio demonstrates
comparable long-form generation quality with training-based Stable Audio and
paves the way for timing-controlled long-form T2A synthesis. Demo samples are
available at: https://freeaudio.github.io/FreeAudio/
中文标题/摘要
标题:FreeAudio:无需训练的时间规划以实现可控长文本转音频生成
文本转音频(T2A)生成在生成模型的最新进展下取得了令人鼓舞的结果。然而,由于缺乏高质量和数量的时序对齐的音频-文本对,现有的T2A方法难以处理包含精确时间控制的复杂文本提示,例如“猫头鹰在2.4秒至5.2秒之间发出叫声”。最近的研究探索了数据增强技术或引入时间条件作为模型输入以实现时间条件下的10秒T2A生成,但其合成质量仍然有限。在本文中,我们提出了一种新颖的无需训练的时间控制T2A框架FreeAudio,首次尝试实现时间控制的长文本转音频生成,例如“猫头鹰在2.4秒至5.2秒之间发出叫声,蟋蟀在0秒至24秒之间鸣叫”。具体而言,我们首先使用LLM规划非重叠的时间窗口,并基于输入文本和时间提示重新描述每个窗口。然后我们引入:1)解耦和聚合注意力控制以实现精确的时间控制;2)局部平滑的上下文潜在组成和全局一致性的参考指导。广泛的实验表明:1)FreeAudio在无需训练的方法中实现了最先进的时间条件下的T2A合成质量,并且与领先的基于训练的方法相当;2)FreeAudio展示了与基于训练的Stable Audio相当的长文本生成质量,并为时间控制的长文本转音频合成铺平了道路。演示样本可在:https://freeaudio.github.io/FreeAudio/获取
Summary / 总结
FreeAudio is a training-free framework for timing-controlled text-to-audio generation, addressing the challenge of handling complex text prompts with precise timing. It uses an LLM to plan non-overlapping time windows and refines them with natural language descriptions. Key techniques include decoupling and aggregating attention control, contextual latent composition, and reference guidance. Experiments show that FreeAudio matches the quality of leading training-based methods and paves the way for timing-controlled long-form T2A synthesis.
FreeAudio 是一个无需训练的框架,用于实现具有精确时间控制的文本转音频生成,解决复杂文本提示中的时间控制难题。它使用语言模型规划非重叠的时间窗口,并用自然语言描述进行细化。关键技术包括解耦和聚合注意力控制、上下文潜在组成和参考指导。实验表明,FreeAudio 的合成质量与领先的基于训练的方法相当,并为长格式 T2A 合成中的时间控制铺平了道路。
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling
Authors: Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, Nanyun Peng
First: 2025-02-24T21:01:39+00:00 · Latest: 2025-09-17T21:47:50+00:00
Comments: ACL2025 Main
Abstract
Chart generation aims to generate code to produce charts satisfying the
desired visual properties, e.g., texts, layout, color, and type. It has great
potential to empower the automatic professional report generation in financial
analysis, research presentation, education, and healthcare. In this work, we
build a vision-language model (VLM) based multi-agent framework for effective
automatic chart generation. Generating high-quality charts requires both strong
visual design skills and precise coding capabilities that embed the desired
visual properties into code. Such a complex multi-modal reasoning process is
difficult for direct prompting of VLMs. To resolve these challenges, we propose
METAL, a multi-agent framework that decomposes the task of chart generation
into the iterative collaboration among specialized agents. METAL achieves 5.2%
improvement over the current best result in the chart generation task. The
METAL framework exhibits the phenomenon of test-time scaling: its performance
increases monotonically as the logarithmic computational budget grows from 512
to 8192 tokens. In addition, we find that separating different modalities
during the critique process of METAL boosts the self-correction capability of
VLMs in the multimodal context.
中文标题/摘要
标题:METAL:一种用于图表生成的多智能体框架(带测试时扩展)
图表生成旨在生成代码以生成满足所需视觉属性的图表,例如文本、布局、颜色和类型。它在金融分析、研究展示、教育和医疗保健中的自动专业报告生成方面具有巨大的潜力。在本工作中,我们构建了一个基于视觉语言模型(VLM)的有效自动图表生成多智能体框架。生成高质量的图表需要强大的视觉设计技能和精确的编码能力,将所需的视觉属性嵌入代码中。这种复杂的多模态推理过程难以直接对VLM进行提示。为了解决这些挑战,我们提出了METAL,一种多智能体框架,将图表生成任务分解为专业智能体之间的迭代协作。METAL在图表生成任务中的表现优于当前最佳结果5.2%。METAL框架展示了测试时扩展的现象:其性能随着计算预算从512增长到8192个令牌单调增加。此外,我们发现,在METAL的批评过程中分离不同的模态增强了VLM在多模态环境下的自我纠正能力。
Summary / 总结
The research aims to develop an effective multi-agent framework for automatic chart generation, which is crucial for professional report generation in various fields. The METAL framework uses a vision-language model to decompose the chart generation task into iterative collaboration among specialized agents, improving the quality of generated charts by 5.2% compared to the previous best results. Additionally, the framework demonstrates test-time scaling, with performance increasing as computational budget grows, and shows enhanced self-correction capabilities when modalities are separated during the critique process.
研究旨在开发一种有效的多智能体框架,用于自动生成图表,这对于金融分析、研究展示、教育和医疗保健等领域的专业报告生成至关重要。METAL 是一个基于视觉语言模型的框架,将图表生成任务分解为多个专门智能体的迭代协作,相比之前的最佳结果提高了5.2%的图表质量。此外,METAL 在测试时表现出计算预算增长时性能递增的现象,并且在多模态上下文中分离模态可以增强视觉语言模型的自我纠正能力。
Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models
Authors: Ilyass Moummad, Kawtar Zaher, Lukas Rauch, Alexis Joly
First: 2025-09-17T20:58:43+00:00 · Latest: 2025-09-17T20:58:43+00:00
Abstract
Information retrieval with compact binary embeddings, also referred to as
hashing, is crucial for scalable fast search applications, yet state-of-the-art
hashing methods require expensive, scenario-specific training. In this work, we
introduce Hashing-Baseline, a strong training-free hashing method leveraging
powerful pretrained encoders that produce rich pretrained embeddings. We
revisit classical, training-free hashing techniques: principal component
analysis, random orthogonal projection, and threshold binarization, to produce
a strong baseline for hashing. Our approach combines these techniques with
frozen embeddings from state-of-the-art vision and audio encoders to yield
competitive retrieval performance without any additional learning or
fine-tuning. To demonstrate the generality and effectiveness of this approach,
we evaluate it on standard image retrieval benchmarks as well as a newly
introduced benchmark for audio hashing.
中文标题/摘要
标题:哈希-基线:在预训练模型时代重新思考哈希
使用紧凑二进制嵌入的信息检索,也称为哈希,在可扩展的快速搜索应用中至关重要,但最先进的哈希方法需要昂贵且场景特定的训练。在本文中,我们引入了哈希-基线,这是一种强大的无需训练的哈希方法,利用强大的预训练编码器生成丰富的预训练嵌入。我们回顾了经典的无需训练的哈希技术:主成分分析、随机正交投影和阈值二值化,以生成哈希的强基线。我们的方法将这些技术与最先进的视觉和音频编码器的冻结嵌入结合,无需任何额外的学习或微调即可获得具有竞争力的检索性能。为了证明该方法的通用性和有效性,我们在标准图像检索基准以及新引入的音频哈希基准上进行了评估。
Summary / 总结
The research aims to improve hashing methods for information retrieval by leveraging pretrained models to produce compact binary embeddings. The study introduces Hashing-Baseline, which combines principal component analysis, random orthogonal projection, and threshold binarization with pretrained embeddings from vision and audio encoders. The approach achieves competitive retrieval performance without requiring additional training or fine-tuning, demonstrating its effectiveness across image and audio retrieval benchmarks.
研究旨在通过利用预训练模型生成紧凑的二进制嵌入来改进哈希方法。研究引入了Hashing-Baseline,该方法将主成分分析、随机正交投影和阈值二进制化与视觉和音频编码器的冻结嵌入相结合。该方法在不需要额外训练或微调的情况下实现了竞争力的检索性能,并在图像和音频检索基准测试中展示了其有效性。
Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis
Authors: Chenjun Li, Laurin Lux, Alexander H. Berger, Martin J. Menten, Mert R. Sabuncu, Johannes C. Paetzold
First: 2025-03-12T20:19:07+00:00 · Latest: 2025-09-17T20:08:48+00:00
Comments: 11 pages, 3 figures
Abstract
Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely
interventions and preventing vision loss. However, current staging models are
hardly interpretable, and most public datasets contain no clinical reasoning or
interpretation beyond image-level labels. In this paper, we present a novel
method that integrates graph representation learning with vision-language
models (VLMs) to deliver explainable DR diagnosis. Our approach leverages
optical coherence tomography angiography (OCTA) images by constructing
biologically informed graphs that encode key retinal vascular features such as
vessel morphology and spatial connectivity. A graph neural network (GNN) then
performs DR staging while integrated gradients highlight critical nodes and
edges and their individual features that drive the classification decisions. We
collect this graph-based knowledge which attributes the model's prediction to
physiological structures and their characteristics. We then transform it into
textual descriptions for VLMs. We perform instruction-tuning with these textual
descriptions and the corresponding image to train a student VLM. This final
agent can classify the disease and explain its decision in a human
interpretable way solely based on a single image input. Experimental
evaluations on both proprietary and public datasets demonstrate that our method
not only improves classification accuracy but also offers more clinically
interpretable results. An expert study further demonstrates that our method
provides more accurate diagnostic explanations and paves the way for precise
localization of pathologies in OCTA images.
中文标题/摘要
标题:基于图知识微调视觉语言模型以实现可解释的糖尿病视网膜病变医学图像分析
准确的糖尿病视网膜病变(DR)分期对于指导及时干预和预防视力丧失至关重要。然而,当前的分期模型几乎不具备可解释性,而且大多数公开的数据集仅包含图像级别的标签,而没有临床推理或解释。在本文中,我们提出了一种新颖的方法,将图表示学习与视觉-语言模型(VLMs)结合,以提供可解释的DR诊断。我们的方法通过构建生物启发的图来利用光学相干断层扫描血管成像(OCTA)图像,这些图编码了关键的视网膜血管特征,如血管形态和空间连接性。然后,图神经网络(GNN)执行DR分期,集成梯度突出显示驱动分类决策的关键节点和边及其个体特征。我们收集了这种基于图的知识,将模型的预测归因于生理结构及其特征。然后将其转换为文本描述,供VLMs使用。我们使用这些文本描述和相应的图像进行指令微调,以训练一个学生VLM。最终的代理可以根据单张图像输入进行疾病分类,并以人类可解释的方式解释其决策。在私有和公开数据集上的实验评估表明,我们的方法不仅提高了分类准确性,还提供了更具临床解释性的结果。进一步的专家研究证明,我们的方法提供了更准确的诊断解释,并为OCTA图像中的病理精确定位铺平了道路。
Summary / 总结
This paper introduces a method that integrates graph representation learning with vision-language models to enhance the explainability of diabetic retinopathy (DR) diagnosis. The approach uses optical coherence tomography angiography (OCTA) images to construct biologically informed graphs that encode retinal vascular features. A graph neural network (GNN) performs DR staging, with integrated gradients highlighting critical nodes and edges. This graph-based knowledge is then transformed into textual descriptions for instruction-tuning a student VLM, enabling the model to provide human-interpretable explanations. Experiments show improved classification accuracy and more clinically interpretable results compared to existing models.
本文提出了一种将图表示学习与视觉语言模型相结合的方法,以增强糖尿病视网膜病变(DR)诊断的可解释性。该方法使用光学相干断层扫描血管成像(OCTA)图像构建生物信息图,编码视网膜血管特征。图神经网络(GNN)执行DR分期,集成梯度突出关键节点和边及其特征。然后将这种图基知识转换为文本描述,用于指令调优学生视觉语言模型,使其能够仅基于单张图像输入提供可解释的诊断解释。实验表明,该方法不仅提高了分类准确性,还提供了更具有临床解释性的结果。
Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark
Authors: Nisarg A. Shah, Amir Ziai, Chaitanya Ekanadham, Vishal M. Patel
First: 2025-09-17T17:58:06+00:00 · Latest: 2025-09-17T17:58:06+00:00
Comments: 11 pages, 5 figures, 5 tables
Abstract
While recent advancements in vision-language models have improved video
understanding, diagnosing their capacity for deep, narrative comprehension
remains a challenge. Existing benchmarks often test short-clip recognition or
use template-based questions, leaving a critical gap in evaluating fine-grained
reasoning over long-form narrative content. To address these gaps, we introduce
$\mathsf{Cin\acute{e}aste}$, a comprehensive benchmark for long-form movie
understanding. Our dataset comprises 3,119 multiple-choice question-answer
pairs derived from 1,805 scenes across 200 diverse movies, spanning five novel
fine-grained contextual reasoning categories. We use GPT-4o to generate
diverse, context-rich questions by integrating visual descriptions, captions,
scene titles, and summaries, which require deep narrative understanding. To
ensure high-quality evaluation, our pipeline incorporates a two-stage filtering
process: Context-Independence filtering ensures questions require video
context, while Contextual Veracity filtering validates factual consistency
against the movie content, mitigating hallucinations. Experiments show that
existing MLLMs struggle on $\mathsf{Cin\acute{e}aste}$; our analysis reveals
that long-range temporal reasoning is a primary bottleneck, with the top
open-source model achieving only 63.15\% accuracy. This underscores significant
challenges in fine-grained contextual understanding and the need for
advancements in long-form movie comprehension.
中文标题/摘要
标题:电影导演:细粒度情境电影问答基准
尽管近期视觉语言模型在视频理解方面取得了进步,但诊断其在深入叙事理解方面的能力仍是一个挑战。现有基准通常测试短片段识别或使用模板化问题,这在评估长篇叙事内容上的细粒度推理方面留下了关键缺口。为解决这些缺口,我们引入了$\mathsf{Cin\acute{e}aste}$,一个全面的长篇电影理解基准。我们的数据集包含来自200部不同电影1,805个场景的3,119个多项选择题-答案对,涵盖了五个新颖的细粒度情境推理类别。我们使用GPT-4o生成多样化的情境丰富问题,通过整合视觉描述、字幕、场景标题和摘要,这些问题需要深入的叙事理解。为了确保高质量的评估,我们的流水线包含两阶段过滤过程:情境独立性过滤确保问题需要视频情境,而情境一致性过滤验证事实一致性,防止幻觉。实验表明,现有MLLM在$\mathsf{Cin\acute{e}aste}$上表现不佳;我们的分析显示,长时序推理是主要瓶颈,顶级开源模型的准确率仅为63.15%。这突显了细粒度情境理解的重大挑战,并强调了长篇电影理解方面的进步需求。
Summary / 总结
The research aims to evaluate the deep narrative comprehension capabilities of vision-language models by introducing $\mathsf{Cin\acute{e}aste}$, a benchmark for long-form movie understanding. The method involves generating diverse, context-rich questions using GPT-4o, which integrate visual descriptions and summaries. Key findings show that existing models struggle with long-range temporal reasoning, achieving only 63.15% accuracy, highlighting the need for advancements in fine-grained contextual understanding.
Cinéaste 是一个用于评估视觉语言模型在长片理解中细粒度叙事理解能力的基准。它包含来自200部不同电影的1,805个场景的3,119个问答对,使用GPT-4o生成富含上下文的问题。实验表明,现有模型在长时序推理方面存在困难,准确率仅为63.15%,突显了该领域需要进一步改进。
TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning
Authors: Ziyuan Chen, Zhenghui Zhao, Zhangye Han, Miancan Liu, Xianhang Ye, Yiqing Li, Hongbo Min, Jinkui Ren, Xiantao Zhang, Guitao Cao
First: 2025-09-17T16:58:44+00:00 · Latest: 2025-09-17T16:58:44+00:00
Abstract
With the rapid advancement of large language models and vision-language
models, employing large models as Web Agents has become essential for automated
web interaction. However, training Web Agents with reinforcement learning faces
critical challenges including credit assignment misallocation, prohibitively
high annotation costs, and reward sparsity. To address these issues, we propose
Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning
framework that proposes a tree-structured trajectory representation merging
semantically identical states across trajectories to eliminate label conflicts.
Our framework incorporates a Process Reward Model that automatically generates
fine-grained rewards through subgoal progress, redundancy detection, and action
verification. Additionally, a dynamic weighting mechanism prioritizes
high-impact decision points during training. Experiments on Online-Mind2Web and
our self-constructed C-WebShop datasets demonstrate that TGPO significantly
outperforms existing methods, achieving higher success rates with fewer
redundant steps.
中文标题/摘要
标题:TGPO:基于树引导的偏好优化以实现鲁棒的网络代理强化学习
随着大型语言模型和视觉-语言模型的迅速发展,使用大型模型作为网络代理对于自动化网页交互变得至关重要。然而,使用强化学习训练网络代理面临着关键挑战,包括奖励分配错误、标注成本高昂以及奖励稀疏性。为了解决这些问题,我们提出了树引导偏好优化(TGPO),这是一种离线强化学习框架,通过树结构轨迹表示将轨迹中语义相同的状态合并,消除标签冲突。该框架结合了过程奖励模型,该模型通过子目标进展、冗余检测和动作验证自动生成细粒度奖励。此外,动态加权机制在训练过程中优先处理高影响决策点。在Online-Mind2Web和我们自构建的C-WebShop数据集上的实验表明,TGPO显著优于现有方法,以较少的冗余步骤实现了更高的成功率。
Summary / 总结
The research aims to improve the training of Web Agents using reinforcement learning by addressing challenges such as credit assignment misallocation and high annotation costs. The proposed Tree-Guided Preference Optimization (TGPO) framework uses a tree-structured trajectory representation to eliminate label conflicts and incorporates a Process Reward Model for generating fine-grained rewards. Experiments show that TGPO outperforms existing methods, achieving higher success rates with fewer redundant steps on Online-Mind2Web and C-WebShop datasets.
论文针对强化学习训练Web代理时遇到的挑战,如信用分配错误、高标注成本和稀疏奖励。提出了树引导偏好优化(TGPO)框架,使用树结构轨迹表示法将语义上相同的状态合并,消除标签冲突。TGPO还包含一个过程奖励模型,根据子目标进度、冗余检测和动作验证自动生成奖励,并使用动态加权机制优先处理高影响决策点。实验表明,TGPO在成功率和减少冗余步骤方面优于现有方法。
StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance
Authors: Zefan Qu, Zhenwei Wang, Haoyuan Wang, Ke Xu, Gerhard Hancke, Rynson W. H. Lau
Venue: SIGGRAPH Asia 2025
First: 2025-09-16T17:55:20+00:00 · Latest: 2025-09-17T15:58:50+00:00
Comments: SIGGRAPH Asia 2025, Project page:https://stylesculptor.github.io
Abstract
Creating 3D assets that follow the texture and geometry style of existing
ones is often desirable or even inevitable in practical applications like video
gaming and virtual reality. While impressive progress has been made in
generating 3D objects from text or images, creating style-controllable 3D
assets remains a complex and challenging problem. In this work, we propose
StyleSculptor, a novel training-free approach for generating style-guided 3D
assets from a content image and one or more style images. Unlike previous
works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner,
enabling fine-grained 3D style control that captures the texture, geometry, or
both styles of user-provided style images. At the core of StyleSculptor is a
novel Style Disentangled Attention (SD-Attn) module, which establishes a
dynamic interaction between the input content image and style image for
style-guided 3D asset generation via a cross-3D attention mechanism, enabling
stable feature fusion and effective style-guided generation. To alleviate
semantic content leakage, we also introduce a style-disentangled feature
selection strategy within the SD-Attn module, which leverages the variance of
3D feature patches to disentangle style- and content-significant channels,
allowing selective feature injection within the attention framework. With
SD-Attn, the network can dynamically compute texture-, geometry-, or
both-guided features to steer the 3D generation process. Built upon this, we
further propose the Style Guided Control (SGC) mechanism, which enables
exclusive geometry- or texture-only stylization, as well as adjustable style
intensity control. Extensive experiments demonstrate that StyleSculptor
outperforms existing baseline methods in producing high-fidelity 3D assets.
Summary / 总结
StyleSculptor is a zero-shot approach for generating 3D assets with style guidance from a content image and one or more style images. It uses a novel Style Disentangled Attention (SD-Attn) module to dynamically interact between the content and style images, enabling fine-grained control over texture and geometry styles. Experiments show that StyleSculptor outperforms existing methods in producing high-fidelity 3D assets with stable feature fusion and effective style guidance.
StyleSculptor 是一种零样本方法,可以从内容图像和一个或多个风格图像生成带有风格指导的 3D 资产。它使用一种新颖的 Style Disentangled Attention (SD-Attn) 模块动态地在内容图像和风格图像之间进行交互,从而实现对纹理和几何风格的精细控制。实验表明,StyleSculptor 在生成高保真 3D 资产方面优于现有方法,具有稳定的特征融合和有效的风格指导。
VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement
Authors: Jun Du, Weiwei Xing, Ming Li, Fei Richard Yu
First: 2025-09-17T15:04:45+00:00 · Latest: 2025-09-17T15:04:45+00:00
Abstract
Current multi-object tracking (MOT) algorithms typically overlook issues
inherent in low-quality videos, leading to significant degradation in tracking
performance when confronted with real-world image deterioration. Therefore,
advancing the application of MOT algorithms in real-world low-quality video
scenarios represents a critical and meaningful endeavor. To address the
challenges posed by low-quality scenarios, inspired by vision-language models,
this paper proposes a Visual Semantic Enhancement-guided Multi-Object Tracking
framework (VSE-MOT). Specifically, we first design a tri-branch architecture
that leverages a vision-language model to extract global visual semantic
information from images and fuse it with query vectors. Subsequently, to
further enhance the utilization of visual semantic information, we introduce
the Multi-Object Tracking Adapter (MOT-Adapter) and the Visual Semantic Fusion
Module (VSFM). The MOT-Adapter adapts the extracted global visual semantic
information to suit multi-object tracking tasks, while the VSFM improves the
efficacy of feature fusion. Through extensive experiments, we validate the
effectiveness and superiority of the proposed method in real-world low-quality
video scenarios. Its tracking performance metrics outperform those of existing
methods by approximately 8% to 20%, while maintaining robust performance in
conventional scenarios.
中文标题/摘要
标题:VSE-MOT:低质量视频场景中基于视觉语义增强的多目标跟踪
当前的多目标跟踪(MOT)算法通常忽视低质量视频中存在的问题,导致在面对真实世界图像退化时跟踪性能显著下降。因此,在真实世界低质量视频场景中应用MOT算法的改进具有重要的意义。为应对低质量场景的挑战,受视觉语言模型的启发,本文提出了一种基于视觉语义增强的多目标跟踪框架(VSE-MOT)。具体而言,我们首先设计了一个三支路架构,利用视觉语言模型从图像中提取全局视觉语义信息并融合到查询向量中。为进一步增强视觉语义信息的利用,我们引入了多目标跟踪适配器(MOT-Adapter)和视觉语义融合模块(VSFM)。MOT-Adapter将提取的全局视觉语义信息适配到多目标跟踪任务中,而VSFM提高了特征融合的效果。通过大量实验,我们验证了所提方法在真实世界低质量视频场景中的有效性和优越性。其跟踪性能指标比现有方法高出约8%到20%,同时在常规场景中保持了稳健的性能。
Summary / 总结
The research aims to improve multi-object tracking (MOT) in low-quality video scenes by addressing the limitations of existing algorithms. The proposed VSE-MOT framework uses a tri-branch architecture and includes a Visual Semantic Enhancement module to extract and fuse global visual semantic information with query vectors. This is further enhanced by the MOT-Adapter and VSFM modules. Experimental results show that VSE-MOT outperforms existing methods by approximately 8% to 20% in real-world low-quality video scenarios, while maintaining robust performance in conventional scenarios.
本文提出了一种VSE-MOT框架,通过使用视觉语言模型增强视觉语义信息来解决低质量视频场景中的多目标跟踪(MOT)问题。该方法包括三支架构,并引入了MOT-Adapter和VSFM以改进特征融合并使全局视觉语义信息适应MOT任务。实验结果显示,VSE-MOT在低质量场景中的性能比现有方法高出8%到20%,同时在常规场景中保持了稳健的性能。
Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems
Authors: Jie Zhang, Ting Xu, Gelei Deng, Runyi Hu, Han Qiu, Tianwei Zhang, Qing Guo, Ivor Tsang
First: 2025-09-04T05:35:32+00:00 · Latest: 2025-09-17T13:47:40+00:00
Abstract
Writing is a universal cultural technology that reuses vision for symbolic
communication. Humans display striking resilience: we readily recognize words
even when characters are fragmented, fused, or partially occluded. This paper
investigates whether advanced vision language models (VLMs) share this
resilience. We construct two psychophysics inspired benchmarks across distinct
writing systems, Chinese logographs and English alphabetic words, by splicing,
recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli
for models while remaining legible to humans. Despite strong performance on
clean text, contemporary VLMs show a severe drop under these perturbations,
frequently producing unrelated or incoherent outputs. The pattern suggests a
structural limitation: models heavily leverage generic visual invariances but
under rely on compositional priors needed for robust literacy. We release
stimuli generation code, prompts, and evaluation protocols to facilitate
transparent replication and follow up work. Our findings motivate architectures
and training strategies that encode symbol segmentation, composition, and
binding across scripts, and they delineate concrete challenges for deploying
multimodal systems in education, accessibility, cultural heritage, and
security.
中文标题/摘要
标题:可见却不可读:视觉语言模型在不同书写系统中的一个系统性盲点
书写是一种普遍的文化技术,利用视觉进行符号交流。人类表现出惊人的适应性:即使字符被分割、融合或部分遮挡,我们也能轻易识别出单词。本文探讨先进视觉语言模型(VLMs)是否也具备这种适应性。我们通过拼接、重组和叠加字符,构建了跨不同书写系统的心理物理学启发式基准,包括中文表意文字和英文字母词,从而为模型提供“可见但不可读”的刺激,同时保持对人类的可读性。尽管在干净文本上表现出色,但当代VLMs在这些扰动下表现出严重的性能下降,经常产生不相关或不连贯的输出。这一模式表明,模型过度依赖通用的视觉不变性,而对需要稳健读写能力的组合先验依赖不足。我们发布了刺激生成代码、提示和评估协议,以促进透明的复制和后续工作。我们的发现促使了能够编码符号分割、组合和跨书写系统绑定的架构和训练策略的发展,并指出了在教育、无障碍、文化遗产和安全领域部署多模态系统时的具体挑战。
Summary / 总结
This paper investigates the resilience of advanced vision language models (VLMs) in recognizing fragmented or occluded text across different writing systems. By creating 'visible but unreadable' stimuli, the study shows that VLMs perform poorly under these conditions, indicating a structural limitation in leveraging compositional priors. The findings suggest that VLMs need to better encode symbol segmentation and composition to improve robustness in literacy tasks.
该研究探讨了先进视觉语言模型(VLMs)在识别不同书写系统中碎片化或被遮挡的文本时的鲁棒性。通过创建‘可见但不可读’的刺激,研究显示VLMs在这些条件下表现不佳,表明模型在利用组合先验方面存在结构性限制。研究结果表明,VLMs需要更好地编码符号分割和组合,以提高在识字任务中的鲁棒性。