arXiv 论文速递

2025-09-21 03:45
Snapshot: 20250921_0345
Lost in Translation? Vocabulary Alignment for Source-Free Domain Adaptation in Open-Vocabulary Semantic Segmentation
Authors: Silvio Mazzucco, Carl Persson, Mattia Segu, Pier Luigi Dovesi, Federico Tombari, Luc Van Gool, Matteo Poggi
First: 2025-09-18T17:59:58+00:00 · Latest: 2025-09-18T17:59:58+00:00
Comments: BMVC 2025 - Project Page: https://thegoodailab.org/blog/vocalign - Code: https://github.com/Sisso16/VocAlign
Abstract
We introduce VocAlign, a novel source-free domain adaptation framework specifically designed for VLMs in open-vocabulary semantic segmentation. Our method adopts a student-teacher paradigm enhanced with a vocabulary alignment strategy, which improves pseudo-label generation by incorporating additional class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to fine-tune the model, preserving its original capabilities while minimizing computational overhead. In addition, we propose a Top-K class selection mechanism for the student model, which significantly reduces memory requirements while further improving adaptation performance. Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in the open-vocabulary setting.
中文标题/摘要
标题:迷失翻译?源代码自由领域适应在开放词汇语义分割中的词汇对齐
我们引入了VocAlign,一种专门为开放词汇语义分割中的VLM设计的源代码自由领域适应框架。该方法采用学生-教师范式,并结合了词汇对齐策略,通过引入额外的类别概念来提高伪标签生成。为了确保效率,我们使用低秩适应(LoRA)对模型进行微调,同时保留其原始功能并最小化计算开销。此外,我们还提出了一种学生模型的Top-K类别选择机制,这显著减少了内存需求并进一步提高了适应性能。我们的方法在CityScapes数据集上实现了显著的6.11 mIoU改进,并在零样本分割基准测试中表现出色,为开放词汇设置中的源代码自由适应设定了新标准。
Summary / 总结
The paper introduces VocAlign, a source-free domain adaptation framework for VLMs in open-vocabulary semantic segmentation. It uses a student-teacher paradigm with vocabulary alignment to enhance pseudo-label generation and employs Low-Rank Adaptation (LoRA) for efficient fine-tuning. Additionally, a Top-K class selection mechanism is proposed to reduce memory usage. The approach achieves a 6.11 mIoU improvement on CityScapes and outperforms zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in open-vocabulary settings.
研究引入了VocAlign,这是一种针对开放词汇语义分割中VLM的源免费域适应框架。该方法采用学生-教师范式并结合词汇对齐策略以增强伪标签生成,并使用低秩适应(LoRA)进行高效微调。此外,该方法还包含一个Top-K类选择机制以减少内存使用。该方法在CityScapes上实现了6.11 mIoU的改进,并在零样本分割基准测试中表现出色,为开放词汇设置中的源免费适应设定了新标准。
Calibration-Aware Prompt Learning for Medical Vision-Language Models
Authors: Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan
First: 2025-09-18T17:59:58+00:00 · Latest: 2025-09-18T17:59:58+00:00
Comments: Accepted in BMVC 2025
Abstract
Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable performance across diverse medical imaging tasks by leveraging large-scale image-text pretraining. However, their confidence calibration is largely unexplored, and so remains a significant challenge. As such, miscalibrated predictions can lead to overconfident errors, undermining clinical trust and decision-making reliability. To address this, we introduce CalibPrompt, the first framework to calibrate Med-VLMs during prompt tuning. CalibPrompt optimizes a small set of learnable prompts with carefully designed calibration objectives under scarce labeled data regime. First, we study a regularizer that attempts to align the smoothed accuracy with the predicted model confidences. Second, we introduce an angular separation loss to maximize textual feature proximity toward improving the reliability in confidence estimates of multimodal Med-VLMs. Extensive experiments on four publicly available Med-VLMs and five diverse medical imaging datasets reveal that CalibPrompt consistently improves calibration without drastically affecting clean accuracy. Our code is available at https://github.com/iabh1shekbasu/CalibPrompt.
中文标题/摘要
标题:医疗视觉语言模型的校准感知提示学习
医疗视觉语言模型(Med-VLMs)通过大规模图像-文本预训练,在多种医疗成像任务中表现出色。然而,它们的置信度校准尚未得到充分探索,仍然是一个重大挑战。因此,未校准的预测可能导致过度自信的错误,削弱临床信任和决策可靠性。为了解决这一问题,我们引入了CalibPrompt,这是第一个在提示调优过程中校准Med-VLMs的框架。CalibPrompt在少量标注数据条件下,通过精心设计的校准目标优化一小组可学习的提示。首先,我们研究了一个正则化器,试图使平滑后的准确率与预测模型置信度对齐。其次,我们引入了角度分离损失,以最大化文本特征的接近度,从而提高多模态Med-VLMs置信度估计的可靠性。在四个公开的Med-VLMs和五个多样化的医疗成像数据集上的广泛实验表明,CalibPrompt在不大幅影响干净准确率的情况下,始终能够提高校准。我们的代码可在https://github.com/iabh1shekbasu/CalibPrompt/ 获取。
Summary / 总结
The paper addresses the issue of confidence calibration in Medical Vision-Language Models (Med-VLMs) by introducing CalibPrompt, a framework that optimizes learnable prompts during prompt tuning. It uses a regularizer to align smoothed accuracy with predicted model confidences and an angular separation loss to enhance textual feature proximity. Experiments on four Med-VLMs and five medical imaging datasets show that CalibPrompt improves calibration without significantly reducing clean accuracy.
论文提出了CalibPrompt框架,用于在提示调优过程中校准医疗视觉-语言模型(Med-VLMs)。该方法在有限标注数据下优化可学习的提示,并包含一个正则化器来使平滑准确率与预测置信度对齐,以及一个角度分离损失来增强文本特征的接近性。实验表明,CalibPrompt在四个Med-VLMs和五个医学成像数据集上提高了校准,同时显著降低了干净准确率。
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Authors: Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang
First: 2025-09-18T17:59:22+00:00 · Latest: 2025-09-18T17:59:22+00:00
Abstract
Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.
中文标题/摘要
标题:ScaleCUA:跨平台数据扩展开源计算机使用代理
视觉-语言模型(VLMs)使计算机使用代理(CUAs)能够自主操作GUI,展现出巨大的潜力,但进展受限于缺乏大规模、开源的计算机使用数据和基础模型。在本项工作中,我们介绍了ScaleCUA,这是迈向扩展开源CUA的一个步骤。它提供了一个跨越6个操作系统和3个任务领域的大型数据集,通过将自动化代理与人类专家结合的闭环管道构建而成。在这些扩展的数据上训练后,ScaleCUA可以在不同平台之间无缝操作。具体而言,它在WebArena-Lite-v2上比基线模型提高了26.6%,在ScreenSpot-Pro上提高了10.7%,并在MMBench-GUI L1-Hard上达到了94.4%,在OSWorld-G上达到了60.6%,在WebArena-Lite-v2上达到了47.4%。这些发现强调了数据驱动扩展对通用计算机使用代理的强大作用。我们将发布数据、模型和代码以促进未来研究:https://github.com/OpenGVLab/ScaleCUA。
Summary / 总结
ScaleCUA addresses the limitation of open-source computer use agents by introducing a large-scale dataset spanning multiple operating systems and task domains. Utilizing a closed-loop pipeline combining automated agents and human experts, ScaleCUA achieves significant performance gains over baselines and sets new state-of-the-art results on various benchmarks, highlighting the importance of data-driven scaling for general-purpose computer use agents.
ScaleCUA通过提供跨多个操作系统和任务领域的大型数据集来解决开源计算机使用代理的局限性。利用结合自动化代理和人工专家的闭环管道,它实现了跨平台的无缝操作。ScaleCUA在MMBenchmark-GUI L1-Hard、OSWorld-G和WebArena-Lite-v2上超越基线并建立了新的最先进结果,突显了数据驱动扩展对于通用计算机使用代理的重要性。数据、模型和代码已公开发布以促进进一步研究。
MedFact-R1: Towards Factual Medical Reasoning via Pseudo-Label Augmentation
Authors: Gengliang Li, Rongyu Chen, Bin Li, Linlin Yang, Guodong Ding
First: 2025-09-18T16:59:59+00:00 · Latest: 2025-09-18T16:59:59+00:00
Comments: Tech report
Abstract
Ensuring factual consistency and reliable reasoning remains a critical challenge for medical vision-language models. We introduce MEDFACT-R1, a two-stage framework that integrates external knowledge grounding with reinforcement learning to improve the factual medical reasoning. The first stage uses pseudo-label supervised fine-tuning (SFT) to incorporate external factual expertise; while the second stage applies Group Relative Policy Optimization (GRPO) with four tailored factual reward signals to encourage self-consistent reasoning. Across three public medical QA benchmarks, MEDFACT-R1 delivers up to 22.5% absolute improvement in factual accuracy over previous state-of-the-art methods. Ablation studies highlight the necessity of pseudo-label SFT cold start and validate the contribution of each GRPO reward, underscoring the synergy between knowledge grounding and RL-driven reasoning for trustworthy medical AI. Codes are released at https://github.com/Garfieldgengliang/MEDFACT-R1.
中文标题/摘要
标题:MedFact-R1:通过伪标签增强实现医学事实推理
确保事实一致性与可靠推理仍然是医学视觉-语言模型的关键挑战。我们引入了MEDFACT-R1,这是一种两阶段框架,结合了外部知识接地与强化学习以提高医学事实推理。第一阶段使用伪标签监督微调(SFT)来整合外部事实专业知识;而第二阶段则应用组相对策略优化(GRPO)并使用四个定制的事实奖励信号来促进自我一致的推理。在三个公开的医学问答基准测试中,MEDFACT-R1在事实准确性上比之前最先进的方法提高了高达22.5%。消融研究强调了伪标签SFT冷启动的必要性,并验证了每个GRPO奖励的贡献,突显了知识接地与基于RL的推理之间的协同作用对于可信赖的医学AI的重要性。代码已发布于https://github.com/Garfieldgengliang/MEDFACT-R1。
WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance
Authors: Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang
First: 2025-09-18T16:40:47+00:00 · Latest: 2025-09-18T16:40:47+00:00
Comments: Project Webpage: https://worldforge-agi.github.io/
Abstract
Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement introduces a recursive refinement mechanism during inference, which repeatedly optimizes network predictions within each denoising step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.
Summary / 总结
WorldForge is a training-free framework that enhances the controllability and geometric consistency of video diffusion models for 3D/4D generation. It consists of three modules: Intra-Step Recursive Refinement, Flow-Gated Latent Fusion, and Dual-Path Self-Corrective Guidance. These modules enable precise trajectory injection and adaptive correction of trajectory drift, achieving accurate motion control and photorealistic content generation. Experiments show that WorldForge outperforms existing methods in terms of realism, trajectory consistency, and visual fidelity.
WorldForge 是一个无需训练的框架,旨在增强视频扩散模型在3D/4D生成中的可控性和几何一致性。它包含三个模块:Intra-Step Recursive Refinement、Flow-Gated Latent Fusion 和 Dual-Path Self-Corrective Guidance。这些模块能够实现精确的轨迹注入和对轨迹漂移的自适应修正,从而实现准确的运动控制和逼真的内容生成。实验表明,WorldForge 在逼真度、轨迹一致性以及视觉保真度方面优于现有方法。
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models
Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Chengyuan Yu, Mengshu Sun, Qiang Zhang, Yijie Guo, Kaidi Xu, Jize Zhang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu
First: 2024-09-20T03:02:05+00:00 · Latest: 2025-09-18T16:36:42+00:00
Abstract
Recently, driven by advancements in Multimodal Large Language Models (MLLMs), Vision Language Action Models (VLAMs) are being proposed to achieve better performance in open-vocabulary scenarios for robotic manipulation tasks. Since manipulation tasks involve direct interaction with the physical world, ensuring robustness and safety during the execution of this task is always a very critical issue. In this paper, by synthesizing current safety research on MLLMs and the specific application scenarios of the manipulation task in the physical world, we comprehensively evaluate VLAMs in the face of potential physical threats. Specifically, we propose the Physical Vulnerability Evaluating Pipeline (PVEP) that can incorporate as many visual modal physical threats as possible for evaluating the physical robustness of VLAMs. The physical threats in PVEP specifically include Out-of-Distribution, Typography-based Visual Prompt, and Adversarial Patch Attacks. By comparing the performance fluctuations of VLAMs before and after being attacked, we provide generalizable \textbf{\textit{Analyses}} of how VLAMs respond to different physical threats.
中文标题/摘要
标题:操纵面临威胁:评估端到端视觉语言动作模型的物理脆弱性
近年来,随着多模态大型语言模型(MLLMs)的发展,视觉语言动作模型(VLAMs)被提出以在机器人操纵任务的开放词汇场景中实现更好的性能。由于操纵任务涉及直接与物理世界互动,确保执行此任务时的鲁棒性和安全性始终是一个非常关键的问题。在本文中,通过综合当前关于MLLMs的安全研究以及操纵任务在物理世界中的具体应用场景,我们全面评估了VLAMs在面对潜在物理威胁时的表现。具体而言,我们提出了物理脆弱性评估管道(PVEP),该管道可以尽可能多地纳入视觉模态的物理威胁,以评估VLAMs的物理鲁棒性。PVEP中的物理威胁具体包括离分布、基于字体的视觉提示和对抗性补丁攻击。通过比较攻击前后VLAMs的性能波动,我们提供了关于VLAMs如何应对不同物理威胁的可泛化的分析。
Summary / 总结
This paper evaluates the physical robustness of Vision Language Action Models (VLAMs) in robotic manipulation tasks by proposing the Physical Vulnerability Evaluating Pipeline (PVEP), which includes out-of-distribution, typography-based visual prompt, and adversarial patch attacks. The study finds that VLAMs exhibit varying performance fluctuations when subjected to these physical threats, providing insights into their vulnerability to different types of attacks.
本文旨在评估Vision Language Action Models(VLAMs)在面对潜在物理威胁时的物理鲁棒性,以确保机器人操作任务中的安全性和鲁棒性。作者提出了Physical Vulnerability Evaluating Pipeline(PVEP)来评估VLAMs对诸如分布外、基于字体的视觉提示和对抗性补丁等物理威胁的鲁棒性。关键发现表明,当VLAMs受到这些攻击时,其性能会表现出不同的波动,提供了对其对物理威胁的脆弱性的见解。
Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering
Authors: Neale Ratzlaff, Matthew Lyle Olson, Musashi Hinck, Estelle Aflalo, Shao-Yen Tseng, Vasudev Lal, Phillip Howard
First: 2024-11-15T20:06:09+00:00 · Latest: 2025-09-18T15:58:56+00:00
Comments: 10 pages, 6 Figures, 8 Tables. arXiv admin note: text overlap with arXiv:2410.13976
Abstract
Large Multi-Modal Models (LMMs) have demonstrated impressive capabilities as general-purpose chatbots able to engage in conversations about visual inputs. However, their responses are influenced by societal biases present in their training datasets, leading to undesirable differences in how the model responds when presented with images depicting people of different demographics. In this work, we propose a training-free debiasing framework for LMMs that intervenes on the model's representations during text generation by constructing a steering vector that reduces reference on protected attributes. Our framework introduces two complementary methods: (1) a dataset-based approach that constructs a steering vector by contrasting model activations on biased and neutral inputs, and (2) a novel optimization-based approach designed for low-resource settings, which constructs the steering vector using a single step of gradient-based perturbation without requiring additional data. Our experiments show that these interventions effectively reduce the propensity of LMMs to generate text related to protected attributes while maintaining sentiment and fluency. Furthermore, we demonstrate that debiased LMMs achieve comparable accuracy to their unmodified counterparts on downstream tasks, indicating that bias mitigation can be achieved without sacrificing model performance.
中文标题/摘要
标题:在测试时去偏大型多模态模型通过非对比视觉属性引导
大型多模态模型(LMMs)展示了作为通用聊天机器人的出色能力,能够就视觉输入进行对话。然而,它们的响应受到其训练数据集中存在的社会偏见的影响,导致在展示不同人口统计学特征的人像时,模型的响应存在不希望的差异。在本工作中,我们提出了一种无需训练的去偏框架,该框架在文本生成过程中干预模型的表示,通过构建一个减少对受保护属性依赖的引导向量。我们的框架引入了两种互补的方法:(1)基于数据的方法,通过对比模型在有偏和中性输入上的激活来构建引导向量;(2)一种针对资源有限环境的新颖优化方法,使用单步梯度扰动构建引导向量,无需额外数据。我们的实验表明,这些干预措施有效地减少了LMMs生成与受保护属性相关文本的倾向,同时保持了情感和流畅性。此外,我们证明去偏的LMMs在下游任务上的准确度与未修改的版本相当,表明可以在不牺牲模型性能的情况下实现偏见缓解。
Summary / 总结
This paper addresses the issue of bias in Large Multi-Modal Models (LMMs) by proposing a training-free debiasing framework. The method involves constructing a steering vector during text generation to reduce reliance on protected attributes. Two approaches are introduced: a dataset-based method that contrasts model activations on biased and neutral inputs, and an optimization-based method that uses a single step of gradient-based perturbation. Experiments show that these interventions reduce the model's propensity to generate text related to protected attributes while maintaining sentiment and fluency, and debiased LMMs perform comparably to their unmodified counterparts on downstream tasks.
本文提出了一种无训练的去偏见框架,以解决大型多模态模型(LMMs)中的社会偏见问题。该框架在文本生成过程中构建一个引导向量,以减少对受保护属性的依赖。介绍了两种方法:基于数据集的方法使用对比学习,以及适用于资源有限环境的优化方法。实验表明,这些干预措施可以减少生成文本中的偏见,同时保持情感和流畅性,并且在下游任务上的性能与未修改的模型相当。
Forecasting and Visualizing Air Quality from Sky Images with Vision-Language Models
Authors: Mohammad Saleh Vahdatpour, Maryam Eyvazi, Yanqing Zhang
First: 2025-09-18T15:36:38+00:00 · Latest: 2025-09-18T15:36:38+00:00
Comments: Published at ICCVW 2025
Abstract
Air pollution remains a critical threat to public health and environmental sustainability, yet conventional monitoring systems are often constrained by limited spatial coverage and accessibility. This paper proposes an AI-driven agent that predicts ambient air pollution levels from sky images and synthesizes realistic visualizations of pollution scenarios using generative modeling. Our approach combines statistical texture analysis with supervised learning for pollution classification, and leverages vision-language model (VLM)-guided image generation to produce interpretable representations of air quality conditions. The generated visuals simulate varying degrees of pollution, offering a foundation for user-facing interfaces that improve transparency and support informed environmental decision-making. These outputs can be seamlessly integrated into intelligent applications aimed at enhancing situational awareness and encouraging behavioral responses based on real-time forecasts. We validate our method using a dataset of urban sky images and demonstrate its effectiveness in both pollution level estimation and semantically consistent visual synthesis. The system design further incorporates human-centered user experience principles to ensure accessibility, clarity, and public engagement in air quality forecasting. To support scalable and energy-efficient deployment, future iterations will incorporate a green CNN architecture enhanced with FPGA-based incremental learning, enabling real-time inference on edge platforms.
中文标题/摘要
标题:基于视觉语言模型的天空图像空气质量预测与可视化
空气污染仍然是公共健康和环境可持续性的重大威胁,但传统的监测系统往往受限于有限的空间覆盖范围和可访问性。本文提出了一种基于人工智能的代理,可以从天空图像中预测大气污染水平,并使用生成模型合成现实的污染场景可视化。我们的方法结合了统计纹理分析和监督学习进行污染分类,并利用视觉语言模型(VLM)指导的图像生成来生成可解释的空气质量条件表示。生成的视觉效果模拟了不同程度的污染,为面向用户的界面提供了基础,以提高透明度并支持基于实时预测的环境决策。这些输出可以无缝集成到旨在增强态势感知并鼓励基于实时预测的行为响应的智能应用中。我们使用城市天空图像数据集验证了该方法,并证明了其在污染水平估计和语义一致的视觉合成方面的有效性。系统设计进一步融入了以用户为中心的人机交互原则,以确保空气质量预测的可访问性、清晰性和公众参与。为了实现可扩展和节能部署,未来的迭代将结合增强的FPGA基于增量学习的绿色CNN架构,以实现边缘平台上的实时推理。
Summary / 总结
This paper introduces an AI system that predicts air pollution levels from sky images using statistical texture analysis and supervised learning, and generates realistic visualizations of pollution scenarios with vision-language models. The system effectively estimates pollution levels and produces semantically consistent visuals, enhancing situational awareness and supporting environmental decision-making. Future iterations aim to incorporate a green CNN architecture for scalable and energy-efficient deployment on edge platforms.
本文旨在通过开发一种基于天空图像预测空气质量的AI系统来解决传统空气质量监测系统的局限性,该系统结合统计纹理分析和监督学习进行污染分类,并使用视觉语言模型生成可解释的可视化表示。关键发现包括有效的污染水平估计和语义一致的视觉合成,未来版本将通过嵌入绿色CNN架构和基于FPGA的增量学习,实现边缘平台上的实时推理,以增强环境决策中的情况意识和公众参与。
QuizRank: Picking Images by Quizzing VLMs
Authors: Tenghao Ji, Eytan Adar
First: 2025-09-18T15:22:33+00:00 · Latest: 2025-09-18T15:22:33+00:00
Abstract
Images play a vital role in improving the readability and comprehension of Wikipedia articles by serving as `illustrative aids.' However, not all images are equally effective and not all Wikipedia editors are trained in their selection. We propose QuizRank, a novel method of image selection that leverages large language models (LLMs) and vision language models (VLMs) to rank images as learning interventions. Our approach transforms textual descriptions of the article's subject into multiple-choice questions about important visual characteristics of the concept. We utilize these questions to quiz the VLM: the better an image can help answer questions, the higher it is ranked. To further improve discrimination between visually similar items, we introduce a Contrastive QuizRank that leverages differences in the features of target (e.g., a Western Bluebird) and distractor concepts (e.g., Mountain Bluebird) to generate questions. We demonstrate the potential of VLMs as effective visual evaluators by showing a high congruence with human quiz-takers and an effective discriminative ranking of images.
中文标题/摘要
标题:QuizRank:通过测验VLMs挑选图像
图像在提高维基百科文章的可读性和理解性方面起着至关重要的作用,作为‘说明性辅助工具’。然而,并非所有图像都同样有效,也不是所有维基百科编辑都受过图像选择的培训。我们提出QuizRank,一种新颖的图像选择方法,利用大型语言模型(LLMs)和视觉语言模型(VLMs)对图像进行排名,作为学习干预措施。我们的方法将文章主题的文本描述转化为关于概念重要视觉特征的多项选择题。我们利用这些问题来测验VLM:图像越能帮助回答问题,排名越高。为了进一步提高对视觉相似项目的区分度,我们引入了对比测验QuizRank,利用目标(如蓝冠山雀)和干扰概念(如西部蓝冠山雀)的特征差异来生成问题。我们通过展示VLMs与人类测验者高度一致以及对图像的有效区分排名来证明VLMs作为有效的视觉评估工具的潜力。
PRISM: Product Retrieval In Shopping Carts using Hybrid Matching
Authors: Arda Kabadayi, Senem Velipasalar, Jiajing Chen
First: 2025-09-18T14:15:37+00:00 · Latest: 2025-09-18T14:15:37+00:00
Abstract
Compared to traditional image retrieval tasks, product retrieval in retail settings is even more challenging. Products of the same type from different brands may have highly similar visual appearances, and the query image may be taken from an angle that differs significantly from view angles of the stored catalog images. Foundational models, such as CLIP and SigLIP, often struggle to distinguish these subtle but important local differences. Pixel-wise matching methods, on the other hand, are computationally expensive and incur prohibitively high matching times. In this paper, we propose a new, hybrid method, called PRISM, for product retrieval in retail settings by leveraging the advantages of both vision-language model-based and pixel-wise matching approaches. To provide both efficiency/speed and finegrained retrieval accuracy, PRISM consists of three stages: 1) A vision-language model (SigLIP) is employed first to retrieve the top 35 most semantically similar products from a fixed gallery, thereby narrowing the search space significantly; 2) a segmentation model (YOLO-E) is applied to eliminate background clutter; 3) fine-grained pixel-level matching is performed using LightGlue across the filtered candidates. This framework enables more accurate discrimination between products with high inter-class similarity by focusing on subtle visual cues often missed by global models. Experiments performed on the ABV dataset show that our proposed PRISM outperforms the state-of-the-art image retrieval methods by 4.21% in top-1 accuracy while still remaining within the bounds of real-time processing for practical retail deployments.
中文标题/摘要
标题:PRISM:购物车中产品检索的混合匹配方法
与传统的图像检索任务相比,零售环境中的产品检索更具挑战性。同一类型的不同品牌产品可能具有高度相似的视觉外观,查询图像的角度可能与存储目录图像的角度相差很大。基础模型如CLIP和SigLIP往往难以区分这些细微但重要的局部差异。像素级匹配方法则计算成本高昂,匹配时间难以接受。本文提出了一种新的混合方法PRISM,通过结合基于视觉语言模型和像素级匹配方法的优势,用于零售环境中的产品检索。PRISM由三个阶段组成:1) 使用视觉语言模型(SigLIP)首先从固定画廊中检索出最相似的35个产品,显著缩小搜索范围;2) 应用分割模型(YOLO-E)去除背景杂乱;3) 在筛选出的候选产品中进行精细的像素级匹配,使用LightGlue。该框架通过关注全局模型常忽略的细微视觉线索,使具有高类间相似性的产品之间能够更准确地区分。在ABV数据集上的实验表明,我们的PRISM在top-1准确率上比最先进的图像检索方法高出4.21%,同时仍保持实时处理的可行性,适用于实际零售部署。
Summary / 总结
The paper introduces PRISM, a hybrid method for product retrieval in retail settings, addressing the challenges of visual similarity and angle differences. It combines the semantic matching of SigLIP with object segmentation by YOLO-E and fine-grained pixel-level matching using LightGlue. The method significantly narrows the search space and improves accuracy, achieving 4.21% higher top-1 accuracy than state-of-the-art methods while maintaining real-time processing capabilities.
论文针对零售环境中产品检索的挑战,提出了一种结合视觉语言模型效率和像素级匹配准确性的混合方法PRISM。PRISM包括三个阶段:首先,SigLIP检索出最相似的35个产品,然后YOLO-E去除背景杂乱,最后使用LightGlue进行精细的像素级匹配。实验结果显示,PRISM在top-1准确率上比最先进的图像检索方法高出4.21%,同时保持实时处理能力。
EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence
Authors: Chaoyin She, Ruifang Lu, Lida Chen, Wei Wang, Qinghua Huang
First: 2025-09-18T14:07:53+00:00 · Latest: 2025-09-18T14:07:53+00:00
Abstract
Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.
中文标题/摘要
标题:EchoVLM:用于通用超声智能的动态专家混合视觉语言模型
超声成像已成为早期癌症筛查的首选成像技术,因其无辐射、低成本和实时成像能力。然而,传统的超声诊断高度依赖于医生的专业知识,存在主观性高和诊断效率低的挑战。视觉语言模型(VLMs)为这一问题提供了有前景的解决方案,但现有的通用模型在超声医学任务中的知识有限,在多器官病灶识别上的泛化能力差,且在多任务诊断中的效率低。为解决这些局限性,我们提出了一种专门针对超声医学成像的视觉语言模型EchoVLM。该模型采用跨七个解剖区域数据训练的专家混合(MoE)架构,能够执行包括超声报告生成、诊断和视觉问答(VQA)在内的多种任务。实验结果表明,与Qwen2-VL相比,EchoVLM在超声报告生成任务中的BLEU-1分数和ROUGE-1分数分别提高了10.15和4.77分。这些发现表明,EchoVLM在提高超声成像诊断准确性方面具有巨大潜力,从而为未来的临床应用提供可行的技术解决方案。源代码和模型权重可在https://github.com/Asunatan/EchoVLM/获取。
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion
Authors: Pei Liu, Haipeng Liu, Haichao Liu, Xin Liu, Jinxin Ni, Jun Ma
First: 2025-02-25T10:02:12+00:00 · Latest: 2025-09-18T11:55:02+00:00
Abstract
Human drivers adeptly navigate complex scenarios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space. In this sense, it hinders their effective deployment in dynamic and complex environments. Leveraging the superior scene understanding and reasoning abilities of Vision-Language Models (VLMs), we propose VLM-E2E, a novel framework that uses the VLMs to enhance training by providing attentional cues. Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, which enables the model to learn richer feature representations that explicitly capture the driver's attentional semantics. By focusing on attentional semantics, VLM-E2E better aligns with human-like driving behavior, which is critical for navigating dynamic and complex environments. Furthermore, we introduce a BEV-Text learnable weighted fusion strategy to address the issue of modality importance imbalance in fusing multimodal information. This approach dynamically balances the contributions of BEV and text features, ensuring that the complementary information from visual and textual modalities is effectively utilized. By explicitly addressing the imbalance in multimodal fusion, our method facilitates a more holistic and robust representation of driving environments. We evaluate VLM-E2E on the nuScenes dataset and achieve significant improvements in perception, prediction, and planning over the baseline end-to-end model, showcasing the effectiveness of our attention-enhanced BEV representation in enabling more accurate and reliable autonomous driving tasks.
中文标题/摘要
标题:VLM-E2E:利用多模态驾驶员注意力融合提升端到端自动驾驶
人类驾驶员能够利用丰富的注意力语义在复杂场景中自如驾驶,但当前的自动驾驶系统在将二维观察转换为三维空间时往往会丢失关键的语义信息,这阻碍了它们在动态和复杂环境中的有效部署。利用视觉语言模型(VLMs)的优越场景理解和推理能力,我们提出了一种名为VLM-E2E的新框架,通过提供注意力提示来增强训练。该方法将文本表示整合到鸟瞰视图(BEV)特征中,用于语义监督,使模型能够学习更丰富的特征表示,明确捕捉驾驶员的注意力语义。通过关注注意力语义,VLM-E2E更好地与人类驾驶行为对齐,这对于在动态和复杂环境中导航至关重要。此外,我们引入了一种可学习的BEV-Text加权融合策略,以解决多模态信息融合中的模态重要性不平衡问题。该方法动态平衡BEV和文本特征的贡献,确保视觉和文本模态互补信息的有效利用。通过明确解决多模态融合中的不平衡问题,我们的方法促进了更全面和稳健的驾驶环境表示。我们在nuScenes数据集上评估了VLM-E2E,并在感知、预测和规划方面显著优于基线端到端模型,展示了我们增强的BEV表示在实现更准确和可靠的自动驾驶任务方面的有效性。
Summary / 总结
VLM-E2E proposes a novel framework to enhance end-to-end autonomous driving by integrating textual representations into Bird's-Eye-View features, which helps the model learn richer feature representations that capture driver attentional semantics. This approach improves perception, prediction, and planning tasks over the baseline model, demonstrating the effectiveness of attention-enhanced BEV representation.
研究旨在通过利用视觉语言模型(VLMs)提供注意力线索,增强端到端自动驾驶,更好地进行语义监督和特征表示。VLM-E2E将文本表示集成到鸟瞰图(BEV)特征中,专注于注意力语义,以与人类驾驶行为对齐。该方法还引入了一种可学习的多模态信息加权融合策略,以平衡视觉和文本模态的贡献。实验结果表明,该方法在感知、预测和规划方面显著优于基线模型,展示了增强的BEV表示在实现更准确可靠的自动驾驶任务中的有效性。
MARIC: Multi-Agent Reasoning for Image Classification
Authors: Wonduk Seo, Minhyeong Yu, Hyunjin An, Seunghyun Lee
First: 2025-09-18T11:27:00+00:00 · Latest: 2025-09-18T11:27:00+00:00
Comments: Preprint
Abstract
Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate some of these constraints, they remain limited by their reliance on single pass representations, often failing to capture complementary aspects of visual content. In this paper, we introduce Multi Agent based Reasoning for Image Classification (MARIC), a multi agent framework that reformulates image classification as a collaborative reasoning process. MARIC first utilizes an Outliner Agent to analyze the global theme of the image and generate targeted prompts. Based on these prompts, three Aspect Agents extract fine grained descriptions along distinct visual dimensions. Finally, a Reasoning Agent synthesizes these complementary outputs through integrated reflection step, producing a unified representation for classification. By explicitly decomposing the task into multiple perspectives and encouraging reflective synthesis, MARIC mitigates the shortcomings of both parameter-heavy training and monolithic VLM reasoning. Experiments on 4 diverse image classification benchmark datasets demonstrate that MARIC significantly outperforms baselines, highlighting the effectiveness of multi-agent visual reasoning for robust and interpretable image classification.
中文标题/摘要
标题:MARIC:多智能体推理在图像分类中的应用
传统的图像分类依赖于参数密集型模型的训练,需要大规模标注数据集和大量的微调才能达到竞争性性能。虽然最近的视觉语言模型(VLMs)在一定程度上缓解了这些限制,但它们仍然受限于单次表示,往往无法捕捉视觉内容的互补方面。在本文中,我们提出了基于多智能体的图像分类推理(MARIC),这是一种多智能体框架,将图像分类重新定义为协作推理过程。MARIC 首先利用一个异常智能体分析图像的全局主题并生成针对性的提示。基于这些提示,三个方面智能体沿不同的视觉维度提取细粒度描述。最后,一个推理智能体通过综合反思步骤综合这些互补输出,生成用于分类的统一表示。通过明确将任务分解为多个视角,并促进反思综合,MARIC 缓解了参数密集型训练和单一视觉语言模型推理的不足。在4个不同的图像分类基准数据集上的实验表明,MARIC 显著优于基线,突显了多智能体视觉推理在稳健和可解释图像分类中的有效性。
Summary / 总结
MARIC is a multi-agent framework for image classification that addresses the limitations of traditional parameter-intensive training and monolithic vision language models. It decomposes the task into an outliner agent, three aspect agents, and a reasoning agent, each focusing on different aspects of the image. The outliner agent generates prompts, aspect agents extract fine-grained descriptions, and the reasoning agent synthesizes these outputs. Experiments show MARIC outperforms baselines on four diverse datasets, indicating the effectiveness of multi-agent visual reasoning.
MARIC 是一种多代理框架,用于图像分类,旨在解决传统参数密集型训练和单一视觉语言模型的局限性。它将任务分解为轮廓代理、三个方面代理和推理代理,每个代理关注图像的不同方面。轮廓代理生成提示,方面代理提取细粒度描述,推理代理综合这些输出。实验表明,MARIC 在四个不同的数据集上优于基线,表明多代理视觉推理的有效性。
The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs
Authors: Asif Azad, Mohammad Sadat Hossain, MD Sadik Hossain Shanto, M Saifur Rahman, Md Rizwan Parvez
First: 2025-09-16T08:17:39+00:00 · Latest: 2025-09-18T10:10:19+00:00
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don't know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.
中文标题/摘要
标题:说“可能”的艺术:一种用于VLMs不确定性基准测试的同形透镜
视觉-语言模型(VLMs)在跨科学和推理任务的复杂视觉理解方面取得了显著进展。虽然性能基准测试已经提高了我们对这些能力的理解,但不确定性量化这一关键维度却受到了不足的关注。因此,不同于以往专注于有限场景的同形预测研究,我们进行了一项全面的不确定性基准测试研究,评估了16个最先进的VLMs(开源和闭源)在6个多模态数据集上的表现,使用了3种不同的评分函数。我们的研究结果表明,较大的模型在不确定性量化方面表现更一致;知道得越多的模型也更清楚自己不知道什么。更确定的模型实现了更高的准确性,而数学和推理任务在所有模型中的不确定性表现普遍低于其他领域。这项工作为多模态系统的可靠不确定性评估奠定了基础。
Frame Sampling Strategies Matter: A Benchmark for small vision language models
Authors: Marija Brkic, Anas Filali Razzouki, Yannis Tevissen, Khalil Guetari, Mounim A. El Yacoubi
First: 2025-09-18T09:18:42+00:00 · Latest: 2025-09-18T09:18:42+00:00
Abstract
Comparing vision language models on videos is particularly complex, as the performances is jointly determined by the model's visual representation capacity and the frame-sampling strategy used to construct the input. Current video benchmarks are suspected to suffer from substantial frame-sampling bias, as models are evaluated with different frame selection strategies. In this work, we propose the first frame-accurate benchmark of state-of-the-art small VLMs for video question-answering, evaluated under controlled frame-sampling strategies. Our results confirm the suspected bias and highlight both data-specific and task-specific behaviors of SVLMs under different frame-sampling techniques. By open-sourcing our benchmarking code, we provide the community with a reproducible and unbiased protocol for evaluating video VLMs and emphasize the need for standardized frame-sampling strategies tailored to each benchmarking dataset in future research.
中文标题/摘要
标题:帧采样策略很重要:小型视觉语言模型基准测试
在视频上比较视觉语言模型特别复杂,因为模型的表现由其视觉表示能力和用于构建输入的帧采样策略共同决定。当前的视频基准可能受到显著的帧采样偏差影响,因为模型使用了不同的帧选择策略进行评估。在本工作中,我们提出了第一个针对视频问答的最先进的小型视觉语言模型的帧准确基准测试,该基准测试在受控的帧采样策略下进行评估。我们的结果证实了存在的偏差,并突显了在不同帧采样技术下SVLMs的数据特异性和任务特异性行为。通过开源我们的基准测试代码,我们为社区提供了可重复且无偏的视频VLMs评估协议,并强调未来研究中为每个基准测试数据集制定标准化帧采样策略的必要性。
Summary / 总结
This work addresses the complexity of evaluating vision language models on videos, where model performance is influenced by both visual representation capacity and frame-sampling strategies. The authors propose a frame-accurate benchmark for small vision language models, ensuring controlled frame-sampling strategies. Their results confirm the presence of frame-sampling bias and reveal data-specific and task-specific behaviors of these models. The study emphasizes the importance of standardized frame-sampling strategies for future research.
该研究解决了在视频上比较视觉语言模型的复杂性,模型性能受视觉表示能力和帧采样策略的影响。作者提出了一种针对小型视觉语言模型的帧准确基准,采用受控的帧采样技术进行评估。结果确认了帧采样偏见的存在,并揭示了这些模型在不同帧采样技术下的任务特定和数据特定行为。开源基准代码确保了未来研究中的可重复性和无偏评估。
PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution
Authors: Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao
First: 2025-04-19T01:11:46+00:00 · Latest: 2025-09-18T08:24:25+00:00
Abstract
The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with dynamic contrastive learning(PVLM) method for zero-shot deepfake attribution (ZS-DFA),which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative parsing-guided vision language model with dynamic contrastive learning (PVLM) method to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We employ the inherent face attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.
中文标题/摘要
标题:PVLM:具有动态对比学习的感知导向视觉语言模型及其在零样本深度伪造归属中的应用
由于生成模型的迅速发展,伪造面孔的来源归属问题引起了广泛关注。然而,现有的深度伪造归属(DFA)工作主要集中在视觉模态中各种领域的交互上,而其他模态如文本和面部解析尚未得到充分探索。此外,它们倾向于以精细的方式评估深度伪造归属器对未见过的高级生成器(如扩散模型)的一般化性能。在本文中,我们提出了一种新的感知导向视觉语言模型(PVLM)方法,结合动态对比学习,用于零样本深度伪造归属(ZS-DFA),以促进对未见过的高级生成器的有效和精细的可追溯性。具体而言,我们建立了一个新的精细的ZS-DFA基准,以评估深度伪造归属器对未见过的高级生成器(如扩散模型)的归属性能。此外,我们提出了一种创新的感知导向视觉语言模型(PVLM)方法,结合动态对比学习,以捕捉通用和多样的归属特征。我们受到观察的启发,即由GAN和扩散模型生成的面部图像中源面部属性的保留存在显著差异。我们利用固有的面部属性保留差异来捕捉感知导向的伪造表示。因此,我们设计了一种新的感知编码器,专注于全局面部属性嵌入,通过动态视觉解析匹配实现感知导向的伪造表示学习。此外,我们提出了一种新颖的深度伪造归属对比中心损失,将相关生成器拉近,将不相关生成器推开,可以引入到伪造归属模型中以增强可追溯性。实验结果表明,通过各种协议评估,我们的模型在ZS-DFA基准上超过了最先进的方法。
Summary / 总结
This paper addresses the challenge of tracing the source attribution of forged faces using a novel parsing-aware vision language model with dynamic contrastive learning (PVLM) for zero-shot deepfake attribution (ZS-DFA). The authors propose a fine-grained ZS-DFA benchmark to evaluate the performance of deepfake attributors against unseen advanced generators. The PVLM method captures general and diverse attribution features by focusing on global face attribute embeddings and using a parsing-guided approach. The model also includes a deepfake attribution contrastive center loss to improve traceability. Experimental results demonstrate that the proposed model outperforms existing methods on the ZS-DFA benchmark.
本文针对伪造人脸来源追溯的挑战,提出了一种新的基于解析的视觉语言模型(PVLM)结合动态对比学习方法,用于零样本深伪归属(ZS-DFA)。作者引入了一个细粒度的ZS-DFA基准来评估深伪归属模型在面对未见过的高级生成器时的表现。PVLM模型通过关注全局面部属性嵌入并使用解析编码器来捕捉一般性和多样性的归属特征。此外,还提出了一种新的深伪归属对比中心损失,以增强可追溯性。实验结果表明,PVLM模型在ZS-DFA基准上优于现有方法。
BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
Authors: Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia
First: 2025-09-17T07:58:36+00:00 · Latest: 2025-09-18T04:57:32+00:00
Abstract
Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$\times$ speedup with comparable visual quality.
中文标题/摘要
标题:BWCache:通过块级缓存加速视频扩散变换器
最近在扩散变换器(DiTs)方面的进展已经确立了它们作为视频生成的最新方法的地位。然而,它们固有的顺序去噪过程不可避免地导致了延迟,限制了其实用性。现有的加速方法要么因架构修改而牺牲视觉质量,要么无法在适当的粒度上重用中间特征。我们的分析表明,DiT块是推理延迟的主要贡献者。在扩散时间步长中,DiT块的特征变化呈现出U形模式,在中间时间步长中相似度高,这表明存在大量的计算冗余。在本文中,我们提出了一种无需训练的方法Block-Wise Caching(BWCache)来加速基于DiT的视频生成。BWCache动态地跨扩散时间步长缓存和重用DiT块的特征。此外,我们引入了一个相似性指标,仅在相邻时间步长之间块特征的差异低于阈值时触发特征重用,从而最小化冗余计算并保持视觉保真度。在几种视频扩散模型上的广泛实验表明,BWCache在保持视觉质量的同时实现了高达2.24倍的加速。
Summary / 总结
BWCache is a training-free method that accelerates Diffusion Transformers (DiTs) by caching and reusing features from DiT blocks across diffusion timesteps. This approach reduces computational redundancy and achieves up to 2.24 times speedup while maintaining visual quality. The U-shaped pattern of feature variations during intermediate timesteps guides the dynamic caching strategy, ensuring minimal redundant computations.
BWCache 是一种无需训练的方法,通过在扩散时间步之间缓存和重用 DiT 块的特征来加速 DiT。它基于相似性指标动态缓存特征,并仅在相邻时间步块特征的差异低于阈值时触发特征重用。实验表明,BWCache 可以实现最高 2.24 倍的加速,同时保持视觉质量。
Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
Authors: Rashid Mushkani
First: 2025-09-18T03:21:10+00:00 · Latest: 2025-09-18T03:21:10+00:00
Abstract
Understanding how people read city scenes can inform design and planning. We introduce a small benchmark for testing vision-language models (VLMs) on urban perception using 100 Montreal street images, evenly split between photographs and photorealistic synthetic scenes. Twelve participants from seven community groups supplied 230 annotation forms across 30 dimensions mixing physical attributes and subjective impressions. French responses were normalized to English. We evaluated seven VLMs in a zero-shot setup with a structured prompt and deterministic parser. We use accuracy for single-choice items and Jaccard overlap for multi-label items; human agreement uses Krippendorff's alpha and pairwise Jaccard. Results suggest stronger model alignment on visible, objective properties than subjective appraisals. The top system (claude-sonnet) reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement coincides with better model scores. Synthetic images slightly lower scores. We release the benchmark, prompts, and harness for reproducible, uncertainty-aware evaluation in participatory urban analysis.
中文标题/摘要
标题:视觉语言模型如何理解城市场景?一种城市感知基准
理解人们如何阅读城市场景可以指导设计和规划。我们引入了一个小型基准,使用100张蒙特利尔街道图像测试视觉语言模型(VLMs)的城市感知能力,这些图像平分了照片和写实合成场景。来自七个社区团体的12名参与者提供了涵盖30个维度的230份注释表,这些维度混合了物理属性和主观印象。法语回答被标准化为英语。我们在零样本设置下评估了七种VLMs,使用结构化提示和确定性解析器。我们使用准确率评估单选题,使用Jaccard重叠评估多标签题;人类一致性使用Krippendorff的alpha和成对Jaccard。结果表明,模型在可见的、客观的属性上比主观评价有更好的对齐。顶级系统(claude-sonnet)在多标签题上的宏平均得分为0.31,平均Jaccard得分为0.48。更高的人类一致性与更好的模型得分相关。合成图像略微降低了得分。我们发布了基准、提示和框架,以实现可重复的、考虑不确定性的参与式城市分析评估。
VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Authors: Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, Yuxin Ma
First: 2025-09-18T03:15:00+00:00 · Latest: 2025-09-18T03:15:00+00:00
Comments: 11 pages, 7 figures, 1 table, accepted to IEEE VIS 2025 (IEEE Transactions on Visualization and Computer Graphics)
Abstract
Vision-language (VL) models have shown transformative potential across various critical domains due to their capability to comprehend multi-modal information. However, their performance frequently degrades under distribution shifts, making it crucial to assess and improve robustness against real-world data corruption encountered in practical applications. While advancements in VL benchmark datasets and data augmentation (DA) have contributed to robustness evaluation and improvement, there remain challenges due to a lack of in-depth comprehension of model behavior as well as the need for expertise and iterative efforts to explore data patterns. Given the achievement of visualization in explaining complex models and exploring large-scale data, understanding the impact of various data corruption on VL models aligns naturally with a visual analytics approach. To address these challenges, we introduce VisMoDAl, a visual analytics framework designed to evaluate VL model robustness against various corruption types and identify underperformed samples to guide the development of effective DA strategies. Grounded in the literature review and expert discussions, VisMoDAl supports multi-level analysis, ranging from examining performance under specific corruptions to task-driven inspection of model behavior and corresponding data slice. Unlike conventional works, VisMoDAl enables users to reason about the effects of corruption on VL models, facilitating both model behavior understanding and DA strategy formulation. The utility of our system is demonstrated through case studies and quantitative evaluations focused on corruption robustness in the image captioning task.
中文标题/摘要
标题:VisMoDAl:评估和提升视觉语言模型抗腐败鲁棒性的视觉分析
视觉语言(VL)模型因其能够理解多模态信息而在多个关键领域展现了变革性的潜力。然而,它们在分布变化下的性能经常下降,因此评估和提升其在实际应用中遇到的真实数据腐败情况下的鲁棒性变得至关重要。尽管VL基准数据集和数据增强(DA)的进步已经促进了鲁棒性评估和提升,但由于对模型行为缺乏深入理解以及需要专业知识和迭代探索数据模式,仍然存在挑战。鉴于可视化在解释复杂模型和探索大规模数据方面的成就,理解各种数据腐败对VL模型的影响自然适合视觉分析方法。为了解决这些挑战,我们引入了VisMoDAl,这是一种视觉分析框架,旨在评估VL模型在各种腐败类型下的鲁棒性,并识别表现不佳的样本以指导有效的数据增强策略的开发。VisMoDAl基于文献综述和专家讨论,支持多级分析,从特定腐败下的性能检查到任务驱动的模型行为和相应数据切片的检查。与传统工作不同,VisMoDAl使用户能够推理数据腐败对VL模型的影响,促进模型行为理解和数据增强策略的制定。通过图像字幕任务中的案例研究和定量评估,展示了我们系统的实用性。
Summary / 总结
VisMoDAl is a visual analytics framework designed to evaluate and improve the robustness of vision-language models against various types of data corruption. It supports multi-level analysis, from examining performance under specific corruptions to task-driven inspection of model behavior and corresponding data slices. Key findings show that VisMoDAl enables users to better understand model behavior and develop effective data augmentation strategies, enhancing the models' performance in image captioning tasks under real-world data corruption.
VisMoDAl 是一个视觉分析框架,旨在评估和提高视觉语言模型在数据污染下的鲁棒性。它支持多层次分析,从特定污染类型到任务驱动的模型行为检查。主要发现表明,VisMoDAl 有助于识别表现不佳的样本,并指导有效数据增强策略的开发,特别是在图像字幕任务中。
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
Authors: Gabriel Sarch, Lawrence Jang, Michael J. Tarr, William W. Cohen, Kenneth Marino, Katerina Fragkiadaki
First: 2024-06-20T17:45:02+00:00 · Latest: 2025-09-18T02:44:34+00:00
Comments: Project website: https://ical-learning.github.io/
Abstract
Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot learning but require high-quality demonstrations. We propose In-Context Abstraction Learning (ICAL), enabling VLM agents to transform suboptimal trajectories into high-quality training data through self-reflection and human feedback. Given imperfect task demonstrations, a VLM abstracts trajectories into generalized strategies and action annotations by correcting inefficiencies and annotating cognitive abstractions: causal relationships, object state changes, temporal subgoals, and task-relevant visual elements. These annotations are iteratively refined through human feedback during execution in similar environments. The resulting examples significantly improve decision-making when used for retrieval-augmented generation or fine-tuning. As the agent's example library grows, it becomes more efficient at abstracting new examples, requiring less human feedback and fewer environment interactions. ICAL achieves state-of-the-art results across multiple benchmarks. In TEACh dialogue-based instruction following, combining fine-tuning and retrieval on ICAL examples outperforms raw human demonstrations and expert examples by 17.5% in goal-condition success. In VisualWebArena, retrieval-augmented GPT-4V with ICAL improves task success 1.6x, while fine-tuned Qwen2-VL achieves 2.8x improvement over the base model. In Ego4D action forecasting, we surpass few-shot GPT-4V and remain competitive with supervised models. Our approach scales 2x better than raw demonstrations and significantly reduces manual prompt engineering requirements.
中文标题/摘要
标题:VLM智能体生成自己的记忆:将经验提炼为具身思维程序
大规模生成语言和跨模态语言模型(LLMs和VLMs)在少量示例学习方面表现出色,但需要高质量的演示。我们提出了上下文抽象学习(ICAL),使VLM智能体能够通过自我反思和人类反馈将次优轨迹转化为高质量的训练数据。给定不完美的任务演示,VLM将轨迹抽象为通用策略和动作注释,通过纠正低效性和标注认知抽象:因果关系、物体状态变化、时间子目标和任务相关的视觉元素。这些注释在类似环境中执行期间通过人类反馈迭代优化。生成的示例在用于检索增强生成或微调时显著改善了决策。随着智能体示例库的增长,它在抽象新示例方面变得更加高效,需要更少的人类反馈和环境交互。ICAL在多个基准测试中取得了最先进的成果。在TEACh对话式指令跟随中,结合ICAL示例的微调和检索优于原始人类演示和专家示例17.5%的目标条件成功率。在VisualWebArena中,ICAL增强的检索增强GPT-4V将任务成功率提高了1.6倍,而微调后的Qwen2-VL将基线模型提高了2.8倍。在Ego4D动作预测中,我们超越了少量示例的GPT-4V,并在监督模型中保持竞争力。我们的方法比原始演示扩展速度快2倍,并显著减少了手动提示工程的需求。
Summary / 总结
The research aims to improve the performance of vision-language models (VLMs) in few-shot learning by enabling them to generate high-quality training data through self-reflection and human feedback. The method, In-Context Abstraction Learning (ICAL), involves abstracting suboptimal trajectories into generalized strategies and action annotations, which are iteratively refined with human feedback. Key findings show that ICAL significantly improves decision-making in various benchmarks, outperforming raw human demonstrations and expert examples in tasks such as dialogue-based instruction following, visual task success, and action forecasting.
研究旨在通过自我反思和人类反馈,使视觉语言模型(VLMs)能够生成高质量的训练数据,从而增强其少量样本学习能力。方法In-Context Abstraction Learning (ICAL) 允许VLM代理将次优轨迹抽象为泛化的策略和动作注释,并通过执行过程中的人类反馈进行迭代优化。关键实验结果表明,ICAL在各种基准测试中显著提高了决策能力,最高可将任务成功率提高2.8倍,超越了原始的人类示范和专家示例。
An Empirical Study of Federated Prompt Learning for Vision Language Model
Authors: Zhihao Wang, Wenke Huang, Tian Chen, Zekun Shi, Guancheng Wan, Yu Qiao, Bin Yang, Jian Wang, Bing Li, Mang Ye
First: 2025-05-29T03:09:15+00:00 · Latest: 2025-09-18T02:36:50+00:00
Abstract
The Vision Language Model (VLM) excels in aligning vision and language representations, and prompt learning has emerged as a key technique for adapting such models to downstream tasks. However, the application of prompt learning with VLM in federated learning (FL) scenarios remains underexplored. This paper systematically investigates the behavioral differences between language prompt learning (LPT) and vision prompt learning (VPT) under data heterogeneity challenges, including label skew and domain shift. We conduct extensive experiments to evaluate the impact of various FL and prompt configurations, such as client scale, aggregation strategies, and prompt length, to assess the robustness of Federated Prompt Learning (FPL). Furthermore, we explore strategies for enhancing prompt learning in complex scenarios where label skew and domain shift coexist, including leveraging both prompt types when computational resources allow. Our findings offer practical insights into optimizing prompt learning in federated settings, contributing to the broader deployment of VLMs in privacy-preserving environments.
中文标题/摘要
标题:联邦提示学习在视觉语言模型中的实证研究
视觉语言模型(VLM)在视觉和语言表示对齐方面表现出色,提示学习已成为将此类模型适应下游任务的关键技术。然而,提示学习在联邦学习(FL)场景中的应用尚未得到充分探索。本文系统地研究了在数据异质性挑战(包括标签偏斜和领域偏移)下语言提示学习(LPT)和视觉提示学习(VPT)的行为差异。我们进行了广泛的实验,评估了各种FL和提示配置(如客户端规模、聚合策略和提示长度)对联邦提示学习(FPL)鲁棒性的影响。此外,我们探讨了在标签偏斜和领域偏移共存的复杂场景中增强提示学习的策略,包括在计算资源允许时利用两种提示类型。我们的研究结果为优化联邦设置中的提示学习提供了实用见解,有助于在隐私保护环境中更广泛地部署VLMs。
Summary / 总结
This paper explores the differences between language prompt learning (LPT) and vision prompt learning (VPT) in federated learning (FL) scenarios, focusing on data heterogeneity challenges like label skew and domain shift. Through extensive experiments, the study evaluates the impact of various FL and prompt configurations, such as client scale, aggregation strategies, and prompt length, to assess the robustness of Federated Prompt Learning (FPL). Key findings include the importance of using both prompt types when computational resources permit to enhance prompt learning in complex scenarios.
该研究探讨了在联邦学习(FL)场景下语言提示学习(LPT)和视觉提示学习(VPT)之间的差异,重点关注标签偏差和领域偏移等数据异质性挑战。通过广泛的实验,研究评估了各种FL和提示配置的影响,如客户端规模、聚合策略和提示长度,以评估Federated Prompt Learning(FPL)的鲁棒性。主要发现包括在计算资源允许的情况下同时使用两种提示类型,以增强复杂场景下的提示学习。
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
Authors: Yuxuan Jiang, Zehua Chen, Zeqian Ju, Chang Li, Weibei Dou, Jun Zhu
Venue: ACM MM 2025
First: 2025-07-11T12:57:51+00:00 · Latest: 2025-09-18T02:19:36+00:00
Comments: Accepted at ACM MM 2025
Abstract
Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: https://freeaudio.github.io/FreeAudio/
中文标题/摘要
标题:FreeAudio:无需训练的时间规划以实现可控长文本转音频生成
文本转音频(T2A)生成在生成模型的最新进展下取得了令人鼓舞的结果。然而,由于缺乏高质量和数量的时序对齐的音频-文本对,现有的T2A方法难以处理包含精确时间控制的复杂文本提示,例如“猫头鹰在2.4秒至5.2秒之间发出叫声”。最近的研究探索了数据增强技术或引入时间条件作为模型输入以实现时间条件下的10秒T2A生成,但其合成质量仍然有限。在本文中,我们提出了一种新颖的无需训练的时间控制T2A框架FreeAudio,首次尝试实现时间控制的长文本转音频生成,例如“猫头鹰在2.4秒至5.2秒之间发出叫声,蟋蟀在0秒至24秒之间鸣叫”。具体而言,我们首先使用LLM规划不重叠的时间窗口,并基于输入文本和时间提示重新描述每个窗口。然后我们引入:1)解耦和聚合注意力控制以实现精确的时间控制;2)上下文潜在组成以实现局部平滑和参考指导以实现全局一致性。广泛的实验表明:1)FreeAudio在无需训练的方法中实现了最先进的时间条件下的T2A合成质量,并且与领先的基于训练的方法相当;2)FreeAudio展示了与基于训练的Stable Audio相当的长文本生成质量,并为时间控制的长文本转音频合成铺平了道路。演示样本可在:https://freeaudio.github.io/FreeAudio/获取
Summary / 总结
FreeAudio is a training-free framework for timing-controlled text-to-audio generation, addressing the challenge of precise timing control in long-form audio synthesis. It uses an LLM to plan non-overlapping time windows and refine descriptions, and introduces attention control, latent composition, and reference guidance to enhance timing accuracy and audio quality. Experiments show that FreeAudio matches the quality of leading training-based methods and paves the way for timing-controlled long-form T2A synthesis.
FreeAudio 是一个无需训练的框架,用于实现精确时间控制的长文本转音频生成,解决了长音频合成中的精确时间控制难题。它使用语言模型规划非重叠时间窗口并细化自然语言描述。关键技术包括解耦和聚合注意力控制、上下文潜在组成和参考指导。实验表明,FreeAudio 的合成质量与领先的基于训练的方法相当,并为长文本转音频的精确时间控制铺平了道路。
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling
Authors: Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, Nanyun Peng
First: 2025-02-24T21:01:39+00:00 · Latest: 2025-09-17T21:47:50+00:00
Comments: ACL2025 Main
Abstract
Chart generation aims to generate code to produce charts satisfying the desired visual properties, e.g., texts, layout, color, and type. It has great potential to empower the automatic professional report generation in financial analysis, research presentation, education, and healthcare. In this work, we build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation. Generating high-quality charts requires both strong visual design skills and precise coding capabilities that embed the desired visual properties into code. Such a complex multi-modal reasoning process is difficult for direct prompting of VLMs. To resolve these challenges, we propose METAL, a multi-agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents. METAL achieves 5.2% improvement over the current best result in the chart generation task. The METAL framework exhibits the phenomenon of test-time scaling: its performance increases monotonically as the logarithmic computational budget grows from 512 to 8192 tokens. In addition, we find that separating different modalities during the critique process of METAL boosts the self-correction capability of VLMs in the multimodal context.
中文标题/摘要
标题:METAL:一种用于图表生成的多智能体框架(带测试时扩展)
图表生成旨在生成代码以生成满足所需视觉属性的图表,例如文本、布局、颜色和类型。它在金融分析、研究展示、教育和医疗保健中的自动专业报告生成方面具有巨大的潜力。在这项工作中,我们构建了一个基于视觉语言模型(VLM)的有效自动图表生成多智能体框架。生成高质量的图表需要强大的视觉设计技能和精确的编码能力,将所需的视觉属性嵌入到代码中。这种复杂的多模态推理过程难以直接对VLM进行提示。为了解决这些挑战,我们提出了METAL,一种多智能体框架,将图表生成任务分解为专业智能体之间的迭代协作。METAL在图表生成任务中的表现优于当前最佳结果5.2%。METAL框架展示了测试时扩展的现象:其性能随着计算预算从512增长到8192个令牌呈单调增长。此外,我们发现,在METAL的批评过程中分离不同的模态增强了VLM在多模态环境中的自我纠正能力。
Summary / 总结
The research aims to develop an effective multi-agent framework for automatic chart generation, which is crucial for professional report generation in various fields. METAL, a vision-language model-based framework, decomposes the chart generation task into iterative collaboration among specialized agents, improving the quality of generated charts by 5.2% compared to the previous best result. The framework also demonstrates test-time scaling, with performance increasing as computational budget grows, and enhances the self-correction capability of VLMs through modality separation during the critique process.
研究旨在开发一种有效的多智能体框架,用于自动生成图表,这对于金融分析、研究展示、教育和医疗保健等领域的专业报告生成至关重要。METAL框架利用视觉语言模型将图表生成任务分解为多个专业智能体的迭代协作,相比之前的最佳结果提高了5.2%的图表质量。该框架还展示了随着计算预算增长而性能递增的测试时扩展现象,并且在批评过程中分离不同模态时,增强了视觉语言模型在多模态环境下的自我纠正能力。
Hashing-Baseline: Rethinking Hashing in the Age of Pretrained Models
Authors: Ilyass Moummad, Kawtar Zaher, Lukas Rauch, Alexis Joly
First: 2025-09-17T20:58:43+00:00 · Latest: 2025-09-17T20:58:43+00:00
Abstract
Information retrieval with compact binary embeddings, also referred to as hashing, is crucial for scalable fast search applications, yet state-of-the-art hashing methods require expensive, scenario-specific training. In this work, we introduce Hashing-Baseline, a strong training-free hashing method leveraging powerful pretrained encoders that produce rich pretrained embeddings. We revisit classical, training-free hashing techniques: principal component analysis, random orthogonal projection, and threshold binarization, to produce a strong baseline for hashing. Our approach combines these techniques with frozen embeddings from state-of-the-art vision and audio encoders to yield competitive retrieval performance without any additional learning or fine-tuning. To demonstrate the generality and effectiveness of this approach, we evaluate it on standard image retrieval benchmarks as well as a newly introduced benchmark for audio hashing.
中文标题/摘要
标题:哈希基线:在预训练模型时代重新思考哈希
使用紧凑的二进制嵌入进行信息检索,也称为哈希,在可扩展的快速搜索应用中至关重要,但最先进的哈希方法需要昂贵的、特定场景的训练。在本文中,我们引入了哈希基线,这是一种强大的无需训练的哈希方法,利用强大的预训练编码器生成丰富的预训练嵌入。我们回顾了经典的无需训练的哈希技术:主成分分析、随机正交投影和阈值二值化,以生成哈希的强基线。我们的方法将这些技术与最先进的视觉和音频编码器的冻结嵌入相结合,无需任何额外的学习或微调即可获得竞争力的检索性能。为了证明该方法的通用性和有效性,我们在标准图像检索基准以及新引入的音频哈希基准上进行了评估。
Summary / 总结
The research aims to improve the efficiency and effectiveness of information retrieval using hashing methods, which are essential for fast search applications. The study introduces Hashing-Baseline, a training-free hashing method that uses powerful pretrained encoders to generate rich embeddings. The approach combines classical hashing techniques with frozen embeddings from state-of-the-art vision and audio encoders, achieving competitive retrieval performance without additional learning or fine-tuning. The method is evaluated on standard image retrieval benchmarks and a new audio hashing benchmark, demonstrating its generality and effectiveness.
研究旨在通过哈希方法提高信息检索的效率和效果,这些方法对于快速搜索应用至关重要。该研究引入了Hashing-Baseline,这是一种无需训练的哈希方法,利用强大的预训练编码器生成丰富的嵌入。该方法结合了经典的哈希技术与来自最新视觉和音频编码器的冻结嵌入,无需额外的学习或微调即可实现竞争力的检索性能。该方法在标准图像检索基准和新的音频哈希基准上进行了评估,展示了其通用性和有效性。
Fine-tuning Vision Language Models with Graph-based Knowledge for Explainable Medical Image Analysis
Authors: Chenjun Li, Laurin Lux, Alexander H. Berger, Martin J. Menten, Mert R. Sabuncu, Johannes C. Paetzold
First: 2025-03-12T20:19:07+00:00 · Latest: 2025-09-17T20:08:48+00:00
Comments: 11 pages, 3 figures
Abstract
Accurate staging of Diabetic Retinopathy (DR) is essential for guiding timely interventions and preventing vision loss. However, current staging models are hardly interpretable, and most public datasets contain no clinical reasoning or interpretation beyond image-level labels. In this paper, we present a novel method that integrates graph representation learning with vision-language models (VLMs) to deliver explainable DR diagnosis. Our approach leverages optical coherence tomography angiography (OCTA) images by constructing biologically informed graphs that encode key retinal vascular features such as vessel morphology and spatial connectivity. A graph neural network (GNN) then performs DR staging while integrated gradients highlight critical nodes and edges and their individual features that drive the classification decisions. We collect this graph-based knowledge which attributes the model's prediction to physiological structures and their characteristics. We then transform it into textual descriptions for VLMs. We perform instruction-tuning with these textual descriptions and the corresponding image to train a student VLM. This final agent can classify the disease and explain its decision in a human interpretable way solely based on a single image input. Experimental evaluations on both proprietary and public datasets demonstrate that our method not only improves classification accuracy but also offers more clinically interpretable results. An expert study further demonstrates that our method provides more accurate diagnostic explanations and paves the way for precise localization of pathologies in OCTA images.
中文标题/摘要
标题:基于图知识微调视觉语言模型以实现可解释的糖尿病视网膜病变医学图像分析
准确的糖尿病视网膜病变(DR)分期对于指导及时干预和预防视力丧失至关重要。然而,当前的分期模型几乎不具备可解释性,而且大多数公开的数据集仅包含图像级别的标签,而没有临床推理或解释。在本文中,我们提出了一种新颖的方法,将图表示学习与视觉-语言模型(VLMs)结合,以提供可解释的DR诊断。我们的方法通过构建生物启发的图来利用光学相干断层扫描血管成像(OCTA)图像,这些图编码了关键的视网膜血管特征,如血管形态和空间连接性。然后,图神经网络(GNN)执行DR分期,集成梯度突出显示驱动分类决策的关键节点和边及其个体特征。我们收集了这种基于图的知识,将其归因于生理结构及其特征。然后将其转换为文本描述,供VLMs使用。我们使用这些文本描述和相应的图像进行指令微调,以训练一个学生VLM。最终的代理可以根据单张图像输入,仅基于此进行疾病分类并以人类可解释的方式解释其决策。在私有和公开数据集上的实验评估表明,我们的方法不仅提高了分类准确性,还提供了更具临床解释性的结果。进一步的专家研究证明,我们的方法提供了更准确的诊断解释,并为OCTA图像中的病理精确定位铺平了道路。
Summary / 总结
This paper presents a method to enhance the explainability of Diabetic Retinopathy (DR) diagnosis using graph-based knowledge integrated with vision-language models. The approach constructs biologically informed graphs from OCTA images to encode key retinal features, which are then processed by a graph neural network for DR staging. Integrated gradients highlight critical nodes and edges, providing interpretable explanations. Experimental results show improved classification accuracy and more clinically interpretable results compared to existing models, with expert studies confirming the method's accuracy in diagnostic explanations and localization of pathologies in OCTA images.
本文提出了一种通过将图基知识与视觉语言模型结合来增强糖尿病视网膜病变(DR)诊断的可解释性的方法。该方法从OCTA图像构建生物启发的图来编码关键的视网膜特征,然后由图神经网络进行DR分期。集成梯度突出关键节点和边及其特征,提供可解释的解释。实验结果表明,该方法不仅提高了分类准确性,还提供了更具临床解释性的结果,专家研究进一步证实了该方法在诊断解释和OCTA图像中病理精确定位方面的准确性。
Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark
Authors: Nisarg A. Shah, Amir Ziai, Chaitanya Ekanadham, Vishal M. Patel
First: 2025-09-17T17:58:06+00:00 · Latest: 2025-09-17T17:58:06+00:00
Comments: 11 pages, 5 figures, 5 tables
Abstract
While recent advancements in vision-language models have improved video understanding, diagnosing their capacity for deep, narrative comprehension remains a challenge. Existing benchmarks often test short-clip recognition or use template-based questions, leaving a critical gap in evaluating fine-grained reasoning over long-form narrative content. To address these gaps, we introduce $\mathsf{Cin\acute{e}aste}$, a comprehensive benchmark for long-form movie understanding. Our dataset comprises 3,119 multiple-choice question-answer pairs derived from 1,805 scenes across 200 diverse movies, spanning five novel fine-grained contextual reasoning categories. We use GPT-4o to generate diverse, context-rich questions by integrating visual descriptions, captions, scene titles, and summaries, which require deep narrative understanding. To ensure high-quality evaluation, our pipeline incorporates a two-stage filtering process: Context-Independence filtering ensures questions require video context, while Contextual Veracity filtering validates factual consistency against the movie content, mitigating hallucinations. Experiments show that existing MLLMs struggle on $\mathsf{Cin\acute{e}aste}$; our analysis reveals that long-range temporal reasoning is a primary bottleneck, with the top open-source model achieving only 63.15\% accuracy. This underscores significant challenges in fine-grained contextual understanding and the need for advancements in long-form movie comprehension.
中文标题/摘要
标题:电影导演:精细语境电影问答基准
尽管近期视觉语言模型在视频理解方面取得了进步,但诊断其在深层次叙事理解方面的能力仍是一个挑战。现有基准测试往往侧重于短片段识别或使用模板化问题,这在评估长篇叙事内容的精细推理方面留下了关键缺口。为解决这些问题,我们引入了$\mathsf{Cin\acute{e}aste}$,一个全面的长篇电影理解基准。我们的数据集包含来自200部不同电影1,805个场景的3,119个多项选择题-答案对,涵盖了五个新颖的精细语境推理类别。我们使用GPT-4o生成多样化、富含语境的问题,结合视觉描述、字幕、场景标题和摘要,这些都需要深入的叙事理解。为了确保高质量评估,我们的流水线包含两阶段过滤过程:语境独立性过滤确保问题需要视频语境,而语境真实性过滤则验证事实一致性,防止幻觉。实验表明,现有MLLM在$\mathsf{Cin\acute{e}aste}$上表现不佳;我们的分析显示,长时序推理是主要瓶颈,顶级开源模型的准确率仅为63.15%。这突显了精细语境理解的重大挑战,并强调了长篇电影理解方面的进步需求。
Summary / 总结
Cinéaste is a benchmark for evaluating fine-grained contextual reasoning in long-form movie understanding. It includes 3,119 question-answer pairs from 1,805 scenes across 200 movies, covering five reasoning categories. GPT-4o generates context-rich questions by integrating visual descriptions and summaries, requiring deep narrative understanding. Experiments show existing models struggle, with the top open-source model achieving only 63.15% accuracy, highlighting the need for advancements in long-form movie comprehension.
Cinéaste 是一个用于评估长片理解中的细粒度上下文推理的基准,包含来自200部电影的1,805个场景中的3,119个问答对,涵盖五个推理类别。通过整合视觉描述和摘要,GPT-4o 生成了丰富的上下文问题,要求深入理解叙事。实验显示现有模型表现不佳,顶级开源模型的准确率仅为63.15%,突显了长片细粒度理解的挑战和需要改进之处。
TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning
Authors: Ziyuan Chen, Zhenghui Zhao, Zhangye Han, Miancan Liu, Xianhang Ye, Yiqing Li, Hongbo Min, Jinkui Ren, Xiantao Zhang, Guitao Cao
First: 2025-09-17T16:58:44+00:00 · Latest: 2025-09-17T16:58:44+00:00
Abstract
With the rapid advancement of large language models and vision-language models, employing large models as Web Agents has become essential for automated web interaction. However, training Web Agents with reinforcement learning faces critical challenges including credit assignment misallocation, prohibitively high annotation costs, and reward sparsity. To address these issues, we propose Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning framework that proposes a tree-structured trajectory representation merging semantically identical states across trajectories to eliminate label conflicts. Our framework incorporates a Process Reward Model that automatically generates fine-grained rewards through subgoal progress, redundancy detection, and action verification. Additionally, a dynamic weighting mechanism prioritizes high-impact decision points during training. Experiments on Online-Mind2Web and our self-constructed C-WebShop datasets demonstrate that TGPO significantly outperforms existing methods, achieving higher success rates with fewer redundant steps.
中文标题/摘要
标题:TGPO:基于树引导的偏好优化以增强网络代理的强化学习鲁棒性
随着大型语言模型和视觉-语言模型的迅速发展,使用大型模型作为网络代理已成为自动网络交互的必要手段。然而,使用强化学习训练网络代理面临着关键挑战,包括奖励分配不当、标注成本高昂以及奖励稀疏性。为了解决这些问题,我们提出了基于树引导的偏好优化(TGPO),这是一种离线强化学习框架,通过树结构轨迹表示将轨迹中语义相同的状态合并,以消除标签冲突。该框架结合了过程奖励模型,该模型能够通过子目标进展、冗余检测和动作验证自动生成细粒度奖励。此外,动态加权机制在训练过程中优先处理高影响决策点。在Online-Mind2Web和我们自构建的C-WebShop数据集上的实验表明,TGPO显著优于现有方法,能够在较少的冗余步骤中实现更高的成功率。
Summary / 总结
The research aims to address the challenges in training Web Agents using reinforcement learning, such as credit assignment misallocation, high annotation costs, and sparse rewards. The proposed Tree-Guided Preference Optimization (TGPO) framework uses a tree-structured trajectory representation to merge semantically identical states and eliminate label conflicts. TGPO also includes a Process Reward Model that generates fine-grained rewards and a dynamic weighting mechanism that prioritizes high-impact decision points. Experiments show that TGPO outperforms existing methods by achieving higher success rates with fewer redundant steps.
研究旨在解决使用强化学习训练网页代理所面临的挑战,如信用分配错误、高昂的标注成本和稀疏的奖励。提出的Tree-Guided Preference Optimization (TGPO)框架使用树结构的轨迹表示来合并语义相同的状态并消除标签冲突。TGPO还包含一个过程奖励模型,该模型基于子目标进度、冗余检测和动作验证生成细粒度的奖励。实验表明,TGPO在成功率和减少冗余步骤方面优于现有方法。
StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance
Authors: Zefan Qu, Zhenwei Wang, Haoyuan Wang, Ke Xu, Gerhard Hancke, Rynson W. H. Lau
Venue: SIGGRAPH Asia 2025
First: 2025-09-16T17:55:20+00:00 · Latest: 2025-09-17T15:58:50+00:00
Comments: SIGGRAPH Asia 2025, Project page:https://stylesculptor.github.io
Abstract
Creating 3D assets that follow the texture and geometry style of existing ones is often desirable or even inevitable in practical applications like video gaming and virtual reality. While impressive progress has been made in generating 3D objects from text or images, creating style-controllable 3D assets remains a complex and challenging problem. In this work, we propose StyleSculptor, a novel training-free approach for generating style-guided 3D assets from a content image and one or more style images. Unlike previous works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner, enabling fine-grained 3D style control that captures the texture, geometry, or both styles of user-provided style images. At the core of StyleSculptor is a novel Style Disentangled Attention (SD-Attn) module, which establishes a dynamic interaction between the input content image and style image for style-guided 3D asset generation via a cross-3D attention mechanism, enabling stable feature fusion and effective style-guided generation. To alleviate semantic content leakage, we also introduce a style-disentangled feature selection strategy within the SD-Attn module, which leverages the variance of 3D feature patches to disentangle style- and content-significant channels, allowing selective feature injection within the attention framework. With SD-Attn, the network can dynamically compute texture-, geometry-, or both-guided features to steer the 3D generation process. Built upon this, we further propose the Style Guided Control (SGC) mechanism, which enables exclusive geometry- or texture-only stylization, as well as adjustable style intensity control. Extensive experiments demonstrate that StyleSculptor outperforms existing baseline methods in producing high-fidelity 3D assets.
中文标题/摘要
标题:StyleSculptor:基于纹理-几何双重引导的零样本风格可控3D资产生成
在视频游戏和虚拟现实等实际应用中,生成遵循现有纹理和几何风格的3D资产往往是必要的甚至是不可避免的。尽管在从文本或图像生成3D对象方面取得了令人印象深刻的进展,但创建风格可控的3D资产仍然是一个复杂且具有挑战性的问题。在本文中,我们提出了一种名为StyleSculptor的新型无训练方法,用于从内容图像和一个或多个风格图像生成风格引导的3D资产。与以往工作不同,StyleSculptor以零样本方式实现了风格引导的3D生成,能够实现细粒度的3D风格控制,捕捉用户提供的风格图像的纹理、几何或两者风格。StyleSculptor的核心是一个新颖的风格解耦注意力(SD-Attn)模块,该模块通过跨3D注意力机制在输入内容图像和风格图像之间建立动态交互,实现稳定特征融合和有效的风格引导生成。为了缓解语义内容泄露,我们还在SD-Attn模块中引入了一种风格解耦特征选择策略,该策略利用3D特征补丁的方差来解耦风格和内容相关的通道,允许在注意力框架内选择性地注入特征。借助SD-Attn,网络可以动态计算纹理、几何或两者引导的特征,引导3D生成过程。在此基础上,我们进一步提出了风格引导控制(SGC)机制,该机制能够实现独占的几何或仅纹理风格化,以及可调节的风格强度控制。大量实验表明,StyleSculptor在生成高保真3D资产方面优于现有基线方法。
Summary / 总结
StyleSculptor is a zero-shot approach for generating 3D assets with style guidance from a content image and one or more style images. It introduces a Style Disentangled Attention (SD-Attn) module to dynamically interact between the content and style images, enabling fine-grained control over texture, geometry, or both. Experiments show that StyleSculptor outperforms existing methods in generating high-fidelity 3D assets with style control.
StyleSculptor 是一种零样本方法,用于通过输入图像生成具有纹理和几何风格的 3D 资产。它使用一种新颖的 Style Disentangled Attention (SD-Attn) 模块动态交互内容和风格图像,实现对纹理和几何的精细控制。实验表明,StyleSculptor 在生成高保真 3D 资产方面优于现有方法,并具有风格控制能力。
VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement
Authors: Jun Du, Weiwei Xing, Ming Li, Fei Richard Yu
First: 2025-09-17T15:04:45+00:00 · Latest: 2025-09-17T15:04:45+00:00
Abstract
Current multi-object tracking (MOT) algorithms typically overlook issues inherent in low-quality videos, leading to significant degradation in tracking performance when confronted with real-world image deterioration. Therefore, advancing the application of MOT algorithms in real-world low-quality video scenarios represents a critical and meaningful endeavor. To address the challenges posed by low-quality scenarios, inspired by vision-language models, this paper proposes a Visual Semantic Enhancement-guided Multi-Object Tracking framework (VSE-MOT). Specifically, we first design a tri-branch architecture that leverages a vision-language model to extract global visual semantic information from images and fuse it with query vectors. Subsequently, to further enhance the utilization of visual semantic information, we introduce the Multi-Object Tracking Adapter (MOT-Adapter) and the Visual Semantic Fusion Module (VSFM). The MOT-Adapter adapts the extracted global visual semantic information to suit multi-object tracking tasks, while the VSFM improves the efficacy of feature fusion. Through extensive experiments, we validate the effectiveness and superiority of the proposed method in real-world low-quality video scenarios. Its tracking performance metrics outperform those of existing methods by approximately 8% to 20%, while maintaining robust performance in conventional scenarios.
Summary / 总结
The research aims to improve multi-object tracking (MOT) in low-quality video scenes, which are common in real-world applications but often neglected by existing algorithms. The proposed VSE-MOT framework uses a tri-branch architecture with a vision-language model to extract and fuse global visual semantic information, enhancing the tracking performance. Experiments show that VSE-MOT outperforms existing methods by 8% to 20% in low-quality scenarios while maintaining robust performance in conventional scenarios.
研究旨在改善低质量视频场景中的多目标跟踪(MOT),这些场景在现实应用中很常见。提出的VSE-MOT框架采用三分支架构,并包括视觉语义增强过程、多目标跟踪适配器和视觉语义融合模块,以提高跟踪性能。实验表明,VSE-MOT在低质量场景中的性能比现有方法高出约8%到20%,同时在常规场景中保持了稳健的性能。
Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems
Authors: Jie Zhang, Ting Xu, Gelei Deng, Runyi Hu, Han Qiu, Tianwei Zhang, Qing Guo, Ivor Tsang
First: 2025-09-04T05:35:32+00:00 · Latest: 2025-09-17T13:47:40+00:00
Abstract
Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.
中文标题/摘要
标题:可见却不可读:视觉语言模型在不同书写系统中的一个系统性盲点
书写是一种普遍的文化技术,利用视觉进行符号化交流。人类表现出惊人的适应性:即使字符被分割、融合或部分遮挡,我们也能迅速识别出单词。本文探讨先进视觉语言模型(VLMs)是否也具备这种适应性。我们构建了跨不同书写系统的两个心理物理学启发式基准,通过拼接、重组和叠加字符,生成对模型来说“可见但不可读”的刺激,但对人类来说仍可读。尽管在干净文本上表现出色,但当代VLMs在这些干扰下表现严重下降,经常产生不相关或不连贯的输出。这一模式表明,模型过度依赖通用的视觉不变性,而对构成先验的依赖不足,这是实现稳健读写能力的必要条件。我们发布了刺激生成代码、提示和评估协议,以促进透明的复制和后续工作。我们的发现促使了能够编码符号分割、组合和跨书写系统绑定的架构和训练策略的发展,并指出了在教育、无障碍、文化遗产和安全领域部署多模态系统时的具体挑战。
Summary / 总结
This paper investigates the resilience of advanced vision language models (VLMs) in recognizing fragmented or occluded text across different writing systems. The authors created psychophysics-inspired benchmarks by manipulating Chinese logographs and English alphabetic words, making them 'visible but unreadable' for models while still legible to humans. Despite strong performance on clean text, VLMs showed significant drops in accuracy under these perturbations, indicating a structural limitation in leveraging compositional priors for robust literacy. The study highlights the need for models to better encode symbol segmentation and composition across scripts, and underscores challenges in deploying multimodal systems in various domains.
该论文研究了先进视觉语言模型(VLMs)在不同书写系统中识别碎片化或被遮挡的文本时的鲁棒性。作者通过操纵中文象形文字和英文字母词创建了心理物理学启发式的基准,使模型能够识别这些文字但对人类来说仍然清晰可读。尽管在干净的文本上表现良好,但VLMs在这些扰动下显示出显著的准确性下降,表明模型在利用组成先验以实现稳健的读写能力方面存在结构性限制。研究强调了模型需要更好地编码符号分割和跨书写系统的组合,同时也指出了在教育、无障碍、文化遗产和安全等领域部署多模态系统所面临的挑战。
History