arXiv 论文速递

Lost in Translation? Vocabulary Alignment for Source-Free Domain Adaptation in Open-Vocabulary Semantic Segmentation

Authors: Silvio Mazzucco, Carl Persson, Mattia Segu, Pier Luigi Dovesi, Federico Tombari, Luc Van Gool, Matteo Poggi

First: 2025-09-18T17:59:58+00:00 · Latest: 2025-09-18T17:59:58+00:00

Comments: BMVC 2025 - Project Page: https://thegoodailab.org/blog/vocalign - Code: https://github.com/Sisso16/VocAlign

Abstract

We introduce VocAlign, a novel source-free domain adaptation framework specifically designed for VLMs in open-vocabulary semantic segmentation. Our method adopts a student-teacher paradigm enhanced with a vocabulary alignment strategy, which improves pseudo-label generation by incorporating additional class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to fine-tune the model, preserving its original capabilities while minimizing computational overhead. In addition, we propose a Top-K class selection mechanism for the student model, which significantly reduces memory requirements while further improving adaptation performance. Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in the open-vocabulary setting.

中文标题/摘要

标题：迷失翻译？源代码自由领域适应在开放词汇语义分割中的词汇对齐

我们引入了VocAlign，这是一种专为开放词汇语义分割中的VLM设计的源代码自由领域适应框架。我们的方法采用学生-教师范式，并结合了词汇对齐策略，通过引入额外的类别概念来改进伪标签生成。为了确保效率，我们使用低秩适应（LoRA）对模型进行微调，同时保留其原始功能并最小化计算开销。此外，我们还提出了一种学生模型的Top-K类别选择机制，这显著减少了内存需求并进一步提高了适应性能。我们的方法在CityScapes数据集上实现了显著的6.11 mIoU改进，并在零样本分割基准测试中表现出色，为开放词汇设置中的源代码自由适应设定了新标准。

Summary / 总结

The research introduces VocAlign, a source-free domain adaptation framework for VLMs in open-vocabulary semantic segmentation. It uses a student-teacher paradigm with vocabulary alignment to enhance pseudo-label generation and employs Low-Rank Adaptation (LoRA) for efficient fine-tuning. The approach also includes a Top-K class selection mechanism to reduce memory usage. Experiments show a 6.11 mIoU improvement on CityScapes and superior performance in zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in open-vocabulary settings.

论文提出了VocAlign，这是一种针对开放词汇语义分割中VLM的源免费域适应框架。该方法采用学生-教师范式并结合词汇对齐策略来增强伪标签生成，并采用低秩适应（LoRA）进行高效微调。此外，还提出了一种Top-K类选择机制以减少内存使用。VocAlign在CityScapes上实现了6.11 mIoU的改进，并在零样本分割基准测试中表现出色，为开放词汇设置中的源免费适应设定了新标准。

Calibration-Aware Prompt Learning for Medical Vision-Language Models

Authors: Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan

First: 2025-09-18T17:59:58+00:00 · Latest: 2025-09-18T17:59:58+00:00

Comments: Accepted in BMVC 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable performance across diverse medical imaging tasks by leveraging large-scale image-text pretraining. However, their confidence calibration is largely unexplored, and so remains a significant challenge. As such, miscalibrated predictions can lead to overconfident errors, undermining clinical trust and decision-making reliability. To address this, we introduce CalibPrompt, the first framework to calibrate Med-VLMs during prompt tuning. CalibPrompt optimizes a small set of learnable prompts with carefully designed calibration objectives under scarce labeled data regime. First, we study a regularizer that attempts to align the smoothed accuracy with the predicted model confidences. Second, we introduce an angular separation loss to maximize textual feature proximity toward improving the reliability in confidence estimates of multimodal Med-VLMs. Extensive experiments on four publicly available Med-VLMs and five diverse medical imaging datasets reveal that CalibPrompt consistently improves calibration without drastically affecting clean accuracy. Our code is available at https://github.com/iabh1shekbasu/CalibPrompt.

中文标题/摘要

标题：医疗视觉语言模型的校准感知提示学习

医疗视觉语言模型（Med-VLMs）通过大规模图像-文本预训练，在多种医疗成像任务中表现出色。然而，它们的置信度校准尚未得到充分探索，仍然是一个重大挑战。因此，未校准的预测可能导致过度自信的错误，削弱临床信任和决策可靠性。为了解决这一问题，我们引入了CalibPrompt，这是第一个在提示调优过程中校准Med-VLMs的框架。CalibPrompt在少量标注数据条件下，通过精心设计的校准目标优化一小组可学习的提示。首先，我们研究了一个正则化器，试图使平滑后的准确率与预测模型置信度对齐。其次，我们引入了角度分离损失，以最大化文本特征的接近性，从而提高多模态Med-VLMs置信度估计的可靠性。在四个公开的Med-VLMs和五个多样化的医疗成像数据集上的广泛实验表明，CalibPrompt在不大幅影响干净准确率的情况下，始终能够提高校准。我们的代码可在https://github.com/iabh1shekbasu/CalibPrompt/ 获取。

Summary / 总结

The paper introduces CalibPrompt, a framework to calibrate Medical Vision-Language Models (Med-VLMs) during prompt tuning. It optimizes learnable prompts with calibration objectives under limited labeled data. The method includes a regularizer to align smoothed accuracy with predicted model confidences and an angular separation loss to enhance textual feature proximity. Experiments show that CalibPrompt improves calibration without significantly affecting clean accuracy on four Med-VLMs and five medical imaging datasets.

研究旨在通过改进Medical Vision-Language Models (Med-VLMs)的置信度校准，减少医疗成像任务中的过度自信错误。CalibPrompt是一个新颖的框架，通过在有限标注数据下优化可学习提示和校准目标来实现这一目标。它包括一个正则化项，以使平滑准确度与预测置信度对齐，以及一个角度分离损失，以增强文本特征的接近性。实验表明，CalibPrompt在不显著影响干净准确率的情况下，能够提高校准效果，覆盖四个Med-VLMs和五个数据集。

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Authors: Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang

First: 2025-09-18T17:59:22+00:00 · Latest: 2025-09-18T17:59:22+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.

中文标题/摘要

标题：ScaleCUA：跨平台数据扩展开源计算机使用代理

视觉-语言模型（VLMs）使计算机使用代理（CUAs）能够自主操作GUI，展现出巨大的潜力，但进展受限于缺乏大规模、开源的计算机使用数据和基础模型。在本项工作中，我们介绍了ScaleCUA，这是迈向扩展开源CUA的一个步骤。它提供了一个跨越6个操作系统和3个任务领域的大型数据集，通过结合自动化代理和人类专家的闭环管道构建而成。在这些扩展的数据上训练后，ScaleCUA可以在不同平台之间无缝操作。具体而言，它在WebArena-Lite-v2上比基线模型提高了26.6%，在ScreenSpot-Pro上提高了10.7%，并在MMBench-GUI L1-Hard上达到了94.4%的新最佳结果，在OSWorld-G上达到了60.6%，在WebArena-Lite-v2上达到了47.4%。这些发现强调了数据驱动扩展对于通用计算机使用代理的强大作用。我们将发布数据、模型和代码以促进未来研究：https://github.com/OpenGVLab/ScaleCUA。

Summary / 总结

ScaleCUA addresses the limitation of open-source computer use agents by introducing a large-scale dataset spanning multiple operating systems and task domains. Utilizing a closed-loop pipeline involving automated agents and human experts, it enables seamless cross-platform operation. ScaleCUA outperforms baselines and sets new state-of-the-art results on various benchmarks, highlighting the importance of data-driven scaling for general-purpose computer use agents. The dataset, models, and code are available for future research.

ScaleCUA通过引入跨越多个操作系统和任务领域的大型数据集来解决开源计算机使用代理的限制问题。该数据集通过结合自动化代理和人类专家的闭环管道创建。ScaleCUA在MMBenchmark-GUI L1-Hard、OSWorld-G和WebArena-Lite-v2等任务上显著优于现有基线，展示了数据驱动扩展对于通用计算机使用代理的重要性。数据集、模型和代码将公开发布，以促进进一步的研究。

MedFact-R1: Towards Factual Medical Reasoning via Pseudo-Label Augmentation

Authors: Gengliang Li, Rongyu Chen, Bin Li, Linlin Yang, Guodong Ding

First: 2025-09-18T16:59:59+00:00 · Latest: 2025-09-18T16:59:59+00:00

Comments: Tech report

Abs · PDF · Code1 · Code2 · Code3

Abstract

Ensuring factual consistency and reliable reasoning remains a critical challenge for medical vision-language models. We introduce MEDFACT-R1, a two-stage framework that integrates external knowledge grounding with reinforcement learning to improve the factual medical reasoning. The first stage uses pseudo-label supervised fine-tuning (SFT) to incorporate external factual expertise; while the second stage applies Group Relative Policy Optimization (GRPO) with four tailored factual reward signals to encourage self-consistent reasoning. Across three public medical QA benchmarks, MEDFACT-R1 delivers up to 22.5% absolute improvement in factual accuracy over previous state-of-the-art methods. Ablation studies highlight the necessity of pseudo-label SFT cold start and validate the contribution of each GRPO reward, underscoring the synergy between knowledge grounding and RL-driven reasoning for trustworthy medical AI. Codes are released at https://github.com/Garfieldgengliang/MEDFACT-R1.

中文标题/摘要

标题：MedFact-R1：通过伪标签增强实现医学事实推理

确保事实一致性与可靠推理仍然是医学视觉-语言模型的关键挑战。我们引入了MEDFACT-R1，这是一种两阶段框架，结合了外部知识接地与强化学习以提高医学事实推理。第一阶段使用伪标签监督微调（SFT）来整合外部事实专业知识；而第二阶段则应用组相对策略优化（GRPO）并使用四个定制的事实奖励信号来促进自我一致的推理。在三个公开的医学问答基准测试中，MEDFACT-R1在事实准确性上相对于之前最先进的方法实现了高达22.5%的绝对改进。消融研究强调了伪标签SFT冷启动的必要性，并验证了每个GRPO奖励的贡献，突显了知识接地与基于RL的推理之间的协同作用对于可信赖的医学AI的重要性。代码已发布于https://github.com/Garfieldgengliang/MEDFACT-R1。

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

Authors: Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang

First: 2025-09-18T16:40:47+00:00 · Latest: 2025-09-18T16:40:47+00:00

Comments: Project Webpage: https://worldforge-agi.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement introduces a recursive refinement mechanism during inference, which repeatedly optimizes network predictions within each denoising step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.

中文标题/摘要

标题：WorldForge：通过训练-free 指导解锁视频扩散模型中的3D/4D生成

近期的视频扩散模型在空间智能任务中表现出强大的潜力，这得益于它们丰富的潜在世界先验知识。然而，这种潜力受到其有限的可控性和几何不一致性的影响，导致其先验知识强大但实际应用在3D/4D任务中存在差距。因此，当前的方法往往依赖于重新训练或微调，这可能会损害预训练知识并导致高计算成本。为了解决这个问题，我们提出了WorldForge，这是一种训练-free、推理时框架，由三个紧密耦合的模块组成。Intra-Step 递归细化引入了一种在推理过程中重复优化网络预测的递归细化机制，以实现精确的轨迹注入。流门控潜在融合利用光学流相似性在潜在空间中解耦运动和外观，并选择性地将轨迹指导注入与运动相关的通道中。双路径自我纠正指导将指导和未指导的去噪路径进行比较，以自适应地纠正由噪声或对齐不良的结构信号引起的轨迹漂移。这些组件共同在不进行训练的情况下注入细粒度、轨迹对齐的指导，实现准确的运动控制和逼真的内容生成。广泛的跨不同基准的实验验证了我们方法在逼真度、轨迹一致性和视觉保真度方面的优越性。这项工作引入了一种新的插即用范式，用于可控视频合成，为利用生成先验知识进行空间智能提供了新的视角。

Summary / 总结

WorldForge addresses the limitations of video diffusion models in 3D/4D tasks by proposing a training-free framework. It consists of three modules: Intra-Step Recursive Refinement, Flow-Gated Latent Fusion, and Dual-Path Self-Corrective Guidance. These modules enable precise trajectory injection, motion decoupling, and adaptive correction during inference, respectively. Experiments show that WorldForge achieves accurate motion control and photorealistic content generation, outperforming existing methods in realism, trajectory consistency, and visual fidelity.

WorldForge 提出了一种无需训练的框架来解决视频扩散模型在3D/4D任务中的局限性。该框架包含三个模块：Intra-Step Recursive Refinement、Flow-Gated Latent Fusion 和 Dual-Path Self-Corrective Guidance。这些模块分别在推理过程中实现精确的轨迹注入、运动与外观的解耦以及自适应纠正。实验表明，WorldForge 在现实感、轨迹一致性以及视觉保真度方面优于现有方法。

Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models

Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Chengyuan Yu, Mengshu Sun, Qiang Zhang, Yijie Guo, Kaidi Xu, Jize Zhang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu

First: 2024-09-20T03:02:05+00:00 · Latest: 2025-09-18T16:36:42+00:00