arXiv 论文速递

Lost in Translation? Vocabulary Alignment for Source-Free Domain Adaptation in Open-Vocabulary Semantic Segmentation

Authors: Silvio Mazzucco, Carl Persson, Mattia Segu, Pier Luigi Dovesi, Federico Tombari, Luc Van Gool, Matteo Poggi

First: 2025-09-18T17:59:58+00:00 · Latest: 2025-09-18T17:59:58+00:00

Comments: BMVC 2025 - Project Page: https://thegoodailab.org/blog/vocalign - Code: https://github.com/Sisso16/VocAlign

Abstract

We introduce VocAlign, a novel source-free domain adaptation framework specifically designed for VLMs in open-vocabulary semantic segmentation. Our method adopts a student-teacher paradigm enhanced with a vocabulary alignment strategy, which improves pseudo-label generation by incorporating additional class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to fine-tune the model, preserving its original capabilities while minimizing computational overhead. In addition, we propose a Top-K class selection mechanism for the student model, which significantly reduces memory requirements while further improving adaptation performance. Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in the open-vocabulary setting.

中文标题/摘要

标题：迷失翻译？源代码自由领域适应在开放词汇语义分割中的词汇对齐

我们引入了VocAlign，一种专为开放词汇语义分割中的VLM设计的源代码自由领域适应框架。该方法采用学生-教师范式，并结合了词汇对齐策略，通过引入额外的类别概念来改进伪标签生成。为了确保效率，我们使用低秩适应（LoRA）对模型进行微调，同时保留其原始功能并最小化计算开销。此外，我们还提出了学生模型的Top-K类别选择机制，这显著减少了内存需求并进一步提高了适应性能。我们的方法在CityScapes数据集上实现了显著的6.11 mIoU改进，并在零样本分割基准测试中表现出色，为开放词汇设置中的源代码自由适应设定了新标准。

Summary / 总结

The research introduces VocAlign, a source-free domain adaptation framework for VLMs in open-vocabulary semantic segmentation. It uses a student-teacher paradigm with vocabulary alignment and Low-Rank Adaptation (LoRA) to fine-tune the model efficiently. The approach also includes a Top-K class selection mechanism to reduce memory usage. The method achieves a 6.11 mIoU improvement on CityScapes and outperforms existing methods on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in open-vocabulary settings.

研究引入了VocAlign，这是一种针对开放词汇语义分割中VLM的源免费领域适应框架。该方法采用学生-教师范式并结合词汇对齐策略来增强伪标签生成，并采用低秩适应（LoRA）进行高效微调。此外，该方法还包含一个Top-K类别选择机制以减少内存使用。VocAlign在CityScapes上实现了6.11 mIoU的改进，并在零样本分割基准测试中表现出色，为开放词汇设置中的源免费适应设定了新标准。

Calibration-Aware Prompt Learning for Medical Vision-Language Models

Authors: Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan

First: 2025-09-18T17:59:58+00:00 · Latest: 2025-09-18T17:59:58+00:00

Comments: Accepted in BMVC 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable performance across diverse medical imaging tasks by leveraging large-scale image-text pretraining. However, their confidence calibration is largely unexplored, and so remains a significant challenge. As such, miscalibrated predictions can lead to overconfident errors, undermining clinical trust and decision-making reliability. To address this, we introduce CalibPrompt, the first framework to calibrate Med-VLMs during prompt tuning. CalibPrompt optimizes a small set of learnable prompts with carefully designed calibration objectives under scarce labeled data regime. First, we study a regularizer that attempts to align the smoothed accuracy with the predicted model confidences. Second, we introduce an angular separation loss to maximize textual feature proximity toward improving the reliability in confidence estimates of multimodal Med-VLMs. Extensive experiments on four publicly available Med-VLMs and five diverse medical imaging datasets reveal that CalibPrompt consistently improves calibration without drastically affecting clean accuracy. Our code is available at https://github.com/iabh1shekbasu/CalibPrompt.

中文标题/摘要

标题：医疗视觉语言模型的校准感知提示学习

医疗视觉语言模型（Med-VLMs）通过大规模图像文本预训练，在多种医疗成像任务中表现出色。然而，它们的置信度校准尚未得到充分探索，仍然是一个重大挑战。因此，未校准的预测可能导致过度自信的错误，削弱临床信任和决策可靠性。为了解决这一问题，我们引入了CalibPrompt，这是第一个在提示调优过程中校准Med-VLMs的框架。CalibPrompt在少量标注数据条件下，通过精心设计的校准目标优化一小组可学习的提示。首先，我们研究了一个正则化器，试图使平滑后的准确率与预测模型置信度对齐。其次，我们引入了角度分离损失，以最大化文本特征的接近度，从而提高多模态Med-VLMs置信度估计的可靠性。在四个公开的Med-VLMs和五个多样化的医疗成像数据集上的广泛实验表明，CalibPrompt在不大幅影响干净准确率的情况下，始终能够提高校准。我们的代码可在https://github.com/iabh1shekbasu/CalibPrompt获取。

Summary / 总结

The paper introduces CalibPrompt, a framework for calibrating Medical Vision-Language Models (Med-VLMs) during prompt tuning. It optimizes learnable prompts with calibration objectives under limited labeled data. CalibPrompt uses a regularizer to align smoothed accuracy with predicted model confidences and an angular separation loss to enhance textual feature proximity. Experiments show that CalibPrompt improves calibration without significantly affecting clean accuracy across four Med-VLMs and five medical imaging datasets.

论文提出了CalibPrompt框架，在提示调优过程中校准医疗视觉-语言模型（Med-VLMs），解决预测失准的问题。它在有限标注数据下优化可学习提示，并通过校准目标实现这一目标。实验表明，CalibPrompt在四个Med-VLMs和五个医学影像数据集上提高了校准效果，同时对干净准确率影响不大。

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

Authors: Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Zeyue Tian, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang

First: 2025-09-18T17:59:22+00:00 · Latest: 2025-09-18T17:59:22+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.

中文标题/摘要

标题：ScaleCUA：跨平台数据扩展开源计算机使用代理

视觉-语言模型（VLMs）使计算机使用代理（CUAs）能够自主操作GUI，展现出巨大的潜力，但进展受限于缺乏大规模、开源的计算机使用数据和基础模型。在本项工作中，我们介绍了ScaleCUA，这是迈向扩展开源CUA的一个步骤。它提供了一个跨越6个操作系统和3个任务领域的大型数据集，通过结合自动化代理和人类专家的闭环管道构建而成。在这些扩展的数据上训练后，ScaleCUA可以在不同平台之间无缝操作。具体而言，它在WebArena-Lite-v2上比基线模型提高了26.6%，在ScreenSpot-Pro上提高了10.7%，并在MMBench-GUI L1-Hard上达到了94.4%的新最佳结果，在OSWorld-G上达到了60.6%，在WebArena-Lite-v2上达到了47.4%。这些发现强调了数据驱动扩展对于通用计算机使用代理的强大作用。我们将发布数据、模型和代码以促进未来研究：https://github.com/OpenGVLab/ScaleCUA。

Summary / 总结

ScaleCUA addresses the limitation of open-source computer use agents by introducing a large-scale dataset spanning multiple operating systems and task domains. Utilizing a closed-loop pipeline combining automated agents and human experts, ScaleCUA demonstrates significant improvements over existing baselines, achieving new state-of-the-art results on MMBench-GUI L1-Hard, OSWorld-G, and WebArena-Lite-v2. This work highlights the importance of data-driven scaling for general-purpose computer use agents and will release data, models, and code for further research.

ScaleCUA通过引入跨越多个操作系统和任务领域的大型数据集来解决开源计算机使用代理的限制。利用结合自动化代理和人工专家的闭环管道，ScaleCUA在MMBenchmark-GUI L1-Hard、OSWorld-G和WebArena-Lite-v2上实现了显著的改进，达到了新的最先进结果。这项工作强调了数据驱动扩展对于通用计算机使用代理的重要性，并将发布数据、模型和代码以促进进一步研究。

MedFact-R1: Towards Factual Medical Reasoning via Pseudo-Label Augmentation

Authors: Gengliang Li, Rongyu Chen, Bin Li, Linlin Yang, Guodong Ding

First: 2025-09-18T16:59:59+00:00 · Latest: 2025-09-18T16:59:59+00:00

Comments: Tech report

Abs · PDF · Code1 · Code2 · Code3

Abstract

Ensuring factual consistency and reliable reasoning remains a critical challenge for medical vision-language models. We introduce MEDFACT-R1, a two-stage framework that integrates external knowledge grounding with reinforcement learning to improve the factual medical reasoning. The first stage uses pseudo-label supervised fine-tuning (SFT) to incorporate external factual expertise; while the second stage applies Group Relative Policy Optimization (GRPO) with four tailored factual reward signals to encourage self-consistent reasoning. Across three public medical QA benchmarks, MEDFACT-R1 delivers up to 22.5% absolute improvement in factual accuracy over previous state-of-the-art methods. Ablation studies highlight the necessity of pseudo-label SFT cold start and validate the contribution of each GRPO reward, underscoring the synergy between knowledge grounding and RL-driven reasoning for trustworthy medical AI. Codes are released at https://github.com/Garfieldgengliang/MEDFACT-R1.

中文标题/摘要

标题：MedFact-R1：通过伪标签增强实现医学事实推理

确保事实一致性与可靠推理仍然是医学视觉-语言模型的关键挑战。我们引入了MEDFACT-R1，这是一种两阶段框架，结合了外部知识接地与强化学习，以提高医学事实推理。第一阶段使用伪标签监督微调（SFT）来整合外部事实专业知识；而第二阶段则应用组相对策略优化（GRPO）和四个定制的事实奖励信号，以促进自我一致的推理。在三个公开的医学问答基准测试中，MEDFACT-R1在事实准确性上比之前最先进的方法提高了高达22.5%。消融研究强调了伪标签SFT冷启动的必要性，并验证了每个GRPO奖励的贡献，突显了知识接地与基于RL的推理之间的协同作用对于可信赖的医学AI的重要性。代码已发布于https://github.com/Garfieldgengliang/MEDFACT-R1。

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

Authors: Chenxi Song, Yanming Yang, Tong Zhao, Ruibo Li, Chi Zhang

First: 2025-09-18T16:40:47+00:00 · Latest: 2025-09-18T16:40:47+00:00

Comments: Project Webpage: https://worldforge-agi.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent video diffusion models demonstrate strong potential in spatial intelligence tasks due to their rich latent world priors. However, this potential is hindered by their limited controllability and geometric inconsistency, creating a gap between their strong priors and their practical use in 3D/4D tasks. As a result, current approaches often rely on retraining or fine-tuning, which risks degrading pretrained knowledge and incurs high computational costs. To address this, we propose WorldForge, a training-free, inference-time framework composed of three tightly coupled modules. Intra-Step Recursive Refinement introduces a recursive refinement mechanism during inference, which repeatedly optimizes network predictions within each denoising step to enable precise trajectory injection. Flow-Gated Latent Fusion leverages optical flow similarity to decouple motion from appearance in the latent space and selectively inject trajectory guidance into motion-related channels. Dual-Path Self-Corrective Guidance compares guided and unguided denoising paths to adaptively correct trajectory drift caused by noisy or misaligned structural signals. Together, these components inject fine-grained, trajectory-aligned guidance without training, achieving both accurate motion control and photorealistic content generation. Extensive experiments across diverse benchmarks validate our method's superiority in realism, trajectory consistency, and visual fidelity. This work introduces a novel plug-and-play paradigm for controllable video synthesis, offering a new perspective on leveraging generative priors for spatial intelligence.

Summary / 总结

WorldForge is a training-free framework that enhances the controllability and geometric consistency of video diffusion models for 3D/4D generation. It consists of three modules: Intra-Step Recursive Refinement, Flow-Gated Latent Fusion, and Dual-Path Self-Corrective Guidance. These modules enable precise trajectory injection, selective motion guidance, and adaptive correction of trajectory drift, respectively. Experiments show that WorldForge outperforms existing methods in terms of realism, trajectory consistency, and visual fidelity across various benchmarks.

WorldForge 是一个无需训练的框架，旨在增强视频扩散模型在3D/4D任务中的可控性和几何一致性。它包含三个模块：Intra-Step Recursive Refinement、Flow-Gated Latent Fusion 和 Dual-Path Self-Corrective Guidance，在推理时注入轨迹对齐的指导。实验表明，WorldForge 在提高真实感、轨迹一致性及视觉保真度方面优于现有方法，且无需重新训练模型。

Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models

Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Chengyuan Yu, Mengshu Sun, Qiang Zhang, Yijie Guo, Kaidi Xu, Jize Zhang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu

First: 2024-09-20T03:02:05+00:00 · Latest: 2025-09-18T16:36:42+00:00