arXiv 论文速递

VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models

Authors: Dominick Reilly, Manish Kumar Govind, Le Xue, Srijan Das

First: 2025-10-15T17:59:52+00:00 · Latest: 2025-10-15T17:59:52+00:00

Abstract

Large Vision-Language Models (VLMs) excel at general visual reasoning tasks but exhibit sharp performance degradation when applied to novel domains with substantial distribution shifts from pretraining data. Existing domain adaptation approaches finetune different VLM components, but this often results in limited domain-specific feature learning or catastrophic forgetting of prior capabilities. To address these issues, we introduce Vision Contextualized Probing (VisCoP), which augments the VLM's vision encoder with a compact set of learnable visual probes. These probes enable efficient domain-specific adaptation with minimal modification to pretrained parameters. We evaluate VisCoP across three challenging domain adaptation settings-cross-view (exocentric to egocentric), cross-modal (RGB to depth), and cross-task (human understanding to robot control). Experiments show that VisCoP consistently outperforms existing adaptation strategies, achieving superior performance on target domains while effectively retaining source-domain knowledge.

中文标题/摘要

标题：VisCoP：视觉探针在视觉语言模型视频领域适应中的可视化探查

大型视觉-语言模型（VLMs）在通用视觉推理任务中表现出色，但在应用于与预训练数据有显著分布偏移的新领域时，性能会出现急剧下降。现有的领域适应方法会微调不同的VLM组件，但这往往会导致有限的领域特定特征学习或对先前能力的灾难性遗忘。为了解决这些问题，我们引入了视觉上下文探针（VisCoP），它通过在VLM的视觉编码器中添加一组可学习的视觉探针来增强VLM。这些探针使模型能够以最小的预训练参数修改实现高效的领域特定适应。我们在三个具有挑战性的领域适应设置中评估了VisCoP：跨视角（外视角到内视角）、跨模态（RGB到深度）和跨任务（人类理解到机器人控制）。实验表明，VisCoP在所有目标领域中都优于现有的适应策略，同时有效地保留了源领域知识，实现了更好的性能。

Summary / 总结

The research aims to improve the performance of large Vision-Language Models (VLMs) in novel domains with distribution shifts. VisCoP, a novel method, introduces learnable visual probes to the vision encoder of VLMs, enabling efficient domain-specific adaptation with minimal parameter modification. Experiments across three settings (cross-view, cross-modal, and cross-task) demonstrate that VisCoP outperforms existing adaptation strategies, achieving better performance on target domains while retaining source-domain knowledge.

研究旨在提高大型视觉-语言模型（VLMs）在具有分布偏移的新领域中的性能。VisCoP 方法通过在 VLM 的视觉编码器中引入可学习的视觉探针，实现高效的领域特定适应，同时最小化参数修改。该方法在跨视点、跨模态和跨任务三种领域适应设置中均优于现有策略，同时有效保留了源领域知识。