arXiv 论文速递

Snapshot: 20260403_0400

A ROS 2 Wrapper for Florence-2: Multi-Mode Local Vision-Language Inference for Robotic Systems

Authors: J. E. Domínguez-Vidal

First: 2026-04-01T17:29:59+00:00 · Latest: 2026-04-01T17:29:59+00:00

Comments: 5 pages, 1 figure

Abstract

Foundation vision-language models are becoming increasingly relevant to robotics because they can provide richer semantic perception than narrow task-specific pipelines. However, their practical adoption in robot software stacks still depends on reproducible middleware integrations rather than on model quality alone. Florence-2 is especially attractive in this regard because it unifies captioning, optical character recognition, open-vocabulary detection, grounding and related vision-language tasks within a comparatively manageable model size. This article presents a ROS 2 wrapper for Florence-2 that exposes the model through three complementary interaction modes: continuous topic-driven processing, synchronous service calls and asynchronous actions. The wrapper is designed for local execution and supports both native installation and Docker container deployment. It also combines generic JSON outputs with standard ROS 2 message bindings for detection-oriented tasks. A functional validation is reported together with a throughput study on several GPUs, showing that local deployment is feasible with consumer grade hardware. The repository is publicly available here: https://github.com/JEDominguezVidal/florence2_ros2_wrapper

Summary / 总结

This paper introduces a ROS 2 wrapper for Florence-2, a vision-language model that integrates multiple tasks such as captioning and optical character recognition. The wrapper supports three interaction modes: continuous topic-driven processing, synchronous service calls, and asynchronous actions. It is designed for local execution and can be deployed either natively or in Docker containers. Experimental results show that local deployment is feasible with consumer-grade hardware, and a functional validation is provided. The wrapper combines generic JSON outputs with standard ROS 2 message bindings for detection tasks.

本文介绍了Florence-2的ROS 2封装器，该模型集成了诸如图像字幕和光学字符识别等多种任务。封装器支持连续的主题驱动处理、同步服务调用和异步操作，便于本地执行和Docker部署。实验结果表明，使用消费级硬件可以实现本地部署，并且封装器已公开发布。

ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Redirection

Authors: Yi Sun, Xinhao Zhong, Hongyan Li, Yimin Zhou, Junhao Li, Bin Chen, Xuan Wang

First: 2026-01-01T09:11:09+00:00 · Latest: 2026-04-01T16:54:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.

中文标题/摘要

标题：ActErase: 一种基于激活重定向的无需训练框架以精确消除概念

文本到图像扩散模型的最新进展展示了卓越的生成能力，但同时也引发了关于安全、版权和伦理问题的重大关切。现有的概念消除方法通过从预训练模型中移除敏感概念来应对这些风险，但大多数方法依赖于数据密集型和计算成本高昂的微调，这构成了一个关键限制。为克服这些挑战，受观察到模型的激活主要由通用概念组成，仅有一小部分可以表示目标概念的启发，我们提出了一种新颖的无需训练方法（ActErase）以高效地消除概念。具体而言，该方法通过提示对分析识别激活差异区域，在前向传递过程中提取目标激活并动态替换输入激活。在三个关键消除任务（裸体、艺术风格和对象移除）上的全面评估表明，我们的无需训练方法实现了最先进的（SOTA）消除性能，同时有效地保留了模型的整体生成能力。我们的方法还表现出强大的对抗攻击鲁棒性，确立了一种轻量级且有效的扩散模型中概念操控的新即插即用框架。

Summary / 总结

The paper introduces ActErase, a training-free method for precise concept erasure in text-to-image diffusion models. It leverages prompt-pair analysis to identify and redirect activation differences, enabling efficient removal of sensitive concepts without fine-tuning. ActErase outperforms existing methods in nudity, artistic style, and object removal tasks while maintaining the model's generative capabilities and showing robustness against adversarial attacks.

该论文提出了一种名为ActErase的无训练方法，用于文本到图像扩散模型中的精确概念擦除。该方法通过提示对分析来识别和重定向激活，避免了需要微调。ActErase在裸体、艺术风格和对象移除等关键擦除任务中表现出色，同时保持了模型的整体生成能力，并且对对抗攻击具有很强的鲁棒性。

Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

Authors: Reyhaneh Ahani Manghotay, Jie Liang

First: 2026-04-01T16:41:04+00:00 · Latest: 2026-04-01T16:41:04+00:00

Comments: 14 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.

中文标题/摘要

标题：轻量级提示引导CLIP适应性方法在单目深度估计中的应用

利用视觉语言模型（VLMs）如CLIP的丰富语义特征进行单目深度估计任务是一个有前景的方向，但通常需要大量的微调或缺乏几何精度。我们提出了一种参数高效的框架MoA-DepthCLIP，该框架通过最少的监督将预训练的CLIP表示适应单目深度估计。该方法将轻量级混合适配器（MoA）模块集成到预训练的Vision Transformer（ViT-B/32）主干中，并选择性地微调最终层。此设计使空间感知适应成为可能，由全局语义上下文向量和深度区间分类与直接回归相结合的混合预测架构引导。为了提高结构准确性，我们采用了一种复合损失函数来施加几何约束。在NYU Depth V2基准测试中，MoA-DepthCLIP取得了具有竞争力的结果，显著优于DepthCLIP基线，将δ1精度从0.390提高到0.745，将RMSE从1.176降低到0.520。这些结果仅需少量可训练参数，表明轻量级、提示引导的MoA是将VLM知识转移到精细单目深度估计任务的有效策略。

Summary / 总结

The paper introduces MoA-DepthCLIP, a parameter-efficient framework that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. It integrates a lightweight Mixture-of-Adapters module into the ViT-B/32 backbone and employs selective fine-tuning of the final layers, guided by a global semantic context vector and a hybrid prediction architecture. The method achieves competitive results on the NYU Depth V2 benchmark, significantly improving δ_1 accuracy and reducing RMSE compared to the DepthCLIP baseline.

研究旨在利用CLIP的语义特征进行单目深度估计，同时减少微调量。提出了MoA-DepthCLIP方法，将轻量级的Mixture-of-Adapters模块集成到预训练的CLIP模型中，实现空间感知的适应。该方法使用混合预测架构和复合损失函数来提高结构准确性。在NYU Depth V2基准测试上，MoA-DepthCLIP显著优于基线，实现了更高的$δ_1$准确性和更低的RMSE，同时使用较少的可训练参数。

TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models

Authors: Awais Khan, Muhammad Umar Farooq, Kutub Uddin, Khalid Malik

First: 2026-04-01T16:12:31+00:00 · Latest: 2026-04-01T16:12:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.

中文标题/摘要

标题：TRACE：无需训练的嵌入轨迹分析语音基础模型部分音频深度伪造检测

部分音频深度伪造，其中合成片段被拼接到真实录音中，因其大部分音频保持真实而特别具有欺骗性。现有检测器是监督式的：它们需要帧级注释，对特定合成管道过拟合，并且随着新生成模型的出现需要重新训练。我们认为这种监督是不必要的。我们假设语音基础模型隐式地编码了法医信号：真实语音形成平滑、缓慢变化的嵌入轨迹，而拼接边界在帧级过渡中引入了突然的中断。基于此，我们提出了TRACE（无需训练的基于表示的音频反制措施，通过嵌入动态），这是一种无需训练的框架，通过分析冻结的语音基础模型表示的一阶动态来检测部分音频深度伪造，无需任何训练、标注数据或架构修改。我们使用六种语音基础模型在四个跨语言基准上评估了TRACE。在PartialSpoof中，TRACE实现了8.08%的EER，与微调的监督基线相当。在LlamaPartialSpoof中，这是最具挑战性的基准，包含由LLM驱动的商业合成，TRACE在没有任何目标领域数据的情况下超越了监督基线（24.12% vs. 24.49% EER）。这些结果表明，语音基础模型中的时间动态提供了有效的、通用的信号，用于无需训练的音频法医分析。

Summary / 总结

The research aims to develop a training-free method for detecting partial audio deepfakes, which are particularly deceptive because they only alter parts of genuine recordings. TRACE, a training-free framework, analyzes the first-order dynamics of frozen speech foundation model representations to detect abrupt disruptions caused by splice boundaries. The method achieves competitive performance on four benchmarks, including surpassing a supervised baseline on the most challenging LlamaPartialSpoof benchmark without requiring target-domain data.

研究旨在应对部分音频合成假信息的挑战，因为只有部分音频是合成的，因此特别具有欺骗性。TRACE 是一个无需训练的框架，通过分析冻结的语音基础模型表示的一阶动态来检测这些合成信息，无需帧级标注或重新训练。在四个基准测试中，TRACE 达到了与监督基线相当的性能，在最具挑战性的 LlamaPartialSpoof 基准测试中超越了监督基线，无需使用目标领域数据。

ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction

Authors: Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Di Wen, Danda Pani Paudel, Luc Van Gool, Kailun Yang

Venue: CVPR 2026

First: 2026-04-01T16:11:59+00:00 · Latest: 2026-04-01T16:11:59+00:00

Comments: Accepted to CVPR 2026. The source code is publicly available at https://github.com/7uHeng/ProOOD

Abs · PDF · Code1 · Code2 · Code3

Abstract

3D semantic occupancy prediction is central to autonomous driving, yet current methods are vulnerable to long-tailed class bias and out-of-distribution (OOD) inputs, often overconfidently assigning anomalies to rare classes. We present ProOOD, a lightweight, plug-and-play method that couples prototype-guided refinement with training-free OOD scoring. ProOOD comprises (i) prototype-guided semantic imputation that fills occluded regions with class-consistent features, (ii) prototype-guided tail mining that strengthens rare-class representations to curb OOD absorption, and (iii) EchoOOD, which fuses local logit coherence with local and global prototype matching to produce reliable voxel-level OOD scores. Extensive experiments on five datasets demonstrate that ProOOD achieves state-of-the-art performance on both in-distribution 3D occupancy prediction and OOD detection. On SemanticKITTI, it surpasses baselines by +3.57% mIoU overall and +24.80% tail-class mIoU; on VAA-KITTI, it improves AuPRCr by +19.34 points, with consistent gains across benchmarks. These improvements yield more calibrated occupancy estimates and more reliable OOD detection in safety-critical urban driving. The source code is publicly available at https://github.com/7uHeng/ProOOD.

中文标题/摘要

标题：ProOOD：原型引导的分布外3D占用预测

3D语义占用预测是自动驾驶的核心，但当前方法容易受到长尾类别偏差和分布外(OOD)输入的影响，往往对异常情况过度自信地归类为稀有类别。我们提出了ProOOD，这是一种轻量级、即插即用的方法，结合了原型引导的细化和无需训练的OOD评分。ProOOD 包括 (i) 原型引导的语义插补，用类别一致的特征填充被遮挡的区域，(ii) 原型引导的尾部挖掘，增强稀有类别的表示以减少OOD吸收，以及 (iii) EchoOOD，它将局部logit一致性与局部和全局原型匹配融合，生成可靠的体素级OOD评分。在五个数据集上的广泛实验表明，ProOOD 在分布内3D占用预测和OOD检测方面均达到了最先进的性能。在SemanticKITTI上，它整体mIoU提高了3.57%，尾类mIoU提高了24.80%；在VAA-KITTI上，它提高了AuPRCr 19.34分，且在各个基准上均有所提升。这些改进在安全关键的城市驾驶中提供了更准确的占用估计和更可靠的OOD检测。源代码已公开，可在https://github.com/7uHeng/ProOOD 获取。

Summary / 总结

ProOOD addresses the limitations of existing 3D semantic occupancy prediction methods by introducing prototype-guided refinement and training-free OOD scoring. It consists of prototype-guided semantic imputation, prototype-guided tail mining, and EchoOOD for reliable OOD scores. Experiments show ProOOD outperforms baselines on multiple datasets, achieving higher mIoU and AuPRCr scores, leading to better calibrated occupancy estimates and OOD detection in autonomous driving scenarios.

ProOOD通过引入原型引导的细化和无需训练的OOD评分来解决现有3D语义占用率预测方法的局限性。它包括原型引导的语义填充、原型引导的尾部挖掘和EchoOOD以产生可靠的体素级OOD评分。实验表明，ProOOD在多个数据集上优于基线，实现了更高的mIoU和AuPRCr分数，从而在自动驾驶场景中提供更好的占用率估计校准和OOD检测。

Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA

Authors: Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A., Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna

Venue: ICLR 2026 poster

First: 2026-03-13T01:11:23+00:00 · Latest: 2026-04-01T15:49:01+00:00

Comments: Accepted as a poster at ICLR 2026 workshop ICBINB, typo fixed

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have advanced rapidly, yet they still struggle with basic spatial reasoning. Despite strong performance on general benchmarks, modern VLMs remain brittle at understanding 2D spatial relationships such as relative position, layout, and counting. We argue that this failure is not merely a data problem, but is closely tied to dominant design choices in current VLM pipelines: reliance on CLIP-style image encoders and the flattening of images into 1D token sequences with 1D positional encoding. We present a controlled diagnostic study within the LLaVA framework to isolate how these choices affect spatial grounding. We evaluate frontier models and LLaVA variants on a suite of spatial benchmarks, comparing CLIP-based encoders against alternatives trained with denser or generative objectives, as well as variants augmented with 2D positional encoding. Our results show consistent spatial performance gaps across models, and indicate that encoder objectives and positional structure shape spatial behavior, but do not fully resolve it.

中文标题/摘要

标题：空间推理并非免费午餐：LLaVA 的受控研究

视觉-语言模型（VLMs）取得了快速进展，但仍难以处理基本的空间推理。尽管在通用基准测试中表现出色，现代 VLMs 在理解二维空间关系（如相对位置、布局和计数）方面仍然脆弱。我们认为这种失败不仅仅是数据问题，而是与当前 VLM 管道中的主导设计选择密切相关：依赖 CLIP 风格的图像编码器以及将图像扁平化为一维标记序列并使用一维位置编码。我们在一个受控的诊断研究中在 LLaVA 框架内隔离这些选择如何影响空间定位。我们评估了前沿模型和 LLaVA 变体在一系列空间基准测试上的表现，将基于 CLIP 的编码器与使用更密集或生成性目标训练的替代方案进行比较，以及带有二维位置编码的增强变体。我们的结果显示模型在空间性能上存在一致的差距，并表明编码器目标和位置结构影响空间行为，但并未完全解决这一问题。

Summary / 总结

The study investigates why modern vision-language models struggle with basic spatial reasoning despite good performance on general benchmarks. It argues that this issue is not just due to data limitations but is related to the design choices in current VLM pipelines, such as reliance on CLIP-style image encoders and the 1D positional encoding. Through a controlled study within the LLaVA framework, the researchers evaluated different models and variants on spatial benchmarks, finding consistent performance gaps and suggesting that while encoder objectives and positional structures influence spatial behavior, they do not fully address the problem.

研究探讨了为什么现代视觉-语言模型在基本的空间推理方面存在困难，尽管在通用基准测试中表现出色。发现这种失败不仅是因为数据问题，还与VLM管道中的设计选择有关，如依赖CLIP风格的图像编码器和将图像扁平化为1D标记序列。研究使用LLaVA框架在空间基准测试上评估不同模型和变体，结果显示虽然编码器目标和位置结构会影响空间行为，但并不能完全解决空间性能差距。

PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

Authors: Jingning Xu, Haochen Luo, Chen Liu

First: 2026-04-01T15:16:07+00:00 · Latest: 2026-04-01T15:16:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) are vulnerable to adversarial image perturbations. Existing works based on adversarial training against task-specific adversarial examples are computationally expensive and often fail to generalize to unseen attack types. To address these limitations, we introduce Paraphrase-Decomposition-Aggregation (PDA), a training-free defense framework that leverages text augmentation to enhance VLM robustness under diverse adversarial image attacks. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation entirely at test time, thus requiring no modification on the underlying models. To balance robustness and efficiency, we instantiate PDA as invariants that reduce the inference cost while retaining most of its robustness gains. Experiments on multiple VLM architectures and benchmarks for visual question answering, classification, and captioning show that PDA achieves consistent robustness gains against various adversarial perturbations while maintaining competitive clean accuracy, establishing a generic, strong and practical defense framework for VLMs during inference.

中文标题/摘要

标题：PDA：增强视觉语言模型鲁棒性的文本增强防御框架

视觉语言模型（VLMs）容易受到对抗性图像扰动的影响。现有的基于对抗训练的方法针对特定任务的对抗性示例计算成本高，往往无法泛化到未见过的攻击类型。为了解决这些限制，我们引入了Paraphrase-Decomposition-Aggregation（PDA），这是一种无需训练的防御框架，利用文本增强来增强VLM在多种对抗性图像攻击下的鲁棒性。PDA在测试时完全执行提示重述、问题分解和一致性聚合，因此不需要对底层模型进行修改。为了平衡鲁棒性和效率，我们将PDA实例化为不变量，以减少推理成本同时保留其大部分鲁棒性增益。在多个VLM架构和视觉问答、分类和描述基准上的实验表明，PDA在各种对抗性扰动下实现了稳健性增益，同时保持了竞争力的干净准确性，从而为VLMs推理期间提供了一种通用、强大且实用的防御框架。

Summary / 总结

The research aims to improve the robustness of vision-language models (VLMs) against adversarial image attacks, which are computationally expensive and often fail to generalize. The Paraphrase-Decomposition-Aggregation (PDA) framework is introduced as a training-free method that enhances VLM robustness through text augmentation at test time. PDA performs prompt paraphrasing, question decomposition, and consistency aggregation, reducing inference cost while maintaining robustness gains. Experiments show that PDA consistently improves robustness against various adversarial perturbations while preserving clean accuracy on multiple VLM architectures and benchmarks.

研究旨在通过增强视觉语言模型（VLMs）的鲁棒性来抵御对抗性图像攻击，这些攻击既计算成本高昂又往往对未见过的攻击类型无效。提出了Paraphrase-Decomposition-Aggregation（PDA）框架，这是一种无需训练的防御方法，通过文本增强来提高VLM的鲁棒性。PDA完全在测试时运行，通过进行提示重述、问题分解和一致性聚合来操作，不需要对底层模型进行修改。实验表明，PDA能够一致地提高对各种对抗性扰动的鲁棒性，同时保持竞争力的干净准确性，从而建立了一个实用的VLM防御框架用于推理阶段。

Customizing Large Vision Model-Guided Low-Rank Approximation for Ground-Roll Denoise

Authors: Jiacheng Liao, Feng Qian, Ziyin Fan, Yongjian Guo

First: 2026-04-01T14:59:48+00:00 · Latest: 2026-04-01T14:59:48+00:00

Abs · PDF · Code1 · Code2

Abstract

Ground-roll is a dominant source of coherent noise in land and vertical seismic profiling (VSP) data, severely masking reflection events and degrading subsequent imaging and interpretation. Conventional attenuation methods, including transform-domain filtering, sparse representation, and deep learning, often suffer from limited adaptability, signal leakage, or dependence on labeled training data, especially under strong signal-noise overlap. To address these challenges, we propose a training-free framework that reformulates ground-roll attenuation as a semantic-guided signal separation problem. Specifically, a promptable large vision model is employed to extract high-level semantic priors by converting seismic gathers into visual representations and localizing ground-roll-dominant regions via text or image prompts. The resulting semantic response is transformed into a continuous soft mask, which is embedded into a mask-conditioned low-rank inverse formulation to enable spatially adaptive suppression and reflection-preserving reconstruction. An efficient alternating direction method of multipliers (ADMM)-based solver is further developed to solve the proposed inverse problem, enabling stable and physically consistent signal recovery without requiring task-specific training or manual annotation. Extensive experiments on both synthetic and field VSP datasets demonstrate that the proposed method achieves superior ground-roll attenuation while preserving reflection continuity and waveform fidelity, consistently outperforming representative transform-domain filtering and implicit neural representation methods.

中文标题/摘要

标题：定制大型视觉模型引导的低秩逼近以去除地面滚动噪声

地面滚动是陆地和垂直地震剖面（VSP）数据中占主导地位的相干噪声来源，严重掩盖了反射事件并降低后续成像和解释的质量。传统的衰减方法，包括变换域滤波、稀疏表示和深度学习，往往适应性有限、信号泄漏或依赖标记的训练数据，尤其是在强信号-噪声重叠情况下。为了解决这些挑战，我们提出了一种无需训练的框架，将地面滚动衰减重新表述为语义引导的信号分离问题。具体而言，使用可提示的大视觉模型将地震束转换为视觉表示，并通过文本或图像提示定位地面滚动主导区域，提取高层语义先验。由此产生的语义响应被转换为连续的软掩码，嵌入到掩码条件下的低秩逆问题求解中，以实现空间自适应抑制和反射保留重构。进一步开发了一种基于交替方向乘子法（ADMM）的求解器来解决提出的逆问题，无需特定任务的训练或手动标注即可实现稳定且物理一致的信号恢复。在合成和现场VSP数据集上的广泛实验表明，所提出的方法在去除地面滚动噪声方面优于代表性的变换域滤波和隐式神经表示方法，同时保持了反射连续性和波形保真度。

Summary / 总结

The research aims to address the challenges of ground-roll noise in seismic data by proposing a training-free framework that uses a large vision model to extract semantic priors. The method converts seismic gathers into visual representations and uses text or image prompts to localize ground-roll-dominant regions, which are then transformed into a soft mask for low-rank inverse formulation. Experiments show that the proposed method effectively attenuates ground-roll noise while preserving reflection continuity and waveform fidelity, outperforming existing methods.

论文针对地震数据中的地面波噪声问题，该噪声妨碍了反射事件的检测。提出了一种无需训练的框架，利用大型视觉模型从地震数据中提取语义先验，引导地面波和反射信号的分离。方法将地震束转换为视觉表示，并使用文本或图像提示来定位地面波占主导的区域，生成一个软掩码进行自适应抑制和重建。实验表明，该方法在保持反射连续性和波形保真度方面优于传统方法。

CDH-Bench: A Commonsense-Driven Hallucination Benchmark for Evaluating Visual Fidelity in Vision-Language Models

Authors: Kesheng Chen, Yamin Hu, Qi Zhou, Zhenqian Zhu, Wenjian Luo

First: 2026-03-30T03:04:53+00:00 · Latest: 2026-04-01T14:55:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) achieve strong performance on many benchmarks, yet a basic reliability question remains underexplored: when visual evidence conflicts with commonsense, do models follow what is shown or what commonsense suggests? A characteristic failure in this setting is that the model overrides visual evidence and outputs the commonsense alternative. We term this phenomenon \textbf{commonsense-driven hallucination} (CDH). To evaluate it, we introduce \textbf{CDH-Bench}, a benchmark designed to create explicit \textbf{visual evidence--commonsense conflicts}. CDH-Bench covers three dimensions: \textit{counting anomalies}, \textit{relational anomalies}, and \textit{attribute anomalies}. We evaluate frontier VLMs under \textit{binary Question Answering (QA)} and \textit{multiple-choice QA}, and report metrics including \textit{Counterfactual Accuracy} (CF-Acc), \textit{Commonsense Accuracy} (CS-Acc), \textit{Counterfactual Accuracy Drop} (CFAD), \textit{Commonsense Collapse Rate} (CCR), and \textit{Relative Prior Dependency} (RPD). Results show that even strong models remain vulnerable to prior-driven normalization under visual evidence--commonsense conflict. CDH-Bench provides a controlled diagnostic of visual fidelity under visual evidence--commonsense conflict.

中文标题/摘要

标题：CDH-Bench：一种基于常识的幻觉基准，用于评估视觉保真度

视觉语言模型（VLMs）在许多基准测试中表现出色，但一个基本的可靠性问题仍被忽视：当视觉证据与常识冲突时，模型会遵循所见还是常识建议？在这种情况下的一种典型失败是，模型会忽略视觉证据并输出常识替代方案。我们称这种现象为“基于常识的幻觉”（CDH）。为了评估这一点，我们引入了“CDH-Bench”，一个旨在创建明确的“视觉证据-常识冲突”的基准测试。CDH-Bench涵盖了三个维度：计数异常、关系异常和属性异常。我们对前沿的VLMs进行了二元问答（二选一问答）和多项选择问答的评估，并报告了包括反事实准确率（CF-Acc）、常识准确率（CS-Acc）、反事实准确率下降（CFAD）、常识崩溃率（CCR）和相对先验依赖性（RPD）等指标。结果显示，即使强大的模型在视觉证据-常识冲突下也容易受到先验驱动的规范化的影响。CDH-Bench为视觉证据-常识冲突下的视觉保真度提供了受控诊断。

Summary / 总结

The research aims to evaluate how vision-language models handle conflicts between visual evidence and commonsense. The study introduces CDH-Bench, a benchmark that creates explicit visual evidence-commonsense conflicts in three dimensions: counting anomalies, relational anomalies, and attribute anomalies. The evaluation includes binary and multiple-choice question answering tasks, and metrics such as Counterfactual Accuracy, Commonsense Accuracy, Counterfactual Accuracy Drop, Commonsense Collapse Rate, and Relative Prior Dependency. The results indicate that even strong models can be vulnerable to commonsense-driven hallucination under visual evidence-commonsense conflict.

研究旨在评估视觉语言模型在视觉证据与常识冲突时的处理能力。研究引入了CDH-Bench基准，用于创建明确的视觉证据-常识冲突，涵盖计数、关系和属性异常。模型使用二元和多项选择问答任务进行评估，并报告了反事实准确率、常识准确率、反事实准确率下降、常识崩溃率和相对先验依赖等指标。研究结果表明，即使强大的模型在视觉证据-常识冲突下也可能受到常识驱动的幻觉影响。

ACT Now: Preempting LVLM Hallucinations via Adaptive Context Integration

Authors: Bei Yan, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen

First: 2026-04-01T14:49:50+00:00 · Latest: 2026-04-01T14:49:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) frequently suffer from severe hallucination issues. Existing mitigation strategies predominantly rely on isolated, single-step states to enhance visual focus or suppress strong linguistic priors. However, these static approaches neglect dynamic context changes across the generation process and struggles to correct inherited information loss. To address this limitation, we propose Adaptive Context inTegration (ACT), a training-free inference intervention method that mitigates hallucination through the adaptive integration of contextual information. Specifically, we first propose visual context exploration, which leverages spatio-temporal profiling to adaptively amplify attention heads responsible for visual exploration. To further facilitate vision-language alignment, we propose semantic context aggregation that marginalizes potential semantic queries to effectively aggregate visual evidence, thereby resolving the information loss caused by the discrete nature of token prediction. Extensive experiments across diverse LVLMs demonstrate that ACT significantly reduces hallucinations and achieves competitive results on both discriminative and generative benchmarks, acting as a robust and highly adaptable solution without compromising fundamental generation capabilities.

中文标题/摘要

标题：立即行动：通过自适应上下文整合预防LVLM幻觉

大型视觉-语言模型（LVLMs）经常遭受严重的幻觉问题。现有的缓解策略主要依赖于孤立的、单一步骤的状态来增强视觉焦点或抑制强烈的语言先验。然而，这些静态方法忽视了生成过程中动态上下文的变化，并且难以纠正继承的信息损失。为了解决这一局限性，我们提出了自适应上下文整合（ACT），这是一种无需训练的推理干预方法，通过自适应整合上下文信息来缓解幻觉。具体来说，我们首先提出了视觉上下文探索，利用时空分析来自适应放大负责视觉探索的注意力头。为了进一步促进视觉-语言对齐，我们提出了语义上下文聚合，通过边缘化潜在的语义查询来有效聚合视觉证据，从而解决由于标记预测的离散性导致的信息损失。广泛的实验表明，ACT 显著减少了幻觉，并在区分性和生成性基准上取得了竞争力的结果，作为一种稳健且高度适应的解决方案，不会牺牲基本的生成能力。

Summary / 总结

The paper addresses the issue of hallucinations in Large Vision-Language Models (LVLMs) by proposing Adaptive Context inTegration (ACT), an inference intervention method that dynamically integrates contextual information. ACT includes visual context exploration to adaptively enhance visual attention and semantic context aggregation to effectively aggregate visual evidence. Experiments show that ACT reduces hallucinations and achieves competitive results on various benchmarks without compromising generation capabilities.

论文提出了一种动态集成上下文信息的方法——Adaptive Context inTegration (ACT)，以解决大型视觉语言模型（LVLM）中的幻觉问题。ACT 包括视觉上下文探索，以适应性增强视觉关注的注意力头，以及语义上下文聚合，以有效聚合视觉证据，解决由于标记预测的离散性导致的信息损失。实验表明，ACT 显著减少了幻觉，并在各种基准测试中表现出色，同时不牺牲生成能力。

D4C: Data-Free Quantization for Contrastive Language-Image Pre-training Models

Authors: Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka

First: 2025-11-19T13:08:25+00:00 · Latest: 2026-04-01T14:42:29+00:00

Comments: Accepted to CVPRF 2026

Abs · PDF · Code1 · Code2

Abstract

Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: 1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; 2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and 3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models.

中文标题/摘要

标题：D4C：对比语言-图像预训练模型的无数据量化

无数据量化（DFQ）提供了一种在无需访问真实数据的情况下进行模型压缩的实用解决方案，特别是在涉及隐私的场景中尤为吸引人。尽管DFQ在单模态模型中显示出潜力，但将其扩展到视觉-语言模型如对比语言-图像预训练（CLIP）模型的研究仍然较少。在本文中，我们揭示了直接将现有DFQ技术应用于CLIP会导致显著的性能下降，原因是两个关键限制：合成样本中缺乏语义内容和图像内多样性低。为应对这些挑战，我们提出了D4C，这是第一个针对CLIP的DFQ框架。D4C通过三个关键组件生成语义丰富且结构多样的伪图像：1）提示引导的语义注入使用文本提示将生成的图像与现实世界的语义对齐；2）结构对比生成利用前景-背景对比合成自然图像的组成结构；3）扰动感知增强应用可控扰动以提高样本多样性和鲁棒性。这些组件共同赋予D4C生成既语义丰富又结构多样的图像的能力，有效地弥合了DFQ在CLIP上的性能差距。广泛的实验验证了D4C的有效性，展示了在各种位宽和模型上的显著性能提升。

Summary / 总结

This paper addresses the challenge of applying Data-Free Quantization (DFQ) to Contrastive Language-Image Pre-training (CLIP) models, which have shown performance degradation due to insufficient semantic content and low intra-image diversity in synthesized samples. To overcome these issues, the authors propose D4C, a DFQ framework that synthesizes semantically rich and structurally diverse pseudo images through three components: Prompt-Guided Semantic Injection, Structural Contrastive Generation, and Perturbation-Aware Enhancement. Experimental results demonstrate significant performance improvements on various bit-widths and models.

本文解决了将数据免费量化（DFQ）应用于对比语言-图像预训练（CLIP）模型的问题，这些模型由于合成样本中缺乏语义内容和低内部图像多样性而表现出性能下降。为了解决这些问题，作者提出了D4C，这是一种DFQ框架，通过三种组件生成语义丰富且结构多样的伪图像：Prompt-Guided Semantic Injection、Structural Contrastive Generation和Perturbation-Aware Enhancement。实验结果表明，D4C在各种位宽和模型上显著提高了性能。

OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport

Authors: Xiwen Chen, Wenhui Zhu, Gen Li, Xuanzhao Dong, Yujian Xiong, Hao Wang, Peijie Qiu, Qingquan Song, Zhipeng Wang, Shao Tang, Yalin Wang, Abolfazl Razi

First: 2026-02-22T21:02:47+00:00 · Latest: 2026-04-01T14:09:30+00:00

Comments: Accepted by CVPR2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Multi-modal large language models (MLLMs) achieve strong visual-language reasoning but suffer from high inference cost due to redundant visual tokens. Recent work explores visual token pruning to accelerate inference, while existing pruning methods overlook the underlying distributional structure of visual representations. We propose OTPrune, a training-free framework that formulates pruning as distribution alignment via optimal transport (OT). By minimizing the 2-Wasserstein distance between the full and pruned token distributions, OTPrune preserves both local diversity and global representativeness while reducing inference cost. Moreover, we derive a tractable submodular objective that enables efficient optimization, and theoretically prove its monotonicity and submodularity, providing a principled foundation for stable and efficient pruning. We further provide a comprehensive analysis that explains how distributional alignment contributes to stable and semantically faithful pruning. Comprehensive experiments on wider benchmarks demonstrate that OTPrune achieves superior performance-efficiency tradeoffs compared to state-of-the-art methods. The code is available at https://github.com/xiwenc1/OTPrune.

中文标题/摘要

标题：OTPrune: 通过最优传输实现分布对齐的视觉标记剪枝

多模态大型语言模型（MLLMs）在视觉语言推理方面表现出色，但由于冗余视觉标记导致推理成本高昂。近期工作探索了视觉标记剪枝以加速推理，但现有剪枝方法忽略了视觉表示的潜在分布结构。我们提出了OTPrune，这是一种无需训练的框架，将剪枝形式化为通过最优传输（OT）实现分布对齐。通过最小化完整标记分布和剪枝标记分布之间的2- Wasserstein距离，OTPrune在减少推理成本的同时保持局部多样性和全局代表性。此外，我们推导出一个可计算的次模函数目标，使其能够高效优化，并从理论上证明其单调性和次模性，为稳定和高效的剪枝提供了理论基础。我们还进行了全面分析，解释了分布对齐如何促进稳定且语义忠实的剪枝。在更广泛的基准测试上的全面实验表明，OTPrune在性能效率权衡方面优于现有方法。代码可在https://github.com/xiwenc1/OTPrune/ 获取。

Summary / 总结

OTPrune is a training-free framework that prunes visual tokens in multi-modal large language models to reduce inference cost while preserving representativeness and diversity. It formulates pruning as distribution alignment using optimal transport, minimizing the 2-Wasserstein distance between full and pruned token distributions. Experiments show OTPrune outperforms existing methods in terms of performance-efficiency tradeoff on various benchmarks.

OTPrune 是一个无需训练的框架，通过最优传输对多模态大型语言模型中的视觉标记分布进行对齐，从而减少推理成本同时保持表示能力和多样性。实验表明，OTPrune 在性能和效率之间提供了优于现有方法的权衡。该框架基于一个亚模性的目标函数，确保了稳定和高效的剪枝。代码可在 GitHub 上获取。

Representation Selection via Cross-Model Agreement using Canonical Correlation Analysis

Authors: Dylan B. Lewis, Jens Gregor, Hector Santos-Villalobos

First: 2026-04-01T14:01:41+00:00 · Latest: 2026-04-01T14:01:41+00:00

Comments: 9 pages, 5 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

Modern vision pipelines increasingly rely on pretrained image encoders whose representations are reused across tasks and models, yet these representations are often overcomplete and model-specific. We propose a simple, training-free method to improve the efficiency of image representations via a post-hoc canonical correlation analysis (CCA) operator. By leveraging the shared structure between representations produced by two pre-trained image encoders, our method finds linear projections that serve as a principled form of representation selection and dimensionality reduction, retaining shared semantic content while discarding redundant dimensions. Unlike standard dimensionality reduction techniques such as PCA, which operate on a single embedding space, our approach leverages cross-model agreement to guide representation distillation and refinement. The technique allows representations to be reduced by more than 75% in dimensionality with improved downstream performance, or enhanced at fixed dimensionality via post-hoc representation transfer from larger or fine-tuned models. Empirical results on ImageNet-1k, CIFAR-100, MNIST, and additional benchmarks show consistent improvements over both baseline and PCA-projected representations, with accuracy gains of up to 12.6%.

中文标题/摘要

标题：基于交叉模型一致性的典范相关分析表示选择

现代视觉流水线越来越多地依赖于预训练的图像编码器，其表示在任务和模型之间重用，但这些表示往往是过度完备且模型特定的。我们提出了一种简单的、无需训练的方法，通过后处理典范相关分析（CCA）算子来提高图像表示的效率。通过利用两个预训练图像编码器生成的表示之间的共享结构，我们的方法找到了线性投影，作为一种原理性的表示选择和降维形式，保留了共享的语义内容，同时消除了冗余维度。与主成分分析（PCA）等标准降维技术不同，我们的方法利用跨模型一致性来指导表示的提炼和优化。该技术允许表示在维度减少超过75%的同时提高下游性能，或在固定维度下通过从较大或微调模型的后处理表示转移来增强。在ImageNet-1k、CIFAR-100、MNIST以及其他基准上的实验证明，与基线和PCA投影表示相比，该方法具有一致的改进，准确率提升高达12.6%。

Summary / 总结

The paper proposes a training-free method using canonical correlation analysis (CCA) to improve the efficiency of image representations by selecting and reducing dimensions in pretrained encoders. This method leverages cross-model agreement to distill and refine representations, leading to a reduction of over 75% in dimensionality while maintaining or improving downstream performance. Experiments on various datasets show consistent improvements over baseline and PCA-projected representations, with accuracy gains up to 12.6%.

论文提出了一种无需训练的方法，使用主成分分析（CCA）来提高预训练编码器的图像表示效率，通过选择和减少维度来挑选出具有共享语义内容的表示，同时丢弃冗余维度。这种方法利用了两个预训练模型之间表示的共享结构，能够在各种基准上实现高达12.6%的准确率提升，并且提高了下游性能。

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

Authors: Zhuchenyang Liu, Yao Zhang, Yu Xiao

First: 2026-04-01T13:55:28+00:00 · Latest: 2026-04-01T13:55:28+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

中文标题/摘要

标题：视觉语言模型在跨图示装配指令对齐中的基准测试与机制分析

2D 装配图通常抽象且难以理解，因此需要智能助手来监控进度、检测错误并提供逐步指导。在混合现实环境中，此类系统必须从摄像头画面中识别已完成和正在进行的步骤，并与图示指令对齐。视觉语言模型（VLMs）在这一任务中显示出潜力，但面临图示差距，因为装配图和视频帧共享的视觉特征很少。为了系统地评估这一差距，我们构建了IKEA-Bench，这是一个包含1,623个问题、6种任务类型和29种宜家家具产品的基准测试集，并评估了19种VLMs（2B-38B）在三种对齐策略下的表现。我们的主要发现：（1）装配指令的理解可以通过文本恢复，但同时会降低图示到视频的对齐；（2）架构家族比参数数量更能预测对齐准确性；（3）视频理解仍然是一个难以克服的瓶颈，不受策略影响。进一步的三级机制分析表明，图示和视频占据不同的ViT子空间，添加文本使模型从视觉推理转向文本驱动推理。这些结果指出了提高跨图示鲁棒性的主要目标是视觉编码。项目页面：https://ryenhails.github.io/IKEA-Bench/

Summary / 总结

This study aims to evaluate the performance of Vision-Language Models (VLMs) in aligning assembly instructions with video feeds for mixed reality applications. The researchers constructed IKEA-Bench, a benchmark with 1,623 questions across 6 task types on 29 IKEA furniture products, and tested 19 VLMs (2B-38B) under three alignment strategies. Key findings include that text helps understand instructions but degrades diagram-to-video alignment, architecture family is more predictive of alignment accuracy than parameter count, and video understanding remains a bottleneck. The study also reveals that diagrams and video occupy different subspaces in Vision Transformers and that adding text shifts models towards text-driven reasoning, highlighting the need for improved visual encoding capabilities.

研究旨在评估视觉语言模型（VLMs）在将装配指令与视频流对齐方面的性能。研究构建了包含29种宜家家具产品的1,623个问题的IKEA-Bench基准，并评估了19种VLM在三种对齐策略下的表现。主要发现包括：文本有助于理解指令但会降低图示到视频的对齐效果，架构家族比参数量更能预测对齐准确性，视频理解仍然是瓶颈且不受策略影响。进一步的分析表明，图示和视频在不同的ViT子空间中编码，添加文本使模型转向以文本驱动的推理，这表明需要改进视觉编码以提高跨图示的鲁棒性。

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Authors: Zimo Cao, Yuchen Deng, Haibin Ling, Bingyao Huang

First: 2026-04-01T13:54:05+00:00 · Latest: 2026-04-01T13:54:05+00:00

Comments: 16 pages, 7 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: https://ZimoCao.github.io/ProCap/.

中文标题/摘要

标题：ProCap：投影感知的描述生成技术在空间增强现实中的应用

空间增强现实（SAR）直接使用投影仪将数字内容投射到物理场景上，创造出无需头戴式显示器的沉浸式体验。然而，为了使SAR支持智能交互，例如对场景进行推理或回答用户查询，它必须在语义上区分物理场景和投影内容。标准视觉语言模型（VLMs）难以处理这种虚拟-物理的模糊性，经常混淆这两种上下文。为了解决这一问题，我们提出了ProCap，这是一种新颖的框架，明确地将投影内容与物理场景分离。ProCap采用两阶段流水线：首先通过自动分割视觉隔离虚拟和物理层；然后使用区域感知检索以避免由于投影失真引起的模糊语义上下文。为此，我们提出了RGBP（RGB + 投影），这是第一个大规模的SAR语义基准数据集，包含65个多样化的物理场景和超过180,000个投影，具有密集且分离的注释。最后，我们建立了一种双描述评估协议，使用任务特定的标记独立评估物理场景和投影描述。我们的实验表明，ProCap为未来的SAR研究提供了稳健的语义基础。源代码、预训练模型和RGBP数据集可在项目页面上获得：https://ZimoCao.github.io/ProCap/。

Summary / 总结

ProCap is a novel framework designed to address the virtual-physical ambiguity in spatial augmented reality (SAR) by decoupling projected content from physical scenes. It uses a two-stage pipeline for automated segmentation and region-aware retrieval to avoid semantic confusion. The framework is evaluated on a new dataset, RGBP, which includes detailed annotations for 65 physical scenes and over 180,000 projections. Experiments demonstrate that ProCap provides a robust semantic foundation for SAR research.

ProCap 是一个新颖的框架，旨在通过分离投影内容和物理场景来解决空间增强现实（SAR）中的虚实混淆问题。它使用两阶段流水线进行自动分割和区域感知检索，以避免语义混淆。该框架在包含65种不同物理场景和超过180,000个投影的新数据集RGBP上进行评估，这些投影具有详细的注释。实验表明，ProCap 为未来的SAR研究提供了稳健的语义基础。

JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

Authors: Issa Sugiura, Koki Maeda, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Naoaki Okazaki

First: 2026-04-01T13:53:06+00:00 · Latest: 2026-04-01T13:53:06+00:00

Comments: 16 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Reliable evaluation is essential for the development of vision-language models (VLMs). However, Japanese VQA benchmarks have undergone far less iterative refinement than their English counterparts. As a result, many existing benchmarks contain issues such as ambiguous questions, incorrect answers, and instances that can be solved without visual grounding, undermining evaluation reliability and leading to misleading conclusions in model comparisons. To address these limitations, we introduce JAMMEval, a refined collection of Japanese benchmarks for reliable VLM evaluation. It is constructed by systematically refining seven existing Japanese benchmark datasets through two rounds of human annotation, improving both data quality and evaluation reliability. In our experiments, we evaluate open-weight and proprietary VLMs on JAMMEval and analyze the capabilities of recent models on Japanese VQA. We further demonstrate the effectiveness of our refinement by showing that the resulting benchmarks yield evaluation scores that better reflect model capability, exhibit lower run-to-run variance, and improve the ability to distinguish between models of different capability levels. We release our dataset and code to advance reliable evaluation of VLMs.

中文标题/摘要

标题：JAMMEval：一种用于可靠视觉语言模型评估的精炼日语基准集合

可靠的评估对于视觉语言模型（VLMs）的发展至关重要。然而，现有的日语VQA基准相比于英语版本经历了较少的迭代精炼。因此，许多现有的基准存在诸如含糊的问题、错误的答案以及无需视觉接地即可解决的实例等问题，这损害了评估的可靠性，并导致模型比较中得出误导性的结论。为了解决这些局限性，我们引入了JAMMEval，这是一种用于可靠VLM评估的日语基准集合。它通过两轮的人工注释系统地精炼了七个现有的日语基准数据集，从而提高了数据质量和评估的可靠性。在我们的实验中，我们使用JAMMEval评估了开放权重和专有VLM，并分析了最近模型在日语VQA上的能力。我们进一步通过展示精炼后的基准能够更好地反映模型能力、表现出较低的运行间方差以及提高区分不同能力水平模型的能力来证明其有效性。我们发布了我们的数据集和代码，以促进VLM的可靠评估。

Summary / 总结

The research aims to improve the reliability of evaluating vision-language models (VLMs) by addressing issues in Japanese VQA benchmarks. JAMMEval, a refined collection of Japanese benchmarks, is created through two rounds of human annotation, enhancing data quality and evaluation reliability. Experiments show that JAMMEval provides more accurate model evaluations, reduces variance, and better distinguishes model capabilities compared to existing benchmarks.

JAMMEval 是一个改进后的日语基准集合，旨在提高视觉语言模型评估的可靠性。通过两轮人工注释系统地改进七个现有数据集，JAMMEval 解决了诸如模糊问题和错误答案等问题。实验表明，JAMMEval 提供了更准确的评估分数、更低的运行间方差，并且能够更好地区分不同能力水平的模型。

SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

Authors: Rui Bao, Zheng Gao, Xiaoyu Li, Xiaoyan Feng, Yang Song, Jiaojiao Jiang

First: 2026-03-31T13:39:37+00:00 · Latest: 2026-04-01T13:47:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion-based watermarking methods embed verifiable marks by manipulating the initial noise or the reverse diffusion trajectory. However, these methods share a critical assumption: verification can succeed only if the diffusion trajectory can be faithfully reconstructed. This reliance on trajectory recovery constitutes a fundamental and exploitable vulnerability. We propose $\underline{\mathbf{S}}$tochastic $\underline{\mathbf{Hi}}$dden-Trajectory De$\underline{\mathbf{f}}$lec$\underline{\mathbf{t}}$ion ($\mathbf{SHIFT}$), a training-free attack that exploits this common weakness across diverse watermarking paradigms. SHIFT leverages stochastic diffusion resampling to deflect the generative trajectory in latent space, making the reconstructed image statistically decoupled from the original watermark-embedded trajectory while preserving strong visual quality and semantic consistency. Extensive experiments on nine representative watermarking methods spanning noise-space, frequency-domain, and optimization-based paradigms show that SHIFT achieves 95%--100% attack success rates with nearly no loss in semantic quality, without requiring any watermark-specific knowledge or model retraining.

中文标题/摘要

标题：SHIFT：随机隐藏轨迹偏转以去除基于扩散的水印

基于扩散的水印方法通过操纵初始噪声或逆向扩散轨迹来嵌入可验证的标记。然而，这些方法共享一个关键假设：只有在能够忠实重建扩散轨迹的情况下，验证才能成功。对轨迹恢复的依赖构成了一个基本且可利用的漏洞。我们提出了$\underline{\mathbf{S}}$随$\underline{\mathbf{H}}$机$\underline{\mathbf{D}}$隐$\underline{\mathbf{F}}$偏$\underline{\mathbf{L}}$($\mathbf{SHIFT}$)，一种无需训练的攻击方法，利用了不同水印框架中的这一共同弱点。SHIFT 利用随机扩散重采样在潜在空间中偏转生成轨迹，使重建图像在统计上与原始水印嵌入轨迹解耦，同时保持强烈的视觉质量和语义一致性。在噪声空间、频域和基于优化的九种代表性水印方法上的广泛实验表明，SHIFT 在几乎不损失语义质量的情况下实现了95%至100%的攻击成功率，无需任何水印特定知识或模型重训练。

Summary / 总结

The paper addresses the vulnerability of diffusion-based watermarking methods that rely on the accurate reconstruction of the diffusion trajectory for verification. It introduces SHIFT, a training-free attack that uses stochastic diffusion resampling to deflect the generative trajectory, making the reconstructed image statistically decoupled from the original watermark-embedded trajectory while maintaining high visual and semantic quality. Experiments demonstrate that SHIFT successfully attacks nine different watermarking methods with nearly 100% success rate and no loss in semantic quality.

研究针对依赖于准确重建扩散轨迹进行验证的水印方法的漏洞。提出了一种无需训练的攻击方法SHIFT，利用随机扩散重采样来偏转生成轨迹，使重建图像与原始水印嵌入轨迹统计上脱钩，同时保持高视觉和语义质量。实验表明，SHIFT在九种不同类型的水印方法上可以实现95%到100%的攻击成功率，且对语义质量影响极小。

PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding

Authors: Nan Wang, Zhiwei Jin, Chen Chen, Haonan Lu

First: 2026-04-01T13:33:27+00:00 · Latest: 2026-04-01T13:33:27+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Document understanding and GUI interaction are among the highest-value applications of Vision-Language Models (VLMs), yet they impose exceptionally heavy computational burden: fine-grained text and small UI elements demand high-resolution inputs that produce tens of thousands of visual tokens. We observe that this cost is largely wasteful -- across document and GUI benchmarks, only 22--71\% of image patches are pixel-unique, the rest being exact duplicates of another patch in the same image. We propose \textbf{PixelPrune}, which exploits this pixel-level redundancy through predictive-coding-based compression, pruning redundant patches \emph{before} the Vision Transformer (ViT) encoder. Because it operates in pixel space prior to any neural computation, PixelPrune accelerates both the ViT encoder and the downstream LLM, covering the full inference pipeline. The method is training-free, requires no learnable parameters, and supports pixel-lossless compression ($τ{=}0$) as well as controlled lossy compression ($τ{>}0$). Experiments across three model scales and document and GUI benchmarks show that PixelPrune maintains competitive task accuracy while delivering up to 4.2$\times$ inference speedup and 1.9$\times$ training acceleration. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.

中文标题/摘要

标题：PixelPrune：基于预测编码的像素级自适应视觉标记缩减

文档理解和GUI交互是视觉语言模型（VLMs）最具价值的应用之一，但它们带来了极其沉重的计算负担：精细的文字和小型UI元素需要高分辨率输入，产生数万个视觉标记。我们观察到，这种成本是浪费的——在文档和GUI基准测试中，只有22%-71%的图像块是像素唯一的，其余的则是同一图像中另一个块的完全重复。我们提出了PixelPrune，它通过基于预测编码的压缩利用了这种像素级冗余，在视觉变换器（ViT）编码器之前剪枝冗余块。由于它在任何神经计算之前操作在像素空间中，PixelPrune加速了ViT编码器和下游的LLM，覆盖了整个推理管道。该方法无需训练，不需要可学习的参数，并支持无损压缩（τ=0）以及可控的有损压缩（τ>0）。在三个模型规模和文档及GUI基准测试中的实验表明，PixelPrune在保持竞争力的同时，提供了高达4.2倍的推理加速和1.9倍的训练加速。代码可在https://github.com/OPPO-Mente-Lab/PixelPrune/ 获取。

Summary / 总结

PixelPrune is designed to reduce the computational burden of Vision-Language Models (VLMs) by exploiting pixel-level redundancy through predictive-coding-based compression. It prunes redundant patches before the Vision Transformer (ViT) encoder, accelerating both the encoder and downstream Language Models. Experiments show that PixelPrune maintains competitive task accuracy while providing up to 4.2 times faster inference and 1.9 times faster training. The method is training-free and supports both lossless and lossy compression. Code is available at https://github.com/OPPO-Mente-Lab/PixelPrune.

PixelPrune 是一种方法，通过预测编码利用像素级别的冗余来减少视觉语言模型（VLMs）的计算负担。它在视觉变换器（ViT）编码器之前修剪冗余的图像块，加速编码器和下游语言模型。实验表明，PixelPrune 保持了竞争力的任务准确性，同时提供高达 4.2 倍的推理速度和 1.9 倍的训练加速。该方法无需训练，支持无损和有损压缩。代码可在 https://github.com/OPPO-Mente-Lab/PixelPrune 获取。

A 4D Representation for Training-Free Agentic Reasoning from Monocular Laparoscopic Video

Authors: Maximilian Fehrentz, Nicolas Stellwag, Robert Wiebe, Nicole Thorisch, Fabian Grob, Patrick Remerscheid, Ken-Joel Simmoteit, Benjamin D. Killeen, Christian Heiliger, Nassir Navab

First: 2026-04-01T13:14:52+00:00 · Latest: 2026-04-01T13:14:52+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Spatiotemporal reasoning is a fundamental capability for artificial intelligence (AI) in soft tissue surgery, paving the way for intelligent assistive systems and autonomous robotics. While 2D vision-language models show increasing promise at understanding surgical video, the spatial complexity of surgical scenes suggests that reasoning systems may benefit from explicit 4D representations. Here, we propose a framework for equipping surgical agents with spatiotemporal tools based on an explicit 4D representation, enabling AI systems to ground their natural language reasoning in both time and 3D space. Leveraging models for point tracking, depth, and segmentation, we develop a coherent 4D model with spatiotemporally consistent tool and tissue semantics. A Multimodal Large Language Model (MLLM) then acts as an agent on tools derived from the explicit 4D representation (e.g., trajectories) without any fine-tuning. We evaluate our method on a new dataset of 134 clinically relevant questions and find that the combination of a general purpose reasoning backbone and our 4D representation significantly improves spatiotemporal understanding and allows for 4D grounding. We demonstrate that spatiotemporal intelligence can be "assembled" from 2D MLLMs and 3D computer vision models without additional training. Code, data, and examples are available at https://tum-ai.github.io/surg4d/

中文标题/摘要

标题：一种用于训练无监督代理推理的4D表示——基于单目腹腔镜视频

时空推理是人工智能（AI）在软组织手术中的基本能力，为智能辅助系统和自主机器人铺平了道路。虽然2D视觉-语言模型在理解手术视频方面显示出越来越大的潜力，但手术场景的空间复杂性表明，推理系统可能从显式的4D表示中受益。在此，我们提出了一种基于显式4D表示的框架，使手术代理能够获得时空工具，使AI系统能够将自然语言推理同时基于时间和三维空间。利用点跟踪、深度和分割模型，我们开发了一个时空一致的4D模型，具有时空一致的工具和组织语义。然后，多模态大型语言模型（MLLM）作为代理作用于从显式4D表示派生的工具（例如轨迹），无需任何微调。我们在包含134个临床相关问题的新数据集上评估了我们的方法，发现通用推理骨干和我们的4D表示的结合显著提高了时空理解能力，并允许4D定位。我们证明时空智能可以从2D MLLMs和3D计算机视觉模型“组装”而来，而无需额外训练。代码、数据和示例可在https://tum-ai.github.io/surg4d/ 获取。

Summary / 总结

This paper proposes a framework for training-free agentic reasoning in soft tissue surgery using an explicit 4D representation. By leveraging models for point tracking, depth, and segmentation, the authors develop a coherent 4D model that enables a Multimodal Large Language Model to reason about surgical tools and tissues in both time and 3D space without fine-tuning. Evaluation on 134 clinically relevant questions shows that this method significantly improves spatiotemporal understanding and allows for 4D grounding of natural language reasoning in surgical contexts.

本文提出了一种框架，通过点跟踪、深度和分割模型生成显式的4D表示，为手术代理提供时空推理能力。该框架利用多模态大型语言模型（MLLM）在无需微调的情况下对工具和组织进行推理。在包含134个临床相关问题的新数据集上进行的评估表明，通用推理骨干与4D表示的结合显著提高了时空理解能力，并实现了4D定位。该方法证明了时空智能可以通过2D MLLMs和3D计算机视觉模型组装而成，无需额外训练。

LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

Authors: Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva

First: 2026-04-01T12:38:27+00:00 · Latest: 2026-04-01T12:38:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

中文标题/摘要

标题：LinguDistill：通过选择性跨模态蒸馏恢复视觉语言模型的语言能力

将预训练的语言模型（LMs）适应为视觉语言模型（VLMs）可能会由于多模态适应过程中引入的表示转移和跨模态干扰而导致其固有的语言能力下降。即使使用标准目标进行针对性的任务特定微调，这种损失也难以恢复。先前的恢复方法通常引入额外的模块作为中间对齐层，以保持或隔离模态特定的子空间，这增加了架构的复杂性，在推理时增加了参数量，并限制了模型和设置的灵活性。我们提出了一种无需适配器的蒸馏方法LinguDistill，通过利用原始冻结的LM作为教师来恢复语言能力。我们通过引入层级KV缓存共享来克服关键挑战，使教师能够监督学生的多模态表示，而不修改任何模型的架构。然后，我们选择性地在语言密集型数据上蒸馏教师的强语言信号，以恢复语言能力，同时保持学生在多模态任务中的视觉定位。结果，LinguDistill在语言和知识基准上恢复了约10%的性能损失，同时在视觉密集型任务上保持了相当的性能。我们的研究结果表明，语言能力可以在不引入额外模块的情况下恢复，提供了一种高效且实用的解决多模态模型模态特定退化的方案。

Summary / 总结

LinguDistill is a method that recovers linguistic ability in vision-language models by utilizing the original frozen language model as a teacher. It introduces layer-wise KV-cache sharing to enable vision-conditioned teacher supervision without altering the architecture. By selectively distilling the teacher's strong linguistic signal on language-intensive data, LinguDistill recovers about 10% of the performance lost on language and knowledge benchmarks while maintaining comparable performance on vision-heavy tasks.

LinguDistill 是一种方法，通过使用原始冻结的语言模型作为教师来恢复视觉语言模型的语义能力。它通过引入层级的 KV 缓存共享来实现视觉条件下的教师监督，而不修改任何模型的架构。通过在语言密集型数据上选择性地蒸馏教师的强语言信号，LinguDistill 恢复了约 10% 的在语言和知识基准上的性能损失，同时在视觉密集型任务上保持了相当的性能。

Continual Vision-Language Learning for Remote Sensing: Benchmarking and Analysis

Authors: Xingxing Weng, Ruifeng Ni, Chao Pang, XiangYu Hao, Yishan Wang, Xiaokang Zhang, Wei Xu, Gui-Song Xia

First: 2026-04-01T12:27:31+00:00 · Latest: 2026-04-01T12:27:31+00:00

Comments: 23 pages, 7 figures, 9 tables

Abs · PDF · Code1 · Code2

Abstract

Current remote sensing vision-language models (RS VLMs) demonstrate impressive performance in image interpretation but rely on static training data, limiting their ability to accommodate continuously emerging sensing modalities and downstream tasks. This exposes a fundamental challenge: enabling RS VLMs to continually adapt without catastrophic forgetting. Despite its practical importance, the continual learning capability of RS VLMs remains underexplored, and no dedicated benchmark currently exists. In this work, we present CLeaRS, a comprehensive benchmark for continual vision-language learning in remote sensing. CLeaRS comprises 10 curated subsets with over 207k image-text pairs, spanning diverse interpretation tasks, sensing modalities, and application scenarios. We further define three evaluation protocols: long-horizon, modality-incremental, and task-incremental settings, to systematically assess continual adaptation. Extensive benchmarking of diverse vision-language models reveals catastrophic forgetting across all settings. Moreover, representative continual learning methods, when adapted to RS VLMs, exhibit limited effectiveness in handling task, instruction, and modality transitions. Our findings underscore the need for developing continual learning methods tailored to RS VLMs.

中文标题/摘要

标题：遥感持续视觉语言学习：基准测试与分析

当前的遥感视觉语言模型（RS VLMs）在图像解释方面表现出色，但依赖于静态训练数据，限制了它们适应不断出现的传感模态和下游任务的能力。这暴露出一个根本性的挑战：使RS VLMs能够持续适应而不发生灾难性遗忘。尽管其实际重要性不言而喻，但RS VLMs的持续学习能力仍被严重忽视，目前也没有专门的基准测试。在本文中，我们提出了CLeaRS，这是一个全面的遥感持续视觉语言学习基准测试。CLeaRS 包含10个精心策划的子集，涵盖超过207,000个图像-文本对，涉及多样化的解释任务、传感模态和应用场景。我们进一步定义了三种评估协议：长期视角、模态增量和任务增量设置，以系统地评估持续适应性。对多种视觉语言模型的广泛基准测试揭示了所有设置中的灾难性遗忘现象。此外，当代表性的持续学习方法应用于RS VLMs时，在处理任务、指令和模态转换方面表现出有限的效果。我们的研究结果强调了开发针对RS VLMs的持续学习方法的必要性。

Summary / 总结

This work addresses the challenge of enabling remote sensing vision-language models to continually adapt without forgetting previously learned information. It introduces CLeaRS, a benchmark for continual vision-language learning in remote sensing, comprising 10 subsets with over 207k image-text pairs. The study evaluates various vision-language models and continual learning methods across different settings and finds that these models suffer from catastrophic forgetting. The results highlight the necessity for specialized continual learning methods for remote sensing vision-language models.

该研究旨在解决使遥感视觉语言模型能够持续适应而不遗忘之前学习的信息的问题。它引入了CLeaRS，一个用于遥感领域持续视觉语言学习的基准，包含10个子集，共有超过207k的图像-文本对。研究评估了多种视觉语言模型和持续学习方法在不同设置下的表现，并发现这些模型存在灾难性遗忘的问题。研究结果强调了为遥感视觉语言模型开发专门的持续学习方法的必要性。

An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

Authors: Lennart Maack, Alexander Schlaefer

First: 2026-04-01T11:45:28+00:00 · Latest: 2026-04-01T11:45:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips densely extended with 150k fine-grained spatial-temporal question-answer samples. Our comprehensive evaluation shows that while state-of-the-art generalist VLMs struggle in zero-shot settings, their spatial-temporal capabilities can be improved through in-context learning. A fine-tuned VLM on the SurgSTU training dataset achieves highest performance among all spatial-temporal tasks, validating the dataset's efficacy to improve spatial-temporal understanding of VLMs in surgical videos. Code will be made publicly available.

中文标题/摘要

标题：一种丰富外科视频数据集的方法以实现精细粒度的空间-时间视觉-语言模型理解

外科视频理解是推进计算机辅助手术的关键前提。尽管视觉-语言模型（VLMs）最近被应用于外科领域，但现有的外科视觉-语言数据集在捕捉和评估复杂的、交织的空间-时间动态方面存在不足。由于手动注释成本高或使用大型语言模型生成时容易出错，创建准确反映外科视频中精细粒度的空间-时间关系的大规模数据集具有挑战性。为解决这一问题，我们引入了SurgSTU-Pipeline，这是一种具有时间和空间连续性过滤功能的确定性生成管道，以可靠地创建用于精细粒度空间-时间多模态理解的外科数据集。将此管道应用于公开可用的外科数据集，我们创建了SurgSTU数据集，包含7515个视频片段，密集扩展了150k个精细粒度的空间-时间问答样本。我们的全面评估表明，尽管最先进的通用VLMs在零样本设置中表现不佳，但通过上下文学习可以提高它们的空间-时间能力。在SurgSTU训练数据集上微调的VLM在所有空间-时间任务中表现最佳，验证了该数据集在提高VLMs在外科视频中的空间-时间理解方面的有效性。代码将公开发布。

Summary / 总结

The research aims to enhance the understanding of surgical videos through vision-language models by addressing the lack of fine-grained spatial-temporal dynamics in existing datasets. The SurgSTU-Pipeline, a deterministic generation pipeline, was developed to create the SurgSTU dataset, which includes 7515 video clips with 150k fine-grained spatial-temporal question-answer samples. The evaluation demonstrates that while state-of-the-art VLMs perform poorly in zero-shot settings, they can improve their spatial-temporal capabilities through in-context learning, and a fine-tuned VLM on the SurgSTU dataset shows the best performance in spatial-temporal tasks, validating the dataset's effectiveness.

本文旨在解决创建能够捕捉精细空间-时间动态的大规模手术视频数据集的挑战，这对于推进计算机辅助手术至关重要。作者提出了SurgSTU-Pipeline，这是一种确保时空连续性的确定性生成管道，以创建SurgSTU数据集。该数据集包含7515个视频片段和150k个精细的空间-时间问答样本。评估结果显示，尽管最先进的视觉-语言模型在零样本设置中表现不佳，但通过上下文学习可以提高它们的空间-时间能力，而基于SurgSTU训练集微调的视觉-语言模型在空间-时间任务中表现最佳，验证了该数据集的有效性。

ActivityNarrated: An Open-Ended Narrative Paradigm for Wearable Human Activity Understanding

Authors: Lala Shakti Swarup Ray, Mengxi Liu, Alcina Pinto, Deepika Gurung, Daniel Geissler, Paul Lukowoicz, Bo Zhou

First: 2026-04-01T11:31:44+00:00 · Latest: 2026-04-01T11:31:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Wearable HAR has improved steadily, but most progress still relies on closed-set classification, which limits real-world use. In practice, human activity is open-ended, unscripted, personalized, and often compositional, unfolding as narratives rather than instances of fixed classes. We argue that addressing this gap does not require simply scaling datasets or models. It requires a fundamental shift in how wearable HAR is formulated, supervised, and evaluated. This work shows how to model open-ended activity narratives by aligning wearable sensor data with natural-language descriptions in an open-vocabulary setting. Our framework has three core components. First, we introduce a naturalistic data collection and annotation pipeline that combines multi-position wearable sensing with free-form, time-aligned narrative descriptions of ongoing behavior, allowing activity semantics to emerge without a predefined vocabulary. Second, we define a retrieval-based evaluation framework that measures semantic alignment between sensor data and language, enabling principled evaluation without fixed classes while also subsuming closed-set classification as a special case. Third, we present a language-conditioned learning architecture that supports sensor-to-text inference over variable-length sensor streams and heterogeneous sensor placements. Experiments show that models trained with fixed-label objectives degrade sharply under real-world variability, while open-vocabulary sensor-language alignment yields robust and semantically grounded representations. Once this alignment is learned, closed-set activity recognition becomes a simple downstream task. Under cross-participant evaluation, our method achieves 65.3% Macro-F1, compared with 31-34% for strong closed-set HAR baselines. These results establish open-ended narrative modeling as a practical and effective foundation for real-world wearable HAR.

中文标题/摘要

标题：ActivityNarrated：一种开放式的穿戴式人类活动理解叙述范式

穿戴式HAR已经稳步提升，但大多数进步仍然依赖于封闭集分类，这限制了其在现实世界中的应用。实际上，人类活动是开放式的、未编排的、个性化的，并且往往是组合性的，以叙述的形式展开而非固定类别的实例。我们认为，解决这一差距并不需要简单地扩大数据集或模型规模。这需要在穿戴式HAR的表述、监督和评估方式上进行根本性的转变。这项工作展示了如何通过将穿戴传感器数据与自然语言描述对齐来建模开放式活动叙述，从而在开放词汇量设置中进行建模。我们的框架有三个核心组件。首先，我们引入了一种自然化的数据收集和标注管道，结合多位置穿戴传感与自由形式的时间对齐叙述描述，允许活动语义在没有预定义词汇的情况下浮现。其次，我们定义了一种基于检索的评估框架，该框架衡量传感器数据与语言之间的语义对齐，从而在没有固定类别的前提下进行有原则的评估，同时还将封闭集分类作为特殊情况包含在内。第三，我们提出了一种基于语言的条件学习架构，支持传感器到文本的推理，适用于可变长度的传感器流和异构传感器布置。实验表明，使用固定标签目标训练的模型在现实世界变异性下急剧退化，而开放词汇量的传感器-语言对齐则产生稳健且语义化的表示。一旦这种对齐被学习，封闭集活动识别就成为一项简单的下游任务。在跨参与者评估中，我们的方法实现了65.3%的宏F1值，而强大的封闭集HAR基线则为31-34%。这些结果确立了开放式叙述建模作为现实世界穿戴式HAR的实用且有效的基础。

Summary / 总结

This work addresses the limitations of closed-set classification in wearable human activity recognition (HAR) by proposing an open-ended narrative paradigm. The method involves a naturalistic data collection pipeline that aligns wearable sensor data with free-form narrative descriptions, and a retrieval-based evaluation framework that measures semantic alignment between sensor data and language. Experiments show that open-vocabulary sensor-language alignment yields robust and semantically grounded representations, outperforming fixed-label models by achieving 65.3% Macro-F1 under cross-participant evaluation compared to 31-34% for strong closed-set HAR baselines.

该研究通过提出一种开放式的叙述范式，解决了穿戴式人体活动识别（HAR）中封闭集分类的局限性。方法包括结合穿戴式传感与自由形式叙述描述的自然数据收集管道，以及一种基于检索的评估框架，用于测量传感器数据与语言之间的语义对齐。实验表明，开放词汇量的传感器-语言对齐能够产生稳健且语义上合理的表示，跨参与者评估下达到65.3%的宏F1分数，而封闭集HAR基线则为31-34%。

IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

Authors: Dong-Jae Lee, Sunghyun Baek, Junmo Kim

First: 2026-04-01T11:23:16+00:00 · Latest: 2026-04-01T11:23:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.

中文标题/摘要

标题：IWP：作为大型视觉语言模型隐式权重剪枝的标记剪枝

大型视觉语言模型在图像和视频理解任务中表现出色，但其计算成本随着视觉标记数量的增加而迅速增长。现有标记剪枝方法通过经验方法减轻这一问题，但忽略了注意力的内部机制。在本文中，我们提出了一种基于注意力双重形式视角的新型无训练剪枝框架。我们将注意力重新表述为一个隐式线性层，其权重矩阵是单个标记的关键值对生成的秩1外积之和。因此，标记剪枝归结为选择一个最佳子集，这些秩1更新最好地逼近原始双重权重矩阵。将这一视角扩展到LVLM中的标准softmax注意力，我们推导出一个衡量标记信息量及其重复的新指标。为了高效地选择具有该指标的子集，我们引入了渐进分块最大边际相关性。广泛的实验表明，我们的方法在性能和效率之间实现了更好的权衡，同时为现有的剪枝方法提供了另一种视角。

Summary / 总结

This paper addresses the computational cost issue in large vision language models due to the increasing number of visual tokens. It proposes a novel token pruning framework called IWP, which is based on the dual form perspective of attention. By reformulating attention as an implicit linear layer, the method selects an optimal subset of rank 1 updates to approximate the original dual weight matrix, thereby reducing computational cost while maintaining performance. Experiments show that IWP achieves a better balance between performance and efficiency compared to existing methods.

本文针对大型视觉语言模型因视觉令牌数量增加而导致的计算成本问题，提出了一种名为IWP的无训练集令牌剪枝框架，该框架基于注意力的双重形式视角。通过将注意力重新表述为隐式线性层，该方法选择最优的rank 1更新子集来近似原始权重矩阵，从而在性能和效率之间实现更好的权衡。关键发现是，IWP为令牌剪枝提供了新的视角，并实现了更好的平衡。

Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

Authors: Zehao Jin, Yanan Sui

First: 2026-04-01T11:18:40+00:00 · Latest: 2026-04-01T11:18:40+00:00

Abs · PDF · Code1 · Code2

Abstract

The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same $O(nw)$ per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in $O(\log_w n)$ layers versus $O(n/w)$ for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.

中文标题/摘要

标题：随机注意力：受脑连接组启发的随机化路由机制以实现高效线性时间注意力

果蝇全脑连接组包含超过13万个神经元，连接概率仅为0.02%，但其平均最短路径仅为4.4跳。尽管在电路层面高度结构化，该网络的长距离连接广泛分布于脑区，作为随机捷径，实现高效的全局通信。受此观察启发，我们提出随机注意力（SA），这是一种滑动窗口注意力（SWA）的即插即用增强方法，在窗口注意力之前对标记序列应用随机排列，并在之后恢复原始顺序。这将固定局部窗口转换为相同$O(nw)$每层预算内的随机全局窗口。通过深度，独立采样的排列产生指数增长的感受野，在$O(\log_w n)$层内实现全序列覆盖，而SWA为$O(n/w)$。我们在两种场景中验证了SA：从零开始预训练语言模型，其中门控SA + SWA组合获得最佳平均零样本准确率；以及无需训练的推理，针对Qwen3-8B和Qwen3-30B-A3B，SA始终优于SWA，并在相似计算预算下匹配或超越块注意力混合。这些结果表明，受脑连接组启发的随机路由是一种实用的基本方法，可提高高效注意力的表达能力，补充现有的线性和稀疏方法。

Summary / 总结

The research aims to enhance the efficiency and expressivity of attention mechanisms in neural networks by drawing inspiration from the brain's connectome. Stochastic Attention (SA) is introduced as a method that applies random permutations to token sequences before windowed attention, then restores the original order. This approach transforms fixed local windows into stochastic global ones, achieving full sequence coverage in logarithmic layers compared to the linear layers required by sliding-window attention. SA outperforms sliding-window attention in both pre-training and training-free inference settings, suggesting its practical utility in improving attention mechanisms.

研究旨在通过借鉴大脑连接组的启发，提升神经网络中注意力机制的效率和表达能力。提出了随机注意力（SA）方法，在窗口化注意力之前对token序列应用随机排列，以在相同计算预算下实现全局随机视图。实验表明，SA与滑动窗口注意力结合时，在从零开始预训练语言模型中达到最佳零样本准确率，并在无训练推理中优于滑动窗口注意力，展示了其在改进注意力机制方面的实际应用价值。

How Blind and Low-Vision Individuals Prefer Large Vision-Language Model-Generated Scene Descriptions

Authors: Na Min An, Eunki Kim, Wan Ju Kang, Sangryul Kim, James Thorne, Hyunjung Shim

First: 2025-02-15T10:17:52+00:00 · Latest: 2026-04-01T10:55:52+00:00

Comments: This paper has been superseded by version 2 of arXiv:2510.00766

Abs · PDF · Code1 · Code2

Abstract

For individuals with blindness or low vision (BLV), navigating complex environments can pose serious risks. Large Vision-Language Models (LVLMs) show promise for generating scene descriptions, but their effectiveness for BLV users remains underexplored. To address this gap, we conducted a user study with eight BLV participants to systematically evaluate preferences for six types of LVLM descriptions. While they helped to reduce fear and improve actionability, user ratings showed wide variation in sufficiency and conciseness. Furthermore, GPT-4o--despite its strong potential to refine descriptions--was not consistently preferred by participants. We use the insights obtained from the user study to build training data for building our new automatic evaluation metric that can capture BLV preferences effectively. Our findings underscore the urgent need for BLV-centered evaluation metrics and human-in-the-loop feedback to advance LVLM description quality for accessibility.

中文标题/摘要

标题：盲人和低视力个体偏好大型视觉-语言模型生成的场景描述

对于盲人或低视力（BLV）个体而言，导航复杂环境会带来严重风险。大型视觉-语言模型（LVLMs）显示出生成场景描述的潜力，但其对BLV用户的有效性尚未得到充分探索。为解决这一问题，我们对八名BLV参与者进行了用户研究，系统评估了六种类型LVLM描述的偏好。虽然这些描述有助于减少恐惧并提高行动性，但用户评分在充分性和简洁性方面表现出广泛差异。此外，尽管GPT-4o具有强大的细化描述潜力，但并未被参与者一致偏好。我们利用用户研究获得的见解构建了训练数据，以构建能够有效捕捉BLV偏好自动评估指标。我们的研究结果强调了BLV中心评估指标和人工在环反馈的迫切需求，以提高LVLM描述质量的可访问性。

Summary / 总结

This study evaluates how blind and low-vision individuals prefer large vision-language model-generated scene descriptions. Eight participants were surveyed on six types of descriptions, finding that while these descriptions helped reduce fear and improve actionability, there was wide variation in their sufficiency and conciseness. GPT-4o, despite its potential, was not consistently preferred. The study led to the development of a new automatic evaluation metric tailored to BLV preferences. The findings highlight the need for BLV-centered evaluation metrics and human feedback to enhance LVLM description quality for accessibility.

研究旨在评估盲和低视力个体如何偏好由大型视觉-语言模型（LVLM）生成的场景描述，以帮助他们导航复杂环境。测试了六种类型的LVLM描述，虽然它们有助于减少恐惧并提高行动性，但用户对它们的评价在充分性和简洁性方面差异很大。尽管GPT-4o具有潜在优势，但它并未被参与者一致偏好。研究导致开发了一种新的自动评估指标，专门针对BLV的偏好。研究结果强调了BLV中心化评估指标和人工在环反馈的重要性，以提高LVLM描述的质量，使其更易于访问。

Are Large Vision-Language Models Ready to Guide Blind and Low-Vision Individuals?

Authors: Eunki Kim, Na Min An, Wan Ju Kang, Sangryul Kim, James Thorne, Hyunjung Shim

First: 2025-10-01T10:55:33+00:00 · Latest: 2026-04-01T10:51:55+00:00

Comments: 42 pages, 14 figures, 28 tables

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) demonstrate a promising direction for assisting individuals with blindness or low-vision (BLV). Yet, measuring their true utility in real-world scenarios is challenging because evaluating whether their descriptions are BLV-informative requires a fundamentally different approach from assessing standard scene descriptions. While the "VLM-as-a-metric" or "LVLM-as-a-judge" paradigm has emerged, existing evaluators still fall short of capturing the unique requirements of BLV-centric evaluation, lacking at least one of the following key properties: (1) High correlation with human judgments, (2) Long instruction understanding, (3) Score generation efficiency, and (4) Multi-dimensional assessment. To this end, we propose a unified framework to bridge the gap between automated evaluation and actual BLV needs. First, we conduct an in-depth user study with BLV participants to understand and quantify their navigational preferences, curating VL-GUIDEDATA, a large-scale BLV user-simulated preference dataset containing image-request-response-score pairs. We then leverage the dataset to develop an accessibility-aware evaluator, VL-GUIDE-S, which outperforms existing (L)VLM judges in both human alignment and inference efficiency. Notably, its effectiveness extends beyond a single domain, demonstrating strong performance across multiple fine-grained, BLV-critical dimensions. We hope our work lays as a foundation for automatic AI judges that advance safe, barrier-free navigation for BLV users.

中文标题/摘要

标题：大型视觉-语言模型是否准备好指导盲人和低视力个体？

大型视觉-语言模型（LVLMs）展示了协助盲人或低视力（BLV）个体的有希望的方向。然而，在实际场景中衡量它们的真实效用颇具挑战性，因为评估其描述是否BLV相关信息需要与评估标准场景描述完全不同的方式。虽然“VLM作为度量标准”或“LVLM作为裁判”的范式已经出现，但现有的评估者仍然无法捕捉BLV中心评估的独特要求，至少缺少以下关键属性之一：（1）与人类判断高度相关，（2）长时间指令理解，（3）评分生成效率，（4）多维度评估。为此，我们提出了一种统一框架以弥合自动化评估与实际BLV需求之间的差距。首先，我们对BLV参与者进行了深入的用户研究，以了解和量化他们的导航偏好，创建了VL-GUIDEDATA，这是一个大规模的BLV用户模拟偏好数据集，包含图像-请求-响应-评分对。然后，我们利用该数据集开发了一个无障碍意识评估器VL-GUIDE-S，其在人类对齐和推理效率方面均优于现有的（L)VLM裁判。值得注意的是，其有效性超越单一领域，在多个细粒度、BLV关键维度上表现出色。我们希望我们的工作为自动AI裁判奠定基础，促进BLV用户的安全无障碍导航。

Summary / 总结

The study aims to evaluate the utility of Large Vision-Language Models (LVLMs) in assisting individuals with blindness or low-vision (BLV) by developing a unified framework. This framework includes an in-depth user study with BLV participants to create VL-GUIDEDATA, a dataset of image-request-response-score pairs. The study then introduces VL-GUIDE-S, an accessibility-aware evaluator that outperforms existing LVLM judges in human alignment and inference efficiency, and shows strong performance across multiple BLV-critical dimensions.

该研究旨在通过提出一个统一框架VL-GUIDE-S来评估大型视觉-语言模型（LVLM）在帮助盲人或低视力（BLV）个体方面的实用性，该框架包括一项用户研究以理解BLV用户的需求，并构建一个数据集VL-GUIDEDATA来开发一个无障碍感知评估器。该评估器在人类一致性与推理效率方面优于现有方法，并在多个关键维度上展示了良好的性能，以满足BLV用户的需求。

Alphacast: An Interaction-Driven Agentic Reasoning Framework for Cognition-Inspired Time Series Forecasting

Authors: Xiaohan Zhang, Tian Gao, Mingyue Cheng, Bokai Pan, Ze Guo, Yaguo Liu, Xiaoyu Tao, Qi Liu

First: 2025-11-12T03:48:05+00:00 · Latest: 2026-04-01T08:51:07+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Time series forecasting plays a crucial role in decision-making across many real-world applications. Despite substantial progress, most existing methods still treat forecasting as a static, single-pass regression problem. In contrast, human experts form predictions through iterative reasoning that integrates temporal features, domain knowledge, case-based references, and supplementary context, with continuous refinement. In this work, we propose Alphacast, an interaction-driven agentic reasoning framework that enables accurate time series forecasting with training-free large language models. Alphacast reformulates forecasting as an expert-like process and organizes it into a multi-stage workflow involving context preparation, reasoning-based generation, and reflective evaluation, transforming forecasting from a single-pass output into a multi-turn, autonomous interaction process. To support diverse perspectives commonly considered by human experts, we develop a lightweight toolkit comprising a feature set, a knowledge base, a case library, and a contextual pool that provides external support for LLM-based reasoning. Extensive experiments across multiple benchmarks show that Alphacast generally outperforms representative baselines. Code is available at this repository: https://github.com/echo01-ai/AlphaCast.

中文标题/摘要

标题：Alphacast：一种交互驱动的能动推理框架，用于启发式时间序列预测

时间序列预测在许多实际应用中的决策制定中起着关键作用。尽管取得了显著进展，但大多数现有方法仍然将预测视为静态的单次回归问题。相比之下，人类专家通过迭代推理形成预测，该推理整合了时间特征、领域知识、案例参考和补充背景，并不断改进。在本文中，我们提出了Alphacast，一种交互驱动的能动推理框架，该框架利用训练无监督的大语言模型实现准确的时间序列预测。Alphacast 将预测重新定义为专家式的过程，并将其组织成一个包含上下文准备、基于推理的生成和反思性评估的多阶段工作流，将预测从单次输出转变为多轮次、自主的交互过程。为了支持人类专家通常考虑的多种视角，我们开发了一个轻量级工具包，包括特征集、知识库、案例库和上下文池，为基于大语言模型的推理提供外部支持。在多个基准上的广泛实验表明，Alphacast 通常优于代表性基线。代码可在以下仓库获取：https://github.com/echo01-ai/AlphaCast.

Summary / 总结

Alphacast is an interaction-driven agentic reasoning framework that reformulates time series forecasting as an iterative process, integrating temporal features, domain knowledge, and context. It uses training-free large language models to enable multi-turn, autonomous interactions, transforming forecasting into a process similar to expert reasoning. Experiments show that Alphacast outperforms existing methods across multiple benchmarks.

Alphacast 是一个交互驱动的代理推理框架，用于时间序列预测，模仿了人类专家的迭代过程。它将预测重新构想为一个包含上下文准备、基于推理的生成和反思性评估的多阶段工作流，使用无需训练的大语言模型。跨多个基准的实验表明，Alphacast 在性能上优于现有方法。该框架包含一个轻量级工具包，包括特征集、知识库、案例库和上下文池，以支持基于大语言模型的推理和多样化的专家视角。

Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models

Authors: Hao Tang, Yu Liu, Shuanglin Yan, Fei Shen, Shengfeng He, Jing Qin

Venue: AAAI 2026

First: 2026-01-13T12:08:26+00:00 · Latest: 2026-04-01T08:21:11+00:00

Comments: Accepted by AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided by test images and iteratively refines visual proxies, progressively realigning cross-modal similarities and enlarging local OOD margins. Finally, we dynamically re-weight the contributions of dual-modal proxies to obtain a calibrated OOD score that is robust to distribution shift. Extensive experiments on standard benchmarks demonstrate that CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.

中文标题/摘要

标题：跨模态代理演化在视觉-语言模型的OOD检测中

在开放世界环境中部署视觉-语言模型时，可靠的零样本检测出分布外（OOD）输入至关重要。然而，零样本OOD检测中缺乏标记的负样本需要有效的代理信号，这些信号在分布转移下仍然有效。现有的负标签方法依赖于固定的一组文本代理，这会导致（i）稀疏地采样ID类之外的语义空间，以及（ii）视觉特征漂移而代理保持静态，从而导致跨模态对齐不良和预测不稳定。在本文中，我们提出了一种无需训练和注释的测试时框架CoEvo，该框架在测试时双向地、基于样本的条件下适应文本和视觉代理。具体而言，CoEvo引入了一种代理对齐的共进化机制，以维护两个不断进化的代理缓存，该机制通过测试图像引导上下文文本负样本的动态挖掘，并迭代地细化视觉代理，逐步重新对齐跨模态相似性并扩大局部OOD边界。最后，我们动态重新加权双模态代理的贡献，以获得对分布转移具有鲁棒性的OOD分数。在标准基准上的广泛实验表明，CoEvo在ImageNet-1K上实现了最先进的性能，与强大的负标签基线相比，AUROC提高了1.33%，FPR95降低了45.98%。

Summary / 总结

The paper addresses the challenge of zero-shot out-of-distribution (OOD) detection for vision-language models by proposing CoEvo, a framework that dynamically adapts both textual and visual proxies at test time. CoEvo uses a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which are refined iteratively based on test images to realign cross-modal similarities and improve OOD detection robustness. Experiments show that CoEvo outperforms existing methods, achieving a 1.33% improvement in AUROC and a 45.98% reduction in FPR95 on ImageNet-1K.

论文针对开放世界设置下视觉-语言模型的零样本out-of-distribution (OOD)检测挑战，提出了一种名为CoEvo的测试时框架，通过代理对齐的共生机制动态调整文本和视觉代理。该机制维护两个进化的代理缓存，动态挖掘由测试图像引导的上下文文本负例，并迭代细化视觉代理，以重新对齐跨模态相似性和扩大局部OOD边界。实验结果表明，CoEvo在ImageNet-1K上优于强负标签基线，实现了1.33%的AUROC提升和45.98%的FPR95减少。

HarassGuard: Detecting Harassment Behaviors in Social Virtual Reality with Vision-Language Models

Authors: Junhee Lee, Minseok Kim, Hwanjo Heo, Seungwon Woo, Jinwoo Kim

First: 2026-04-01T07:59:01+00:00 · Latest: 2026-04-01T07:59:01+00:00

Comments: To appear in the 2026 TVCG Special Issue on the 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (VR)

Abs · PDF · Code1 · Code2

Abstract

Social Virtual Reality (VR) platforms provide immersive social experiences but also expose users to serious risks of online harassment. Existing safety measures are largely reactive, while proactive solutions that detect harassment behavior during an incident often depend on sensitive biometric data, raising privacy concerns. In this paper, we present HarassGuard, a vision-language model (VLM) based system that detects physical harassment in social VR using only visual input. We construct an IRB-approved harassment vision dataset, apply prompt engineering, and fine-tune VLMs to detect harassment behavior by considering contextual information in social VR. Experimental results demonstrate that HarassGuard achieves competitive performance compared to state-of-the-art baselines (i.e., LSTM/CNN, Transformer), reaching an accuracy of up to 88.09% in binary classification and 68.85% in multi-class classification. Notably, HarassGuard matches these baselines while using significantly fewer fine-tuning samples (200 vs. 1,115), offering unique advantages in contextual reasoning and privacy-preserving detection.

中文标题/摘要

标题：HarassGuard：使用视觉语言模型在社交虚拟现实环境中检测骚扰行为

社交虚拟现实（VR）平台提供了沉浸式的社交体验，但也使用户面临严重的网络骚扰风险。现有的安全措施大多是被动的，而能够在此过程中检测骚扰行为的主动解决方案往往依赖敏感的生物识别数据，引发了隐私问题。在本文中，我们提出了一种基于视觉语言模型（VLM）的系统——HarassGuard，该系统仅使用视觉输入来检测社交VR中的身体骚扰行为。我们构建了一个IRB批准的骚扰视觉数据集，应用提示工程，并微调VLM以通过考虑社交VR中的上下文信息来检测骚扰行为。实验结果表明，HarassGuard在二分类和多分类中的准确率分别达到了88.09%和68.85%，与最先进的基线（即LSTM/CNN、Transformer）相比，HarassGuard在使用显著较少的微调样本（200 vs. 1,115）的情况下达到了类似的表现，提供了在上下文推理和隐私保护检测方面的独特优势。

Summary / 总结

HarassGuard is a vision-language model-based system that detects physical harassment in social VR using only visual input, addressing privacy concerns of existing safety measures. The system constructs an IRB-approved harassment dataset, applies prompt engineering, and fine-tunes VLMs to detect harassment behavior in social VR. HarassGuard achieves up to 88.09% accuracy in binary classification and 68.85% in multi-class classification, matching state-of-the-art baselines while requiring fewer fine-tuning samples, thus offering advantages in contextual reasoning and privacy preservation.

HarassGuard 是一个基于视觉-语言模型的系统，旨在仅使用视觉输入在社交 VR 中检测物理骚扰行为，解决现有安全措施相关的隐私问题。它构建了一个 IRB 批准的数据集，并通过提示工程微调 VLMs 来检测骚扰行为，二分类准确率达到 88.09%，多分类准确率达到 68.85%，同时需要比最先进的基线更少的微调样本，提供了在上下文推理和隐私保护检测方面的独特优势。

History

20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553