arXiv 论文速递

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Authors: Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han

First: 2025-10-10T17:59:58+00:00 · Latest: 2025-10-10T17:59:58+00:00

Comments: The first two authors contributed equally to this work

Abstract

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

中文标题/摘要

标题：StreamingVLM：实时理解无限视频流

视觉-语言模型（VLMs）可以为实时助手和自主代理提供动力，但它们面临一个关键挑战：在不增加延迟和内存使用的情况下理解近乎无限的视频流。对整个视频进行全注意力处理会导致计算成本呈平方级增长，并且在处理长视频时表现不佳。同时，简单的滑动窗口方法也有缺陷，它们要么破坏连贯性，要么由于冗余重新计算而导致高延迟。在本文中，我们介绍了StreamingVLM，这是一种为无限视觉输入提供实时、稳定理解设计的模型。我们的方法是一个统一框架，将训练与流式推理对齐。在推理过程中，我们通过重用注意力汇的状态、最近视觉标记的短窗口和最近文本标记的长窗口来维护紧凑的KV缓存。这种流式能力通过一种简单的监督微调（SFT）策略实现，该策略在短重叠视频片段上应用全注意力，从而在不使用过长上下文进行训练的情况下，有效地模拟了推理时的注意力模式。为了评估，我们构建了Inf-Streams-Eval，这是一个新的基准，其中视频平均时长超过两小时，需要每秒帧和文本的密集对齐。在Inf-Streams-Eval上，StreamingVLM以66.18%的胜率击败了GPT-4O mini，并在单个NVIDIA H100上保持稳定的实时性能，最高可达8 FPS。值得注意的是，我们的SFT策略还增强了通用的VQA能力，而无需任何VQA特定的微调，在LongVideoBench上提高了4.30%，在OVOBench Realtime上提高了5.96%。代码可在https://github.com/mit-han-lab/streaming-vlm/获取。

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Authors: Shaoqi Dong, Chaoyou Fu, Haihan Gao, Yi-Fan Zhang, Chi Yan, Chu Wu, Xiaoyu Liu, Yunhang Shen, Jing Huo, Deqiang Jiang, Haoyu Cao, Yang Gao, Xing Sun, Ran He, Caifeng Shan

First: 2025-10-10T17:59:56+00:00 · Latest: 2025-10-10T17:59:56+00:00

Comments: Homepage: https://ltbai.github.io/VITA-VLA/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.

中文标题/摘要

标题：VITA-VLA：通过动作专家蒸馏高效训练视觉语言模型进行操作

视觉语言动作（VLA）模型通过利用预训练视觉语言模型（VLM）的强大感知能力，显著推动了机器人操作的进步。通过将动作模块整合到这些预训练模型中，VLA方法展示了更好的泛化能力。然而，从头开始训练它们成本高昂。在本文中，我们提出了一种简单而有效的基于蒸馏的框架，通过从预训练的小动作模型中转移知识，使VLM具备执行动作的能力。我们的架构保留了原始VLM的结构，仅添加了一个动作标记和一个状态编码器以纳入物理输入。为了蒸馏动作知识，我们采用两阶段训练策略。首先，我们进行轻量级对齐，将VLM隐藏状态映射到小动作模型的动作空间，从而有效利用其预训练的动作解码器并避免昂贵的预训练。其次，我们选择性地微调语言模型、状态编码器和动作模块，使系统能够结合多模态输入并精确生成动作。具体而言，动作标记为VLM提供了预测未来动作的直接手段，而状态编码器则允许模型将仅凭视觉无法捕捉到的机器人动力学纳入考虑。此设计在从头开始训练大型VLA模型方面实现了显著的效率提升。与之前最先进的方法相比，我们的方法在LIBERO上实现了97.3%的平均成功率（提高11.8%），在LIBERO-LONG上实现了93.5%的成功率（提高24.5%）。在五个操作任务的现实世界实验中，我们的方法始终优于教师模型，实现了82.0%的成功率（提高17%），这表明动作蒸馏有效地使VLM能够生成精确的动作，同时大幅降低了训练成本。

Summary / 总结

The research aims to improve the efficiency of teaching vision-language models to perform robotic manipulation tasks. The method involves a distillation-based framework that transfers knowledge from pretrained small action models to vision-language models, adding only an action token and a state encoder. This approach achieves significant improvements in success rates on the LIBERO and LIBERO-LONG datasets, and outperforms the teacher model in real-world manipulation tasks, demonstrating the effectiveness of action distillation while reducing training costs.

研究旨在通过从预训练的小动作模型中进行蒸馏，提高视觉-语言模型在机器人操作中的训练效率。方法在预训练的视觉-语言模型中引入了动作标记和状态编码器，并采用两阶段训练策略对相关组件进行对齐和微调。该方法在LIBERO和LIBERO-LONG数据集上实现了显著的成功率提升，并在五个实际操作任务中优于教师模型，证明了有效动作生成的同时大幅减少了训练成本。

Vision Language Models: A Survey of 26K Papers

Authors: Fengming Lin

First: 2025-10-10T17:43:17+00:00 · Latest: 2025-10-10T17:43:17+00:00

Comments: VLM/LLM Learning Notes

Abs · PDF · Code1 · Code2

Abstract

We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.

中文标题/摘要

标题：视觉语言模型：26000篇论文综述

我们对CVPR、ICLR和NeurIPS在2023-2025年间接受的26104篇论文进行了透明、可复现的研究趋势测量。标题和摘要经过规范化、短语保护和与手工构建的词典匹配，以分配多达35个主题标签，并挖掘关于任务、架构、训练策略、目标、数据集和共现模态的细粒度线索。分析量化了三个宏观转变：（1）多模态视觉-语言-LLM工作的急剧上升，越来越多地将经典感知重新定义为指令遵循和多步推理；（2）生成方法的稳步扩展，扩散研究集中在可控性、蒸馏和速度上；（3）3D和视频活动的持续，合成从NeRFs转向高斯点积，并越来越强调人类和代理中心的理解。在VLMs中，参数高效适应如提示/适配器/LoRA和轻量级视觉-语言桥梁占主导地位；训练实践从从零构建编码器转向指令调优和微调强大骨干；对比目标相对于交叉熵/排名和蒸馏退居其次。跨会议比较显示，CVPR在3D方面有更强的足迹，而ICLR在VLM方面占比最高，可靠性主题如效率或鲁棒性在各个领域扩散。我们发布了词典和方法以供审计和扩展。局限性包括词典召回率和仅摘要范围，但纵向信号在各个会议和年份中保持一致。

Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Authors: Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi

First: 2025-10-01T17:55:37+00:00 · Latest: 2025-10-10T17:38:52+00:00

Comments: 82 pages, 28 figures, 32 tables. Code is available at https://github.com/CHATS-lab/verbalize-sampling

Abs · PDF · Code1 · Code2 · Code3

Abstract

Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities"). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.

中文标题/摘要

标题：口头采样：如何缓解模式崩溃并解锁大模型多样性

后训练对齐往往会减少大模型的多样性，导致模式崩溃现象。不同于以往将这种现象归因于算法限制的研究，我们发现了一个根本性的、普遍存在的数据层面驱动因素：偏好数据中的典型性偏差，即注释者系统地偏好熟悉的文本，这源于认知心理学中的已有发现。我们从理论上正式化了这种偏差，通过偏好数据集进行实证验证，并表明它在模式崩溃中起着核心作用。基于这一分析，我们提出了口头采样，这是一种简单的、无需训练的提示策略，以绕过模式崩溃。VS促使模型口头化一组响应的概率分布（例如，“生成5个关于咖啡的笑话及其相应的概率”）。全面的实验表明，VS在创意写作（诗歌、故事、笑话）、对话模拟、开放性问答和合成数据生成等方面显著提高了性能，而不会牺牲事实准确性与安全性。例如，在创意写作中，VS将多样性提高了1.6-2.1倍。我们还观察到一个新兴趋势，即更强大的模型从VS中获益更多。总之，我们的工作提供了一种新的数据为中心的模式崩溃视角，并提供了一种实用的推理时补救措施，有助于解锁预训练生成多样性。

Summary / 总结

The paper addresses mode collapse in post-training alignment of LLMs, attributing it to typicality bias in preference data. It introduces Verbalized Sampling (VS), a training-free prompting strategy that prompts models to verbalize probability distributions over responses. Experiments show VS significantly enhances diversity in creative writing, dialogue simulation, open-ended QA, and synthetic data generation without compromising factual accuracy and safety.

本文通过识别偏好数据中的典型性偏见作为模式塌陷的关键驱动因素，解决了大型语言模型（LLMs）中的模式塌陷问题。它引入了训练免费的提示策略——Verbalized Sampling（VS），以缓解这一问题。实验表明，VS 在创意写作、对话模拟、开放性问答和合成数据生成中提高了多样性，同时没有牺牲事实准确性与安全性，且更强大的模型从中受益更多。

ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition

Authors: Sanjoy Kundu, Shanmukha Vellamcheti, Sathyanarayanan N. Aakur

Venue: ICCV 2025

First: 2025-04-04T21:30:45+00:00 · Latest: 2025-10-10T17:23:46+00:00

Comments: Accepted to ICCV 2025. 17 pages, 6 figures, 3 tables

Abs · PDF · Code1 · Code2

Abstract

Open-world egocentric activity recognition poses a fundamental challenge due to its unconstrained nature, requiring models to infer unseen activities from an expansive, partially observed search space. We introduce ProbRes, a Probabilistic Residual search framework based on jump-diffusion that efficiently navigates this space by balancing prior-guided exploration with likelihood-driven exploitation. Our approach integrates structured commonsense priors to construct a semantically coherent search space, adaptively refines predictions using Vision-Language Models (VLMs) and employs a stochastic search mechanism to locate high-likelihood activity labels while minimizing exhaustive enumeration efficiently. We systematically evaluate ProbRes across multiple openness levels (L0-L3), demonstrating its adaptability to increasing search space complexity. In addition to achieving state-of-the-art performance on benchmark datasets (GTEA Gaze, GTEA Gaze+, EPIC-Kitchens, and Charades-Ego), we establish a clear taxonomy for open-world recognition, delineating the challenges and methodological advancements necessary for egocentric activity understanding. Our results highlight the importance of structured search strategies, paving the way for scalable and efficient open-world activity recognition.

中文标题/摘要

标题：ProbRes：开放世界第一人称活动识别的概率跳跃扩散

开放世界第一人称活动识别因其不受约束的性质而构成了一个基本挑战，要求模型从一个广泛且部分观察到的搜索空间中推断出未见过的活动。我们提出了ProbRes，一种基于跳跃扩散的概率残差搜索框架，通过平衡先验引导的探索与似然驱动的利用来高效地导航这个空间。我们的方法结合结构化的常识先验来构建语义上连贯的搜索空间，使用视觉-语言模型（VLMs）自适应地细化预测，并采用随机搜索机制来定位高似然活动标签，同时尽量减少耗尽式枚举。我们系统地在多个开放级别（L0-L3）上评估了ProbRes，展示了其对增加搜索空间复杂性的适应性。除了在基准数据集（GTEA Gaze、GTEA Gaze+、EPIC-Kitchens和Charades-Ego）上达到最先进的性能外，我们还为开放世界识别建立了清晰的分类法，界定了第一人称活动理解所需面临的挑战和方法论进步。我们的结果突显了结构化搜索策略的重要性，为开放世界的活动识别提供了可扩展和高效的途径。

Solving Inverse Problems with FLAIR

Authors: Julius Erbach, Dominik Narnhofer, Andreas Dombos, Bernt Schiele, Jan Eric Lenssen, Konrad Schindler

First: 2025-06-03T09:29:47+00:00 · Latest: 2025-10-10T17:14:03+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Flow-based latent generative models such as Stable Diffusion 3 are able to generate images with remarkable quality, even enabling photorealistic text-to-image generation. Their impressive performance suggests that these models should also constitute powerful priors for inverse imaging problems, but that approach has not yet led to comparable fidelity. There are several key obstacles: (i) the data likelihood term is usually intractable; (ii) learned generative models cannot be directly conditioned on the distorted observations, leading to conflicting objectives between data likelihood and prior; and (iii) the reconstructions can deviate from the observed data. We present FLAIR, a novel, training-free variational framework that leverages flow-based generative models as prior for inverse problems. To that end, we introduce a variational objective for flow matching that is agnostic to the type of degradation, and combine it with deterministic trajectory adjustments to guide the prior towards regions which are more likely under the posterior. To enforce exact consistency with the observed data, we decouple the optimization of the data fidelity and regularization terms. Moreover, we introduce a time-dependent calibration scheme in which the strength of the regularization is modulated according to off-line accuracy estimates. Results on standard imaging benchmarks demonstrate that FLAIR consistently outperforms existing diffusion- and flow-based methods in terms of reconstruction quality and sample diversity. Our code is available at https://inverseflair.github.io/.

中文标题/摘要

标题：使用FLAIR解决逆问题

基于流的潜在生成模型，如Stable Diffusion 3，能够生成具有卓越质量的图像，甚至实现逼真的文本到图像生成。这些模型的出色性能表明，它们也应成为逆成像问题的强大先验，但这种方法尚未达到相当的保真度。有几个关键障碍：(i) 数据似然项通常难以计算；(ii) 学习生成模型不能直接根据失真的观测值进行条件化，导致数据似然项和先验之间的目标冲突；(iii) 重构结果可能与观测数据偏离。我们提出了FLAIR，这是一种新颖的无需训练的变分框架，利用基于流的生成模型作为逆问题的先验。为此，我们引入了一种流匹配的变分目标，该目标对退化类型不敏感，并将其与确定性的轨迹调整相结合，以引导先验朝后验更可能的区域发展。为了确保与观测数据的精确一致性，我们将数据保真度和正则化项的优化分离开来。此外，我们引入了一种时间依赖的校准方案，其中正则化的强度根据离线准确性估计进行调节。在标准成像基准上的结果表明，FLAIR在重构质量和样本多样性方面始终优于现有的扩散和基于流的方法。我们的代码可在https://inverseflair.github.io/获得。

Summary / 总结

The research aims to leverage flow-based generative models, such as Stable Diffusion 3, for solving inverse imaging problems, which have not achieved comparable quality due to intractable data likelihood, conflicting objectives, and reconstruction deviations. FLAIR, a training-free variational framework, uses a flow-matching variational objective and deterministic trajectory adjustments to align the prior with the posterior. It also decouples data fidelity and regularization terms and introduces a time-dependent calibration scheme. Experiments on standard benchmarks show that FLAIR outperforms existing methods in reconstruction quality and sample diversity.

研究旨在利用如Stable Diffusion 3等基于流的生成模型解决逆向成像问题，由于数据似然性难以计算、目标冲突以及与观测数据的偏差，这些方法尚未达到可比的质量。FLAIR是一种无需训练的变分框架，通过流匹配和确定性轨迹调整使先验与后验对齐，并引入时间依赖的校准方案以确保数据保真度。实验结果表明，FLAIR在标准基准上的重建质量和样本多样性方面优于现有方法。

CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models

Authors: Aneesh Komanduri, Karuna Bhaila, Xintao Wu

Venue: EMNLP 2025

First: 2025-05-21T00:45:15+00:00 · Latest: 2025-10-10T15:38:10+00:00

Comments: Accepted to the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025 Main)

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have shown remarkable ability in various language tasks, especially with their emergent in-context learning capability. Extending LLMs to incorporate visual inputs, large vision-language models (LVLMs) have shown impressive performance in tasks such as recognition and visual question answering (VQA). Despite increasing interest in the utility of LLMs in causal reasoning tasks such as causal discovery and counterfactual reasoning, there has been relatively little work showcasing the abilities of LVLMs on visual causal reasoning tasks. We take this opportunity to formally introduce a comprehensive causal reasoning benchmark for multi-modal in-context learning from LVLMs. Our CausalVLBench encompasses three representative tasks: causal structure inference, intervention target prediction, and counterfactual prediction. We evaluate the ability of state-of-the-art open-source LVLMs on our causal reasoning tasks across three causal representation learning datasets and demonstrate their fundamental strengths and weaknesses. We hope that our benchmark elucidates the drawbacks of existing vision-language models and motivates new directions and paradigms in improving the visual causal reasoning abilities of LVLMs.

中文标题/摘要

标题：CausalVLBench：大型视觉语言模型的视觉因果推理基准测试

大型语言模型（LLMs）在各种语言任务中表现出色，尤其是得益于其出现的上下文学习能力。将LLMs扩展到包含视觉输入，大型视觉语言模型（LVLMs）在识别和视觉问答（VQA）等任务中表现出色。尽管越来越多的研究关注LLMs在因果推理任务中的应用，如因果发现和反事实推理，但关于LVLMs在视觉因果推理任务中的能力展示却相对较少。我们借此机会正式介绍一个全面的因果推理基准，用于多模态上下文学习的LVLMs。我们的CausalVLBench包括三个代表性任务：因果结构推理、干预目标预测和反事实预测。我们评估了最先进的开源LVLMs在我们因果推理任务上的能力，并展示了它们的基本优势和不足。我们希望我们的基准测试揭示现有视觉语言模型的缺点，并激发改进LVLMs视觉因果推理能力的新方向和范式。

Summary / 总结

CausalVLBench is a benchmark for evaluating the visual causal reasoning capabilities of large vision-language models (LVLMs). It includes three tasks: causal structure inference, intervention target prediction, and counterfactual prediction. The study evaluates state-of-the-art open-source LVLMs on three causal representation learning datasets, revealing their strengths and weaknesses in visual causal reasoning. This benchmark aims to highlight the limitations of current LVLMs and inspire new research directions to enhance their causal reasoning abilities.

CausalVLBench 是一个用于评估大型视觉语言模型（LVLM）在视觉因果推理能力的基准。该基准旨在通过因果结构推理、干预目标预测和反事实预测三个任务来评估最先进的 LVLM。这些模型在三个因果表示学习数据集上的表现揭示了它们在视觉因果推理方面的优势和不足。基准旨在识别当前 LVLM 的局限性，并激发新的方法来提高它们的因果推理能力。

D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models

Authors: Jisu Han, Wonjun Hwang

First: 2025-10-10T15:27:44+00:00 · Latest: 2025-10-10T15:27:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.

中文标题/摘要

标题：D-TPT：维度熵最大化在视觉语言模型测试时提示调优中的校准

测试时适应范式通过在源模型的未标记目标数据上进行即时适应，提供了对领域转移的灵活性。视觉语言模型（VLMs）利用其泛化能力处理各种下游任务，而测试时提示调优已成为适应VLMs的突出解决方案。在本文中，我们探索对比VLMs并识别由单一主导特征维度引起的模态差距。我们观察到，文本和图像模态中的主导维度表现出高度的预测敏感性，限制其影响可以改善校准误差。基于这一见解，我们提出维度熵最大化，通过使文本特征分布趋于均匀来减轻对主导维度的依赖。我们的方法缓解了测试时提示调优中的校准性能下降，提供了一种简单而有效的解决方案，以增强VLMs在实际部署场景中的可靠性

Summary / 总结

This work addresses the challenge of calibrating test-time prompt tuning in Vision-Language Models (VLMs) by exploring contrastive VLMs and identifying a modality gap due to a single dominant feature dimension. The authors propose dimensional entropy maximization to regularize textual features, reducing the dependency on these dominant dimensions and improving calibration error. Experimental results show that this method enhances the reliability of VLMs in real-world deployment scenarios.

该研究旨在通过探索对比视觉语言模型（VLMs）并识别由于跨模态单一主导特征维度导致的模态差距，解决测试时提示调优校准问题。作者提出了一种维度熵最大化的方法，该方法通过使文本特征分布均匀化来改善校准误差并减轻对主导维度的依赖。实验结果表明，该方法通过缓解测试时提示调优期间校准性能的下降，增强了VLMs在实际部署场景中的可靠性。

RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

Authors: Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, Hao Tang

Venue: NeurIPS 2025 Spotlight

First: 2025-02-24T13:52:05+00:00 · Latest: 2025-10-10T15:19:20+00:00

Comments: NeurIPS 2025 (Spotlight)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.

中文标题/摘要

标题：RobustMerge：参数高效模型合并方法，适用于MLLMs的方向鲁棒性

使用自定义数据微调预训练模型会产生众多针对特定任务的专家模型。将模型合并为一个通用模型以增强多任务能力并避免数据泄露变得流行。随着数据和模型规模的扩大，参数高效微调已成为高效获取任务特定模型的常见做法。然而，很少有方法专注于高效合并，现有的设计用于全面微调合并的方法在高效合并时会失效。为了解决这一问题，我们从低秩分解的角度进行了分析，并揭示了合并高效模块时方向鲁棒性的重要性。我们进一步发现，补偿显著奇异值之间的差距有助于方向鲁棒性。因此，我们提出了RobustMerge，这是一种无需训练的参数高效合并方法，通过互补的参数适应来保持方向鲁棒性。具体来说，我们（1）从参数间关系中剪枝参数并缩放系数以保持远离任务干扰的方向稳定性，（2）进行跨任务归一化以增强对未见过任务的泛化能力。我们在包含多种模态任务的基准上进行了实验，以证明我们方法的出色性能和泛化能力。进一步的研究和详尽的分析进一步展示了其有效性。代码可在https://github.com/AuroraZengfh/RobustMerge获取。

Summary / 总结

RobustMerge is a parameter-efficient merging method for multi-task language models (MLLMs) that addresses the issue of direction robustness during merging. It prunes and scales parameters to maintain direction stability and performs cross-task normalization to enhance generalization. Experiments on a diverse multimodal benchmark demonstrate its superior performance and generalizability compared to existing methods.

RobustMerge 是一种参数高效合并方法，用于多任务语言模型（MLLMs），旨在解决合并过程中保持方向稳健性的挑战。它通过修剪和缩放参数来维持奇异值稳定性，并执行跨任务归一化以增强泛化能力。在多样化的多模态任务基准测试上进行的实验表明，该方法的性能和泛化能力优于现有方法。

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

Authors: Wenrui Zhou, Mohamed Hendy, Shu Yang, Qingsong Yang, Zikun Guo, Yuyu Luo, Lijie Hu, Di Wang

First: 2025-06-08T15:00:21+00:00 · Latest: 2025-10-10T15:15:28+00:00

Comments: 28 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the video-language domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE (Video-LLM Sycophancy Benchmarking and Evaluation), the first benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISE pioneeringly brings linguistic perspectives on sycophancy into the video domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. Furthermore, we propose two potential training-free mitigation strategies, revealing potential paths for reducing sycophantic bias: (i) enhancing visual grounding through interpretable key-frame selection and (ii) steering model behavior away from sycophancy via targeted, inference-time intervention on its internal neural representations. Our code is available at https://github.com/William030422/Video-Sycophancy.

中文标题/摘要

标题：运动中的奉承：视频大语言模型的基准测试与分析

随着视频大语言模型（Video-LLMs）越来越多地被集成到需要基于视觉证据的多模态推理的实际应用中，确保其事实一致性和可靠性变得至关重要。然而，这些模型倾向于与用户输入保持一致，即使这与视觉证据相矛盾，这会削弱它们在这些情境中的可信度。当前关于奉承的研究大多忽略了其在视频语言领域的具体表现，导致缺乏系统性的基准测试和针对性评估来理解Video-LLMs在误导性用户输入下的反应。为填补这一空白，我们提出了VISE（Video-LLM奉承基准测试与评估），这是第一个旨在评估最先进的Video-LLMs在多种问题格式、提示偏差和视觉推理任务下的奉承行为的基准测试。具体而言，VISE首次将语言学视角引入视频领域，使我们能够对多种奉承类型和互动模式进行精细分析。此外，我们提出了两种潜在的无需训练的缓解策略，揭示了减少奉承偏见的可能途径：（i）通过可解释的关键帧选择增强视觉接地，（ii）通过在推理时对内部神经表示进行有针对性的干预来引导模型行为远离奉承。我们的代码可在https://github.com/William030422/Video-Sycophancy 获取。

Summary / 总结

This paper addresses the issue of sycophancy in Video-LLMs, where these models align with user input even when it contradicts visual evidence. To tackle this, the authors propose VISE, a benchmark for evaluating sycophantic behavior in Video-LLMs across various question formats and visual reasoning tasks. VISE introduces linguistic perspectives on sycophancy in the video domain, enabling detailed analysis. The paper also suggests two mitigation strategies: enhancing visual grounding and steering model behavior through inference-time interventions, which could help reduce sycophantic bias in Video-LLMs.

该论文关注视频大语言模型（Video-LLMs）中的奉承行为问题，即这些模型在用户输入与视觉证据矛盾时仍与其保持一致。为解决这一问题，作者提出了VISE基准，用于评估Video-LLMs在不同问题格式和视觉推理任务中的奉承行为。VISE引入了视频领域中奉承行为的语言视角，实现了详细的分析。论文还提出了两种缓解策略：增强视觉接地和通过推理时的干预调整模型行为，这可能有助于减少Video-LLMs中的奉承偏见。