arXiv 论文速递

2026-01-09 03:34
Snapshot: 20260109_0334
Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models
Authors: Erik Thiringer, Fredrik K. Gustafsson, Kajsa Ledesma Eriksson, Mattias Rantalainen
First: 2026-01-07T18:24:12+00:00 · Latest: 2026-01-07T18:24:12+00:00
Abstract
Pathology foundation models (PFMs) have become central to computational pathology, aiming to offer general encoders for feature extraction from whole-slide images (WSIs). Despite strong benchmark performance, PFM robustness to real-world technical domain shifts, such as variability from whole-slide scanner devices, remains poorly understood. We systematically evaluated the robustness of 14 PFMs to scanner-induced variability, including state-of-the-art models, earlier self-supervised models, and a baseline trained on natural images. Using a multiscanner dataset of 384 breast cancer WSIs scanned on five devices, we isolated scanner effects independently from biological and laboratory confounders. Robustness is assessed via complementary unsupervised embedding analyses and a set of clinicopathological supervised prediction tasks. Our results demonstrate that current PFMs are not invariant to scanner-induced domain shifts. Most models encode pronounced scanner-specific variability in their embedding spaces. While AUC often remains stable, this masks a critical failure mode: scanner variability systematically alters the embedding space and impacts calibration of downstream model predictions, resulting in scanner-dependent bias that can impact reliability in clinical use cases. We further show that robustness is not a simple function of training data scale, model size, or model recency. None of the models provided reliable robustness against scanner-induced variability. While the models trained on the most diverse data, here represented by vision-language models, appear to have an advantage with respect to robustness, they underperformed on downstream supervised tasks. We conclude that development and evaluation of PFMs requires moving beyond accuracy-centric benchmarks toward explicit evaluation and optimisation of embedding stability and calibration under realistic acquisition variability.
中文标题/摘要
标题:扫描诱导的领域偏移削弱了病理基础模型的稳健性
病理基础模型(PFMs)已成为计算病理学的核心,旨在提供从全切片图像(WSIs)中提取特征的一般编码器。尽管在基准测试中表现出色,但PFMs对现实世界技术领域偏移(如全切片扫描仪设备的变异性)的稳健性仍然知之甚少。我们系统地评估了14种PFMs对扫描诱导变异性(包括最先进的模型、早期的自监督模型以及基于自然图像训练的基线模型)的稳健性。使用包含384例乳腺癌WSIs的多扫描仪数据集(在五种设备上扫描),我们独立地隔离了扫描器效应,排除了生物学和实验室混杂因素的影响。稳健性通过互补的无监督嵌入分析和一系列临床病理学监督预测任务进行评估。我们的结果表明,当前的PFMs对扫描诱导的领域偏移缺乏不变性。大多数模型在其嵌入空间中编码了明显的扫描器特定变异性。虽然AUC通常保持稳定,但这掩盖了一个关键的失败模式:扫描器变异性系统地改变了嵌入空间,并影响了下游模型预测的校准,导致扫描器依赖性偏差,可能影响临床应用的可靠性。我们进一步表明,稳健性不是简单地由训练数据规模、模型大小或模型的最新程度决定的。没有一种模型能够可靠地抵抗扫描器诱导的变异性。虽然训练数据最多样化的模型(这里代表视觉-语言模型)似乎在稳健性方面具有优势,但它们在下游监督任务中表现不佳。我们得出结论,PFM的开发和评估需要超越以准确性为中心的基准,转向明确评估和优化在现实获取变异性下的嵌入稳定性和校准。
Summary / 总结
The study evaluates the robustness of 14 pathology foundation models (PFMs) to scanner-induced variability using a multiscanner dataset of breast cancer whole-slide images. It finds that most PFMs encode scanner-specific variability, which impacts the calibration of downstream predictions, leading to scanner-dependent bias. The robustness is not correlated with model size, training data diversity, or recency, suggesting that current PFMs are not invariant to scanner-induced domain shifts.
研究使用乳腺癌全切片图像的多扫描仪数据集评估了14种病理基础模型(PFMs)对扫描器引起的变异性的鲁棒性。研究发现,大多数PFMs在嵌入空间中编码了扫描器特有的变异性,导致下游预测中的扫描器依赖性偏差,尽管AUC分数保持稳定。鲁棒性与模型规模、训练数据多样性或模型的最新程度无关,表明需要新的评估指标,重点关注嵌入稳定性和校准的鲁棒性。
Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning
Authors: Yifan Wang, Yanyu Li, Sergey Tulyakov, Yun Fu, Anil Kag
First: 2026-01-07T18:05:08+00:00 · Latest: 2026-01-07T18:05:08+00:00
Abstract
Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse -- without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.
中文标题/摘要
标题:扩散-DRF:可微奖励流用于视频扩散微调
直接偏好优化(DPO)最近通过提高视觉保真度和文本对齐性改善了文本到视频(T2V)生成。然而,当前方法依赖于人类注释或学习奖励模型中的非可微偏好信号,这使得训练耗时、易产生偏差且容易被操纵,从而常常引发奖励劫持和训练不稳定。我们提出了一种可微奖励流(Diffusion-DRF),使用冻结的现成视觉-语言模型(VLM)作为无训练的批评家,用于微调视频扩散模型。Diffusion-DRF直接通过扩散去噪链反向传播VLM反馈,将logit级响应转换为token感知的梯度以进行优化。我们提出了一种自动化的、按方面结构化的提示管道,以获取可靠的多维度VLM反馈,同时梯度检查点技术使通过最终去噪步骤的高效更新成为可能。Diffusion-DRF在不使用额外奖励模型或偏好数据集的情况下,提高了视频质量和语义对齐性,同时减轻了奖励劫持和崩溃。它具有模型通用性,并且可以轻松推广到其他基于扩散的生成任务。
Summary / 总结
The paper addresses the limitations of current Direct Preference Optimization (DPO) methods in Text-to-Video (T2V) generation, which rely on non-differentiable preference signals that are label-intensive and prone to bias and reward hacking. It introduces Diffusion-DRF, a differentiable reward flow that uses a frozen Vision-Language Model (VLM) as a critic to provide token-aware gradients for optimization. This method improves video quality and semantic alignment while avoiding the need for additional reward models or preference datasets, and it is model-agnostic and applicable to other diffusion-based generative tasks.
论文针对当前文本到视频(T2V)生成中的直接偏好优化(DPO)方法存在的问题,这些方法依赖于非可微的偏好信号,导致训练效率低下且容易出现奖励作弊。作者提出了Diffusion-DRF,这是一种可微奖励流,使用冻结的视觉-语言模型(VLM)作为批评家,提供具有标记感知的梯度进行优化。该方法提高了视频质量和语义对齐,并避免了需要额外的奖励模型或偏好数据集,使其更加稳健和高效。该方法对其他基于扩散的生成任务也具有通用性。
GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning
Authors: Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuxin Hu
First: 2026-01-07T17:26:41+00:00 · Latest: 2026-01-07T17:26:41+00:00
Abstract
The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.
中文标题/摘要
标题:GeoReason: 通过逻辑一致性强化学习使遥感视觉语言模型的思考与回答保持一致
遥感视觉语言模型(RS-VLMs)的发展强调了从感知中心的识别向高级演绎推理过渡的重要性,以增强复杂空间任务中的认知可靠性。然而,当前模型常常遭受逻辑幻觉的问题,即正确的答案是基于有缺陷的推理链或依赖于位置捷径而非空间逻辑。这种脱节削弱了在战略空间决策中的可靠性。为了解决这一问题,我们提出了GeoReason框架,旨在使内部思考与最终决策同步。我们首先构建了GeoReason-Bench,这是一个逻辑驱动的数据集,包含4,000条从几何原语和专家知识中合成的推理轨迹。然后我们制定了两阶段训练策略:(1) 监督知识初始化,以使模型具备推理语法和领域专业知识;(2) 一致性感知强化学习,以提高演绎可靠性。这一阶段整合了一种新颖的逻辑一致性奖励,通过选项排列策略惩罚逻辑漂移,以使决策基于可验证的推理轨迹。实验结果表明,我们的框架显著提高了RS-VLMs的认知可靠性和可解释性,达到了与其他先进方法相比的最优性能。
Summary / 总结
GeoReason is a framework designed to improve the cognitive reliability and interpretability of Remote Sensing Vision-Language Models (RS-VLMs) by addressing logical hallucinations. It introduces a two-stage training strategy: supervised knowledge initialization to equip the model with reasoning syntax and domain expertise, followed by consistency-aware reinforcement learning that penalizes logical drift through a novel Logical Consistency Reward. The framework significantly enhances RS-VLMs' performance, achieving state-of-the-art results.
GeoReason 是一个旨在通过解决逻辑幻觉来提高遥感视觉语言模型 (RS-VLM) 认知可靠性的框架。它采用两阶段训练策略:监督知识初始化和一致性意识强化学习。第一阶段使模型具备推理语法和领域专业知识,而第二阶段使用逻辑一致性奖励来惩罚逻辑漂移,确保决策基于可验证的推理轨迹。实验表明,GeoReason 显著提高了 RS-VLM 的可解释性和可靠性,并优于其他先进方法。
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts
Authors: Zhihao Zhu, Jiafeng Liang, Shixin Jiang, Jinlan Fu, Ming Liu, Guanglu Sun, See-Kiong Ng, Bing Qin
First: 2026-01-07T16:39:34+00:00 · Latest: 2026-01-07T16:39:34+00:00
Comments: 10 pages, 5 figures
Abstract
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in video reasoning via Chain-of-Thought (CoT). However, the robustness of their reasoning chains remains questionable. In this paper, we identify a critical failure mode termed textual inertia, where once a textual hallucination occurs in the thinking process, models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence. To systematically investigate this, we propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs spanning both native reasoning architectures and prompt-driven paradigms to evaluate their self-reflection capabilities. The results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation. To mitigate this, we introduce Active Visual-Context Refinement, a training-free inference paradigm which orchestrates an active visual re-grounding mechanism to enforce fine-grained verification coupled with an adaptive context refinement strategy to summarize and denoise the reasoning history. Experiments demonstrate that our approach significantly stifles hallucination propagation and enhances reasoning robustness.
中文标题/摘要
标题:大型多模态模型在跨模态冲突下的推理一致性分析
大型多模态模型(LMMs)在通过链式思考(CoT)进行视频推理方面展现了令人印象深刻的性能。然而,它们推理链的鲁棒性仍然值得怀疑。在本文中,我们识别出一种关键的失败模式,称为文本惯性,即在思考过程中一旦出现文本幻觉,模型往往会盲目地坚持错误的文本,而忽视矛盾的视觉证据。为了系统地研究这一问题,我们提出了逻辑图扰动协议,该协议结构化地向不同LMMs的推理链中注入扰动,涵盖原生推理架构和提示驱动范式,以评估它们的自我反思能力。结果显示,模型在不到10%的情况下成功自我纠正,并且主要依赖于盲目的文本错误传播。为了缓解这一问题,我们引入了主动视觉上下文精炼,这是一种无需训练的推理范式,它协调了一个主动视觉再定位机制,以强制执行精细验证,并结合自适应上下文精炼策略来总结和去噪推理历史。实验表明,我们的方法显著抑制了幻觉传播并增强了推理鲁棒性。
Summary / 总结
This paper addresses the robustness of reasoning chains in Large Multimodal Models (LMMs) by identifying a failure mode called textual inertia. The authors propose the LogicGraph Perturbation Protocol to evaluate models' self-reflection capabilities under cross-modal conflicts. The study finds that models rarely self-correct and mostly propagate textual errors. To improve this, the authors introduce Active Visual-Context Refinement, which enhances reasoning robustness by actively re-grounding visual context and refining reasoning history. Experiments show that this approach reduces hallucination propagation and improves reasoning consistency.
本文通过识别大型多模态模型(LMMs)推理链中的一个失败模式——文本惯性,即模型在面对视觉矛盾证据时仍坚持错误文本,来探讨其稳健性问题。为此,作者提出了逻辑图扰动协议,以评估模型的自我反思能力。结果显示,模型很少自我纠正,主要是在错误文本上盲目传播。随后,作者引入了主动视觉-上下文精炼方法,这是一种无需训练的推理范式,通过主动重新定位视觉上下文和精炼推理历史,显著减少了幻觉传播,提升了推理的稳健性。
Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding
Authors: Shengyuan Ye, Bei Ouyang, Tianyi Qian, Liekang Zeng, Mu Yuan, Xiaowen Chu, Weijie Hong, Xu Chen
First: 2025-12-08T09:32:47+00:00 · Latest: 2026-01-07T16:24:34+00:00
Comments: Accepted by IEEE International Conference on Computer Communications 2026
Abstract
Vision-language models (VLMs) have demonstrated impressive multimodal comprehension capabilities and are being deployed in an increasing number of online video understanding applications. While recent efforts extensively explore advancing VLMs' reasoning power in these cases, deployment constraints are overlooked, leading to overwhelming system overhead in real-world deployments. To address that, we propose Venus, an on-device memory-and-retrieval system for efficient online video understanding. Venus proposes an edge-cloud disaggregated architecture that sinks memory construction and keyframe retrieval from cloud to edge, operating in two stages. In the ingestion stage, Venus continuously processes streaming edge videos via scene segmentation and clustering, where the selected keyframes are embedded with a multimodal embedding model to build a hierarchical memory for efficient storage and retrieval. In the querying stage, Venus indexes incoming queries from memory, and employs a threshold-based progressive sampling algorithm for keyframe selection that enhances diversity and adaptively balances system cost and reasoning accuracy. Our extensive evaluation shows that Venus achieves a 15x-131x speedup in total response latency compared to state-of-the-art methods, enabling real-time responses within seconds while maintaining comparable or even superior reasoning accuracy.
中文标题/摘要
标题:金星:一种高效的边缘记忆与检索系统,用于基于VLM的在线视频理解
视觉语言模型(VLMs)展示了强大的多模态理解能力,并被部署在越来越多的在线视频理解应用中。尽管最近的努力广泛探索了在这些情况下增强VLMs的推理能力,但部署限制被忽视了,导致实际部署中的系统开销过大。为了解决这个问题,我们提出了金星,一种用于高效在线视频理解的边缘设备记忆与检索系统。金星提出了一种边缘-云分离架构,将云中的记忆构建和关键帧检索下沉到边缘,分为两个阶段操作。在摄取阶段,金星通过场景分割和聚类连续处理边缘视频流,选择的关键帧通过多模态嵌入模型嵌入,构建层次化记忆以实现高效存储和检索。在查询阶段,金星从记忆中索引入来的查询,并采用基于阈值的渐进采样算法进行关键帧选择,以增强多样性并适当地平衡系统成本和推理准确性。我们的广泛评估表明,与最先进的方法相比,金星在总响应延迟上实现了15倍至131倍的加速,能够在几秒钟内实现实时响应,同时保持相当甚至更优的推理准确性。
Summary / 总结
Venus is an on-device memory-and-retrieval system designed to enhance the efficiency of online video understanding using vision-language models. It proposes an edge-cloud disaggregated architecture to reduce system overhead. Venus processes streaming videos via scene segmentation and clustering, embedding selected keyframes with a multimodal model to build a hierarchical memory. During querying, it uses a threshold-based progressive sampling algorithm to select keyframes, balancing cost and accuracy. Extensive evaluation demonstrates a 15x-131x reduction in total response latency compared to state-of-the-art methods, enabling real-time responses while maintaining or improving reasoning accuracy.
Venus 是一种基于设备的内存和检索系统,旨在通过视觉语言模型(VLMs)提高在线视频理解的效率。它提出了一种边缘-云分离架构,将内存构建和关键帧检索从云迁移到边缘。Venus 通过场景分割和聚类处理流媒体视频,并使用多模态模型嵌入选定的关键帧以构建层次化内存。在查询阶段,它使用基于阈值的渐进采样算法选择关键帧,以平衡成本和准确性。Venus 显著减少了总响应延迟,比现有方法快15到131倍,能够在几秒钟内实现实时响应,同时保持或提高推理准确性。
FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation
Authors: Gen Li, Peiyu Liu
First: 2026-01-04T12:46:35+00:00 · Latest: 2026-01-07T15:36:31+00:00
Abstract
Vision-Language Models (VLMs) excel at visual reasoning but still struggle with integrating external knowledge. Retrieval-Augmented Generation (RAG) is a promising solution, but current methods remain inefficient and often fail to maintain high answer quality. To address these challenges, we propose VideoSpeculateRAG, an efficient VLM-based RAG framework built on two key ideas. First, we introduce a speculative decoding pipeline: a lightweight draft model quickly generates multiple answer candidates, which are then verified and refined by a more accurate heavyweight model, substantially reducing inference latency without sacrificing correctness. Second, we identify a major source of error - incorrect entity recognition in retrieved knowledge - and mitigate it with a simple yet effective similarity-based filtering strategy that improves entity alignment and boosts overall answer accuracy. Experiments demonstrate that VideoSpeculateRAG achieves comparable or higher accuracy than standard RAG approaches while accelerating inference by approximately 2x. Our framework highlights the potential of combining speculative decoding with retrieval-augmented reasoning to enhance efficiency and reliability in complex, knowledge-intensive multimodal tasks.
中文标题/摘要
标题:FastV-RAG:迈向快速且精细粒度的视频问答检索增强生成
视觉-语言模型(VLMs)在视觉推理方面表现出色,但在整合外部知识方面仍然存在困难。检索增强生成(RAG)是一种有前途的解决方案,但当前的方法仍然不够高效,往往无法保持高质量的答案。为了解决这些挑战,我们提出了一种高效的基于VLM的RAG框架VideoSpeculateRAG,该框架基于两个关键想法。首先,我们引入了一种推测性解码流水线:一个轻量级的草稿模型快速生成多个答案候选,然后由一个更准确的重模型进行验证和细化,从而显著减少推理延迟,同时保持正确性。其次,我们识别出错误的主要来源——检索知识中的实体识别错误,并通过一种简单而有效的基于相似性的过滤策略来缓解这一问题,从而提高实体对齐并提升整体答案准确性。实验表明,VideoSpeculateRAG在准确性和标准RAG方法相当或更高,同时将推理加速约2倍。我们的框架突显了将推测性解码与检索增强推理相结合以提高复杂、知识密集型多模态任务的效率和可靠性的潜力。
Summary / 总结
The research aims to improve the efficiency and accuracy of video question-answering systems by integrating retrieval-augmented generation (RAG) with speculative decoding. The proposed VideoSpeculateRAG framework uses a lightweight draft model to quickly generate multiple answer candidates, which are then refined by a more accurate model, significantly reducing inference latency. It also introduces a similarity-based filtering strategy to improve entity recognition in retrieved knowledge, enhancing overall answer accuracy. Experiments show that VideoSpeculateRAG achieves comparable or higher accuracy than standard RAG approaches while accelerating inference by approximately 2x.
研究旨在通过结合检索增强生成(RAG)与推测性解码来提高视频问答系统的效率和准确性。提出的VideoSpeculateRAG框架使用轻量级的草稿模型快速生成多个答案候选,然后由更准确的重模型进行优化。这种方法减少了推理延迟,同时保持了准确性。此外,采用基于相似性的过滤策略来提高检索知识中的实体识别,从而提高整体答案准确性。实验表明,VideoSpeculateRAG在加速推理约2倍的同时,达到了或超过了标准RAG方法的准确率。
SSSD: Simply-Scalable Speculative Decoding
Authors: Michele Marzollo, Jiawei Zhuang, Niklas Roemer, Niklas Zwingenberger, Lorenz K. Müller, Lukas Cavigelli
First: 2024-11-08T14:23:02+00:00 · Latest: 2026-01-07T15:11:39+00:00
Comments: 16 pages, 6 figures
Abstract
Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial speedups typically rely on an additional trained draft model or auxiliary model components, increasing deployment and maintenance complexity. This added complexity reduces flexibility, particularly when serving workloads shift to tasks, domains, or languages that are not well represented in the draft model's training data. We introduce Simply-Scalable Speculative Decoding (SSSD), a training-free method that combines lightweight n-gram matching with hardware-aware speculation. Relative to standard autoregressive decoding, SSSD reduces latency by up to 2.9x. It achieves performance on par with leading training-based approaches across a broad range of benchmarks, while requiring substantially lower adoption effort--no data preparation, training or tuning are needed--and exhibiting superior robustness under language and domain shift, as well as in long-context settings.
中文标题/摘要
标题:SSSD:简单可扩展的推测性解码
推测性解码已成为加速大型语言模型推理的一种流行技术。然而,大多数现有方法在生产服务系统中仅能提供适度的性能提升。能够实现显著加速的方法通常依赖于额外训练的草稿模型或辅助模型组件,增加了部署和维护的复杂性。这种额外的复杂性降低了灵活性,特别是在服务工作负载转向草稿模型训练数据中未充分代表的任务、领域或语言时。 我们提出了简单可扩展的推测性解码(SSSD),这是一种无需训练的方法,结合了轻量级n-克隆匹配与硬件感知推测。与标准自回归解码相比,SSSD 将延迟最多减少2.9倍。它在广泛基准测试中实现了与基于训练的领先方法相当的性能,而采用门槛显著降低——无需数据准备、训练或调优,并且在语言和领域变化以及长上下文设置下表现出更优的鲁棒性。
Summary / 总结
SSSD is a training-free speculative decoding method that combines lightweight n-gram matching with hardware-aware speculation to reduce latency by up to 2.9x compared to standard autoregressive decoding. It achieves performance comparable to leading training-based approaches across various benchmarks while requiring no data preparation, training, or tuning, and showing better robustness under language and domain shifts and in long-context settings.
SSSD 是一种无需训练的推测性解码方法,结合了轻量级 n- 克隆匹配和硬件感知推测,与标准自回归解码相比,可将延迟降低多达 2.9 倍。它在各种基准测试中实现了与领先训练基线方法相当的性能,无需数据准备、训练或调优,并且在语言和领域变化以及长上下文设置中表现出更好的鲁棒性。
VISTA: Mitigating Semantic Inertia in Video-LLMs via Training-Free Dynamic Chain-of-Thought Routing
Authors: Hongbo Jin, Jiayu Ding, Siyi Xie, Guibo Luo, Ge Li
First: 2025-05-17T04:34:32+00:00 · Latest: 2026-01-07T14:58:44+00:00
Comments: 19 pages, 7 figures
Abstract
Recent advancements in Large Language Models have successfully transitioned towards System 2 reasoning, yet applying these paradigms to video understanding remains challenging. While prevailing research attributes failures in Video-LLMs to perceptual limitations, our empirical analysis reveals a cognitive misalignment termed Semantic Inertia, where models suppress valid visual evidence in favor of dominant language priors. To rectify this, we propose VISTA, a training-free framework designed to align perception with logical deduction. By dynamically routing inference paths and materializing implicit visual features into explicit textual anchors, our approach effectively counterbalances the influence of parametric knowledge. Furthermore, we incorporate a Latent Reasoning Consensus mechanism to mitigate stochastic hallucinations. VISTA showed outstanding results on a wide range of benchmarks, and outperforms its base model by 9.3% on Egochema and 5.6% on VideoEspresso, rivalling or even surpassing larger and proprietary models. Our codebase will be publicly available soon.
中文标题/摘要
标题:VISTA:通过训练-free 动态链式思考路由减轻视频LLMs中的语义惯性
近年来,大型语言模型成功过渡到系统2推理,但在将其应用到视频理解方面仍然面临挑战。尽管现有研究将视频LLMs的失败归因于感知限制,但我们的实证分析揭示了一种认知错位,称为语义惯性,即模型倾向于抑制有效的视觉证据,而优先考虑主导的语言先验。为了解决这一问题,我们提出了VISTA,一种训练-free框架,旨在使感知与逻辑推理相一致。通过动态路由推理路径并将隐含的视觉特征显式化为文本锚点,我们的方法有效地平衡了参数知识的影响。此外,我们还引入了一种潜在推理共识机制,以减轻随机幻觉。VISTA在多种基准测试中表现出色,并在Egochema上比其基础模型高出9.3%,在VideoEspresso上高出5.6%,甚至与更大的专有模型相当或超越。我们的代码库将很快公开。
Summary / 总结
The research aims to address the issue of Semantic Inertia in Video-LLMs, where models prioritize language priors over visual evidence. The authors propose VISTA, a training-free framework that dynamically routes inference paths and converts implicit visual features into explicit textual anchors to align perception with logical deduction. VISTA outperforms its base model by 9.3% on Egochema and 5.6% on VideoEspresso, demonstrating superior performance across various benchmarks and rivaling larger proprietary models.
研究旨在解决Video-LLMs中的语义惯性问题,即模型优先考虑语言先验而非视觉证据。作者提出了一种名为VISTA的无训练框架,通过动态路由推理路径并将隐含的视觉特征转化为显式的文本锚点,以实现感知与逻辑推理的对齐。VISTA在Egochema和VideoEspresso等基准测试中分别比基模型高出9.3%和5.6%,展示了其在各种基准测试中的有效性,并且能够匹敌甚至超越更大的专有模型。
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
Authors: Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng
First: 2026-01-07T13:48:12+00:00 · Latest: 2026-01-07T13:48:12+00:00
Comments: 14 pages, 13 figures
Abstract
Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.
中文标题/摘要
标题:FocusUI:通过保留位置的视觉标记选择实现高效的UI定位
视觉-语言模型(VLMs)在用户界面(UI)定位任务中表现出色,这得益于它们处理越来越高的分辨率屏幕截图的能力。然而,屏幕截图被标记成数千个视觉标记(例如,2K分辨率大约有4700个),这导致了显著的计算开销并稀释了注意力。相比之下,人类在与UI交互时通常会关注感兴趣区域。在本文中,我们开创了高效UI定位的任务。根据对任务特性和挑战的实用分析,我们提出了FocusUI,这是一种高效的UI定位框架,它在保留位置连续性的前提下选择与指令最相关的补丁,以实现精确的定位。FocusUI解决了两个关键挑战:(1)在视觉编码中消除冗余标记。我们通过将指令条件得分与基于规则的UI图得分融合来构建补丁级监督,该得分降低了大同质区域的权重,从而选择出独特的且与指令相关的视觉标记。(2)在视觉标记选择过程中保持位置连续性。我们发现,通用的视觉标记修剪方法在UI定位任务中由于位置信息的中断而遭受严重的准确性下降。我们引入了一种新颖的PosPad策略,即将每个连续的删除视觉标记序列压缩成一个特殊的标记放置在序列的最后一个索引处,以保持位置连续性。在四个定位基准上的全面实验表明,FocusUI超越了特定于GUI的基线。在ScreenSpot-Pro基准上,FocusUI-7B在GUI-Actor-7B的基础上提高了3.7%的性能。即使只保留30%的视觉标记,FocusUI-7B的性能下降也只有3.2%,同时实现了高达1.44倍的更快推理速度和17%的更低峰值GPU内存。
Summary / 总结
This paper addresses the challenge of efficient UI grounding in Vision-Language Models (VLMs) by proposing FocusUI. FocusUI selects relevant patches while preserving positional continuity, addressing the issues of redundant tokens and broken positional information. Experiments show that FocusUI outperforms GUI-specific baselines, achieving a 3.7% improvement on the ScreenSpot-Pro benchmark and up to 1.44x faster inference with 30% visual token retention.
研究旨在通过解决计算开销和注意力稀释问题,提高Vision-Language Models (VLMs)在UI接地任务中的效率。提出了FocusUI,通过选择相关片段并保持位置连续性来实现这一目标。它通过融合指令条件得分和UI图得分构建片段级监督,并引入PosPad策略压缩删除的视觉标记,以保持位置信息。实验表明,FocusUI在仅保留30%视觉标记的情况下,比GUI特定基线性能更好,实现高达1.44倍的更快推理速度和17%的更低峰值GPU内存。
HemBLIP: A Vision-Language Model for Interpretable Leukemia Cell Morphology Analysis
Authors: Julie van Logtestijn, Petru Manescu
First: 2026-01-07T13:31:33+00:00 · Latest: 2026-01-07T13:31:33+00:00
Abstract
Microscopic evaluation of white blood cell morphology is central to leukemia diagnosis, yet current deep learning models often act as black boxes, limiting clinical trust and adoption. We introduce HemBLIP, a vision language model designed to generate interpretable, morphology aware descriptions of peripheral blood cells. Using a newly constructed dataset of 14k healthy and leukemic cells paired with expert-derived attribute captions, we adapt a general-purpose VLM via both full fine-tuning and LoRA based parameter efficient training, and benchmark against the biomedical foundation model MedGEMMA. HemBLIP achieves higher caption quality and morphological accuracy, while LoRA adaptation provides further gains with significantly reduced computational cost. These results highlight the promise of vision language models for transparent and scalable hematological diagnostics.
中文标题/摘要
标题:HemBLIP:一种用于可解释性白血病细胞形态分析的视觉语言模型
白血病诊断中,显微镜下评估白细胞形态是核心内容,但当前的深度学习模型往往作为黑箱运作,限制了临床信任和应用。我们引入了HemBLIP,这是一种视觉语言模型,旨在生成可解释且形态意识强的外周血细胞描述。通过使用一个由14000个健康和白血病细胞及其专家提取的属性描述构成的新数据集,我们通过全微调和基于LoRA的参数高效训练来适应通用视觉语言模型,并与生物医学基础模型MedGEMMA进行基准测试。HemBLIP在描述质量和形态准确性上表现更优,而基于LoRA的适应进一步降低了计算成本。这些结果突显了视觉语言模型在透明和可扩展的血液学诊断中的潜力。
Summary / 总结
HemBLIP is a vision-language model designed to provide interpretable descriptions of leukemia cell morphology, addressing the limitations of current black-box models in clinical settings. Utilizing a dataset of 14,000 healthy and leukemic cells with expert captions, HemBLIP is trained through full fine-tuning and LoRA-based parameter-efficient methods, outperforming the biomedical foundation model MedGEMMA in both caption quality and morphological accuracy. Additionally, LoRA adaptation reduces computational cost while maintaining performance gains.
HemBLIP 是一种视觉语言模型,旨在为诊断提供可解释的白血病细胞形态描述。它使用包含 14,000 个健康和白血病细胞及其专家注释的全新数据集进行微调,相比生物医学基础模型 MedGEMMA 达到了更高的形态准确性。LoRA 调整方法进一步提高了性能并减少了计算成本。
Current Agents Fail to Leverage World Model as Tool for Foresight
Authors: Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani-Tür, Gokhan Tur, Yunzhu Li, Heng Ji, Heng Ji
First: 2026-01-07T13:15:23+00:00 · Latest: 2026-01-07T13:15:23+00:00
Comments: 36 Pages, 13 Figures, 17 Tables
Abstract
Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.
中文标题/摘要
标题:当前代理无法利用世界模型作为前瞻工具
基于视觉语言模型构建的代理越来越多地面临需要预测未来状态的任务,而不是依赖于短期推理。生成的世界模型提供了一种有希望的解决方案:代理可以使用它们作为外部模拟器,在行动前预见结果。本文实证研究了当前代理是否能够利用此类世界模型作为工具来增强其认知能力。在各种各样的代理和视觉问答任务中,我们观察到一些代理很少使用模拟(不到1%),频繁误用预测滚动(约15%),并且在模拟可用或强制执行时,经常表现出不一致甚至退化的性能(最高5%)。进一步的归因分析表明,主要瓶颈在于代理决定何时模拟、如何解释预测结果以及如何将前瞻性纳入下游推理的能力。这些发现强调了需要机制来促进与世界模型的校准、战略性互动,为未来代理系统更可靠的前瞻性认知铺平道路。
Summary / 总结
This paper investigates whether current agents can effectively use generative world models to enhance their foresight in tasks requiring long-term reasoning. Across various tasks, the study found that some agents rarely use simulation, many misuse predicted outcomes, and some even perform worse when simulation is available. The main issue is the agents' inability to decide when to simulate, interpret predictions, and integrate foresight into reasoning. These findings highlight the need for better mechanisms to facilitate strategic interaction with world models.
该研究探讨了当前代理是否能够有效利用生成的世界模型来增强其在需要长期推理的任务中的预见能力。研究发现,一些代理很少使用模拟,许多代理错误地使用预测结果,而有些代理在模拟可用时甚至表现更差。主要问题在于代理无法决定何时进行模拟、如何解释预测结果以及如何将预见性融入后续推理中。这些发现强调了需要更好的机制来促进与世界模型的战略互动。
PM4Bench: Benchmarking Large Vision-Language Models with Parallel Multilingual Multi-Modal Multi-task Corpus
Authors: Junyuan Gao, Jiahe Song, Jiang Wu, Runchuan Zhu, Guanlin Shen, Shasha Wang, Xingjian Wei, Haote Yang, Songyang Zhang, Weijia Li, Bin Wang, Dahua Lin, Lijun Wu, Conghui He
First: 2025-03-24T09:38:37+00:00 · Latest: 2026-01-07T12:57:26+00:00
Comments: Equal contribution: Junyuan Gao, Jiahe Song, Jiang Wu; Corresponding author: Conghui He
Abstract
While Large Vision-Language Models (LVLMs) demonstrate promising multilingual capabilities, their evaluation is currently hindered by two critical limitations: (1) the use of non-parallel corpora, which conflates inherent language capability gaps with dataset artifacts, precluding a fair assessment of cross-lingual alignment; and (2) disjointed multimodal inputs, which deviate from real-world scenarios where most texts are embedded within visual contexts. To address these challenges, we propose PM4Bench, the first Multilingual Multi-Modal Multi-task Benchmark constructed on a strictly parallel corpus across 10 languages. By eliminating content divergence, our benchmark enables a fair comparison of model capabilities across different languages. We also introduce a vision setting where textual queries are visually fused into images, compelling models to jointly "see," "read," and "think". Extensive evaluation of 10 LVLMs uncover a substantial performance drop in the Vision setting compared to standard inputs. Further analysis reveals that OCR capability is not only a general bottleneck but also contributes to cross-lingual performance disparities, suggesting that improving multilingual OCR is essential for advancing LVLM performance. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .
中文标题/摘要
标题:PM4Bench:基于并行多语言多模态多任务语料库的大规模视觉语言模型基准测试
虽然大规模视觉语言模型(LVLMs)展示了有希望的多语言能力,但其评估目前受到两个关键限制的阻碍:(1)使用非并行语料库,这将固有的语言能力差距与数据集特征混淆,阻碍了跨语言对齐的公平评估;(2)分离的多模态输入,这与大多数文本嵌入在视觉上下文中的现实场景相偏离。为了解决这些挑战,我们提出了PM4Bench,这是第一个基于10种语言严格并行语料库构建的多语言多模态多任务基准测试。通过消除内容差异,我们的基准测试使不同语言的模型能力比较变得公平。我们还引入了一个视觉设置,其中文本查询与图像视觉融合,促使模型共同“看”、“读”和“思考”。对10个LVLMs的广泛评估发现,视觉设置中的性能显著低于标准输入。进一步分析表明,OCR能力不仅是普遍瓶颈,还导致了跨语言性能差异,这表明提高多语言OCR对于提升LVLM性能至关重要。我们将在https://github.com/opendatalab/PM4Bench 发布PM4Bench。
Summary / 总结
PM4Bench addresses the limitations of current benchmarks for Large Vision-Language Models (LVLMs) by using a strictly parallel multilingual corpus across 10 languages, which eliminates content divergence and allows for fair cross-lingual comparisons. The benchmark includes a vision setting where textual queries are visually fused into images, requiring models to jointly process visual and textual information. Experimental results show a significant performance drop in the Vision setting compared to standard inputs, indicating that OCR capability is a critical bottleneck for LVLMs and contributes to cross-lingual performance disparities.
PM4Bench 是一个针对10种语言的大型视觉语言模型(LVLM)的新基准,使用平行多语言多模态语料库进行评估。它通过确保公平的跨语言比较和现实的多模态输入来解决先前基准的局限性。对10种LVLM的评估显示,在视觉设置中的性能显著下降,突显了OCR能力对于跨语言性能的重要性。
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
Authors: Peichun Hua, Hao Li, Shanghao Shi, Zhiyuan Yu, Ning Zhang
First: 2025-12-12T22:31:38+00:00 · Latest: 2026-01-07T11:45:25+00:00
Comments: 37 pages, 13 figures
Abstract
Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github https://github.com/sarendis56/Jailbreak_Detection_RCS.
中文标题/摘要
标题:用表示对比评分重新思考大型视觉语言模型的越狱检测
大型视觉-语言模型(LVLMs)容易受到日益增多的多模态越狱攻击,需要既能应对新型威胁又能在实际部署中高效的防御措施。当前许多策略要么针对特定攻击模式,限制了泛化能力,要么增加了巨大的计算开销。虽然轻量级异常检测方法很有前景,但我们发现它们常见的单类设计容易将新的良性输入误判为恶意输入,导致过度拒绝且不可靠。为解决这一问题,我们提出了表示对比评分(RCS),该框架基于一个关键洞察:最有效的安全信号存在于LVLM自身的内部表示中。我们的方法检查这些表示的内部几何结构,学习一个轻量级的投影,以在关键的安全层中最大化地分离良性输入和恶意输入。这使得一种简单而强大的对比评分能够区分真实的恶意意图和仅仅是新颖性。我们的实现,MCD(马氏对比检测)和KCD(K近邻对比检测),在用于测试对未见过的攻击类型的泛化的挑战性评估协议上达到了最先进的性能。这项工作表明,通过将简单的、可解释的统计方法应用于适当的内部表示,可以实现有效的越狱检测,为更安全的LVLM部署提供了一条实用的道路。我们的代码可在Github上获得 https://github.com/sarendis56/Jailbreak_Detection_RCS。
Summary / 总结
This paper addresses the vulnerability of Large Vision-Language Models (LVLMs) to multimodal jailbreak attacks by proposing Representational Contrastive Scoring (RCS), which uses internal representations to detect malicious inputs. RCS learns a lightweight projection to separate benign and malicious inputs, achieving state-of-the-art performance on a challenging evaluation protocol. The method, instantiated as MCD and KCD, offers a practical and efficient solution for jailbreak detection in LVLMs.
本文提出了一种名为Representational Contrastive Scoring (RCS)的方法,通过利用模型内部表示来检测恶意输入,以应对大型视觉-语言模型(LVLM)的漏洞。RCS学习一个轻量级的投影,以区分良性输入和恶意输入,该方法在通用性测试中达到了最先进的性能。该方法的实例化形式为MCD和KCD,提供了一种实用且高效的LVLM安全解决方案。
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
Authors: Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, Spencer Whitehead
First: 2026-01-05T09:35:11+00:00 · Latest: 2026-01-07T11:21:44+00:00
Comments: Slightly modified format; added Table 3 for better illustration of the scaling results
Abstract
We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.
中文标题/摘要
标题:WebGym:通过现实任务扩展视觉网络代理的训练环境
我们介绍了WebGym,这是迄今为止最大的开源环境,用于训练现实的视觉网络代理。真实网站是非平稳且多样的,使得人工或小型任务集不足以实现稳健的策略学习。WebGym 包含近300,000个任务,涵盖多样化的实际网站和难度级别,并采用评分表评估。我们使用简单的强化学习(RL)方法训练代理,该方法利用代理自身的交互轨迹(rollouts)进行训练,并使用任务奖励作为反馈来指导学习。为了扩展RL,我们通过开发一种高吞吐量的异步rollout系统来加速WebGym中轨迹的采样,该系统专门针对网络代理设计。我们的系统相比朴素实现实现了4-5倍的rollout加速。其次,我们扩展了任务集的广度、深度和规模,这导致了持续的性能提升。在WebGym上微调强大的基础视觉-语言模型Qwen-3-VL-8B-Instruct后,我们在未见过的测试集上的成功率从26.2%提高到42.9%,显著优于基于专有模型GPT-4o和GPT-5-Thinking的代理,它们分别达到27.1%和29.8%。这一改进是显著的,因为我们的测试集仅包含训练期间从未见过的任务,不同于许多其他关于训练视觉网络代理的研究工作。
Summary / 总结
WebGym is an open-source environment for training visual web agents with nearly 300,000 diverse tasks. It uses reinforcement learning to train agents based on their own interaction traces, and an asynchronous rollout system to speed up the process by 4-5 times. Fine-tuning a vision-language model on WebGym significantly improves the success rate on unseen tasks, outperforming proprietary models like GPT-4o and GPT-5-Thinking.
WebGym 是一个包含近 30 万种多样任务的开放源代码环境,用于训练视觉网络代理。它使用强化学习基于代理自身的交互轨迹进行训练,并通过异步轨迹系统将过程加速 4-5 倍。通过在 WebGym 上微调视觉语言模型,显著提高了在未见过的任务上的成功率,超过了如 GPT-4o 和 GPT-5-Thinking 这样的专有模型。
Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?
Authors: Zabir Al Nazi, GM Shahariar, Md. Abrar Hossain, Wei Peng
First: 2025-12-19T09:47:38+00:00 · Latest: 2026-01-07T09:58:06+00:00
Abstract
Theory of Mind (ToM) - the ability to attribute beliefs and intents to others - is fundamental for social intelligence, yet Vision-Language Model (VLM) evaluations remain largely Western-centric. In this work, we introduce CulturalToM-VQA, a benchmark of 5,095 visually situated ToM probes across diverse cultural contexts, rituals, and social norms. Constructed through a frontier proprietary MLLM, human-verified pipeline, the dataset spans a taxonomy of six ToM tasks and four complexity levels. We benchmark 10 VLMs (2023-2025) and observe a significant performance leap: while earlier models struggle, frontier models achieve high accuracy (>93%). However, significant limitations persist: models struggle with false belief reasoning (19-83% accuracy) and show high regional variance (20-30% gaps). Crucially, we find that SOTA models exhibit social desirability bias - systematically favoring semantically positive answer choices over negative ones. Ablation experiments reveal that some frontier models rely heavily on parametric social priors, frequently defaulting to safety-aligned predictions. Furthermore, while Chain-of-Thought prompting aids older models, it yields minimal gains for newer ones. Overall, our work provides a testbed for cross-cultural social reasoning, underscoring that despite architectural gains, achieving robust, visually grounded understanding remains an open challenge.
中文标题/摘要
标题:视觉语言模型是跨文化共情推理者吗?
共情推理(ToM)——赋予他人信念和意图的能力——是社会智能的基础,然而视觉-语言模型(VLM)的评估仍然主要集中在西方文化上。在本研究中,我们引入了CulturalToM-VQA,这是一个包含5,095个跨多种文化背景、仪式和社会规范的视觉情境ToM测试题的基准数据集。该数据集通过前沿的专有MLLM和人工验证管道构建,涵盖了六种ToM任务和四种复杂度级别。我们对10个VLM(2023-2025年)进行了基准测试,并观察到显著的性能提升:尽管早期模型表现不佳,前沿模型的准确率超过93%。然而,仍存在显著的局限性:模型在错误信念推理方面表现不佳(19%-83%的准确率),并且显示出高区域差异(20%-30%的差距)。我们发现,SOTA模型表现出社会可接受性偏见——系统地倾向于选择语义上积极的答案选项而非消极选项。消融实验表明,一些前沿模型高度依赖参数化的社会先验,经常默认安全对齐的预测。此外,虽然链式思考提示对较旧的模型有所帮助,但对较新的模型几乎没有增益。总体而言,我们的工作提供了一个跨文化社会推理的测试平台,强调尽管架构有所改进,实现稳健、视觉接地的理解仍然是一个开放的挑战。
Summary / 总结
This study introduces CulturalToM-VQA, a benchmark for evaluating Vision-Language Models (VLMs) on Theory of Mind (ToM) tasks across diverse cultural contexts. The dataset includes 5,095 visually situated ToM probes and reveals significant performance improvements in newer VLMs, with high accuracy (>93%) but persistent challenges in false belief reasoning and regional variance. Notably, state-of-the-art models show social desirability bias, favoring semantically positive answers. The research highlights the ongoing challenge of achieving robust, cross-cultural social reasoning in VLMs despite architectural advancements.
该研究在CulturalToM-VQA基准上评估了视觉语言模型(VLMs),该基准包含5,095个跨文化背景的视觉情境中的理论思维(ToM)问题。模型显示出显著的性能提升,前沿模型的准确率超过93%,但仍难以处理错误信念推理,并表现出社会偏好偏差。消融实验表明,一些模型依赖于参数化的社会先验,并且链式思考提示对较旧的模型有帮助但对较新的模型效果有限。该工作强调了在VLMs中实现稳健的跨文化社会推理仍然是一项挑战。
RadDiff: Describing Differences in Radiology Image Sets with Natural Language
Authors: Xiaoxian Shen, Yuhui Zhang, Sahithi Ankireddy, Xiaohan Wang, Maya Varma, Henry Guo, Curtis Langlotz, Serena Yeung-Levy
First: 2026-01-07T09:25:04+00:00 · Latest: 2026-01-07T09:25:04+00:00
Abstract
Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff's versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.
中文标题/摘要
标题:RadDiff:使用自然语言描述放射影像集之间的差异
理解两个放射影像集之间的差异对于生成临床见解和解释医疗AI系统至关重要。我们介绍了RadDiff,这是一种多模态代理系统,能够执行放射科医生风格的比较推理,以描述配对放射学研究之间的临床意义差异。RadDiff 基于 VisDiff 的提案-排名框架,并结合了四个受实际诊断工作流程启发的创新:(1)通过领域适应的视觉-语言模型注入医学知识;(2)多模态推理,将图像与其临床报告相结合;(3)在多轮推理中迭代假设细化;以及(4)针对视觉搜索,定位并放大关键区域以捕捉细微发现。为了评估 RadDiff,我们构建了 RadDiffBench,这是一个具有 57 对专家验证的放射学研究配对的具有挑战性的基准,其中包含真实差异描述的地面真相。在 RadDiffBench 上,RadDiff 达到了 47% 的准确率,并且在使用真实报告引导时达到了 50% 的准确率,显著优于通用领域的 VisDiff 基准。我们进一步展示了 RadDiff 在各种临床任务中的灵活性,包括 COVID-19 表型比较、种族亚组分析以及生存相关影像特征的发现。总体而言,RadDiff 和 RadDiffBench 为系统地揭示放射学数据中的有意义差异提供了第一个方法和基准基础。
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail
Authors: NVIDIA, :, Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, Liang Feng, Greg Heinrich, Jack Huang, Peter Karkus, Boyi Li, Pinyi Li, Tsung-Yi Lin, Dongran Liu, Ming-Yu Liu, Langechuan Liu, Zhijian Liu, Jason Lu, Yunxiang Mao, Pavlo Molchanov, Lindsey Pavao, Zhenghao Peng, Mike Ranzinger, Ed Schmerling, Shida Shen, Yunfei Shi, Sarah Tariq, Ran Tian, Tilman Wekel, Xinshuo Weng, Tianjun Xiao, Eric Yang, Xiaodong Yang, Yurong You, Xiaohui Zeng, Wenyuan Zhang, Boris Ivanovic, Marco Pavone
First: 2025-10-30T01:25:34+00:00 · Latest: 2026-01-07T09:09:57+00:00
Abstract
End-to-end architectures trained via imitation learning have advanced autonomous driving by scaling model size and data, yet performance remains brittle in safety-critical long-tail scenarios where supervision is sparse and causal understanding is limited. We introduce Alpamayo-R1 (AR1), a vision-language-action model (VLA) that integrates Chain of Causation reasoning with trajectory planning for complex driving scenarios. Our approach features three key innovations: (1) the Chain of Causation (CoC) dataset, built through a hybrid auto-labeling and human-in-the-loop pipeline producing decision-grounded, causally linked reasoning traces aligned with driving behaviors; (2) a modular VLA architecture combining Cosmos-Reason, a vision-language model pre-trained for Physical AI, with a diffusion-based trajectory decoder that generates dynamically feasible trajectories in real time; (3) a multi-stage training strategy using supervised fine-tuning to elicit reasoning and reinforcement learning (RL) to enforce reasoning-action consistency and optimize reasoning quality. AR1 achieves up to a 12% improvement in planning accuracy on challenging cases compared to a trajectory-only baseline, with a 35% reduction in close encounter rate in closed-loop simulation. RL post-training improves reasoning quality by 45% and reasoning-action consistency by 37%. Model scaling from 0.5B to 7B parameters shows consistent improvements. On-vehicle road tests confirm real-time performance (99 ms latency) and successful urban deployment. By bridging interpretable reasoning with precise control, AR1 demonstrates a practical path towards Level 4 autonomous driving. Model weights are available at https://huggingface.co/nvidia/Alpamayo-R1-10B with inference code at https://github.com/NVlabs/alpamayo.
中文标题/摘要
标题:Alpamayo-R1:在长尾场景中将推理与行动预测相结合以实现通用自主驾驶
通过模仿学习训练的端到端架构通过扩展模型规模和数据提高了自主驾驶能力,但在安全关键的长尾场景中,由于监督稀少和因果理解有限,性能仍然脆弱。我们引入了Alpamayo-R1(AR1),一种结合因果链推理与轨迹规划的视觉-语言-行动模型(VLA)。我们的方法包括三个关键创新:(1) 通过混合自动标注和人工在环管道构建的因果链(CoC)数据集,生成与驾驶行为一致的决策导向、因果链接的推理轨迹;(2) 结合预训练的物理AI视觉-语言模型Cosmos-Reason和基于扩散的轨迹解码器的模块化VLA架构,实时生成动态可行的轨迹;(3) 多阶段训练策略,使用监督微调激发推理,并使用强化学习(RL)确保推理与行动的一致性,优化推理质量。AR1在具有挑战性的案例中规划准确性提高了12%,闭环仿真中近距离相遇率降低了35%。RL后训练提高了推理质量45%,推理与行动一致性37%。从0.5B到7B参数的模型扩展显示了持续改进。车载道路测试证实了实时性能(99 ms延迟)和城市部署的成功。通过将可解释的推理与精确控制相结合,AR1展示了向L4级自主驾驶的实用路径。模型权重可在https://huggingface.co/nvidia/Alpamayo-R1-10B获取,推理代码在https://github.com/NVlabs/alpamayo/。
BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion
Authors: Qingyao Tian, Bingyu Yang, Huai Liao, Xinyan Huang, Junyong Li, Dong Yi, Hongbin Liu
First: 2026-01-07T09:00:52+00:00 · Latest: 2026-01-07T09:00:52+00:00
Abstract
Vision-language models (VLMs) have recently shown remarkable performance in navigation and localization tasks by leveraging large-scale pretraining for semantic understanding. However, applying VLMs to 6-DoF endoscopic camera localization presents several challenges: 1) the lack of large-scale, high-quality, densely annotated, and localization-oriented vision-language datasets in real-world medical settings; 2) limited capability for fine-grained pose regression; and 3) high computational latency when extracting temporal features from past frames. To address these issues, we first construct BREATH dataset, the largest in-vivo endoscopic localization dataset to date, collected in the complex human airway. Building on this dataset, we propose BREATH-VL, a hybrid framework that integrates semantic cues from VLMs with geometric information from vision-based registration methods for accurate 6-DoF pose estimation. Our motivation lies in the complementary strengths of both approaches: VLMs offer generalizable semantic understanding, while registration methods provide precise geometric alignment. To further enhance the VLM's ability to capture temporal context, we introduce a lightweight context-learning mechanism that encodes motion history as linguistic prompts, enabling efficient temporal reasoning without expensive video-level computation. Extensive experiments demonstrate that the vision-language module delivers robust semantic localization in challenging surgical scenes. Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline, while achieving competitive computational latency.
中文标题/摘要
标题:BREATH-VL:基于语义几何融合的6自由度支气管镜定位
视觉语言模型(VLMs)在利用大规模预训练进行语义理解方面已经在导航和定位任务中表现出色。然而,将VLMs应用于6自由度内窥镜摄像头定位存在几个挑战:1)缺乏大规模、高质量、密集标注且面向定位的现实医疗环境中的视觉语言数据集;2)细粒度姿态回归能力有限;3)从过去帧中提取时序特征时计算延迟高。为了解决这些问题,我们首先构建了BREATH数据集,这是迄今为止最大的体内内窥镜定位数据集,收集于复杂的真人气道中。基于此数据集,我们提出了一种混合框架BREATH-VL,该框架将VLMs提供的语义线索与基于视觉注册方法的几何信息相结合,以实现准确的6自由度姿态估计。我们的动机在于这两种方法的互补优势:VLMs提供通用的语义理解,而注册方法提供精确的几何对齐。为了进一步增强VLMs捕捉时序上下文的能力,我们引入了一种轻量级的上下文学习机制,将运动历史编码为语言提示,从而实现高效的时序推理,而无需昂贵的视频级计算。大量实验表明,视觉语言模块在复杂的手术场景中提供了稳健的语义定位。在此基础上,我们的BREATH-VL在准确性和泛化能力上均优于最先进的仅视觉定位方法,与最佳基线相比,翻译误差降低了25.5%,同时实现了具有竞争力的计算延迟。
Summary / 总结
The research aims to improve 6-DoF localization of endoscopic cameras in bronchoscopy by leveraging the strengths of vision-language models (VLMs) and geometric registration methods. To address the challenges of limited annotated datasets and computational latency, the authors constructed the BREATH dataset and proposed BREATH-VL, a hybrid framework that fuses semantic cues from VLMs with geometric information. By introducing a lightweight context-learning mechanism, the system enhances the VLM's ability to capture temporal context. Experimental results show that BREATH-VL outperforms state-of-the-art vision-only methods, reducing translational error by 25.5% and maintaining competitive computational latency.
研究旨在通过结合视觉语言模型(VLMs)的语义线索和基于视觉的注册方法的几何信息,提高内窥镜摄像头在医疗环境中的6-DoF定位。为了解决标注数据有限和计算延迟的问题,作者构建了BREATH数据集,并引入了一种轻量级的上下文学习机制来增强VLMs的时序推理能力。实验结果表明,BREATH-VL 在准确性和泛化能力上优于最先进的视觉仅方法,将平移误差降低了25.5%,同时保持了竞争力的计算延迟。
FairT2I: Mitigating Social Bias in Text-to-Image Generation via Large Language Model-Assisted Detection and Attribute Rebalancing
Authors: Jinya Sakurai, Yuki Koyama, Issei Sato
First: 2025-02-06T07:22:57+00:00 · Latest: 2026-01-07T08:59:53+00:00
Abstract
Text-to-image (T2I) models have advanced creative content generation, yet their reliance on large uncurated datasets often reproduces societal biases. We present FairT2I, a training-free and interactive framework grounded in a mathematically principled latent variable guidance formulation. This formulation decomposes the generative score function into attribute-conditioned components and reweights them according to a defined distribution, providing a unified and flexible mechanism for bias-aware generation that also subsumes many existing ad hoc debiasing approaches as special cases. Building upon this foundation, FairT2I incorporates (1) latent variable guidance as the core mechanism, (2) LLM-based bias detection to automatically infer bias-prone categories and attributes from text prompts as part of the latent structure, and (3) attribute resampling, which allows users to adjust or redefine the attribute distribution based on uniform, real-world, or user-specified statistics. The accompanying user interface supports this pipeline by enabling users to inspect detected biases, modify attributes or weights, and generate debiased images in real time. Experimental results show that LLMs outperform average human annotators in the number and granularity of detected bias categories and attributes. Moreover, FairT2I achieves superior performance to baseline models in both societal bias mitigation and image diversity, while preserving image quality and prompt fidelity.
中文标题/摘要
标题:FairT2I:通过大型语言模型辅助检测和属性重平衡减轻文本到图像生成中的社会偏见
文本到图像(T2I)模型在创意内容生成方面取得了进展,但它们对大量未加筛选的数据集的依赖往往会导致社会偏见的再现。我们提出了FairT2I,这是一种无需训练且交互式的框架,基于一个数学上合理的潜在变量指导公式。该公式将生成得分函数分解为属性条件分量,并根据定义的分布重新加权,提供了一种统一且灵活的偏见感知生成机制,该机制还涵盖了众多现有的临时去偏见方法。在此基础上,FairT2I 包含(1)潜在变量指导作为核心机制,(2)基于LLM的偏见检测,自动从文本提示中推断出易受偏见影响的类别和属性,作为潜在结构的一部分,以及(3)属性重采样,允许用户根据均匀、现实世界或用户指定的统计数据调整或重新定义属性分布。伴随的用户界面支持此管道,使用户能够检查检测到的偏见、修改属性或权重,并实时生成去偏见图像。实验结果表明,LLM在检测到的偏见类别和属性的数量和精细度上优于普通的人类注释者。此外,FairT2I 在社会偏见缓解和图像多样性方面均优于基线模型,同时保持了图像质量和提示保真度。
Summary / 总结
FairT2I is a training-free framework that mitigates social biases in text-to-image generation by using a latent variable guidance formulation and large language model (LLM)-based bias detection. It reweights attribute-conditioned components to generate bias-aware images and allows users to adjust attribute distributions. Experimental results show that LLMs outperform human annotators in detecting bias categories and attributes, and FairT2I outperforms baseline models in societal bias mitigation and image diversity while maintaining image quality and prompt fidelity.
FairT2I 是一个无需训练的框架,通过大型语言模型辅助检测和属性重新平衡来减轻文本到图像生成中的社会偏见。它将生成得分函数分解并根据定义的分布重新加权组件,提供了一种统一的机制来进行有意识的偏见生成。实验结果表明,FairT2I 在社会偏见缓解和图像多样性方面优于基线模型,同时保持了图像质量和提示的一致性。还显示,LLM 在检测偏见类别和属性方面比人类注释者更有效。
Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis
Authors: Jingguo Qu, Xinyang Han, Jia Ai, Juan Wu, Tong Zhao, Tonghuan Xiao, Sheng Ning, Yuqi Yang, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying
First: 2025-06-10T14:37:51+00:00 · Latest: 2026-01-07T07:58:47+00:00
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities, yet their application to medical ultrasound remains constrained by the significant domain shift between natural images and sonographic data. The unique physics of ultrasound, manifesting as speckle noise, shadowing, and variable artifacts, often leads to suboptimal performance when applying off-the-shelf foundation models. To address this, we propose a novel Hybrid-tuning (HT) strategy for the efficient adaptation of CLIP-based models to ultrasound analysis. Our method introduces a lightweight adapter module integrated into the frozen visual backbone, featuring frequency-domain filtering to suppress periodic artifacts and dynamic noise estimation to calibrate feature representations. Furthermore, we design specialized segmentation and classification heads that employ multi-scale feature aggregation to maximize the utility of pre-trained semantic priors. Extensive evaluations across six multi-center datasets (covering lymph nodes, breast, thyroid, and prostate) reveal that our HT-enhanced models significantly outperform existing state-of-the-art methods, including BiomedCLIP and standard LoRA fine-tuning. The results highlight the superior data efficiency and robustness of our approach, paving the way for practical, foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.
中文标题/摘要
标题:适应下一代医学超声图像分析的视觉-语言基础模型调整
视觉-语言模型(VLMs)展示了出色的泛化能力,但在应用于医学超声时,由于自然图像与声学数据之间存在显著的领域差异,其应用受到限制。超声特有的物理特性,如斑点噪声、阴影和可变伪影,常常导致在使用现成的基础模型时性能不佳。为了解决这一问题,我们提出了一种新的混合调优(HT)策略,用于高效地将CLIP基础模型调整到超声分析中。我们的方法引入了一个轻量级的适配器模块,集成到冻结的视觉主干中,采用频域滤波来抑制周期性伪影,并采用动态噪声估计来校准特征表示。此外,我们设计了专门的分割和分类头,利用多尺度特征聚合来最大化预训练语义先验的效用。在六个多中心数据集(涵盖淋巴结、乳腺、甲状腺和前列腺)上的广泛评估表明,我们的HT增强模型显著优于现有的最先进的方法,包括BiomedCLIP和标准LoRA微调。结果突显了我们方法的优越的数据效率和鲁棒性,为自动化超声诊断中的基础智能应用铺平了道路。源代码可在https://github.com/jinggqu/NextGen-UIA/获得。
Summary / 总结
This study aims to enhance the application of Vision-Language Models (VLMs) in medical ultrasound image analysis by addressing the domain shift between natural images and sonographic data. The authors propose a Hybrid-tuning (HT) strategy that introduces a lightweight adapter module with frequency-domain filtering and dynamic noise estimation to suppress artifacts and calibrate feature representations. The method also includes specialized segmentation and classification heads that aggregate multi-scale features. Experimental results across six datasets show that the HT-enhanced models outperform existing state-of-the-art methods, demonstrating superior data efficiency and robustness.
研究旨在通过解决自然图像与超声成像数据之间的领域差异,改进Vision-Language Models (VLMs)在医学超声图像分析中的应用。提出的Hybrid-tuning (HT)策略引入了一个轻量级的适配模块,结合频域滤波和动态噪声估计来适应CLIP基模型。实验结果显示,HT增强的模型在六个数据集上优于现有方法,展示了在自动化超声诊断中的优越的数据效率和鲁棒性。
e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings
Authors: Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou
First: 2026-01-07T07:39:40+00:00 · Latest: 2026-01-07T07:39:40+00:00
Abstract
Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.
中文标题/摘要
标题:e5-omni:显式跨模态对齐的全模态嵌入
现代信息系统通常涉及不同类型的信息项,例如文本查询、图像、视频片段或音频片段。这促使了全模态嵌入模型的发展,将异构模态映射到共享空间进行直接比较。然而,大多数最新的全模态嵌入仍然严重依赖于从预训练的视觉-语言模型(VLM)骨干继承的隐式对齐。实践中,这导致了三个常见问题:(i)相似度logits具有模态依赖的锐度,因此分数不在一致的尺度上;(ii)批次内负样本随时间变得不那么有效,因为混合模态批次创建了一个不平衡的难度分布;结果,许多负样本很快变得简单且贡献很少梯度;(iii)模态之间的嵌入显示出不匹配的一阶和二阶统计,这使得排名不够稳定。为了解决这些问题,我们提出了e5-omni,这是一种轻量级的显式对齐方法,将现成的VLM适应为稳健的全模态嵌入模型。e5-omni结合了三个简单的组件:(1)模态感知的温度校准以对齐相似度尺度,(2)可控的负样本课程学习以减轻误负样本的影响并集中于混淆的负样本,以及(3)批次白化和协方差正则化以更好地匹配共享嵌入空间中的跨模态几何。在MMEB-V2和AudioCaps上的实验显示,e5-omni在强大的双模态和全模态基线之上具有一致的改进,并且相同的配方也很好地转移到了其他VLM骨干上。我们将在https://huggingface.co/Haon-Chen/e5-omni-7B/发布我们的模型检查点。
Summary / 总结
The paper addresses the challenges of implicit alignment in omni-modal embedding models, which can lead to inconsistent similarity scales, imbalanced negative hardness, and mismatched statistics across modalities. To solve these issues, the authors propose e5-omni, which includes modality-aware temperature calibration, a controllable negative curriculum, and batch whitening with covariance regularization. Experiments on MMEB-V2 and AudioCaps demonstrate consistent improvements over existing bi-modal and omni-modal baselines, and the method is adaptable to different VLM backbones.
论文针对当前多模态嵌入中存在的模态依赖性锐度、硬度分布不平衡以及统计匹配不一致等问题进行了研究。提出了一种轻量级方法e5-omni,包括模态感知温度校准、可控负样本课程以及批量白化和协方差正则化。在MMEB-V2和AudioCaps上的实验表明,该方法能够一致地改进现有的双模态和多模态基线模型,并且该方法适用于不同的VLM基座。
CaTS-Bench: Can Language Models Describe Time Series?
Authors: Luca Zhou, Pratham Yashwante, Marshall Fisher, Alessio Sampieri, Zihao Zhou, Fabio Galasso, Rose Yu
First: 2025-09-25T07:10:03+00:00 · Latest: 2026-01-07T07:03:34+00:00
Comments: 8 pages, 6 figures, 3 tables in the main paper. Many more in the appendix
Abstract
Time series captioning, the task of describing time series in natural language, requires numeric and temporal reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on fully synthetic or generic captions, and typically neglect metadata and visual representations. We introduce \textbf{CaTS-Bench}, a comprehensive benchmark for \textbf{C}ontext-\textbf{a}ware \textbf{T}ime \textbf{S}eries reasoning across $11$ diverse domains, centered on a gold-standard evaluation set of $1746$ human-rewritten captions that measure how effectively models translate numeric trends into immediately interpretable narratives. To address the scarcity of human-annotated data, we also propose a scalable pipeline for generating high-fidelity synthetic captions, the quality of which we validate. We evaluate leading Vision-Language Models on our benchmark, revealing that even proprietary models struggle to capture numeric nuances in temporal descriptions, while finetuning open-source models on synthetic data yields substantial performance gains. Finally, we release a diagnostic suite of $910$ multiple-choice questions and tailored numeric metrics to gauge time-series-specific reasoning capabilities, establishing CaTS-Bench as a reliable foundation for grounded, multimodal language generation in numeric domains.
中文标题/摘要
标题:CaTS-Bench:语言模型能否描述时间序列?
时间序列描述任务要求在自然语言中描述时间序列,需要数字和时间推理、趋势解释和上下文理解。然而,现有的基准测试往往依赖于完全合成或通用的描述,通常忽略了元数据和视觉表示。我们引入了**CaTS-Bench**,这是一个涵盖11个不同领域的全面基准测试,专注于1746个人类修订的描述,这些描述衡量了模型如何将数字趋势转化为易于理解的叙述。为了应对人类标注数据的稀缺性,我们还提出了一种可扩展的生成高质量合成描述的管道,并验证了其质量。我们在基准测试上评估了领先的语言-视觉模型,发现即使是专有模型也难以捕捉时间描述中的数字细微差别,而使用合成数据微调开源模型则能显著提高性能。最后,我们发布了一套包含910个选择题和定制化数字指标的诊断工具包,以评估时间序列特定的推理能力,从而确立CaTS-Bench作为数值领域可靠的基础,用于基于现实的多模态语言生成。
Summary / 总结
CaTS-Bench is a new benchmark for time series captioning that evaluates models' ability to describe numeric trends in natural language across 11 diverse domains. It uses 1746 human-rewritten captions to measure how well models translate time series data into interpretable narratives. The benchmark includes a scalable pipeline for generating synthetic captions and a diagnostic suite of 910 multiple-choice questions. The evaluation shows that even proprietary models struggle with temporal descriptions, but finetuning open-source models on synthetic data improves performance significantly.
CaTS-Bench 是一个用于时间序列描述的新基准,评估模型在11个不同领域中如何用自然语言描述数值趋势。它使用1746个人类重写后的描述来衡量模型将时间序列数据转化为可理解叙述的能力。基准还包括一个生成合成描述的可扩展管道,以及包含910个多选题的诊断套件。评估结果显示,即使是专有模型也难以处理时间描述,但通过对合成数据进行微调,开源模型的性能显著提高。
V-Agent: An Interactive Video Search System Using Vision-Language Models
Authors: SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju
First: 2025-11-04T07:24:45+00:00 · Latest: 2026-01-07T06:16:41+00:00
Comments: CIKM 2025 MMGENSR Workshop
Abstract
We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications. The retrieval model and demo videos are available at https://huggingface.co/NCSOFT/multimodal-embedding.
中文标题/摘要
标题:V-Agent:一种使用视觉语言模型的交互式视频搜索系统
我们介绍了V-Agent,这是一种新型多智能体平台,用于高级视频搜索和交互式用户-系统对话。通过使用小型视频偏好数据集微调视觉语言模型(VLM),并结合图像-文本检索模型的检索向量进行增强,我们克服了传统基于文本的检索系统在多模态场景中的局限性。基于VLM的检索模型独立地将视频帧和自动语音识别(ASR)模块生成的音频转录嵌入到共享的多模态表示空间中,使V-Agent能够解释视觉和口头内容,实现上下文感知的视频搜索。该系统由三个智能体——路由智能体、搜索智能体和聊天智能体——协作工作,通过细化搜索输出和与用户交流来解决用户意图。搜索智能体利用基于VLM的检索模型以及额外的重排序模块,进一步提高视频检索质量。我们提出的框架在MultiVENT 2.0基准测试中展示了最先进的零样本性能,突显了其在学术研究和实际应用中的潜力。检索模型和演示视频可在https://huggingface.co/NCSOFT/multimodal-embedding获取。
Summary / 总结
V-Agent is a multi-agent platform that improves video search and interactive user-system conversations by fine-tuning a vision-language model with a small video preference dataset and enhancing it with an image-text retrieval model. It uses a VLM-based retrieval model to embed video frames and audio transcriptions into a shared multimodal space, enabling context-aware video search. The system includes three agents that collaborate to refine search outputs and communicate with users, with the search agent using a re-ranking module to enhance video retrieval quality. V-Agent shows state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, indicating its potential for both academic and practical applications.
V-Agent 是一个使用微调后的视觉-语言模型来增强视频搜索和用户-系统互动的多代理平台。通过将视频帧和自动语音识别模块生成的音频转录嵌入到共享的多模态表示空间中,V-Agent 可以理解和解释视觉和语音内容,实现上下文相关的视频搜索。该系统包括路由代理、搜索代理和聊天代理,它们协同工作以改进搜索输出并与用户交流。搜索代理使用基于 VLM 的检索模型和额外的重排序模块来提高视频检索质量。V-Agent 在 MultiVENT 2.0 基准测试中展示了最先进的零样本性能,表明其在学术研究和实际应用中的潜力。
Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions
Authors: Zhongbin Guo, Zhen Yang, Yushan Li, Xinyue Zhang, Wenyu Gao, Jiacheng Wang, Chengzhi Li, Xiangrui Liu, Ping Jian
First: 2026-01-07T05:13:52+00:00 · Latest: 2026-01-07T05:13:52+00:00
Abstract
Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in localized semantic tasks, a significant "spatial gap" remains in global consistency. Notably, we find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential. Our proposed dataset SiT-Bench serves as a foundational resource to foster the development of spatially-grounded LLM backbones for future VLMs and embodied agents. Our code and benchmark will be released at https://github.com/binisalegend/SiT-Bench .
中文标题/摘要
标题:LLM能不依赖像素看到吗?基于文本描述的空间智能基准测试
近年来,空间智能(SI)的进步主要依赖于视觉语言模型(VLMs),但一个关键问题仍然存在:空间理解是源自视觉编码器还是基础的推理核心?受此问题的启发,我们引入了SiT-Bench,这是一种新型基准测试,旨在评估大型语言模型(LLMs)的空间智能性能,而不依赖像素级输入,包含超过3,800个专家标注的项目,涵盖五个主要类别和17个子任务,从第一人称导航和视角转换到精细的机器人操作。通过将单视图或多视图场景转换为高保真、坐标感知的文本描述,我们挑战LLMs进行符号文本推理,而不是视觉模式匹配。最先进的(SOTA)LLMs的评估结果表明,虽然模型在局部语义任务中表现出色,但在全局一致性方面仍存在显著的“空间差距”。值得注意的是,我们发现明确的空间推理显著提高了性能,表明LLMs具有潜在的世界建模能力。我们提出的SiT-Bench数据集作为基础资源,促进了空间化大型语言模型骨干的发展,为未来的VLMs和具身代理奠定了基础。我们的代码和基准测试将在https://github.com/binisalegend/SiT-Bench 发布。
Summary / 总结
This study investigates whether Large Language Models (LLMs) can demonstrate spatial intelligence without visual input, introducing SiT-Bench, a benchmark with over 3,800 expert-annotated tasks. By converting scenes into textual descriptions, the study evaluates LLMs' ability to perform symbolic reasoning rather than visual pattern matching. The results show that while LLMs excel in localized tasks, they struggle with global consistency, indicating a 'spatial gap.' Explicit spatial reasoning significantly improves performance, suggesting LLMs have potential for world modeling.
研究探讨了大型语言模型(LLMs)在没有视觉输入的情况下能否实现空间智能,引入了包含超过3,800个专家标注项目的SiT-Bench基准。通过将视觉场景转换为文本描述,研究评估了LLMs进行符号推理而非视觉模式匹配的能力。结果显示,虽然LLMs在局部任务上表现出色,但在全局一致性方面存在显著的“空间差距”。明确的空间推理可以提高性能,表明LLMs具有潜在的世界建模能力。
LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning
Authors: Beomseok Kang, Jiwon Song, Jae-Joon Kim
First: 2025-10-16T01:37:39+00:00 · Latest: 2026-01-07T03:47:14+00:00
Abstract
Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage outperforms prior training-free layer skipping methods.
中文标题/摘要
标题:LiteStage:针对多阶段推理的延迟感知层跳过方法
多阶段推理已成为通过将复杂问题分解为顺序子阶段来增强小型语言模型推理能力的有效策略。然而,这会导致延迟增加。我们观察到,现有的自适应加速技术,如层跳过,在这种情况下难以在效率和准确性之间取得平衡,主要由于两个关键挑战:(1)阶段间跳过敏感度的差异,(2)生成冗余输出标记。为了解决这些问题,我们提出了LiteStage,这是一种针对多阶段推理的延迟感知层跳过框架。LiteStage结合了阶段间离线搜索,以分配最优层预算,以及基于置信度的在线生成早期退出,以抑制不必要的解码。在三个基准测试,例如OBQA、CSQA和StrategyQA上的实验表明,LiteStage优于先前的无训练层跳过方法。
Summary / 总结
The research aims to enhance the efficiency of multi-stage reasoning in small language models by addressing the increased latency. LiteStage, a latency-aware layer skipping framework, is proposed to tackle the challenges of varying skip sensitivity across stages and redundant output tokens. The framework uses a stage-wise offline search to allocate optimal layer budgets and an online confidence-based generation early exit to reduce unnecessary decoding. Experimental results on OBQA, CSQA, and StrategyQA show that LiteStage outperforms previous training-free layer skipping methods.
研究旨在通过解决增加的延迟问题,提高小语言模型中多阶段推理的效率。提出了一种名为LiteStage的延迟感知层跳过框架,以应对阶段间跳过敏感度的差异和冗余输出令牌的问题。实验结果显示,LiteStage在OBQA、CSQA和StrategyQA等基准上优于之前的无训练层跳过方法。
Sortblock: Similarity-Aware Feature Reuse for Diffusion Model
Authors: Hanqi Chen, Xu Zhang, Xiaoliu Guan, Lielin Jiang, Guanzhong Wang, Zeyu Chen, Yi Liu
First: 2025-08-01T08:10:54+00:00 · Latest: 2026-01-07T03:25:48+00:00
Abstract
Diffusion Transformers (DiTs) have demonstrated remarkable generative capabilities, particularly benefiting from Transformer architectures that enhance visual and artistic fidelity. However, their inherently sequential denoising process results in high inference latency, limiting their deployment in real-time scenarios. Existing training-free acceleration approaches typically reuse intermediate features at fixed timesteps or layers, overlooking the evolving semantic focus across denoising stages and Transformer blocks.To address this, we propose Sortblock, a training-free inference acceleration framework that dynamically caches block-wise features based on their similarity across adjacent timesteps. By ranking the evolution of residuals, Sortblock adaptively determines a recomputation ratio, selectively skipping redundant computations while preserving generation quality. Furthermore, we incorporate a lightweight linear prediction mechanism to reduce accumulated errors in skipped blocks.Extensive experiments across various tasks and DiT architectures demonstrate that Sortblock achieves over 2$\times$ inference speedup with minimal degradation in output quality, offering an effective and generalizable solution for accelerating diffusion-based generative models.
Summary / 总结
Sortblock is a training-free inference acceleration framework for Diffusion Transformers (DiTs) that dynamically caches block-wise features based on their similarity across adjacent timesteps. By ranking the evolution of residuals, Sortblock adaptively determines a recomputation ratio, selectively skipping redundant computations while preserving generation quality. Experiments show that Sortblock achieves over 2 times inference speedup with minimal degradation in output quality.
Sortblock 是一种针对扩散变换器(DiTs)的无训练加速框架,通过相邻时间步之间的相似性动态缓存块级特征。通过排序残差的演变,Sortblock 适应性地确定重计算比例,选择性地跳过冗余计算同时保持生成质量。实验表明,Sortblock 可以实现超过 2 倍的推理加速,同时保持输出质量的最小退化。
Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images
Authors: Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuanyu Wan, Lijun Zhang
First: 2025-12-19T07:44:43+00:00 · Latest: 2026-01-07T03:10:43+00:00
Abstract
Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.
中文标题/摘要
标题:深入可靠:提升图像思维多轮推理能力
大型视觉-语言模型(VLMs)的最新进展通过图像思维在链式思考(CoT)中展示了强大的推理能力,这通过主动调用工具分析视觉输入而非仅仅感知它们来实现。然而,现有模型在尝试错误的推理轨迹时往往难以反思和自我纠正。为解决这一局限,我们提出了DRIM模型,使其在多模态CoT中进行深入但可靠的多轮推理。我们的管道包括三个阶段:数据构建、冷启动微调(SFT)和强化学习(RL)。基于高分辨率图像数据集,我们构建了高难度且可验证的视觉问答对,其中解决每个任务需要多轮工具调用来达到正确答案。在SFT阶段,我们收集工具轨迹作为冷启动数据,引导多轮推理模式。在RL阶段,我们引入了冗余惩罚策略优化,激励模型发展自我反思的推理模式。基本思想是对推理轨迹进行判断,并惩罚那些在未进行充分多尺度探索的情况下产生错误答案的轨迹。大量实验表明,DRIM在视觉理解基准测试中取得了优越的性能。
Summary / 总结
The research aims to enhance the reasoning capabilities of Vision-Language Models (VLMs) for complex visual tasks by developing a model called DRIM, which supports deep yet reliable multi-turn reasoning. The method involves constructing high-difficulty visual question-answer pairs requiring multiple tool calls, followed by a cold-start supervised fine-tuning stage to guide multi-turn reasoning, and a reinforcement learning stage that penalizes incorrect reasoning without sufficient exploration. Experimental results show that DRIM outperforms existing models on visual understanding benchmarks.
论文旨在通过引入DRIM提升Vision-Language模型在复杂视觉任务中的推理能力,使其能够进行深入但可靠的多轮次推理。DRIM的方法包括三个阶段:数据构建、冷启动监督微调(SFT)和强化学习(RL)。模型构建了需要多次工具调用才能解决的高难度视觉问答对,并通过SFT引导多轮次推理模式。在RL阶段,引入了惩罚冗余策略优化,鼓励自我反思推理,对没有充分探索就给出错误答案的行为进行惩罚。实验表明,DRIM在视觉理解基准测试中表现优于现有模型。
Persona-aware and Explainable Bikeability Assessment: A Vision-Language Model Approach
Authors: Yilong Dai, Ziyi Wang, Chenguang Wang, Kexin Zhou, Yiheng Qian, Susu Xu, Xiang Yan
First: 2026-01-07T02:46:51+00:00 · Latest: 2026-01-07T02:46:51+00:00
Abstract
Bikeability assessment is essential for advancing sustainable urban transportation and creating cyclist-friendly cities, and it requires incorporating users' perceptions of safety and comfort. Yet existing perception-based bikeability assessment approaches face key limitations in capturing the complexity of road environments and adequately accounting for heterogeneity in subjective user perceptions. This paper proposes a persona-aware Vision-Language Model framework for bikeability assessment with three novel contributions: (i) theory-grounded persona conditioning based on established cyclist typology that generates persona-specific explanations via chain-of-thought reasoning; (ii) multi-granularity supervised fine-tuning that combines scarce expert-annotated reasoning with abundant user ratings for joint prediction and explainable assessment; and (iii) AI-enabled data augmentation that creates controlled paired data to isolate infrastructure variable impacts. To test and validate this framework, we developed a panoramic image-based crowdsourcing system and collected 12,400 persona-conditioned assessments from 427 cyclists. Experiment results show that the proposed framework offers competitive bikeability rating prediction while uniquely enabling explainable factor attribution.
中文标题/摘要
标题:基于人物画像和可解释性的骑行安全性评估:一种视觉-语言模型方法
骑行安全性评估对于促进可持续城市交通和创建骑行友好型城市至关重要,且需要纳入用户对安全性和舒适性的感知。然而,现有的基于感知的骑行安全性评估方法在捕捉道路环境的复杂性以及充分考虑用户感知的异质性方面存在关键局限。本文提出了一种基于人物画像的视觉-语言模型框架,具有三个创新贡献:(i)基于已建立的骑行者类型理论指导的人物条件化,通过链式推理生成人物特定的解释;(ii)多粒度监督微调,结合稀缺的专家标注推理与丰富的用户评分进行联合预测和可解释评估;(iii)AI驱动的数据增强,创建受控配对数据以隔离基础设施变量的影响。为了测试和验证该框架,我们开发了一个全景图像为基础的众包系统,并收集了427名骑行者的12,400个人物条件化评估。实验结果表明,所提出的框架在提供竞争力的骑行安全性评分预测的同时,还能够独特地实现可解释的因素归因。
Summary / 总结
This paper addresses the limitations of existing bikeability assessment methods by proposing a persona-aware Vision-Language Model framework. The method incorporates established cyclist typology to generate persona-specific explanations and uses a combination of expert-annotated reasoning and user ratings for fine-tuning. The framework also includes AI-enabled data augmentation to isolate infrastructure impacts. Experimental results demonstrate that the proposed framework provides competitive bikeability ratings while enabling unique explainable factor attribution.
本文提出了一种基于人设的视觉-语言模型框架,以解决现有自行车友好度评估方法的局限性。该框架结合了已有的骑车人类型学来生成人设特定的解释,并使用多粒度监督微调将专家注释与用户评分结合起来进行联合预测和可解释评估。研究从427名骑车人中收集了12,400个人设条件下的评估,并表明所提出的框架提供了具有竞争力的自行车友好度评分预测,同时能够进行独特的可解释因素归因。
SDCD: Structure-Disrupted Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models
Authors: Yuxuan Xia, Siheng Wang, Peng Li
First: 2026-01-07T01:27:58+00:00 · Latest: 2026-01-07T01:27:58+00:00
Abstract
Large Vision-Language Models (LVLMs) demonstrate significant progress in multimodal understanding and reasoning, yet object hallucination remains a critical challenge. While existing research focuses on mitigating language priors or high-level statistical biases, they often overlook the internal complexities of the visual encoding process. We identify that visual statistical bias, arising from the inherent Bag-of-Patches behavior of Vision Encoders under weak structural supervision, acts as a contributing factor of object hallucinations. Under this bias, models prioritize local texture features within individual patches over holistic geometric structures. This tendency may induce spurious visual confidence and result in hallucinations. To address this, we introduce a training-free algorithm called Structure-Disrupted Contrastive Decoding (SDCD), which performs contrastive calibration of the output distribution by introducing a shuffled structure-disrupted view. By penalizing tokens that maintain high confidence under this structure-less view, SDCD effectively suppresses the texture-driven bias. Experimental results demonstrate that SDCD significantly mitigates hallucinations across multiple benchmarks and enhances the overall multimodal capabilities of LVLMs.
Summary / 总结
The paper addresses the issue of object hallucination in Large Vision-Language Models (LVLMs) by identifying the role of visual statistical bias due to the Bag-of-Patches behavior of Vision Encoders. It introduces a training-free method called Structure-Disrupted Contrastive Decoding (SDCD) that introduces a shuffled structure-disrupted view to calibrate the output distribution, thereby suppressing texture-driven bias and mitigating hallucinations across multiple benchmarks.
论文通过识别视觉编码器固有的Bag-of-Patches行为导致的内在统计偏差,来解决大型视觉语言模型(LVLM)中的物体幻觉问题。它引入了SDCD算法,通过引入打乱结构的视图来进行对比校准,惩罚在该无结构视图中仍保持高置信度的标记。实验结果表明,SDCD有效缓解了多个基准上的幻觉,并增强了LVLM的多模态能力。
FROST-Drive: Scalable and Efficient End-to-End Driving with a Frozen Vision Encoder
Authors: Zeyu Dong, Yimin Zhu, Yu Wu, Yu Sun
First: 2026-01-06T23:13:35+00:00 · Latest: 2026-01-06T23:13:35+00:00
Abstract
End-to-end (E2E) models in autonomous driving aim to directly map sensor inputs to control commands, but their ability to generalize to novel and complex scenarios remains a key challenge. The common practice of fully fine-tuning the vision encoder on driving datasets potentially limits its generalization by causing the model to specialize too heavily in the training data. This work challenges the necessity of this training paradigm. We propose FROST-Drive, a novel E2E architecture designed to preserve and leverage the powerful generalization capabilities of a pretrained vision encoder from a Vision-Language Model (VLM). By keeping the encoder's weights frozen, our approach directly transfers the rich, generalized world knowledge from the VLM to the driving task. Our model architecture combines this frozen encoder with a transformer-based adapter for multimodal fusion and a GRU-based decoder for smooth waypoint generation. Furthermore, we introduce a custom loss function designed to directly optimize for Rater Feedback Score (RFS), a metric that prioritizes robust trajectory planning. We conduct extensive experiments on Waymo Open E2E Dataset, a large-scale datasets deliberately curated to capture the long-tail scenarios, demonstrating that our frozen-encoder approach significantly outperforms models that employ full fine-tuning. Our results provide substantial evidence that preserving the broad knowledge of a capable VLM is a more effective strategy for achieving robust, generalizable driving performance than intensive domain-specific adaptation. This offers a new pathway for developing vision-based models that can better handle the complexities of real-world application domains.
中文标题/摘要
标题:FROST-Drive:基于冻结视觉编码器的可扩展高效端到端驾驶
自主驾驶中的端到端(E2E)模型旨在直接将传感器输入映射到控制命令,但它们在处理新颖和复杂场景时的能力仍然是一个关键挑战。完全微调视觉编码器的常见做法可能会限制其泛化能力,因为它可能导致模型过度专业化于训练数据。这项工作挑战了这种训练范式。我们提出了一种名为FROST-Drive的新型E2E架构,该架构旨在保留并利用来自视觉语言模型(VLM)的预训练视觉编码器的强大泛化能力。通过冻结编码器的权重,我们的方法直接将VLM中的丰富泛化世界知识转移到驾驶任务中。我们的模型架构将这个冻结的编码器与用于多模态融合的变压器适配器以及用于平滑生成航点的GRU解码器相结合。此外,我们引入了一种自定义损失函数,旨在直接优化评分反馈分数(RFS),这是一种优先考虑稳健轨迹规划的指标。我们在Waymo Open E2E数据集上进行了广泛的实验,这是一个故意收集以捕捉长尾场景的大规模数据集,结果表明,我们的冻结编码器方法显著优于采用完全微调的模型。我们的结果提供了有力的证据,证明保留一个能力强的VLM的广泛知识比密集的领域特定适应是一种更有效的策略,以实现稳健的、可泛化的驾驶性能。这为开发能够更好地处理实际应用领域复杂性的基于视觉的模型提供了一条新途径。
Summary / 总结
This work addresses the challenge of generalizing end-to-end (E2E) models in autonomous driving by proposing FROST-Drive, which freezes the weights of a pretrained vision encoder from a Vision-Language Model (VLM) to leverage its generalization capabilities. The model combines this frozen encoder with a transformer-based adapter and a GRU-based decoder. Experiments on the Waymo Open E2E Dataset show that FROST-Drive outperforms fully fine-tuned models, suggesting that preserving the VLM's broad knowledge is more effective for robust driving performance.
论文提出了一种名为FROST-Drive的新颖E2E架构,该架构冻结了来自Vision-Language Model (VLM)的预训练视觉编码器,以保留其泛化能力。该模型结合了冻结的编码器、基于变换器的多模态融合适配器和基于GRU的解码器。实验结果表明,FROST-Drive在Waymo Open E2E数据集上的表现优于完全微调的模型,这表明利用预训练知识对于实现稳健的驾驶性能更为有效。
History
20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553