arXiv 论文速递

2025-10-16 03:28
Latest digest
UniFusion: Vision-Language Model as Unified Encoder in Image Generation
Authors: Kevin Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale
First: 2025-10-14T17:57:56+00:00 · Latest: 2025-10-14T17:57:56+00:00
Comments: Project page at https://thekevinli.github.io/unifusion/
Abstract
Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its accessibility.We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM's reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.
中文标题/摘要
标题:UniFusion:统一编码的视觉语言模型在图像生成中的应用
尽管近期在视觉生成方面取得了显著进展,但大多数现有架构仍然依赖于独立的图像和文本编码器。这种分离限制了扩散模型进行跨模态推理和知识转移的能力。此前尝试弥合这一差距的方法通常使用视觉语言模型(VLM)的最后一层信息、采用多个视觉编码器或联合训练大规模统一模型以同时生成文本和图像,这需要大量的计算资源和大规模数据,限制了其可访问性。我们提出了UniFusion,这是一种基于扩散的生成模型,条件于一个冻结的大型视觉语言模型(VLM),该模型作为统一的多模态编码器。UniFusion的核心是逐层注意力池化(LAP)机制,该机制从冻结的VLM的文本和视觉标记中提取高层语义和低层细节,以条件化扩散生成模型。我们证明了LAP在文本-图像对齐生成和视觉信息从VLM到扩散模型的忠实转移方面优于其他浅融合架构,这对于编辑至关重要。我们提出了VLM-启用重写注入与灵活推理(VERIFI),该方法仅在模型内提示重写过程中由VLM生成的文本标记条件化扩散变换器(DiT)。VERIFI结合了条件分布的对齐与VLM的推理能力,以提高推理时的能力和灵活性。此外,编辑任务的微调不仅提高了生成的文本-图像对齐,表明跨模态知识转移,还展示了巨大的泛化能力。我们的模型在单图像编辑训练后,零样本泛化到多个图像参考,进一步证明了UniFusion统一编码器设计的有效性。
Modular Embedding Recomposition for Incremental Learning
Authors: Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara
First: 2025-08-22T15:25:40+00:00 · Latest: 2025-10-14T16:54:27+00:00
Comments: Accepted to the 36th British Machine Vision Conference (BMVC 2025), Sheffield, UK
Abstract
The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.
中文标题/摘要
标题:模块化嵌入重组以实现增量学习
预训练的视觉-语言模型(VLMs)的出现显著改变了连续学习(CL),主要是由于它们的零样本分类能力。这种能力使VLMs非常适合现实世界的应用,能够在不需要适应的情况下对新的未见过的类别提供稳健的性能。然而,当下游任务与预训练领域相差甚远时,微调仍然是必不可少的。先前的CL方法主要集中在保留VLMs的零样本能力,通过增量微调下游任务。我们更进一步,提出了一种方法,将保留转变为增强VLMs的零样本能力。我们的方法名为MoDular Embedding Recomposition(MoDER),引入了一个模块化框架,该框架训练多个专门针对单一已见类别的文本专家,并将它们存储在一个基础枢纽中。在推理时,对于每个未见过的类别,我们查询枢纽并组合检索到的专家,以合成一个改进分类的精炼原型。我们展示了我们的方法在两个流行的零样本增量协议Class-IL和MTIL中的有效性,共涉及14个数据集。代码库可在https://github.com/aimagelab/mammoth/ 获取。
Summary / 总结
The paper introduces MoDER, a modular framework for enhancing zero-shot capabilities in Vision-Language Models (VLMs) during incremental learning. It trains specialized textual experts for each seen class and stores them in a hub. At inference, unseen classes are synthesized from these experts to improve classification. The method shows effectiveness across 14 datasets using Class-IL and MTIL protocols.
该研究提出了一种名为MoDER的方法,通过训练每个已见类别的专门文本专家并存储在一个枢纽中,在推理时通过组合这些专家来增强视觉-语言模型的零样本能力。MoDER在Class-IL和MTIL协议下的14个数据集中展示了有效性。代码可在GitHub上获得。
SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
Authors: Lin Lin, Jiefeng Long, Zhihe Wan, Yuchi Wang, Dingkang Yang, Shuang Yang, Yueyang Yao, Xu Chen, Zirui Guo, Shengqiang Li, Weiran Li, Hanyu Li, Yaling Mou, Yan Qiu, Haiyang Yu, Xiao Liang, Hongsheng Li, Chao Feng
First: 2025-10-14T16:43:22+00:00 · Latest: 2025-10-14T16:43:22+00:00
Comments: Technical Report
Abstract
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.158% and the 14-day LT gain of +0.144% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.08% AUC gain.
中文标题/摘要
标题:SAIL-嵌入技术报告:全模态嵌入基础模型
多模态嵌入模型旨在生成具有信息性的统一表示,以赋能多样的跨模态任务。尽管从基于CLIP的双塔架构到大型视觉语言模型的发展取得了令人鼓舞的进展,但先前的工作在实际应用和商业场景中仍面临诸多挑战,如模态支持有限、训练机制不稳定以及工业领域差距。在本工作中,我们引入了SAIL-嵌入,这是一种通过定制化的训练策略和架构设计来解决这些问题的全模态嵌入基础模型。在优化过程中,我们提出了一种多阶段训练方案,以增强表示学习的多面有效性。具体而言,内容感知渐进式训练旨在提高模型对多种下游任务的适应性,并掌握丰富的跨模态能力。协作感知推荐增强训练进一步通过从序列到项目和ID到项目的嵌入中提炼知识,并挖掘用户历史兴趣,来适应推荐场景中的多模态表示。同时,我们开发了随机专业化和数据驱动的模式匹配,以增强模型训练的灵活性和泛化能力。实验结果表明,SAIL-嵌入在不同检索任务中实现了SOTA性能。在与我们的模型集成的各种实际场景中的在线实验中,我们观察到显著的生命周期(LT)提升,这是推荐体验的关键指标。例如,在抖音精选场景中,模型实现了7天LT提升+0.158%,14天LT提升+0.144%。对于抖音信息流排名模型,SAIL-嵌入生成的匹配特征实现了+0.08%的AUC提升。
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
Authors: Hanyang Chen, Mark Zhao, Rui Yang, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan, Mengchao Zhang, Jose Barreiros, Aykut Onol, ChengXiang Zhai, Heng Ji, Manling Li, Huan Zhang, Tong Zhang
First: 2025-10-14T16:25:46+00:00 · Latest: 2025-10-14T16:25:46+00:00
Abstract
Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, we present \textit{Embodied Reasoning Agent (ERA)}, a two-stage framework that integrates prior knowledge learning and online reinforcement learning (RL). The first stage, \textit{Embodied Prior Learning}, distills foundational knowledge from three types of data: (1) Trajectory-Augmented Priors, which enrich existing trajectory data with structured reasoning generated by stronger models; (2) Environment-Anchored Priors, which provide in-environment knowledge and grounding supervision; and (3) External Knowledge Priors, which transfer general knowledge from out-of-environment datasets. In the second stage, we develop an online RL pipeline that builds on these priors to further enhance agent performance. To overcome the inherent challenges in agent RL, including long horizons, sparse rewards, and training instability, we introduce three key designs: self-summarization for context management, dense reward shaping, and turn-level policy optimization. Extensive experiments on both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate that ERA-3B surpasses both prompting-based large models and previous training-based baselines. Specifically, it achieves overall improvements of 8.4\% on EB-ALFRED and 19.4\% on EB-Manipulation over GPT-4o, and exhibits strong generalization to unseen tasks. Overall, ERA offers a practical path toward scalable embodied intelligence, providing methodological insights for future embodied AI systems.
中文标题/摘要
标题:ERA:通过体态先验学习和在线强化学习将VLM转化为具身代理
近期具身AI的发展突显了视觉语言模型(VLMs)作为能够在复杂环境中进行感知、推理和交互的代理的潜力。然而,表现最佳的系统依赖于成本高昂的大规模模型,而较小的VLM缺乏必要的知识和技能。为弥合这一差距,我们提出了“具身推理代理(ERA)”,这是一种两阶段框架,结合了先验知识学习和在线强化学习(RL)。第一阶段,“体态先验学习”,从三种类型的数据中提炼基础知识:(1)轨迹增强先验,通过更强的模型生成的结构化推理丰富现有轨迹数据;(2)环境锚定先验,提供环境内的知识和定位监督;(3)外部知识先验,从环境外的数据集中转移一般知识。在第二阶段,我们开发了一个基于这些先验的在线RL管道,进一步提升代理性能。为克服代理RL固有的挑战,包括长时程、稀疏奖励和训练不稳定,我们引入了三个关键设计:自我总结以管理上下文、密集奖励塑造和回合级策略优化。在EB-ALFRED(高层规划)和EB-Manipulation(低层控制)任务上的广泛实验表明,ERA-3B在EB-ALFRED上优于基于提示的大模型,在EB-Manipulation上优于之前的基于训练的基线,分别提高了8.4%和19.4%,并且在未见过的任务上表现出强大的泛化能力。总体而言,ERA提供了一条实用的道路,通往可扩展的具身智能,并为未来的具身AI系统提供了方法论上的见解。
Summary / 总结
ERA is a two-stage framework that transforms vision language models into embodied agents by integrating embodied prior learning and online reinforcement learning. The first stage, Embodied Prior Learning, uses three types of data to distill foundational knowledge: trajectory-augmented priors, environment-anchored priors, and external knowledge priors. The second stage employs an online RL pipeline to further enhance agent performance. Key designs include self-summarization, dense reward shaping, and turn-level policy optimization to address challenges in agent RL. Experiments show that ERA-3B outperforms both prompting-based large models and previous training-based baselines, achieving 8.4% and 19.4% improvements on EB-ALFRED and EB-Manipulation tasks, respectively.
ERA 是一个两阶段框架,结合了体态先验学习和在线强化学习,将视觉语言模型转化为体态代理。第一阶段,体态先验学习,使用三种数据:轨迹增强先验、环境锚定先验和外部知识先验。第二阶段采用在线 RL 管道进一步提升代理性能。关键设计包括自我总结、密集奖励塑造和回合级策略优化。实验表明,ERA-3B 在 EB-ALFRED 和 EB-Manipulation 任务上分别比 GPT-4o 提高了 8.4% 和 19.4%,并且具有较强的未见过任务的泛化能力。
TTT3R: 3D Reconstruction as Test-Time Training
Authors: Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
First: 2025-09-30T17:59:51+00:00 · Latest: 2025-10-14T15:57:11+00:00
Comments: Page: https://rover-xingyu.github.io/TTT3R/ Code: https://github.com/Inception3D/TTT3R
Abstract
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R
中文标题/摘要
标题:TTT3R:测试时训练的3D重建
现代循环神经网络由于其线性时间复杂度已成为3D重建的竞争性架构。然而,当应用于训练上下文长度之外时,其性能显著下降,显示出有限长度泛化能力。在本文中,我们从测试时训练的角度重新审视3D重建的基础模型,将其设计框架为在线学习问题。基于这一视角,我们利用记忆状态与新观测之间的对齐置信度来推导出记忆更新的闭式学习率,以平衡保留历史信息和适应新观测之间的关系。这种无需训练的干预措施,称为TTT3R,显著提高了长度泛化能力,在全局姿态估计方面比基线提高了2倍,同时以每秒20帧的速度运行,仅使用6 GB的GPU内存处理数千张图像。代码可在https://rover-xingyu.github.io/TTT3R/获取
Image Quality Assessment for Embodied AI
Authors: Chunyi Li, Jiaohao Xiao, Jianbo Zhang, Farong Wen, Zicheng Zhang, Yuan Tian, Xiangyang Zhu, Xiaohong Liu, Zhengxue Cheng, Weisi Lin, Guangtao Zhai
First: 2025-05-22T15:51:07+00:00 · Latest: 2025-10-14T15:39:20+00:00
Abstract
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: https://github.com/lcysyzxdxc/EmbodiedIQA
中文标题/摘要
标题:具身AI的图像质量评估
近年来,具身AI发展迅速,但主要仍部署在实验室中,现实世界中的各种失真限制了其应用。传统上,图像质量评估(IQA)方法用于预测失真图像的人类偏好;然而,没有评估图像在具身任务中可用性的IQA方法,即机器人的感知质量。为了为未来的具身场景提供准确可靠的质量指标,我们首先提出了IQA for 具身AI这一主题。具体来说,我们(1)基于默顿系统和元认知理论,构建了感知-认知-决策-执行管道,并定义了一个全面的主观评分收集过程;(2)建立了具身-IQA数据库,包含超过36000对参考/失真图像,提供了超过500万细粒度的注释,由视觉语言模型/视觉语言动作模型/现实世界机器人提供;(3)在具身-IQA上训练和验证了主流IQA方法的表现,展示了需要为具身AI开发更准确的质量指标。我们真诚地希望通过评估,促进具身AI在现实世界复杂失真下的应用。项目页面:https://github.com/lcysyzxdxc/EmbodiedIQA
Summary / 总结
This study addresses the need for Image Quality Assessment (IQA) methods tailored for embodied AI, where traditional IQA methods fall short due to real-world distortions. The authors propose a perception-cognition-decision-execution pipeline and a comprehensive subjective score collection process based on Mertonian system and meta-cognitive theory. They also created the Embodied-IQA database with over 36,000 reference/distorted image pairs and 5 million fine-grained annotations. The study demonstrates the necessity for developing more accurate quality indicators for embodied AI scenarios, highlighting the importance of assessing perceptual quality for robots in complex real-world conditions.
该研究针对传统图像质量评估(IQA)方法在实体AI中的不足,提出了基于莫顿系统和元认知理论的感知-认知-决策-执行管道和全面的主观评分收集过程。作者还创建了包含超过36,000个参考/失真图像对和500万细粒度注释的实体-IQA数据库。研究强调了在复杂现实环境中评估机器人感知质量的必要性,展示了开发更准确的实体AI质量指标的重要性。
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
Authors: Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang
Venue: NeurIPS 2025
First: 2025-09-21T17:53:30+00:00 · Latest: 2025-10-14T14:25:36+00:00
Comments: Project homepage: https://flageval-baai.github.io/LRM-Eval/ This work will also be presented at NeurIPS 2025 Workshop on Foundations of Reasoning in Language Models (FoRLM)
Abstract
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
中文标题/摘要
标题:FlagEval 研究报告:大型推理模型在自动可验证文本和视觉问题上的初步评估
我们进行了一项中规模的无污染(一定程度上)的大型推理模型(LRMs)评估,并获得了初步发现。我们还发布了用于视觉语言模型的评估基准 ROME,旨在测试从视觉线索中推理的能力。我们在此网站上附上了基准、评估数据和其他更新的链接:https://flageval-baai.github.io/LRM-Eval/
Summary / 总结
This study evaluates current large reasoning models (LRMs) in a contamination-free setting and introduces ROME, a benchmark for vision language models to test reasoning from visual clues. Key findings include preliminary insights into the models' performance on automatically verifiable questions in both textual and visual domains.
该研究在无污染环境下评估了当前的大规模推理模型(LRMs),并引入了用于测试从视觉线索进行推理的ROME基准。主要发现包括对文本和视觉领域中自动验证问题的模型性能的初步见解。
VISaGE: Understanding Visual Generics and Exceptions
Authors: Stella Frank, Emily Allaway
Venue: EMNLP 2025
First: 2025-10-14T14:13:06+00:00 · Latest: 2025-10-14T14:13:06+00:00
Comments: EMNLP 2025
Abstract
While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances.
中文标题/摘要
标题:VISaGE:理解视觉泛化和例外
尽管视觉语言模型(VLMs)在训练过程中学习到概念化的表示形式,即泛化的知识,但它们通常用于分析单个实例。当评估实例不典型时,这种范式会在模型中产生两个先验之间的紧张关系。第一个是实用先验,即文本和视觉输入都相关,这源于VLM在一致输入上的微调;第二个是语义先验,即概念表示对于该类别的实例通常为真。为了理解VLM如何在这些先验之间权衡,我们引入了一个新的评估数据集VISaGE,包含典型和例外的图像。在精心平衡的实验中,我们展示了当实用先验的基础假设(即一致性)被不一致的图像违反时,概念理解会下降。当查询单个实例时,这种效应比语义先验的影响更强。
Summary / 总结
The research aims to understand how Vision Language Models (VLMs) handle typical and atypical images. The study introduces VISaGE, a dataset with both typical and exceptional images, to evaluate VLMs. The experiments reveal that VLMs' conceptual understanding weakens when incongruent images are used, indicating a stronger impact of the pragmatic prior over the semantic prior in individual instance queries.
研究旨在理解视觉语言模型(VLMs)如何处理典型和异常实例。研究引入了VISaGE数据集,包含典型和异常图像,以评估VLMs。关键发现表明,当异常图像违反了实用先验时,VLMs的概念理解会减弱,这种效果在查询个别实例时比语义先验更为显著。
Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding
Authors: Lin Long, Changdae Oh, Seongheon Park, Sharon Li
First: 2025-09-27T02:12:05+00:00 · Latest: 2025-10-14T14:10:23+00:00
Abstract
Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.
中文标题/摘要
标题:通过对比嵌入链理解LVLM的语言先验
大型视觉-语言模型(LVLMs)在多模态任务中表现出色,但它们往往依赖于其语言先验(LP)——预训练中记忆的文本模式,而未能充分利用视觉证据。关于LP的先前分析主要依赖于输入-输出探针,这未能揭示视觉如何以及何时影响模型行为的内部机制。为了解决这一问题,我们首次通过嵌入链的视角系统地分析了语言先验,该视角检查了LVLM中逐层的表示动态。我们的分析揭示了一个普遍现象:每个模型都表现出一个视觉整合点(VIP),这是一个关键层,在此层,视觉信息开始有意义地重塑隐藏表示并影响解码。基于这一观察,我们引入了总视觉整合(TVI)估计器,该估计器汇总了VIP之外的表示距离,以量化视觉查询对响应生成的影响强度。在涵盖9个当代LVLM和6个基准的54种模型-数据集组合中,我们证明VIP始终出现,并且TVI可靠地预测了语言先验的强度。这提供了一个原则性的工具箱,用于诊断和理解LVLM中的语言先验。
Summary / 总结
The research aims to understand how large vision-language models (LVLMs) rely on their language prior (LP) and underutilize visual evidence. To address this, the study introduces a new method called chain-of-embedding, which analyzes the layer-wise representation dynamics within LVLMs. Key findings include the discovery of a Visual Integration Point (VIP) in each model, a critical layer where visual information starts to reshape hidden representations, and the introduction of the Total Visual Integration (TVI) estimator, which quantifies the influence of visual queries on response generation. Across various model-dataset combinations, VIP consistently emerges, and TVI reliably predicts the strength of the language prior.
研究旨在理解大型视觉语言模型(LVLMs)如何依赖语言先验(LP)并低估视觉证据的作用。为此,研究人员使用链式嵌入方法分析LVLMs的内部机制,发现了视觉信息开始影响隐藏表示的关键层,称为视觉整合点(VIP)。他们引入了总视觉整合(TVI)估算器来量化视觉信息对响应生成的影响。在各种模型-数据集组合中,研究显示VIP始终出现,而TVI可靠地衡量了语言先验的强度。
OpenLex3D: A Tiered Evaluation Benchmark for Open-Vocabulary 3D Scene Representations
Authors: Christina Kassab, Sacha Morin, Martin Büchner, Matías Mattamala, Kumaraditya Gupta, Abhinav Valada, Liam Paull, Maurice Fallon
Venue: NeurIPS 2025
First: 2025-03-25T15:28:50+00:00 · Latest: 2025-10-14T13:14:38+00:00
Comments: NeurIPS 2025
Abstract
3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, at present the evaluation of these representations is limited to datasets with closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark for evaluating 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. Our label sets provide 13 times more labels per scene than the original datasets. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. Our experiments provide insights on feature precision, segmentation, and downstream capabilities. The benchmark is publicly available at: https://openlex3d.github.io/.
中文标题/摘要
标题:OpenLex3D:开放词汇3D场景表示的分层评估基准
开放词汇语言模型改变了3D场景理解,使其能够通过自然语言进行交互。然而,目前这些表示的评估仅限于具有封闭语义的数据集,无法捕捉语言的丰富性。本工作提出了OpenLex3D,这是一个专门用于评估开放词汇3D场景表示的基准。OpenLex3D为Replica、ScanNet++和HM3D提供了全新的标签注释,通过引入同义对象类别和额外的细微描述,捕捉了现实世界的语言变异性。我们的标签集为每个场景提供了比原始数据集多13倍的标签。通过引入开放集3D语义分割任务和对象检索任务,我们在OpenLex3D上评估了各种现有的3D开放词汇方法,展示了失败案例和改进途径。我们的实验提供了关于特征精度、分割和下游能力的见解。基准已公开发布于:https://openlex3d.github.io/
Summary / 总结
OpenLex3D is a benchmark designed to evaluate open-vocabulary 3D scene representations, addressing the limitations of closed-set semantics in existing datasets. It introduces new label annotations for scenes from Replica, ScanNet++, and HM3D, capturing real-world linguistic variability. The benchmark evaluates various 3D open-vocabulary methods through an open-set 3D semantic segmentation task and an object retrieval task, revealing insights into feature precision and downstream capabilities. The experiments highlight failure cases and suggest areas for improvement in 3D scene understanding.
OpenLex3D 是一个基准,旨在评估开放词汇的3D场景表示,解决了现有数据集闭集语义的局限性。它为来自Replica、ScanNet++和HM3D的场景引入了新的标签注释,捕捉了现实世界的语言变异性。该基准通过开放集3D语义分割任务和对象检索任务评估了各种3D开放词汇方法,揭示了特征精度和下游能力的见解。实验指出了失败案例,并提出了3D场景理解改进的方向。
FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models
Authors: Shengming Yuan, Xinyu Lyu, Shuailong Wang, Beitao Chen, Jingkuan Song, Lianli Gao
Venue: NeurIPS 2025
First: 2025-10-13T09:22:12+00:00 · Latest: 2025-10-14T13:01:01+00:00
Comments: 19 pages, 11 figures. Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract
Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs' adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model's associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.
中文标题/摘要
标题:FlexAC:向多模态大型语言模型灵活控制关联推理的方向
多模态大型语言模型(MLLMs)在忠实性和创造性之间存在固有的权衡,因为不同的任务需要不同程度的关联推理。然而,现有的方法缺乏调节这种推理强度的灵活性,限制了MLLMs在事实性和创造性场景中的适应性。为了解决这一问题,我们提出为MLLMs配备机制,使其能够灵活控制关联推理。我们首先研究了MLLMs内部驱动关联行为的机制,并发现:(1) 中间层在塑造模型的关联倾向中起着关键作用,(2) 修改这些层中的表示可以有效地调节关联推理强度,(3) 可以利用幻觉来推导出引导这种调节的引导向量。基于这些发现,我们引入了灵活关联控制(FlexAC),这是一种轻量级且无需训练的框架,用于调节MLLMs的关联行为。FlexAC首先通过幻觉引导的中间表示来编码关联方向。然后,它选择高关联实例来构建有效的关联引导向量,其强度会根据创造性指导与输出稳定性之间的平衡进行自适应校准。最后,考虑到关联推理的多维性质,FlexAC结合了从少量目标领域样本前向传递中提取的任务特定关联向量,使模型能够遵循多种关联方向,更好地适应创造性任务。值得注意的是,我们的方法在Creation-MMBench上的创造性提高了5.8倍,在CHAIR上的幻觉率降低了29%,超过了现有基线,证明了其在MLLMs中实现灵活控制关联推理的有效性。我们的代码可在https://github.com/ylhz/FlexAC获取。
A Review of Longitudinal Radiology Report Generation: Dataset Composition, Methods, and Performance Evaluation
Authors: Shaoyang Zhou, Yingshu Li, Yunyi Liu, Lingqiao Liu, Lei Wang, Luping Zhou
First: 2025-10-14T12:26:23+00:00 · Latest: 2025-10-14T12:26:23+00:00
Abstract
Chest Xray imaging is a widely used diagnostic tool in modern medicine, and its high utilization creates substantial workloads for radiologists. To alleviate this burden, vision language models are increasingly applied to automate Chest Xray radiology report generation (CXRRRG), aiming for clinically accurate descriptions while reducing manual effort. Conventional approaches, however, typically rely on single images, failing to capture the longitudinal context necessary for producing clinically faithful comparison statements. Recently, growing attention has been directed toward incorporating longitudinal data into CXR RRG, enabling models to leverage historical studies in ways that mirror radiologists diagnostic workflows. Nevertheless, existing surveys primarily address single image CXRRRG and offer limited guidance for longitudinal settings, leaving researchers without a systematic framework for model design. To address this gap, this survey provides the first comprehensive review of longitudinal radiology report generation (LRRG). Specifically, we examine dataset construction strategies, report generation architectures alongside longitudinally tailored designs, and evaluation protocols encompassing both longitudinal specific measures and widely used benchmarks. We further summarize LRRG methods performance, alongside analyses of different ablation studies, which collectively highlight the critical role of longitudinal information and architectural design choices in improving model performance. Finally, we summarize five major limitations of current research and outline promising directions for future development, aiming to lay a foundation for advancing this emerging field.
中文标题/摘要
标题:长程放射学报告生成综述:数据集构成、方法和性能评估
胸部X线成像是现代医学中广泛使用的诊断工具,其高使用率给放射科医生带来了巨大的工作负担。为了减轻这一负担,视觉语言模型越来越多地应用于自动化胸部X线放射学报告生成(CXRRRG),旨在提供临床准确的描述并减少人工努力。然而,传统方法通常依赖单张图像,无法捕捉到生成临床忠实比较陈述所需的纵向上下文。最近,越来越多的关注转向将纵向数据纳入胸部X线报告生成(CXR RRG),使模型能够以类似于放射科医生诊断工作流程的方式利用历史研究。尽管如此,现有的综述主要关注单张图像的CXRRRG,并为纵向设置提供了有限的指导,使研究人员缺乏系统的设计框架。为解决这一差距,本综述提供了长程放射学报告生成(LRRG)的第一个全面综述。具体而言,我们探讨了数据集构建策略、报告生成架构以及纵向定制设计,并涵盖了纵向特定指标和广泛使用的基准的评估协议。我们还总结了LRRG方法的性能,以及不同消融研究的分析,这些共同突显了纵向信息和架构设计选择在提高模型性能中的关键作用。最后,我们总结了当前研究的五个主要局限性,并概述了未来发展的有希望的方向,旨在为推进这一新兴领域奠定基础。
Summary / 总结
This study reviews the development of longitudinal radiology report generation for chest X-rays, focusing on dataset composition, report generation methods, and performance evaluation. It highlights the importance of incorporating historical imaging data to improve the accuracy of radiology reports. Key findings include the critical role of longitudinal information and architectural design in enhancing model performance, with ablation studies demonstrating the necessity of these elements. The review also identifies several limitations and suggests future research directions to advance this field.
该论文回顾了胸部X光影像的纵向放射学报告生成技术的发展,重点关注如何结合历史影像数据以提高报告的准确性。它探讨了数据集的构建策略、报告生成方法以及评估协议,强调了纵向信息和架构设计的重要性。主要发现包括历史数据在生成临床准确报告中的关键作用,以及需要更多定制化的模型设计来有效利用这些信息。
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models
Authors: Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, Tieniu Tan
Venue: NeurIPS 2025
First: 2025-06-09T17:36:34+00:00 · Latest: 2025-10-14T12:26:03+00:00
Comments: NeurIPS 2025
Abstract
Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/
中文标题/摘要
标题:BridgeVLA:高效3D操作学习的输入输出对齐方法与视觉语言模型
近年来,利用预训练的视觉语言模型(VLMs)构建视觉语言动作(VLA)模型已成为有效机器人操作学习的一种有前途的方法。然而,只有少数方法将3D信号整合到VLMs中进行动作预测,且未能充分利用3D数据中固有的空间结构,导致样本效率较低。本文介绍了一种新颖的3D VLA模型BridgeVLA,该模型(1)将3D输入投影到多个2D图像上,确保输入与VLM主干对齐,(2)利用2D热图进行动作预测,将输入和输出空间统一到一致的2D图像空间内。此外,我们提出了一种可扩展的预训练方法,使VLM主干能够在下游策略学习之前预测2D热图。大量实验表明,所提出的方法能够高效有效地学习3D操作。BridgeVLA在三个模拟基准测试中均优于最先进的基线方法。在RLBench中,它将平均成功率从81.4%提高到88.2%。在COLOSSEUM中,它在具有挑战性的泛化设置中表现出显著更好的性能,将平均成功率从56.7%提高到64.0%。在GemBench中,它在平均成功率方面超过了所有比较基线方法。在真实机器人实验中,BridgeVLA在平均成功率方面比最先进的基线方法高出32%。它在多种离分布设置中表现出稳健的泛化能力,包括视觉干扰和未见过的指令。值得注意的是,它仅使用每个任务3条轨迹就能在10多个任务中实现96.8%的成功率,突显了其非凡的样本效率。项目网站:https://bridgevla.github.io/
Summary / 总结
BridgeVLA is a novel 3D VLA model that projects 3D inputs to 2D images for input-output alignment with the VLM backbone, and uses 2D heatmaps for action prediction. It also proposes a scalable pre-training method to equip the VLM with the ability to predict 2D heatmaps. BridgeVLA outperforms state-of-the-art methods across three simulation benchmarks and a real-robot experiment, demonstrating significant improvements in success rates and robust generalization in various settings, including visual disturbances and unseen instructions. It shows exceptional sample efficiency with only 3 trajectories per task achieving a success rate of 96.8%.
BridgeVLA 是一种新颖的 3D VLA 模型,它将 3D 输入投影到 2D 图像中,以与 VLM 主干进行输入输出对齐,并使用 2D 热图进行动作预测。此外,它还提出了一种可扩展的预训练方法,使 VLM 主干能够预测 2D 热图。BridgeVLA 在三个模拟基准和一个真实机器人实验中均优于最先进的方法,显示出在各种设置中的显著改进,包括视觉干扰和未见过的指令。它仅使用每任务 3 条轨迹即可实现 96.8% 的成功率,显示出极高的样本效率。
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Authors: Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, Aman Chadha
First: 2024-02-20T18:57:34+00:00 · Latest: 2025-10-14T12:09:01+00:00
Comments: One of the first survey on Visual Language Models
Abstract
The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-Language Models (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating further breakthroughs and advancements.
中文标题/摘要
标题:探索视觉语言模型的前沿:当前方法学与未来方向综述
大型语言模型(LLMs)的出现极大地重塑了人工智能革命的轨迹。然而,这些LLMs在处理视觉信息方面存在明显局限性。为解决这一限制,研究人员致力于将视觉能力与LLMs结合,从而产生了视觉语言模型(VLMs)。这些先进的模型在图像字幕和视觉问答等更复杂的任务中发挥着重要作用。在我们的综述论文中,我们深入探讨了VLMs的关键进展。我们将VLMs分为三类:专注于视觉语言理解的模型、处理多模态输入以生成单模态(文本)输出的模型以及同时接受和产生多模态输入和输出的模型。这种分类基于它们在处理和生成各种模态数据方面的能力和功能。我们详细剖析了每种模型,提供了其基础架构、训练数据来源以及可能的优势和局限性的全面分析,使读者能够全面了解其关键组成部分。我们还分析了VLMs在各种基准数据集中的性能。通过这种方式,我们旨在提供对VLMs多样景观的深刻理解。此外,我们强调了这一动态领域中未来研究的潜在途径,预示着进一步的突破和进步。
Summary / 总结
This paper explores the advancements in Vision-Language Models (VLMs) by categorizing them into three types based on their capabilities: models for vision-language understanding, models processing multimodal inputs to generate unimodal outputs, and models accepting and producing multimodal inputs and outputs. The authors analyze the foundational architecture, training data, and performance of these models in benchmark datasets, providing a comprehensive understanding of VLMs and suggesting future research directions. The study aims to offer a nuanced understanding of the current landscape of VLMs and their potential for further advancements.
这篇综述论文探讨了视觉语言模型(VLMs)的发展,将其分为三种类型:用于视觉语言理解的模型、处理多模态输入以生成文本输出的模型以及同时接受和产生多模态输入和输出的模型。作者分析了这些模型的架构、训练数据和在基准数据集中的性能,旨在提供对VLMs的全面理解,并提出未来研究的方向。
Towards General Urban Monitoring with Vision-Language Models: A Review, Evaluation, and a Research Agenda
Authors: André Torneiro, Diogo Monteiro, Paulo Novais, Pedro Rangel Henriques, Nuno F. Rodrigues
First: 2025-10-14T11:27:46+00:00 · Latest: 2025-10-14T11:27:46+00:00
Comments: 44 pages
Abstract
Urban monitoring of public infrastructure (such as waste bins, road signs, vegetation, sidewalks, and construction sites) poses significant challenges due to the diversity of objects, environments, and contextual conditions involved. Current state-of-the-art approaches typically rely on a combination of IoT sensors and manual inspections, which are costly, difficult to scale, and often misaligned with citizens' perception formed through direct visual observation. This raises a critical question: Can machines now "see" like citizens and infer informed opinions about the condition of urban infrastructure? Vision-Language Models (VLMs), which integrate visual understanding with natural language reasoning, have recently demonstrated impressive capabilities in processing complex visual information, turning them into a promising technology to address this challenge. This systematic review investigates the role of VLMs in urban monitoring, with particular emphasis on zero-shot applications. Following the PRISMA methodology, we analyzed 32 peer-reviewed studies published between 2021 and 2025 to address four core research questions: (1) What urban monitoring tasks have been effectively addressed using VLMs? (2) Which VLM architectures and frameworks are most commonly used and demonstrate superior performance? (3) What datasets and resources support this emerging field? (4) How are VLM-based applications evaluated, and what performance levels have been reported?
中文标题/摘要
标题:基于视觉-语言模型的城市通用监测:综述、评估与研究议程
公共基础设施(如垃圾箱、道路标志、植被、人行道和建筑工地)的城市监测由于涉及多种物体、环境和背景条件而面临重大挑战。当前最先进的方法通常依赖物联网传感器和人工检查,这些方法成本高、难以扩展,并且往往与公民通过直接视觉观察形成的感知不一致。这提出了一个关键问题:机器现在能否“像公民一样”看到,并推断出关于城市基础设施状况的有见地的意见?视觉-语言模型(VLMs),结合视觉理解和自然语言推理,最近在处理复杂视觉信息方面表现出色,成为解决这一挑战的有前途的技术。本系统综述探讨了VLMs在城市监测中的作用,特别强调了零样本应用。按照PRISMA方法,我们分析了2021年至2025年间发表的32篇同行评审研究,以回答四个核心研究问题:(1)哪些城市监测任务已有效使用VLMs解决?(2)哪些VLM架构和框架最常用且表现最佳?(3)哪些数据集和资源支持这一新兴领域?(4)VLM基应用程序如何评估,报告了哪些性能水平?
NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows
Authors: Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Nikita Lyubaykin, Andrei Polubarov, Alexander Derevyagin, Vladislav Kurenkov
First: 2025-08-23T00:02:15+00:00 · Latest: 2025-10-14T10:06:39+00:00
Comments: https://github.com/dunnolab/NinA/
Abstract
Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alternative to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.
中文标题/摘要
标题:NinA:行动中的归一化流。使用归一化流训练VLA模型
近期在视觉-语言-行动(VLA)模型方面的进展确立了两部分架构,其中预训练的视觉-语言模型(VLM)编码视觉观察和任务描述,而行动解码器将这些表示映射到连续行动。由于能够建模复杂的多模态行动分布,扩散模型被广泛用作行动解码器。然而,它们在推理时需要多次迭代去噪步骤或下游技术来加快采样速度,这限制了它们在需要高频控制的实际场景中的实用性。在本文中,我们提出了NinA(行动中的归一化流),这是一种扩散基解码器的快速且表达能力强的替代方案。NinA用归一化流(NF)替换扩散行动解码器,通过可逆变换实现一次采样,显著减少了推理时间。我们将NinA集成到FLOWER VLA架构中,并在LIBERO基准上进行微调。我们的实验表明,在相同的训练条件下,NinA与基于扩散的解码器具有相当的性能,但推理速度显著加快。这些结果表明,NinA为高效、高频VLA控制提供了一条有前景的道路,而不会牺牲性能。
Summary / 总结
NinA is a novel approach to Vision-Language-Action models that replaces diffusion-based action decoders with Normalizing Flows (NFs) for faster inference. This method enables one-shot sampling and significantly reduces inference time while maintaining performance comparable to diffusion-based models. Experiments on the LIBERO benchmark demonstrate that NinA achieves faster inference without compromising on performance.
NinA 是一种新的视觉-语言-动作模型方法,用归一化流(NFs)替代了基于扩散的行动解码器,以实现更快的推理。这种方法允许一次采样并显著减少推理时间,同时保持与基于扩散模型相当的性能。在 LIBERO 基准上的实验表明,NinA 在不牺牲性能的情况下实现了更快的推理。
CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding
Authors: Shixin Yi, Lin Shang
First: 2025-08-01T07:17:12+00:00 · Latest: 2025-10-14T09:15:36+00:00
Comments: The paper is not yet mature and needs further improvement
Abstract
Multimodal reasoning with vision-language models (VLMs) often suffers from hallucinations, as models tend to generate explanations after only a superficial inspection of the image. We present \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs. Given a VLM-generated rationale, CoRGI decomposes it into step-wise statements, grounds each step in visual evidence, and filters or corrects unsupported claims before producing the final answer. Experiments on five challenging benchmark-VCR, ScienceQA, MMMU, MathVista, and HallusionBenc-demonstrate that CoRGI consistently improves both answer accuracy and explanation faithfulness across multiple VLM backbones, including Qwen-2.5VL, LLaVA-1.6, and Gemma3-12B. Beyond quantitative gains, qualitative analyses further illustrate how the verification process reduces hallucination and strengthens interpretability, suggesting that post-hoc visual grounding is a promising direction for building more trustworthy and transparent multimodal reasoning systems.
中文标题/摘要
标题:CoRGI:经过后验视觉定位验证的链式推理
多模态推理使用视觉语言模型(VLMs)经常遭受幻觉问题,因为模型往往在仅对图像进行浅层检查后就生成解释。我们提出了**CoRGI**(链式推理与定位见解),这是一种通过后验验证链式输出的框架,以增强推理可靠性。给定VLM生成的推理,CoRGI将其分解为逐步陈述,将每一步与视觉证据联系起来,并在生成最终答案之前过滤或纠正未支持的声明。在五个具有挑战性的基准VCR、ScienceQA、MMMU、MathVista和HallusionBenc上进行的实验表明,CoRGI在多个VLM骨干网络,包括Qwen-2.5VL、LLaVA-1.6和Gemma3-12B中,始终提高了答案准确性和解释可信度。除了定量收益外,定性分析还进一步说明了验证过程如何减少幻觉并增强可解释性,表明后验视觉定位是构建更值得信赖和透明的多模态推理系统的有前途的方向。
Summary / 总结
CoRGI is a framework that enhances the reliability of multimodal reasoning by verifying the chain-of-thought outputs of vision-language models post-hoc. It decomposes the rationale into steps, grounds each step in visual evidence, and filters unsupported claims. Experiments on five benchmarks show that CoRGI improves both answer accuracy and explanation faithfulness across different VLMs, suggesting post-hoc visual grounding is a promising approach for more trustworthy multimodal reasoning systems.
CoRGI 是一个框架,通过后验验证视觉语言模型生成的推理链,增强其可靠性。它将推理分解为步骤,将每个步骤与视觉证据联系起来,并过滤掉未支持的声明。在五个基准测试上的实验表明,CoRGI 能够提高答案准确性和解释可信度,不同 VLM 的结果表明后验视觉接地是构建更可信和透明的多模态推理系统的有前途的方法。
Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector
Authors: Sifan Li, Hongkai Chen, Yujun Cai, Qingwen Ye, Liyang Chen, Junsong Yuan, Yiwei Wang
First: 2025-10-14T08:42:58+00:00 · Latest: 2025-10-14T08:42:58+00:00
Abstract
Vision Language Models (VLMs) have achieved impressive progress in multimodal reasoning; yet, they remain vulnerable to hallucinations, where outputs are not grounded in visual evidence. In this paper, we investigate a previously overlooked setting: logo hallucination, where models generate brand names or textual content despite logos containing no visible words. Using curated splits of pure symbols, hybrids, and text-bearing logos, as well as the challenging Hard-60 subset, we systematically measure hallucination across leading VLMs. We further probe robustness through nine structured perturbations and show that hallucinations persist even under strong distortions, with occlusion exposing the sharpest weaknesses. Embedding-level analysis with open-weight LLaVA demonstrates that hallucination is tied to a small subset of projector dimensions, and targeted ablation substantially reduces errors while preserving OCR accuracy. Together, these findings reveal that VLMs often rely on symbolic priors rather than genuine glyph perception, particularly for iconic circular logos, and that projector subspaces play a decisive role in this failure mode. Our work contributes both a novel diagnostic lens and actionable mitigation insights, highlighting projector disentanglement and OCR-guided decoding as promising directions for building more trustworthy multimodal systems.
中文标题/摘要
标题:视觉语言模型通过视觉投影中的语义纠缠将标志映射到文本
视觉语言模型(VLMs)在多模态推理方面取得了显著进展;然而,它们仍然容易出现幻觉,即输出缺乏视觉证据。在本文中,我们研究了一个之前被忽视的场景:标志幻觉,即模型生成品牌名称或文本内容,尽管标志中没有可见的文字。我们使用精心策划的纯符号、混合体和带文本的标志的划分,以及具有挑战性的Hard-60子集,系统地测量了领先VLMs的幻觉。我们进一步通过九种结构化扰动测试了鲁棒性,并表明即使在强烈失真下幻觉仍然存在,遮挡揭示了最明显的弱点。通过开放权重的LLaVA进行嵌入级分析表明,幻觉与投影维度中的一个小子集相关,有针对性的消融显著减少了错误同时保持OCR准确性。这些发现揭示了VLMs往往依赖于符号先验而非真实的字符感知,特别是对于标志性的圆形标志,且投影子空间在这一失败模式中起着决定性作用。我们的工作不仅提供了一个新颖的诊断视角,还提供了可操作的缓解策略,强调了投影解纠缠和OCR引导解码是构建更可信的多模态系统有希望的方向。
Summary / 总结
This paper investigates the issue of logo hallucination in Vision Language Models (VLMs), where models generate text despite logos containing no visible words. The authors use various logo types and perturbations to measure hallucination across leading VLMs and find that hallucinations persist even under strong distortions. Embedding-level analysis shows that hallucination is linked to specific projector dimensions, and targeted ablation reduces errors while maintaining OCR accuracy. The study reveals that VLMs rely on symbolic priors rather than genuine glyph perception, particularly for circular logos, and suggests that projector disentanglement and OCR-guided decoding could improve VLM robustness.
本文研究了视觉语言模型(VLMs)在处理logo时的幻觉问题,即模型在没有可见文字的情况下生成文本。作者使用不同类型的logo和扰动来测量领先VLMs中的幻觉现象,并发现即使在强烈扭曲的情况下,幻觉仍然存在。他们还表明,幻觉与特定的投影维度相关,并可以通过针对性的消融减少错误同时保持OCR准确性。研究揭示了VLMs在处理圆环形logo时依赖于符号先验而非真正的字符感知,并建议投影解耦和OCR引导解码作为改进多模态系统的有前景策略。
GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
Authors: Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, Salman Khan
First: 2025-09-29T16:48:54+00:00 · Latest: 2025-10-14T08:30:43+00:00
Comments: Tables 6 and Figures 8. https://mustansarfiaz.github.io/GeoVLM-R1/
Abstract
Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .
中文标题/摘要
标题:GeoVLM-R1:强化学习微调以提高遥感推理能力
近期强化学习(RL)在自然图像领域的推理能力取得了显著进展,但在地球观测(EO)领域的潜力尚未得到充分探索。EO任务引入了独特的挑战,包括目标检测、图像或区域描述、变化检测、语义定位和时间分析,这些挑战需要任务感知的推理。我们提出了一种新的后训练框架,结合任务感知的奖励,以使基于RL的推理模型能够有效适应各种EO任务。这种训练策略增强了遥感图像的推理能力,稳定了优化过程,并提高了鲁棒性。在多个EO基准测试中的广泛实验显示,与最先进的通用和专门的视觉语言模型相比,该方法具有一致的性能提升。代码和模型将在https://mustansarfiaz.github.io/GeoVLM-R1/ 公开发布。
Summary / 总结
The research aims to enhance reasoning capabilities in Earth Observation (EO) tasks using reinforcement learning (RL). The method involves a post-training framework that integrates task-aware rewards to fine-tune RL models for diverse EO tasks. Key experimental findings show consistent performance improvements over existing generic and specialized vision-language models across multiple EO benchmarks, improving reasoning and robustness in remote sensing images.
研究旨在利用强化学习提升地球观测(EO)任务中的推理能力。方法是采用一个后训练框架,结合任务感知的奖励来微调RL模型以适应多种EO任务。实验结果表明,在多个EO基准测试中,该方法在推理和鲁棒性方面优于现有的一般和专门的视觉语言模型,提升了遥感图像的推理能力。
HiLoRA: Adaptive Hierarchical LoRA Routing for Training-Free Domain Generalization
Authors: Ziyi Han, Huanyu Wang, Zeyu Zhang, Xiangxiang Dai, Xutong Liu, John C. S. Lui
First: 2025-10-14T08:19:13+00:00 · Latest: 2025-10-14T08:19:13+00:00
Abstract
Low-Rank Adaptation (LoRA) has emerged as a widely used technique for adapting large language models (LLMs) to new domains, due to its modular design and broad availability on platforms such as HuggingFace. This availability has motivated efforts to reuse existing LoRAs for domain generalization. However, existing methods often rely on explicit task labels or additional training, which are impractical for deployment. Moreover, they typically activate a fixed number of entire LoRA modules, leading to parameter redundancy or insufficiency that degrade performance. In this paper, we propose \texttt{HiLoRA}, a training-free framework that performs adaptive hierarchical routing over LoRA pools. Drawing on structural properties of LoRA, we define rank-one components (ROCs), in which each rank parameter is regarded as an independent unit. For a given input sequence, \texttt{HiLoRA} first adaptively selects a subset of LoRAs and determines their ROC allocation based on Gaussian likelihoods at the sequence level. At the token level, it further refines routing by activating only the most informative ROCs. We further provide theoretical guarantees that \texttt{HiLoRA} selects the most relevant LoRAs with high probability. Extensive experiments show that \texttt{HiLoRA} achieves substantial improvements in domain generalization, with accuracy gains of up to {\small $55\%$} over state-of-the-art baselines, while maintaining comparable inference throughput.
Summary / 总结
HiLoRA is a training-free framework that performs adaptive hierarchical routing over LoRA pools to improve domain generalization. It selects relevant LoRAs based on Gaussian likelihoods and refines routing at the token level by activating only the most informative rank-one components. Experiments show that HiLoRA achieves up to 55% accuracy gains over state-of-the-art baselines while maintaining similar inference throughput.
本文提出了一种名为HiLoRA的训练-free框架,用于使用低秩适应(LoRA)进行领域泛化。该框架旨在实现实用且高效的领域适应,通过层次路由LoRA池来选择子集并基于高斯似然性分配秩一组件(ROCs),并在token级别进一步细化路由。实验表明,与最先进的基线方法相比,HiLoRA在领域泛化准确性上提高了高达55%,同时保持了类似的推理吞吐量。
Cross-Modal Safety Alignment: Is textual unlearning all you need?
Authors: Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit K. Roy-Chowdhury, Chengyu Song
Venue: EMNLP 2024
First: 2024-05-27T20:29:13+00:00 · Latest: 2025-10-14T07:41:15+00:00
Comments: Accepted by EMNLP 2024 Findings
Abstract
Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability -- textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8\% and in some cases, even as low as nearly 2\% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands, possibly up to 6 times higher.
中文标题/摘要
标题:跨模态安全对齐:你只需要文本去学习吗?
近期研究表明,将新模态整合到大型语言模型(LLMs)中,如视觉-语言模型(VLMs),会创建一个新的攻击面,绕过了现有的安全训练技术,如监督微调(SFT)和基于人类反馈的强化学习(RLHF)。虽然可以在多模态设置中进一步进行SFT和RLHF安全训练,但收集多模态训练数据集存在重大挑战。受最近多模态模型结构设计的启发,无论输入模态的组合如何,所有输入最终都会融合到语言空间中,我们旨在探索是否仅在文本域中去学习可以有效实现跨模态安全对齐。我们的跨六个数据集的评估实证地证明了这种转移性——在VLMs中仅在文本域进行去学习显著降低了攻击成功率(ASR)至低于8%,在某些情况下甚至低至接近2%,同时保持了实用性。此外,我们的实验表明,使用多模态数据集进行去学习没有潜在的好处,但会带来显著增加的计算需求,可能高达6倍。
Summary / 总结
This study addresses the challenge of ensuring safety in Vision-Language Models (VLMs) by exploring textual unlearning as a method for cross-modal safety alignment. Unlike traditional safety training techniques, textual unlearning focuses solely on the textual domain, which is more feasible to implement. The research demonstrates that textual unlearning can effectively reduce the Attack Success Rate (ASR) to less than 8% and even as low as nearly 2% for both text-based and vision-text-based attacks, while maintaining model utility. Importantly, the study finds that using a multi-modal dataset for unlearning does not offer additional benefits but increases computational demands significantly.
该研究旨在通过探索文本去学习来解决视觉语言模型(VLMs)的安全性问题,并将其作为跨模态安全对齐的方法。与传统的安全训练技术不同,文本去学习仅在文本领域进行,更具可行性。研究结果表明,文本去学习可以有效将攻击成功率(ASR)降低到低于8%,甚至在某些情况下低至接近2%,同时保持模型的实用性。此外,研究发现使用多模态数据集进行去学习并不会带来额外的好处,反而会显著增加计算需求。
Extremely low-bitrate Image Compression Semantically Disentangled by LMMs from a Human Perception Perspective
Authors: Juan Song, Lijie Yang, Mingtao Feng
First: 2025-03-01T08:27:11+00:00 · Latest: 2025-10-14T07:36:33+00:00
Abstract
It remains a significant challenge to compress images at extremely low bitrate while achieving both semantic consistency and high perceptual quality. Inspired by human progressive perception mechanism, we propose a Semantically Disentangled Image Compression framework (SEDIC) in this paper. Initially, an extremely compressed reference image is obtained through a learned image encoder. Then we leverage LMMs to extract essential semantic components, including overall descriptions, object detailed description, and semantic segmentation masks. We propose a training-free Object Restoration model with Attention Guidance (ORAG) built on pre-trained ControlNet to restore object details conditioned by object-level text descriptions and semantic masks. Based on the proposed ORAG, we design a multistage semantic image decoder to progressively restore the details object by object, starting from the extremely compressed reference image, ultimately generating high-quality and high-fidelity reconstructions. Experimental results demonstrate that SEDIC significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at extremely low-bitrates ($\le$ 0.05 bpp).
中文标题/摘要
标题:从人类感知角度通过LMMs语义解耦的极低比特率图像压缩
在保持语义一致性和高感知质量的同时,对图像进行极低比特率压缩仍然是一个重大挑战。受人类渐进感知机制的启发,本文提出了一种语义解耦图像压缩框架(SEDIC)。首先,通过学习图像编码器获得一个极简压缩的参考图像。然后利用LMMs提取关键的语义组件,包括整体描述、对象详细描述和语义分割掩码。我们基于预训练的ControlNet构建了一个无需训练的对象恢复模型(ORAG),带有注意力引导,以对象级文本描述和语义掩码为条件恢复对象细节。基于提出的ORAG,我们设计了一种多阶段语义图像解码器,逐步按对象恢复细节,从极简压缩的参考图像开始,最终生成高质量和高保真的重构图像。实验结果表明,SEDIC在极低比特率(≤0.05 bpp)下显著优于现有方法,实现了更高的感知质量和语义一致性。
Summary / 总结
The paper addresses the challenge of compressing images at extremely low bitrates while maintaining semantic consistency and high perceptual quality. It proposes a Semantically Disentangled Image Compression framework (SEDIC) that first obtains an extremely compressed reference image using a learned encoder. LMMs are then used to extract essential semantic components, and a training-free Object Restoration model with Attention Guidance (ORAG) restores object details based on text descriptions and semantic masks. The framework progressively restores details, generating high-quality reconstructions. Experiments show that SEDIC outperforms existing methods at bitrates ≤ 0.05 bpp in terms of perceptual quality and semantic consistency.
论文旨在解决在极低比特率下压缩图像的同时保持语义一致性和高感知质量的挑战。提出了一种语义解耦图像压缩框架(SEDIC),首先通过学习编码器获得极简压缩的参考图像。然后使用LMM提取关键语义组件,并利用基于预训练ControlNet的无训练对象恢复模型(ORAG)根据对象级文本描述和语义掩码恢复对象细节。框架采用多阶段语义图像解码器逐步恢复细节,生成高质量的重构图像。实验表明,SEDIC在比特率≤0.05 bpp时显著优于现有方法,实现了更高的感知质量和语义一致性。
HoneyBee: Data Recipes for Vision-Language Reasoners
Authors: Hritik Bansal, Devandra Singh Sachan, Kai-Wei Chang, Aditya Grover, Gargi Ghosh, Wen-tau Yih, Ramakanth Pasunuru
First: 2025-10-14T07:23:44+00:00 · Latest: 2025-10-14T07:23:44+00:00
Comments: 32 pages
Abstract
Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research.
中文标题/摘要
标题:HoneyBee: 视觉-语言数据食谱
近期视觉-语言模型(VLMs)的进步使它们在推理任务中表现出色。然而,构建高性能VL推理训练数据集的原则仍然知之甚少。在本研究中,我们介绍了几种数据整理方法,并通过仔细控制训练和评估设置来研究它们对VL推理能力的影响。我们分析了上下文(图像和问题配对)来源的影响,实施了有针对性的数据干预,并探索了扩展图像、问题和思维链(CoT)解决方案。我们的研究发现:(a) 上下文来源策略显著影响VLM性能;(b) 诸如来自图像描述的辅助信号和包含文本推理的干预措施带来了显著收益;(c) 扩展所有数据维度(例如,每张图像的独特问题数量和每张图像-问题配对的独特CoT数量)一致提高了推理能力。受这些见解的启发,我们引入了HoneyBee,这是一个包含250万示例、35万图像-问题配对的大规模高质量CoT推理数据集。使用HoneyBee训练的VLM在各种模型规模上均优于最先进的模型。例如,一个使用30亿参数训练的HoneyBee模型在MathVerse上的表现分别比最先进的模型和基础模型高出7.8%和24.8%。此外,我们提出了一种测试时扩展策略,该策略将解码成本降低了73%,而不会牺牲准确性。总体而言,这项工作提出了改进的VL推理数据集整理研究策略。
Summary / 总结
This work investigates the principles behind constructing effective vision-language reasoning datasets by introducing HoneyBee, a large-scale dataset with 2.5M examples. The study finds that context source strategies, auxiliary signals from image captions, and scaling data dimensions significantly enhance VLM performance. HoneyBee-trained VLMs outperform state-of-the-art models, with a 7.8% and 24.8% improvement over the SOTA and base models, respectively, on MathVerse. Additionally, a test-time scaling strategy is proposed to reduce decoding cost by 73%.
该研究通过引入包含250万例的HoneyBee大规模数据集,探索构建有效视觉-语言推理数据集的原则。研究发现,上下文来源策略、图像标题的辅助信号以及扩展数据维度显著提升了VLM性能。HoneyBee训练的模型在MathVerse上优于SOTA模型,分别提高了7.8%和24.8%。此外,还提出了一种测试时的缩放策略,可将解码成本降低73%而不牺牲准确性。
GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization
Authors: Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, Sharon Li
Venue: NeurIPS 2025
First: 2025-05-19T21:04:46+00:00 · Latest: 2025-10-14T07:13:43+00:00
Comments: NeurIPS 2025
Abstract
Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods.
中文标题/摘要
标题:GeoRanker:全球图像地理定位的距离感知排名
全球图像地理定位——从地球上任何地方拍摄的图像预测GPS坐标——由于地区间视觉内容的巨大多样性而构成了基本挑战。尽管最近的方法采用两阶段管道(首先是检索候选者,然后选择最佳匹配),但它们通常依赖于简单的相似性启发式方法和点监督,未能建模候选者之间的空间关系。在本文中,我们提出了一种距离感知排名框架GeoRanker,该框架利用大规模的视觉-语言模型联合编码查询-候选者交互并预测地理邻近度。此外,我们引入了一种多阶距离损失,可以对绝对和相对距离进行排名,使模型能够推理结构化空间关系。为此,我们构建了GeoRanking,这是首个明确为地理排名任务设计的多模态候选信息数据集。GeoRanker在两个广泛认可的基准测试(IM2GPS3K和YFCC4K)上取得了最先进的结果,显著优于当前最佳方法。
Summary / 总结
GeoRanker is a distance-aware ranking framework for worldwide image geolocalization that uses large vision-language models to encode query-candidate interactions and predict geographic proximity. It introduces a multi-order distance loss to rank both absolute and relative distances, enhancing spatial reasoning. GeoRanker outperforms existing methods on IM2GPS3K and YFCC4K benchmarks, demonstrating its effectiveness in modeling spatial relationships among candidates.
GeoRanker 是一种用于全球图像地理定位的距离感知排名框架,使用大型视觉-语言模型来编码查询-候选交互并预测地理接近度。它引入了一种多级距离损失,用于排名绝对和相对距离,增强空间推理能力。GeoRanker 在 IM2GPS3K 和 YFCC4K 基准测试中表现出色,显著优于现有方法,证明了其在建模候选者之间空间关系方面的有效性。
STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes
Authors: Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, Yu Yamaguchi
First: 2025-08-14T07:57:06+00:00 · Latest: 2025-10-14T06:54:59+00:00
Comments: Project Page: https://turingmotors.github.io/stride-qa/
Abstract
Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.
中文标题/摘要
标题:STRIDE-QA:城市驾驶场景时空推理的视觉问答数据集
视觉-语言模型(VLMs)已被应用于自动驾驶,以支持在复杂现实场景中的决策。然而,它们基于静态的、来源于网络的图像-文本对的训练,从根本上限制了理解并预测动态交通场景所需的精确时空推理能力。我们通过STRIDE-QA填补了这一关键缺口,这是一个大规模的视觉问答(VQA)数据集,用于从第一人称视角进行物理上接地的推理。该数据集源自东京100小时的多传感器驾驶数据,捕捉了多样且具有挑战性的条件,是最大的用于城市驾驶时空推理的VQA数据集,提供了超过285,000帧的1600万对问答。通过密集的、自动生成的注释,包括3D边界框、分割掩码和多对象轨迹,数据集独特地支持了通过三个需要空间定位和时间预测的新问答任务进行对象中心和第一人称中心推理。我们的基准测试表明,现有VLMs在预测一致性方面表现不佳,得分接近零。相比之下,基于STRIDE-QA微调的VLMs表现出显著的性能提升,空间定位成功率为55%,未来运动预测一致性为28%,而通用VLMs的得分接近零。因此,STRIDE-QA为开发更可靠的VLMs奠定了全面的基础,适用于安全关键的自动驾驶系统。
Summary / 总结
The research aims to enhance the spatiotemporal reasoning capabilities of Vision-Language Models (VLMs) for autonomous driving by addressing their limitations in understanding dynamic traffic scenes. STRIDE-QA, a large-scale VQA dataset, is introduced, containing 16 million QA pairs over 285K frames from 100 hours of driving data in Tokyo. The dataset supports both object-centric and ego-centric reasoning through three novel QA tasks. Benchmarks show that existing VLMs perform poorly, while those fine-tuned on STRIDE-QA achieve 55% success in spatial localization and 28% consistency in future motion prediction, significantly outperforming general-purpose models.
STRIDE-QA 是一个大规模的视觉问答数据集,用于解决城市驾驶场景中的时空推理问题,弥补了现有基于静态图像-文本对训练的视觉语言模型的不足。该数据集包含来自东京100小时多传感器驾驶数据的1600万问答对,覆盖了285K帧,并附有密集的自动标注,包括3D边界框、分割掩码和多对象轨迹。基准测试显示,现有的视觉语言模型在预测一致性方面表现不佳,而经过 STRIDE-QA 微调的模型在空间定位上的成功率为55%,未来运动预测的一致性为28%。
Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos
Authors: Shingo Yokoi, Kento Sasaki, Yu Yamaguchi
Venue: ICCV 2025
First: 2025-10-14T06:36:41+00:00 · Latest: 2025-10-14T06:36:41+00:00
Comments: 2nd Place Winner, ICCV 2025 2COOOL Competition
Abstract
Recent advances in end-to-end (E2E) autonomous driving have been enabled by training on diverse large-scale driving datasets, yet autonomous driving models still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark targets this gap by encouraging hazard understanding beyond closed taxonomies, and the 2COOOL challenge extends it to generating human-interpretable incident reports. We present a hierarchical reasoning framework for incident report generation from dashcam videos that integrates frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models (VLMs). We further improve factual accuracy and readability through model ensembling and a Blind A/B Scoring selection protocol. On the official 2COOOL open leaderboard, our method ranks 2nd among 29 teams and achieves the best CIDEr-D score, producing accurate and coherent incident narratives. These results indicate that hierarchical reasoning with VLMs is a promising direction for accident analysis and for broader understanding of safety-critical traffic events. The implementation and code are available at https://github.com/riron1206/kaggle-2COOOL-2nd-Place-Solution.
中文标题/摘要
标题:基于视觉-语言模型的层次推理在行车记录视频事故报告生成中的应用
端到端(E2E)自动驾驶的最新进展得益于大规模驾驶数据集的训练,但自动驾驶模型在分布外(OOD)场景中仍然存在困难。COOOL基准通过鼓励超越封闭分类的理解来解决这一差距,而2COOOL挑战则将其扩展到生成可解释的事故报告。我们提出了一种从行车记录视频生成事故报告的层次推理框架,该框架结合了帧级描述、事故帧检测和视觉-语言模型(VLM)中的细粒度推理。我们进一步通过模型集成和盲A/B评分选择协议提高了事实准确性和可读性。在官方2COOOL公开排行榜上,我们的方法在29支队伍中排名第2,并获得了最佳CIDEr-D分数,生成了准确且连贯的事故叙述。这些结果表明,使用VLM的层次推理是事故分析和更广泛理解关键交通事件的有前途的方向。该实现和代码可在https://github.com/riron1206/kaggle-2COOOL-2nd-Place-Solution上获得。
Summary / 总结
This paper addresses the challenge of generating human-interpretable incident reports from dashcam videos, focusing on out-of-distribution scenarios. The authors propose a hierarchical reasoning framework that integrates frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models. Their method, which includes model ensembling and a Blind A/B Scoring selection protocol, achieves the best CIDEr-D score on the 2COOOL leaderboard, demonstrating improved factual accuracy and readability. This suggests that hierarchical reasoning with VLMs is a promising approach for accident analysis and understanding safety-critical traffic events.
该论文旨在从行车记录仪视频中生成可解释的事故报告,重点关注异常分布场景。作者提出了一种分层推理框架,结合了帧级描述、事故帧检测和视觉语言模型中的细粒度推理。该方法包括模型集成和盲A/B评分选择协议,在2COOOL排行榜上取得了最佳CIDEr-D分数,表明在事实准确性与可读性方面有所提升。这表明,分层推理与视觉语言模型相结合是事故分析和理解关键交通事件的重要方向。
Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers
Authors: Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, Weiyao Lin
First: 2025-10-13T15:39:13+00:00 · Latest: 2025-10-14T04:40:48+00:00
Abstract
Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).
中文标题/摘要
标题:大规模激活是扩散变换器在视觉生成中局部细节合成的关键
扩散变换器(DiTs)最近已成为视觉生成的强大骨干。最近的观察发现它们内部特征图中存在大规模激活(MAs),但其功能尚未得到充分理解。在本工作中,我们系统地研究这些激活以阐明其在视觉生成中的作用。我们发现这些大规模激活出现在所有空间标记中,其分布受输入时间步嵌入的调节。重要的是,我们的研究进一步表明,这些大规模激活在局部细节合成中起着关键作用,而对输出的整体语义内容影响甚微。基于这些见解,我们提出了**D**etail **G**uidance(**DG**),一种基于MAs的、无需训练的自我指导策略,以明确增强DiTs的局部细节保真度。具体而言,DG 通过破坏MAs 构建一个退化的“细节不足”模型,并利用它来引导原始网络向更高质量的细节合成发展。我们的DG 可以无缝地与无分类器引导(CFG)集成,进一步细化微小的细节。广泛的实验表明,我们的DG 在各种预训练的DiTs(例如,SD3、SD3.5 和 Flux)中一致地提高了细粒度细节的质量。
Summary / 总结
This work investigates the role of massive activations (MAs) in Diffusion Transformers (DiTs) for visual generation. It finds that MAs are crucial for local detail synthesis without significantly affecting the overall semantic content. Based on this, the authors propose Detail Guidance (DG), a training-free method that enhances local detail fidelity by disrupting MAs and guiding the network towards better detail synthesis, which can be integrated with Classifier-Free Guidance (CFG) for further refinements. Experiments show consistent improvements in fine-grained detail quality across different DiTs models.
该研究探讨了大规模激活(MAs)在扩散变换器(DiTs)中对视觉生成的作用。研究发现,MAs 对局部细节合成至关重要,而不影响整体语义内容。基于此,作者提出了一种名为 Detail Guidance (DG) 的训练免费方法,通过利用 MAs 来增强局部细节保真度。实验表明,DG 能够在不同 DiTs 模型(如 SD3、SD3.5 和 Flux)中提高细粒度细节质量。
ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation
Authors: Ziyuan Luo, Yangyi Zhao, Ka Chun Cheung, Simon See, Renjie Wan
Venue: NeurIPS 2025
First: 2025-10-14T03:45:19+00:00 · Latest: 2025-10-14T03:45:19+00:00
Comments: Accepted at NeurIPS 2025
Abstract
The widespread adoption of Retrieval-Augmented Image Generation (RAIG) has raised significant concerns about the unauthorized use of private image datasets. While these systems have shown remarkable capabilities in enhancing generation quality through reference images, protecting visual datasets from unauthorized use in such systems remains a challenging problem. Traditional digital watermarking approaches face limitations in RAIG systems, as the complex feature extraction and recombination processes fail to preserve watermark signals during generation. To address these challenges, we propose ImageSentinel, a novel framework for protecting visual datasets in RAIG. Our framework synthesizes sentinel images that maintain visual consistency with the original dataset. These sentinels enable protection verification through randomly generated character sequences that serve as retrieval keys. To ensure seamless integration, we leverage vision-language models to generate the sentinel images. Experimental results demonstrate that ImageSentinel effectively detects unauthorized dataset usage while preserving generation quality for authorized applications. Code is available at https://github.com/luo-ziyuan/ImageSentinel.
中文标题/摘要
标题:ImageSentinel:保护视觉数据集免受未授权检索增强图像生成的侵害
检索增强图像生成(RAIG)的广泛应用引发了对私有图像数据集未授权使用的严重关切。尽管这些系统在通过参考图像增强生成质量方面表现出色,但在RAIG系统中保护视觉数据集免受未授权使用仍然是一个具有挑战性的问题。传统的数字水印方法在RAIG系统中面临局限性,因为复杂的特征提取和重组过程无法在生成过程中保持水印信号。为了解决这些挑战,我们提出了一种名为ImageSentinel的新颖框架,用于在RAIG中保护视觉数据集。我们的框架合成保持与原始数据集视觉一致性的小兵图像。这些小兵通过随机生成的字符序列实现保护验证,这些字符序列作为检索密钥。为了确保无缝集成,我们利用视觉-语言模型生成小兵图像。实验结果表明,ImageSentinel有效地检测了未授权的数据集使用,同时保留了授权应用的生成质量。代码可在https://github.com/luo-ziyuan/ImageSentinel获取。
Summary / 总结
ImageSentinel is a novel framework designed to protect visual datasets from unauthorized use in Retrieval-Augmented Image Generation (RAIG) systems. It synthesizes sentinel images that maintain visual consistency with the original dataset and uses randomly generated character sequences as retrieval keys for protection verification. Experiments show that ImageSentinel can effectively detect unauthorized usage while maintaining generation quality for authorized applications.
ImageSentinel 是一种新型框架,旨在保护视觉数据集在检索增强图像生成(RAIG)系统中的未经授权使用。它通过生成与原始数据集视觉一致的哨兵图像,并使用随机生成的字符序列作为检索密钥来进行保护验证。实验表明,ImageSentinel 可以有效检测未经授权的使用,同时保持授权应用中的生成质量。
Diffusion Language Models Know the Answer Before Decoding
Authors: Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan, Li Shen, Yi Liang, Soroush Vosoughi, Shiwei Liu
First: 2025-08-27T15:40:25+00:00 · Latest: 2025-10-14T03:42:04+00:00
Abstract
Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.
中文标题/摘要
标题:扩散语言模型在解码前就知道答案
扩散语言模型(DLMs)最近作为一种替代自回归方法出现,提供并行序列生成和灵活的标记顺序。然而,其推理速度仍慢于自回归模型,主要原因是双向注意的成本和生成高质量输出所需的大量细化步骤。在本工作中,我们强调并利用了DLMs早期答案收敛的一个未被重视的特性:在许多情况下,正确的答案可以在最终解码步骤之前由半步骤内部识别,无论是半自回归还是随机重新遮盖调度。例如,在GSM8K和MMLU上,分别有97%和99%的实例仅使用一半的细化步骤即可正确解码。基于这一观察,我们引入了Prophet,这是一种无需训练的快速解码范式,可实现早期提交解码。具体而言,Prophet动态决定是否继续细化或“全押”(即一次性解码剩余所有标记),使用前两个预测候选之间的置信度差距作为标准。它无缝集成到现有的DLM实现中,几乎不增加开销,并不需要额外的训练。对LLaDA-8B和Dream-7B在多个任务上的实证评估表明,Prophet将解码步骤减少多达3.4倍,同时保持高质量生成。这些结果将DLM解码重新定义为何时停止采样的问题,并证明早期解码收敛提供了一种简单而强大的机制,用于加速DLM推理,补充现有的加速技术。我们的代码可在https://github.com/pixeli99/Prophet上公开获取。
IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?
Authors: Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Yunfei Zhao, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, Botian Shi
First: 2025-09-29T12:38:06+00:00 · Latest: 2025-10-14T03:28:36+00:00
Abstract
The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available at https://github.com/L-O-I/IWR-Bench.
中文标题/摘要
标题:IWR-Bench:LVLM能否从用户交互视频中重建交互网页?
网页到代码的任务要求模型理解网页的视觉表示并生成相应的代码。然而,现有的基准主要集中在静态截图到代码的任务上,从而忽视了真实世界网页应用中至关重要的动态交互。为了解决这一局限,本文引入了IWR-Bench,这是一个新的基准,用于评估大型视觉-语言模型(LVLM)从视频中重建交互网页的能力。IWR-Bench 包含来自100个真实网站的113个精心策划的任务,涉及1001个动作,具有多样化的交互复杂性(例如,网页游戏)、视觉风格和领域。每个任务不仅包括用户交互视频,还包括所有抓取的静态资产(例如,图像、视频)。该基准评估模型在两个基本挑战上的表现:综合多模态推理以从视频和资产中推断交互逻辑,以及高级代码生成以将这种逻辑转化为功能代码。使用一个代理作为裁判的框架和一个全面的度量系统自动评估生成网页的功能正确性和视觉保真度。在28个LVLM上的广泛实验揭示了一个显著的挑战:最佳模型的整体得分为36.35%,功能正确性(24.39% IFS)远远落后于视觉保真度(64.25% VFS)。这些结果突显了当前模型在推理时间动态性和合成事件驱动逻辑方面的重要局限性,确立了IWR-Bench作为视觉-语言研究具有挑战性的前沿。基准和评估代码将在https://github.com/L-O-I/IWR-Bench上公开发布。
Summary / 总结
IWR-Bench is a new benchmark designed to evaluate Large Vision-Language Models (LVLMs) in reconstructing interactive webpages from user interaction videos. It includes 113 tasks from 100 real-world websites with diverse interaction complexities and visual styles. Experiments on 28 LVLMs show that the best model achieves only 36.35% overall score, with functional correctness lagging behind visual fidelity, indicating significant challenges in reasoning about temporal dynamics and event-driven logic. This benchmark highlights the need for improved capabilities in vision-language models for interactive web applications.
IWR-Bench 是一个新基准,用于评估大型视觉-语言模型(LVLM)从用户交互视频重建交互网页的能力。它包含来自100个真实网站的113个任务,具有多样化的交互复杂性。实验显示,28个LVLM中表现最好的模型仅获得36.35%的整体分数,功能性正确性低于视觉保真度。这表明模型需要更好地处理交互中的时间动态和事件驱动逻辑。
History