Alterbute: Editing Intrinsic Attributes of Objects in Images
Authors: Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen
First: 2026-01-15T18:59:53+00:00 · Latest: 2026-01-15T18:59:53+00:00
Comments: Project page is available at https://talreiss.github.io/alterbute/
Abstract
We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.
中文标题/摘要
标题:Alterbute:图像中对象固有属性的编辑
我们介绍了Alterbute,一种基于扩散的方法,用于编辑图像中对象的固有属性。我们允许更改对象的颜色、纹理、材质,甚至形状,同时保持其感知身份和场景上下文。现有方法要么依赖于无法可靠保持身份的无监督先验,要么使用过于严格的监督,限制了有意义的固有属性变化。我们的方法依赖于:(i) 放松的训练目标,允许模型在参考身份图像、描述目标固有属性的文本提示以及定义外部上下文的背景图像和对象掩码的条件下,同时改变固有属性和外部属性;(ii) 视觉命名实体(VNEs)——细粒度的视觉身份类别(例如,“保时捷911卡雷拉”),将具有身份定义特征的对象分组,同时允许固有属性的变化。我们使用视觉语言模型从大型公共图像数据集中自动提取VNE标签和固有属性描述,实现可扩展、保持身份的监督。Alterbute在保持身份的对象固有属性编辑方面优于现有方法。
Summary / 总结
Alterbute is a diffusion-based method for editing intrinsic attributes of objects in images, such as color, texture, and material, while preserving the object's identity and scene context. It uses a relaxed training objective and Visual Named Entities to allow changes in intrinsic attributes while keeping extrinsic attributes consistent. The method outperforms existing approaches in identity-preserving object intrinsic attribute editing.
Alterbute 是一种基于扩散的方法,用于在图像中编辑对象的内在属性,如颜色、纹理和材料,同时保持其身份和场景上下文。该方法使用一种宽松的训练目标和视觉命名实体(VNEs)来允许内在属性的变化,同时保持外在属性不变。该方法在保持对象身份的同时,在内在属性编辑方面优于现有方法。
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Authors: Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao
First: 2026-01-15T18:59:10+00:00 · Latest: 2026-01-15T18:59:10+00:00
Abstract
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
中文标题/摘要
标题:从一对一到多对多:动态跨层注入以实现深度视觉-语言融合
视觉-语言模型(VLMs)通过使用粗略且不对称的连接方式,仅将视觉编码器的输出连接到大型语言模型(LLM)的输入,从而造成严重的视觉特征瓶颈。这种静态架构从根本上限制了LLM实现多层次视觉知识全面对齐的能力,削弱了它们将局部细节与全局语义整合到连贯推理中的能力。为了解决这一问题,我们引入了跨层注入(CLI),这是一种新颖且轻量级的框架,它在两种模态之间构建了一种动态的多对多桥梁。CLI 包含两个协同工作的、参数高效的组件:自适应多投影(AMP)模块,用于协调来自不同视觉层的特征,以及自适应门控融合(AGF)机制,使LLM能够根据其实时解码上下文选择性地注入最相关的视觉信息。我们通过将CLI整合到LLaVA-OneVision和LLaVA-1.5中来验证其有效性和灵活性。在18个多样基准上的广泛实验表明,CLI 可以显著提高性能,确立了CLI 作为一种可扩展范式的地位,通过赋予LLM按需访问完整视觉层次结构的能力,解锁了更深层次的多模态理解。
Summary / 总结
The paper addresses the limitation of static vision-language models (VLMs) by introducing Cross-Layer Injection (CLI), a dynamic framework that enables a many-to-many connection between visual and language modalities. CLI includes an Adaptive Multi-Projection (AMP) module to harmonize features from various vision layers and an Adaptive Gating Fusion (AGF) mechanism to allow the language model to selectively inject relevant visual information. Experiments on 18 benchmarks show that CLI improves performance and enhances the model's ability to integrate visual and linguistic information effectively.
研究通过提出动态框架Cross-Layer Injection (CLI),解决了静态视觉-语言模型的局限性,CLI使视觉和语言模态之间能够实现多对多的连接。CLI包含一个用于特征谐调的Adaptive Multi-Projection (AMP)模块和一个用于选择性注入视觉信息的Adaptive Gating Fusion (AGF)机制。实验表明,CLI在18个不同的基准测试中表现出显著的性能提升,增强了LLMs对视觉和语言信息的综合整合能力。
Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
Authors: Jorge Ortiz
First: 2025-12-31T23:27:32+00:00 · Latest: 2026-01-15T17:31:17+00:00
Comments: Preprint. Diagnostic study of confidence-based abstention under evidence truncation
Abstract
High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f
中文标题/摘要
标题:视频问答中可预测可靠性的明确弃权开关
在高风险部署视觉-语言模型(VLMs)时,需要选择性预测,即系统在不确定时弃权,而不是冒高成本错误的风险。我们研究了基于信心的弃权是否能提供对错误率的可靠控制,以及这种控制在分布偏移下是否保持稳健。使用NExT-QA和Gemini 2.0 Flash,我们建立了两个发现。首先,置信度阈值化在同分布内提供了机制控制。扫掠阈值ε产生平滑的风险-覆盖率权衡,降低错误率f
Summary / 总结
The research aims to improve the reliability of vision-language models in high-stakes applications by enabling them to abstain when uncertain. The study uses confidence-based abstention and finds that it provides reliable control over error rates in video question answering, with smooth risk-coverage tradeoffs. This control remains robust under distribution shift. The findings are based on experiments with NExT-QA and Gemini 2.0 Flash datasets.
研究旨在通过使视觉-语言模型在不确定时能够避免预测,提高其在高风险应用中的可靠性。研究使用基于置信度的避免策略,并发现它在视频问答中提供了可靠的错误率控制,具有平滑的风险-覆盖率折衷。这种控制在分布变化下仍然稳健。实验基于NExT-QA和Gemini 2.0 Flash数据集。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-01-15T17:27:44+00:00
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当前最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs的合成数据,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是像素跟踪。即使私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最先进的,并展示了在单图像、多图像和视频任务中出色的基于指针的定位新能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、自由形式的视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的训练方法,并展示了在视觉标记上进行双向注意以及一种新的标记权重策略可以提高性能。我们的最佳8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率35.5 vs 29.6)并在某些任务上超越了私有模型如Gemini 3 Pro(视频指针F1分数38.4 vs 20.0,视频跟踪J&F分数56.2 vs 41.1)。
Summary / 总结
The paper introduces Molmo2, a new family of open-source vision-language models that outperform existing open-source models in video understanding and grounding tasks. The authors provide 9 new datasets, including video captions, Q&A, object tracking, and pointing datasets, and a training recipe that includes efficient packing and message-tree encoding. Molmo2 significantly improves performance in video counting, captioning, and video-grounding tasks, surpassing both open-source and proprietary models in some cases.
Molmo2 是一种新的开源视觉-语言模型,其在视频理解和定位任务中优于其他开源模型。研究解决了缺乏开源基础来改进视频和图像语言模型的问题。Molmo2 包含 9 个新数据集用于预训练和微调,并采用一种新颖的训练方法来提升性能。主要发现表明,Molmo2 在短视频、计数和描述等任务上优于现有开源模型,并在视频定位任务如视频指针和跟踪上超越了专有模型。
Semantic Misalignment in Vision-Language Models under Perceptual Degradation
Authors: Guo Cheng
First: 2026-01-13T09:13:05+00:00 · Latest: 2026-01-15T17:10:05+00:00
Comments: 10 pages, 4 figures, 6 tables
Abstract
Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
中文标题/摘要
标题:视觉-语言模型在感知退化下的语义不匹配
视觉-语言模型(VLMs)在自动驾驶和具身AI系统中越来越被部署,可靠的感知对于安全的语义推理和决策至关重要。尽管最近的VLMs在多模态基准测试中表现出色,但它们对现实感知退化的鲁棒性仍然知之甚少。在本工作中,我们系统地研究了在上游视觉感知受控退化下VLMs中的语义不匹配,使用Cityscapes数据集上的语义分割作为代表性感知模块。我们引入了感知现实的退化,这些退化仅在传统分割指标上引起适度下降,但观察到下游VLM行为的严重失败,包括虚构对象提及、安全关键实体的遗漏以及不一致的安全判断。为了量化这些影响,我们提出了一组语言层面的不匹配度量,以捕捉虚构、关键遗漏和安全误判,并分析这些度量与分割质量之间的关系,涵盖多个对比性和生成性VLMs。我们的结果揭示了像素级鲁棒性和多模态语义可靠性之间的明显脱节,突显了当前VLM基系统的一个关键局限性,并强调了需要评估框架来明确考虑感知不确定性在关键安全应用中的重要性。
Summary / 总结
This study investigates the robustness of Vision-Language Models (VLMs) under perceptual degradation, focusing on their performance in autonomous driving and embodied AI systems. By introducing controlled corruptions to the semantic segmentation module, the research reveals that even moderate drops in segmentation accuracy can lead to severe failures in VLMs, such as hallucinations and safety misinterpretations. The authors propose new metrics to quantify these issues and demonstrate the need for more rigorous evaluation frameworks that consider perception uncertainty in safety-critical applications.
该研究探讨了视觉-语言模型(VLMs)在感知降级情况下的鲁棒性,重点关注其在自动驾驶和具身AI系统中的表现。通过引入对语义分割模块的可控破坏,研究发现即使分割精度只有轻微下降,也可能导致VLM行为严重失败,如幻觉和安全误判。研究提出了新的量化指标,并强调了需要考虑感知不确定性在安全关键应用中的评估框架的重要性。
Action100M: A Large-scale Video Action Dataset
Authors: Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, Pascale Fung
First: 2026-01-15T17:02:27+00:00 · Latest: 2026-01-15T17:02:27+00:00
Abstract
Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
中文标题/摘要
标题:Action100M:大规模视频动作数据集
从视觉观察中推断物理动作是推进物理世界机器智能的基本能力。实现这一目标需要涵盖广泛领域的大型、开放词汇量视频动作数据集。我们介绍了Action100M,这是一个从1.2M互联网教学视频(14.6年时长)构建的大规模数据集,包含O(100百万)个时间局部化片段,具有开放词汇量动作监督和丰富的描述。Action100M通过一个完全自动化的流水线生成,该流水线(i)使用V-JEPA 2嵌入进行分层时间分割,(ii)生成多级帧和片段描述组织成描述树,(iii)使用多轮Self-Refine程序下的推理模型(GPT-OSS-120B)聚合证据,输出结构化注释(简要/详细动作、演员、简要/详细描述)。在Action100M上训练VL-JEPA展示了在各种动作识别基准测试中一致的数据规模改进和强大的零样本性能,确立了Action100M作为视频理解和世界建模可扩展研究新基础的地位。
Summary / 总结
The research aims to develop a large-scale video action dataset to enhance machine intelligence in recognizing physical actions from visual observations. The method involves creating Action100M from 1.2 million instructional videos, using a fully automated pipeline for hierarchical temporal segmentation, multi-level captioning, and structured annotation refinement. Key findings show consistent data-scaling improvements and strong zero-shot performance on diverse action recognition benchmarks, positioning Action100M as a foundational resource for video understanding and world modeling research.
研究旨在开发大规模视频动作数据集,以推动物理世界应用中的机器智能。主要方法是从120万条教学视频中创建Action100M,生成超过1亿个时间局部化片段,具有开放词汇的动作监督。关键发现显示,在各种动作识别基准测试中表现出一致的数据扩展改进和强大的零样本性能,确立了Action100M作为视频理解研究新基础的地位。
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
Authors: Mikel Williams-Lekuona, Georgina Cosma
First: 2025-12-17T12:19:54+00:00 · Latest: 2026-01-15T16:58:39+00:00
Comments: Camera-ready version for ECIR 2026
Abstract
Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance, attaining a Pearson correlation coefficient of 0.959 with human labelling whilst delivering 4.4x faster complexity prediction. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% faster image encoding while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
中文标题/摘要
标题:基于图像复杂性的自适应检索以提高视觉语言模型的效率
视觉语言模型中的视觉变换器通常对每张图像使用相同的计算量,无论其简单与否。我们提出了ICAR(图像复杂性感知检索),这是一种自适应计算方法,使视觉变换器能够为简单的图像使用较少的计算量,而复杂图像则通过其全网络深度进行处理。关键挑战是保持跨模态对齐:不同处理深度的嵌入必须保持兼容以进行文本匹配。ICAR 通过双路径训练来解决这一问题,该训练产生来自早期退出路径和全深度路径的兼容嵌入。这在图像表示和文本嵌入处于同一语义空间的情况下保持了兼容性,无论图像是否早期退出或完全处理。与现有的两阶段方法不同,ICAR 不需要昂贵的重排序,可以直接进行图像-文本匹配而无需额外开销。为了确定使用多少计算量,我们开发了ConvNeXt-IC,将其视为分类任务。通过应用现代分类器骨干网络而非专门的架构,ConvNeXt-IC 达到了最先进的性能,获得了与人工标注 0.959 的皮尔逊相关系数,同时实现了 4.4 倍更快的复杂性预测。在标准基准上增加了真实世界的网络数据进行评估,ICAR 在保持类别级性能的同时实现了 20% 更快的图像编码,并且保留了 95% 的实例级性能,从而实现了视觉语言系统的可持续扩展。
Summary / 总结
The paper proposes ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach for vision transformers in vision-language models, which uses less compute for simple images while processing complex images fully. This is achieved through dual-path training that ensures compatibility between early-exit and full-depth embeddings. ICAR achieves 20% faster image encoding with maintained performance and enables sustainable scaling of vision-language systems. ConvNeXt-IC, a classifier backbone, is used to assess image complexity, achieving state-of-the-art performance with 4.4x faster complexity prediction compared to existing methods.
该研究提出了ICAR(图像复杂性感知检索)方法,该方法针对视觉语言模型中的视觉变换器,能够为简单图像使用较少的计算资源,而复杂图像则通过全深度网络处理。通过双路径训练解决跨模态对齐问题,确保图像表示和文本嵌入在相同的语义空间中保持兼容。ICAR在标准基准测试中实现了20%的图像编码加速,同时保持类别级和95%的实例级性能,展示了视觉语言系统的可持续扩展。ConvNeXt-IC作为图像复杂性评估的分类器骨干网络,实现了4.4倍更快的复杂性预测,达到最先进的性能。
Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
Authors: Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong
First: 2026-01-15T16:16:34+00:00 · Latest: 2026-01-15T16:16:34+00:00
Abstract
Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
中文标题/摘要
标题:释放大型视觉语言模型在路边基础设施智能感知中的能力
城市路边基础设施的自动化感知对于智能城市管理至关重要,但通用模型往往难以捕捉到必要的细粒度属性和领域规则。虽然大型视觉语言模型(VLMs)在开放世界识别方面表现出色,但在遵循工程标准准确解释复杂设施状态方面常常遇到困难,导致在实际应用中的性能不可靠。为了解决这一问题,我们提出了一种领域适应框架,将VLMs转化为专门的智能基础设施分析代理。我们的方法结合了数据高效微调策略和基于知识的推理机制。具体而言,我们利用Grounding DINO的开放式词汇微调来在最少监督的情况下稳健地定位各种资产,然后利用基于LoRA的Qwen-VL适应进行深入的语义属性推理。为了减轻幻觉并确保专业合规,我们引入了一种双模态检索增强生成(RAG)模块,在推理过程中动态检索权威的行业标准和视觉示例。在全面的新建城市路边场景数据集上进行评估,我们的框架实现了58.9的检测性能mAP和95.5%的属性识别准确率,展示了智能基础设施监控的稳健解决方案。
Summary / 总结
The research aims to improve the automated perception of urban roadside infrastructure for smart city management by addressing the limitations of general-purpose models. The proposed domain-adapted framework fine-tunes large vision-language models with open-vocabulary techniques and LoRA-based adaptation, integrating a dual-modality RAG module to enforce professional compliance. Experimental results show a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5% on a new dataset of urban roadside scenes, indicating a robust solution for intelligent infrastructure monitoring.
研究旨在通过智能基础设施监测改善城市道路旁基础设施的自动化感知。提出了一种领域适应框架,利用开放词汇量微调和LoRA基适应技术,并结合双模态检索增强生成模块以确保专业合规。该框架在检测方面达到了58.9 mAP,在属性识别方面达到了95.5%的准确率,展示了在实际应用中的稳健性能。
SVII-3D: Advancing Roadside Infrastructure Inventory with Decimeter-level 3D Localization and Comprehension from Sparse Street Imagery
Authors: Chong Liu, Luxuan Fu, Yang Jia, Zhen Dong, Bisheng Yang
First: 2026-01-15T15:57:18+00:00 · Latest: 2026-01-15T15:57:18+00:00
Abstract
The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.
中文标题/摘要
标题:SVII-3D:利用亚米级3D定位与理解从稀疏街道图像中推进路边基础设施库存
在智慧城市建设和设施生命周期管理中,自动创建数字孪生和精确资产库存是一项关键任务。然而,利用经济有效的稀疏图像仍然具有挑战性,因为其鲁棒性有限、定位不准确且缺乏细粒度状态理解。为了解决这些限制,提出了SVII-3D,这是一种用于整体资产数字化的统一框架。首先,将LoRA微调开放集检测与空间注意力匹配网络融合,以稳健地关联稀疏视图中的观测。其次,引入几何引导的细化机制以解决结构错误,实现精确的亚米级3D定位。第三,超越静态几何映射,引入利用多模态提示的视觉-语言模型代理以自动诊断细粒度运行状态。实验表明,SVII-3D显著提高了识别准确性并最小化了定位误差。因此,该框架提供了一种可扩展、经济有效的解决方案,用于高保真基础设施数字化,有效弥合了稀疏感知与自动化智能维护之间的差距。
Summary / 总结
SVII-3D is a unified framework designed to enhance the creation of digital twins and precise asset inventories in smart cities. It addresses challenges such as limited robustness and inaccurate localization by using LoRA fine-tuned open-set detection and a spatial-attention matching network for robust observation association, and a geometry-guided refinement mechanism for precise decimeter-level 3D localization. Additionally, it incorporates a Vision-Language Model agent to diagnose fine-grained operational states. Experiments show that SVII-3D improves identification accuracy and minimizes localization errors, providing a scalable and cost-effective solution for infrastructure digitization.
论文提出了SVII-3D统一框架,利用稀疏街道图像创建详细的数字孪生和精确的资产库存。通过使用LoRA微调开放集检测和空间注意力匹配网络进行稳健的观测关联,以及几何引导的精炼机制实现精确的3D定位,解决有限鲁棒性和不准确定位的问题。此外,还引入了利用多模态提示的视觉语言模型代理来自动诊断细粒度的操作状态。实验表明,SVII-3D提高了识别准确性并减少了定位误差,提供了一种可扩展且成本效益高的基础设施数字化解决方案。
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Authors: Leyang Hu, Matteo Gamba, Randall Balestriero
Venue: NeurIPS 2025
First: 2025-02-11T18:59:57+00:00 · Latest: 2026-01-15T15:36:28+00:00
Comments: Accepted at NeurIPS 2025
Abstract
The scaling of model and data sizes has reshaped the AI landscape, establishing finetuning pretrained models as the standard paradigm for solving downstream tasks. However, dominant finetuning methods typically rely on weight adaptation, often lack interpretability, and depend on heuristically chosen hyperparameters. In this paper, we take a different perspective and shift the focus from weights to activation functions, viewing them through the lens of spline operators. We propose Curvature Tuning (CT), an interpretable and principled steering method that modulates a model's decision boundary by injecting a single hyperparameter into its activation functions. We show that CT provably adjusts model decision boundary curvature and, more fundamentally, projects a model onto a space of smooth functions-thereby complementing current finetuning methods, whose effect lies primarily in feature adaptation. Making this hyperparameter trainable gives rise to a novel and highly parameter-efficient finetuning method. Empirically, CT improves both generalization and robustness. For example, it boosts downstream accuracy of ResNet-50/152 by 8.59%/8.34% over linear probing and 4.64%/1.70% over LoRA across 12 datasets, and improves robust accuracy on the $\ell_\infty$ benchmark from RobustBench by 1032.64%/1494.46%. Our code is available at https://github.com/Leon-Leyang/curvature-tuning.
中文标题/摘要
标题:曲率调谐:单一参数驱动的无需训练模型导向
模型和数据规模的扩展重塑了人工智能的格局,使微调预训练模型成为解决下游任务的标准范式。然而,主流的微调方法通常依赖于权重调整,缺乏可解释性,并且依赖于经验选择的超参数。本文从不同角度出发,将焦点从权重转移到激活函数,通过样条算子的视角来审视它们。我们提出了曲率调谐(CT),这是一种可解释且原理上的导向方法,通过将单一超参数注入激活函数来调节模型的决策边界。我们证明CT能够证明地调整模型决策边界的曲率,并更根本地将模型投影到光滑函数的空间中,从而补充了当前主要依赖于特征调整的微调方法。使这个超参数可训练导致了一种新颖且高度参数高效的微调方法。实验上,CT提高了泛化能力和鲁棒性。例如,它在12个数据集上分别将ResNet-50/152的下游准确性提高了8.59%/8.34%,相对于线性探针和LoRA分别提高了4.64%/1.70%,并在RobustBench的$\ell_\infty$基准上提高了1032.64%/1494.46%的鲁棒准确性。我们的代码可在https://github.com/Leon-Leyang/curvature-tuning/ 获取。
mergetune: Continued fine-tuning of vision-language models
Authors: Wenqing Wang, Da Li, Xiatian Zhu, Josef Kittler
First: 2026-01-15T15:15:53+00:00 · Latest: 2026-01-15T15:15:53+00:00
Comments: 20 pages, 5 figures
Abstract
Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, \emph{continued fine-tuning (CFT)}, which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6\% on base-novel generalisation without adding parameters. % We show \emph{the first time} superior performance than CLIP on both DTD and EuroSAT, on cross-dataset transfer. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at \href{https://github.com/Surrey-UP-Lab/MERGETUNE}{https://github.com/Surrey-UP-Lab/MERGETUNE}.
中文标题/摘要
标题:mergetune:视觉-语言模型的持续微调
微调视觉-语言模型(VLMs)如CLIP通常会导致预训练知识的灾难性遗忘。先前的工作主要旨在适应过程中减轻遗忘;然而,在此过程中遗忘往往是不可避免的。我们引入了一种新的范式,即持续微调(CFT),它旨在在零样本模型已经适应后恢复预训练知识。我们提出了一种简单的、模型无关的CFT策略(名为MERGETUNE),该策略由线性模式连通性(LMC)引导,可以在不进行架构更改的情况下应用于现有微调模型。给定一个微调模型,我们继续微调其可训练参数(例如,软提示或线性头),以搜索一个持续模型,该模型具有两条低损失路径到零样本(例如,CLIP)和微调(例如,CoOp)解决方案。通过利用损失景观的几何结构,持续模型隐式地合并了两种解决方案,恢复了在微调对应物中丢失的预训练知识。挑战在于,原始的LMC约束需要从预训练任务中重放数据。我们通过二阶近似零样本模型的LMC约束,避免了大规模数据重放的需求。实验表明,MERGETUNE在不增加参数的情况下,将CoOp的基本新颖泛化提高了5.6%。MERGETUNE首次在DTD和EuroSAT上展示了优于CLIP的性能,实现了跨数据集迁移。在鲁棒微调评估中,MERGETUNE生成的LMC合并模型以较低的推理成本超越了集成基线,并在与零样本模型集成时实现了最先进的结果。我们的代码可在https://github.com/Surrey-UP-Lab/MERGETUNE获得。
Summary / 总结
The research aims to address the issue of catastrophic forgetting in fine-tuning vision-language models like CLIP. It introduces a novel paradigm called continued fine-tuning (CFT) and a model-agnostic strategy named MERGETUNE, which helps recover pretrained knowledge after a model has been adapted. MERGETUNE uses linear mode connectivity (LMC) to find a model that can achieve low-loss paths to both the zero-shot and fine-tuned solutions, thereby implicitly merging these solutions and restoring lost pretrained knowledge. Experiments show that MERGETUNE improves the harmonic mean of CoOp by 5.6% on base-novel generalization without adding parameters and outperforms CLIP on DTD and EuroSAT datasets for cross-dataset transfer. It also achieves state-of-the-art results in robust fine-tuning evaluations with lower inference cost when ensembled with the zero-shot model.
研究旨在解决视觉-语言模型如CLIP在微调过程中出现的灾难性遗忘问题。提出了一种新的持续微调(CFT)范式和一种名为MERGETUNE的模型通用策略,该策略在模型已适应后帮助恢复预训练知识。MERGETUNE利用线性模式连通性(LMC)找到一个可以同时达到零样本和微调解决方案低损失路径的模型,从而隐式地将这两种解决方案合并,恢复丢失的预训练知识。实验表明,MERGETUNE在基底-新颖泛化上提高了CoOp的调和平均值5.6%,并在DTD和EuroSAT数据集上首次优于CLIP的跨数据集迁移性能。此外,MERGETUNE在鲁棒微调评估中也达到了最先进的结果,且具有更低的推理成本,当与零样本模型进行集成时,进一步提高了性能。
RGS-SLAM: Robust Gaussian Splatting SLAM with One-Shot Dense Initialization
Authors: Wei-Tse Cheng, Yen-Jen Chiou, Yuan-Fu Yang
First: 2025-12-28T03:45:57+00:00 · Latest: 2026-01-15T15:14:21+00:00
Comments: 10 pages, 9 figures
Abstract
We introduce RGS-SLAM, a robust Gaussian-splatting SLAM framework that replaces the residual-driven densification stage of GS-SLAM with a training-free correspondence-to-Gaussian initialization. Instead of progressively adding Gaussians as residuals reveal missing geometry, RGS-SLAM performs a one-shot triangulation of dense multi-view correspondences derived from DINOv3 descriptors refined through a confidence-aware inlier classifier, generating a well-distributed and structure-aware Gaussian seed prior to optimization. This initialization stabilizes early mapping and accelerates convergence by roughly 20\%, yielding higher rendering fidelity in texture-rich and cluttered scenes while remaining fully compatible with existing GS-SLAM pipelines. Evaluated on the TUM RGB-D and Replica datasets, RGS-SLAM achieves competitive or superior localization and reconstruction accuracy compared with state-of-the-art Gaussian and point-based SLAM systems, sustaining real-time mapping performance at up to 925 FPS. Additional details and resources are available at this URL: https://breeze1124.github.io/rgs-slam-project-page/
中文标题/摘要
标题:RGS-SLAM:基于一次性密集初始化的鲁棒高斯点积SLAM
我们提出了RGS-SLAM,一种鲁棒的高斯点积SLAM框架,用无训练的对应到高斯的初始化阶段取代GS-SLAM中的残差驱动密集化阶段。RGS-SLAM 不是随着残差揭示缺失的几何结构逐步添加高斯点,而是通过一种基于置信度的内点分类器对DINOv3描述符进行细化,进行一次性三角化密集多视图对应,生成一个分布良好且结构意识强的高斯种子,作为优化前的先验。这种初始化稳定了早期建图,并通过大约20%的速度提升加速了收敛,从而在纹理丰富和杂乱的场景中提高了渲染保真度,同时保持与现有GS-SLAM流水线的完全兼容性。在TUM RGB-D和Replica数据集上评估,RGS-SLAM在定位和重建准确性方面与最先进的高斯和点基SLAM系统具有竞争力或更优,保持实时建图性能,最高可达925 FPS。更多详细信息和资源请参见此网址:https://breeze1124.github.io/rgs-slam-project-page/
Summary / 总结
RGS-SLAM is a robust Gaussian-splatting SLAM framework that introduces a one-shot dense initialization method using DINOv3 descriptors and a confidence-aware inlier classifier, replacing the traditional residual-driven densification stage. This approach stabilizes early mapping and accelerates convergence by about 20%, leading to higher rendering fidelity in complex scenes. RGS-SLAM achieves competitive or superior localization and reconstruction accuracy compared to state-of-the-art SLAM systems while maintaining real-time performance at up to 925 FPS on TUM RGB-D and Replica datasets.
RGS-SLAM 是一种鲁棒的高斯点云 SLAM 框架,通过引入一次性密集初始化方法,取代传统的基于残差的密集化阶段。通过使用 DINOv3 描述子和置信度感知的内点分类器,RGS-SLAM 在优化前生成一个分布良好的高斯种子,从而稳定早期建图并加速收敛。该系统在 TUM RGB-D 和 Replica 数据集上的定位和重建精度与最先进的高斯和点云 SLAM 系统相当,同时保持每秒高达 925 帧的实时性能。此方法在复杂场景中提高了渲染保真度,并且完全兼容现有的 GS-SLAM 管道。
Urban Socio-Semantic Segmentation with Vision-Language Reasoning
Authors: Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li
First: 2026-01-15T15:00:36+00:00 · Latest: 2026-01-15T15:00:36+00:00
Abstract
As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.
中文标题/摘要
标题:基于视觉-语言推理的城市社会语义分割
作为人类活动的中心,城市表面包含了大量的语义实体。从卫星图像中分割这些各种实体对于一系列下游应用至关重要。当前先进的分割模型可以可靠地分割由物理属性定义的实体(如建筑物、水体),但在处理社会定义的类别(如学校、公园)方面仍然存在困难。在本工作中,我们通过视觉-语言模型推理实现了社会语义分割。为此,我们引入了名为SocioSeg的城市社会语义分割数据集,该数据集包含卫星图像、数字地图和按分层结构组织的社会语义实体的像素级标签。此外,我们还提出了一种新的视觉-语言推理框架,称为SocioReasoner,该框架通过跨模态识别和多阶段推理模拟人类识别和标注社会语义实体的过程。我们使用强化学习优化这一非可微过程,激发视觉-语言模型的推理能力。实验表明,我们的方法在最先进的模型上有所改进,并且具有强大的零样本泛化能力。我们的数据集和代码可在https://github.com/AMAP-ML/SocioReasoner获取。
Summary / 总结
This work addresses the challenge of socio-semantic segmentation in urban areas by leveraging vision-language model reasoning. The authors introduce the SocioSeg dataset, which includes satellite imagery, digital maps, and pixel-level labels for social semantic entities. They also propose SocioReasoner, a novel framework that uses cross-modal recognition and multi-stage reasoning to identify and annotate social semantic entities. Experiments show that their approach outperforms existing models and demonstrates strong zero-shot generalization capabilities.
该研究旨在通过卫星图像更好地分割社会定义的城市实体,这对于各种应用至关重要。作者引入了一个新的数据集SocioSeg和一个视觉-语言推理框架SocioReasoner,以解决现有模型的局限性。实验表明,他们的方法优于最先进的模型,并展示了强大的零样本泛化能力。
Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning
Authors: Guoqiang Liang, Jianyi Wang, Zhonghua Wu, Shangchen Zhou
First: 2026-01-06T11:00:17+00:00 · Latest: 2026-01-15T14:19:47+00:00
Comments: Project Page: https://ethanliang99.github.io/ZOOMIQA-Projectpage
Abstract
Image Quality Assessment (IQA) is a long-standing problem in computer vision. Previous methods typically focus on predicting numerical scores without explanation or providing low-level descriptions lacking precise scores. Recent reasoning-based vision language models (VLMs) have shown strong potential for IQA by jointly generating quality descriptions and scores. However, existing VLM-based IQA methods often suffer from unreliable reasoning due to their limited capability of integrating visual and textual cues. In this work, we introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors: uncertainty awareness, region reasoning, and iterative refinement. Specifically, we present a two-stage training pipeline: 1) supervised fine-tuning (SFT) on our Grounded-Rationale-IQA (GR-IQA) dataset to teach the model to ground its assessments in key regions, and 2) reinforcement learning (RL) for dynamic policy exploration, stabilized by our KL-Coverage regularizer to prevent reasoning and scoring diversity collapse, with a Progressive Re-sampling Strategy for mitigating annotation bias. Extensive experiments show that Zoom-IQA achieves improved robustness, explainability, and generalization. The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.
中文标题/摘要
标题:Zoom-IQA:基于可靠区域感知推理的图像质量评估
图像质量评估(IQA)是计算机视觉中的一个长期问题。以往的方法通常侧重于预测数值分数而没有解释,或者提供低级描述而缺乏精确的分数。最近的基于视觉语言模型(VLMs)的推理方法在联合生成质量描述和分数方面显示出了强大的潜力。然而,现有的基于VLM的IQA方法往往由于其整合视觉和文本线索能力有限而表现出不可靠的推理。在本文中,我们引入了Zoom-IQA,这是一种基于VLM的IQA模型,旨在明确模拟关键的认知行为:不确定性意识、区域推理和迭代细化。具体而言,我们提出了一种两阶段训练管道:1)在我们的Grounded-Rationale-IQA(GR-IQA)数据集上进行监督微调(SFT),以教导模型将其评估扎根于关键区域;2)通过我们的KL-Coverage正则化器稳定动态策略探索,并结合渐进重采样策略以减轻注释偏差,进行强化学习(RL)。广泛的实验表明,Zoom-IQA在鲁棒性、可解释性和泛化能力方面取得了改进。Zoom-IQA在图像恢复等下游任务中的应用进一步证明了其有效性。
Summary / 总结
Zoom-IQA is a VLM-based IQA model that focuses on improving robustness, explainability, and generalization by explicitly emulating cognitive behaviors such as uncertainty awareness, region reasoning, and iterative refinement. It uses a two-stage training pipeline: supervised fine-tuning on a Grounded-Rationale-IQA dataset and reinforcement learning with a KL-Coverage regularizer and Progressive Re-sampling Strategy. Experiments show that Zoom-IQA outperforms existing methods in terms of reliability and precision in quality assessment and description generation.
Zoom-IQA 是一种基于 VLM 的图像质量评估模型,通过明确模拟关键认知行为如不确定性意识、区域推理和迭代改进来提升鲁棒性、可解释性和泛化能力。它采用两阶段训练管道:在 Grounded-Rationale-IQA 数据集上进行监督微调和强化学习,并使用 KL-Coverage 正则化器和渐进采样策略来稳定推理和评分多样性。该模型在鲁棒性、可解释性和泛化能力方面表现出改进,并通过图像修复等下游任务的应用进一步验证了其有效性。
Global Context Compression with Interleaved Vision-Text Transformation
Authors: Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang, Feng Huang
First: 2026-01-15T13:29:16+00:00 · Latest: 2026-01-15T13:29:16+00:00
Abstract
Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.
中文标题/摘要
标题:全局上下文压缩与交错的视觉-文本转换
视觉-语言模型在端到端OCR方面的最新成就为低损耗压缩文本信息开辟了一条新途径。这促使早期工作将Transformer的输入转换为图像以进行预填充,从而通过视觉编码有效减少了令牌数量,从而减轻了注意力计算的二次增加。然而,这种部分压缩在逐令牌推理时未能节省计算或内存成本。在本文中,我们研究了全局上下文压缩,该压缩在预填充和推理阶段都节省了令牌。因此,我们提出了VIST2,这是一种新颖的Transformer,交错输入文本片段及其视觉编码,同时仅依赖于预上下文中的视觉令牌来预测下一个文本令牌分布。围绕这一理念,我们将文本片段转换为草图图像,并分阶段训练VIST2,从基于课程安排的预训练开始,用于光学语言建模,然后是模态交错指令微调。我们使用从0.6B到8B缩放的VIST2家族进行了广泛的实验,以探索训练配方和超参数。压缩比为4倍的情况下,所得到的模型在长文本任务上显著优于基线,平均第一令牌生成速度提高3倍,内存使用减少77%,FLOPS减少74%。我们的代码和数据集将公开,以支持进一步的研究。
Summary / 总结
This paper addresses the need for efficient compression of textual information in vision-language models, particularly for Optical Character Recognition (OCR) tasks. It introduces VIST2, a novel Transformer that interleaves text and visual encodings to reduce the number of tokens at both prefilling and inference stages. Experimental results show that VIST2 models, with a 4x compression ratio, outperform baseline models on long writing tasks, achieving a 3x speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS.
本文探讨了全局上下文压缩方法,旨在减少预填充和推理阶段的计算和内存成本。作者提出了一种名为VIST2的新Transformer,该模型将文本片段与其视觉编码交织在一起,并使用预上下文中的视觉标记来预测下一个文本标记。实验表明,VIST2模型,压缩比为4倍,比基线模型在长文本任务上表现更优,实现了3倍的第一标记生成速度提升,77%的内存使用减少和74%的FLOPS减少。
Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
Authors: Peng-Fei Zhang, Zi Huang
First: 2026-01-15T11:45:56+00:00 · Latest: 2026-01-15T11:45:56+00:00
Comments: 15 pages, 7 figures
Abstract
Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.
中文标题/摘要
标题:视觉语言模型的分层细化普遍多模态攻击
现有的针对VLP模型的对抗攻击大多针对样本特定,当扩展到大规模数据集或新场景时会产生大量的计算开销。为克服这一限制,我们提出了分层细化攻击(HRA),这是一种针对VLP模型的多模态普遍攻击框架。HRA在样本级别和优化级别细化普遍对抗扰动(UAPs)。对于图像模态,我们将对抗样本分解为干净图像和扰动,允许每个组件独立处理,以更有效地破坏跨模态对齐。我们还引入了一种ScMix增强策略,以多样化视觉上下文并增强UAPs的全局和局部效用,从而减少对虚假特征的依赖。此外,通过利用历史和估计未来梯度的时间层次结构来细化优化路径,以避免局部最小值并稳定普遍扰动学习。对于文本模态,HRA通过结合句内和句间重要性度量来识别全局有影响力的单词,并随后利用这些单词作为普遍文本扰动。广泛的实验结果表明,提出的普遍多模态攻击具有优越性。
Summary / 总结
The research aims to address the high computational cost of sample-specific adversarial attacks on vision-language models by proposing Hierarchical Refinement Attack (HRA), a universal multimodal attack framework. HRA refines universal adversarial perturbations at both the sample and optimization levels, using techniques like ScMix augmentation and a temporal hierarchy of gradients. The study shows that HRA outperforms existing methods across various downstream tasks and datasets, demonstrating its effectiveness in disrupting cross-modal alignment and stabilizing universal perturbation learning.
研究旨在通过提出层次化精炼攻击(HRA),一种针对视觉语言模型的通用多模态攻击框架,解决样本特定的对抗性攻击带来的计算开销问题。HRA 在样本和优化层面精炼通用对抗性扰动,使用 ScMix 增强策略和历史及预测梯度的时间层次结构。研究结果表明,HRA 在各种下游任务、视觉语言模型和数据集上优于现有方法。
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics
Authors: Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo
First: 2025-05-29T16:41:12+00:00 · Latest: 2026-01-15T11:24:14+00:00
Comments: 29 pages, 13 figures
Abstract
Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate Robot-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.
中文标题/摘要
标题:Robot-R1:强化学习在机器人本体推理中的增强
大型视觉-语言模型(LVLM)最近在结合本体推理和机器人控制方面显示出巨大的潜力。一种常见的方法是通过监督微调(SFT)在与机器人控制相关的本体推理任务上进行训练。然而,SFT数据集通常是通过启发式方法构建的,并未明确优化以提高机器人控制性能。此外,SFT往往会导致灾难性遗忘和泛化性能降低等问题。为了解决这些局限性,我们提出了Robot-R1,这是一种新颖的框架,利用强化学习来增强特别针对机器人控制的本体推理。Robot-R1 学习预测完成任务所需的下一个关键点状态,基于当前场景图像和从专家演示中提取的环境元数据。受DeepSeek-R1学习方法的启发,Robot-R1 采样基于推理的响应,并强化那些导致更准确预测的响应。为了严格评估Robot-R1,我们还引入了一个新的基准,要求具备多样化的本体推理能力。我们的实验表明,使用Robot-R1训练的模型在本体推理任务上优于SFT方法。尽管只有7B参数,Robot-R1甚至在与低级动作控制相关的推理任务,如空间和运动推理方面,超过了GPT-4o。
Summary / 总结
The research aims to improve embodied reasoning in robotics by addressing the limitations of Supervised Fine-Tuning (SFT) methods, such as catastrophic forgetting and reduced generalization. Robot-R1, a novel framework, uses reinforcement learning to enhance embodied reasoning specifically for robot control. It predicts the next keypoint state needed for task completion based on the current scene image and environment metadata from expert demonstrations. Experiments show that models trained with Robot-R1 outperform SFT methods and even surpass GPT-4o on reasoning tasks related to low-level action control.
论文提出了Robot-R1框架,该框架利用强化学习提升机器人控制中的嵌入式推理能力,解决了监督微调(SFT)方法的局限性。Robot-R1基于当前场景图像和环境元数据从专家演示中学习预测完成任务所需的下一个关键点状态。实验表明,Robot-R1在嵌入式推理任务上优于SFT方法,并且即使只有7B参数,也超越了GPT-4o在低级动作控制任务如空间和运动推理上的表现。
A Study of Commonsense Reasoning over Visual Object Properties
Authors: Abhishek Kolari, Mohammadhossein Khojasteh, Yifan Jiang, Floris den Hengst, Filip Ilievski
First: 2025-08-14T11:28:40+00:00 · Latest: 2026-01-15T11:10:05+00:00
Abstract
Inspired by human categorization, object property reasoning involves identifying and recognizing low-level details and higher-level abstractions. While current visual question answering (VQA) studies consider multiple object properties, such as size, they typically blend perception and reasoning and lack representativeness in terms of reasoning and image categories, making it unclear whether and how vision-language models (VLMs) abstract and reason over depicted objects. To this end, we introduce a systematic evaluation framework comprising images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions, informed by prior work on common sense. We develop a procedure to instantiate this framework in two VQA object reasoning benchmarks: OPTICS-CNT, comprising 360 images paired with 1,080 multi-level, count-based questions, and OPTICS-CMP, with 2.1k comparison questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations relative to humans, with the best-performing model achieving below 40% counting and 70% comparison accuracy. VLMs struggle particularly with photographic images, counterfactual reasoning, physical and functional properties, and higher counts. We make the OPTICS benchmark data and code available to support future work on scalable benchmarking methods, generalized annotation guidelines, and advanced reasoning VLMs.
中文标题/摘要
标题:视觉对象属性上的常识推理研究
受人类分类启发,对象属性推理涉及识别和识别低级细节和高级抽象。尽管当前的视觉问答(VQA)研究考虑了多个对象属性,如大小,但它们通常将感知和推理结合在一起,并且在推理和图像类别方面缺乏代表性,这使得不清楚视觉语言模型(VLMs)是否以及如何对描绘的对象进行抽象和推理。为此,我们引入了一个系统评估框架,包括三种代表性类型的图像、三种复杂度递增的推理层次和四种对象属性维度,这些维度受到常识相关先前工作的启发。我们开发了一种程序,将此框架实例化为两个VQA对象推理基准:OPTICS-CNT,包含360张图像配对1,080个多层次、基于计数的问题,以及OPTICS-CMP,包含2,100个比较问题。零样本设置下12个最先进的VLMs的实验揭示了与人类相比的重大局限性,最佳模型在计数和比较准确性方面分别达到不到40%和70%。VLMs特别难以处理照片图像、反事实推理、物理和功能属性以及更高数量。我们提供了OPTICS基准数据和代码以支持未来可扩展基准方法、通用注释指南和高级推理VLMs的研究。
Summary / 总结
This study aims to evaluate how vision-language models (VLMs) reason about object properties in images, addressing limitations in current visual question answering (VQA) systems. The researchers developed a systematic evaluation framework with three types of images, three reasoning levels, and four property dimensions, and applied it to two VQA benchmarks: OPTICS-CNT and OPTICS-CMP. Experiments showed that state-of-the-art VLMs perform poorly, achieving less than 40% accuracy in counting and 70% in comparison tasks, especially in photographic images and counterfactual reasoning. The findings highlight the need for improved VLMs capable of abstract and complex reasoning.
该研究旨在评估视觉语言模型(VLMs)在图像中对物体属性进行推理的能力,解决了当前视觉问答(VQA)研究中的局限性。研究人员开发了一个系统化的评估框架,包含三种类型的图像、三个推理层次和四个属性维度,并将其应用于两个VQA基准:OPTICS-CNT和OPTICS-CMP。实验结果显示,VLMs的表现远不及人类,尤其是在摄影图像和反事实推理方面,最佳模型在计数和比较任务中的准确率分别仅为40%和70%。该研究强调了需要更好的基准测试方法和推理能力的改进。
RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation
Authors: Yue Chang, Rufeng Chen, Zhaofan Zhang, Yi Chen, Sihong Xie
First: 2026-01-15T08:15:01+00:00 · Latest: 2026-01-15T08:15:01+00:00
Comments: 9 pages, 6 figures
Abstract
Open-vocabulary 3D Scene Graph (3DSG) generation can enhance various downstream tasks in robotics, such as manipulation and navigation, by leveraging structured semantic representations. A 3DSG is constructed from multiple images of a scene, where objects are represented as nodes and relationships as edges. However, existing works for open-vocabulary 3DSG generation suffer from both low object-level recognition accuracy and speed, mainly due to constrained viewpoints, occlusions, and redundant surface density. To address these challenges, we propose RAG-3DSG to mitigate aggregation noise through re-shot guided uncertainty estimation and support object-level Retrieval-Augmented Generation (RAG) via reliable low-uncertainty objects. Furthermore, we propose a dynamic downsample-mapping strategy to accelerate cross-image object aggregation with adaptive granularity. Experiments on Replica dataset demonstrate that RAG-3DSG significantly improves node captioning accuracy in 3DSG generation while reducing the mapping time by two-thirds compared to the vanilla version.
中文标题/摘要
标题:RAG-3DSG:利用重拍引导检索增强生成改进3D场景图
开放词汇的3D场景图(3DSG)生成可以通过利用结构化的语义表示来增强机器人领域的各种下游任务,如操作和导航。3DSG是从场景的多张图像中构建的,其中对象作为节点,关系作为边。然而,现有的开放词汇3DSG生成工作在对象级识别准确性和速度方面都存在问题,主要是由于受限视角、遮挡和冗余表面密度。为了解决这些挑战,我们提出了RAG-3DSG,通过重拍引导的不确定性估计来减轻聚合噪声,并通过可靠的低不确定性对象支持对象级检索增强生成(RAG)。此外,我们提出了一种动态下采样映射策略,以通过自适应粒度加速跨图像对象聚合。在Replica数据集上的实验表明,RAG-3DSG在3DSG生成中显著提高了节点描述的准确性,同时将映射时间减少了三分之二。
Summary / 总结
RAG-3DSG addresses the limitations of existing 3D Scene Graph (3DSG) generation methods by enhancing object-level recognition accuracy and speed. It uses re-shot guided uncertainty estimation to mitigate aggregation noise and supports object-level Retrieval-Augmented Generation (RAG) through reliable low-uncertainty objects. Additionally, a dynamic downsample-mapping strategy accelerates cross-image object aggregation. Experiments show that RAG-3DSG improves node captioning accuracy and reduces mapping time by two-thirds compared to the vanilla version.
研究旨在通过3D场景图(3DSG)生成提高机器人应用中的准确性和速度。RAG-3DSG通过使用重新拍摄引导的不确定性估计来减少聚合噪声,并采用动态下采样映射策略加速对象聚合。实验表明,RAG-3DSG提高了节点描述的准确性,并将映射时间减少了三分之二,相比传统的版本。
Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation
Authors: Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee
First: 2025-12-01T03:38:44+00:00 · Latest: 2026-01-15T07:18:11+00:00
Abstract
We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.
中文标题/摘要
标题:生成对抗Gumbel MCTS在抽象视觉组成生成中的应用
我们研究抽象视觉组成,其中身份主要由少量几何基本元素(如部分、对称性、拓扑)的空间配置和关系决定。这些元素主要对纹理和写实细节不变。在几何约束和模糊目标规范(如文本)下从固定组件组成此类结构是非平凡的,由于组合放置选择、有限的数据和离散可行性(无重叠、允许的方向)导致稀疏解空间不适合纯粹的统计像素空间生成器。我们提出了一种结合显式几何推理和神经语义的约束引导框架。AlphaGo风格的搜索确保可行性,而微调的视觉语言模型则作为奖励信号评分语义对齐。我们的算法使用策略网络作为蒙特卡洛树搜索中的启发式方法,并通过搜索生成的计划微调网络。受生成对抗网络启发,我们使用生成实例进行对抗奖励细化。随着时间的推移,当奖励模型无法区分生成实例和真实数据时,生成应更接近真实数据。在七巧板组装任务中,我们的方法在约束收紧时比扩散和自回归基线具有更高的有效性和语义保真度。
Summary / 总结
The research aims to generate abstract visual compositions by addressing the challenges of combinatorial placement and discrete feasibility. The method combines geometric reasoning with neural semantics using a Monte-Carlo Tree Search (MCTS) with a policy network. The approach uses an adversarial reward refinement mechanism inspired by Generative Adversarial Networks (GANs) to improve the quality of generated compositions. Experimental results show that the proposed method outperforms diffusion and auto-regressive baselines in terms of validity and semantic fidelity, particularly under tighter constraints.
研究旨在通过解决空间配置和离散可行性问题来生成抽象视觉组成。方法结合几何推理和神经语义学,使用约束导向框架和蒙特卡洛树搜索。该方法在拼图组装任务中优于扩散和自回归基线,显示出更高的有效性和语义保真度,尤其是在约束更紧的情况下。
Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets
Authors: Huy M. Le, Dat Tien Nguyen, Phuc Binh Nguyen, Gia Bao Le Tran, Phu Truong Thien, Cuong Dinh, Minh Nguyen, Nga Nguyen, Thuy T. N. Nguyen, Tan Nhat Nguyen, Binh T. Nguyen
First: 2025-11-15T15:23:44+00:00 · Latest: 2026-01-15T06:23:25+00:00
Abstract
The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.
中文标题/摘要
标题:Fusionista2.0: 大规模数据集高效检索系统
视频浏览器 showdown (VBS) 挑战系统在严格的时间限制下提供准确结果。为满足这一需求,我们推出了 Fusionista2.0,一个优化速度和易用性的精简视频检索系统。所有核心模块都进行了重新设计以提高效率:预处理现在依赖于 ffmpeg 进行快速关键帧提取,光学字符识别使用 Vintern-1B-v3.5 进行稳健的多语言文本识别,自动语音识别采用 faster-whisper 进行实时转录。对于问答,轻量级的视觉语言模型提供了快速响应,而无需大型模型的高昂成本。除了这些技术升级,Fusionista2.0 还引入了重新设计的用户界面,提高了响应性、可访问性和工作流程效率,使非专家用户也能快速检索相关内容。评估表明检索时间减少了高达 75%,同时准确性和用户满意度都得到了提高,确认 Fusionista2.0 是一个具有竞争力且用户友好的大规模视频搜索系统。
Summary / 总结
Fusionista2.0 is an optimized video retrieval system designed to meet the VBS challenge by improving speed and usability. It reengineers core modules such as preprocessing, optical character recognition, and automatic speech recognition for efficiency. The system also features a redesigned user interface that enhances responsiveness and accessibility. Experimental results show a 75% reduction in retrieval time with increased accuracy and user satisfaction, making Fusionista2.0 a competitive and user-friendly solution for large-scale video search.
Fusionista2.0 是为了在严格的时间限制内高效检索大规模视频数据集而设计的。它通过优化预处理、光学字符识别和自动语音识别等核心模块来提高速度和易用性。系统还引入了改进的用户界面,增强了响应性和易用性。实验结果显示,检索时间减少了75%,同时保持或提高了准确性和用户满意度,使其成为大规模视频搜索的有竞争力且用户友好的解决方案。
V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation
Authors: Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen
First: 2026-01-15T05:47:43+00:00 · Latest: 2026-01-15T05:47:43+00:00
Abstract
Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero
中文标题/摘要
标题:V-Zero: 自我改进的多模态推理无需标注
近期在多模态学习方面的进展显著增强了视觉语言模型(VLMs)的推理能力。然而,最先进的方法严重依赖大规模的人工标注数据集,这些数据集的获取成本高且耗时。为克服这一限制,我们引入了V-Zero,这是一种通用的后训练框架,通过仅使用未标注的图像来促进自我改进。V-Zero 通过实例化两个不同的角色——提问者和解答者,建立了一个共生进化的循环。提问者通过利用对比直观猜测与推理结果的双重推理奖励机制,学习生成高质量、具有挑战性的问题。解答者则通过对其自身采样响应进行多数投票获得的伪标签进行优化。两个角色通过组相对策略优化(GRPO)迭代训练,推动相互增强的循环。令人惊讶的是,没有一个人工标注,V-Zero 在 Qwen2.5-VL-7B-Instruct 上实现了持续的性能提升,视觉数学推理提高了 1.7,一般视觉中心任务提高了 2.6,展示了多模态系统自我改进的潜力。代码可在 https://github.com/SatonoDia/V-Zero 获取
Summary / 总结
V-Zero is a post-training framework for self-improvement in vision-language models without relying on human annotations. It uses a co-evolutionary loop with a Questioner and a Solver, where the Questioner generates challenging questions and the Solver improves through pseudo-labels. This method achieves consistent performance gains, improving visual mathematical reasoning by 1.7 and general vision-centric tasks by 2.6 on Qwen2.5-VL-7B-Instruct.
V-Zero 是一个无需人工标注的自改进多模态推理后训练框架,通过一个协同进化循环,包含一个生成挑战性问题的提问者和一个通过多数投票伪标签优化的求解者。两者通过组相对策略优化迭代训练,实现了视觉数学推理和一般视觉中心任务的持续性能提升。V-Zero 在视觉数学推理上提高了 1.7,在一般视觉中心任务上提高了 2.6,而没有任何人工标注。
Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making
Authors: Jua Han, Jaeyoon Seo, Jungbin Min, Jean Oh, Jihie Kim
First: 2026-01-09T05:04:15+00:00 · Latest: 2026-01-15T05:09:03+00:00
Abstract
One mistake by an AI system in a safety-critical setting can cost lives. As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows; a single wrong instruction can directly endanger human safety. This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic. Through a qualitative evaluation of a fire evacuation scenario, we identified critical failure cases in LLM-based decision-making. Based on these, we designed seven tasks for quantitative assessment, categorized into: Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). Complete information tasks utilize ASCII maps to minimize interpretation ambiguity and isolate spatial reasoning from visual processing. Incomplete information tasks require models to infer missing context, testing for spatial continuity versus hallucinations. SOSR tasks use natural language to evaluate safe decision-making in life-threatening contexts. We benchmark various LLMs and Vision-Language Models (VLMs) across these tasks. Beyond aggregate performance, we analyze the implications of a 1% failure rate, highlighting how "rare" errors escalate into catastrophic outcomes. Results reveal serious vulnerabilities: several models achieved a 0% success rate in ASCII navigation, while in a simulated fire drill, models instructed robots to move toward hazardous areas instead of emergency exits. Our findings lead to a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics, as it implies one out of every hundred executions could result in catastrophic harm. We demonstrate that even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.
Summary / 总结
This paper addresses the safety risks associated with Large Language Models (LLMs) in robotics decision-making by evaluating their performance in critical scenarios. Through a qualitative evaluation of a fire evacuation scenario and a quantitative assessment of seven tasks, the study identifies serious vulnerabilities in LLMs. Key findings include several models achieving 0% success rates in ASCII navigation tasks and instructing robots to move towards hazardous areas during a simulated fire drill, highlighting the potential for catastrophic outcomes even with high accuracy rates.
该论文探讨了大型语言模型(LLMs)在机器人决策中的安全风险。通过火灾疏散场景中的系列任务评估LLMs和视觉-语言模型(VLMs),发现多个模型在ASCII导航中完全失败,并在模拟火灾演习中错误地指示机器人向危险区域移动,显示出严重的安全漏洞。研究强调,99%的准确率对于安全关键系统是不够的,因为即使1%的失败率也可能导致灾难性后果。
Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model
Authors: Siwen Jiao, Tianxiong Lv, Kangan Qian, Chenxu Zhao, Xiuyuan Zhu, Tianlun Li, Xiaolong Cheng, Jinyu Li, Zhihao Liao, Yang Cai
First: 2026-01-12T16:26:42+00:00 · Latest: 2026-01-15T03:58:36+00:00
Abstract
Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes "near-miss" samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.
中文标题/摘要
标题:平滑操作员:平滑可验证奖励激活视觉语言模型的空间推理能力
视觉语言模型(VLMs)在实现精确的数值预测以理解3D场景方面面临关键瓶颈。传统的强化学习(RL)方法,主要基于相对排名,往往遭受严重的奖励稀疏性和梯度不稳定性,无法有效利用3D物理约束提供的可验证信号。值得注意的是,在标准GRPO框架中,相对归一化导致“接近但未命中”的样本(特征为小但非零的误差)遭受优势坍塌。这导致在优化过程中有价值边界样本被丢弃的数据利用瓶颈。为解决这一问题,我们引入了平滑数值奖励激活(SNRA)操作和绝对保留GRPO(AP-GRPO)框架。SNRA采用动态参数化的Sigmoid函数将原始反馈转换为密集的连续奖励连续体。同时,AP-GRPO整合绝对标量梯度以减轻传统相对排名机制固有的数值信息损失。通过这种方法,我们构建了包含50,000个可验证3D子任务的数据集Numerical3D-50k。实验证明,AP-GRPO在性能上与大规模监督方法相当,同时保持更高的数据效率,有效激活了VLMs中的潜在3D推理能力,无需进行架构修改。
Summary / 总结
The research addresses the challenge of precise numerical prediction in 3D scene understanding by VLMs. It introduces SNRA and AP-GRPO to overcome issues of reward sparsity and gradient instability. SNRA uses a dynamically parameterized Sigmoid function to create a dense reward signal, while AP-GRPO integrates absolute gradients to preserve numerical information. The approach leads to performance comparable to large-scale supervised methods but with higher data efficiency, enhancing 3D reasoning capabilities in VLMs.
研究旨在通过解决传统强化学习中的奖励稀疏性和梯度不稳定性问题,提高视觉-语言模型(VLMs)在3D场景理解中的精确数值预测能力。研究引入了Smooth Numerical Reward Activation (SNRA) 操作符和Absolute-Preserving GRPO (AP-GRPO) 框架。SNRA将原始反馈转换为密集的奖励连续体,而AP-GRPO减轻了数值信息损失。这些方法使VLMs能够有效利用可验证的3D物理约束,从而实现与大规模监督方法的性能平齐,同时提高数据效率。
Memo-SQL: Structured Decomposition and Experience-Driven Self-Correction for Training-Free NL2SQL
Authors: Zerui Yang, Weichuan Wang, Yanwei Xu, Linqi Song, Yudai Matsuda, Wei Han, Bo Bai
First: 2026-01-15T02:42:05+00:00 · Latest: 2026-01-15T02:42:05+00:00
Abstract
Existing NL2SQL systems face two critical limitations: (1) they rely on in-context learning with only correct examples, overlooking the rich signal in historical error-fix pairs that could guide more robust self-correction; and (2) test-time scaling approaches often decompose questions arbitrarily, producing near-identical SQL candidates across runs and diminishing ensemble gains. Moreover, these methods suffer from a stark accuracy-efficiency trade-off: high performance demands excessive computation, while fast variants compromise quality. We present Memo-SQL, a training-free framework that addresses these issues through two simple ideas: structured decomposition and experience-aware self-correction. Instead of leaving decomposition to chance, we apply three clear strategies, entity-wise, hierarchical, and atomic sequential, to encourage diverse reasoning. For correction, we build a dynamic memory of both successful queries and historical error-fix pairs, and use retrieval-augmented prompting to bring relevant examples into context at inference time, no fine-tuning or external APIs required. On BIRD, Memo-SQL achieves 68.5% execution accuracy, setting a new state of the art among open, zero-fine-tuning methods, while using over 10 times fewer resources than prior TTS approaches.
中文标题/摘要
标题:Memo-SQL:结构化分解和经验驱动的自纠正以实现无需训练的NL2SQL
现有的NL2SQL系统面临两个关键限制:(1) 它们依赖于上下文学习,仅使用正确的示例,忽视了历史错误修正对的丰富信号,这些信号可以指导更稳健的自纠正;(2) 测试时的扩展方法通常任意地分解问题,导致每次运行生成几乎相同的SQL候选,从而削弱了集成收益。此外,这些方法还面临着明显的准确性和效率权衡:高性能需要大量计算,而快速版本则牺牲了质量。我们提出了Memo-SQL,这是一种无需训练的框架,通过两种简单的想法来解决这些问题:结构化分解和经验感知自纠正。我们不是让分解依赖于运气,而是应用了三种明确的策略:按实体、层次和原子顺序,以鼓励多样化的推理。对于纠正,我们构建了一个动态记忆,包括成功的查询和历史错误修正对,并在推理时使用检索增强提示将相关示例带入上下文,无需微调或外部API。在BIRD上,Memo-SQL实现了68.5%的执行准确率,成为无需训练且开放的方法中的最新状态,同时使用的资源比之前的TTS方法少超过10倍。
Summary / 总结
Memo-SQL addresses limitations in existing NL2SQL systems by introducing structured decomposition and experience-aware self-correction. It uses entity-wise, hierarchical, and atomic sequential strategies for decomposition to encourage diverse reasoning, and maintains a dynamic memory of successful queries and historical error-fix pairs for self-correction. On the BIRD dataset, Memo-SQL achieves 68.5% execution accuracy, surpassing previous zero-fine-tuning methods while using significantly fewer resources.
Memo-SQL 通过使用结构化分解和经验驱动的自我纠正来解决现有 NL2SQL 系统的限制。它采用了实体级、层次化和原子序列分解策略来促进多样化的推理,并使用成功查询和历史错误修正对的动态记忆来进行推理。在 BIRD 数据集上,Memo-SQL 达到了 68.5% 的执行准确率,超过了之前的开放、零微调方法,并且使用了显著更少的资源。
The Spatial Blindspot of Vision-Language Models
Authors: Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A, Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna
First: 2026-01-15T00:30:34+00:00 · Latest: 2026-01-15T00:30:34+00:00
Abstract
Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.
中文标题/摘要
标题:视觉语言模型的空间盲点
视觉语言模型(VLMs)已经取得了快速进展,但它们捕捉空间关系的能力仍然是一个盲点。当前的VLMs通常使用CLIP风格的图像编码器进行对比语言-图像预训练。训练配方通常将图像压平为1D的块序列,从而丢弃了进行空间推理所必需的2D结构。我们认为,这种缺乏空间意识是VLM设计中缺失的一个维度,并且是需要空间定位的应用(如机器人技术和具身AI)的瓶颈。为了应对这一问题,我们研究了(i)使用其他目标训练的图像编码器以及(ii)2D位置编码。我们的实验表明,这些架构选择可以在多个基准上提高空间推理能力。
Summary / 总结
The research addresses the limitation of vision-language models (VLMs) in capturing spatial relationships, which is crucial for applications like robotics. The study explores alternative image encoders and 2D positional encodings to enhance spatial awareness. Experiments demonstrate that these modifications improve spatial reasoning on various benchmarks.
研究旨在解决视觉-语言模型(VLMs)在捕捉空间关系方面的不足,这对于机器人等应用至关重要。研究探索了不同的图像编码器和2D位置编码来增强空间推理能力。实验表明,这些架构上的改进可以提高VLM在空间推理任务上的表现,优于传统的对比语言-图像预训练方法。
MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation
Authors: Yang Xing, Jiong Wu, Savas Ozdemir, Ying Zhang, Yang Yang, Wei Shao, Kuang Gong
First: 2026-01-14T21:21:00+00:00 · Latest: 2026-01-14T21:21:00+00:00
Abstract
Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.
中文标题/摘要
标题:MedVL-SAM2:统一的3D医学视觉语言模型,用于多模态推理和提示驱动分割
医学视觉语言模型(VLMs)在图像级文本中心任务,如报告生成和视觉问答(VQA)方面取得了显著性能。然而,在3D医学VLM中实现精细的视觉定位和体积空间推理仍然具有挑战性,尤其是在希望在一个通用框架内统一这些能力时。为了解决这一挑战,我们提出了MedVL-SAM2,这是一种统一的3D医学多模态模型,同时支持报告生成、VQA和多范式分割,包括语义分割、引用分割和交互分割。MedVL-SAM2 通过一个针对3D医学成像定制的统一架构,将图像级推理和像素级感知相结合,并结合基于SAM2的体积分割模块,以实现精确的多粒度空间推理。该模型在多阶段管道中进行训练:首先在大规模的3D CT图像-文本对语料库上进行预训练,以对齐体积视觉特征与放射学语言嵌入。然后,使用一个全面的3D CT分割数据集,同时优化语言理解和分割目标。这种联合训练使语言、点或框提示的灵活交互成为可能,从而统一高层次的视觉推理与空间精确的定位。我们的统一架构在报告生成、VQA和多个3D分割任务中均实现了最先进的性能。进一步的分析还表明,该模型提供了可靠的3D视觉定位、可控的交互分割和稳健的跨模态推理,证明了高层次语义推理和精确的3D定位可以在统一的3D医学VLM中同时实现。
Summary / 总结
The research aims to improve the fine-grained visual grounding and volumetric spatial reasoning in 3D medical vision-language models. MedVL-SAM2, a unified 3D medical multimodal model, is proposed to support report generation, VQA, and multi-paradigm segmentation. The model integrates image-level reasoning and pixel-level perception through a cohesive architecture and uses a SAM2-based volumetric segmentation module. MedVL-SAM2 is pre-trained on 3D CT image-text pairs and jointly optimized with language-understanding and segmentation objectives. The model achieves state-of-the-art performance in report generation, VQA, and 3D segmentation tasks, and provides reliable 3D visual grounding and robust cross-modal reasoning.
研究旨在开发一个统一的3D医学视觉语言模型(MedVL-SAM2),以解决3D医学成像中精细视觉定位和体积空间推理的挑战。MedVL-SAM2 通过一个统一的架构整合了图像级推理和像素级感知,并集成了基于SAM2的体积分割模块。该模型通过多阶段训练进行训练,首先在3D CT图像-文本对上进行预训练,然后与语言理解和分割目标联合优化。关键发现包括在报告生成、VQA和多个3D分割任务中的领先性能,以及可靠的3D视觉定位、可控的交互分割和稳健的跨模态推理。
Thinking Long, but Short: Stable Sequential Test-Time Scaling for Large Reasoning Models
Authors: Michael R. Metel, Yufei Cui, Boxing Chen, Prasanna Parthasarathi
First: 2026-01-14T20:30:55+00:00 · Latest: 2026-01-14T20:30:55+00:00
Comments: Findings of EACL 2026
Abstract
Sequential test-time scaling is a promising training-free method to improve large reasoning model accuracy, but as currently implemented, significant limitations have been observed. Inducing models to think for longer can increase their accuracy, but as the length of reasoning is further extended, it has also been shown to result in accuracy degradation and model instability. This work presents a novel sequential test-time scaling method, Min-Seek, which improves model accuracy significantly over a wide range of induced thoughts, stabilizing the accuracy of sequential scaling, and removing the need for reasoning length fine-tuning. Beyond improving model accuracy over a variety of reasoning tasks, our method is inherently efficient, as only the KV pairs of one additional induced thought are kept in the KV cache during reasoning. With a custom KV cache which stores keys without position embeddings, by dynamically encoding them contiguously before each new generated thought, our method can continue to reason well beyond a model's maximum context length, and under mild conditions has linear computational complexity.
中文标题/摘要
标题:长思考,短执行:大型推理模型的稳定顺序测试时缩放
顺序测试时缩放是一种无需训练即可提高大型推理模型准确性的有前途的方法,但目前实施中观察到了显著的限制。延长模型的思考时间可以提高其准确性,但随着推理长度的进一步延长,也已显示出准确性和模型稳定性下降。本研究提出了一种新颖的顺序测试时缩放方法Min-Seek,该方法在广泛诱导思考范围内显著提高了模型准确性,稳定了顺序缩放的准确性,并消除了推理长度微调的需要。除了在各种推理任务中提高模型准确性,我们的方法还具有内在的高效性,因为在推理过程中仅保留一个额外诱导思考的KV对。通过使用一个自定义的KV缓存,该缓存不存储位置嵌入,而是动态地在每次生成新思考前连续编码它们,我们的方法可以继续推理远超模型的最大上下文长度,并在温和条件下具有线性计算复杂度。
Summary / 总结
This work addresses the limitations of current sequential test-time scaling methods for large reasoning models, which can lead to accuracy degradation and instability when reasoning length is extended. The proposed Min-Seek method stabilizes the accuracy of sequential scaling and improves model accuracy across various reasoning tasks. It only requires keeping the KV pairs of one additional induced thought in the KV cache, making it efficient and allowing reasoning beyond the model's maximum context length with linear computational complexity.
该研究解决了当前用于大型推理模型的序列测试时缩放方法的局限性,这些方法在扩展时可能导致准确率下降和不稳定。提出的Min-Seek方法在各种推理任务中显著提高了模型的准确率并保持了稳定性。该方法只需在推理时将一个额外推理思想的KV对保留在缓存中,使其高效,并允许推理超出模型的最大上下文长度,且在轻微条件下具有线性计算复杂度。
Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration
Authors: Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu
First: 2025-12-08T05:15:41+00:00 · Latest: 2026-01-14T20:22:57+00:00
Comments: 9 pages, 3 figures. Preprint under review
Abstract
We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.
中文标题/摘要
标题:通过无训练自信心校准提高基于扩散的大语言模型的吞吐量
我们提出了CadLLM,这是一种无训练方法,用于加速基于扩散的大语言模型(dLLMs)的推理吞吐量。我们首先研究了令牌去遮蔽信心在块和步骤中的动态性质。基于这一观察,我们提出了一种轻量级自适应方法,根据未遮蔽令牌的平均信心控制生成块大小、步长和阈值。我们进一步通过动态利用词汇表的子集来调节采样范围,从而减少softmax开销。CadLLM 是一种即插即用、模型无关的方法,适用于基于KV缓存的大语言模型。在四个流行任务上的广泛实验表明,与最先进的基线相比,CadLLM 可以获得高达2.28倍的吞吐量提升,同时保持竞争力的准确性。
Summary / 总结
CadLLM is a training-free method to enhance the inference throughput of diffusion-based large language models (dLLMs) by dynamically adjusting the generation block size, step size, and threshold based on token unmasking confidence. It also reduces softmax overhead by sampling from a subset of the vocabulary. Experiments show that CadLLM can achieve up to 2.28x throughput improvement with comparable accuracy on four popular tasks.
论文提出了CadLLM,一种无需训练的方法来提升扩散型大型语言模型(dLLM)的推理吞吐量。通过分析标记解掩蔽置信度的动态特性,CadLLM 提出了一个自适应方法来调整生成块大小、步长和阈值,基于未解掩标记的平均置信度。此外,通过动态使用词汇表的一部分来减少 softmax 过头。该方法是模型无关的,兼容基于 KV 缓存的 dLLM。实验表明,CadLLM 在四个流行任务上实现了最高 2.28 倍的吞吐量提升,同时保持了竞争力的准确性。
ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning
Authors: Po-han Li, Shenghui Chen, Ufuk Topcu, Sandeep Chinchali
First: 2026-01-14T20:14:47+00:00 · Latest: 2026-01-14T20:14:47+00:00
Abstract
Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.
中文标题/摘要
标题:ViSIL:统一评估多模态视频字幕中的信息损失
多模态视频字幕将密集的视频片段浓缩为结构化的关键帧和自然语言格式。通过创建一致的多模态摘要,这种方法将生成型AI锚定在丰富的语义证据上,并作为高效检索的轻量级代理。然而,传统的指标如BLEU或ROUGE无法量化跨不同模态的信息覆盖情况,例如将一段文本与一系列关键帧进行比较。为了解决这个问题,我们提出了视频摘要信息损失(ViSIL)得分,这是一种信息论框架,通过视觉-语言模型(VLM)推理量化未被摘要捕捉的视频信息。通过测量信息损失,ViSIL成为一种统一的指标,可以在尽管存在结构差异的情况下直接比较不同多模态摘要格式。我们的结果显示,ViSIL得分与视频问答(VQA)任务中的人类和VLM性能之间存在统计学上的显著相关性。ViSIL还使摘要选择能够优化信息损失与处理速度之间的权衡,建立了帕累托最优前沿,在不增加处理负载的情况下,VQA准确率提高了7%。
Summary / 总结
The research aims to evaluate the information loss in multimodal video captioning by proposing the Video Summary Information Loss (ViSIL) score, an information-theoretic framework. ViSIL measures the information not captured by a summary using vision-language model inference, enabling direct comparison across different multimodal summary formats. The study shows that ViSIL scores correlate significantly with both human and VLM performance on VQA tasks and helps in optimizing the trade-off between information loss and processing speed, improving VQA accuracy by 7% without increasing processing load.
研究旨在通过提出视频摘要信息损失(ViSIL)评分来评估多模态视频摘要中的信息损失,这是一种信息论框架。ViSIL 使用视觉-语言模型推理来衡量未被摘要捕获的信息,从而可以在不同多模态摘要格式之间进行直接比较。研究显示,ViSIL 评分与人类和 VLM 在 VQA 任务上的表现有显著的相关性,并有助于优化信息损失与处理速度之间的权衡,通过这种方式在 VQA 准确性上提高了 7%,而无需增加处理负载。