arXiv 论文速递

2026-01-19 03:29
Snapshot: 20260119_0329
Alterbute: Editing Intrinsic Attributes of Objects in Images
Authors: Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen
First: 2026-01-15T18:59:53+00:00 · Latest: 2026-01-15T18:59:53+00:00
Comments: Project page is available at https://talreiss.github.io/alterbute/
Abstract
We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.
中文标题/摘要
标题:Alterbute:图像中对象固有属性的编辑方法
我们介绍了Alterbute,一种基于扩散的方法,用于编辑图像中对象的固有属性。我们允许更改对象的颜色、纹理、材质,甚至形状,同时保持其感知身份和场景上下文。现有方法要么依赖于无法可靠保持身份的无监督先验,要么使用过于严格的监督,从而阻止有意义的固有属性变化。我们的方法依赖于:(i) 放松的训练目标,允许模型在参考身份图像、描述目标固有属性的文本提示以及定义外部上下文的背景图像和对象掩码的条件下,同时改变固有属性和外部属性;(ii) 视觉命名实体(VNEs)——细粒度的视觉身份类别(例如,“保时捷911卡雷拉”),将具有身份定义特征的对象分组,同时允许固有属性的变化。我们使用视觉语言模型从大型公共图像数据集中自动提取VNE标签和固有属性描述,从而实现可扩展、保持身份的监督。Alterbute在保持身份的物体固有属性编辑方面优于现有方法。
Summary / 总结
Alterbute is a diffusion-based method for editing intrinsic attributes of objects in images, such as color, texture, and material, while preserving the object's identity and scene context. It uses a relaxed training objective and Visual Named Entities to allow changes in intrinsic attributes while keeping extrinsic attributes consistent. The method outperforms existing approaches in identity-preserving object intrinsic attribute editing.
Alterbute 是一种基于扩散的方法,用于编辑图像中对象的内在属性,如颜色、纹理和材料,同时保持其身份和场景上下文。它使用宽松的训练目标和视觉命名实体来允许内在属性的变化,同时保持外在属性与原始图像一致。该方法在保持对象身份的同时编辑内在属性方面优于现有方法。
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Authors: Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao
First: 2026-01-15T18:59:10+00:00 · Latest: 2026-01-15T18:59:10+00:00
Abstract
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
中文标题/摘要
标题:从一对一到多对多:动态跨层注入以实现深度视觉-语言融合
视觉-语言模型(VLMs)通过使用粗略且不对称的连接方式,仅将视觉编码器的输出链接到大型语言模型(LLM)的输入,从而造成严重的视觉特征瓶颈。这种静态架构从根本上限制了LLM实现多层次视觉知识全面对齐的能力,削弱了它们将局部细节与全局语义整合到连贯推理中的能力。为了解决这一问题,我们引入了跨层注入(CLI),这是一种新颖且轻量级的框架,它在两种模态之间构建了一种动态的多对多桥梁。CLI 包含两个协同工作的、参数高效的组件:自适应多投影(AMP)模块,用于协调来自不同视觉层的特征,以及自适应门控融合(AGF)机制,使LLM能够根据其实时解码上下文选择性地注入最相关的视觉信息。我们通过将CLI整合到LLaVA-OneVision和LLaVA-1.5中来验证其有效性和灵活性。在18个多样基准上的广泛实验表明,CLI 显著提高了性能,确立了CLI 作为一种可扩展范式的地位,通过赋予LLM按需访问完整视觉层次结构的能力,解锁了更深层次的多模态理解。
Summary / 总结
The paper addresses the limitation of static vision-language models (VLMs) by introducing Cross-Layer Injection (CLI), a dynamic many-to-many framework. CLI includes an Adaptive Multi-Projection (AMP) module to harmonize features from various vision layers and an Adaptive Gating Fusion (AGF) mechanism to allow the language model to selectively inject relevant visual information. Experiments on 18 benchmarks show CLI improves performance, enabling LLMs to better integrate local details with global semantics.
论文通过引入Cross-Layer Injection (CLI)框架解决了当前视觉-语言模型(VLMs)的局限性,CLI创建了视觉和语言模态之间的动态多对多连接。CLI包含一个用于特征协调的Adaptive Multi-Projection (AMP)模块和一个用于选择性注入视觉信息的Adaptive Gating Fusion (AGF)机制。在18个基准测试中的实验表明,CLI提高了性能,使LLMs能够更好地将视觉细节与全局语义相结合。
Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
Authors: Jorge Ortiz
First: 2025-12-31T23:27:32+00:00 · Latest: 2026-01-15T17:31:17+00:00
Comments: Preprint. Diagnostic study of confidence-based abstention under evidence truncation
Abstract
High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f
中文标题/摘要
标题:显式弃权开关以实现视频问答中的可预测可靠性
在高风险部署视觉-语言模型(VLMs)时,需要选择性预测,即系统在不确定时弃权,而不是冒昂贵错误的风险。我们研究了基于置信度的弃权是否能提供对错误率的可靠控制,以及这种控制在分布偏移下是否保持稳健。使用NExT-QA和Gemini 2.0 Flash,我们建立了两个发现。首先,置信度阈值化在同分布下提供了机制控制。扫掠阈值ε产生平滑的风险-覆盖率权衡,降低错误率
Summary / 总结
The research aims to ensure reliable performance of vision-language models in high-stakes applications by enabling selective prediction. The study investigates the effectiveness of confidence-based abstention in video question answering, showing that adjusting confidence thresholds can provide a smooth tradeoff between error rates and prediction coverage. The findings indicate that this method maintains robust control over error rates even when the data distribution shifts.
研究探讨了在视频问答模型中使用基于置信度的回避策略来控制错误率。通过调整置信度阈值,研究展示了风险和覆盖率之间的平滑权衡,有效降低了同分布下的错误率。研究还使用NExT-QA和Gemini 2.0 Flash数据集考察了该方法在分布变化情况下的鲁棒性。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-01-15T17:27:44+00:00
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当今最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs的合成数据,实际上是从它们中提炼而来,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是通过像素跟踪。即使是私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最新技术,并展示了在单图像、多图像和视频任务中出色的基于指针的定位能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、用于微调的自由形式视频问答数据集、一种新的具有复杂查询的对象跟踪数据集以及一种创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的数据训练食谱,并展示了在视觉标记上使用双向注意和一种新颖的标记权重策略可以提高性能。我们的最佳8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5 vs 29.6),并在某些任务上超越了私有模型如Gemini 3 Pro(视频指针F1得分为38.4 vs 20.0,视频跟踪J&F得分为56.2 vs 41.1)。
Semantic Misalignment in Vision-Language Models under Perceptual Degradation
Authors: Guo Cheng
First: 2026-01-13T09:13:05+00:00 · Latest: 2026-01-15T17:10:05+00:00
Comments: 10 pages, 4 figures, 6 tables
Abstract
Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
中文标题/摘要
标题:视觉-语言模型在感知退化下的语义不匹配
视觉-语言模型(VLMs)在自动驾驶和具身AI系统中越来越被部署,可靠的感知对于安全的语义推理和决策至关重要。尽管最近的VLMs在多模态基准测试中表现出色,但它们对现实感知退化的鲁棒性仍然知之甚少。在本文中,我们系统地研究了在控制的上游视觉感知退化下VLMs中的语义不匹配,使用城市景观数据集上的语义分割作为代表性的感知模块。我们引入了感知现实的破坏,这些破坏仅在常规分割指标上引起适度下降,但观察到下游VLM行为的严重失败,包括虚构对象提及、安全关键实体的遗漏以及不一致的安全判断。为了量化这些影响,我们提出了一组语言层面的不匹配度量标准,以捕捉虚构、关键遗漏和安全误解,并分析这些度量标准与分割质量之间的关系,涵盖多个对比性和生成性VLMs。我们的结果揭示了像素级鲁棒性和多模态语义可靠性之间的明显脱节,突显了当前VLM基系统的一个关键局限性,并强调了在安全关键应用中明确考虑感知不确定性评估框架的必要性。
Summary / 总结
This study investigates the robustness of Vision-Language Models (VLMs) under perceptual degradation, focusing on their performance in autonomous driving and embodied AI systems. By introducing controlled corruptions to the semantic segmentation of the Cityscapes dataset, the research reveals that even moderate drops in segmentation accuracy can lead to severe failures in VLM behavior, such as hallucinations and safety misinterpretations. The authors propose new metrics to quantify these effects and find a clear disconnect between pixel-level robustness and multimodal semantic reliability, emphasizing the need for better evaluation frameworks in safety-critical applications.
该研究探讨了视觉语言模型(VLMs)在自主驾驶和具身人工智能系统中对感知降级的鲁棒性。通过引入对语义分割的可控破坏,研究发现即使分割精度只有轻微下降,VLMs 也可能出现幻觉和安全误判等严重故障。作者提出了新的度量标准来量化这些影响,并发现像素级鲁棒性和多模态语义可靠性之间存在明显的脱节,强调了需要更好的评估框架来考虑感知不确定性在关键安全应用中的影响。
Action100M: A Large-scale Video Action Dataset
Authors: Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, Pascale Fung
First: 2026-01-15T17:02:27+00:00 · Latest: 2026-01-15T17:02:27+00:00
Abstract
Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
中文标题/摘要
标题:Action100M:大规模视频动作数据集
从视觉观察中推断物理动作是推进物理世界机器智能的基本能力。实现这一目标需要涵盖广泛领域的大型、开放词汇量视频动作数据集。我们介绍了Action100M,这是一个由120万互联网教学视频(14.6年时长)构建的大规模数据集,提供了O(100百万)个时间局部化片段,具有开放词汇量的动作监督和丰富的描述。Action100M通过一个完全自动化的流水线生成,该流水线(i)使用V-JEPA 2嵌入进行分层时间分割,(ii)生成多级帧和片段描述,组织为描述树,(iii)使用多轮Self-Refine程序下的推理模型(GPT-OSS-120B)聚合证据,输出结构化注释(简要/详细动作、演员、简要/详细描述)。在Action100M上训练VL-JEPA展示了在各种动作识别基准测试中一致的数据规模改进和强大的零样本性能,确立了Action100M作为视频理解和世界建模可扩展研究新基础的地位。
Summary / 总结
The research aims to develop a large-scale video action dataset to enhance machine intelligence in physical world applications. The main method involves creating Action100M from 1.2 million instructional videos, generating over 100 million temporally localized segments with open-vocabulary action supervision. Key findings show consistent data-scaling improvements and strong zero-shot performance across various action recognition benchmarks, establishing Action100M as a new foundation for video understanding research.
研究动机是开发大规模视频动作数据集,以促进物理世界中的机器智能应用。主要方法是从120万条教学视频中构建Action100M,并使用全自动流水线进行层次时间分割、多级字幕生成和结构化注释精炼。关键实验发现表明,Action100M在各种动作识别基准测试中表现出一致的数据扩展改进和强大的零样本性能,确立了其作为视频理解和世界建模研究基础的价值。
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
Authors: Mikel Williams-Lekuona, Georgina Cosma
First: 2025-12-17T12:19:54+00:00 · Latest: 2026-01-15T16:58:39+00:00
Comments: Camera-ready version for ECIR 2026
Abstract
Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance, attaining a Pearson correlation coefficient of 0.959 with human labelling whilst delivering 4.4x faster complexity prediction. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% faster image encoding while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
中文标题/摘要
标题:基于图像复杂性的自适应检索以提高视觉语言模型的效率
视觉语言模型中的视觉变换器通常对每张图像使用相同的计算量,无论其简单与否。我们提出了ICAR(基于图像复杂性的自适应检索),这是一种自适应计算方法,使视觉变换器能够为简单的图像使用较少的计算量,而为复杂的图像通过其全网络深度进行处理。关键挑战是保持跨模态对齐:不同处理深度的嵌入必须保持兼容以进行文本匹配。ICAR 通过双路径训练解决这一问题,该训练产生来自早期退出路径和全深度路径的兼容嵌入。这在相同的语义空间中保持了图像表示和文本嵌入之间的兼容性,无论图像是否早期退出或完全处理。与现有的两阶段方法不同,ICAR 不需要昂贵的重排序,可以直接进行图像-文本匹配而无需额外开销。为了确定使用多少计算量,我们开发了ConvNeXt-IC,将其视为分类任务。通过应用现代分类器骨干网络而非专门的架构,ConvNeXt-IC 达到了最先进的性能,获得了与人工标注 0.959 的皮尔逊相关系数,同时实现了 4.4 倍更快的复杂性预测。在标准基准上增加了真实世界的网络数据,ICAR 在保持类别级性能的同时实现了 20% 更快的图像编码,并保持了 95% 的实例级性能,从而实现了视觉语言系统的可持续扩展。
Summary / 总结
The paper proposes ICAR (Image Complexity-Aware Retrieval), an adaptive computation method for vision transformers in vision-language models, which uses less compute for simple images and full depth for complex ones while maintaining cross-modal alignment. ICAR employs dual-path training to produce compatible embeddings and uses ConvNeXt-IC for image complexity assessment, achieving state-of-the-art performance with 4.4x faster complexity prediction. ICAR improves image encoding speed by 20% without sacrificing category-level performance and 95% of instance-level performance.
论文提出了ICAR(图像复杂性感知检索)方法,该方法针对视觉语言模型中的视觉变换器,对简单图像使用较少计算量,对复杂图像使用全网络深度,通过双路径训练保持跨模态对齐。ICAR在标准基准上的图像编码速度提高了20%,保持了95%的实例级性能,使视觉语言系统的可持续扩展成为可能。
Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
Authors: Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong
First: 2026-01-15T16:16:34+00:00 · Latest: 2026-01-15T16:16:34+00:00
Abstract
Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
中文标题/摘要
标题:释放大型视觉语言模型在路边基础设施智能感知方面的潜力
城市路边基础设施的自动化感知对于智能城市管理至关重要,但通用模型往往难以捕捉到必要的细粒度属性和领域规则。虽然大型视觉语言模型(VLMs)在开放世界识别方面表现出色,但在遵循工程标准准确解释复杂设施状态方面却常常力不从心,导致在实际应用中的性能不可靠。为了解决这一问题,我们提出了一种领域适应框架,将VLMs转化为专门的智能基础设施分析代理。我们的方法结合了数据高效微调策略和基于知识的推理机制。具体来说,我们利用Grounding DINO的开放式词汇微调来在最少监督的情况下稳健地定位各种资产,然后利用基于LoRA的Qwen-VL适应进行深入的语义属性推理。为了减轻幻觉并确保专业合规,我们引入了一个双模态检索增强生成(RAG)模块,在推理过程中动态检索权威的行业标准和视觉示例。在一项新的城市路边场景数据集上进行评估,我们的框架实现了58.9的检测性能mAP和95.5%的属性识别准确率,展示了智能基础设施监控的稳健解决方案。
Summary / 总结
The research aims to improve the automated perception of urban roadside infrastructure for smart city management by addressing the limitations of general-purpose models. The proposed domain-adapted framework fine-tunes large vision-language models with open-vocabulary techniques and a knowledge-grounded reasoning mechanism, integrating a dual-modality Retrieval-Augmented Generation module to ensure professional compliance. The framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5% on a new dataset of urban roadside scenes, showing promising results for intelligent infrastructure monitoring.
研究旨在通过自动化感知城市路边基础设施来提升智能城市管理水平。提出了一种领域适应框架,利用开放词汇量微调和知识导向推理技术。该框架在检测方面达到58.9 mAP,在属性识别方面达到95.5%的准确率,展示了智能基础设施监控的可靠性能。
SVII-3D: Advancing Roadside Infrastructure Inventory with Decimeter-level 3D Localization and Comprehension from Sparse Street Imagery
Authors: Chong Liu, Luxuan Fu, Yang Jia, Zhen Dong, Bisheng Yang
First: 2026-01-15T15:57:18+00:00 · Latest: 2026-01-15T15:57:18+00:00
Abstract
The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.
中文标题/摘要
标题:SVII-3D:利用亚米级3D定位与理解从稀疏街道图像中推进路边基础设施库存
在智慧城市建设和设施生命周期管理中,自动创建数字孪生和精确资产库存是一项关键任务。然而,利用经济有效的稀疏图像仍然具有挑战性,因为其鲁棒性有限、定位不准确且缺乏细粒度状态理解。为了解决这些限制,提出了SVII-3D,这是一种用于整体资产数字化的统一框架。首先,将LoRA微调开放集检测与空间注意力匹配网络融合,以稳健地关联稀疏视图中的观测。其次,引入几何引导的细化机制以解决结构错误,实现精确的亚米级3D定位。第三,超越静态几何映射,引入利用多模态提示的视觉-语言模型代理以自动诊断细粒度操作状态。实验表明,SVII-3D显著提高了识别准确性并最小化了定位误差。因此,该框架提供了一种可扩展、经济有效的解决方案,用于高保真基础设施数字化,有效弥合了稀疏感知与自动化智能维护之间的差距。
Summary / 总结
The paper proposes SVII-3D, a unified framework for creating digital twins and precise asset inventories using sparse street imagery. It addresses limitations such as robustness and localization accuracy by integrating LoRA fine-tuned open-set detection, a spatial-attention matching network, and a geometry-guided refinement mechanism. The framework also incorporates a Vision-Language Model agent for diagnosing fine-grained operational states. Experiments show that SVII-3D improves identification accuracy and minimizes localization errors, providing a scalable and cost-effective solution for infrastructure digitization.
论文提出了SVII-3D统一框架,利用稀疏街道图像创建数字孪生和精确资产库存。通过结合LoRA微调开放集检测、空间注意力匹配网络和几何引导精炼机制来解决鲁棒性、定位精度等问题。框架还引入了利用多模态提示的视觉语言模型代理来自动诊断细粒度的操作状态。实验表明,SVII-3D提高了识别准确性并减少了定位误差,提供了一种可扩展且成本效益高的基础设施数字化解决方案。
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Authors: Leyang Hu, Matteo Gamba, Randall Balestriero
Venue: NeurIPS 2025
First: 2025-02-11T18:59:57+00:00 · Latest: 2026-01-15T15:36:28+00:00
Comments: Accepted at NeurIPS 2025
Abstract
The scaling of model and data sizes has reshaped the AI landscape, establishing finetuning pretrained models as the standard paradigm for solving downstream tasks. However, dominant finetuning methods typically rely on weight adaptation, often lack interpretability, and depend on heuristically chosen hyperparameters. In this paper, we take a different perspective and shift the focus from weights to activation functions, viewing them through the lens of spline operators. We propose Curvature Tuning (CT), an interpretable and principled steering method that modulates a model's decision boundary by injecting a single hyperparameter into its activation functions. We show that CT provably adjusts model decision boundary curvature and, more fundamentally, projects a model onto a space of smooth functions-thereby complementing current finetuning methods, whose effect lies primarily in feature adaptation. Making this hyperparameter trainable gives rise to a novel and highly parameter-efficient finetuning method. Empirically, CT improves both generalization and robustness. For example, it boosts downstream accuracy of ResNet-50/152 by 8.59%/8.34% over linear probing and 4.64%/1.70% over LoRA across 12 datasets, and improves robust accuracy on the $\ell_\infty$ benchmark from RobustBench by 1032.64%/1494.46%. Our code is available at https://github.com/Leon-Leyang/curvature-tuning.
中文标题/摘要
标题:曲率调谐:单一参数驱动的无需训练模型导向
模型和数据规模的扩展重塑了人工智能的格局,使微调预训练模型成为解决下游任务的标准范式。然而,主流的微调方法通常依赖于权重调整,缺乏可解释性,并且依赖于经验选择的超参数。本文从不同角度出发,将焦点从权重转移到激活函数,通过样条算子的视角来审视它们。我们提出了曲率调谐(CT),这是一种可解释且基于原理的导向方法,通过将单一超参数注入激活函数来调节模型的决策边界。我们证明CT能够证明地调整模型决策边界的曲率,并更根本地将模型投影到光滑函数的空间中,从而补充了当前主要依赖于特征调整的微调方法。使这个超参数可训练导致了一种新颖且高度参数高效的微调方法。实验表明,CT在泛化能力和鲁棒性方面均有所提升。例如,它在12个数据集上将ResNet-50/152的下游准确性分别提高了8.59%/8.34%,相对于线性探针和LoRA分别提高了4.64%/1.70%,并且在RobustBench的$\ell_\infty$基准上将鲁棒准确性分别提高了1032.64%/1494.46%。我们的代码可在https://github.com/Leon-Leyang/curvature-tuning/ 获取。
Summary / 总结
This paper introduces Curvature Tuning (CT), a training-free method that modulates a model's decision boundary by adjusting a single hyperparameter in its activation functions. CT is shown to improve model generalization and robustness, achieving significant accuracy boosts on various datasets compared to linear probing and other methods like LoRA. It projects models onto a space of smooth functions, complementing traditional finetuning methods focused on feature adaptation. Empirical results demonstrate CT's effectiveness in enhancing both accuracy and robustness across multiple benchmarks.
本文提出了曲率调谐(CT),这是一种训练-free 方法,通过调整激活函数中的单个超参数来调节模型的决策边界。CT 被证明能够提升模型的泛化能力和鲁棒性,在多个数据集上相比线性探针和 LoRA 等方法实现了显著的准确率提升。它将模型投影到光滑函数的空间中,补充了传统调优方法主要关注特征适应的局限性。实验证明,CT 在多个基准测试中有效提升了准确率和鲁棒性。
mergetune: Continued fine-tuning of vision-language models
Authors: Wenqing Wang, Da Li, Xiatian Zhu, Josef Kittler
First: 2026-01-15T15:15:53+00:00 · Latest: 2026-01-15T15:15:53+00:00
Comments: 20 pages, 5 figures
Abstract
Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, \emph{continued fine-tuning (CFT)}, which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6\% on base-novel generalisation without adding parameters. % We show \emph{the first time} superior performance than CLIP on both DTD and EuroSAT, on cross-dataset transfer. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at \href{https://github.com/Surrey-UP-Lab/MERGETUNE}{https://github.com/Surrey-UP-Lab/MERGETUNE}.
中文标题/摘要
标题:mergetune:视觉-语言模型持续微调
微调视觉-语言模型(VLMs)如CLIP通常会导致预训练知识的灾难性遗忘。先前的工作主要旨在适应过程中减轻遗忘,然而遗忘在此过程中仍然不可避免。我们引入了一种新的范式——持续微调(CFT),旨在在零样本模型已经适应后恢复预训练知识。我们提出了一种简单的、模型无关的CFT策略(名为MERGETUNE),该策略由线性模式连通性(LMC)引导,可以在不进行架构更改的情况下应用于现有微调模型。给定一个微调模型,我们继续微调其可训练参数(例如,软提示或线性头),以搜索一个持续模型,该模型具有两条低损失路径到零样本(例如,CLIP)和微调(例如,CoOp)解决方案。通过利用损失景观的几何结构,持续模型隐式地合并了两种解决方案,恢复了在微调对应物中丢失的预训练知识。挑战在于,原始的LMC约束需要从预训练任务中重放数据。我们通过二阶近似零样本模型的LMC约束,避免了大规模数据重放的需要。实验表明,MERGETUNE在不增加参数的情况下,将CoOp的基本-新颖泛化平均值提高了5.6%。MERGETUNE首次在DTD和EuroSAT上展示了优于CLIP的性能,在跨数据集迁移中。在鲁棒微调评估中,MERGETUNE生成的LMC合并模型以较低的推理成本超越了集成基线,并在与零样本模型集成时达到了最先进的结果。我们的代码可在https://github.com/Surrey-UP-Lab/MERGETUNE获得。
Summary / 总结
The research aims to address the issue of catastrophic forgetting in fine-tuning vision-language models (VLMs) like CLIP. The authors introduce a novel paradigm called continued fine-tuning (CFT) and propose a model-agnostic strategy named MERGETUNE, which uses linear mode connectivity (LMC) to recover pretrained knowledge after the model has been adapted. Experiments show that MERGETUNE improves the harmonic mean of CoOp by 5.6% on base-novel generalization without adding parameters, and surpasses CLIP on DTD and EuroSAT datasets for cross-dataset transfer. The LMC-merged model also outperforms ensemble baselines with lower inference cost and achieves state-of-the-art results when ensembled with the zero-shot model.
研究旨在解决视觉-语言模型如CLIP在微调过程中出现的灾难性遗忘问题。作者引入了一种新的持续微调(CFT)范式,并提出了一种名为MERGETUNE的模型通用策略,该策略利用线性模式连通性(LMC)在模型已适应后恢复预训练知识。实验结果显示,MERGETUNE在基底-新颖泛化上的调和平均值提高了5.6%,且在DTD和EuroSAT数据集上的跨数据集迁移中优于CLIP。此外,MERGETUNE的LMC合并模型在较低的推理成本下超越了集成基线,并且与零样本模型集成后达到了最先进的结果。
RGS-SLAM: Robust Gaussian Splatting SLAM with One-Shot Dense Initialization
Authors: Wei-Tse Cheng, Yen-Jen Chiou, Yuan-Fu Yang
First: 2025-12-28T03:45:57+00:00 · Latest: 2026-01-15T15:14:21+00:00
Comments: 10 pages, 9 figures
Abstract
We introduce RGS-SLAM, a robust Gaussian-splatting SLAM framework that replaces the residual-driven densification stage of GS-SLAM with a training-free correspondence-to-Gaussian initialization. Instead of progressively adding Gaussians as residuals reveal missing geometry, RGS-SLAM performs a one-shot triangulation of dense multi-view correspondences derived from DINOv3 descriptors refined through a confidence-aware inlier classifier, generating a well-distributed and structure-aware Gaussian seed prior to optimization. This initialization stabilizes early mapping and accelerates convergence by roughly 20\%, yielding higher rendering fidelity in texture-rich and cluttered scenes while remaining fully compatible with existing GS-SLAM pipelines. Evaluated on the TUM RGB-D and Replica datasets, RGS-SLAM achieves competitive or superior localization and reconstruction accuracy compared with state-of-the-art Gaussian and point-based SLAM systems, sustaining real-time mapping performance at up to 925 FPS. Additional details and resources are available at this URL: https://breeze1124.github.io/rgs-slam-project-page/
中文标题/摘要
标题:RGS-SLAM:基于一次性密集初始化的鲁棒高斯点云SLAM
我们提出了RGS-SLAM,一种鲁棒的高斯点云SLAM框架,用无训练的对应到高斯的初始化阶段取代GS-SLAM中的残差驱动密集化阶段。RGS-SLAM 不是像GS-SLAM那样随着残差揭示缺失的几何结构逐步添加高斯点,而是通过一种基于置信度的内点分类器对DINOv3描述符进行细化,一次性三角化密集多视图对应,生成一个分布良好且结构意识强的高斯种子,作为优化前的先验。这种初始化稳定了早期建图,并通过大约20%的速度提升加速了收敛,从而在纹理丰富和杂乱的场景中提高了渲染保真度,同时保持与现有GS-SLAM流水线的完全兼容性。在TUM RGB-D和Replica数据集上评估,RGS-SLAM在定位和重建准确性方面与最先进的高斯和点云SLAM系统具有竞争力或更优,同时保持实时建图性能,最高可达925 FPS。更多细节和资源请参见此网址:https://breeze1124.github.io/rgs-slam-project-page/
Summary / 总结
RGS-SLAM is a robust Gaussian-splatting SLAM framework that introduces a one-shot dense initialization method to replace the residual-driven densification stage of GS-SLAM. By using DINOv3 descriptors and a confidence-aware inlier classifier, RGS-SLAM generates a well-distributed and structure-aware Gaussian seed prior to optimization, which stabilizes early mapping and accelerates convergence by about 20%. The system achieves competitive or superior localization and reconstruction accuracy on TUM RGB-D and Replica datasets, while maintaining real-time performance at up to 925 FPS and being fully compatible with existing GS-SLAM pipelines.
RGS-SLAM 是一种鲁棒的高斯点云 SLAM 框架,引入了一次性密集初始化方法,替代传统的基于残差的密集化阶段。通过使用 DINOv3 描述子和置信度感知的内点分类器,它进行了一次多视图对应点的三角化,生成了一个分布良好且结构意识强的高斯种子,用于优化前。这种方法稳定了早期建图并加速了收敛约 20%,在复杂场景中提高了渲染保真度,同时保持了实时性能,最高可达 925 FPS。RGS-SLAM 在 TUM RGB-D 和 Replica 数据集上的定位和重建精度与最先进的 SLAM 系统相当或更优。
Urban Socio-Semantic Segmentation with Vision-Language Reasoning
Authors: Yu Wang, Yi Wang, Rui Dai, Yujie Wang, Kaikui Liu, Xiangxiang Chu, Yansheng Li
First: 2026-01-15T15:00:36+00:00 · Latest: 2026-01-15T15:00:36+00:00
Abstract
As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.
中文标题/摘要
标题:基于视觉-语言推理的城市社会语义分割
作为人类活动的中心,城市表面包含了大量的语义实体。从卫星图像中分割这些各种实体对于一系列下游应用至关重要。当前先进的分割模型可以可靠地分割由物理属性定义的实体(如建筑物、水体),但在处理社会定义的类别(如学校、公园)方面仍然存在困难。在本工作中,我们通过视觉-语言模型推理实现了社会语义分割。为了促进这一过程,我们引入了名为SocioSeg的城市社会语义分割数据集,该数据集包含卫星图像、数字地图和按分层结构组织的社会语义实体的像素级标签。此外,我们还提出了一种新的视觉-语言推理框架,称为SocioReasoner,该框架通过跨模态识别和多阶段推理模拟人类识别和标注社会语义实体的过程。我们使用强化学习优化这一非可微过程,激发视觉-语言模型的推理能力。实验表明,我们的方法在最先进的模型上有所改进,并且具有强大的零样本泛化能力。我们的数据集和代码可在https://github.com/AMAP-ML/SocioReasoner获取。
Summary / 总结
This research aims to improve the segmentation of socially defined categories in urban areas using vision-language models. The authors introduce the SocioSeg dataset and a novel framework called SocioReasoner, which combines cross-modal recognition and multi-stage reasoning to achieve socio-semantic segmentation. Experiments show that their approach outperforms existing models and demonstrates strong zero-shot generalization capabilities.
该研究旨在利用卫星图像分割城市中的社会定义类别。作者引入了SocioSeg数据集,并提出了SocioReasoner框架,该框架结合了跨模态识别和多阶段推理来识别社会语义实体。实验结果显示,该方法优于现有模型,并且具有较强的零样本泛化能力。
Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning
Authors: Guoqiang Liang, Jianyi Wang, Zhonghua Wu, Shangchen Zhou
First: 2026-01-06T11:00:17+00:00 · Latest: 2026-01-15T14:19:47+00:00
Comments: Project Page: https://ethanliang99.github.io/ZOOMIQA-Projectpage
Abstract
Image Quality Assessment (IQA) is a long-standing problem in computer vision. Previous methods typically focus on predicting numerical scores without explanation or providing low-level descriptions lacking precise scores. Recent reasoning-based vision language models (VLMs) have shown strong potential for IQA by jointly generating quality descriptions and scores. However, existing VLM-based IQA methods often suffer from unreliable reasoning due to their limited capability of integrating visual and textual cues. In this work, we introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors: uncertainty awareness, region reasoning, and iterative refinement. Specifically, we present a two-stage training pipeline: 1) supervised fine-tuning (SFT) on our Grounded-Rationale-IQA (GR-IQA) dataset to teach the model to ground its assessments in key regions, and 2) reinforcement learning (RL) for dynamic policy exploration, stabilized by our KL-Coverage regularizer to prevent reasoning and scoring diversity collapse, with a Progressive Re-sampling Strategy for mitigating annotation bias. Extensive experiments show that Zoom-IQA achieves improved robustness, explainability, and generalization. The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.
中文标题/摘要
标题:Zoom-IQA:基于可靠区域感知推理的图像质量评估
图像质量评估(IQA)是计算机视觉中的一个长期问题。以往的方法通常侧重于预测数值分数而没有解释,或者提供低级描述而缺乏精确的分数。最近的基于视觉语言模型(VLM)的推理方法在联合生成质量描述和分数方面显示出了强大的潜力。然而,现有的基于VLM的IQA方法往往由于其整合视觉和文本线索的能力有限而表现出不可靠的推理。在本文中,我们引入了Zoom-IQA,这是一种基于VLM的IQA模型,旨在明确模拟关键的认知行为:不确定性意识、区域推理和迭代细化。具体而言,我们提出了一种两阶段训练管道:1)在我们的Grounded-Rationale-IQA(GR-IQA)数据集上进行监督微调(SFT),以教导模型将其评估扎根于关键区域;2)通过我们的KL-Coverage正则化器稳定动态策略探索的强化学习(RL),并结合渐进重采样策略以减轻注释偏差。广泛的实验表明,Zoom-IQA在鲁棒性、可解释性和泛化能力方面有所提升。Zoom-IQA在下游任务中的应用,如图像恢复,进一步证明了其有效性。
Summary / 总结
Zoom-IQA is a VLM-based IQA model that improves upon existing methods by focusing on uncertainty awareness, region reasoning, and iterative refinement. It uses a two-stage training pipeline: supervised fine-tuning on a Grounded-Rationale-IQA dataset and reinforcement learning with a KL-Coverage regularizer. The model demonstrates improved robustness, explainability, and generalization in IQA tasks, and its effectiveness is further validated through applications in downstream tasks like image restoration.
Zoom-IQA 是一种基于 VLM 的 IQA 模型,通过明确模拟认知行为如不确定性意识、区域推理和迭代改进来提高鲁棒性、可解释性和泛化能力。它采用两阶段训练管道:在 Grounded-Rationale-IQA 数据集上进行监督微调和带有 KL-Coverage 正则化器和渐进重采样策略的强化学习。实验表明,Zoom-IQA 在质量评估的可靠性和精度方面优于现有方法。
Global Context Compression with Interleaved Vision-Text Transformation
Authors: Dian Jiao, Jiaxin Duan, Shuai Zhao, Jiabing Leng, Yiran Zhang, Feng Huang
First: 2026-01-15T13:29:16+00:00 · Latest: 2026-01-15T13:29:16+00:00
Abstract
Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.
中文标题/摘要
标题:全局上下文压缩与交错的视觉-文本转换
视觉-语言模型在端到端OCR方面的近期成就为低损耗压缩文本信息开辟了一条新途径。这促使早期工作将Transformer的输入转换为图像以进行预填充,从而通过视觉编码有效减少了令牌数量,从而减轻了注意力计算的二次增加。然而,这种部分压缩在逐令牌推理时未能节省计算或内存成本。在本文中,我们研究了全局上下文压缩,这种压缩在预填充和推理阶段都节省了令牌。因此,我们提出了VIST2,这是一种新颖的Transformer,交错地输入文本片段及其视觉编码,同时仅依赖于预上下文中的视觉令牌来预测下一个文本令牌分布。围绕这一理念,我们将文本片段渲染为草图图像,并分阶段训练VIST2,从基于课程表调度的光学语言模型预训练开始,然后是模态交错指令微调。我们使用从0.6B到8B缩放的VIST2家族进行了广泛的实验,以探索训练配方和超参数。压缩比为4倍的情况下,所得到的模型在长文本任务上显著优于基线,平均第一令牌生成速度提高3倍,内存使用减少77%,FLOPS减少74%。我们的代码和数据集将公开,以支持进一步的研究。
Summary / 总结
This paper addresses the need for efficient compression of textual information in vision-language models, particularly for optical character recognition (OCR). It introduces VIST2, a novel Transformer that interleaves text and visual encoding to compress tokens both during prefilling and inference. Experiments show that VIST2 achieves a 4x compression ratio, resulting in a 3x speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS compared to baselines on long writing tasks.
本文探讨了全局上下文压缩在视觉语言模型中的应用,以减少预填充和推理阶段的计算和内存成本。提出了一种新颖的Transformer——VIST2,它交替使用文本和视觉编码,并利用视觉标记进行预测。实验表明,从0.6B到8B参数的VIST2模型实现了4倍的压缩比,平均在第一标记生成速度上提高了3倍,内存使用减少了77%,FLOPS减少了74%,优于基线模型在长文本任务上的表现。
Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
Authors: Peng-Fei Zhang, Zi Huang
First: 2026-01-15T11:45:56+00:00 · Latest: 2026-01-15T11:45:56+00:00
Comments: 15 pages, 7 figures
Abstract
Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.
中文标题/摘要
标题:视觉-语言模型多模态普遍攻击的分层细化
现有的针对VLP模型的对抗攻击大多针对样本特定,当扩展到大规模数据集或新场景时会产生大量的计算开销。为克服这一局限,我们提出了分层细化攻击(HRA),这是一种针对VLP模型的多模态普遍攻击框架。HRA在样本级别和优化级别对普遍对抗扰动(UAPs)进行细化。对于图像模态,我们将对抗样本分解为干净图像和扰动,允许每个组件独立处理,以更有效地破坏跨模态对齐。我们还引入了一种ScMix增强策略,以多样化视觉上下文并增强UAPs的全局和局部效用,从而减少对虚假特征的依赖。此外,通过利用历史和估计未来梯度的时间层次结构来细化优化路径,以避免局部最小值并稳定普遍扰动学习。对于文本模态,HRA通过结合句内和句间重要性度量来识别全局有影响力的单词,并随后利用这些单词作为普遍文本扰动。广泛的实验结果表明,提出的多模态普遍攻击具有优越性。
Summary / 总结
The research aims to address the computational overhead of sample-specific adversarial attacks on vision-language models by proposing Hierarchical Refinement Attack (HRA), a universal multimodal attack framework. HRA refines universal adversarial perturbations at both the sample and optimization levels, using techniques like ScMix augmentation and a temporal hierarchy of gradients. The study shows that HRA outperforms existing methods across various downstream tasks and datasets.
研究旨在通过提出层次化精炼攻击(HRA),一种多模态的通用攻击框架,解决视觉语言模型中样本特定的对抗性攻击的计算开销问题。HRA 在样本和优化层面精炼通用对抗性扰动,使用图像的解纠缠技术和文本的重要性度量。研究结果表明,HRA 在各种下游任务、视觉语言模型和数据集上优于现有方法。
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics
Authors: Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo
First: 2025-05-29T16:41:12+00:00 · Latest: 2026-01-15T11:24:14+00:00
Comments: 29 pages, 13 figures
Abstract
Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate Robot-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.
中文标题/摘要
标题:Robot-R1:强化学习在机器人本体推理中的增强
大型视觉-语言模型(LVLM)最近在通过结合本体推理和机器人控制来推动机器人技术方面展现了巨大的潜力。一种常见的方法是使用监督微调(SFT)对与机器人控制相关的本体推理任务进行训练。然而,SFT数据集通常是通过启发式方法构建的,并未明确优化以提高机器人控制性能。此外,SFT往往会导致灾难性遗忘和泛化性能降低等问题。为了解决这些局限性,我们提出了Robot-R1,这是一种新颖的框架,利用强化学习来增强特别针对机器人控制的本体推理。Robot-R1 学习预测完成任务所需的下一个关键点状态,条件是基于当前场景图像和从专家演示中提取的环境元数据。受DeepSeek-R1学习方法的启发,Robot-R1 采样基于推理的响应,并强化那些导致更准确预测的响应。为了严格评估Robot-R1,我们还引入了一个新的基准,要求具备多样化的本体推理能力。我们的实验表明,使用Robot-R1训练的模型在本体推理任务上优于SFT方法。尽管只有70亿个参数,Robot-R1甚至在与低级动作控制相关的推理任务,如空间和运动推理方面,也超越了GPT-4o。
Summary / 总结
The paper introduces Robot-R1, a framework that uses reinforcement learning to enhance embodied reasoning for robot control, addressing the limitations of supervised fine-tuning methods. Robot-R1 predicts the next keypoint state needed for task completion based on current scene images and environment metadata from expert demonstrations. Experiments show that Robot-R1 outperforms supervised fine-tuning methods on embodied reasoning tasks and even surpasses GPT-4o in low-level action control reasoning tasks.
研究旨在通过解决监督微调(SFT)方法的局限性,如灾难性遗忘和泛化性能降低,来提升机器人的体态推理能力。Robot-R1 是一种新颖的框架,利用强化学习来增强特定于机器人控制的体态推理。它基于当前场景图像和环境元数据从专家演示中学习预测完成任务所需的下一个关键点状态。实验表明,使用Robot-R1训练的模型在体态推理任务上优于SFT方法,并且在低级动作控制任务如空间和运动推理方面甚至超越了GPT-4o。
A Study of Commonsense Reasoning over Visual Object Properties
Authors: Abhishek Kolari, Mohammadhossein Khojasteh, Yifan Jiang, Floris den Hengst, Filip Ilievski
First: 2025-08-14T11:28:40+00:00 · Latest: 2026-01-15T11:10:05+00:00
Abstract
Inspired by human categorization, object property reasoning involves identifying and recognizing low-level details and higher-level abstractions. While current visual question answering (VQA) studies consider multiple object properties, such as size, they typically blend perception and reasoning and lack representativeness in terms of reasoning and image categories, making it unclear whether and how vision-language models (VLMs) abstract and reason over depicted objects. To this end, we introduce a systematic evaluation framework comprising images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions, informed by prior work on common sense. We develop a procedure to instantiate this framework in two VQA object reasoning benchmarks: OPTICS-CNT, comprising 360 images paired with 1,080 multi-level, count-based questions, and OPTICS-CMP, with 2.1k comparison questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations relative to humans, with the best-performing model achieving below 40% counting and 70% comparison accuracy. VLMs struggle particularly with photographic images, counterfactual reasoning, physical and functional properties, and higher counts. We make the OPTICS benchmark data and code available to support future work on scalable benchmarking methods, generalized annotation guidelines, and advanced reasoning VLMs.
中文标题/摘要
标题:视觉对象属性上的常识推理研究
受人类分类启发,对象属性推理涉及识别和识别低级细节和高级抽象。当前的视觉问答(VQA)研究考虑了多个对象属性,如大小,但通常将感知和推理结合在一起,并且在推理和图像类别方面缺乏代表性,这使得不清楚视觉语言模型(VLMs)是否以及如何对描绘的对象进行抽象和推理。为此,我们引入了一个系统评估框架,包括三种代表性类型的图像、三种复杂度递增的推理层次和四种由先前常识工作启发的对象属性维度。我们开发了一种程序,将此框架实例化为两个VQA对象推理基准:OPTICS-CNT,包含360张图像配对1,080个多级计数问题,和OPTICS-CMP,包含2,100个比较问题。零样本设置下12个最先进的VLMs的实验揭示了与人类相比的重大局限性,最佳模型在计数和比较准确性上分别低于40%和70%。VLMs特别难以处理摄影图像、反事实推理、物理和功能属性以及更高数量。我们提供了OPTICS基准数据和代码,以支持未来可扩展基准方法、通用注释指南和高级推理VLMs的研究。
Summary / 总结
This study aims to evaluate how vision-language models (VLMs) reason about object properties in images, addressing limitations in current visual question answering (VQA) studies. The researchers developed a systematic evaluation framework with three types of images, three levels of reasoning complexity, and four object property dimensions. They tested 12 state-of-the-art VLMs in zero-shot settings and found that these models perform significantly worse than humans, especially in photographic images and counterfactual reasoning. The best model achieved only 40% accuracy in counting and 70% in comparison tasks. The study provides insights into the limitations of VLMs in reasoning over visual object properties and offers a benchmark for future research.
该研究旨在评估视觉语言模型(VLMs)在图像中对物体属性进行推理的能力,解决了当前视觉问答(VQA)研究中的局限性。研究人员开发了一个系统性评估框架,包含三种图像类型、三种推理层次和四个属性维度,并将该框架应用于两个VQA基准数据集OPTICS-CNT和OPTICS-CMP。研究发现,最先进的VLMs表现不佳,在计数和比较任务中的准确率分别低于40%和70%,尤其是在照片图像和反事实推理方面。该研究突显了VLMs在属性推理方面与人类表现之间的显著差距。
RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation
Authors: Yue Chang, Rufeng Chen, Zhaofan Zhang, Yi Chen, Sihong Xie
First: 2026-01-15T08:15:01+00:00 · Latest: 2026-01-15T08:15:01+00:00
Comments: 9 pages, 6 figures
Abstract
Open-vocabulary 3D Scene Graph (3DSG) generation can enhance various downstream tasks in robotics, such as manipulation and navigation, by leveraging structured semantic representations. A 3DSG is constructed from multiple images of a scene, where objects are represented as nodes and relationships as edges. However, existing works for open-vocabulary 3DSG generation suffer from both low object-level recognition accuracy and speed, mainly due to constrained viewpoints, occlusions, and redundant surface density. To address these challenges, we propose RAG-3DSG to mitigate aggregation noise through re-shot guided uncertainty estimation and support object-level Retrieval-Augmented Generation (RAG) via reliable low-uncertainty objects. Furthermore, we propose a dynamic downsample-mapping strategy to accelerate cross-image object aggregation with adaptive granularity. Experiments on Replica dataset demonstrate that RAG-3DSG significantly improves node captioning accuracy in 3DSG generation while reducing the mapping time by two-thirds compared to the vanilla version.
中文标题/摘要
标题:RAG-3DSG:利用重拍引导检索增强生成改进3D场景图
开放词汇的3D场景图(3DSG)生成可以通过利用结构化的语义表示来增强机器人领域的各种下游任务,如操作和导航。3DSG是从场景的多张图像中构建的,其中对象作为节点,关系作为边。然而,现有的开放词汇3DSG生成工作在对象级识别准确性和速度方面都存在问题,主要是由于受限的视角、遮挡和冗余表面密度。为了解决这些挑战,我们提出了RAG-3DSG,通过重拍引导的不确定性估计来减轻聚合噪声,并通过可靠的低不确定性对象支持对象级检索增强生成(RAG)。此外,我们提出了一种动态下采样映射策略,以通过自适应粒度加速跨图像对象聚合。在Replica数据集上的实验表明,RAG-3DSG在3DSG生成中显著提高了节点描述的准确性,同时将映射时间减少了三分之二,与原版相比。
Summary / 总结
The research aims to enhance 3D Scene Graph (3DSG) generation for robotics applications by addressing low recognition accuracy and speed issues. RAG-3DSG uses re-shot guided uncertainty estimation to reduce aggregation noise and supports object-level Retrieval-Augmented Generation (RAG) with reliable low-uncertainty objects. Additionally, a dynamic downsample-mapping strategy accelerates cross-image object aggregation. Experiments show that RAG-3DSG improves node captioning accuracy and reduces mapping time by two-thirds compared to the vanilla version.
研究旨在通过解决低识别准确率和速度慢的问题,提升3D场景图(3DSG)生成,以应用于机器人领域。RAG-3DSG利用重新拍摄引导的不确定性估计和动态下采样映射策略,提高物体级别的识别准确率并减少映射时间。实验结果显示,RAG-3DSG在节点描述准确性上有所提升,并将映射时间减少了三分之二。
Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation
Authors: Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee
First: 2025-12-01T03:38:44+00:00 · Latest: 2026-01-15T07:18:11+00:00
Abstract
We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.
中文标题/摘要
标题:生成对抗Gumbel MCTS在抽象视觉合成生成中的应用
我们研究抽象视觉合成,其中身份主要由少量几何基本元素(如部分、对称性、拓扑)的空间配置和关系决定。它们主要对纹理和写实细节不变。在几何约束和模糊目标规范(如文本)下从固定组件合成此类结构是非平凡的,由于组合放置选择、有限的数据和离散可行性(无重叠、允许的方向),这导致了一个稀疏的解空间,不适合纯粹的统计像素空间生成器。我们提出了一种结合显式几何推理和神经语义的约束引导框架。AlphaGo风格的搜索确保可行性,而微调的视觉语言模型则作为奖励信号评分语义对齐。我们的算法使用策略网络作为蒙特卡洛树搜索中的启发式方法,并通过搜索生成的计划微调网络。受生成对抗网络的启发,我们使用生成实例进行对抗奖励细化。随着时间的推移,当奖励模型无法区分生成实例和真实数据时,生成应更接近真实数据。在七巧板组装任务中,我们的方法在约束收紧时比扩散和自回归基线具有更高的有效性和语义保真度。
Summary / 总结
The study addresses the challenge of generating abstract visual compositions using geometric primitives under geometric constraints and vague goals. It proposes a constraint-guided framework combining explicit geometric reasoning with neural semantics. The approach uses Monte-Carlo Tree Search with a policy network and a fine-tuned vision-language model for reward signals. The method outperforms diffusion and auto-regressive baselines in the Tangram Assembly task, showing higher validity and semantic fidelity, especially under tight constraints.
该研究解决了在几何约束和模糊目标下使用几何基本元素生成抽象视觉组成的问题。提出的方法结合了显式的几何推理和神经语义,通过一个约束导向框架实现。它使用蒙特卡洛树搜索和策略网络,并通过搜索生成的计划来微调网络。该方法在七巧板组装任务中优于扩散和自回归基线,显示出更高的有效性和语义保真度,尤其是在约束条件更严格时。
Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets
Authors: Huy M. Le, Dat Tien Nguyen, Phuc Binh Nguyen, Gia Bao Le Tran, Phu Truong Thien, Cuong Dinh, Minh Nguyen, Nga Nguyen, Thuy T. N. Nguyen, Tan Nhat Nguyen, Binh T. Nguyen
First: 2025-11-15T15:23:44+00:00 · Latest: 2026-01-15T06:23:25+00:00
Abstract
The Video Browser Showdown (VBS) challenges systems to deliver accurate results under strict time constraints. To meet this demand, we present Fusionista2.0, a streamlined video retrieval system optimized for speed and usability. All core modules were re-engineered for efficiency: preprocessing now relies on ffmpeg for fast keyframe extraction, optical character recognition uses Vintern-1B-v3.5 for robust multilingual text recognition, and automatic speech recognition employs faster-whisper for real-time transcription. For question answering, lightweight vision-language models provide quick responses without the heavy cost of large models. Beyond these technical upgrades, Fusionista2.0 introduces a redesigned user interface with improved responsiveness, accessibility, and workflow efficiency, enabling even non-expert users to retrieve relevant content rapidly. Evaluations demonstrate that retrieval time was reduced by up to 75% while accuracy and user satisfaction both increased, confirming Fusionista2.0 as a competitive and user-friendly system for large-scale video search.
中文标题/摘要
标题:Fusionista2.0:大规模数据集高效检索系统
视频浏览器 showdown (VBS) 挑战系统在严格的时间限制下提供准确结果。为满足这一需求,我们推出了 Fusionista2.0,一个优化速度和易用性的精简视频检索系统。所有核心模块都进行了重新设计以提高效率:预处理现在依赖于 ffmpeg 进行快速关键帧提取,光学字符识别使用 Vintern-1B-v3.5 进行稳健的多语言文本识别,自动语音识别采用 faster-whisper 进行实时转录。对于问答,轻量级的视觉语言模型提供了快速响应,而无需大型模型的高昂成本。除了这些技术升级,Fusionista2.0 还引入了重新设计的用户界面,提高了响应性、可访问性和工作流程效率,使非专家用户也能快速检索相关内容。评估表明,检索时间减少了高达 75%,同时准确性和用户满意度都得到了提高,确认 Fusionista2.0 是一个具有竞争力且用户友好的大规模视频搜索系统。
Summary / 总结
Fusionista2.0 is designed to efficiently retrieve accurate results within strict time constraints for large-scale video datasets. It optimizes key modules such as preprocessing, optical character recognition, and automatic speech recognition for speed and robustness. The system also features a redesigned user interface that enhances responsiveness and accessibility. Experimental results show a 75% reduction in retrieval time with improved accuracy and user satisfaction, making Fusionista2.0 a competitive and user-friendly solution for large-scale video search.
Fusionista2.0 旨在通过优化核心模块来满足 Video Browser Showdown (VBS) 中快速准确检索的需求。它使用 ffmpeg 进行快速关键帧提取,使用 Vintern-1B-v3.5 进行稳健的多语言文本识别,并使用 faster-whisper 进行实时转录。系统还采用轻量级的视觉语言模型进行问答。用户界面的改进增强了响应性和可访问性。实验结果表明,检索时间减少了 75%,准确率和用户满意度均有所提高,使 Fusionista2.0 成为一个具有竞争力且用户友好的大规模视频搜索系统。
V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation
Authors: Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen
First: 2026-01-15T05:47:43+00:00 · Latest: 2026-01-15T05:47:43+00:00
Abstract
Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero
中文标题/摘要
标题:V-Zero: 自我改进的多模态推理无需标注
近期多模态学习的进展显著提升了视觉语言模型(VLMs)的推理能力。然而,最先进的方法严重依赖大规模的人标注数据集,这些数据集的获取成本高且耗时。为克服这一限制,我们引入了V-Zero,这是一种通用的后训练框架,通过仅使用未标记的图像来促进自我改进。V-Zero 通过实例化两个不同的角色——提问者和解答者,建立了一个共生进化的循环。提问者通过利用对比直观猜测与推理结果的双重推理奖励机制,学习生成高质量的挑战性问题。解答者则通过对其自身采样响应进行多数投票获得的伪标签进行优化。两个角色通过组相对策略优化(GRPO)迭代训练,推动相互增强的循环。令人惊讶的是,没有使用任何人工标注,V-Zero 在 Qwen2.5-VL-7B-Instruct 上实现了持续的性能提升,视觉数学推理提高了 1.7,一般视觉中心任务提高了 2.6,展示了多模态系统自我改进的潜力。代码可在 https://github.com/SatonoDia/V-Zero 获取
Summary / 总结
V-Zero is a post-training framework that enables self-improvement in vision-language models without relying on human-annotated datasets. It uses a co-evolutionary loop with a Questioner and a Solver, where the Questioner generates challenging questions and the Solver improves through pseudo-labels. V-Zero significantly enhances visual mathematical reasoning (+1.7) and general vision-centric tasks (+2.6) on Qwen2.5-VL-7B-Instruct, showcasing the effectiveness of self-improvement in multimodal systems.
研究旨在通过利用未标注图像,而非依赖成本高昂且耗时的人工标注数据集,提升视觉语言模型的推理能力。V-Zero 是一个后训练框架,通过一个包含提问者和解答者两者的共生进化循环来促进自我改进。提问者生成具有挑战性的问题,而解答者则通过自身响应的伪标签进行优化。通过组相对策略优化(GRPO)进行迭代训练,促进两者相互提升。V-Zero 在视觉数学推理任务上提高了1.7,在一般视觉中心任务上提高了2.6,无需任何人工标注。
Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making
Authors: Jua Han, Jaeyoon Seo, Jungbin Min, Jean Oh, Jihie Kim
First: 2026-01-09T05:04:15+00:00 · Latest: 2026-01-15T05:09:03+00:00
Abstract
One mistake by an AI system in a safety-critical setting can cost lives. As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows; a single wrong instruction can directly endanger human safety. This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic. Through a qualitative evaluation of a fire evacuation scenario, we identified critical failure cases in LLM-based decision-making. Based on these, we designed seven tasks for quantitative assessment, categorized into: Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). Complete information tasks utilize ASCII maps to minimize interpretation ambiguity and isolate spatial reasoning from visual processing. Incomplete information tasks require models to infer missing context, testing for spatial continuity versus hallucinations. SOSR tasks use natural language to evaluate safe decision-making in life-threatening contexts. We benchmark various LLMs and Vision-Language Models (VLMs) across these tasks. Beyond aggregate performance, we analyze the implications of a 1% failure rate, highlighting how "rare" errors escalate into catastrophic outcomes. Results reveal serious vulnerabilities: several models achieved a 0% success rate in ASCII navigation, while in a simulated fire drill, models instructed robots to move toward hazardous areas instead of emergency exits. Our findings lead to a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics, as it implies one out of every hundred executions could result in catastrophic harm. We demonstrate that even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.
中文标题/摘要
标题:安全未找到(404):基于LLM的机器人决策中的隐藏风险
AI系统在关键安全环境中的一个错误可能会导致生命损失。随着大型语言模型(LLMs)在机器人决策中的作用日益重要,物理风险的范围也在扩大;一个错误指令可以直接威胁到人类的安全。本文针对在即使出现微小错误也会导致灾难性后果的场景中系统地评估LLM性能的迫切需求进行了探讨。通过定性评估火灾疏散场景,我们识别出基于LLM的决策中的关键失败案例。基于这些案例,我们设计了七个用于定量评估的任务,分为:完全信息、不完全信息和安全导向的空间推理(SOSR)。完全信息任务使用ASCII地图来减少解释歧义,并将空间推理与视觉处理隔离。不完全信息任务要求模型推断缺失的上下文,测试空间连续性与幻觉。SOSR任务使用自然语言评估在生命威胁情境下的安全决策。我们在这七个任务中对各种LLM和视觉语言模型(VLMs)进行了基准测试。除了整体性能外,我们还分析了1%失败率的影响,强调“罕见”的错误如何升级为灾难性后果。结果揭示了严重的漏洞:一些模型在ASCII导航中实现了0%的成功率,而在模拟火灾演习中,模型指示机器人向危险区域移动而不是紧急出口。我们的研究结果得出一个令人警醒的结论:当前的LLM尚不适合直接部署在关键安全系统中。99%的准确率在机器人领域是危险的误导,因为它意味着每一百次执行中就可能有一次会导致灾难性伤害。我们证明即使是最先进的模型也无法保证安全,完全依赖它们会带来不可接受的风险。
Summary / 总结
This paper addresses the safety risks associated with Large Language Models (LLMs) in robotics decision-making, particularly in critical scenarios. It evaluates LLMs and Vision-Language Models (VLMs) through a series of tasks, including Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). The study reveals that several models failed completely in ASCII navigation and incorrectly directed robots towards hazardous areas during a simulated fire drill. These findings indicate that current LLMs are not suitable for safety-critical systems, as even a 1% failure rate can lead to catastrophic outcomes.
本文探讨了在机器人决策中使用大型语言模型(LLMs)所伴随的安全风险。通过模拟关键场景(如火灾疏散)的一系列任务来评估LLMs和视觉语言模型(VLMs)。关键发现包括严重的漏洞,一些模型在ASCII导航任务中完全失败,并在模拟火灾演习中指示机器人向危险区域移动。研究得出结论,当前的LLMs尚不适合部署在安全关键系统中,即使1%的失败率也可能导致灾难性后果。
Smooth Operator: Smooth Verifiable Reward Activates Spatial Reasoning Ability of Vision-Language Model
Authors: Siwen Jiao, Tianxiong Lv, Kangan Qian, Chenxu Zhao, Xiuyuan Zhu, Tianlun Li, Xiaolong Cheng, Jinyu Li, Zhihao Liao, Yang Cai
First: 2026-01-12T16:26:42+00:00 · Latest: 2026-01-15T03:58:36+00:00
Abstract
Vision-Language Models (VLMs) face a critical bottleneck in achieving precise numerical prediction for 3D scene understanding. Traditional reinforcement learning (RL) approaches, primarily based on relative ranking, often suffer from severe reward sparsity and gradient instability, failing to effectively exploit the verifiable signals provided by 3D physical constraints. Notably, in standard GRPO frameworks, relative normalization causes "near-miss" samples (characterized by small but non-zero errors) to suffer from advantage collapse. This leads to a severe data utilization bottleneck where valuable boundary samples are discarded during optimization. To address this, we introduce the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA employs a dynamically parameterized Sigmoid function to transform raw feedback into a dense, continuous reward continuum. Concurrently, AP-GRPO integrates absolute scalar gradients to mitigate the numerical information loss inherent in conventional relative-ranking mechanisms. By leveraging this approach, we constructed Numerical3D-50k, a dataset comprising 50,000 verifiable 3D subtasks. Empirical results indicate that AP-GRPO achieves performance parity with large-scale supervised methods while maintaining higher data efficiency, effectively activating latent 3D reasoning in VLMs without requiring architectural modifications.
中文标题/摘要
标题:平滑操作员:平滑可验证奖励激活视觉语言模型的空间推理能力
视觉语言模型(VLMs)在实现精确的数值预测以理解3D场景方面面临关键瓶颈。传统的强化学习(RL)方法,主要基于相对排名,往往遭受严重的奖励稀疏性和梯度不稳定性,未能有效利用3D物理约束提供的可验证信号。值得注意的是,在标准GRPO框架中,相对归一化导致“接近但未命中”的样本(特征为小但非零的误差)遭受优势坍塌。这导致在优化过程中有价值边界样本被丢弃的数据利用瓶颈。为解决这一问题,我们引入了平滑数值奖励激活(SNRA)操作和绝对保留GRPO(AP-GRPO)框架。SNRA采用动态参数化的Sigmoid函数将原始反馈转换为密集的连续奖励连续体。同时,AP-GRPO整合绝对标量梯度以减轻传统相对排名机制固有的数值信息损失。通过这种方法,我们构建了包含50,000个可验证3D子任务的数据集Numerical3D-50k。实验证明,AP-GRPO在性能上与大规模监督方法相当,同时保持更高的数据效率,有效激活了VLMs中的潜在3D推理能力,无需进行架构修改。
Summary / 总结
The research aims to enhance the precision of numerical predictions in 3D scene understanding for Vision-Language Models (VLMs) by addressing the issues of reward sparsity and gradient instability in traditional reinforcement learning. The study introduces the Smooth Numerical Reward Activation (SNRA) operator and the Absolute-Preserving GRPO (AP-GRPO) framework. SNRA transforms raw feedback into a dense, continuous reward, while AP-GRPO mitigates numerical information loss. These methods enable the construction of Numerical3D-50k, a dataset of 50,000 verifiable 3D subtasks, and demonstrate that AP-GRPO achieves performance comparable to large-scale supervised methods with higher data efficiency, effectively activating 3D reasoning in VLMs without architectural changes.
研究旨在提高Vision-Language模型(VLMs)在3D场景理解中的数值预测精度。提出了Smooth Numerical Reward Activation(SNRA)操作符和Absolute-Preserving GRPO(AP-GRPO)框架,以解决传统强化学习方法中的奖励稀疏性和梯度不稳定性问题。研究表明,AP-GRPO结合SNRA,能够在保持高数据效率的同时,实现与大规模监督方法相当的性能,从而有效激活VLMs中的3D推理能力。
Memo-SQL: Structured Decomposition and Experience-Driven Self-Correction for Training-Free NL2SQL
Authors: Zerui Yang, Weichuan Wang, Yanwei Xu, Linqi Song, Yudai Matsuda, Wei Han, Bo Bai
First: 2026-01-15T02:42:05+00:00 · Latest: 2026-01-15T02:42:05+00:00
Abstract
Existing NL2SQL systems face two critical limitations: (1) they rely on in-context learning with only correct examples, overlooking the rich signal in historical error-fix pairs that could guide more robust self-correction; and (2) test-time scaling approaches often decompose questions arbitrarily, producing near-identical SQL candidates across runs and diminishing ensemble gains. Moreover, these methods suffer from a stark accuracy-efficiency trade-off: high performance demands excessive computation, while fast variants compromise quality. We present Memo-SQL, a training-free framework that addresses these issues through two simple ideas: structured decomposition and experience-aware self-correction. Instead of leaving decomposition to chance, we apply three clear strategies, entity-wise, hierarchical, and atomic sequential, to encourage diverse reasoning. For correction, we build a dynamic memory of both successful queries and historical error-fix pairs, and use retrieval-augmented prompting to bring relevant examples into context at inference time, no fine-tuning or external APIs required. On BIRD, Memo-SQL achieves 68.5% execution accuracy, setting a new state of the art among open, zero-fine-tuning methods, while using over 10 times fewer resources than prior TTS approaches.
中文标题/摘要
标题:Memo-SQL:结构化分解和经验驱动的自纠正机制以实现无需训练的NL2SQL
现有的NL2SQL系统面临两个关键限制:(1) 它们依赖于上下文学习,仅使用正确的示例,忽视了历史错误修正对的丰富信号,这些信号可以指导更稳健的自纠正;(2) 测试时的扩展方法通常会任意分解问题,导致每次运行生成几乎相同的SQL候选,从而削弱了集成收益。此外,这些方法还面临着明显的准确性和效率权衡:高性能需要大量计算,而快速版本则牺牲了质量。我们提出了Memo-SQL,这是一种无需训练的框架,通过两种简单的方法来解决这些问题:结构化分解和经验感知自纠正。我们不是让分解依赖于运气,而是应用了三种明确的策略:按实体、层次和原子顺序,以鼓励多样化的推理。对于纠正,我们构建了一个动态记忆,包括成功的查询和历史错误修正对,并在推理时使用检索增强提示将相关示例带入上下文,无需微调或外部API。在BIRD上,Memo-SQL实现了68.5%的执行准确率,成为无需训练且开放的方法中的最新状态,同时使用的资源比之前的TTS方法少10多倍。
Summary / 总结
Memo-SQL addresses the limitations of existing NL2SQL systems by introducing structured decomposition and experience-aware self-correction. It uses three decomposition strategies to encourage diverse reasoning and a dynamic memory of successful queries and error-fix pairs to enhance inference. On the BIRD dataset, Memo-SQL achieves 68.5% execution accuracy, surpassing previous open, zero-fine-tuning methods while using significantly fewer resources.
Memo-SQL 通过引入结构化分解和经验驱动的自我纠正来解决现有 NL2SQL 系统的限制。它使用实体级、层次结构和原子序列分解策略来促进多样化的推理,并使用成功查询和历史错误修正对的动态记忆进行自我纠正。在 BIRD 数据集上,Memo-SQL 达到了 68.5% 的执行准确率,超过了之前的开放、零微调方法,同时使用了显著较少的资源。
The Spatial Blindspot of Vision-Language Models
Authors: Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A, Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna
First: 2026-01-15T00:30:34+00:00 · Latest: 2026-01-15T00:30:34+00:00
Abstract
Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.
中文标题/摘要
标题:视觉语言模型的空间盲点
视觉语言模型(VLMs)已经取得了快速进展,但它们捕捉空间关系的能力仍然是一个盲点。当前的VLMs通常使用CLIP风格的图像编码器进行对比语言-图像预训练。训练配方通常将图像扁平化为1D的块序列,丢弃了进行空间推理所必需的2D结构。我们认为,这种缺乏空间意识是VLM设计中缺失的一个维度,并且是需要空间定位的应用(如机器人技术和具身AI)的瓶颈。为了应对这一问题,我们研究了(i)使用其他目标训练的图像编码器和(ii)2D位置编码。我们的实验表明,这些架构选择可以在多个基准上提高空间推理能力。
Summary / 总结
The research aims to address the limitation of vision-language models (VLMs) in capturing spatial relationships, which is crucial for applications like robotics. The study explores alternative image encoders and 2D positional encodings to enhance spatial reasoning. Experiments demonstrate that these modifications improve VLM performance on spatial reasoning benchmarks.
研究旨在解决视觉语言模型(VLMs)在捕捉空间关系方面的局限性,这对于机器人等应用至关重要。研究探讨了使用替代图像编码器和2D位置编码来增强空间推理的方法。实验表明,这些架构上的改进可以提高VLM在空间基准测试中的性能。
MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation
Authors: Yang Xing, Jiong Wu, Savas Ozdemir, Ying Zhang, Yang Yang, Wei Shao, Kuang Gong
First: 2026-01-14T21:21:00+00:00 · Latest: 2026-01-14T21:21:00+00:00
Abstract
Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.
中文标题/摘要
标题:MedVL-SAM2:统一的3D医学视觉语言模型,用于多模态推理和提示驱动分割
医学视觉语言模型(VLMs)在图像级文本中心任务,如报告生成和视觉问答(VQA)方面取得了显著性能。然而,在3D医学VLMs中实现精细的视觉定位和体积空间推理仍然具有挑战性,尤其是在希望在一个通用框架内统一这些能力时。为了解决这一挑战,我们提出了MedVL-SAM2,这是一种统一的3D医学多模态模型,同时支持报告生成、VQA和多范式分割,包括语义分割、引用分割和交互分割。MedVL-SAM2 通过一个针对3D医学成像定制的统一架构,结合图像级推理和像素级感知,并结合基于SAM2的体积分割模块,以实现精确的多粒度空间推理。该模型在多阶段管道中进行训练:首先在大规模的3D CT图像-文本对语料库上进行预训练,以对齐体积视觉特征与放射学-语言嵌入。然后使用一个全面的3D CT分割数据集,同时优化语言理解和分割目标。这种联合训练使语言、点或框提示的灵活交互成为可能,从而统一高层次的视觉推理与空间精确的定位。我们的统一架构在报告生成、VQA和多个3D分割任务上实现了最先进的性能。进一步的分析还表明,该模型提供了可靠的3D视觉定位、可控的交互分割和稳健的跨模态推理,证明了高层次语义推理和精确的3D定位可以在统一的3D医学VLM中同时实现。
Summary / 总结
The research aims to improve fine-grained visual grounding and volumetric spatial reasoning in 3D medical vision-language models. MedVL-SAM2 is a unified 3D medical multimodal model that supports report generation, VQA, and multi-paradigm segmentation. It integrates image-level reasoning and pixel-level perception through a cohesive architecture and uses a SAM2-based volumetric segmentation module. The model is trained in a multi-stage pipeline, first pre-trained on 3D CT image-text pairs and then jointly optimized with language-understanding and segmentation objectives. Key findings include state-of-the-art performance across report generation, VQA, and 3D segmentation tasks, and reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning.
研究旨在提高3D医学视觉语言模型的细粒度视觉定位和体积空间推理能力。MedVL-SAM2 是一个统一的3D医学多模态模型,支持报告生成、VQA和多范式分割。模型通过多阶段训练进行训练,首先在3D CT图像-文本对上进行预训练,然后与语言理解和分割目标联合优化。该模型在各种任务上实现了最先进的性能,并展示了可靠的3D视觉定位和稳健的跨模态推理能力。
Thinking Long, but Short: Stable Sequential Test-Time Scaling for Large Reasoning Models
Authors: Michael R. Metel, Yufei Cui, Boxing Chen, Prasanna Parthasarathi
First: 2026-01-14T20:30:55+00:00 · Latest: 2026-01-14T20:30:55+00:00
Comments: Findings of EACL 2026
Abstract
Sequential test-time scaling is a promising training-free method to improve large reasoning model accuracy, but as currently implemented, significant limitations have been observed. Inducing models to think for longer can increase their accuracy, but as the length of reasoning is further extended, it has also been shown to result in accuracy degradation and model instability. This work presents a novel sequential test-time scaling method, Min-Seek, which improves model accuracy significantly over a wide range of induced thoughts, stabilizing the accuracy of sequential scaling, and removing the need for reasoning length fine-tuning. Beyond improving model accuracy over a variety of reasoning tasks, our method is inherently efficient, as only the KV pairs of one additional induced thought are kept in the KV cache during reasoning. With a custom KV cache which stores keys without position embeddings, by dynamically encoding them contiguously before each new generated thought, our method can continue to reason well beyond a model's maximum context length, and under mild conditions has linear computational complexity.
中文标题/摘要
标题:长思考,短推理:大型推理模型的稳定顺序测试时缩放方法
顺序测试时缩放是一种无需训练即可提高大型推理模型准确性的有前景的方法,但目前实施中观察到了显著的限制。延长模型的思考时间可以提高其准确性,但随着推理长度的进一步延长,也已显示出准确性和模型稳定性下降的问题。本研究提出了一种新颖的顺序测试时缩放方法——Min-Seek,该方法在广泛诱导思考范围内显著提高了模型准确性,稳定了顺序缩放的准确性,并消除了推理长度微调的需要。除了在各种推理任务中提高模型准确性,我们的方法还具有内在的高效性,因为在推理过程中仅保留一个额外诱导思考的KV对。通过使用一个自定义的KV缓存,该缓存不存储位置嵌入,而是动态地在每次生成新思考前连续编码它们,我们的方法可以继续推理远超模型的最大上下文长度,并在温和条件下具有线性计算复杂度。
Summary / 总结
This work addresses the limitations of current sequential test-time scaling methods for large reasoning models, which can lead to accuracy degradation and instability when extended. It introduces Min-Seek, a novel method that enhances model accuracy across various reasoning tasks while maintaining stability. Min-Seek only requires keeping the KV pairs of one additional thought in the cache, making it efficient and allowing reasoning beyond the model's maximum context length with linear computational complexity.
该研究通过引入Min-Seek方法解决了大型推理模型中序列测试时缩放的局限性,该方法能够在各种推理任务中显著提高模型的准确性,无需调整推理长度。Min-Seek方法能够稳定地在长时间推理过程中保持准确性,并且只需要额外存储一个额外的KV对,使其高效。该方法可以在模型的最大上下文长度之外继续推理,并在轻微条件下具有线性计算复杂度。
Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration
Authors: Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu
First: 2025-12-08T05:15:41+00:00 · Latest: 2026-01-14T20:22:57+00:00
Comments: 9 pages, 3 figures. Preprint under review
Abstract
We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.
中文标题/摘要
标题:通过无训练自信心校准提高基于扩散的大语言模型的吞吐量
我们提出了CadLLM,这是一种无训练方法,用于加速基于扩散的大语言模型(dLLMs)的推理吞吐量。我们首先研究了令牌去遮蔽信心在块和步骤之间的动态性质。基于这一观察,我们提出了一种轻量级自适应方法,根据未遮蔽令牌的平均信心控制生成块大小、步长和阈值。我们进一步通过动态利用词汇表的子集来调节采样范围,从而减少softmax开销。CadLLM 是一种即插即用、模型无关的方法,适用于基于KV缓存的大语言模型。在四个流行任务上的广泛实验表明,与最先进的基线相比,CadLLM 可以获得高达2.28倍的吞吐量提升,同时保持竞争力的准确性。
Summary / 总结
CadLLM is a training-free method to enhance the inference throughput of diffusion-based large language models (dLLMs) by adapting generation block size, step size, and threshold based on token unmasking confidence. It also reduces softmax overhead through dynamic vocabulary subset regulation. Experiments show CadLLM achieves up to 2.28x throughput improvement with comparable accuracy on four tasks.
CadLLM 是一种无需训练的方法,通过动态调整生成块大小、步长和阈值,基于未掩码标记的平均置信度来提升扩散型大语言模型(dLLM)的推理吞吐量。它还通过从词汇表的子集中采样来减少 softmax 开销。实验表明,CadLLM 可以实现最高 2.28 倍的吞吐量提升,同时保持竞争力的准确性。
ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning
Authors: Po-han Li, Shenghui Chen, Ufuk Topcu, Sandeep Chinchali
First: 2026-01-14T20:14:47+00:00 · Latest: 2026-01-14T20:14:47+00:00
Abstract
Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.
中文标题/摘要
标题:ViSIL:统一评估多模态视频字幕中的信息损失
多模态视频字幕将密集的视频片段浓缩为结构化的关键帧和自然语言格式。通过创建一致的多模态摘要,这种方法将生成式AI锚定在丰富的语义证据上,并作为高效检索的轻量级代理。然而,传统的指标如BLEU或ROUGE无法量化跨不同模态的信息覆盖情况,例如将一段文本与一系列关键帧进行比较。为了解决这个问题,我们提出了视频摘要信息损失(ViSIL)得分,这是一种信息论框架,通过视觉-语言模型(VLM)推理量化未被摘要捕捉的视频信息。通过测量信息损失,ViSIL成为一种统一的度量标准,即使在摘要格式的结构差异下也能直接进行比较。我们的结果显示,ViSIL得分与视频问答(VQA)任务中的人类和VLM性能之间存在统计学上的显著相关性。ViSIL还使摘要选择能够优化信息损失与处理速度之间的权衡,建立了帕累托最优前沿,在不增加处理负载的情况下,VQA准确率提高了7%。
Summary / 总结
The research aims to evaluate the information loss in multimodal video captioning by proposing the ViSIL score, an information-theoretic framework. This method uses vision-language model inference to quantify the information not captured by a summary, enabling direct comparison across different multimodal summary formats. The study shows that ViSIL scores correlate significantly with both human and VLM performance on VQA tasks and helps optimize the trade-off between information loss and processing speed, improving VQA accuracy by 7% without increasing processing load.
论文提出了ViSIL,这是一种信息论框架,用于评估多模态视频摘要中的信息损失。它通过视觉-语言模型推理量化视频摘要未能捕捉到的视频信息,解决了传统指标如BLEU或ROUGE的局限性。结果表明,ViSIL分数与人类和视觉-语言模型在视频问答任务上的表现相关,并且能够优化信息损失与处理速度之间的权衡,使VQA准确率提高了7%,而不增加处理负载。
History
20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553