ShapeR: Robust Conditional 3D Shape Generation from Casual Captures
Authors: Yawar Siddiqui, Duncan Frost, Samir Aroudj, Armen Avetisyan, Henry Howard-Jenkins, Daniel DeTone, Pierre Moulon, Qirui Wu, Zhengqin Li, Julian Straub, Richard Newcombe, Jakob Engel
Venue: www
First: 2026-01-16T18:51:24+00:00 · Latest: 2026-01-16T18:51:24+00:00
Comments: Project Page: http://facebookresearch.github.io/ShapeR Video: https://www.youtube.com/watch?v=EbY30KAA55I
Abstract
Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-language models to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.
中文标题/摘要
标题:ShapeR:基于随意捕捉的稳健条件3D形状生成
近期在3D形状生成方面的进展取得了令人印象深刻的成果,但大多数现有方法依赖于干净、未遮挡和良好分割的输入。在现实世界场景中,这些条件很少被满足。我们提出了ShapeR,一种新颖的方法,用于从随意捕捉的序列中生成条件3D对象形状。给定一个图像序列,我们利用现成的视觉-惯性SLAM、3D检测算法和视觉-语言模型,为每个对象提取一组稀疏的SLAM点、多视角图像和机器生成的描述。一种经过训练以有效利用这些模态的矫正流变换器随后生成高保真度的度量3D形状。为了确保对随意捕捉数据挑战的鲁棒性,我们采用了包括实时组合增强、跨越对象和场景数据集的课程训练方案以及处理背景杂乱的策略。此外,我们引入了一个新的评估基准,包括7个真实世界场景中的178个野外对象,带有几何注释。实验表明,在这种具有挑战性的设置中,ShapeR 显著优于现有方法,与最先进的方法相比,平均切比雪夫距离提高了2.7倍。
Summary / 总结
ShapeR is a novel method for generating 3D object shapes from casually captured sequences. It uses visual-inertial SLAM, 3D detection, and vision-language models to extract sparse SLAM points, multi-view images, and machine-generated captions. A rectified flow transformer then generates high-fidelity 3D shapes. ShapeR demonstrates robustness to real-world challenges through techniques like on-the-fly augmentations and a curriculum training scheme. Experiments show ShapeR significantly outperforms existing methods, reducing the Chamfer distance by 2.7 times.
ShapeR旨在从随意拍摄的序列中生成3D形状,解决现有方法需要干净且良好分割输入的局限性。它利用视觉惯性SLAM、3D检测和视觉语言模型来提取稀疏的SLAM点、多视角图像和机器生成的描述,然后通过校正流变压器生成高保真3D形状。ShapeR在现实世界挑战性场景中显著优于现有方法,将Chamfer距离降低了2.7倍。
MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
Authors: Xiaoran Fan, Zhichao Sun, Tao Ji, Lixing Shen, Tao Gui
First: 2026-01-16T17:45:34+00:00 · Latest: 2026-01-16T17:45:34+00:00
Abstract
As vision-language models (VLMs) tackle increasingly complex and multimodal tasks, the rapid growth of Key-Value (KV) cache imposes significant memory and computational bottlenecks during inference. While Multi-Head Latent Attention (MLA) offers an effective means to compress the KV cache and accelerate inference, adapting existing VLMs to the MLA architecture without costly pretraining remains largely unexplored. In this work, we present MHA2MLA-VLM, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. Our approach features two core techniques: (1) a modality-adaptive partial-RoPE strategy that supports both traditional and multimodal settings by selectively masking nonessential dimensions, and (2) a modality-decoupled low-rank approximation method that independently compresses the visual and textual KV spaces. Furthermore, we introduce parameter-efficient fine-tuning to minimize adaptation cost and demonstrate that minimizing output activation error, rather than parameter distance, substantially reduces performance loss. Extensive experiments on three representative VLMs show that MHA2MLA-VLM restores original model performance with minimal supervised data, significantly reduces KV cache footprint, and integrates seamlessly with KV quantization.
中文标题/摘要
标题:MHA2MLA-VLM:使DeepSeek的经济型多头潜在注意力适用于视觉-语言模型
随着视觉-语言模型(VLMs)处理越来越复杂和多模态的任务,关键-值(KV)缓存的快速增长在推理过程中产生了显著的内存和计算瓶颈。虽然多头潜在注意力(MLA)提供了一种有效的压缩KV缓存和加速推理的方法,但如何在不进行昂贵的预训练的情况下将现有的VLMs适应到MLA架构中仍鲜有探索。在本文中,我们提出了MHA2MLA-VLM,这是一种参数高效且多模态感知的框架,用于将现成的VLMs转换为MLA。我们的方法包含两个核心技术:(1)一种适应模态的部分-RoPE策略,该策略通过选择性地屏蔽非必要维度支持传统的和多模态设置,(2)一种模态解耦的低秩近似方法,该方法独立地压缩了视觉和文本的KV空间。此外,我们引入了参数高效的微调以最小化适应成本,并证明了最小化输出激活误差而非参数距离可以显著减少性能损失。在三个代表性VLMs上的广泛实验表明,MHA2MLA-VLM在最少的监督数据下恢复了原始模型性能,显著减少了KV缓存的占用空间,并与KV量化无缝集成。
Summary / 总结
The research aims to address the memory and computational challenges posed by the Key-Value cache in vision-language models (VLMs) by introducing MHA2MLA-VLM, a parameter-efficient framework for converting existing VLMs to Multi-Head Latent Attention (MLA). The method employs a modality-adaptive partial-RoPE strategy and a modality-decoupled low-rank approximation to compress the visual and textual Key-Value spaces, and includes parameter-efficient fine-tuning to minimize adaptation cost. Experiments on three VLMs show that MHA2MLA-VLM can restore original performance with minimal supervised data and significantly reduce the KV cache footprint.
研究针对视觉语言模型(VLMs)中关键值(KV)缓存带来的内存和计算瓶颈,提出了一种MHA2MLA-VLM框架,将现有VLMs转换为多头潜在注意力(MLA)架构。该框架引入了模态自适应部分-RoPE策略和模态解耦低秩近似方法,以支持传统和多模态设置,并使用参数高效微调来最小化适应成本。实验表明,MHA2MLA-VLM可以在少量监督数据下恢复原始模型性能,减少KV缓存占用,并与KV量化无缝集成。
Causal-SAM-LLM: Large Language Models as Causal Reasoners for Robust Medical Segmentation
Authors: Tao Tang, Shijie Xu, Jionglong Su, Zhixiang Lu
Venue: ICASSP 2026
First: 2025-07-04T13:52:16+00:00 · Latest: 2026-01-16T16:16:45+00:00
Comments: Accepted by IEEE ICASSP 2026
Abstract
The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model's features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician's natural language command to modulate the segmentation decoder's features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model's trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.
中文标题/摘要
标题:Causal-SAM-LLM:大型语言模型作为因果推理器以实现稳健的医学分割
深度学习模型在医学图像分割中的临床应用受到其无法泛化到未见领域的限制。这一问题通常源于模型学习了解剖内容与领域特定成像风格之间的虚假相关性。为克服这一根本性挑战,我们提出了Causal-SAM-LLM,一种新颖的框架,将大型语言模型(LLM)提升为因果推理器的角色。该框架基于冻结的分割一切模型(SAM)编码器,结合了两种协同创新。首先,语言对抗解耦(LAD)利用视觉-语言模型生成丰富的文本描述,以混淆图像风格。通过训练分割模型的特征与这些风格描述对比性地不同,它学习到一个不含非因果信息的表示。其次,测试时因果干预(TCI)提供了一种交互机制,其中LLM解释临床医生的自然语言命令,实时调节分割解码器的特征,实现有针对性的错误修正。我们在四个公开数据集(BTCV、CHAOS、AMOS、BraTS)组成的综合基准上进行了广泛的实证评估,评估了跨扫描仪、跨模态和跨解剖结构设置下的泛化能力。Causal-SAM-LLM 在离群值稳健性方面建立了新的基准,平均Dice分数提高了6.2个点,Hausdorff距离减少了15.8毫米,同时使用了不到9%的完整模型的可训练参数。我们的工作为构建稳健、高效且可交互控制的医学AI系统开辟了新途径。
Summary / 总结
Causal-SAM-LLM is a framework that enhances the robustness of medical image segmentation models by leveraging Large Language Models (LLMs) as causal reasoners. It uses Linguistic Adversarial Disentanglement (LAD) to generate textual descriptions of image styles and trains the segmentation model to be robust against non-causal information. Additionally, Test-Time Causal Intervention (TCI) allows real-time modulation of segmentation features based on clinician commands. The framework significantly improves Dice scores and reduces Hausdorff Distance, achieving state-of-the-art performance in out-of-distribution robustness with minimal parameter usage.
Causal-SAM-LLM 是一种框架,通过将大型语言模型(LLMs)作为因果推理器来增强医学图像分割模型的鲁棒性。它使用 Linguistic Adversarial Disentanglement (LAD) 生成混淆图像风格的文本描述,并训练分割模型以抵御这些风格的影响。此外,Test-Time Causal Intervention (TCI) 允许根据临床医生的命令实时调整分割特征。该框架实现了最先进的性能,与最强基线相比,Dice分数提高了最多 6.2 个点,Hausdorff 距离减少了 15.8 毫米,同时仅使用了模型可训练参数的大约 9%。
Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding
Authors: Wenhui Tan, Ruihua Song, Jiaze Li, Jianzhong Ju, Zhenbo Luo
First: 2026-01-16T15:14:04+00:00 · Latest: 2026-01-16T15:14:04+00:00
Comments: Accepted by ICASSP2026
Abstract
Recent progress in multi-modal large language models (MLLMs) has significantly advanced video understanding. However, their performance on long-form videos remains limited by computational constraints and suboptimal frame selection. We present Think-Clip-Sample (TCS), a training-free framework that enhances long video understanding through two key components: (i) Multi-Query Reasoning, which generates multiple queries to capture complementary aspects of the question and video; and (ii) Clip-level Slow-Fast Sampling, which adaptively balances dense local details and sparse global context. Extensive experiments on MLVU, LongVideoBench, and VideoMME demonstrate that TCS consistently improves performance across different MLLMs, boosting up to 6.9% accuracy, and is capable of achieving comparable accuracy with 50% fewer inference time cost, highlighting both efficiency and efficacy of TCS on long video understanding.
中文标题/摘要
标题:Think-Clip-Sample: 慢速-快速帧选择方法以提升视频理解
多模态大型语言模型(MLLMs)的近期进展显著提升了视频理解能力。然而,它们在长视频上的表现仍受限于计算约束和不理想的帧选择。我们提出了一种无需训练的框架Think-Clip-Sample (TCS),通过两个关键组件增强长视频理解:(i) 多查询推理,生成多个查询以捕捉问题和视频的互补方面;(ii) 剪辑级慢速-快速采样,适应性地平衡密集的局部细节和稀疏的全局上下文。在MLVU、LongVideoBench和VideoMME上的广泛实验表明,TCS在不同MLLMs上均能提升性能,最高可提升6.9%的准确率,并且能够在减少50%推理时间成本的情况下达到相近的准确率,突显了TCS在长视频理解上的高效性和有效性。
Summary / 总结
The research aims to improve the performance of multi-modal large language models (MLLMs) in understanding long-form videos by addressing computational constraints and suboptimal frame selection. The Think-Clip-Sample (TCS) framework, which includes Multi-Query Reasoning and Clip-level Slow-Fast Sampling, is proposed. TCS enhances long video understanding by generating multiple queries to capture different aspects of the question and video, and by adaptively balancing dense local details and sparse global context. Experiments show that TCS improves accuracy by up to 6.9% and reduces inference time by 50% compared to existing methods.
研究旨在通过改进多模态大型语言模型(MLLMs)在理解长视频方面的性能,解决计算约束和帧选择不足的问题。Think-Clip-Sample (TCS) 框架通过使用多查询推理生成多个查询来捕捉问题和视频的互补方面,并通过剪辑级别的慢速-快速采样来适当地平衡密集的局部细节和稀疏的全局上下文。实验表明,TCS 可以将准确率提高高达 6.9%,并将推理时间减少 50%。
Enhancing Vision Language Models with Logic Reasoning for Situational Awareness
Authors: Pavana Pradeep, Krishna Kant, Suya Yu
First: 2026-01-16T14:16:38+00:00 · Latest: 2026-01-16T14:16:38+00:00
Comments: Accepted for publication in IEEE Transactions on AI
Abstract
Vision-Language Models (VLMs) offer the ability to generate high-level, interpretable descriptions of complex activities from images and videos, making them valuable for situational awareness (SA) applications. In such settings, the focus is on identifying infrequent but significant events with high reliability and accuracy, while also extracting fine-grained details and assessing recognition quality. In this paper, we propose an approach that integrates VLMs with traditional computer vision methods through explicit logic reasoning to enhance SA in three key ways: (a) extracting fine-grained event details, (b) employing an intelligent fine-tuning (FT) strategy that achieves substantially higher accuracy than uninformed selection, and (c) generating justifications for VLM outputs during inference. We demonstrate that our intelligent FT mechanism improves the accuracy and provides a valuable means, during inferencing, to either confirm the validity of the VLM output or indicate why it may be questionable.
中文标题/摘要
标题:利用逻辑推理增强视觉语言模型以提高情境意识
视觉-语言模型(VLMs)能够从图像和视频中生成高层次、可解释的复杂活动描述,使其在情境意识(SA)应用中具有重要价值。在这种环境中,重点在于识别罕见但重要的事件,同时提取细粒度的细节并评估识别质量。在本文中,我们提出了一种通过显式逻辑推理将VLMs与传统计算机视觉方法相结合的方法,以从三个方面增强SA:(a) 提取细粒度事件细节,(b) 使用一种智能微调(FT)策略,其准确度远高于无信息选择,(c) 在推理过程中为VLM输出生成解释。我们证明,我们的智能FT机制提高了准确度,并在推理过程中提供了一种有价值的手段,既可以确认VLM输出的有效性,也可以指出其可能存在的问题。
Summary / 总结
This paper aims to enhance Vision-Language Models (VLMs) for situational awareness by integrating them with logic reasoning. The method involves using explicit logic reasoning to extract fine-grained event details, employing an intelligent fine-tuning strategy that outperforms uninformed selection, and generating justifications for VLM outputs during inference. Key experimental findings show that the intelligent fine-tuning mechanism improves accuracy and provides valuable confirmations or explanations for VLM outputs during inferencing.
本文旨在通过将逻辑推理与视觉语言模型(VLMs)结合,增强其在态势感知中的应用。方法包括使用显式的逻辑推理提取细粒度事件细节、采用一种智能微调策略显著提高准确性,并在推理过程中生成VLM输出的解释。主要发现表明,智能微调机制提升了准确性,并提供了有价值的解释,既可以确认VLM输出的正确性,也可以指出其可能存在的问题。
X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning
Authors: Maanping Shao, Feihong Zhang, Gu Zhang, Baiye Cheng, Zhengrong Xue, Huazhe Xu
First: 2026-01-16T13:15:55+00:00 · Latest: 2026-01-16T13:15:55+00:00
Abstract
Visuomotor policies often leverage large pre-trained Vision Transformers (ViTs) for their powerful generalization capabilities. However, their significant data requirements present a major challenge in the data-scarce context of most robotic learning settings, where compact CNNs with strong inductive biases can be more easily optimized. To address this trade-off, we introduce X-Distill, a simple yet highly effective method that synergizes the strengths of both architectures. Our approach involves an offline, cross-architecture knowledge distillation, transferring the rich visual representations of a large, frozen DINOv2 teacher to a compact ResNet-18 student on the general-purpose ImageNet dataset. This distilled encoder, now endowed with powerful visual priors, is then jointly fine-tuned with a diffusion policy head on the target manipulation tasks. Extensive experiments on $34$ simulated benchmarks and $5$ challenging real-world tasks demonstrate that our method consistently outperforms policies equipped with from-scratch ResNet or fine-tuned DINOv2 encoders. Notably, X-Distill also surpasses 3D encoders that utilize privileged point cloud observations or much larger Vision-Language Models. Our work highlights the efficacy of a simple, well-founded distillation strategy for achieving state-of-the-art performance in data-efficient robotic manipulation.
中文标题/摘要
标题:X-Distill:跨架构视觉蒸馏在知觉运动学习中的应用
知觉运动策略通常利用大型预训练的视觉变换器(ViTs)以利用其强大的泛化能力。然而,在大多数机器人学习环境中,由于数据稀缺,它们显著的数据需求构成了重大挑战,而紧凑的CNNs由于其强大的归纳偏置更容易优化。为了解决这种权衡,我们引入了X-Distill,这是一种简单而有效的结合了两种架构优势的方法。我们的方法包括一种离线的跨架构知识蒸馏,将大型冻结的DINOv2教师的丰富视觉表示转移到通用的ImageNet数据集上的紧凑的ResNet-18学生上。经过蒸馏的编码器现在具备了强大的视觉先验,然后与扩散策略头一起在目标操作任务上进行联合微调。在34个模拟基准和5个具有挑战性的真实世界任务上的广泛实验表明,我们的方法在性能上始终优于使用从头开始训练的ResNet或微调DINOv2编码器的策略。值得注意的是,X-Distill也超过了利用特权点云观察或更大视觉语言模型的3D编码器。我们的工作突显了简单而坚实的蒸馏策略在数据高效机器人操作中的有效性。
Summary / 总结
The paper introduces X-Distill, a method that leverages knowledge distillation to transfer visual representations from a large Vision Transformer (ViT) to a compact CNN, addressing the data inefficiency issue in robotic learning. The approach involves distilling the knowledge from a pre-trained DINOv2 ViT to a ResNet-18 CNN on ImageNet, and then fine-tuning this distilled encoder with a diffusion policy head for specific manipulation tasks. Experiments show that X-Distill outperforms policies using from-scratch ResNets or fine-tuned DINOv2 encoders, and even surpasses 3D encoders and larger Vision-Language Models in both simulated and real-world tasks.
论文提出了一种名为X-Distill的方法,通过知识蒸馏将大型Vision Transformer (ViT)的视觉表示转移到紧凑型CNN中,以解决机器人学习中的数据效率问题。该方法包括从预训练的DINOv2 ViT向ImageNet上的ResNet-18 CNN转移知识,然后与扩散策略头一起对特定操作任务进行微调。实验表明,X-Distill在模拟和真实世界任务中均优于从零开始训练的ResNet或微调后的DINOv2编码器,甚至超过了3D编码器和更大的视觉-语言模型。
VINO: A Unified Visual Generator with Interleaved OmniModal Context
Authors: Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye
First: 2026-01-05T18:56:34+00:00 · Latest: 2026-01-16T13:04:59+00:00
Comments: Project page: https://sotamak1r.github.io/VINO-web/
Abstract
We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.
中文标题/摘要
标题:VINO:统一视觉生成器,具有交错的全模态上下文
我们提出了VINO,一种统一的视觉生成器,可以在单一框架内执行图像和视频生成与编辑。VINO 不依赖于特定任务的模型或独立的模块,而是使用一个共享的扩散骨干网络,该网络根据文本、图像和视频进行条件化,从而在一个模型中实现广泛的视觉创作和编辑任务。具体而言,VINO 将一个视觉语言模型(VLM)与一个多模态扩散变换器(MMDiT)耦合,其中多模态输入被编码为交错的条件化标记,然后用于引导扩散过程。这种设计支持多参考定位、长形式指令跟随以及在静态和动态内容中保持一致的身份,同时避免了特定模态的架构组件。为了训练这样一个统一系统,我们引入了一个多阶段训练管道,逐步扩展一个视频生成基础模型,使其成为一个能够处理图像和视频输入输出的统一、多任务生成器。在各种生成和编辑基准测试中,VINO 展现了强大的视觉质量、忠实的指令跟随、改进的参考和属性保留以及更可控的多身份编辑。我们的结果突显了可扩展统一视觉生成的实用路径,并展示了交错的上下文计算作为通用视觉创作基础的潜力。
Summary / 总结
VINO is a unified visual generator that integrates image and video generation and editing within a single framework. It uses a shared diffusion backbone conditioned on text, images, and videos, coupled with a multimodal diffusion transformer to support various visual tasks. VINO shows strong visual quality, faithful instruction following, and improved reference and attribute preservation across different benchmarks, demonstrating the potential of interleaved, in-context computation for general-purpose visual creation.
VINO 是一个统一的视觉生成器,能够在单一框架中实现图像和视频的生成与编辑。它使用一个共享的扩散骨干网络,并结合多模态扩散变换器(MMDiT),根据文本、图像和视频进行条件化,以支持各种视觉任务。VINO 在不同基准测试中展示了强大的视觉质量、忠实的指令跟随以及改进的参考和属性保留,展示了统一视觉生成的实用途径。
Language-Agnostic Visual Embeddings for Cross-Script Handwriting Retrieval
Authors: Fangke Chen, Tianhao Dong, Sirry Chen, Guobin Zhang, Yishu Zhang, Yining Chen
First: 2026-01-16T12:55:41+00:00 · Latest: 2026-01-16T12:55:41+00:00
Comments: 9 pages,5 figures
Abstract
Handwritten word retrieval is vital for digital archives but remains challenging due to large handwriting variability and cross-lingual semantic gaps. While large vision-language models offer potential solutions, their prohibitive computational costs hinder practical edge deployment. To address this, we propose a lightweight asymmetric dual-encoder framework that learns unified, style-invariant visual embeddings. By jointly optimizing instance-level alignment and class-level semantic consistency, our approach anchors visual embeddings to language-agnostic semantic prototypes, enforcing invariance across scripts and writing styles. Experiments show that our method outperforms 28 baselines and achieves state-of-the-art accuracy on within-language retrieval benchmarks. We further conduct explicit cross-lingual retrieval, where the query language differs from the target language, to validate the effectiveness of the learned cross-lingual representations. Achieving strong performance with only a fraction of the parameters required by existing models, our framework enables accurate and resource-efficient cross-script handwriting retrieval.
中文标题/摘要
标题:跨书写字体识别的语言无关视觉嵌入
手写词检索对于数字档案至关重要,但由于手写体的巨大变异性及跨语言语义差距,这一任务仍然具有挑战性。尽管大型视觉-语言模型提供了潜在的解决方案,但其高昂的计算成本阻碍了其实用边缘部署。为解决这一问题,我们提出了一种轻量级不对称双编码器框架,用于学习统一且风格不变的视觉嵌入。通过联合优化实例级对齐和类别级语义一致性,我们的方法将视觉嵌入锚定到语言无关的语义原型上,确保跨书写字体和书写风格的不变性。实验表明,我们的方法优于28个基线,并在同语言检索基准上达到了最先进的准确率。我们进一步进行了显式的跨语言检索,其中查询语言与目标语言不同,以验证所学跨语言表示的有效性。仅使用现有模型所需参数的一小部分,我们的框架实现了准确且资源高效的跨书写字体检索。
Summary / 总结
The research aims to address the challenges of handwritten word retrieval due to handwriting variability and cross-lingual semantic gaps. It proposes a lightweight asymmetric dual-encoder framework that learns unified, style-invariant visual embeddings by jointly optimizing instance-level alignment and class-level semantic consistency. The method outperforms 28 baselines and achieves state-of-the-art accuracy on within-language retrieval benchmarks, demonstrating strong performance in cross-lingual retrieval as well.
研究旨在解决由于手写变异性和跨语言语义差距导致的手写词检索难题。提出了一种轻量级的不对称双编码器框架,通过联合优化实例级对齐和类级语义一致性来学习统一且风格不变的视觉嵌入。该方法在同语言检索基准上超越了28个基线,并达到了最先进的准确性,同时在跨语言检索中也表现出色。
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics
Authors: Dongyoung Kim, Sumin Park, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo
Venue: NeurIPS 2025
First: 2025-05-29T16:41:12+00:00 · Latest: 2026-01-16T12:54:08+00:00
Comments: NeurIPS 2025
Abstract
Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate Robot-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.
中文标题/摘要
标题:Robot-R1:强化学习在机器人本体推理中的增强
大型视觉-语言模型(LVLM)最近在通过结合本体推理和机器人控制来推动机器人技术方面展现了巨大的潜力。一种常见的方法是使用监督微调(SFT)对与机器人控制相关的本体推理任务进行训练。然而,SFT数据集通常是通过启发式方法构建的,并未明确优化以提高机器人控制性能。此外,SFT往往会导致灾难性遗忘和泛化性能降低等问题。为解决这些局限性,我们提出了Robot-R1,这是一种新颖的框架,利用强化学习来增强专门针对机器人控制的本体推理。Robot-R1 学习预测完成任务所需的下一个关键点状态,基于当前场景图像和从专家演示中提取的环境元数据。受DeepSeek-R1学习方法的启发,Robot-R1 采样基于推理的响应,并强化那些导致更准确预测的响应。为了严格评估Robot-R1,我们还引入了一个新的基准,要求具备多样化的本体推理能力。我们的实验表明,使用Robot-R1训练的模型在本体推理任务上优于SFT方法。尽管只有70亿参数,Robot-R1甚至在与低级动作控制相关的推理任务,如空间和运动推理方面,超过了GPT-4o。
Summary / 总结
The research aims to improve embodied reasoning in robotics by addressing the limitations of Supervised Fine-Tuning (SFT) methods, such as catastrophic forgetting and reduced generalization. Robot-R1, a novel framework, uses reinforcement learning to enhance embodied reasoning specifically for robot control. It predicts the next keypoint state needed for task completion based on the current scene image and environment metadata from expert demonstrations. Experiments show that models trained with Robot-R1 outperform SFT methods and even surpass GPT-4o on low-level action control tasks.
研究旨在通过强化学习提升机器人的肢体推理能力,以改善其控制性能。Robot-R1 是一个新颖的框架,基于当前场景图像和环境元数据预测任务完成所需的下一个关键点状态。它使用强化学习来优化预测,实验结果显示,该模型在肢体推理任务上的表现优于监督微调方法。尽管参数量仅有7B,但Robot-R1在空间和动作推理等低级动作控制任务上甚至超越了GPT-4o。
Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification
Authors: Zhiqi Pang, Lingling Zhao, Yang Liu, Chunyu Wang, Gaurav Sharma
First: 2026-01-16T12:45:01+00:00 · Latest: 2026-01-16T12:45:01+00:00
Comments: 12 pages, 10 figures
Abstract
We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. To tackle UMS-ReID, we introduce image-text knowledge modeling (ITKM) -- a three-stage framework that effectively exploits the representational power of vision-language models. We start with a pre-trained CLIP model with an image encoder and a text encoder. In Stage I, we introduce a scenario embedding in the image encoder and fine-tune the encoder to adaptively leverage knowledge from multiple scenarios. In Stage II, we optimize a set of learned text embeddings to associate with pseudo-labels from Stage I and introduce a multi-scenario separation loss to increase the divergence between inter-scenario text representations. In Stage III, we first introduce cluster-level and instance-level heterogeneous matching modules to obtain reliable heterogeneous positive pairs (e.g., a visible image and an infrared image of the same person) within each scenario. Next, we propose a dynamic text representation update strategy to maintain consistency between text and image supervision signals. Experimental results across multiple scenarios demonstrate the superiority and generalizability of ITKM; it not only outperforms existing scenario-specific methods but also enhances overall performance by integrating knowledge from multiple scenarios.
中文标题/摘要
标题:基于图像-文本知识建模的无监督多场景行人重识别
我们提出了无监督多场景(UMS)行人重识别(ReID)任务,该任务在单一连贯框架内扩展了跨多种场景(跨分辨率、着装变化等)的ReID。为解决UMS-ReID,我们引入了图像-文本知识建模(ITKM)——一个三阶段框架,有效利用了视觉-语言模型的表示能力。我们以预训练的CLIP模型作为起点,包含图像编码器和文本编码器。在第一阶段,我们在图像编码器中引入场景嵌入,并微调编码器以适应性地利用来自多个场景的知识。在第二阶段,我们优化一组学习到的文本嵌入,使其与第一阶段的伪标签相关联,并引入多场景分离损失以增加跨场景文本表示之间的差异。在第三阶段,我们首先引入集群级和实例级异质匹配模块,在每个场景内获得可靠的异质正样本对(例如,同一人的可见图像和红外图像)。接下来,我们提出了一种动态文本表示更新策略,以保持文本和图像监督信号之间的一致性。在多个场景的实验结果表明,ITKM不仅优于现有特定场景方法,还能通过整合多个场景的知识提高整体性能。
Summary / 总结
The paper introduces unsupervised multi-scenario person re-identification (UMS-ReID) as a new task to handle diverse scenarios in ReID. It proposes a three-stage framework, image-text knowledge modeling (ITKM), which uses a pre-trained CLIP model. The framework includes fine-tuning the image encoder with scenario embeddings, optimizing text embeddings with multi-scenario separation loss, and introducing heterogeneous matching and dynamic text representation update strategies. Experiments show ITKM outperforms scenario-specific methods and enhances overall performance by integrating multi-scenario knowledge.
研究旨在通过引入名为图像-文本知识建模(ITKM)的三阶段框架来解决多场景无监督行人重识别(UMS-ReID)问题。ITKM 利用预训练的 CLIP 模型并逐步微调图像和文本编码器以适应各种场景。关键发现表明,ITKM 不仅超越了现有的特定场景方法,还能通过整合多个场景的知识来提高整体性能,展示了其在跨分辨率和着装变化等不同场景中的优越性和普适性。
Attention Debiasing for Token Pruning in Vision Language Models
Authors: Kai Zhao, Wubang Yuan, Yuchen Lin, Liting Ruan, Xiaofeng Lu, Deng-Ping Fan, Ming-Ming Cheng, Dan Zeng
First: 2025-08-25T08:56:32+00:00 · Latest: 2026-01-16T09:19:57+00:00
Comments: https://github.com/intcomp/attention-bias
Abstract
Vision-language models (VLMs) typically encode substantially more visual tokens than text tokens, resulting in significant token redundancy. Pruning uninformative visual tokens is therefore crucial for improving computational efficiency, and language-to-vision attention has become a widely used importance criterion for this purpose. However, we find that attention in VLMs is systematically biased. It disproportionately favors tokens appearing later in the sequence, manifesting as over-attention to lower image regions, and assigns inflated scores to semantically empty padding tokens. These behaviors stem from intrinsic recency bias and attention sink effects inherited from large language models (LLMs), and they distort attention-based pruning by preserving irrelevant visual content. To derive a pruning criterion better aligned with semantic relevance, we introduce two lightweight yet effective debiasing techniques that restore the reliability of attention. The first compensates for positional distortions by removing recency-induced attention trends, producing a content-aware and position-agnostic importance measure. The second suppresses attention sink effects by eliminating spurious attention on padding tokens. Our method is model-agnostic, pruning-method-agnostic, and task-agnostic, enabling plug-and-play integration with existing VLM pruning models. Despite its simplicity, our approach consistently delivers strong performance gains. We evaluate our method on ten vision-language benchmarks spanning both image-based and video-based tasks, in comparison with seven state-of-the-art visual token pruning methods and across two representative VLM architectures. Our method achieves substantial performance gains, demonstrating strong effectiveness and generalizability. Our code is available at https://github.com/intcomp/attention-bias.
中文标题/摘要
标题:注意力去偏见化在视觉语言模型中对标记剪枝的应用
视觉语言模型(VLMs)通常编码的视觉标记比文本标记多得多,导致标记冗余性显著。因此,剪枝无信息的视觉标记对于提高计算效率至关重要,而语言到视觉的注意力已成为这一目的中广泛使用的重要性标准。然而,我们发现VLMs中的注意力存在系统性偏差。它不成比例地偏向于序列中较晚出现的标记,表现为对较低图像区域的过度关注,并对语义空的填充标记赋予了过高的分数。这些行为源自大型语言模型(LLMs)固有的近期偏差和注意力陷阱效应,它们扭曲了基于注意力的剪枝,保留了无关的视觉内容。为了获得与语义相关性更好的剪枝标准,我们引入了两种轻量级但有效的去偏见技术,以恢复注意力的可靠性。第一种通过移除近期效应引起的关注趋势,补偿位置偏差,产生内容感知但位置无关的重要性度量。第二种通过消除对填充标记的虚假关注,抑制注意力陷阱效应。我们的方法是模型无关的、剪枝方法无关的和任务无关的,能够与现有的VLM剪枝模型无缝集成。尽管简单,我们的方法在性能上始终表现出色。我们在涵盖图像和视频任务的十个视觉语言基准上评估了我们的方法,与七种最先进的视觉标记剪枝方法和两种代表性VLM架构进行了比较。我们的方法实现了显著的性能提升,证明了其强大的有效性和泛化能力。我们的代码可在https://github.com/intcomp/attention-bias获取。
Summary / 总结
This paper addresses the issue of positional bias in language-to-vision attention within vision-language models (VLMs), which can distort token pruning and preserve irrelevant visual content. To mitigate this, the authors propose two debiasing techniques: one to remove recency-induced attention trends and another to suppress attention sink effects on padding tokens. These techniques are model-agnostic and enable consistent performance gains across various VLM architectures and tasks. Experiments on ten vision-language benchmarks show significant improvements over seven state-of-the-art pruning methods, highlighting the effectiveness and generalizability of the proposed approach.
本文探讨了视觉语言模型(VLM)中语言到视觉注意力的位置偏差问题,这可能会扭曲token修剪并保留无关的视觉内容。为了解决这一问题,作者提出了两种去偏技术:一种用于去除由时间顺序引起的注意力趋势,另一种用于抑制对填充token的虚假注意力。这些技术是模型无关的,能够在各种VLM架构和任务中实现一致的性能提升。在十个视觉语言基准测试上的实验表明,与七个最先进的修剪方法相比,该方法能够显著提高性能,证明了其有效性和普适性。
Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification
Authors: Tayyab Rehman, Giovanni De Gasperis, Aly Shmahell
First: 2026-01-08T11:31:47+00:00 · Latest: 2026-01-16T08:21:29+00:00
Comments: Author email changed, Acknowlegement changes
Abstract
Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.
中文标题/摘要
标题:通过视觉语言模型和基于嵌入的分类实现监控系统中的级联多代理异常检测
在动态视觉环境中实现智能异常检测需要在实时性能与语义可解释性之间取得平衡。传统方法仅解决这一挑战的部分问题。基于重建的模型捕捉低级偏差但缺乏上下文推理,目标检测器提供速度但语义有限,而大型视觉语言系统则以高昂的计算成本提供可解释性。本研究引入了一种级联多代理框架,将这些互补范式统一成一个连贯且可解释的架构。早期模块执行重建门控过滤和对象级评估,而更高层次的推理代理则根据需要选择性地被调用来解释语义模糊的事件。该系统采用自适应升级阈值和发布-订阅通信架构,实现异步协调和在异构硬件上的可扩展部署。在大规模监控数据上的广泛评估表明,所提出的级联架构与直接视觉语言推理相比,延迟降低了三倍,同时保持了高感知保真度(PSNR = 38.3 dB,SSIM = 0.965)和一致的语义标签。该框架超越了传统的检测管道,结合了早期退出的效率、自适应多代理推理和可解释的异常归因,为可扩展的智能视觉监控奠定了可重复和节能的基础。
Summary / 总结
This work addresses the challenge of intelligent anomaly detection in dynamic visual environments by introducing a cascading multi-agent framework that combines reconstruction-based models, object detectors, and large vision-language systems. Early modules perform filtering and object-level assessment, while higher-level agents interpret ambiguous events. The system reduces latency by three times compared to direct vision-language inference, while maintaining high perceptual fidelity and consistent semantic labeling. This framework enhances conventional detection pipelines with early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, providing a scalable and energy-efficient foundation for intelligent visual monitoring.
该工作通过引入一种结合重建模型、目标检测器和大型视觉语言系统的级联多代理框架,解决了动态视觉环境中的智能异常检测挑战。早期模块执行过滤和对象级评估,而高级代理在需要时进行语义解释。该系统在保持高感知保真度(PSNR=38.3 dB,SSIM=0.965)和一致的语义标签的同时,将直接视觉语言推理的延迟降低了三倍。该框架通过结合早期退出效率、自适应多代理推理和可解释的异常归因,超越了传统的检测管道,为智能视觉监控提供了可扩展且节能的基础。
CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs
Authors: Yuanxiang Liu, Songze Li, Xiaoke Guo, Zhaoyan Gong, Qifei Zhang, Huajun Chen, Wen Zhang
First: 2026-01-16T07:27:40+00:00 · Latest: 2026-01-16T07:27:40+00:00
Abstract
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities but often grapple with reliability challenges like hallucinations. While Knowledge Graphs (KGs) offer explicit grounding, existing paradigms of KG-augmented LLMs typically exhibit cognitive rigidity--applying homogeneous search strategies that render them vulnerable to instability under neighborhood noise and structural misalignment leading to reasoning stagnation. To address these challenges, we propose CoG, a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation. First, functioning as the fast, intuitive process, the Relational Blueprint Guidance module leverages relational blueprints as interpretable soft structural constraints to rapidly stabilize the search direction against noise. Second, functioning as the prudent, analytical process, the Failure-Aware Refinement module intervenes upon encountering reasoning impasses. It triggers evidence-conditioned reflection and executes controlled backtracking to overcome reasoning stagnation. Experimental results on three benchmarks demonstrate that CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
中文标题/摘要
标题:CoG:通过关系蓝图和失败意识精炼的知识图谱可控图推理
大型语言模型(LLMs)展示了卓越的推理能力,但常常面临可靠性挑战,如幻觉。知识图谱(KGs)提供了明确的语义基础,但现有的KG增强LLM范式通常表现出认知僵化——采用同质化的搜索策略,使其在邻域噪声和结构错位下变得不稳定,导致推理停滞。为解决这些挑战,我们提出CoG,这是一种基于双重过程理论的无需训练框架,模仿直觉和审慎之间的互动。首先,作为快速直觉过程,关系蓝图指导模块利用可解释的软结构约束来快速稳定搜索方向,抵御噪声。其次,作为审慎的分析过程,失败意识精炼模块在遇到推理困境时介入。它触发基于证据的反思,并执行受控回溯以克服推理停滞。在三个基准上的实验结果表明,CoG在准确性和效率上均显著优于现有最佳方法。
Summary / 总结
CoG is a training-free framework that enhances the reasoning capabilities of Large Language Models (LLMs) by integrating Knowledge Graphs (KGs). It uses a Relational Blueprint Guidance module to provide interpretable soft structural constraints that stabilize search directions against noise, and a Failure-Aware Refinement module that intervenes during reasoning impasses to execute controlled backtracking. Experiments show that CoG improves both accuracy and efficiency compared to existing methods on three benchmarks.
CoG 是一个无需训练的框架,通过利用关系蓝图和失败意识的精炼来解决大型语言模型的可靠性问题。关系蓝图指导模块能够稳定搜索方向以对抗噪声,而失败意识的精炼模块则在遇到推理停滞时介入。实验结果显示,CoG 在准确性和效率上都优于现有方法。
MERGETUNE: Continued fine-tuning of vision-language models
Authors: Wenqing Wang, Da Li, Xiatian Zhu, Josef Kittler
First: 2026-01-15T15:15:53+00:00 · Latest: 2026-01-16T04:31:59+00:00
Comments: 20 pages, 5 figures
Abstract
Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6% on base-novel generalisation without adding parameters. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at https://github.com/Surrey-UP-Lab/MERGETUNE.
中文标题/摘要
标题:MERGETUNE:视觉-语言模型的持续微调
对CLIP等视觉-语言模型(VLMs)进行微调通常会导致预训练知识的灾难性遗忘。先前的工作主要致力于在适应过程中减轻遗忘,然而遗忘在此过程中仍然不可避免。我们提出了一种新的范式——持续微调(CFT),旨在在零样本模型已经适应后恢复预训练知识。我们提出了一种简单的、模型无关的CFT策略(命名为MERGETUNE),该策略由线性模式连通性(LMC)引导,可以在不进行架构更改的情况下,后处理应用到现有的微调模型中。给定一个微调模型,我们继续微调其可训练参数(例如,软提示或线性头),以搜索一个持续模型,该模型具有两条低损失路径到零样本(例如,CLIP)和微调(例如,CoOp)解决方案。通过利用损失景观的几何结构,持续模型隐式地合并了两种解决方案,恢复了在微调对应物中丢失的预训练知识。一个挑战是,原始的LMC约束需要从预训练任务中重放数据。我们通过二阶近似对零样本模型进行近似,从而消除大规模数据重放的需要。实验表明,MERGETUNE在不增加参数的情况下,将CoOp的基本-新颖泛化的调和平均值提高了5.6%。在鲁棒微调评估中,MERGETUNE生成的LMC合并模型在较低推理成本的情况下超过了集成基线,并且当与零样本模型集成时,进一步取得了最先进的结果。我们的代码可在https://github.com/Surrey-UP-Lab/MERGETUNE/获得。
Summary / 总结
The research aims to address the issue of catastrophic forgetting in fine-tuning vision-language models (VLMs) like CLIP. It introduces a novel paradigm called continued fine-tuning (CFT) and proposes a model-agnostic strategy named MERGETUNE, which uses linear mode connectivity (LMC) to recover pretrained knowledge after zero-shot adaptation. Experiments show that MERGETUNE improves the harmonic mean of CoOp by 5.6% on base-novel generalization without adding parameters and achieves state-of-the-art results in robust fine-tuning evaluations with lower inference cost.
MERGETUNE 提出了一种新的持续微调(CFT)范式,旨在在视觉-语言模型适应后恢复预训练知识。它提出了一种基于线性模式连通性(LMC)的模型通用策略 MERGETUNE,无需架构更改即可继续微调现有模型的可训练参数。实验表明,MERGETUNE 在基底-新颖泛化上的调和平均值提高了 +5.6%,并在鲁棒微调评估中超越了集成基线,具有更低的推理成本,并且在与零样本模型集成时达到了最先进的结果。
Steering Language Models Before They Speak: Logit-Level Interventions
Authors: Hyeseon An, Shinwoo Park, Hyundong Jin, Yo-Sub Han
First: 2026-01-16T03:00:33+00:00 · Latest: 2026-01-16T03:00:33+00:00
Comments: 14 pages, 5 figures, preprint
Abstract
Steering LLMs is essential for specialized applications such as style-sensitive text rewriting, user-adaptive communication, and toxicity mitigation. Current steering methods, such as prompting-based and activation-based approaches, are widely used to guide model behavior. However, activation-based techniques require deep access to internal layers, while prompting-based steering often fails to provide consistent or fine-grained control. In order to address these limitations, we propose a training-free inference-time logit intervention for controllable generation. Our approach utilizes a statistical token score table derived from z-normalized log-odds of labeled corpora to shift the decoding distribution. Empirical evaluations across three diverse datasets focusing on writing complexity, formality, and toxicity demonstrate that our method effectively steers output characteristics, confirming its broad applicability and task-agnostic nature. Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains: up to +47%p accuracy and 50x f1 improvement.
中文标题/摘要
标题:在语言模型开口之前引导它们:logit级干预
引导大语言模型对于风格敏感的文字重写、用户自适应通信和毒性缓解等专门应用至关重要。当前的引导方法,如基于提示和基于激活的方法,广泛用于引导模型行为。然而,基于激活的技术需要深入访问内部层,而基于提示的引导往往无法提供一致或精细的控制。为了解决这些限制,我们提出了一种无需训练的推理时logit干预方法,以实现可控生成。我们的方法利用从标注语料库的z标准化对数似然比派生的统计token得分表来偏移解码分布。针对写作复杂性、正式性和毒性三个不同数据集的实证评估表明,我们的方法能够有效引导输出特性,证实了其广泛适用性和任务无关性。我们的结果表明,基于统计的logit引导可以实现大规模、一致性和多任务控制增益:最高可达+47%的准确率和50倍的f1改进。
Summary / 总结
This paper addresses the need for steering large language models (LLMs) for specialized applications like style-sensitive text rewriting and toxicity mitigation. It introduces a training-free logit-level intervention method that uses a statistical token score table to shift the decoding distribution. Experimental results across three datasets show that this method can effectively control output characteristics, achieving up to 47% accuracy improvement and 50x F1 score improvement in multiple tasks.
本文针对大型语言模型(LLMs)在风格敏感文本重写和毒性缓解等特定应用中的引导需求,提出了一种无需训练的logit级干预方法,利用统计化的token得分表来调整解码分布。实验结果表明,该方法可以有效控制输出特性,实现高达47%的准确率提升和50倍的F1分数提升,在多个任务中表现出广泛适用性和任务无关性。
Multi-Stage Patient Role-Playing Framework for Realistic Clinical Interactions
Authors: Shijie Jiang, Zefan Zhang, Kehua Zhu, Tian Bai, Ruihong Zhao
First: 2026-01-16T02:34:22+00:00 · Latest: 2026-01-16T02:34:22+00:00
Comments: 22 pages, 5figures, under review
Abstract
The simulation of realistic clinical interactions plays a pivotal role in advancing clinical Large Language Models (LLMs) and supporting medical diagnostic education. Existing approaches and benchmarks rely on generic or LLM-generated dialogue data, which limits the authenticity and diversity of doctor-patient interactions. In this work, we propose the first Chinese patient simulation dataset (Ch-PatientSim), constructed from realistic clinical interaction scenarios to comprehensively evaluate the performance of models in emulating patient behavior. Patients are simulated based on a five-dimensional persona structure. To address issues of the persona class imbalance, a portion of the dataset is augmented using few-shot generation, followed by manual verification. We evaluate various state-of-the-art LLMs and find that most produce overly formal responses that lack individual personality. To address this limitation, we propose a training-free Multi-Stage Patient Role-Playing (MSPRP) framework, which decomposes interactions into three stages to ensure both personalization and realism in model responses. Experimental results demonstrate that our approach significantly improves model performance across multiple dimensions of patient simulation.
中文标题/摘要
标题:多阶段患者角色扮演框架以实现真实的临床互动
真实的临床互动模拟在推动临床大型语言模型(LLM)的发展和支持医学诊断教育方面起着关键作用。现有的方法和基准依赖于通用或LLM生成的对话数据,这限制了医生与患者互动的真实性和多样性。在本文中,我们提出了第一个中文患者模拟数据集(Ch-PatientSim),该数据集基于真实的临床互动场景,全面评估模型模拟患者行为的能力。患者基于五维人格结构进行模拟。为了解决人格类别不平衡的问题,部分数据集使用少量生成进行扩充,随后进行人工验证。我们评估了多种最先进的LLM,发现大多数生成的回答过于正式,缺乏个性。为了解决这一局限性,我们提出了一种无需训练的多阶段患者角色扮演(MSPRP)框架,将互动分解为三个阶段,以确保模型响应的个性化和真实性。实验结果表明,我们的方法在多个维度上显著提高了模型在患者模拟中的性能。
Summary / 总结
This work addresses the need for more authentic and diverse clinical interactions in the evaluation of Large Language Models (LLMs) for medical diagnostic education. It introduces Ch-PatientSim, a dataset based on realistic clinical scenarios, and proposes a Multi-Stage Patient Role-Playing (MSPRP) framework to enhance model responses. The framework decomposes interactions into three stages to ensure personalization and realism, leading to improved model performance in simulating patient behavior.
该研究旨在通过更真实的临床互动来训练大型语言模型(LLMs),以支持医学诊断教育。它引入了基于真实临床场景的Ch-PatientSim数据集,并提出了一个多阶段患者角色扮演(MSPRP)框架来增强模型响应。该框架将互动分解为三个阶段,以确保个性化和真实性,从而在模拟患者行为方面提高了模型性能。
MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement
Authors: Meidan Ding, Jipeng Zhang, Wenxuan Wang, Haiqin Zhong, Xiaoling Luo, Wenting Chen, Linlin Shen
First: 2026-01-16T02:32:07+00:00 · Latest: 2026-01-16T02:32:07+00:00
Abstract
Medical Vision-Language Models (MedVLMs) excel at perception tasks but struggle with complex clinical reasoning required in real-world scenarios. While reinforcement learning (RL) has been explored to enhance reasoning capabilities, existing approaches face critical mismatches: the scarcity of deep reasoning data, cold-start limits multi-specialty alignment, and standard RL algorithms fail to model clinical reasoning diversity. We propose MMedExpert-R1, a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and clinical guideline reinforcement. We construct MMedExpert, a high-quality dataset of 10K samples across four specialties with step-by-step reasoning traces. Our Domain-Specific Adaptation (DSA) creates specialty-specific LoRA modules to provide diverse initialization, while Guideline-Based Advantages (GBA) explicitly models different clinical reasoning perspectives to align with real-world diagnostic strategies. Conflict-Aware Capability Integration then merges these specialized experts into a unified agent, ensuring robust multi-specialty alignment. Comprehensive experiments demonstrate state-of-the-art performance, with our 7B model achieving 27.50 on MedXpert-MM and 83.03 on OmniMedVQA, establishing a robust foundation for reliable multimodal medical reasoning systems.
中文标题/摘要
标题:MMedExpert-R1: 通过领域特定适应和临床指南强化多模态医学推理
医学视觉-语言模型(MedVLMs)在感知任务上表现出色,但在现实场景中复杂的临床推理方面却力不从心。尽管强化学习(RL)已被探索以增强推理能力,但现有方法面临关键的不匹配:深度推理数据稀缺、冷启动限制了多专科对齐,以及标准RL算法无法建模临床推理多样性。我们提出了一种名为MMedExpert-R1的新颖推理MedVLM,通过领域特定适应和临床指南强化来解决这些挑战。我们构建了MMedExpert,这是一个包含10000个样本的高质量数据集,覆盖四个专科,并附有逐步推理痕迹。我们的领域特定适应(DSA)创建了专科特定的LoRA模块,提供了多样化的初始化,而基于指南的优势(GBA)明确建模了不同的临床推理视角,以与实际诊断策略对齐。冲突感知能力集成将这些专业专家合并为一个统一的代理,确保了多专科对齐的稳健性。全面的实验表明,我们的性能达到了最先进的水平,7B模型在MedXpert-MM上达到了27.50分,在OmniMedVQA上达到了83.03分,为可靠的多模态医学推理系统奠定了坚实的基础。
Summary / 总结
The research aims to enhance the reasoning capabilities of medical vision-language models (MedVLMs) by addressing the limitations of existing approaches. MMedExpert-R1 uses domain-specific adaptation and clinical guideline reinforcement to create a high-quality dataset and specialized modules that align with real-world diagnostic strategies. The model shows state-of-the-art performance, achieving 27.50 on MedXpert-MM and 83.03 on OmniMedVQA, demonstrating robust multi-specialty alignment and reliable multimodal medical reasoning.
研究旨在通过解决现有方法的局限性,如推理数据稀缺和多专科模型难以对齐的问题,来增强医疗视觉-语言模型(MedVLMs)的推理能力。提出了一种名为MMedExpert-R1的新方法,该方法采用领域特定适应和临床指南强化。方法包括创建一个包含四个专科10K样本的数据集,并使用领域特定适应(DSA)提供多样化的初始化,以及使用基于指南的优势(GBA)来与实际诊断策略对齐。实验结果表明,MMedExpert-R1在MedXpert-MM和OmniMedVQA基准测试中表现出色,达到了最先进的性能。
PatientVLM Meets DocVLM: Pre-Consultation Dialogue Between Vision-Language Models for Efficient Diagnosis
Authors: K Lokesh, Abhirama Subramanyam Penamakuri, Uday Agarwal, Apoorva Challa, Shreya K Gowda, Somesh Gupta, Anand Mishra
Venue: AAAI 2026
First: 2026-01-16T02:18:29+00:00 · Latest: 2026-01-16T02:18:29+00:00
Comments: Accepted at AAAI 2026 Main Track
Abstract
Traditionally, AI research in medical diagnosis has largely centered on image analysis. While this has led to notable advancements, the absence of patient-reported symptoms continues to hinder diagnostic accuracy. To address this, we propose a Pre-Consultation Dialogue Framework (PCDF) that mimics real-world diagnostic procedures, where doctors iteratively query patients before reaching a conclusion. Specifically, we simulate diagnostic dialogues between two vision-language models (VLMs): a DocVLM, which generates follow-up questions based on the image and dialogue history, and a PatientVLM, which responds using a symptom profile derived from the ground-truth diagnosis. We additionally conducted a small-scale clinical validation of the synthetic symptoms generated by our framework, with licensed clinicians confirming their clinical relevance, symptom coverage, and overall realism. These findings indicate that the resulting DocVLM-PatientVLM interactions form coherent, multi-turn consultations paired with images and diagnoses, which we then use to fine-tune the DocVLM. This dialogue-based supervision leads to substantial gains over image-only training, highlighting the value of realistic symptom elicitation for diagnosis.
中文标题/摘要
标题:PatientVLM与DocVLM会面:基于视觉-语言模型的预咨询对话以提高诊断效率
传统上,医学诊断中的AI研究主要集中在图像分析上。尽管这带来了显著的进步,但患者报告的症状缺失仍然阻碍了诊断的准确性。为了解决这一问题,我们提出了一种预咨询对话框架(PCDF),该框架模拟了现实中的诊断程序,医生在得出结论前会逐步询问患者。具体来说,我们模拟了两个视觉-语言模型(VLMs)之间的诊断对话:DocVLM根据图像和对话历史生成后续问题,而PatientVLM则使用从真实诊断中推断出的症状概况进行回应。我们还对由该框架生成的合成症状进行了小型临床验证,执业临床医生确认了其临床相关性、症状覆盖面和整体真实性。这些发现表明,DocVLM与PatientVLM的互动形成了连贯的、多轮次的咨询,配以图像和诊断,我们随后利用这些互动对DocVLM进行微调。基于对话的监督比仅使用图像的训练带来了显著的改进,突显了真实症状采集对诊断的价值。
Summary / 总结
The research aims to improve medical diagnosis by incorporating patient-reported symptoms into AI systems. It introduces a Pre-Consultation Dialogue Framework (PCDF) using two vision-language models: DocVLM and PatientVLM. DocVLM generates follow-up questions based on images and dialogue history, while PatientVLM responds with symptoms. Clinical validation confirmed the relevance and realism of the synthetic symptoms. The DocVLM-PatientVLM interactions were used to fine-tune the DocVLM, leading to better diagnostic accuracy compared to image-only training.
研究旨在通过纳入患者报告的症状来提高医疗诊断的准确性。提出了一个前咨询对话框架(PCDF),使用两个视觉语言模型:DocVLM和PatientVLM。DocVLM基于图像和对话历史生成后续问题,而PatientVLM使用症状资料进行回应。临床验证确认了合成症状的相关性、覆盖性和真实性。通过这些互动对DocVLM进行微调,相比仅使用图像的训练,显著提高了诊断准确性。
Image2Garment: Simulation-ready Garment Generation from a Single Image
Authors: Selim Emir Can, Jan Ackermann, Kiyohiro Nakayama, Ruofan Liu, Tong Wu, Yang Zheng, Hugo Bertiche, Menglei Chai, Thabo Beeler, Gordon Wetzstein
First: 2026-01-14T17:47:33+00:00 · Latest: 2026-01-15T21:21:50+00:00
Comments: Project Page: https://image2garment.github.io/
Abstract
Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.
中文标题/摘要
标题:Image2Garment:从单张图像生成可用于模拟的服装
从单张图像估计物理准确且可用于模拟的服装极具挑战性,因为缺乏图像到物理的数据集,且该问题本身具有病态性。先前的方法要么需要多视角捕捉和昂贵的可微模拟,要么仅预测服装几何形状而没有用于真实模拟所需的材料属性。我们提出了一种无需迭代优化即可从单张图像生成可用于模拟的服装的前馈框架。该框架首先通过微调视觉-语言模型从真实图像中推断材料组成和织物属性,然后使用少量的材料-物理测量数据集训练一个轻量级预测器,将这些属性映射到相应的物理织物参数。我们的方法引入了两个新数据集(FTAG和T2P),并实现了无需迭代优化即可从单张图像生成可用于模拟的服装。实验表明,我们的估计器在材料组成估计和织物属性预测方面具有更高的准确性,并通过我们的物理参数估计器进一步实现了比最先进的图像到服装方法更高的保真度模拟。
Summary / 总结
The research aims to generate physically accurate, simulation-ready garments from a single image, addressing the challenges of image-to-physics datasets and the ill-posed nature of the problem. The method involves fine-tuning a vision-language model to infer material composition and fabric attributes from real images, followed by training a lightweight predictor to map these attributes to physical fabric parameters using a small dataset. Experiments demonstrate superior accuracy in material composition estimation and fabric attribute prediction, leading to higher-fidelity simulations compared to existing methods.
研究旨在从单张图像生成物理上准确且可用于模拟的服装,解决数据集稀缺和问题定义不明确的挑战。方法包括微调视觉语言模型从真实图像中推断材料组成和织物属性,然后训练一个轻量级预测器将这些属性映射到相应的物理织物参数。实验表明,在材料组成和织物属性预测方面具有更高的准确性,从而实现比现有方法更逼真的模拟。
ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models
Authors: Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, Sarah Adel Bargal
Venue: WACV 2026
First: 2025-06-13T19:57:40+00:00 · Latest: 2026-01-15T21:21:42+00:00
Comments: Accepted to WACV 2026
Abstract
Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.
中文标题/摘要
标题:ViSTA:使用多模态适配器的文本到图像扩散模型视觉叙事
文本到图像的扩散模型已经取得了显著的成功,但生成连贯的图像序列进行视觉叙事仍然具有挑战性。一个关键挑战是如何有效地利用所有先前的文本-图像对,即历史文本-图像对,它们提供了上下文信息,以保持帧间的一致性。现有的自回归方法依赖于所有过去的图像-文本对,但需要大量的训练,而无需训练的主题特定方法可以确保一致性,但缺乏对叙述提示的适应性。为了解决这些限制,我们提出了一种多模态历史适配器,称为\textbf{ViSTA}。它包括(1)一个多模态历史融合模块,用于提取相关的历史特征,以及(2)一个历史适配器,用于根据提取的相关特征进行生成。我们还在推理过程中引入了一种显著历史选择策略,其中选择最显著的历史文本-图像对,从而提高条件生成的质量。此外,我们提出使用基于视觉问答的度量TIFA来评估视觉叙事中的文本-图像对齐,从而提供一个更针对性和可解释的生成图像评估。在StorySalon和FlintStonesSV数据集上评估,我们提出的ViSTA模型不仅在不同帧之间保持一致,而且与叙述文本描述也很好地对齐。
Summary / 总结
The research aims to improve the coherence of image sequences generated by text-to-image diffusion models for visual storytelling. ViSTA, a multi-modal history adapter, is proposed to effectively leverage historical text-image pairs. It includes a multi-modal history fusion module for feature extraction and a history adapter for conditioning generation. The model also employs a salient history selection strategy and a Visual Question Answering-based metric TIFA for better text-image alignment. Experimental results on StorySalon and FlintStonesSV datasets show that ViSTA generates consistent and well-aligned images with narrative text descriptions.
研究旨在通过解决现有方法的局限性,提高视觉叙事中生成图像序列的连贯性。提出了一个多模态历史适配器ViSTA,通过多模态历史融合模块提取相关的历史特征进行条件生成。模型还包含一个显著历史选择策略和基于视觉问答的TIFA评估指标,用于评估文本-图像对齐。实验结果表明,ViSTA在StorySalon和FlintStonesSV数据集上生成的图像不仅在不同帧之间保持一致,而且与叙述文本描述高度对齐。
Can Vision-Language Models Understand Construction Workers? An Exploratory Study
Authors: Hieu Bui, Nathaniel E. Chodosh, Arash Tavakoli
First: 2026-01-15T20:10:03+00:00 · Latest: 2026-01-15T20:10:03+00:00
Abstract
As robotics become increasingly integrated into construction workflows, their ability to interpret and respond to human behavior will be essential for enabling safe and effective collaboration. Vision-Language Models (VLMs) have emerged as a promising tool for visual understanding tasks and offer the potential to recognize human behaviors without extensive domain-specific training. This capability makes them particularly appealing in the construction domain, where labeled data is scarce and monitoring worker actions and emotional states is critical for safety and productivity. In this study, we evaluate the performance of three leading VLMs, GPT-4o, Florence 2, and LLaVa-1.5, in detecting construction worker actions and emotions from static site images. Using a curated dataset of 1,000 images annotated across ten action and ten emotion categories, we assess each model's outputs through standardized inference pipelines and multiple evaluation metrics. GPT-4o consistently achieved the highest scores across both tasks, with an average F1-score of 0.756 and accuracy of 0.799 in action recognition, and an F1-score of 0.712 and accuracy of 0.773 in emotion recognition. Florence 2 performed moderately, with F1-scores of 0.497 for action and 0.414 for emotion, while LLaVa-1.5 showed the lowest overall performance, with F1-scores of 0.466 for action and 0.461 for emotion. Confusion matrix analyses revealed that all models struggled to distinguish semantically close categories, such as collaborating in teams versus communicating with supervisors. While the results indicate that general-purpose VLMs can offer a baseline capability for human behavior recognition in construction environments, further improvements, such as domain adaptation, temporal modeling, or multimodal sensing, may be needed for real-world reliability.
中文标题/摘要
标题:视觉语言模型能否理解建筑工人?一项探索性研究
随着机器人技术在建筑工作流程中的日益集成,它们解读和响应人类行为的能力将对于实现安全有效的协作至关重要。视觉语言模型(VLMs)作为一种视觉理解工具已经崭露头角,并且有可能在无需大量领域特定训练的情况下识别人类行为。这种能力使它们在建筑领域特别具有吸引力,因为建筑领域中标记数据稀缺,监测工人行为和情绪状态对于安全和生产率至关重要。在这项研究中,我们评估了三种领先VLMs——GPT-4o、Florence 2和LLaVa-1.5——在从静态现场图像中检测建筑工人行为和情绪方面的性能。我们使用了10个行为类别和10个情绪类别标注的1000张图像数据集,通过标准化推理管道和多个评估指标评估每个模型的输出。GPT-4o在两个任务中均取得了最高分数,行为识别的平均F1分数为0.756,准确率为0.799,情绪识别的F1分数为0.712,准确率为0.773。Florence 2表现适中,行为和情绪识别的F1分数分别为0.497和0.414,而LLaVa-1.5的整体表现最低,行为和情绪识别的F1分数分别为0.466和0.461。混淆矩阵分析显示,所有模型在区分语义相近类别,如团队合作与与监督者沟通方面存在困难。虽然结果表明通用视觉语言模型可以在建筑环境中提供人类行为识别的基本能力,但为了实现实际可靠性,可能还需要进一步改进,如领域适应、时间建模或多模态感知等。
Summary / 总结
This study evaluates the performance of three Vision-Language Models (VLMs) in recognizing construction worker actions and emotions from static images. Using a dataset of 1,000 images annotated for ten action and ten emotion categories, the study finds that GPT-4o outperforms the other models, achieving high F1-scores and accuracy in both tasks. Florence 2 and LLaVa-1.5 show lower performance, with significant challenges in distinguishing semantically close categories. The results suggest that while general-purpose VLMs can provide a baseline for human behavior recognition, further improvements are needed for practical applications in construction environments.
本研究评估了三种视觉-语言模型在从静态工地图像中识别建筑工人行为和情绪方面的性能。使用包含1,000张图像的数据集,这些图像被标注了十个行为和十个情绪类别,研究发现GPT-4o在行为识别上的平均F1得分为0.756,准确率为0.799,在情绪识别上的F1得分为0.712,准确率为0.773。Florence 2和LLaVa-1.5的表现较低,行为和情绪识别的F1得分分别为0.497和0.414。模型在区分语义相近类别时存在困难,表明需要进一步改进领域适应和多模态传感以提高实际应用的可靠性。
Alterbute: Editing Intrinsic Attributes of Objects in Images
Authors: Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen
First: 2026-01-15T18:59:53+00:00 · Latest: 2026-01-15T18:59:53+00:00
Comments: Project page is available at https://talreiss.github.io/alterbute/
Abstract
We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.
中文标题/摘要
标题:Alterbute:图像中对象固有属性的编辑方法
我们介绍了Alterbute,一种基于扩散的方法,用于编辑图像中对象的固有属性。我们允许更改对象的颜色、纹理、材料,甚至形状,同时保持其感知身份和场景上下文。现有方法要么依赖于无法有效保持身份的无监督先验,要么使用过于严格的监督,限制了有意义的固有属性变化。我们的方法依赖于:(i) 放松的训练目标,允许模型在参考身份图像、描述目标固有属性的文本提示以及定义外部上下文的背景图像和对象掩码的条件下,同时改变固有属性和外部属性;(ii) 视觉命名实体(VNEs)——细粒度的视觉身份类别(例如,“保时捷911卡雷拉”),将具有身份定义特征的对象分组,同时允许固有属性的变化。我们使用视觉语言模型从大型公共图像数据集中自动提取VNE标签和固有属性描述,实现可扩展、保持身份的监督。Alterbute在保持身份的对象固有属性编辑方面优于现有方法。
Summary / 总结
Alterbute is a diffusion-based method for editing an object's intrinsic attributes in images, such as color, texture, and material, while preserving the object's identity and scene context. It uses a relaxed training objective and Visual Named Entities to allow changes in intrinsic attributes while keeping extrinsic attributes consistent. The method outperforms existing approaches in preserving object identity during intrinsic attribute editing.
Alterbute 是一种基于扩散的方法,用于在保持物体身份和场景上下文不变的情况下编辑图像中物体的内在属性,如颜色、纹理和材料。它使用一个宽松的训练目标和视觉命名实体来允许内在属性的变化,同时保持外在属性的一致性。该方法在保持物体身份方面优于现有方法,特别是在内在属性编辑方面。
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Authors: Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao
First: 2026-01-15T18:59:10+00:00 · Latest: 2026-01-15T18:59:10+00:00
Abstract
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
中文标题/摘要
标题:从一对一到多对多:动态跨层注入以实现深度视觉-语言融合
视觉-语言模型(VLMs)通过使用粗略且不对称的连接方式,仅将视觉编码器的输出链接到大型语言模型(LLM)的输入,从而造成严重的视觉特征瓶颈。这种静态架构从根本上限制了LLM实现多层次视觉知识全面对齐的能力,削弱了它们将局部细节与全局语义整合到连贯推理中的能力。为了解决这一问题,我们引入了跨层注入(CLI),这是一种新颖且轻量级的框架,它在两种模态之间构建了一种动态的多对多桥梁。CLI 包含两个协同工作的、参数高效的组件:自适应多投影(AMP)模块,用于协调来自不同视觉层的特征,以及自适应门控融合(AGF)机制,使LLM能够根据其实时解码上下文选择性地注入最相关的视觉信息。我们通过将CLI整合到LLaVA-OneVision和LLaVA-1.5中来验证其有效性和灵活性。在18个多样基准上的广泛实验表明,CLI 可以显著提高性能,确立了CLI 作为一种可扩展范式的地位,通过赋予LLM按需访问完整视觉层次结构的能力,解锁了更深层次的多模态理解。
Summary / 总结
The paper addresses the limitation of current Vision-Language Models (VLMs) by proposing Cross-Layer Injection (CLI), a dynamic many-to-many framework that enhances the interaction between vision and language. CLI includes an Adaptive Multi-Projection (AMP) module for feature harmonization and an Adaptive Gating Fusion (AGF) mechanism for selective visual information injection. Experiments show CLI improves performance on 18 diverse benchmarks, demonstrating its effectiveness and scalability in achieving deeper multimodal understanding.
论文通过提出一种动态框架Cross-Layer Injection (CLI),解决了静态视觉-语言模型(VLMs)的局限性,CLI允许视觉和语言模态之间实现多对多的连接。CLI 包含一个用于从不同视觉层融合特征的Adaptive Multi-Projection (AMP) 模块,以及一个根据LLM的上下文选择性注入相关视觉信息的Adaptive Gating Fusion (AGF) 机制。在18个不同基准上的实验表明CLI提高了性能,展示了其在增强多模态理解方面的有效性与可扩展性。
Future Optical Flow Prediction Improves Robot Control & Video Generation
Authors: Kanchana Ranasinghe, Honglu Zhou, Yu Fang, Luyu Yang, Le Xue, Ran Xu, Caiming Xiong, Silvio Savarese, Michael S Ryoo, Juan Carlos Niebles
First: 2026-01-15T18:49:48+00:00 · Latest: 2026-01-15T18:49:48+00:00
Comments: Project Site (Code, Models, Demo): https://fofpred.github.io
Abstract
Future motion representations, such as optical flow, offer immense value for control and generative tasks. However, forecasting generalizable spatially dense motion representations remains a key challenge, and learning such forecasting from noisy, real-world data remains relatively unexplored. We introduce FOFPred, a novel language-conditioned optical flow forecasting model featuring a unified Vision-Language Model (VLM) and Diffusion architecture. This unique combination enables strong multimodal reasoning with pixel-level generative fidelity for future motion prediction. Our model is trained on web-scale human activity data-a highly scalable but unstructured source. To extract meaningful signals from this noisy video-caption data, we employ crucial data preprocessing techniques and our unified architecture with strong image pretraining. The resulting trained model is then extended to tackle two distinct downstream tasks in control and generation. Evaluations across robotic manipulation and video generation under language-driven settings establish the cross-domain versatility of FOFPred, confirming the value of a unified VLM-Diffusion architecture and scalable learning from diverse web data for future optical flow prediction.
中文标题/摘要
标题:未来光学流预测提升机器人控制与视频生成
未来的运动表示,如光学流,对于控制和生成任务具有巨大价值。然而,预测通用的空间密集型运动表示仍然是一个关键挑战,而从嘈杂的现实世界数据中学习这种预测相对未被探索。我们引入了FOFPred,这是一种新颖的语言条件光学流预测模型,结合了统一的视觉语言模型(VLM)和扩散架构。这种独特的组合使多模态推理和像素级生成保真度成为未来运动预测的强大工具。我们的模型在大规模网络人类活动数据上进行训练——这是一个高度可扩展但不结构化的数据源。为了从这些嘈杂的视频-描述数据中提取有意义的信号,我们采用了关键的数据预处理技术和我们统一的架构,具有强大的图像预训练。训练后的模型随后扩展以解决控制和生成中的两个不同下游任务。在语言驱动的机器人操作和视频生成设置下的评估确立了FOFPred的跨域通用性,证实了统一的VLM-扩散架构和从多样化的网络数据中进行可扩展学习的价值对于未来光学流预测的重要性。
Summary / 总结
The research aims to improve robot control and video generation by predicting future optical flow. FOFPred, a novel model combining a Vision-Language Model and Diffusion architecture, is introduced. It is trained on large-scale web video-caption data, and the model demonstrates effectiveness in robotic manipulation and video generation tasks under language-driven settings, highlighting the benefits of a unified VLM-Diffusion architecture and scalable learning from diverse data sources.
研究旨在通过预测未来光学流来提升机器人控制和视频生成。引入了FOFPred模型,该模型结合了视觉语言模型和扩散架构。它在大规模网络视频字幕数据上进行训练,并通过预处理提取有意义的信号。该模型在机器人操作和视频生成任务中表现出色,突显了统一的VLM-扩散架构和从多样化的网络数据中进行可扩展学习的优势。
Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
Authors: Jorge Ortiz
First: 2025-12-31T23:27:32+00:00 · Latest: 2026-01-15T17:31:17+00:00
Comments: Preprint. Diagnostic study of confidence-based abstention under evidence truncation
Abstract
High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f
中文标题/摘要
标题:显式弃权开关以实现视频问答中的可预测可靠性
在高风险部署视觉-语言模型(VLMs)时,需要选择性预测,即系统在不确定时弃权,而不是冒高成本错误的风险。我们研究了基于置信度的弃权是否能提供对错误率的可靠控制,以及这种控制在分布偏移下是否保持稳健。使用NExT-QA和Gemini 2.0 Flash,我们得出两项发现。首先,置信度阈值化在分布内提供了机制控制。扫掠阈值ε产生平滑的风险-覆盖率权衡,降低错误率f
Summary / 总结
The research aims to improve the reliability of vision-language models in high-stakes applications by enabling them to abstain when uncertain. The study uses confidence-based abstention and finds that sweeping the confidence threshold epsilon provides a smooth risk-coverage tradeoff, effectively reducing error rates within the distribution. However, the robustness of this approach under distribution shift remains to be explored further.
研究旨在通过使视觉-语言模型在不确定时能够选择不进行预测,提高其在高风险应用中的可靠性。研究探讨了基于置信度的回避在控制视频问答中的错误率方面的有效性。主要发现表明,通过调整置信度阈值可以产生平滑的风险-覆盖率权衡,有效降低同分布情况下的错误率。但该方法在分布变化下的鲁棒性仍需进一步研究。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-01-15T17:27:44+00:00
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当今最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs生成的合成数据,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是像素跟踪。即使私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最新技术,并展示了在单图像、多图像和视频任务中出色的基于指针的定位能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、用于微调的自由形式视频问答数据集、一种新的具有复杂查询的对象跟踪数据集以及一种创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的数据训练配方,并展示了在视觉标记上使用双向注意和一种新颖的标记权重策略可以提高性能。我们的最佳8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5 vs 29.6)并超越了私有模型如Gemini 3 Pro在某些任务上的表现(视频指针F1得分为38.4 vs 20.0,视频跟踪J&F得分为56.2 vs 41.1)。
Summary / 总结
Molmo2 is a new family of open-source vision-language models that excel in video understanding and grounding, surpassing existing open-weight models and proprietary models on various tasks. The research addresses the lack of open-source foundations for improving video language models by providing 9 new datasets and a training recipe. Key findings include superior performance on point-driven grounding tasks and competitive results on long videos and captioning. Molmo2 significantly outperforms Qwen3-VL and Gemini 3 Pro on video counting and video pointing tasks, respectively.
Molmo2 是一种新的开源视觉-语言模型,其在视频理解和定位任务中优于现有开源和专有模型。研究通过提供 9 个新数据集和训练方法来解决缺乏开源基础的问题。主要发现包括在点驱动的定位任务中表现出色,并在长视频任务上具有竞争力。Molmo2 在视频计数和视频跟踪等特定任务上显著优于现有开源模型,甚至超越了一些专有模型。
Semantic Misalignment in Vision-Language Models under Perceptual Degradation
Authors: Guo Cheng
First: 2026-01-13T09:13:05+00:00 · Latest: 2026-01-15T17:10:05+00:00
Comments: 10 pages, 4 figures, 6 tables
Abstract
Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
中文标题/摘要
标题:视觉-语言模型在感知退化下的语义不匹配
视觉-语言模型(VLMs)在自动驾驶和具身AI系统中越来越被部署,可靠的感知对于安全的语义推理和决策至关重要。尽管最近的VLMs在多模态基准测试中表现出色,但它们对现实感知退化的鲁棒性仍然不甚了解。在本工作中,我们系统地研究了在上游视觉感知受控退化下VLMs中的语义不匹配,使用Cityscapes数据集上的语义分割作为代表性感知模块。我们引入了感知现实的退化,这些退化仅在传统分割指标上引起适度下降,但观察到下游VLM行为的严重失败,包括虚构对象提及、安全关键实体的遗漏以及不一致的安全判断。为了量化这些影响,我们提出了一组语言层面的不匹配度量标准,以捕捉虚构、关键遗漏和安全误判,并分析这些度量标准与分割质量之间的关系,涵盖多个对比性和生成性VLMs。我们的结果揭示了像素级鲁棒性和多模态语义可靠性之间的明显脱节,突显了当前VLM基系统的一个关键局限性,并强调了需要评估框架的重要性,这些框架能够明确考虑安全关键应用中的感知不确定性。
Summary / 总结
This study investigates the robustness of Vision-Language Models (VLMs) under perceptual degradation, focusing on their performance in autonomous driving and embodied AI systems. The research introduces perception-realistic corruptions to the Cityscapes dataset and finds that even moderate drops in segmentation accuracy lead to severe failures in VLMs, such as hallucinations and critical omissions. The study proposes metrics to quantify these issues and highlights the need for evaluation frameworks that consider perception uncertainty in safety-critical applications.
该研究探讨了视觉-语言模型(VLMs)在自主驾驶和具身AI系统中对感知降级的鲁棒性。通过系统地对语义分割应用控制化的破坏,研究发现即使分割精度只有轻微下降,也可能导致VLMs严重的语义错位,包括幻觉、关键实体遗漏以及不一致的安全判断。作者提出了语言层面的错位度量标准来量化这些问题,并强调需要考虑感知不确定性在安全关键应用中的评估框架。
Action100M: A Large-scale Video Action Dataset
Authors: Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, Pascale Fung
First: 2026-01-15T17:02:27+00:00 · Latest: 2026-01-15T17:02:27+00:00
Abstract
Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
中文标题/摘要
标题:Action100M:大规模视频动作数据集
从视觉观察中推断物理动作是推进物理世界机器智能的基本能力。实现这一目标需要涵盖广泛领域的大型、开放词汇量视频动作数据集。我们介绍了Action100M,这是一个从1.2M互联网教学视频(14.6年时长)构建的大规模数据集,包含O(100百万)个时间局部化片段,具有开放词汇量动作监督和丰富的描述。Action100M通过一个完全自动化的流水线生成,该流水线(i)使用V-JEPA 2嵌入进行分层时间分割,(ii)生成多级帧和片段描述组织成描述树,(iii)使用多轮Self-Refine程序下的推理模型(GPT-OSS-120B)聚合证据以输出结构化注释(简要/详细动作、演员、简要/详细描述)。在Action100M上训练VL-JEPA展示了在各种动作识别基准测试中一致的数据规模改进和强大的零样本性能,确立了Action100M作为视频理解和世界建模可扩展研究新基础的地位。
Summary / 总结
Action100M is a large-scale video action dataset created from 1.2 million instructional videos, providing over 100 million temporally localized segments with open-vocabulary action supervision. The dataset is generated using a fully automated pipeline that includes hierarchical temporal segmentation, multi-level captioning, and structured annotation refinement. Training a visual-language model on Action100M shows consistent data-scaling improvements and strong zero-shot performance on various action recognition benchmarks, positioning Action100M as a key resource for scalable video understanding research.
Action100M 是一个来自 120 万条教学视频的大规模视频动作数据集,提供了超过 1 亿个时间局部化片段,带有开放词汇的动作监督。该数据集通过一个全自动流水线生成,包括层次时间分割、多级字幕生成和结构化注释精炼。在 Action100M 上训练视觉-语言模型显示了数据规模改进的一致性和在各种动作识别基准上的强大零样本性能,确立了 Action100M 作为大规模视频理解研究的基础资源的地位。
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
Authors: Mikel Williams-Lekuona, Georgina Cosma
First: 2025-12-17T12:19:54+00:00 · Latest: 2026-01-15T16:58:39+00:00
Comments: Camera-ready version for ECIR 2026
Abstract
Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance, attaining a Pearson correlation coefficient of 0.959 with human labelling whilst delivering 4.4x faster complexity prediction. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% faster image encoding while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
中文标题/摘要
标题:基于图像复杂性的自适应检索以提高视觉语言模型的效率
视觉语言模型中的视觉变换器通常对每张图像使用相同的计算量,无论其简单与否。我们提出了ICAR(基于图像复杂性的自适应检索),这是一种自适应计算方法,使视觉变换器能够为简单的图像使用较少的计算量,而为复杂的图像通过其全网络深度进行处理。关键挑战是保持跨模态对齐:不同处理深度的嵌入必须保持兼容以进行文本匹配。ICAR 通过双路径训练解决这一问题,该训练产生来自早期退出路径和全深度路径的兼容嵌入。这在相同的语义空间中保持了图像表示和文本嵌入之间的兼容性,无论图像是否早期退出或完全处理。与现有的两阶段方法不同,ICAR 不需要昂贵的重排序,可以直接进行图像-文本匹配而无需额外开销。为了确定使用多少计算量,我们开发了ConvNeXt-IC,将其视为分类任务。通过应用现代分类器骨干网络而非专门的架构,ConvNeXt-IC 达到了最先进的性能,获得了与人工标注 0.959 的皮尔逊相关系数,同时实现了 4.4 倍更快的复杂性预测。在标准基准上增加了真实世界的网络数据,ICAR 在保持类别级性能的同时实现了 20% 更快的图像编码,并保持了 95% 的实例级性能,从而实现了视觉语言系统的可持续扩展。
Summary / 总结
The paper proposes ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach for vision transformers that reduces compute usage for simple images while fully processing complex ones. It uses dual-path training to maintain cross-modal alignment. ICAR achieves 20% faster image encoding with 95% of instance-level performance on benchmarks, enabling more sustainable scaling of vision-language systems. ConvNeXt-IC, a classifier backbone, is used to assess image complexity, achieving state-of-the-art performance with 4.4x faster prediction than existing methods.
论文提出了ICAR(图像复杂性感知检索)方法,该方法使视觉变换器能够减少简单图像的计算使用量,同时完全处理复杂图像。它使用双路径训练来保持跨模态对齐。ICAR在基准测试中实现了20%的更快图像编码速度,同时保持95%的实例级性能,有助于视觉语言系统的可持续扩展。ConvNeXt-IC 作为分类器骨干,用于评估图像复杂性,其性能达到最先进的水平,预测速度比现有方法快4.4倍。
Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
Authors: Luxuan Fu, Chong Liu, Bisheng Yang, Zhen Dong
First: 2026-01-15T16:16:34+00:00 · Latest: 2026-01-15T16:16:34+00:00
Abstract
Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
中文标题/摘要
标题:释放大型视觉语言模型在路边基础设施智能感知能力的潜力
城市路边基础设施的自动化感知对于智能城市管理至关重要,但通用模型往往难以捕捉到必要的细粒度属性和领域规则。虽然大型视觉语言模型(VLMs)在开放世界识别方面表现出色,但在遵循工程标准准确解释复杂设施状态方面却常常遇到困难,导致在实际应用中的性能不可靠。为了解决这一问题,我们提出了一种领域适应框架,将VLMs转化为专门的智能基础设施分析代理。我们的方法结合了数据高效微调策略和基于知识的推理机制。具体来说,我们利用Grounding DINO的开放式词汇微调来在最少监督的情况下稳健地定位各种资产,然后利用基于LoRA的Qwen-VL适应进行深入的语义属性推理。为了减轻幻觉并确保专业合规,我们引入了一个双模态检索增强生成(RAG)模块,在推理过程中动态检索权威的行业标准和视觉示例。在全面的新建城市路边场景数据集上进行评估,我们的框架实现了58.9的检测性能mAP和95.5%的属性识别准确率,展示了智能基础设施监控的稳健解决方案。
Summary / 总结
The research aims to improve the automated perception of urban roadside infrastructure for smart city management by addressing the limitations of general-purpose models. The proposed domain-adapted framework fine-tunes large vision-language models using open-vocabulary techniques and LoRA-based adaptation, integrating a dual-modality RAG module to ensure compliance with engineering standards. The framework achieves a detection performance of 58.9 mAP and 95.5% attribute recognition accuracy, showcasing its effectiveness in real-world applications.
研究旨在通过改进通用模型的局限性,提高对城市路边基础设施的自动化感知,以支持智能城市管理。提出的领域适应框架通过开放词汇量微调和知识导向的推理技术来优化大型视觉语言模型,并引入双模态检索增强生成模块以确保符合行业标准。实验结果表明,检测性能为58.9 mAP,属性识别准确率为95.5%,表明这是一种稳健的智能基础设施监控解决方案。