LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR
Authors: Said Taghadouini, Adrien Cavaillès, Baptiste Aubertin
First: 2026-01-20T18:58:32+00:00 · Latest: 2026-01-20T18:58:32+00:00
Abstract
We present \textbf{LightOnOCR-2-1B}, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9$\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and \textbf{LightOnOCR-bbox-bench} evaluation under their respective licenses.
中文标题/摘要
标题:LightOnOCR:一种端到端多语言视觉-语言模型,用于先进的光学字符识别
我们介绍了**LightOnOCR-2-1B**,这是一种1亿参数的端到端多语言视觉-语言模型,能够将文档图像(例如PDF)转换为干净、自然排序的文本,而无需脆弱的光学字符识别管道。该模型在大规模高质量的蒸馏混合数据集上进行训练,该数据集涵盖了扫描文档、法语文档和科学PDF的广泛覆盖范围,LightOnOCR-2在OlmOCR-Bench上达到了最先进的结果,同时比之前表现最好的模型小9倍且速度显著更快。我们进一步扩展了输出格式,预测嵌入图像的标准化边界框,在预训练中通过恢复策略引入定位,并使用基于IoU的奖励进行RLVR细化。最后,我们通过检查点平均和任务算术合并提高了鲁棒性。我们以Apache 2.0许可证发布模型检查点,并以各自的许可证公开了数据集和**LightOnOCR-bbox-bench**评估。
Summary / 总结
LightOnOCR-2-1B is a 1B-parameter end-to-end multilingual vision-language model that converts document images into clean text. It is trained on a large-scale dataset and achieves state-of-the-art results on OlmOCR-Bench while being significantly smaller and faster than previous models. The model predicts normalized bounding boxes for embedded images through pretraining and RLVR refinement, and its robustness is improved with checkpoint averaging and task-arithmetic merging. The model and dataset are released under open-source licenses.
LightOnOCR-2-1B 是一个 1 亿参数的端到端多语言视觉-语言模型,能够将文档图像转换为干净的文本。该模型在大规模高质量数据集上进行训练,并在OlmOCR-Bench上达到了最先进的结果,同时比之前的模型更小、更快。模型预测嵌入图像的边界框,并在预训练期间通过恢复策略进行定位细化,使用基于IoU的奖励进行强化学习。此外,通过检查点平均和任务算术合并来提高鲁棒性。该模型和数据集在相应的开源许可证下发布。
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
Authors: Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, Chris Lott
Venue: NeurIPS 2025
First: 2025-04-21T18:12:46+00:00 · Latest: 2026-01-20T17:55:29+00:00
Comments: 37 pages, 19 figures, NeurIPS 2025
Abstract
We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04% with 8K cache budget ($\sim$23% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30% compared to the other token-eviction methods.
中文标题/摘要
标题:KeyDiff:基于键相似性的KV缓存淘汰方法以应对资源受限环境中的长上下文LLM推理
我们证明,在LLM推理过程中,几何上独特的键往往具有较高的注意力分数。基于这一现象,我们提出了KeyDiff,一种仅基于键相似性的无需训练的KV缓存淘汰方法。与其它KV缓存淘汰方法不同,KeyDiff可以在严格的资源限制下处理任意长的提示,并高效生成响应。我们通过将键多样性与注意力分数联系起来,为KeyDiff提供了理论基础。这些结果表明,KeyDiff可以有效地识别需要保留的重要令牌。值得注意的是,KeyDiff不依赖于注意力分数,允许使用优化的注意力机制如FlashAttention。在严格的内存限制下,我们通过在LongBench上观察到Llama 3.1-8B和Llama 3.2-3B的非淘汰基线的性能差距小于0.04%,以及使用8K缓存预算(约23%的KV缓存减少)来证明KeyDiff的有效性。我们还观察到Deepseek-R1-Distill-Llama-8B在Math500推理基准上的接近基线性能,并将端到端推理延迟降低了高达30%。
Summary / 总结
KeyDiff is a training-free KV cache eviction method based on key similarity, designed for long-context LLM inference in resource-constrained environments. It efficiently processes long prompts and generates responses without relying on attention scores, allowing the use of optimized mechanisms like FlashAttention. KeyDiff shows less than 0.04% performance gap with 8K cache budget, reducing the KV cache by about 23% on LongBench for Llama 3.1-8B and Llama 3.2-3B. It also improves end-to-end inference latency by up to 30% compared to other token-eviction methods.
KeyDiff 是一种基于键相似性的无训练 KV 缓存淘汰方法,适用于资源受限环境下的长上下文 LLM 推断。它不依赖于注意力分数,可以高效处理长提示并生成响应,同时利用优化的注意力机制如 FlashAttention。KeyDiff 在 Llama 和 Qwen 模型上将 KV 缓存大小减少了 23%,性能损失不到 0.04%,并且与其它 token 淘汰方法相比,端到端推理延迟最多可减少 30%。
IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models
Authors: Liang Shi, Wei Li, Kevin M Beussman, Lin Chen, Yun Fu
First: 2026-01-20T17:45:24+00:00 · Latest: 2026-01-20T17:45:24+00:00
Abstract
Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.
中文标题/摘要
标题:IIR-VLM:上下文实例级识别的大规模视觉语言模型
实例级识别(ILR)涉及区分单个实例,其中人员再识别是一个突出的例子。尽管现代VLMs具有令人印象深刻的视觉感知能力,但我们发现它们在ILR上的表现令人不满意,经常显著低于专门的ILR模型。这一限制阻碍了许多VLMs的实际应用,例如,在有效视觉理解中识别熟悉的人和物体至关重要。现有解决方案通常使用实例特定的数据集一次学习识别单个实例,这不仅会带来大量数据收集和训练成本,而且难以进行细微区分。在这项工作中,我们提出了IIR-VLM,这是一种增强的VLM,用于上下文实例级识别。我们整合了预训练的ILR专家模型作为辅助视觉编码器,以提供专门的特征来学习多样化的实例,从而使VLMs能够以一次学习的方式在上下文中学习新实例。此外,IIR-VLM 利用这些知识进行实例感知的视觉理解。我们在现有的实例个性化基准上验证了IIR-VLM的有效性。最后,我们在一个具有挑战性的新基准上展示了其优越的ILR性能,该基准评估了不同难度和多样类别的ILR能力,其中人员、面部、宠物和一般物体是任务中的实例。
Summary / 总结
The research aims to improve the instance-level recognition (ILR) capabilities of large vision-language models (VLMs), which are often underperforming compared to domain-specific models. The method involves integrating pre-trained ILR expert models as auxiliary encoders to enable VLMs to recognize new instances in a one-shot manner. Key findings include improved ILR performance on existing benchmarks and a new challenging benchmark, demonstrating the model's ability to handle diverse categories and varying difficulty levels.
研究旨在提高大型视觉语言模型(VLM)的实例级识别(ILR)能力,这些模型在ILR任务上通常不如领域特定模型。方法是将预训练的ILR专家模型作为辅助编码器集成进来,使VLM能够在一次学习中掌握新实例。关键发现表明,IIR-VLM在各种基准测试中表现优于现有模型,特别是在一个新提出的具有挑战性的基准测试中,该测试评估了ILR在不同难度和类别下的能力。
SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians
Authors: Siyun Liang, Sen Wang, Kunyi Li, Michael Niemeyer, Stefano Gasperini, Hendrik P. A. Lensch, Nassir Navab, Federico Tombari
First: 2024-12-13T16:01:19+00:00 · Latest: 2026-01-20T17:27:32+00:00
Comments: 13 pages, 8 figures. Project page: supergseg.github.io
Abstract
3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering. While its vanilla representation is mainly designed for view synthesis, recent works extended it to scene understanding with language features. However, storing additional high-dimensional features per Gaussian for semantic information is memory-intensive, which limits their ability to segment and interpret challenging scenes. To this end, we introduce SuperGSeg, a novel approach that fosters cohesive, context-aware hierarchical scene representation by disentangling segmentation and language field distillation. SuperGSeg first employs neural 3D Gaussians to learn geometry, instance and hierarchical segmentation features from multi-view images with the aid of off-the-shelf 2D masks. These features are then leveraged to create a sparse set of \acrlong{superg}s. \acrlong{superg}s facilitate the lifting and distillation of 2D language features into 3D space. They enable hierarchical scene understanding with high-dimensional language feature rendering at moderate GPU memory costs. Extensive experiments demonstrate that SuperGSeg achieves remarkable performance on both open-vocabulary object selection and semantic segmentation tasks.
中文标题/摘要
标题:SuperGSeg:开放式词汇3D分割与结构化超高斯函数
3D 高斯点积近年来因其高效的训练和实时渲染而受到关注。虽然其基本表示主要设计用于视图合成,但最近的工作将其扩展到使用语言特征的场景理解。然而,为每个高斯存储额外的高维特征以获取语义信息是内存密集型的,这限制了它们对复杂场景进行分割和解释的能力。为了解决这一问题,我们提出了SuperGSeg,这是一种新颖的方法,通过分离分割和语言场提炼来促进连贯的、上下文感知的层次场景表示。SuperGSeg 首先使用神经3D高斯函数从多视图图像中学习几何、实例和层次分割特征,并借助现成的2D掩码。这些特征随后被用来创建稀疏的超G集合。超G集合使2D语言特征提升到3D空间成为可能。它们使层次场景理解能够在适度的GPU内存成本下实现高维语言特征渲染。大量实验表明,SuperGSeg 在开放词汇对象选择和语义分割任务上取得了显著的性能。
Summary / 总结
SuperGSeg introduces a novel approach for 3D segmentation using structured Super-Gaussians to address the memory limitations of storing high-dimensional features for semantic information. It employs neural 3D Gaussians to learn geometry and segmentation features from multi-view images and distills 2D language features into 3D space, achieving remarkable performance on open-vocabulary object selection and semantic segmentation tasks.
SuperGSeg提出了一种使用结构化Super-Gaussians的新方法,以解决3D高维特征存储的内存限制问题。它利用神经3D高斯模型从多视图图像中学习几何和分割特征,并通过Super-Gaussians将2D语言特征提升到3D空间,实现高效的层次场景理解。实验表明,SuperGSeg在开放词汇对象选择和语义分割任务上表现优异。
TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers
Authors: Bin Yu, Shijie Lian, Xiaopeng Lin, Yuliang Wei, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Xinming Wang, Bailing Wang, Cong Huang, Kai Chen
First: 2026-01-20T16:30:07+00:00 · Latest: 2026-01-20T16:30:07+00:00
Comments: GitHub: https://github.com/ZGC-EmbodyAI/TwinBrainVLA
Abstract
Standard Vision-Language-Action (VLA) models typically fine-tune a monolithic Vision-Language Model (VLM) backbone explicitly for robotic control. However, this approach creates a critical tension between maintaining high-level general semantic understanding and learning low-level, fine-grained sensorimotor skills, often leading to "catastrophic forgetting" of the model's open-world capabilities. To resolve this conflict, we introduce TwinBrainVLA, a novel architecture that coordinates a generalist VLM retaining universal semantic understanding and a specialist VLM dedicated to embodied proprioception for joint robotic control. TwinBrainVLA synergizes a frozen "Left Brain", which retains robust general visual reasoning, with a trainable "Right Brain", specialized for embodied perception, via a novel Asymmetric Mixture-of-Transformers (AsyMoT) mechanism. This design allows the Right Brain to dynamically query semantic knowledge from the frozen Left Brain and fuse it with proprioceptive states, providing rich conditioning for a Flow-Matching Action Expert to generate precise continuous controls. Extensive experiments on SimplerEnv and RoboCasa benchmarks demonstrate that TwinBrainVLA achieves superior manipulation performance compared to state-of-the-art baselines while explicitly preserving the comprehensive visual understanding capabilities of the pre-trained VLM, offering a promising direction for building general-purpose robots that simultaneously achieve high-level semantic understanding and low-level physical dexterity.
中文标题/摘要
标题:TwinBrainVLA:通过不对称混合变换器协调通用VLM和专门的体感VLM以实现体化任务的潜力
标准的视觉-语言-动作(VLA)模型通常会针对机器人控制明确微调一个单一的视觉-语言模型(VLM)骨干。然而,这种方法在保持高层次的通用语义理解与学习低层次的精细传感器运动技能之间造成了关键的紧张关系,通常会导致模型对开放世界能力的“灾难性遗忘”。为了解决这一冲突,我们提出了TwinBrainVLA,这是一种新颖的架构,它协调了一个保留通用语义理解的通用VLM和一个专注于体感的专门VLM,以实现联合机器人控制。TwinBrainVLA通过一种新颖的不对称混合变换器(AsyMoT)机制,将一个冻结的“左脑”,保留了稳健的通用视觉推理能力,与一个专门用于体感感知的可训练“右脑”相结合。这种设计允许右脑动态查询冻结的左脑的语义知识,并将其与体感状态融合,为流动匹配动作专家生成精确的连续控制提供丰富的条件。在SimplerEnv和RoboCasa基准上的广泛实验表明,TwinBrainVLA在操纵性能上优于最先进的基线模型,同时明确保留了预训练VLM的全面视觉理解能力,为构建同时实现高层次语义理解和低层次物理灵巧性的通用机器人提供了有希望的方向。
Summary / 总结
TwinBrainVLA addresses the challenge of maintaining both high-level semantic understanding and low-level sensorimotor skills in robotic control by introducing a dual-VLM architecture. It consists of a frozen 'Left Brain' for general visual reasoning and a trainable 'Right Brain' specialized for embodied perception, connected via an Asymmetric Mixture-of-Transformers mechanism. Experiments show that TwinBrainVLA outperforms existing methods on manipulation tasks while preserving the VLM's comprehensive visual understanding, suggesting a promising approach for general-purpose robots.
TwinBrainVLA通过引入双VLM架构解决了保持高阶语义理解和低阶传感器运动技能之间的冲突。它包括一个冻结的‘左脑’用于通用视觉推理和一个可训练的‘右脑’专门用于体感感知,并通过一种非对称混合变换机制连接。实验表明,TwinBrainVLA在操作任务上的表现优于现有方法,同时保留了预训练VLM的全面视觉理解能力,这为同时实现高阶语义理解和低阶物理灵巧性的通用机器人提供了有前景的方向。
Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing
Authors: Xiaolu Liu, Yicong Li, Qiyuan He, Jiayin Zhu, Wei Ji, Angela Yao, Jianke Zhu
First: 2026-01-20T16:03:22+00:00 · Latest: 2026-01-20T16:03:22+00:00
Comments: 22 pages, 12 figures
Abstract
Textured 3D morphing seeks to generate smooth and plausible transitions between two 3D assets, preserving both structural coherence and fine-grained appearance. This ability is crucial not only for advancing 3D generation research but also for practical applications in animation, editing, and digital content creation. Existing approaches either operate directly on geometry, limiting them to shape-only morphing while neglecting textures, or extend 2D interpolation strategies into 3D, which often causes semantic ambiguity, structural misalignment, and texture blurring. These challenges underscore the necessity to jointly preserve geometric consistency, texture alignment, and robustness throughout the transition process. To address this, we propose Interp3D, a novel training-free framework for textured 3D morphing. It harnesses generative priors and adopts a progressive alignment principle to ensure both geometric fidelity and texture coherence. Starting from semantically aligned interpolation in condition space, Interp3D enforces structural consistency via SLAT (Structured Latent)-guided structure interpolation, and finally transfers appearance details through fine-grained texture fusion. For comprehensive evaluations, we construct a dedicated dataset, Interp3DData, with graded difficulty levels and assess generation results from fidelity, transition smoothness, and plausibility. Both quantitative metrics and human studies demonstrate the significant advantages of our proposed approach over previous methods. Source code is available at https://github.com/xiaolul2/Interp3D.
中文标题/摘要
标题:Interp3D:基于对应关系的生成性纹理3D形变插值
纹理3D形变旨在生成两个3D资产之间的平滑和可信过渡,同时保持结构连贯性和精细的外观。这一能力不仅对于推进3D生成研究至关重要,也对于动画、编辑和数字内容创作等实际应用至关重要。现有方法要么直接在几何上操作,仅限于形状形变而忽略纹理,要么将2D插值策略扩展到3D,这通常会导致语义模糊、结构错位和纹理模糊。这些挑战强调了在整个过渡过程中同时保持几何一致性、纹理对齐和鲁棒性的必要性。为了解决这一问题,我们提出了一种名为Interp3D的新型无需训练框架,用于纹理3D形变。它利用生成先验,并采用逐步对齐原则,确保几何保真度和纹理一致性。从语义对齐的条件空间插值开始,Interp3D通过SLAT(结构潜在)引导的结构插值确保结构一致性,最后通过精细的纹理融合转移外观细节。为了进行全面评估,我们构建了一个专用数据集Interp3DData,具有不同难度级别,并从保真度、过渡平滑度和可信度评估生成结果。定量指标和人类研究均证明了我们提出方法相对于先前方法的显著优势。源代码可在https://github.com/xiaolul2/Interp3D获取。
Summary / 总结
Interp3D is a novel framework for textured 3D morphing that addresses the limitations of existing methods by jointly preserving geometric consistency, texture alignment, and robustness. It uses generative priors and a progressive alignment principle, starting with semantically aligned interpolation in condition space, followed by SLAT-guided structure interpolation and fine-grained texture fusion. Evaluations on a custom dataset show that Interp3D outperforms previous methods in terms of generation fidelity, transition smoothness, and plausibility.
Interp3D 是一种新颖的 3D 形态生成框架,通过联合保持几何一致性、纹理对齐和鲁棒性来解决现有方法的局限性。它利用生成先验并采用逐步对齐原则,从语义对齐的条件空间插值开始,随后通过 SLAT 引导的结构插值和精细的纹理融合来确保结构一致性。在自定义数据集上的评估表明,Interp3D 在生成保真度、过渡平滑度和合理性方面优于先前的方法。
Zero-shot adaptable task planning for autonomous construction robots: a comparative study of lightweight single and multi-AI agent systems
Authors: Hossein Naderi, Alireza Shojaei, Lifu Huang, Philip Agee, Kereshmeh Afsari, Abiola Akanmu
First: 2026-01-20T15:54:33+00:00 · Latest: 2026-01-20T15:54:33+00:00
Abstract
Robots are expected to play a major role in the future construction industry but face challenges due to high costs and difficulty adapting to dynamic tasks. This study explores the potential of foundation models to enhance the adaptability and generalizability of task planning in construction robots. Four models are proposed and implemented using lightweight, open-source large language models (LLMs) and vision language models (VLMs). These models include one single agent and three multi-agent teams that collaborate to create robot action plans. The models are evaluated across three construction roles: Painter, Safety Inspector, and Floor Tiling. Results show that the four-agent team outperforms the state-of-the-art GPT-4o in most metrics while being ten times more cost-effective. Additionally, teams with three and four agents demonstrate the improved generalizability. By discussing how agent behaviors influence outputs, this study enhances the understanding of AI teams and supports future research in diverse unstructured environments beyond construction.
中文标题/摘要
标题:自主建筑机器人零样本可适应任务规划:轻量级单智能体与多智能体系统比较研究
机器人预计将在未来的建筑行业中发挥重要作用,但由于成本高昂和难以适应动态任务而面临挑战。本研究探讨了基础模型在增强建筑机器人任务规划的适应性和泛化能力方面的潜力。提出了四种模型,使用轻量级开源大型语言模型(LLMs)和视觉语言模型(VLMs)进行实现。这些模型包括一个单智能体和三个多智能体团队,它们协作生成机器人行动计划。模型在三种建筑角色:油漆工、安全检查员和地面铺砖上进行评估。结果显示,四智能体团队在大多数指标上优于最先进的GPT-4o,同时成本效益提高了十倍。此外,三智能体和四智能体团队展示了改进的泛化能力。通过讨论智能体行为如何影响输出,本研究增强了对AI团队的理解,并支持了在建筑之外的多样化非结构化环境中的未来研究。
Summary / 总结
This study aims to improve the adaptability and generalizability of task planning for autonomous construction robots using lightweight AI models. Four models, including a single agent and three multi-agent teams, were developed and tested in three construction roles: Painter, Safety Inspector, and Floor Tiling. The four-agent team outperformed GPT-4o in most metrics and was significantly more cost-effective. The study also found that teams with three and four agents showed better generalizability, providing insights into the behavior of AI teams in unstructured environments.
本研究旨在通过使用轻量级AI模型提高自主建筑机器人任务规划的适应性和通用性。开发了四个模型,包括单个代理和三个多代理团队,使用开源的大语言模型(LLMs)和视觉语言模型(VLMs)。这些模型在三种建筑角色中进行了测试:油漆工、安全检查员和地面铺砖。四代理团队在大多数指标上优于GPT-4o,并且成本效益更高。研究还发现,三个和四个代理的团队表现出更好的通用性。该研究有助于理解AI团队的行为,并支持未来在非结构化环境中的研究工作。
DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and Reasoning
Authors: Abdurrahim Yilmaz, Ozan Erdem, Ece Gokyayla, Ayda Acar, Burc Bugra Dagtas, Dilara Ilhan Erdil, Gulsum Gencoglan, Burak Temelkuran
First: 2026-01-20T15:44:57+00:00 · Latest: 2026-01-20T15:44:57+00:00
Abstract
Vision-language models (VLMs) are increasingly important in medical applications; however, their evaluation in dermatology remains limited by datasets that focus primarily on image-level classification tasks such as lesion recognition. While valuable for recognition, such datasets cannot assess the full visual understanding, language grounding, and clinical reasoning capabilities of multimodal models. Visual question answering (VQA) benchmarks are required to evaluate how models interpret dermatological images, reason over fine-grained morphology, and generate clinically meaningful descriptions. We introduce DermaBench, a clinician-annotated dermatology VQA benchmark built on the Diverse Dermatology Images (DDI) dataset. DermaBench comprises 656 clinical images from 570 unique patients spanning Fitzpatrick skin types I-VI. Using a hierarchical annotation schema with 22 main questions (single-choice, multi-choice, and open-ended), expert dermatologists annotated each image for diagnosis, anatomic site, lesion morphology, distribution, surface features, color, and image quality, together with open-ended narrative descriptions and summaries, yielding approximately 14.474 VQA-style annotations. DermaBench is released as a metadata-only dataset to respect upstream licensing and is publicly available at Harvard Dataverse.
中文标题/摘要
标题:DermaBench:皮肤病视觉问答和推理临床标注基准数据集
视觉-语言模型(VLMs)在医疗应用中越来越重要;然而,在皮肤病领域的评估仍然受到主要集中在图像级分类任务(如病灶识别)的数据集的限制。虽然这些数据集对于识别是有价值的,但它们无法评估多模态模型的全面视觉理解、语言定位和临床推理能力。视觉问答(VQA)基准数据集是评估模型如何解释皮肤病图像、推理细微形态以及生成临床意义描述所需的。我们介绍了DermaBench,这是一个基于多样皮肤病图像(DDI)数据集构建的皮肤病VQA临床标注基准数据集。DermaBench 包含来自570名独特患者的656张临床图像,涵盖了弗吉尼亚皮肤类型I-VI。使用分层注释方案,包括22个主要问题(单选、多选和开放式),专家皮肤科医生为每张图像进行了诊断、解剖部位、病灶形态、分布、表面特征、颜色和图像质量的注释,以及开放式叙述描述和总结,产生了大约14.474个VQA风格的注释。DermaBench 作为元数据集发布,以尊重上游许可,并在哈佛数据空间公开可用。
Summary / 总结
DermaBench is a clinician-annotated benchmark dataset for dermatology VQA and reasoning, addressing the limitations of existing datasets by focusing on visual understanding, language grounding, and clinical reasoning. It uses a hierarchical annotation schema with 22 main questions to annotate 656 clinical images from 570 unique patients, resulting in approximately 14.474 VQA-style annotations. The dataset evaluates models' abilities to interpret dermatological images, reason over fine-grained morphology, and generate clinically meaningful descriptions.
DermaBench 是一个由临床医生注释的基准数据集,用于皮肤科的视觉问答和推理,通过关注视觉理解、语言定位和临床推理来弥补现有数据集的不足。它使用一个分层注释方案,包含22个主要问题来标注来自570名独特患者的656张临床图像,产生了大约14.474个视觉问答风格的注释。该数据集评估模型在解释皮肤科图像、推理细微形态以及生成临床意义描述方面的能力。
Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model
Authors: Haoran Xu, Yanlin Liu, Zizhao Tong, Jiaze Li, Kexue Fu, Yuyang Zhang, Longxiang Gao, Shuaiguang Li, Xingyu Li, Yanran Xu, Changwei Wang
First: 2026-01-20T15:06:10+00:00 · Latest: 2026-01-20T15:06:10+00:00
Abstract
Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.
中文标题/摘要
标题:视觉也需要:利用多模态大型语言模型进行分布外检测导航
分布外(OOD)检测是一项关键任务,已引起广泛关注。CLIP 的出现推动了零样本 OOD 检测的大量研究,通常采用无训练的方法。当前方法利用大型语言模型(LLMs)的专家知识来识别潜在的异常值。然而,这些方法往往过度依赖文本空间的知识,忽视了在图像空间检测分布外样本的固有挑战。在本文中,我们提出了一种新的管道 MM-OOD,利用 MLLMs 的多模态推理能力和进行多轮对话的能力以增强异常值检测。我们的方法旨在提高近 OOD 和远 OOD 任务的性能。具体而言,(1) 对于近 OOD 任务,我们直接将 ID 图像和相应的文本提示输入 MLLMs 以识别潜在的异常值;(2) 对于远 OOD 任务,我们引入了草图-生成-细化框架:首先,我们使用文本提示草图异常值暴露,然后生成相应的视觉 OOD 样本,最后通过多模态提示进行细化。实验表明,我们的方法在广泛使用的多模态数据集如 Food-101 上取得了显著改进,同时在 ImageNet-1K 上也验证了其可扩展性。
Summary / 总结
The paper addresses the challenge of Out-of-Distribution (OOD) detection by proposing MM-OOD, a novel pipeline that leverages the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs). It directly feeds ID images and text prompts into MLLMs for near OOD tasks and uses a sketch-generate-elaborate framework for far OOD tasks. Experiments show that MM-OOD significantly improves OOD detection performance on Food-101 and ImageNet-1K datasets.
本文提出了一种名为MM-OOD的新管道,利用多模态大型语言模型(MLLM)的多模态推理能力来解决Out-of-Distribution (OOD)检测问题。该方法直接将ID图像和文本提示输入MLLM进行近OOD任务,并使用绘图-生成-细化框架进行远OOD任务。实验表明,MM-OOD在Food-101和ImageNet-1K等广泛使用的多模态数据集上显著提高了性能,证明了其在近和远OOD场景中的有效性。
Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology
Authors: Kaiyu Wu, Pucheng Han, Hualong Zhang, Naigeng Wu, Keze Wang
First: 2026-01-20T15:00:15+00:00 · Latest: 2026-01-20T15:00:15+00:00
Abstract
While Vision Language Models (VLMs) show advancing reasoning capabilities, their application in meteorology is constrained by a domain gap and a reasoning faithfulness gap. Specifically, mainstream Reinforcement Fine-Tuning (RFT) can induce Self-Contradictory Reasoning (Self-Contra), where the model's reasoning contradicts its final answer, which is unacceptable in such a high-stakes domain. To address these challenges, we construct WeatherQA, a novel multimodal reasoning benchmark in meteorology. We also propose Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT), which resolves Self-Contra by introducing a logical consistency reward. Furthermore, we introduce Weather-R1, the first reasoning VLM with logical faithfulness in meteorology, to the best of our knowledge. Experiments demonstrate that Weather-R1 improves performance on WeatherQA by 9.8 percentage points over the baseline, outperforming Supervised Fine-Tuning and RFT, and even surpassing the original Qwen2.5-VL-32B. These results highlight the effectiveness of our LoCo-RFT and the superiority of Weather-R1. Our benchmark and code are available at https://github.com/Marcowky/Weather-R1.
中文标题/摘要
标题:Weather-R1:气象多模态推理的逻辑一致强化微调
尽管视觉语言模型(VLMs)展示了不断增强的推理能力,但在气象领域的应用受限于领域差距和推理忠实度差距。具体而言,主流的强化微调(RFT)可能会导致自相矛盾的推理(Self-Contra),即模型的推理与其最终答案相矛盾,这在如此高风险的领域是不可接受的。为了解决这些挑战,我们构建了WeatherQA,一个新颖的气象多模态推理基准。我们还提出了逻辑一致强化微调(LoCo-RFT),通过引入逻辑一致性奖励来解决Self-Contra。此外,我们引入了Weather-R1,这是已知的第一个具有逻辑忠实度的气象推理VLM。实验结果表明,Weather-R1在WeatherQA上的性能比基线提高了9.8个百分点,优于监督微调和RFT,并且甚至超过了原始的Qwen2.5-VL-32B。这些结果突显了我们LoCo-RFT的有效性以及Weather-R1的优越性。我们的基准和代码可在https://github.com/Marcowky/Weather-R1获取。
Summary / 总结
The research aims to enhance the reasoning capabilities of Vision Language Models (VLMs) in meteorology by addressing domain and reasoning faithfulness gaps. To achieve this, the authors propose Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT), which introduces a logical consistency reward to prevent self-contradictory reasoning. The method is applied to develop Weather-R1, the first reasoning VLM with logical faithfulness in meteorology. Experiments show that Weather-R1 outperforms baseline models and even surpasses the original Qwen2.5-VL-32B on the WeatherQA benchmark, improving performance by 9.8 percentage points.
研究旨在解决视觉语言模型(VLMs)在气象学中的局限性,特别是领域差距和推理忠实度差距。为了解决这些问题,作者引入了WeatherQA,一个新的多模态推理基准,并提出了逻辑一致强化微调(LoCo-RFT)方法,以防止自相矛盾的推理。该方法使得推理VLM Weather-R1在WeatherQA上的表现提高了9.8个百分点,超过了基线、监督微调和RFT方法。这表明LoCo-RFT的有效性和Weather-R1在气象推理中的优越性。
MATE: Matryoshka Audio-Text Embeddings for Open-Vocabulary Keyword Spotting
Authors: Youngmoon Jung, Myunghun Jung, Joon-Young Yang, Yong-Hyeok Lee, Jaeyoung Roh, Hoon-Young Cho
Venue: ICASSP 2026
First: 2026-01-20T14:30:40+00:00 · Latest: 2026-01-20T14:30:40+00:00
Comments: 5 pages, 1 figure, Accepted at ICASSP 2026
Abstract
Open-vocabulary keyword spotting (KWS) with text-based enrollment has emerged as a flexible alternative to fixed-phrase triggers. Prior utterance-level matching methods, from an embedding-learning standpoint, learn embeddings at a single fixed dimensionality. We depart from this design and propose Matryoshka Audio-Text Embeddings (MATE), a dual-encoder framework that encodes multiple embedding granularities within a single vector via nested sub-embeddings ("prefixes"). Specifically, we introduce a PCA-guided prefix alignment: PCA-compressed versions of the full text embedding for each prefix size serve as teacher targets to align both audio and text prefixes. This alignment concentrates salient keyword cues in lower-dimensional prefixes, while higher dimensions add detail. MATE is trained with standard deep metric learning objectives for audio-text KWS, and is loss-agnostic. To our knowledge, this is the first application of matryoshka-style embeddings to KWS, achieving state-of-the-art results on WSJ and LibriPhrase without any inference overhead.
中文标题/摘要
标题:MATE:Matryoshka音频-文本嵌入在开放词汇关键词识别中的应用
基于文本注册的开放词汇关键词识别(KWS)已成为固定短语触发词的灵活替代方案。从前一陈述级匹配方法来看,从嵌入学习的角度,它们在单一固定维度上学习嵌入。我们偏离了这种设计,提出了Matryoshka音频-文本嵌入(MATE),这是一种双编码框架,通过嵌套子嵌入(“前缀”)在单个向量中编码多种嵌入粒度。具体而言,我们引入了一种基于PCA的前缀对齐:每个前缀大小的PCA压缩版本的完整文本嵌入作为教师目标,以对齐音频和文本前缀。这种对齐将显著的关键词线索集中在较低维度的前缀中,而较高维度则增加细节。MATE使用标准的音频-文本KWS深度度量学习目标进行训练,且对损失不敏感。据我们所知,这是首次将matryoshka风格的嵌入应用于KWS,在WSJ和LibriPhrase上取得了最先进的结果,且没有任何推理开销。
Summary / 总结
The research aims to improve open-vocabulary keyword spotting by addressing the limitations of fixed-dimensional embeddings. The proposed Matryoshka Audio-Text Embeddings (MATE) uses a dual-encoder framework with nested sub-embeddings (prefixes) to encode multiple granularities within a single vector. This method employs PCA-guided prefix alignment to align both audio and text prefixes, concentrating salient keyword cues in lower dimensions and adding detail in higher dimensions. MATE achieves state-of-the-art results on WSJ and LibriPhrase datasets without additional inference overhead, demonstrating its effectiveness in this domain.
研究旨在通过解决单一固定维度嵌入的局限性,改进开放词汇关键词识别。提出的Matryoshka Audio-Text Embeddings (MATE) 使用具有嵌套子嵌入的双编码器框架,在单个向量中包含多种嵌入粒度。该方法采用PCA引导的前缀对齐,将关键词线索集中在较低维度,并在较高维度中添加细节。实验结果显示,MATE 在 WSJ 和 LibriPhrase 数据集上达到了最先进的效果,且无需额外的推理成本。
Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval
Authors: Joaquín Polonuer, Lucas Vittor, Iñaki Arango, Ayush Noori, David A. Clifton, Luciano Del Corro, Marinka Zitnik
First: 2026-01-20T13:46:37+00:00 · Latest: 2026-01-20T13:46:37+00:00
Abstract
Retrieving evidence for language model queries from knowledge graphs requires balancing broad search across the graph with multi-hop traversal to follow relational links. Similarity-based retrievers provide coverage but remain shallow, whereas traversal-based methods rely on selecting seed nodes to start exploration, which can fail when queries span multiple entities and relations. We introduce ARK: Adaptive Retriever of Knowledge, an agentic KG retriever that gives a language model control over this breadth-depth tradeoff using a two-operation toolset: global lexical search over node descriptors and one-hop neighborhood exploration that composes into multi-hop traversal. ARK alternates between breadth-oriented discovery and depth-oriented expansion without depending on a fragile seed selection, a pre-set hop depth, or requiring retrieval training. ARK adapts tool use to queries, using global search for language-heavy queries and neighborhood exploration for relation-heavy queries. On STaRK, ARK reaches 59.1% average Hit@1 and 67.4 average MRR, improving average Hit@1 by up to 31.4% and average MRR by up to 28.0% over retrieval-based and agentic training-free methods. Finally, we distill ARK's tool-use trajectories from a large teacher into an 8B model via label-free imitation, improving Hit@1 by +7.0, +26.6, and +13.5 absolute points over the base 8B model on AMAZON, MAG, and PRIME datasets, respectively, while retaining up to 98.5% of the teacher's Hit@1 rate.
中文标题/摘要
标题:自主知识图谱探索与自适应广度深度检索
从知识图谱中检索语言模型查询的证据需要在广泛搜索图谱与多跳遍历以跟随关系链接之间取得平衡。基于相似性的检索器可以提供覆盖范围,但仍然较浅,而基于遍历的方法依赖于选择种子节点来开始探索,当查询跨越多个实体和关系时,这种方法可能会失败。我们引入了ARK:知识检索者,这是一种自主的知识图谱检索器,通过使用包含全局词汇搜索节点描述符和一跳邻域探索的两操作工具集,赋予语言模型控制广度深度权衡。ARK在不依赖脆弱的种子选择、预设的跳数或需要检索训练的情况下,交替进行广度导向的发现和深度导向的扩展。ARK根据查询的特征调整工具使用,对于语言密集型查询使用全局搜索,对于关系密集型查询使用邻域探索。在STaRK上,ARK达到平均Hit@1为59.1%和平均MRR为67.4%,分别比基于检索的方法和无训练的自主方法提高平均Hit@1高达31.4%和平均MRR高达28.0%。最后,我们通过无标签模仿从一个大型教师中提炼出ARK的工具使用轨迹,将该轨迹应用于8B模型,分别在AMAZON、MAG和PRIME数据集上提高了Hit@1绝对值7.0、26.6和13.5个百分点,同时保留了教师高达98.5%的Hit@1率。
Summary / 总结
Retrieving evidence for language model queries from knowledge graphs requires balancing broad search across the graph with multi-hop traversal to follow relational links.
该论文提出了ARK,一种通过使用全局词汇搜索和一跳邻域探索两种操作工具集来平衡广度和深度搜索的自适应知识图检索器。ARK无需种子节点或固定跳数即可在广泛发现和深入遍历之间交替,从而在STaRK数据集上实现了检索准确性的显著提升。此外,ARK在较小模型中进行提炼时也表现出色,增强了在各种数据集上的检索效果。
HyperWalker: Dynamic Hypergraph-Based Deep Diagnosis for Multi-Hop Clinical Modeling across EHR and X-Ray in Medical VLMs
Authors: Yuezhe Yang, Hao Wang, Yige Peng, Jinman Kim, Lei Bi
First: 2026-01-20T12:48:09+00:00 · Latest: 2026-01-20T12:48:09+00:00
Comments: Under Review
Abstract
Automated clinical diagnosis remains a core challenge in medical AI, which usually requires models to integrate multi-modal data and reason across complex, case-specific contexts. Although recent methods have advanced medical report generation (MRG) and visual question answering (VQA) with medical vision-language models (VLMs), these methods, however, predominantly operate under a sample-isolated inference paradigm, as such processing cases independently without access to longitudinal electronic health records (EHRs) or structurally related patient examples. This paradigm limits reasoning to image-derived information alone, which ignores external complementary medical evidence for potentially more accurate diagnosis. To overcome this limitation, we propose \textbf{HyperWalker}, a \textit{Deep Diagnosis} framework that reformulates clinical reasoning via dynamic hypergraphs and test-time training. First, we construct a dynamic hypergraph, termed \textbf{iBrochure}, to model the structural heterogeneity of EHR data and implicit high-order associations among multimodal clinical information. Within this hypergraph, a reinforcement learning agent, \textbf{Walker}, navigates to and identifies optimal diagnostic paths. To ensure comprehensive coverage of diverse clinical characteristics in test samples, we incorporate a \textit{linger mechanism}, a multi-hop orthogonal retrieval strategy that iteratively selects clinically complementary neighborhood cases reflecting distinct clinical attributes. Experiments on MRG with MIMIC and medical VQA on EHRXQA demonstrate that HyperWalker achieves state-of-the-art performance. Code is available at: https://github.com/Bean-Young/HyperWalker
中文标题/摘要
标题:HyperWalker:基于动态超图的多跳临床建模深度诊断方法,用于医疗VLM中的EHR和X光
自动化临床诊断仍然是医疗AI的核心挑战,通常需要模型整合多模态数据并在复杂的、案例特定的上下文中进行推理。尽管最近的方法在医疗报告生成(MRG)和医学视觉问答(VQA)方面推进了医学视觉语言模型(VLMs)的应用,但这些方法主要在样本孤立的推理范式下运行,即独立处理病例,不访问纵向电子健康记录(EHRs)或结构相关的患者示例。这种范式限制了推理仅限于图像衍生的信息,而忽略了外部互补的医学证据,可能导致更准确的诊断。为克服这一限制,我们提出了一种名为\textbf{HyperWalker}的\textit{深度诊断}框架,通过动态超图和测试时训练重新定义临床推理。首先,我们构建了一个动态超图,称为\textbf{iBrochure},以建模EHR数据的结构异质性和多模态临床信息中的隐式高阶关联。在此超图中,强化学习代理\textbf{Walker}导航并识别最佳诊断路径。为了确保测试样本中各种临床特征的全面覆盖,我们引入了一种\textit{滞留机制},这是一种多跳正交检索策略,通过迭代选择反映不同临床属性的临床互补邻域病例。在MIMIC上的MRG和EHRXQA上的医学VQA实验表明,HyperWalker达到了最先进的性能。代码可在:https://github.com/Bean-Young/HyperWalker获取。
Summary / 总结
HyperWalker is a deep diagnosis framework that addresses the limitations of sample-isolated inference by integrating longitudinal electronic health records and structurally related patient examples. It constructs a dynamic hypergraph, termed iBrochure, to model the structural heterogeneity of EHR data and multimodal clinical information. A reinforcement learning agent, Walker, navigates this hypergraph to identify optimal diagnostic paths. The linger mechanism ensures comprehensive coverage of diverse clinical characteristics. Experiments show that HyperWalker outperforms existing methods in medical report generation and visual question answering tasks on MIMIC and EHRXQA datasets.
HyperWalker 是一个通过动态超图和测试时训练来整合纵向电子健康记录和结构相关患者示例的深度诊断框架,以克服样本孤立推理的局限性。它构建了一个 iBrochure 超图来建模 EHR 数据和多模态临床信息,并使用强化学习代理 Walker 导航并识别最优诊断路径。徘徊机制确保了对不同临床特征的全面覆盖。实验表明,HyperWalker 在医学报告生成和医学视觉问答任务中优于现有方法。
Deferred Commitment Decoding for Diffusion Language Models
Authors: Yingte Shu, Yuchuan Tian, Chao Xu, Yunhe Wang, Hanting Chen
First: 2026-01-05T12:57:33+00:00 · Latest: 2026-01-20T12:30:52+00:00
Abstract
Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based diffusion, decoding tokens block by block. However, this paradigm suffers from a structural limitation that we term Boundary-Induced Context Truncation (BICT): undecoded tokens near block boundaries are forced to commit without access to nearby future context, even when such context could substantially reduce uncertainty. This limitation degrades decoding certainty and generation quality, especially for tasks requiring precise reasoning, such as mathematical problem solving and code generation. We propose Deferred Commitment Decoding (DCD), a novel, training-free decoding strategy that mitigates this issue. DCD maintains a certainty-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available. Extensive experiments across multiple diffusion language models, benchmarks, and caching configurations show that DCD improves generation accuracy by 1.73% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 16.5%. These results demonstrate that deferring token commitment based on uncertainty is a simple yet effective principle for improving both the quality and efficiency of diffusion language model decoding.
中文标题/摘要
标题:延迟承诺解码以提高扩散语言模型的推理效率
扩散语言模型(DLMs)最近已成为自回归模型的强大替代方案,通过实现并行文本生成。为了提高推理效率和KV缓存兼容性,先前的工作通常采用基于块的扩散,逐块解码令牌。然而,这种范式遭受了一个我们称之为边界诱导上下文截断(BICT)的结构性限制:接近块边界的未解码令牌被迫在无法访问附近未来上下文的情况下做出承诺,即使这种上下文可以显著减少不确定性。这一限制降低了解码的确定性和生成质量,特别是在需要精确推理的任务中,如数学问题解决和代码生成。我们提出了一种名为延迟承诺解码(DCD)的新型、无需训练的解码策略,以缓解这一问题。DCD 维持一个对遮蔽令牌的不确定性感知滑动窗口,早期解决低不确定性令牌,直到有足够的上下文证据才推迟高不确定性令牌。在多个扩散语言模型、基准测试和缓存配置的广泛实验中显示,与固定块扩散方法相比,DCD 在平均时间相同的情况下提高了生成准确性 1.73%,最高改善幅度达到 16.5%。这些结果表明,基于不确定性推迟令牌承诺是提高扩散语言模型解码质量和效率的一个简单而有效的原则。
Summary / 总结
The paper addresses the issue of Boundary-Induced Context Truncation (BICT) in block-based diffusion language models, which limits the access to future context for undecoded tokens near block boundaries. To overcome this, the authors propose Deferred Commitment Decoding (DCD), a training-free method that uses a certainty-aware sliding window to resolve low-uncertainty tokens early and defer high-uncertainty tokens until sufficient context is available. Experiments show that DCD improves generation accuracy by 1.73% on average compared to fixed block-based methods, with the best improvement reaching 16.5%.
论文针对块基扩散语言模型中的边界诱导上下文截断(BICT)问题,该问题限制了接近块边界未解码令牌的上下文访问。它引入了延迟承诺解码(DCD),这是一种无需训练的解码策略,通过维护一个不确定性感知的滑动窗口,提前解决低不确定性令牌并延迟高不确定性令牌直到获得足够的上下文证据。实验结果显示,DCD 平均提高了生成准确性 1.73%,最高可达 16.5%,同时保持与固定块基方法相当的推理时间。
OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3
Authors: Xu Zhang, Danyang Li, Yingjie Xia, Xiaohang Dong, Hualong Yu, Jianye Wang, Qicheng Li
First: 2026-01-20T12:25:41+00:00 · Latest: 2026-01-20T12:25:41+00:00
Abstract
Change Detection (CD) is a fundamental task in remote sensing. It monitors the evolution of land cover over time. Based on this, Open-Vocabulary Change Detection (OVCD) introduces a new requirement. It aims to reduce the reliance on predefined categories. Existing training-free OVCD methods mostly use CLIP to identify categories. These methods also need extra models like DINO to extract features. However, combining different models often causes problems in matching features and makes the system unstable. Recently, the Segment Anything Model 3 (SAM 3) is introduced. It integrates segmentation and identification capabilities within one promptable model, which offers new possibilities for the OVCD task. In this paper, we propose OmniOVCD, a standalone framework designed for OVCD. By leveraging the decoupled output heads of SAM 3, we propose a Synergistic Fusion to Instance Decoupling (SFID) strategy. SFID first fuses the semantic, instance, and presence outputs of SAM 3 to construct land-cover masks, and then decomposes them into individual instance masks for change comparison. This design preserves high accuracy in category recognition and maintains instance-level consistency across images. As a result, the model can generate accurate change masks. Experiments on four public benchmarks (LEVIR-CD, WHU-CD, S2Looking, and SECOND) demonstrate SOTA performance, achieving IoU scores of 67.2, 66.5, 24.5, and 27.1 (class-average), respectively, surpassing all previous methods.
中文标题/摘要
标题:OmniOVCD:借助SAM 3简化开放词汇变化检测
变化检测(CD)是遥感中的一个基本任务,用于监测土地覆盖随时间的变化。基于此,开放词汇变化检测(OVCD)引入了新的要求,旨在减少对预定义类别的依赖。现有的无需训练的OVCD方法大多使用CLIP来识别类别,这些方法还需要额外的模型如DINO来提取特征。然而,将不同模型结合在一起往往会导致特征匹配问题,使系统不稳定。最近,引入了Segment Anything Model 3(SAM 3),它将分割和识别能力整合在一个可提示模型中,为OVCD任务提供了新的可能性。本文中,我们提出了OmniOVCD,这是一种独立框架,专门用于OVCD。通过利用SAM 3的解耦输出头,我们提出了协同融合到实例解耦(SFID)策略。SFID首先将SAM 3的语义、实例和存在输出融合以构建土地覆盖掩码,然后将它们分解为个体实例掩码以进行变化比较。这种设计在类别识别中保持了高精度,并在图像中保持了实例级的一致性。因此,该模型可以生成准确的变化掩码。在四个公开基准(LEVIR-CD、WHU-CD、S2Looking和SECOND)上的实验展示了SOTA性能,分别实现了IoU分数67.2、66.5、24.5和27.1(类别平均),超越了所有先前的方法。
Summary / 总结
Change Detection (CD) is a fundamental task in remote sensing.
OmniOVCD 是一个用于开放词汇变化检测 (OVCD) 的框架,利用 Segment Anything Model 3 (SAM 3) 简化过程。它使用一种名为 Synergistic Fusion to Instance Decoupling (SFID) 的策略,将 SAM 3 的输出融合和分解,以实现准确的土地覆盖图构建和个体实例掩码生成,用于变化比较。实验结果显示,OmniOVCD 在四个基准上的表现优于先前方法,分别实现了 67.2、66.5、24.5 和 27.1(类别平均)的 IoU 分数。
Revisiting Multi-Task Visual Representation Learning
Authors: Shangzhe Di, Zhonghua Zhai, Weidi Xie
First: 2026-01-20T11:59:19+00:00 · Latest: 2026-01-20T11:59:19+00:00
Comments: Code: https://github.com/Becomebright/MTV
Abstract
Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.
中文标题/摘要
标题:重访多任务视觉表示学习
当前的视觉表示学习仍然分裂:视觉-语言模型(例如,CLIP)在全局语义对齐方面表现出色,但在空间精度方面有所欠缺,而自监督方法(例如,MAE,DINO)能够捕捉复杂的局部结构,但在高层语义上下文方面存在困难。我们认为这些范式本质上是互补的,并可以整合到一个原理性的多任务框架中,进一步通过密集的空间监督加以增强。我们引入了MTV,这是一种多任务视觉预训练框架,联合优化视觉-语言对比、自监督和密集空间目标的共享骨干网络。为了减少手动注释的需求,我们利用高容量的“专家”模型——例如Depth Anything V2和OWLv2——大规模合成密集的结构化伪标签。除了框架之外,我们还系统地探讨了多任务视觉学习的机制,分析了:(i) 每个目标的边际收益,(ii) 任务协同作用与干扰,以及(iii) 在不同数据和模型规模下的扩展行为。我们的结果表明,MTV实现了“兼收并蓄”的性能,显著增强了细粒度的空间推理,而不会牺牲全局语义理解。我们的研究结果表明,由高质量的伪监督驱动的多任务学习是一种通向更通用视觉编码器的可扩展途径。
Summary / 总结
This paper addresses the limitations of current visual representation learning methods by proposing a multi-task framework called MTV. It combines vision-language contrastive learning, self-supervised learning, and dense spatial supervision to improve both global semantic understanding and fine-grained spatial reasoning. Experiments show that MTV outperforms single-task approaches in terms of fine-grained spatial reasoning while maintaining global semantic understanding. The study also investigates the synergies and potential interference between different tasks and their scaling behavior.
该论文通过提出一个多任务框架MTV来解决当前视觉表示学习方法的局限性。它结合了视觉-语言对比学习、自我监督学习和密集的空间监督,以增强全局语义理解和精细的空间推理。作者利用专家模型生成大规模的伪标签,并通过实验表明,MTV在局部和全局理解方面均优于单任务方法,这表明多任务学习结合高质量的伪监督是视觉编码器更具扩展性的方向。
OCCAM: Class-Agnostic, Training-Free, Prior-Free and Multi-Class Object Counting
Authors: Michail Spanakis, Iason Oikonomidis, Antonis Argyros
First: 2026-01-20T11:36:38+00:00 · Latest: 2026-01-20T11:36:38+00:00
Abstract
Class-Agnostic object Counting (CAC) involves counting instances of objects from arbitrary classes within an image. Due to its practical importance, CAC has received increasing attention in recent years. Most existing methods assume a single object class per image, rely on extensive training of large deep learning models and address the problem by incorporating additional information, such as visual exemplars or text prompts. In this paper, we present OCCAM, the first training-free approach to CAC that operates without the need of any supplementary information. Moreover, our approach addresses the multi-class variant of the problem, as it is capable of counting the object instances in each and every class among arbitrary object classes within an image. We leverage Segment Anything Model 2 (SAM2), a foundation model, and a custom threshold-based variant of the First Integer Neighbor Clustering Hierarchy (FINCH) algorithm to achieve competitive performance on widely used benchmark datasets, FSC-147 and CARPK. We propose a synthetic multi-class dataset and F1 score as a more suitable evaluation metric. The code for our method and the proposed synthetic dataset will be made publicly available at https://mikespanak.github.io/OCCAM_counter.
中文标题/摘要
标题:OCCAM:无类别依赖、无需训练、无需先验的多类别物体计数
无类别依赖物体计数(Class-Agnostic Object Counting, CAC)涉及对图像中任意类别的物体实例进行计数。由于其实际重要性,CAC 近年来受到了越来越多的关注。大多数现有方法假设每张图像只有一个物体类别,依赖于大型深度学习模型的大量训练,并通过引入额外信息(如视觉示例或文本提示)来解决该问题。在本文中,我们提出了 OCCAM,这是第一个无需训练的 CAC 方法,无需任何辅助信息即可运行。此外,我们的方法解决了多类别问题的变体,因为它能够对图像中任意类别中的每个物体实例进行计数。我们利用 Segment Anything Model 2 (SAM2) 基础模型和 First Integer Neighbor Clustering Hierarchy (FINCH) 算法的自定义阈值变体来在广泛使用的基准数据集 FSC-147 和 CARPK 上实现竞争力的性能。我们提出了一个合成的多类别数据集和 F1 分数作为更合适的评估指标。我们的方法代码和提出的合成数据集将在 https://mikespanak.github.io/OCCAM_counter/ 公开提供。
Summary / 总结
Class-Agnostic object Counting (CAC) involves counting instances of objects from arbitrary classes within an image.
该论文提出了OCCAM,一种无需训练且不需要任何补充信息的类无感知物体计数方法。该方法利用Segment Anything Model 2 (SAM2) 和一种基于阈值的First Integer Neighbor Clustering Hierarchy (FINCH) 算法的自定义变体,在FSC-147和CARPK基准数据集上实现了竞争力的表现。该方法能够在不依赖先验训练或特定类别信息的情况下,对图像中的多个类别物体进行计数。此外,还提出了一个合成的多类别数据集和F1分数作为评估指标。
DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes
Authors: Aisha Al-Mohannadi, Ayisha Firoz, Yin Yang, Muhammad Imran, Ferda Ofli
First: 2026-01-20T10:50:46+00:00 · Latest: 2026-01-20T10:50:46+00:00
Abstract
Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at https://zenodo.org/records/18267770.
中文标题/摘要
标题:DisasterVQA:灾害场景视觉问答基准数据集
社交媒体图像在自然灾害和人为灾害期间提供了低延迟的情景信息来源,有助于快速评估损害和响应。尽管视觉问答(VQA)在通用领域表现出色,但其在灾害响应中所需的复杂且安全关键推理的适用性仍不清楚。我们介绍了DisasterVQA,一个旨在危机情境中感知和推理的基准数据集。DisasterVQA 包含 1,395 张真实世界图像和 4,405 个专家策划的问题-答案对,涵盖了洪水、野火和地震等多样事件。该数据集基于包括FEMA ESF和OCHA MIRA等人道主义框架,包括二元选择、多项选择和开放性问题,涵盖情景意识和操作决策任务。我们对七种最先进的视觉-语言模型进行了基准测试,发现不同问题类型、灾害类别、地区和人道主义任务的性能存在差异。尽管模型在二元问题上取得了高准确率,但在细粒度的定量推理、物体计数和上下文敏感解释方面,特别是在未充分代表的灾害场景中,表现尤为困难。DisasterVQA 提供了一个具有挑战性和实用性的基准,以指导开发更稳健且操作上有意义的视觉-语言模型,用于灾害响应。该数据集可在 https://zenodo.org/records/18267770 公开获取。
Summary / 总结
The research introduces DisasterVQA, a benchmark dataset for visual question answering in disaster scenarios, to evaluate models' performance in complex and safety-critical situations. The dataset includes 1,395 images and 4,405 expert-curated questions covering various disasters like floods and wildfires. Models show high accuracy on binary questions but struggle with fine-grained reasoning and context-sensitive interpretation, especially for underrepresented scenarios. This benchmark aims to guide the development of more robust vision-language models for disaster response.
研究动机是评估视觉问答(VQA)模型在灾害响应中的适用性,鉴于社交媒体图像对于快速评估灾害损失的重要性。主要方法是创建包含1,395张图像和4,405个专家标注问题的DisasterVQA基准数据集,涵盖各种灾害场景。关键发现表明,虽然模型在二元问题上表现良好,但在复杂的任务如定量推理和上下文敏感解释方面表现不佳,尤其是对于未充分代表的灾害类型。这突显了需要更 robust 的VQA模型来应对灾害响应的需求。
PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval
Authors: Gabriele Serussi, David Vainshtein, Jonathan Kouchly, Dotan Di Castro, Chaim Baskin
First: 2026-01-20T09:57:04+00:00 · Latest: 2026-01-20T09:57:04+00:00
Abstract
Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates robustness across different VLM backbones and exhibits strong zero-shot generalization to more complex textual modifications, highlighting its effectiveness and semantic capabilities.
中文标题/摘要
标题:PREGEN:揭示合成视频检索中的潜在思想
合成视频检索(CoVR)旨在根据查询视频和修改文本检索视频。当前的CoVR方法未能充分利用现代视觉-语言模型(VLM),要么使用过时的架构,要么需要昂贵的微调和缓慢的字幕生成。我们提出了PREGEN(PRE GENeration提取),这是一种克服这些限制的有效且强大的CoVR框架。我们的方法独特地将一个冻结的预训练VLM与一个轻量级编码模型配对,消除了任何VLM微调的需要。我们将查询视频和修改文本输入VLM,并从每一层提取最终标记的隐藏状态。然后,我们训练一个简单的编码器,基于这些聚合表示创建一个语义丰富且紧凑的检索嵌入。PREGEN显著推进了现有技术,超越了所有先前的方法,在标准CoVR基准上的召回率@1分别提高了+27.23和+69.59。我们的方法在不同VLM基础模型上表现出鲁棒性,并且在更复杂的文本修改方面表现出强大的零样本泛化能力,突显了其有效性和语义能力。
Summary / 总结
The research aims to improve Composed Video Retrieval (CoVR) by addressing the limitations of current methods, which either use outdated architectures or require expensive fine-tuning. PREGEN, a novel framework, pairs a frozen pre-trained Vision-Language Model (VLM) with a lightweight encoder, avoiding the need for fine-tuning. It extracts hidden states from the VLM and trains a simple encoder on these representations to create compact embeddings for retrieval. PREGEN outperforms previous methods, achieving a significant increase in Recall@1 of +27.23 and +69.59 on standard CoVR benchmarks and showing robustness across different VLMs and strong zero-shot generalization to complex textual modifications.
研究旨在通过解决现有方法的局限性,即使用过时的架构或需要昂贵的微调,来改进组成视频检索(CoVR)。PREGEN 是一种新颖的框架,将预训练的视觉-语言模型与轻量级编码器配对,避免了微调的需要。该方法从视觉-语言模型中提取隐藏状态,并在这些表示上训练一个简单的编码器,以创建用于检索的紧凑嵌入。PREGEN 在标准 CoVR 基准上的 Recall@1 方面显著超越了先前的方法,实现了 +27.23 和 +69.59 的大幅改进。该方法在不同视觉-语言模型上表现出良好的鲁棒性,并且在复杂文本修改的零样本泛化方面表现出强大的语义能力。
DeCode: Decoupling Content and Delivery for Medical QA
Authors: Po-Jen Ko, Chen-Han Tsai, Yu-Shao Peng
First: 2026-01-05T13:54:38+00:00 · Latest: 2026-01-20T09:31:31+00:00
Comments: Preprint
Abstract
Large language models (LLMs) exhibit strong medical knowledge and can generate factually accurate responses. However, existing models often fail to account for individual patient contexts, producing answers that are clinically correct yet poorly aligned with patients' needs. In this work, we introduce DeCode, a training-free, model-agnostic framework that adapts existing LLMs to produce contextualized answers in clinical settings. We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses. DeCode improves the previous state of the art from $28.4\%$ to $49.8\%$, corresponding to a $75\%$ relative improvement. Experimental results suggest the effectiveness of DeCode in improving clinical question answering of LLMs.
中文标题/摘要
标题:DeCode: 解耦内容与交付以实现医疗QA
大型语言模型(LLMs)表现出强大的医学知识,并能生成事实准确的回答。然而,现有模型往往未能考虑个体患者的背景,导致答案在临床上正确但与患者需求严重脱节。在本研究中,我们引入了DeCode,这是一种无需训练、适用于所有模型的框架,能够将现有的LLMs适应于在临床环境中生成上下文化的回答。我们使用OpenAI HealthBench对DeCode进行了评估,这是一个全面且具有挑战性的基准,旨在评估LLM回答的临床相关性和有效性。DeCode将先前的最佳性能从28.4%提高到49.8%,相当于75%的相对改进。实验结果表明,DeCode在提高LLMs的临床问题回答效果方面具有有效性。
Summary / 总结
DeCode is a training-free, model-agnostic framework designed to adapt existing large language models (LLMs) for producing context-specific medical responses. Evaluated on OpenAI HealthBench, DeCode significantly improves the previous state-of-the-art performance from 28.4% to 49.8%, representing a 75% relative improvement in clinical question answering accuracy.
DeCode 是一个无需训练、适用于多种模型的框架,旨在使现有的大型语言模型能够生成具有上下文的相关医疗答案。它在 OpenAI HealthBench 上进行评估,并实现了从 28.4% 到 49.8% 的临床相关性和有效性显著提升,相对改进幅度为 75%。
Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
Authors: Xinlei Yin, Xiulian Peng, Xiao Li, Zhiwei Xiong, Yan Lu
First: 2026-01-20T08:23:29+00:00 · Latest: 2026-01-20T08:23:29+00:00
Abstract
Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.
中文标题/摘要
标题:基于音频视觉实体连贯性和主动搜索的分层长视频理解
长视频理解对视觉语言模型构成了巨大挑战,因为它们需要处理极其长的上下文窗口。现有的依赖于简单的分块策略和检索增强生成的方法,通常会遭受信息碎片化和全局连贯性丧失的问题。我们提出了HAVEN,这是一种统一的长视频理解框架,通过整合音频视觉实体连贯性和分层视频索引与主动搜索机制,实现连贯和全面的推理。首先,通过整合视觉和听觉流中的实体级表示,保持语义一致性,并将内容组织成跨越全局摘要、场景、片段和实体级别的结构化层次。然后,采用主动搜索机制,实现跨这些层次的动态检索和推理,促进连贯叙事重建和细粒度实体跟踪。大量实验表明,我们的方法在时间连贯性、实体一致性以及检索效率方面表现出色,在LVBench上总体准确率达到84.1%,特别是在具有挑战性的推理类别中,准确率达到80.1%。这些结果突显了结构化多模态推理在长视频全面和上下文一致理解中的有效性。
Summary / 总结
Long video understanding presents significant challenges for vision-language models due to extremely long context windows.
研究提出了HAVEN框架,该框架结合了音频视觉实体一致性、分层视频索引和代理搜索,以解决长视频理解的挑战。它在视觉和听觉流中保持语义一致性,并将内容组织成一个结构化的层次。实验表明,HAVEN在LVBench上的总体准确率为84.1%,在推理类别上的准确率为80.1%,显示出良好的时间连贯性、实体一致性和检索效率。
Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs
Authors: Yujin Jo, Sangyoon Bae, Taesup Kim
First: 2026-01-20T08:04:18+00:00 · Latest: 2026-01-20T08:04:18+00:00
Abstract
Hallucinations in large vision-language models (LVLMs) often arise when language priors dominate over visual evidence, causing object misidentification and visually inconsistent descriptions. We address this issue by framing hallucination mitigation as contrastive guidance, steering generation toward visually grounded and semantically faithful text. This approach regulates the model's internal behavior by reducing over-dependence on language priors and contrasting visually grounded with language-only representations. We propose Attention-space Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths in a single forward computation. This integration enables computationally efficient guidance directly embedded in the model's representation contextualization. To correct approximation bias introduced by the single-pass formulation, we further apply an orthogonalized correction that removes components aligned with the language-only path, selectively amplifying visual contributions. Experiments on the CHAIR and POPE benchmarks show that ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost. Our method establishes a principled and efficient alternative, reducing latency by up to 2x compared to prior contrastive decoding methods that require multiple forward passes.
中文标题/摘要
标题:注意力-空间对比性引导在LVLMs中高效消除幻觉
大型视觉-语言模型(LVLMs)中的幻觉通常发生在语言先验主导视觉证据时,导致物体识别错误和视觉不一致的描述。我们通过将幻觉消除重新定义为对比性引导,将生成引导至视觉依据和语义忠实的文本。该方法通过减少对语言先验的依赖并对比视觉依据与仅语言表示,调节模型的内部行为。我们提出了注意力-空间对比性引导(ACG),这是一种单次机制,在自注意力层内操作以在单次前向计算中构建视觉-语言和仅语言注意力路径。这种集成使计算高效的引导直接嵌入到模型的表示上下文化中。为了纠正单次机制引入的近似偏差,我们进一步应用正交化校正,去除与仅语言路径对齐的成分,选择性地放大视觉贡献。在CHAIR和POPE基准上的实验表明,ACG在忠实度和描述质量方面达到最先进的水平,同时显著降低计算成本。我们的方法建立了一个原理上和计算上高效的替代方案,与需要多次前向计算的先前对比性解码方法相比,将延迟降低高达2倍。
Summary / 总结
The paper addresses hallucinations in large vision-language models by proposing Attention-space Contrastive Guidance (ACG), which steers text generation towards visually grounded descriptions. ACG operates within self-attention layers to contrast vision-language and language-only representations, reducing the model's reliance on language priors. Experiments show that ACG achieves superior faithfulness and caption quality with reduced computational cost and latency compared to previous methods requiring multiple forward passes.
论文通过提出注意力空间对比引导(ACG)来解决大型视觉语言模型中的幻觉问题,该方法引导文本生成更符合视觉描述的内容。ACG 在自注意力层中整合视觉语言和语言仅有的注意力路径,实现高效的引导。实验结果表明,ACG 在 CHAIR 和 POPE 基准上提高了忠实度和描述质量,同时减少了计算成本和高达 2 倍的延迟,优于之前的多步前向传递方法。
ParkingTwin: Training-Free Streaming 3D Reconstruction for Parking-Lot Digital Twins
Authors: Xinhao Liu, Yu Wang, Xiansheng Guo, Gordon Owusu Boateng, Yu Cao, Haonan Si, Xingchen Guo, Nirwan Ansari
First: 2026-01-20T08:03:58+00:00 · Latest: 2026-01-20T08:03:58+00:00
Comments: 35 pages, 10 figures. Submitted to ISPRS Journal of Photogrammetry and Remote Sensing. Under review
Abstract
High-fidelity parking-lot digital twins provide essential priors for path planning, collision checking, and perception validation in Automated Valet Parking (AVP). Yet robot-oriented reconstruction faces a trilemma: sparse forward-facing views cause weak parallax and ill-posed geometry; dynamic occlusions and extreme lighting hinder stable texture fusion; and neural rendering typically needs expensive offline optimization, violating edge-side streaming constraints. We propose ParkingTwin, a training-free, lightweight system for online streaming 3D reconstruction. First, OSM-prior-driven geometric construction uses OpenStreetMap semantic topology to directly generate a metric-consistent TSDF, replacing blind geometric search with deterministic mapping and avoiding costly optimization. Second, geometry-aware dynamic filtering employs a quad-modal constraint field (normal/height/depth consistency) to reject moving vehicles and transient occlusions in real time. Third, illumination-robust fusion in CIELAB decouples luminance and chromaticity via adaptive L-channel weighting and depth-gradient suppression, reducing seams under abrupt lighting changes. ParkingTwin runs at 30+ FPS on an entry-level GTX 1660. On a 68,000 m^2 real-world dataset, it achieves SSIM 0.87 (+16.0%), delivers about 15x end-to-end speedup, and reduces GPU memory by 83.3% compared with state-of-the-art 3D Gaussian Splatting (3DGS) that typically requires high-end GPUs (RTX 4090D). The system outputs explicit triangle meshes compatible with Unity/Unreal digital-twin pipelines. Project page: https://mihoutao-liu.github.io/ParkingTwin/
中文标题/摘要
标题:ParkingTwin:无需训练的流式3D重建用于停车库数字孪生
高保真停车库数字孪生为自动代客泊车(AVP)中的路径规划、碰撞检测和感知验证提供了必要的先验信息。然而,面向机器人的重建面临三难困境:稀疏的前方视角导致微弱的视差和病态几何;动态遮挡和极端光照妨碍了稳定的纹理融合;而神经渲染通常需要昂贵的离线优化,违反了边缘侧流约束。我们提出了ParkingTwin,一种无需训练、轻量级的在线流式3D重建系统。首先,基于OSM先验的几何构建利用OpenStreetMap语义拓扑直接生成度量一致的TSDF,用确定性映射替代盲目的几何搜索,避免了昂贵的优化。其次,几何感知动态过滤采用四模态约束场(法线/高度/深度一致性)实时拒绝移动车辆和瞬态遮挡。第三,CIELAB下的光照鲁棒融合通过自适应L通道加权和深度梯度抑制解耦亮度和色度,减少突然光照变化下的接缝。ParkingTwin在入门级的GTX 1660上运行速度超过30 FPS。在68,000平方米的真实数据集上,它实现了SSIM 0.87(+16.0%),提供了约15倍的端到端加速,并将GPU内存减少了83.3%,与通常需要高端GPU(RTX 4090D)的最新3D高斯点云(3DGS)相比。该系统输出与Unity/Unreal数字孪生管道兼容的显式三角网格。项目页面:https://mihoutao-liu.github.io/ParkingTwin/
Summary / 总结
ParkingTwin is a training-free system for real-time 3D reconstruction of parking lots, addressing challenges such as sparse views, dynamic occlusions, and lighting changes. It uses OpenStreetMap data to generate a metric-consistent TSDF, employs a quad-modal constraint field for real-time filtering, and applies illumination-robust fusion in CIELAB. The system achieves high SSIM scores, significant speedup, and reduced GPU memory usage compared to state-of-the-art methods like 3D Gaussian Splatting. It runs at 30+ FPS on an entry-level GPU and outputs compatible triangle meshes for digital-twin pipelines.
ParkingTwin 是一个无需训练的实时 3D 重建系统,针对稀疏视图、动态遮挡和光照变化等挑战。它使用 OpenStreetMap 数据生成一个度量一致的 TSDF,采用四模态约束场进行实时滤波,并在 CIELAB 中应用光照鲁棒融合。该系统在 SSIM 得分、显著的速度提升和 GPU 内存减少方面优于 3D Gaussian Splatting 等现有方法。它在入门级 GPU 上运行速度超过 30 FPS,并输出兼容 Unity/Unreal 数字孪生管道的三角形网格。
Reasoning or Pattern Matching? Probing Large Vision-Language Models with Visual Puzzles
Authors: Maria Lymperaiou, Vasileios Karampinis, Giorgos Filandrianos, Angelos Vlachos, Chrysoula Zerva, Athanasios Voulodimos
First: 2026-01-20T08:02:04+00:00 · Latest: 2026-01-20T08:02:04+00:00
Abstract
Puzzles have long served as compact and revealing probes of human cognition, isolating abstraction, rule discovery, and systematic reasoning with minimal reliance on prior knowledge. Leveraging these properties, visual puzzles have recently emerged as a powerful diagnostic tool for evaluating the reasoning abilities of Large Vision-Language Models (LVLMs), offering controlled, verifiable alternatives to open-ended multimodal benchmarks. This survey provides a unified perspective of visual puzzle reasoning in LVLMs. We frame visual puzzles through a common abstraction and organize existing benchmarks by the reasoning mechanisms they target (inductive, analogical, algorithmic, deductive, and geometric/spatial), thereby linking puzzle design to the cognitive operations required for solving. Synthesizing empirical evidence across these categories, we identify consistent limitations in current models, including brittle generalization, tight entanglement between perception and reasoning, and a persistent gap between fluent explanations and faithful execution. By framing visual puzzles as diagnostic instruments rather than task formats, this survey elaborates on the state of LVLM reasoning and outlines key directions for future benchmarks and reasoning-aware multimodal systems.
中文标题/摘要
标题:推理还是模式匹配?用视觉谜题探究大型视觉-语言模型
谜题长期以来一直作为紧凑而揭示性的探针,用于研究人类认知,通过最小化先验知识的依赖性,隔离抽象、规则发现和系统推理。利用这些特性,视觉谜题最近已成为评估大型视觉-语言模型(LVLM)推理能力的强大诊断工具,提供了开放性多模态基准的可控和可验证替代方案。本文综述了视觉谜题推理在LVLM中的统一视角。我们通过共同的抽象框架来阐述视觉谜题,并根据它们针对的推理机制(归纳、类比、算法、演绎和几何/空间)组织现有的基准测试,从而将谜题设计与解决所需的认知操作联系起来。综合这些类别中的实证证据,我们识别出当前模型的一致局限性,包括脆弱的泛化能力、感知与推理之间的紧密纠缠,以及流畅解释与忠实执行之间的持续差距。通过将视觉谜题视为诊断工具而非任务格式,本文阐述了LVLM推理的现状,并概述了未来基准测试和推理感知多模态系统的关键方向。
Summary / 总结
The paper explores the use of visual puzzles as a diagnostic tool to evaluate the reasoning abilities of Large Vision-Language Models (LVLMs). It categorizes visual puzzles into inductive, analogical, algorithmic, deductive, and geometric/spatial reasoning mechanisms and identifies limitations such as brittle generalization and tight entanglement between perception and reasoning. The study highlights the need for future benchmarks that better capture reasoning capabilities in LVLMs.
论文探讨了使用视觉谜题作为诊断工具来评估大型视觉-语言模型(LVLM)的推理能力。它将视觉谜题分类为归纳、类比、算法、演绎和几何/空间推理机制,并指出了诸如脆弱的泛化能力和感知与推理之间的紧密联系等限制。研究强调了未来基准测试需要更好地捕捉LVLM的推理能力。
HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
Authors: Zhiyuan Shi, Qibo Qiu, Feng Xue, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, Wenxiao Wang
First: 2026-01-20T07:35:06+00:00 · Latest: 2026-01-20T07:35:06+00:00
Abstract
The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information, principally because they overlook the attention drift phenomenon where token significance evolves dynamically. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse-grained caching strategies and incur high I/O overhead due to frequent data transfers. To overcome these limitations, we propose HeteroCache, a training-free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and redundancy. Consequently, we apply a fine-grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes, thereby addressing the inefficiency of coarse-grained strategies. Furthermore, we employ a hierarchical storage mechanism in which a subset of representative heads monitors attention shift, and trigger an asynchronous, on-demand retrieval of contexts from the CPU, effectively hiding I/O latency. Finally, experiments demonstrate that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to $3\times$ compared to the original model in the 224K context. Our code will be open-source.
中文标题/摘要
标题:HeteroCache:一种异构KV缓存动态压缩方法以应对长上下文LLM推理
KV缓存的线性内存增长是长上下文任务中LLM推理的一个重大瓶颈。现有的静态压缩方法往往无法保留全局重要信息,主要是因为它们忽视了注意力漂移现象,即词元的重要性会动态变化。尽管最近的动态检索方法试图解决这一问题,但它们通常采用粗粒度的缓存策略,并因频繁的数据传输而产生高I/O开销。为克服这些限制,我们提出了HeteroCache,这是一种无需训练的动态压缩框架。我们的方法基于两个关键洞察:注意力头表现出多样化的时序异质性,且同一层内的头之间存在显著的空间冗余。根据这些洞察,HeteroCache 根据稳定性和冗余性对头进行分类。因此,我们应用了一种细粒度的加权策略,为快速变化的注意力分配更大的缓存预算,以捕捉上下文变化,从而解决粗粒度策略的低效性。此外,我们采用了一种分层存储机制,其中一部分代表性头监控注意力变化,并触发异步、按需从CPU检索上下文,从而有效隐藏I/O延迟。最后,实验表明,HeteroCache 在多个长上下文基准测试中达到了最先进的性能,并将解码速度提高了多达3倍,相对于224K上下文的原始模型。我们的代码将开源。
Summary / 总结
HeteroCache is a dynamic compression framework designed to address the memory growth issue of KV caches in long-context LLM inference. It leverages the temporal heterogeneity of attention heads and the spatial redundancy within the same layer to apply a fine-grained weighting strategy and a hierarchical storage mechanism. Experiments show that HeteroCache outperforms existing methods, achieving state-of-the-art performance and accelerating decoding by up to 3 times compared to the original model in a 224K context.
HeteroCache 是一种动态压缩框架,旨在解决长上下文 LLM 推断中 KV 缓存的内存增长问题。它利用了注意力头的时间异质性和空间冗余性,采用细粒度缓存策略。实验表明,HeteroCache 在多个长上下文基准测试中表现出色,并且与原模型相比,在 224K 上下文下的解码速度提高了 3 倍。
CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models
Authors: Donghee Lee, Rui Cai, Zhe Zhao
First: 2026-01-20T05:44:33+00:00 · Latest: 2026-01-20T05:44:33+00:00
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have pushed them closer to becoming general-purpose assistants. Despite their strong performance, LVLMs still struggle with vision-centric tasks such as image classification, underperforming compared to their base vision encoders, which are often CLIP-based models. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a novel, model-agnostic framework which introduces vision-integration layers and a context-aware ensemble strategy to identify when to prioritize image representations or rely on the reasoning capabilities of the language model. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations, leading to consistent improvements in generalization across classification and vision-language benchmarks. Extensive experiments demonstrate that CARPE not only improves performance on image classification benchmarks but also enhances results across various vision-language benchmarks. Finally, CARPE is designed to be effectively integrated with most open-source LVLMs that consist of a vision encoder and a language model, ensuring its adaptability across diverse architectures.
中文标题/摘要
标题:CARPE: 基于上下文的图像表示优先级化集成框架以增强大型视觉-语言模型
近年来,大型视觉-语言模型(LVLMs)的发展使其更接近成为通用助手。尽管它们在性能上表现出色,但在图像分类等视觉中心任务上仍然存在局限性,表现不如基于CLIP的基视觉编码器。为了解决这一局限性,我们提出了基于上下文的图像表示优先级化集成(CARPE)框架,这是一种新型的、模型无关的方法,引入了视觉集成层和上下文感知集成策略,以确定何时优先考虑图像表示或依赖语言模型的推理能力。该设计增强了模型在适应性加权视觉和文本模态方面的能力,并使模型能够捕捉图像表示的各种方面,从而在分类和视觉-语言基准测试中实现一致的性能提升。广泛的实验表明,CARPE不仅在图像分类基准测试中提高了性能,还在各种视觉-语言基准测试中提高了结果。最后,CARPE被设计为可以有效地与大多数包含视觉编码器和语言模型的开源LVLMs集成,确保其在各种架构中的适应性。
Summary / 总结
The research aims to improve the performance of Large Vision-Language Models (LVLMs) in vision-centric tasks, particularly image classification, by proposing CARPE, a model-agnostic framework that introduces vision-integration layers and a context-aware ensemble strategy. The key experimental findings show that CARPE enhances the model's ability to adaptively weight visual and textual modalities, leading to consistent improvements in generalization across various benchmarks, including image classification and vision-language tasks.
研究旨在通过提出CARPE框架,该框架引入了视觉整合层和上下文感知集成策略,来提高大型视觉语言模型(LVLM)在视觉中心任务,特别是图像分类任务中的性能。关键实验结果表明,CARPE能够增强模型在视觉和文本模态之间适配性加权的能力,从而在各种基准测试中,包括图像分类和视觉语言任务中,实现一致的泛化性能提升。
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
Authors: Zheng Liu, Honglin Lin, Chonghan Qin, Xiaoyang Wang, Xin Gao, Yu Li, Mengzhang Cai, Yun Zhu, Zhanping Zhong, Qizhi Pei, Zhuoshi Pan, Xiaoran Shang, Bin Cui, Conghui He, Wentao Zhang, Lijun Wu
First: 2026-01-20T05:11:44+00:00 · Latest: 2026-01-20T05:11:44+00:00
Comments: 29 pages
Abstract
Chart reasoning is a critical capability for Vision Language Models (VLMs). However, the development of open-source models is severely hindered by the lack of high-quality training data. Existing datasets suffer from a dual challenge: synthetic charts are often simplistic and repetitive, while the associated QA pairs are prone to hallucinations and lack the reasoning depth required for complex tasks. To bridge this gap, we propose ChartVerse, a scalable framework designed to synthesize complex charts and reliable reasoning data from scratch. (1) To address the bottleneck of simple patterns, we first introduce Rollout Posterior Entropy (RPE), a novel metric that quantifies chart complexity. Guided by RPE, we develop complexity-aware chart coder to autonomously synthesize diverse, high-complexity charts via executable programs. (2) To guarantee reasoning rigor, we develop truth-anchored inverse QA synthesis. Diverging from standard generation, we adopt an answer-first paradigm: we extract deterministic answers directly from the source code, generate questions conditional on these anchors, and enforce strict consistency verification. To further elevate difficulty and reasoning depth, we filter samples based on model fail-rate and distill high-quality Chain-of-Thought (CoT) reasoning. We curate ChartVerse-SFT-600K and ChartVerse-RL-40K using Qwen3-VL-30B-A3B-Thinking as the teacher. Experimental results demonstrate that ChartVerse-8B achieves state-of-the-art performance, notably surpassing its teacher and rivaling the stronger Qwen3-VL-32B-Thinking.
中文标题/摘要
标题:ChartVerse:从零开始可靠程序合成扩展图表推理能力
图表推理是视觉语言模型(VLMs)的关键能力。然而,开源模型的发展受到高质量训练数据的严重限制。现有数据集面临双重挑战:合成图表往往过于简单且重复,而相关的问答对容易产生幻觉且缺乏复杂任务所需的推理深度。为解决这一问题,我们提出了ChartVerse,这是一种从零开始合成复杂图表和可靠推理数据的可扩展框架。(1)为解决简单模式的瓶颈,我们首先引入了滚动后验熵(RPE),这是一种量化图表复杂性的新型指标。受RPE的指导,我们开发了复杂性感知的图表编码器,通过可执行程序自主合成多样且高复杂度的图表。(2)为保证推理严谨性,我们开发了基于事实的逆向问答合成。不同于标准生成,我们采用答案优先的范式:直接从源代码中提取确定性答案,基于这些锚点生成问题,并强制执行严格的一致性验证。为了进一步提高难度和推理深度,我们根据模型失败率过滤样本,并提炼高质量的推理链(CoT)。我们使用Qwen3-VL-30B-A3B-Thinking作为教师,构建了ChartVerse-SFT-600K和ChartVerse-RL-40K。实验结果表明,ChartVerse-8B达到了最先进的性能,显著超越了其教师模型,并与更强的Qwen3-VL-32B-Thinking相媲美。
Summary / 总结
The research aims to enhance chart reasoning capabilities in Vision Language Models (VLMs) by addressing the limitations of existing datasets. The authors propose ChartVerse, a framework that synthesizes complex charts and reliable reasoning data from scratch. They introduce Rollout Posterior Entropy (RPE) to quantify chart complexity and develop a complexity-aware chart coder. Additionally, they use truth-anchored inverse QA synthesis to ensure rigorous reasoning. The results show that ChartVerse-8B outperforms its teacher model and rivals stronger models in terms of reasoning depth and accuracy.
ChartVerse 是一个从零开始合成复杂图表和可靠推理数据的可扩展框架。它引入了卷积后验熵(RPE)来量化图表复杂性,并开发了复杂性感知的图表编码器以生成多样且高复杂度的图表。为了确保推理的严谨性,它使用基于事实的逆向QA合成,专注于确定性答案和严格的一致性验证。该框架显著提高了性能,超越了其教师模型,并与更强的模型相媲美。
Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation
Authors: Yu Qin, Shimeng Fan, Fan Yang, Zixuan Xue, Zijie Mai, Wenrui Chen, Kailun Yang, Zhiyong Li
First: 2026-01-20T03:48:54+00:00 · Latest: 2026-01-20T03:48:54+00:00
Comments: The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP
Abstract
Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. Our core innovation lies in leveraging a patch-to-patch correlation matrix as a structural prior to narrowing the matching scope, effectively filtering out irrelevant clutter to prevent it from degrading pose estimation. Firstly, we introduce an object-centric disentanglement preprocessing to isolate the semantic target from environmental noise. Secondly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning. Finally, we design a Patch Correlation Predictor (PCP) that generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP.
中文标题/摘要
标题:基于跨视角感知的细粒度对应关系学习在开放词汇6D物体姿态估计中的应用
开放词汇6D物体姿态估计使机器人能够仅凭自然语言指导操作任意未见过的物体。然而,现有方法的关键限制在于它们依赖于不受约束的全局匹配策略。在开放世界场景中,试图将锚特征与整个查询图像空间进行匹配会引入过多的不确定性,因为目标特征容易与背景干扰混淆。为解决这一问题,我们提出了细粒度对应关系姿态估计(FiCoP)框架,从噪声多的全局匹配转向空间约束的块级对应关系。我们的核心创新在于利用块到块的相关矩阵作为结构先验,缩小匹配范围,有效过滤掉无关杂乱,防止其降低姿态估计。首先,我们引入了以物体为中心的解耦预处理,以隔离语义目标和环境噪声。其次,我们提出了跨视角全局感知(CPGP)模块,融合双视图特征,通过显式上下文推理建立结构共识。最后,我们设计了块相关预测器(PCP),生成精确的块级关联图,作为空间滤波器以增强细粒度、抗噪的匹配。在REAL275和Toyota-Light数据集上的实验表明,与最先进的方法相比,FiCoP分别提高了平均召回率8.0%和6.1%,突显了其在复杂、开放世界环境中为机器人代理提供稳健和泛化感知的能力。源代码将在https://github.com/zjjqinyu/FiCoP公开。
Summary / 总结
The research addresses the challenge of 6D object pose estimation in open-vocabulary scenarios, where existing methods rely on global matching strategies that introduce ambiguity. The proposed FiCoP framework shifts to patch-level correspondence, using a patch-to-patch correlation matrix as a structural prior to filter out irrelevant clutter. Key components include object-centric disentanglement preprocessing, a Cross-Perspective Global Perception module, and a Patch Correlation Predictor. Experiments show that FiCoP improves Average Recall by 8.0% and 6.1% on REAL275 and Toyota-Light datasets, respectively, compared to the state-of-the-art method, demonstrating its effectiveness in complex environments.
研究旨在通过自然语言指导机器人操纵未见过的对象,提升开放词汇下的6D物体姿态估计。方法Fine-grained Correspondence Pose Estimation (FiCoP)引入了patch-to-patch相关矩阵和Cross-Perspective Global Perception (CPGP)模块,减少歧义并提高姿态估计精度。实验结果显示,FiCoP在REAL275和Toyota-Light数据集上的平均召回率分别比现有方法高出8.0%和6.1%,展示了其在复杂环境中的鲁棒性和泛化能力。
Hierarchy-Aware Multimodal Unlearning for Medical AI
Authors: Fengli Wu, Vaidehi Patil, Jaehong Yoon, Yue Zhang, Mohit Bansal
First: 2025-12-10T17:55:06+00:00 · Latest: 2026-01-20T03:41:34+00:00
Comments: Dataset and Code: https://github.com/fengli-wu/MedForget
Abstract
Pretrained Multimodal Large Language Models (MLLMs) are increasingly used in sensitive domains such as medical AI, where privacy regulations like HIPAA and GDPR require specific removal of individuals' or institutions' data. This motivates machine unlearning, which aims to remove the influence of target data from a trained model. However, existing unlearning benchmarks fail to reflect the hierarchical and multimodal structure of real-world medical data, limiting their ability to properly evaluate unlearning in practice. Therefore, we introduce MedForget, a hierarchy-aware multimodal unlearning benchmark that models hospital data as a nested structure, enabling fine-grained evaluation of multimodal unlearning across retain and forget splits. Experiments with current unlearning methods show that existing approaches struggle to achieve effective hierarchy-aware forgetting without degrading downstream medical utility. To address this limitation, we propose Cross-modal Hierarchy-Informed Projection for unlearning (CHIP), a training-free, hierarchy-aware multimodal unlearning method that deletes information by selectively removing target-specific weight subspaces while preserving sibling-shared information. Experiments show that CHIP achieves the highest forget-retain performance gap across all hierarchy levels while maintaining competitive downstream utility compared to existing methods. Overall, MedForget provides a practical, HIPAA-aligned benchmark for evaluating structured multimodal unlearning for medical data, and CHIP offers an effective and general solution for hierarchy-aware forgetting that balances deletion with utility.
中文标题/摘要
标题:医疗AI中的层次感知多模态遗忘
预训练的多模态大型语言模型(MLLMs)在医疗AI等敏感领域中越来越被使用,而HIPAA和GDPR等隐私法规要求特定地移除个人或机构的数据。这促使了机器遗忘的发展,其目标是从训练模型中移除目标数据的影响。然而,现有的遗忘基准未能反映现实世界医疗数据的层次和多模态结构,限制了它们在实际中评估遗忘的能力。因此,我们引入了MedForget,一种层次感知的多模态遗忘基准,将医院数据建模为嵌套结构,从而在保留和遗忘分割中实现多模态遗忘的精细评估。实验表明,现有方法在实现有效的层次感知遗忘时难以避免对下游医疗效用的退化。为解决这一局限,我们提出了跨模态层次启发式投影遗忘(CHIP),这是一种无需训练、层次感知的多模态遗忘方法,通过选择性地删除目标特定的权重子空间同时保留兄弟共享的信息来删除信息。实验表明,CHIP在所有层次级别上实现了最高的遗忘-保留性能差距,同时保持与现有方法相当的下游效用。总体而言,MedForget提供了一个实用的、符合HIPAA的基准,用于评估结构化的多模态遗忘,而CHIP提供了一种有效的、通用的层次感知遗忘解决方案,平衡了删除与效用。
Summary / 总结
This paper addresses the challenge of unlearning in medical AI, where existing benchmarks do not adequately reflect the hierarchical and multimodal structure of medical data. To address this, the authors introduce MedForget, a new benchmark that models hospital data hierarchically. They also propose CHIP, a training-free method that selectively removes target-specific information while preserving shared information, achieving better performance in forgetting while maintaining utility. Experiments show that CHIP outperforms existing methods in handling hierarchy-aware forgetting for medical data.
研究旨在解决现有去学习基准在处理医疗数据的层次性和多模态性方面的局限性,这对于敏感领域中的隐私保护机器学习至关重要。研究引入了MedForget,这是一种新的基准,能够以层次化的方式建模医院数据,并提出了CHIP,这是一种无需训练的层次感知多模态去学习方法。实验表明,CHIP在不同层次级别上实现了有效的遗忘,同时保持了医疗效用。MedForget和CHIP提供了一种实用的框架,用于评估和实现医疗AI中的去学习。
Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG
Authors: Baiyu Chen, Wilson Wongso, Xiaoqian Hu, Yue Tan, Flora Salim
Venue: KDD
First: 2025-07-27T05:45:45+00:00 · Latest: 2026-01-20T01:13:34+00:00
Comments: KDD Cup 2025 Meta CRAG-MM Challenge: Third Prize in the Single-Source Augmentation Task
Abstract
This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition's scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM .
中文标题/摘要
标题:面向多模态RAG的多阶段验证中心框架以减轻幻觉
本文介绍了CRUISE团队为KDD杯2025元综合多模态多轮(CRAG-MM)挑战开发的技术解决方案。该挑战旨在解决现代视觉语言模型(VLM)的一个关键局限性:它们倾向于产生幻觉,尤其是在面对主观图像、长尾实体和复杂多跳问题时。这个问题在现实世界应用中尤为严重,用户提出事实查询需求,要求在多种模态中保持高度的准确性。为了解决这一问题,我们提出了一种稳健的多阶段框架,优先考虑事实准确性而非完整性。我们的解决方案包括一个轻量级查询路由器以提高效率、一个查询感知的检索和总结管道、一个双路径生成以及事后验证。这种保守策略旨在最小化幻觉,因为它们在比赛评分标准中会受到严重惩罚。我们的方法在任务1中获得第3名,证明了在复杂多模态RAG系统中优先考虑答案可靠性的重要性。我们的实现可在https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM 获取。
Summary / 总结
This paper introduces a multi-stage verification-centric framework to mitigate hallucination in multi-modal RAG systems, addressing the critical issue of VLMs generating inaccurate information, especially with egocentric imagery and complex questions. The framework includes a query router, retrieval and summarization pipeline, dual-pathways generation, and post-hoc verification. It prioritizes factual accuracy and truthfulness, achieving 3rd place in Task 1 of the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn challenge.
本文提出了一种多阶段验证中心化框架,以减轻多模态RAG系统中的幻觉问题,特别是针对VLMs在处理个人视角图像和复杂问题时生成不准确响应的情况。该框架包括查询路由器、检索和摘要管道、双路径生成以及后置验证。它优先考虑事实准确性和可靠性而非完整性,通过减少幻觉在KDD Cup 2025挑战中的影响而获得第3名,证明了其有效性。