Parallel Context-of-Experts Decoding for Retrieval Augmented Generation
Authors: Giulio Corallo, Paolo Papotti
First: 2026-01-13T15:46:59+00:00 · Latest: 2026-01-13T15:46:59+00:00
Abstract
Retrieval Augmented Generation faces a trade-off: concatenating documents in a long prompt enables multi-document reasoning but creates prefill bottlenecks, while encoding document KV caches separately offers speed but breaks cross-document interaction. We propose Parallel Context-of-Experts Decoding (Pced), a training-free framework that shifts evidence aggregation from the attention mechanism to the decoding. Pced treats retrieved documents as isolated "experts", synchronizing their predictions via a novel retrieval-aware contrastive decoding rule that weighs expert logits against the model prior. This approach recovers cross-document reasoning capabilities without constructing a shared attention across documents.
中文标题/摘要
标题:并行专家上下文解码以增强检索生成
检索增强生成面临权衡:在长提示中连接文档可以实现多文档推理,但会创建预填充瓶颈,而单独编码文档KV缓存则可以提高速度但会破坏跨文档交互。我们提出了一种无需训练的并行专家上下文解码(Pced)框架,该框架将证据聚合从注意力机制转移到解码。Pced 将检索到的文档视为孤立的“专家”,通过一种新颖的检索感知对比解码规则来同步它们的预测,该规则将专家的logits与模型先验进行权衡。这种方法可以在不构建跨文档共享注意力的情况下恢复跨文档推理能力。
Summary / 总结
The paper addresses the challenge in Retrieval Augmented Generation by proposing Parallel Context-of-Experts Decoding (Pced), which avoids the trade-off between multi-document reasoning and prefill bottlenecks. Pced treats retrieved documents as independent experts and synchronizes their predictions through a retrieval-aware contrastive decoding rule, thereby maintaining cross-document reasoning without requiring a shared attention mechanism during training.
论文提出了一种名为Parallel Context-of-Experts Decoding (Pced)的方法,以解决Retrieval Augmented Generation (RAG)中的挑战。Pced将检索到的文档视为独立的专家,并通过一种检索感知的对比解码规则同步它们的预测,从而在无需共享注意力机制的情况下实现跨文档推理。
SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
Authors: Leo Fillioux, Omprakash Chakraborty, Ismail Ben Ayed, Paul-Henry Cournède, Stergios Christodoulidis, Maria Vakalopoulou, Jose Dolz
First: 2026-01-13T15:00:03+00:00 · Latest: 2026-01-13T15:00:03+00:00
Abstract
With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.
中文标题/摘要
标题:SoC:测试时提示调优的语义正交校准
随着视觉-语言模型(VLMs)在医疗保健或自动驾驶等关键决策系统中的广泛应用,其不确定性估计的校准变得至关重要。然而,在VLM测试时提示调优(TPT)文献中,这一维度尚未得到充分探索,该文献主要集中在提高其辨别性能上。最近的先进方法提倡对文本提示嵌入成对施加完全正交约束以增强可分性,从而提高校准。然而,如我们在本文中理论上证明的那样,完全正交约束的固有梯度将强烈地将语义相关类别推开,最终使模型过于自信。基于我们的发现,我们提出了语义正交校准(SoC),这是一种基于Huber的正则化器,它在保持语义邻近性的同时强制平滑原型分离,从而在与先前的正交性方法相比时提高校准性能。在全面的经验验证中,我们证明SoC在提高校准性能的同时,也保持了竞争力的辨别能力。
Summary / 总结
This research addresses the need for better calibration of uncertainty estimates in vision-language models used in critical applications. It introduces Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enhances model calibration by preserving semantic proximity while improving prototype separation. Experiments show that SoC consistently improves calibration performance without sacrificing discriminative capabilities.
论文针对在医疗保健和自动驾驶等关键应用中使用的视觉-语言模型(VLMs)的不确定性估计校准不足的问题,提出了语义正交校准(SoC),这是一种基于Huber的正则化器,通过保持语义相近性同时改善原型分离来提升校准性能。实验表明,SoC在提高校准性能的同时,也保持了与之前正交性方法相当的区分能力。
Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification
Authors: Tayyab Rehman, Giovanni De Gasperis, Aly Shmahell
First: 2026-01-08T11:31:47+00:00 · Latest: 2026-01-13T14:40:15+00:00
Comments: Author email changed, Acknowlegement changes
Abstract
Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.
中文标题/摘要
标题:通过视觉语言模型和基于嵌入的分类在监控系统中实现级联多代理异常检测
在动态视觉环境中实现智能异常检测需要在实时性能与语义可解释性之间取得平衡。传统方法仅解决这一挑战的部分方面。基于重建的模型捕捉低级偏差但缺乏上下文推理,目标检测器提供速度但语义有限,而大型视觉语言系统则以高昂的计算成本提供可解释性。本研究引入了一种级联多代理框架,将这些互补范式统一成一个连贯且可解释的架构。早期模块执行重建门控过滤和对象级评估,而更高层次的推理代理则根据需要选择性地被调用来解释语义含糊的事件。该系统采用自适应升级阈值和发布-订阅通信架构,实现异步协调和在异构硬件上的可扩展部署。在大规模监控数据上的广泛评估表明,所提出的级联架构与直接视觉语言推理相比,延迟降低了三倍,同时保持了高感知保真度(PSNR = 38.3 dB,SSIM = 0.965)和一致的语义标签。该框架超越了传统的检测管道,结合了早期退出的效率、自适应多代理推理和可解释的异常归因,为可扩展的智能视觉监控奠定了可重复和节能的基础。
Summary / 总结
This work addresses the challenge of intelligent anomaly detection in dynamic visual environments by introducing a cascading multi-agent framework that integrates reconstruction-based models, object detectors, and large vision-language systems. Early modules perform reconstruction-gated filtering and object-level assessment, with higher-level agents invoked for semantic interpretation. The system demonstrates a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity and consistent semantic labeling.
该研究通过引入一个级联多代理框架,结合重建门控过滤、对象级评估和语义解释,解决了动态视觉环境中的智能异常检测挑战。该系统使用自适应阈值和发布-订阅通信结构,以实现高效和可扩展的部署。实验结果表明,与直接视觉-语言推理相比,该框架将延迟降低了三倍,同时保持了高感知保真度和一致的语义标注。
Latent Reconstruction from Generated Data for Multimodal Misinformation Detection
Authors: Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis
First: 2025-04-08T13:16:48+00:00 · Latest: 2026-01-13T14:25:49+00:00
Abstract
Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. Due to the scarcity of large-scale annotated datasets for multimodal misinformation detection (MMD), recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic, unrealistic examples, which limits their utility as training examples. To address this, we introduce "MisCaption This!", a framework for generating high-fidelity synthetic miscaptioned datasets through Adversarial Prompting of Vision-Language Models (VLMs). Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a Transformer-based network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" data generalize better to real-world misinformation, while LAMAR achieves new state-of-the-art on NewsCLIPpings, VERITE, and the newly introduced VERITE 24/25 benchmark; highlighting the efficacy of VLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction
中文标题/摘要
标题:从生成数据中提取潜在重建以检测多模态虚假信息
多模态虚假信息,如误标图像,其中的说明文歪曲了图像的来源、背景或意义,在数字时代构成了日益严峻的挑战。由于缺乏大规模标注的多模态虚假信息检测(MMD)数据集,最近的方法依赖于通过离境配对或实体命名修改生成的合成训练数据(例如,更改名称、日期或地点)。然而,这些方法往往产生简单且不现实的例子,限制了它们作为训练示例的实用性。为了解决这一问题,我们引入了“MisCaption This!”框架,通过对抗提示视觉-语言模型(VLM)生成高保真度的合成误标数据集。此外,我们还引入了“潜在多模态重建”(LAMAR),这是一种基于变换器的网络,训练其重建真实说明文的嵌入,提供强大的辅助信号以指导检测。我们探索了各种训练策略(端到端 vs. 大规模预训练)和集成机制(直接、掩码、门控和注意力)。广泛的实验表明,使用“MisCaption This!”数据训练的模型在应对真实世界的虚假信息方面表现更好,而LAMAR在NewsCLIPpings、VERITE和新引入的VERITE 24/25基准测试中达到了新的最佳水平;这突显了VLM生成数据和基于重建的网络在推进MMD方面的有效性。我们的代码可在https://github.com/stevejpapad/miscaptioned-image-reconstruction获取
Summary / 总结
This paper addresses the challenge of detecting multimodal misinformation by introducing 'MisCaption This!', a framework for generating high-fidelity synthetic datasets through adversarial prompting of vision-language models. It also proposes 'Latent Multimodal Reconstruction' (LAMAR), a Transformer-based network that reconstructs embeddings of truthful captions to guide misinformation detection. Experiments show that models trained on 'MisCaption This!' data generalize better to real-world misinformation, and LAMAR achieves new state-of-the-art results on several benchmarks, demonstrating the effectiveness of VLM-generated data and reconstruction-based networks for multimodal misinformation detection.
论文通过引入‘MisCaption This!’框架生成高保真度的合成数据集,以及‘Latent Multimodal Reconstruction’(LAMAR)Transformer网络重建真实描述的嵌入,解决了多模态虚假信息检测的挑战。实验表明,使用‘MisCaption This!’数据训练的模型在处理真实世界虚假信息时表现更好,而LAMAR在NewsCLIPpings、VERITE和新引入的VERITE 24/25基准上达到了最先进的性能,突显了基于VLM生成数据和重建网络的有效性。代码可在GitHub上获得。
VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations
Authors: Sushant Gautam, Cise Midoglu, Vajira Thambawita, Michael A. Riegler, Pål Halvorsen
First: 2026-01-13T13:42:05+00:00 · Latest: 2026-01-13T13:42:05+00:00
Abstract
Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .
中文标题/摘要
标题:VideoHEDGE:基于熵的视频-VLM 幻觉检测框架通过语义聚类和时空扰动
视频-视觉语言模型(Video-VLMs)中的幻觉仍然频繁且置信度高,而现有的不确定性度量往往无法与正确性对齐。我们引入了VideoHEDGE,这是一种模块化框架,用于视频问答中的幻觉检测,将基于熵的可靠性估计从图像扩展到时间结构化的输入。给定一个视频-问题对,VideoHEDGE 生成一个基线答案和多个高温度生成,来自干净片段及其光度和时空扰动的变体,然后使用自然语言推理(NLI)或嵌入方法将生成的文本输出聚类成语义假设。聚类级别的概率质量产生三个可靠性分数:语义熵(SE)、RadFlag 和 视觉增强语义熵(VASE)。我们使用LLM作为法官在SoccerChat基准上评估VideoHEDGE,以获得二元幻觉标签。在三个7B Video-VLMs(Qwen2-VL、Qwen2.5-VL和SoccerChat微调模型)中,VASE在更大的失真预算下始终获得最高的ROC-AUC,而SE和RadFlag通常接近随机水平。我们进一步表明,嵌入方法的聚类在计算成本显著降低的情况下,在检测性能上与NLI方法的聚类相当,并且领域微调减少了幻觉频率,但仅在校准方面带来了适度的改进。hedge-bench PyPI库使基准测试可重复和扩展,完整的代码和实验资源可在https://github.com/Simula/HEDGE#videohedge 获取。
Summary / 总结
VideoHEDGE is a framework designed to detect hallucinations in video-capable vision-language models by extending entropy-based reliability estimation to temporally structured inputs. It generates baseline and perturbed answers, clusters them, and calculates reliability scores. Across three 7B Video-VLMs, VASE consistently outperforms other scores, especially with larger distortions. Embedding-based clustering is shown to be more computationally efficient than NLI-based clustering, and domain fine-tuning reduces hallucinations but has only modest effects on calibration.
VideoHEDGE 是一个框架,用于检测视频能力的视觉语言模型中的幻觉,通过将基于熵的可靠性估计扩展到时间结构化的输入。它生成基线和扰动答案,进行聚类,并计算可靠性得分。VASE 在较大的扰动预算下表现最佳,而 SE 和 RadFlag 经常接近随机表现。嵌入式聚类比 NLI 聚类更具计算效率。领域微调可以减少幻觉,但仅能小幅提高校准。
Sketch-Based Facade Renovation With Generative AI: A Streamlined Framework for Bypassing As-Built Modelling in Industrial Adaptive Reuse
Authors: Warissara Booranamaitree, Xusheng Du, Yushu Cai, Zhengyang Wang, Ye Zhang, Haoran Xie
First: 2026-01-13T13:17:09+00:00 · Latest: 2026-01-13T13:17:09+00:00
Comments: 10 pages, 9 figures, Proceedings of CAADRIA 2026
Abstract
Facade renovation offers a more sustainable alternative to full demolition, yet producing design proposals that preserve existing structures while expressing new intent remains challenging. Current workflows typically require detailed as-built modelling before design, which is time-consuming, labour-intensive, and often involves repeated revisions. To solve this issue, we propose a three-stage framework combining generative artificial intelligence (AI) and vision-language models (VLM) that directly processes rough structural sketch and textual descriptions to produce consistent renovation proposals. First, the input sketch is used by a fine-tuned VLM model to predict bounding boxes specifying where modifications are needed and which components should be added. Next, a stable diffusion model generates detailed sketches of new elements, which are merged with the original outline through a generative inpainting pipeline. Finally, ControlNet is employed to refine the result into a photorealistic image. Experiments on datasets and real industrial buildings indicate that the proposed framework can generate renovation proposals that preserve the original structure while improving facade detail quality. This approach effectively bypasses the need for detailed as-built modelling, enabling architects to rapidly explore design alternatives, iterate on early-stage concepts, and communicate renovation intentions with greater clarity.
中文标题/摘要
标题:基于草图的幕墙翻新与生成式AI:工业适应性再利用中绕过建成建模的简化框架
幕墙翻新提供了比全面拆除更可持续的选择,但保留现有结构同时表达新意图的设计提案仍然具有挑战性。当前的工作流程通常需要在设计之前进行详细的建成建模,这既耗时又劳动密集,经常需要反复修改。为了解决这个问题,我们提出了一种结合生成式人工智能(AI)和视觉-语言模型(VLM)的三阶段框架,可以直接处理粗糙的结构草图和文本描述以生成一致的翻新提案。首先,输入的草图由微调后的VLM模型用于预测需要修改的位置和应添加的组件的边界框。其次,稳定扩散模型生成新元素的详细草图,通过生成性修复流水线与原始轮廓合并。最后,使用ControlNet对结果进行细化以生成照片级的真实图像。在数据集和实际工业建筑上的实验表明,所提出的框架可以生成既保留原始结构又能提高幕墙细节质量的翻新提案。这种方法有效地绕过了详细的建成建模需求,使建筑师能够快速探索设计替代方案,迭代早期概念,并以更清晰的方式沟通翻新意图。
Summary / 总结
The paper addresses the challenge of producing sustainable facade renovation proposals by proposing a three-stage framework that uses generative AI and vision-language models. The framework processes rough structural sketches and textual descriptions to predict modifications and generate detailed renovation proposals, bypassing the need for detailed as-built modelling. Experiments show that the proposed method can produce photorealistic renovation proposals that preserve the original structure while improving facade detail quality, enabling architects to explore design alternatives more efficiently.
论文旨在解决在保留现有结构的同时,通过新的设计意图进行可持续的幕墙翻新方案的生成难题。提出了一种结合生成AI和视觉语言模型的三阶段框架,可以直接处理粗略的草图和文本描述,绕过详细的竣工建模步骤。该框架预测必要的修改,生成详细的草图,并将其细化为照片级的真实图像,展示了能够生成保持原有结构并提高幕墙质量的一致翻新方案,而无需进行大量修订。
DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving
Authors: Muxi Diao, Lele Yang, Hongbo Yin, Zhexu Wang, Yejie Wang, Daxin Tian, Kongming Liang, Zhanyu Ma
First: 2025-05-27T03:21:04+00:00 · Latest: 2026-01-13T12:44:57+00:00
Abstract
Effective autonomous driving hinges on robust reasoning across perception, prediction, planning, and behavior. However, conventional end-to-end models fail to generalize in complex scenarios due to the lack of structured reasoning. While recent vision-language models (VLMs) have been applied to driving tasks, they typically rely on isolated modules and static supervision, limiting their ability to support multi-stage decision-making. We present AutoDriveRL, a unified training framework that formulates autonomous driving as a structured reasoning process over four core tasks. Each task is independently modeled as a vision-language QA problem and optimized using task-specific reward models, enabling fine-grained reinforcement signals at different reasoning stages. Within this framework, we train DriveRX, a cross-task reasoning VLM designed for multi-stage decision-making. DriveRX achieves strong performance on the public benchmark, outperforming GPT-4o in behavior reasoning and demonstrating robustness under complex or corrupted driving conditions. DriveRX serves as a high-level semantic reasoning backbone, producing structured stage-wise reasoning chains that enhance decision consistency. These outputs also provide high-quality supervisory signals for annotation and downstream planning/control models. We release the AutoDriveRL framework and DriveRX to support future research.
中文标题/摘要
标题:DriveRX:一种用于跨任务自动驾驶的视觉-语言推理模型
有效的自动驾驶依赖于在感知、预测、规划和行为等方面进行稳健的推理。然而,传统的端到端模型由于缺乏结构化的推理能力,在复杂场景中无法泛化。虽然最近的视觉-语言模型(VLMs)已被应用于驾驶任务,但它们通常依赖于孤立的模块和静态监督,限制了它们支持多阶段决策的能力。我们提出了AutoDriveRL,这是一种统一的训练框架,将自动驾驶问题形式化为一个包含四个核心任务的结构化推理过程。每个任务独立地被建模为一个视觉-语言问答问题,并使用特定任务的奖励模型进行优化,从而在不同的推理阶段提供精细的强化信号。在此框架下,我们训练了DriveRX,这是一种用于多阶段决策的跨任务推理VLM。DriveRX在公共基准测试中表现出色,行为推理方面优于GPT-4o,并且在复杂或受损驾驶条件下表现出鲁棒性。DriveRX充当高级语义推理骨干,生成结构化的阶段推理链,增强决策一致性。这些输出还为注释和下游规划/控制模型提供高质量的监督信号。我们发布了AutoDriveRL框架和DriveRX,以支持未来的研究。
Summary / 总结
The research aims to develop a robust autonomous driving model capable of handling complex scenarios through structured reasoning across perception, prediction, planning, and behavior. The method involves formulating autonomous driving as a structured reasoning process using a unified training framework, AutoDriveRL, where each task is modeled as a vision-language QA problem and optimized with task-specific reward models. The key experimental findings show that DriveRX, a cross-task reasoning VLM, outperforms GPT-4o in behavior reasoning and demonstrates robust performance under complex or corrupted driving conditions, providing structured reasoning chains that enhance decision consistency and high-quality supervisory signals for downstream models.
DriveRX是一种用于自主驾驶多阶段决策的视觉-语言推理模型。它使用统一的训练框架AutoDriveRL,将四个核心任务建模为视觉-语言问答问题,并使用特定任务的奖励模型进行优化。DriveRX在行为推理方面优于GPT-4o,并在复杂驾驶条件下表现出色,生成结构化的推理链,增强决策一致性,并为下游模型提供高质量的监督信号。
Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning
Authors: Ankita Raj, Chetan Arora
Venue: AAAI 2026
First: 2025-11-16T19:05:31+00:00 · Latest: 2026-01-13T12:36:18+00:00
Comments: Accepted to AAAI 2026
Abstract
Open-vocabulary object detectors (OVODs) unify vision and language to detect arbitrary object categories based on text prompts, enabling strong zero-shot generalization to novel concepts. As these models gain traction in high-stakes applications such as robotics, autonomous driving, and surveillance, understanding their security risks becomes crucial. In this work, we conduct the first study of backdoor attacks on OVODs and reveal a new attack surface introduced by prompt tuning. We propose TrAP (Trigger-Aware Prompt tuning), a multi-modal backdoor injection strategy that jointly optimizes prompt parameters in both image and text modalities along with visual triggers. TrAP enables the attacker to implant malicious behavior using lightweight, learnable prompt tokens without retraining the base model weights, thus preserving generalization while embedding a hidden backdoor. We adopt a curriculum-based training strategy that progressively shrinks the trigger size, enabling effective backdoor activation using small trigger patches at inference. Experiments across multiple datasets show that TrAP achieves high attack success rates for both object misclassification and object disappearance attacks, while also improving clean image performance on downstream datasets compared to the zero-shot setting. Code: https://github.com/rajankita/TrAP
中文标题/摘要
标题:通过多模态提示调优对开放词汇对象检测器的后门攻击
开放词汇对象检测器(OVODs)结合了视觉和语言,基于文本提示检测任意对象类别,使其在零样本泛化到新概念方面表现出强大的能力。随着这些模型在机器人技术、自动驾驶和监控等高风险应用中获得关注,理解其安全风险变得至关重要。在本文中,我们首次研究了OVODs的后门攻击,并揭示了由提示调优引入的新攻击面。我们提出了TrAP(触发器感知提示调优)多模态后门注入策略,该策略同时优化图像和文本模态中的提示参数以及视觉触发器。TrAP使攻击者能够使用轻量级、可学习的提示标记植入恶意行为,而无需重新训练基础模型权重,从而保持泛化能力同时嵌入隐藏后门。我们采用基于课程的训练策略,逐步缩小触发器大小,使在推理时使用小触发补丁有效激活后门。在多个数据集上的实验表明,TrAP在对象误分类攻击和对象消失攻击中均实现了高攻击成功率,同时与零样本设置相比,在下游数据集上提高了干净图像的性能。代码:https://github.com/rajankita/TrAP
Summary / 总结
This study investigates backdoor attacks on open-vocabulary object detectors (OVODs) by proposing TrAP (Trigger-Aware Prompt tuning), which jointly optimizes prompt parameters in both image and text modalities. The method enables attackers to implant malicious behavior using lightweight, learnable prompt tokens without retraining the base model, leading to high success rates in object misclassification and disappearance attacks. Notably, TrAP also improves clean image performance on downstream datasets compared to the zero-shot setting.
研究通过提出TrAP(Trigger-Aware Prompt tuning)方法,探讨了对开放词汇对象检测器(OVODs)的后门攻击,该方法在图像和文本模态中优化提示参数,使攻击者能够在不重新训练基础模型的情况下植入恶意行为,实现对目标误分类和消失攻击的高成功率,同时保持干净图像性能。实验结果表明TrAP在多个数据集上有效嵌入了隐藏后门。
RGS-SLAM: Robust Gaussian Splatting SLAM with One-Shot Dense Initialization
Authors: Wei-Tse Cheng, Yen-Jen Chiou, Yuan-Fu Yang
First: 2025-12-28T03:45:57+00:00 · Latest: 2026-01-13T12:21:54+00:00
Comments: 10 pages, 9 figures
Abstract
We introduce RGS-SLAM, a robust Gaussian-splatting SLAM framework that replaces the residual-driven densification stage of GS-SLAM with a training-free correspondence-to-Gaussian initialization. Instead of progressively adding Gaussians as residuals reveal missing geometry, RGS-SLAM performs a one-shot triangulation of dense multi-view correspondences derived from DINOv3 descriptors refined through a confidence-aware inlier classifier, generating a well-distributed and structure-aware Gaussian seed prior to optimization. This initialization stabilizes early mapping and accelerates convergence by roughly 20\%, yielding higher rendering fidelity in texture-rich and cluttered scenes while remaining fully compatible with existing GS-SLAM pipelines. Evaluated on the TUM RGB-D and Replica datasets, RGS-SLAM achieves competitive or superior localization and reconstruction accuracy compared with state-of-the-art Gaussian and point-based SLAM systems, sustaining real-time mapping performance at up to 925 FPS. Project page:https://breeze1124.github.io/rgs-slam-project-page/
中文标题/摘要
标题:RGS-SLAM:基于一次性密集初始化的鲁棒高斯点云SLAM
我们提出了RGS-SLAM,一种鲁棒的高斯点云SLAM框架,用无训练的对应到高斯的初始化阶段取代GS-SLAM中的残差驱动密集化阶段。RGS-SLAM 不是像GS-SLAM那样随着残差揭示缺失的几何结构逐步添加高斯点,而是通过一种基于置信度的内点分类器对DINOv3描述符进行细化,一次性三角化密集多视图对应,生成一个分布良好且结构意识强的高斯种子,作为优化前的先验。这种初始化稳定了早期建图,并通过大约20%的加速提高了收敛速度,从而在纹理丰富和杂乱的场景中提高了渲染保真度,同时保持与现有GS-SLAM流水线的完全兼容性。在TUM RGB-D和Replica数据集上评估,RGS-SLAM在定位和重建准确性方面与最先进的高斯和点基SLAM系统具有竞争力或更优,同时保持实时建图性能,最高可达925 FPS。项目页面:https://breeze1124.github.io/rgs-slam-project-page/
Summary / 总结
RGS-SLAM is a robust Gaussian-splatting SLAM framework that introduces a one-shot dense initialization method to replace the residual-driven densification stage of GS-SLAM. By using DINOv3 descriptors and a confidence-aware inlier classifier, RGS-SLAM generates a well-distributed and structure-aware Gaussian seed, which stabilizes early mapping and accelerates convergence. The system achieves higher rendering fidelity in complex scenes and maintains real-time performance at up to 925 FPS, outperforming or matching state-of-the-art SLAM systems on TUM RGB-D and Replica datasets.
RGS-SLAM 是一种鲁棒的高斯点云 SLAM 框架,通过引入一次性密集初始化方法来替代 GS-SLAM 中的残差驱动稠密化阶段。通过使用 DINOv3 描述子和置信度感知的内点分类器,RGS-SLAM 在优化前生成一个分布良好且结构意识强的高斯种子,这可以稳定早期建图并加速收敛约 20%。该系统在 TUM RGB-D 和 Replica 数据集上实现了与最先进的高斯和点云 SLAM 系统相当或更优的定位和重建精度,同时保持实时性能,最高可达每秒 925 帧。
Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models
Authors: Hao Tang, Yu Liu, Shuanglin Yan, Fei Shen, Shengfeng He, Jing Qin
Venue: AAAI 2026
First: 2026-01-13T12:08:26+00:00 · Latest: 2026-01-13T12:08:26+00:00
Comments: Accepted by AAAI 2026
Abstract
Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided by test images and iteratively refines visual proxies, progressively realigning cross-modal similarities and enlarging local OOD margins. Finally, we dynamically re-weight the contributions of dual-modal proxies to obtain a calibrated OOD score that is robust to distribution shift. Extensive experiments on standard benchmarks demonstrate that CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.
中文标题/摘要
标题:跨模态代理演化在视觉-语言模型的OOD检测中
在开放世界环境中部署视觉-语言模型时,可靠的零样本检测出分布外(OOD)输入至关重要。然而,零样本OOD检测中缺乏标记的负样本需要有效的代理信号,这些信号在分布偏移下仍然有效。现有的负标签方法依赖于固定的一组文本代理,这会导致(i)稀疏地采样ID类之外的语义空间,以及(ii)视觉特征漂移时代理信号保持静态,从而导致跨模态对齐不良和预测不稳定。在本文中,我们提出了一种无需训练和标注的测试时框架CoEvo,该框架在测试时双向地、基于样本的条件下适应文本和视觉代理。具体而言,CoEvo引入了一种代理对齐的共进化机制,以维护两个不断进化的代理缓存,这些缓存通过测试图像引导上下文文本负样本动态挖掘,并迭代细化视觉代理,逐步重新对齐跨模态相似性并扩大局部OOD边界。最后,我们动态重新加权双模态代理的贡献,以获得鲁棒于分布偏移的OOD分数。在标准基准上的广泛实验表明,CoEvo在ImageNet-1K上实现了最先进的性能,与强大的负标签基线相比,AUROC提高了1.33%,FPR95降低了45.98%。
Summary / 总结
The paper addresses the challenge of zero-shot out-of-distribution (OOD) detection for vision-language models by proposing CoEvo, a test-time framework that dynamically adapts both textual and visual proxies. CoEvo uses a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which are refined iteratively based on test images, thereby realigning cross-modal similarities and improving OOD detection robustness. Experiments show that CoEvo outperforms existing methods, achieving a 1.33% improvement in AUROC and a 45.98% reduction in FPR95 on ImageNet-1K.
本文提出了一种名为CoEvo的测试时框架,用于动态调整文本和视觉代理,以解决视觉-语言模型的零样本out-of-distribution (OOD)检测问题。CoEvo利用代理对齐的双向演化机制,维护两个动态演化的代理缓存,并基于测试图像进行迭代优化,从而重新对齐跨模态相似性并提高OOD检测的鲁棒性。实验表明,CoEvo在ImageNet-1K上优于强负标签基线,实现了1.33%的AUROC提升和45.98%的FPR95减少。
Towards Safer Mobile Agents: Scalable Generation and Evaluation of Diverse Scenarios for VLMs
Authors: Takara Taniguchi, Kuniaki Saito, Atsushi Hashimoto
First: 2026-01-13T11:55:31+00:00 · Latest: 2026-01-13T11:55:31+00:00
Abstract
Vision Language Models (VLMs) are increasingly deployed in autonomous vehicles and mobile systems, making it crucial to evaluate their ability to support safer decision-making in complex environments. However, existing benchmarks inadequately cover diverse hazardous situations, especially anomalous scenarios with spatio-temporal dynamics. While image editing models are a promising means to synthesize such hazards, it remains challenging to generate well-formulated scenarios that include moving, intrusive, and distant objects frequently observed in the real world. To address this gap, we introduce \textbf{HazardForge}, a scalable pipeline that leverages image editing models to generate these scenarios with layout decision algorithms, and validation modules. Using HazardForge, we construct \textbf{MovSafeBench}, a multiple-choice question (MCQ) benchmark comprising 7,254 images and corresponding QA pairs across 13 object categories, covering both normal and anomalous objects. Experiments using MovSafeBench show that VLM performance degrades notably under conditions including anomalous objects, with the largest drop in scenarios requiring nuanced motion understanding.
中文标题/摘要
标题:迈向更安全的移动代理:VLMs多样化场景的可扩展生成与评估
视觉语言模型(VLMs)在自主车辆和移动系统中的应用日益增多,因此评估其在复杂环境中的安全决策能力变得至关重要。然而,现有的基准测试未能充分涵盖各种多样的危险情况,尤其是具有时空动态的异常场景。虽然图像编辑模型是合成此类危险情况的有希望的方法,但生成包含移动、侵入和远距离物体的合理场景仍然具有挑战性,这些物体在现实世界中经常被观察到。为了解决这一差距,我们引入了**HazardForge**,这是一种可扩展的流水线,利用图像编辑模型生成这些场景,并使用布局决策算法和验证模块。使用HazardForge,我们构建了**MovSafeBench**,这是一个包含7,254张图像及其对应的QA对的多项选择题(MCQ)基准,覆盖了13个对象类别,包括正常和异常对象。使用MovSafeBench的实验表明,在包含异常对象的情况下,VLM的性能显著下降,尤其是在需要精细运动理解的场景中下降最大。
Summary / 总结
The research aims to enhance the safety of Vision Language Models (VLMs) in autonomous vehicles by developing a scalable pipeline called HazardForge to generate diverse and hazardous scenarios. HazardForge uses image editing models and layout decision algorithms to create scenarios that include moving, intrusive, and distant objects. The resulting MovSafeBench benchmark includes 7,254 images and QA pairs across 13 object categories, showing that VLM performance significantly drops in scenarios with anomalous objects, especially those requiring nuanced motion understanding.
研究旨在通过开发名为HazardForge的可扩展管道来增强Vision Language Models (VLMs)在自动驾驶车辆中的安全性,该管道利用图像编辑模型和布局决策算法生成包含移动和侵入性物体的多样化和危险场景。MovSafeBench基准包括7,254张图像和13个物体类别的QA配对,显示VLM在处理需要精细运动理解的异常物体时表现显著下降。
ClimateIQA: A New Dataset and Benchmark to Advance Vision-Language Models in Meteorology Anomalies Analysis
Authors: Jian Chen, Peilin Zhou, Yining Hua, Dading Chong, Meng Cao, Yaowei Li, Wei Chen, Bing Zhu, Junwei Liang, Zixuan Yuan
First: 2024-06-14T08:46:44+00:00 · Latest: 2026-01-13T11:47:20+00:00
Abstract
Meteorological heatmaps play a vital role in deciphering extreme weather phenomena, yet their inherent complexities marked by irregular contours, unstructured patterns, and complex color variations present unique analytical hurdles for state-of-the-art Vision-Language Models (VLMs). Current state-of-the-art models like GPT-4o, Qwen-VL, and LLaVA 1.6 struggle with tasks such as precise color identification and spatial localization, resulting in inaccurate or incomplete interpretations. To address these challenges, we introduce Sparse Position and Outline Tracking (SPOT), a novel algorithm specifically designed to process irregularly shaped colored regions in visual data. SPOT identifies and localizes these regions by extracting their spatial coordinates, enabling structured representations of irregular shapes. Building on SPOT, we construct ClimateIQA, a novel meteorological visual question answering (VQA) dataset, comprising 26,280 high-resolution heatmaps and 762,120 instruction samples for wind gust, total precipitation, wind chill index and heat index analysis. ClimateIQA enhances VLM training by incorporating spatial cues, geographic metadata, and reanalysis data, improving model accuracy in interpreting and describing extreme weather features. Furthermore, we develop Climate-Zoo, a suite of fine-tuned VLMs based on SPOT-empowered ClimateIQA, which significantly outperforms existing models in meteorological heatmap tasks.
中文标题/摘要
标题:ClimateIQA:一个新的数据集和基准,以推进气象异常分析中的视觉语言模型
气象热图在解释极端天气现象中发挥着重要作用,但其固有的复杂性,如不规则轮廓、无序模式和复杂颜色变化,为最先进的视觉语言模型(VLMs)带来了独特的分析挑战。当前最先进的模型如GPT-4o、Qwen-VL和LLaVA 1.6在精确颜色识别和空间定位任务上存在困难,导致不准确或不完整的解释。为了解决这些挑战,我们引入了稀疏位置和轮廓跟踪(SPOT),这是一种专门设计用于处理视觉数据中不规则形状彩色区域的新算法。SPOT通过提取空间坐标来识别和定位这些区域,使其能够以结构化的方式表示不规则形状。基于SPOT,我们构建了ClimateIQA,这是一个新颖的气象视觉问答(VQA)数据集,包含26,280个高分辨率热图和762,120个关于风速、总降水量、风寒指数和热指数分析的指令样本。ClimateIQA通过引入空间线索、地理元数据和再分析数据,增强了VLM的训练,提高了模型在解释和描述极端天气特征方面的准确性。此外,我们开发了基于SPOT赋能ClimateIQA的Climate-Zoo,这是一个细调的VLM套件,显著优于现有模型在气象热图任务上的表现。
Summary / 总结
The paper introduces ClimateIQA, a new dataset and benchmark for advancing Vision-Language Models (VLMs) in analyzing meteorological anomalies. To address the challenges posed by irregular contours and complex color variations in meteorological heatmaps, the authors propose SPOT, a novel algorithm for identifying and localizing irregularly shaped colored regions. The dataset includes 26,280 high-resolution heatmaps and 762,120 instruction samples, enhancing VLM training and improving model accuracy in interpreting extreme weather features. Climate-Zoo, a suite of fine-tuned VLMs, demonstrates significant performance improvements over existing models in meteorological heatmap tasks.
研究旨在通过解决最先进的视觉-语言模型(VLM)在精确颜色识别和空间定位任务中的挑战,提高气象热图的分析能力。研究引入了SPOT,一种用于处理不规则形状彩色区域的新算法,并构建了包含26,280个高分辨率热图和762,120个指令样本的ClimateIQA数据集,这增强了VLM的训练并提高了模型在解释极端天气特征方面的准确性。此外,基于SPOT-增强的ClimateIQA微调的Climate-Zoo模型在气象热图任务中显著优于现有模型。
Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling
Authors: Takamichi Miyata, Sumiko Miyata, Andrew Morris
First: 2026-01-13T11:46:05+00:00 · Latest: 2026-01-13T11:46:05+00:00
Abstract
Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.
中文标题/摘要
标题:基于双解耦的零样本分心驾驶检测
分心驾驶是交通碰撞的主要原因之一,需要稳健且可扩展的检测方法。视觉语言模型(VLMs)能够实现强大的零样本图像分类,但现有的基于VLM的分心驾驶检测器在实际应用中往往表现不佳。我们发现个体特定的外观变化(如着装、年龄和性别)是关键瓶颈:VLMs将这些因素与行为线索纠缠在一起,导致决策更多地依赖于驾驶者是谁而不是驾驶者在做什么。为了解决这个问题,我们提出了一种个体解耦框架,该框架提取驾驶员外观嵌入,并在零样本分类之前将其对图像嵌入的影响去除,从而强调与分心相关的证据。我们进一步通过度量投影到Stief尔流形上正交化文本嵌入,以提高可分性同时保持原始语义的接近性。实验结果表明,我们的方法在先前基线方法上具有一致的改进,表明我们的方法在实际道路安全应用中的潜力。
Summary / 总结
The research aims to improve the detection of distracted driving using vision-language models by addressing the issue of subject-specific appearance variations that can lead to misclassification. The method involves a subject decoupling framework that extracts a driver appearance embedding and removes it from the image embedding before zero-shot classification, focusing on behavior cues. The approach also orthogonalizes text embeddings to enhance separability. Experimental results show consistent improvements over previous methods, suggesting the potential of this approach for practical road safety applications.
研究旨在通过解决主体特定外观差异导致的误分类问题,改进基于视觉语言模型的分心驾驶检测。方法包括一个主体解耦框架,将驾驶员外观与行为线索分离,并通过度量投影到Stiefel流形上使文本嵌入正交化以提高可分性。实验结果显示在先前基线方法上的一致改进,表明该方法在实际道路安全应用中的潜力。
CoMa: Contextual Massing Generation with Vision-Language Models
Authors: Evgenii Maslov, Valentin Khrulkov, Anastasia Volkova, Anton Gusarov, Andrey Kuznetsov, Ivan Oseledets
First: 2026-01-13T11:44:00+00:00 · Latest: 2026-01-13T11:44:00+00:00
Comments: Code and dataset will be released later
Abstract
The conceptual design phase in architecture and urban planning, particularly building massing, is complex and heavily reliant on designer intuition and manual effort. To address this, we propose an automated framework for generating building massing based on functional requirements and site context. A primary obstacle to such data-driven methods has been the lack of suitable datasets. Consequently, we introduce the CoMa-20K dataset, a comprehensive collection that includes detailed massing geometries, associated economical and programmatic data, and visual representations of the development site within its existing urban context. We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models (VLMs), evaluating both fine-tuned and large zero-shot models. Our experiments reveal the inherent complexity of the task while demonstrating the potential of VLMs to produce context-sensitive massing options. The dataset and analysis establish a foundational benchmark and highlight significant opportunities for future research in data-driven architectural design.
中文标题/摘要
标题:CoMa:基于视觉语言模型的上下文化体量生成
在建筑和城市规划中,特别是建筑体量设计的概念设计阶段,过程复杂且高度依赖设计师的直觉和手工劳动。为了解决这一问题,我们提出了一种基于功能需求和场地背景的自动化框架,用于生成建筑体量。由于缺乏合适的数据集,数据驱动方法的一个主要障碍是数据集的缺乏。因此,我们引入了CoMa-20K数据集,这是一个全面的集合,包括详细的体量几何形状、相关的经济和功能数据以及场地在现有城市背景下的视觉表示。我们通过将体量生成表述为视觉语言模型(VLMs)的条件任务来评估该数据集,评估了微调和大型零样本模型。我们的实验揭示了任务的内在复杂性,同时展示了VLMs生成上下文敏感体量选项的潜力。数据集和分析为数据驱动的建筑设计奠定了基础基准,并指出了未来研究的重要机会。
Summary / 总结
The paper addresses the complexity of building massing in architectural design by proposing CoMa, an automated framework that uses Vision-Language Models to generate massing based on functional requirements and site context. The authors introduce the CoMa-20K dataset, which includes detailed massing geometries, economic and programmatic data, and visual representations of the site. Experiments with both fine-tuned and zero-shot VLMs show the task's complexity but also the potential of VLMs to produce context-sensitive massing options. The dataset and analysis provide a benchmark for future research in data-driven architectural design.
论文通过提出CoMa框架,解决建筑设计中建筑体量生成的复杂性问题,该框架基于功能需求和场地环境自动生成体量。论文引入了CoMa-20K数据集,包含详细的体量几何、经济和功能数据以及场地的视觉表示。该框架使用视觉语言模型进行基准测试,展示了视觉语言模型在生成场地敏感的体量选项方面的潜力,尽管任务本身具有复杂性。数据集和分析为未来数据驱动的建筑设计研究奠定了基础基准。
Beyond Linearization: Attributed Table Graphs for Table Reasoning
Authors: Yuxiang Wang, Junhao Gan, Shengxiang Gao, Shenghao Ye, Zhengyi Yang, Jianzhong Qi
First: 2026-01-13T11:14:43+00:00 · Latest: 2026-01-13T11:14:43+00:00
Abstract
Table reasoning, a task to answer questions by reasoning over data presented in tables, is an important topic due to the prevalence of knowledge stored in tabular formats. Recent solutions use Large Language Models (LLMs), exploiting the semantic understanding and reasoning capabilities of LLMs. A common paradigm of such solutions linearizes tables to form plain texts that are served as input to LLMs. This paradigm has critical issues. It loses table structures, lacks explicit reasoning paths for result explainability, and is subject to the "lost-in-the-middle" issue. To address these issues, we propose Table Graph Reasoner (TABGR), a training-free model that represents tables as an Attributed Table Graph (ATG). The ATG explicitly preserves row-column-cell structures while enabling graph-based reasoning for explainability. We further propose a Question-Guided Personalized PageRank (QG-PPR) mechanism to rerank tabular data and mitigate the lost-in-the-middle issue. Extensive experiments on two commonly used benchmarks show that TABGR consistently outperforms state-of-the-art models by up to 9.7% in accuracy. Our code will be made publicly available upon publication.
中文标题/摘要
标题:超越线性化:带属性的表图用于表推理
表推理是一项通过在表格中呈现的数据进行推理以回答问题的任务,由于大量知识以表格形式存储,因此该任务非常重要。最近的解决方案利用了大型语言模型(LLMs)的语义理解和推理能力。此类解决方案的常见范式是将表格线性化为文本,作为输入提供给LLMs。这种范式存在关键问题。它会丢失表格结构,缺乏明确的推理路径以解释结果,并且容易出现“中间迷失”问题。为解决这些问题,我们提出了Table Graph Reasoner(TABGR),这是一种无需训练的模型,将表格表示为带属性的表图(ATG)。ATG明确保留了行-列-单元格结构,同时支持基于图的推理以实现可解释性。我们还提出了一种基于问题的个性化PageRank(QG-PPR)机制,以重新排序表格数据并缓解中间迷失问题。在两个常用基准上的广泛实验表明,TABGR在准确率上比最先进的模型高出9.7%。我们的代码将在发表后公开提供。
Summary / 总结
The paper addresses the limitations of linearizing tables for table reasoning tasks, which include losing table structures, lacking explicit reasoning paths, and the 'lost-in-the-middle' issue. It introduces Table Graph Reasoner (TABGR), which represents tables as Attributed Table Graphs (ATGs) to preserve row-column-cell structures and enable graph-based reasoning for explainability. TABGR also uses a Question-Guided Personalized PageRank (QG-PPR) mechanism to rerank tabular data. Experiments on two benchmarks show that TABGR outperforms existing models by up to 9.7% in accuracy.
论文针对表格推理任务中将表格线性化所带来的结构丢失、缺乏明确推理路径以及‘中间迷失’等问题,提出了Table Graph Reasoner (TABGR) 模型,通过使用Attributed Table Graph (ATG) 来保持行-列-单元格结构,并支持基于图的推理以提高可解释性。此外,还提出了一种问题导向的个性化PageRank (QG-PPR) 机制来重新排序表格数据。实验结果显示,TABGR 在两个常用基准上的准确率比现有模型高出了最多9.7%。
MMLGNet: Cross-Modal Alignment of Remote Sensing Data using CLIP
Authors: Aditya Chaudhary, Sneha Barman, Mainak Singha, Ankit Jha, Girish Mishra, Biplab Banerjee
First: 2026-01-13T10:44:37+00:00 · Latest: 2026-01-13T10:44:37+00:00
Comments: Accepted at InGARSS 2025
Abstract
In this paper, we propose a novel multimodal framework, Multimodal Language-Guided Network (MMLGNet), to align heterogeneous remote sensing modalities like Hyperspectral Imaging (HSI) and LiDAR with natural language semantics using vision-language models such as CLIP. With the increasing availability of multimodal Earth observation data, there is a growing need for methods that effectively fuse spectral, spatial, and geometric information while enabling semantic-level understanding. MMLGNet employs modality-specific encoders and aligns visual features with handcrafted textual embeddings in a shared latent space via bi-directional contrastive learning. Inspired by CLIP's training paradigm, our approach bridges the gap between high-dimensional remote sensing data and language-guided interpretation. Notably, MMLGNet achieves strong performance with simple CNN-based encoders, outperforming several established multimodal visual-only methods on two benchmark datasets, demonstrating the significant benefit of language supervision. Codes are available at https://github.com/AdityaChaudhary2913/CLIP_HSI.
中文标题/摘要
标题:MMLGNet:使用CLIP实现遥感数据跨模态对齐
在本文中,我们提出了一种新颖的多模态框架——多模态语言引导网络(MMLGNet),利用视觉-语言模型(如CLIP)将如高光谱成像(HSI)和LiDAR等异构遥感模态与自然语言语义对齐。随着多模态地球观测数据的日益可用,有效融合光谱、空间和几何信息并实现语义级理解的方法需求日益增长。MMLGNet 使用模态特定编码器,并通过双向对比学习在共享的潜在空间中对齐视觉特征与手工构建的文本嵌入。受CLIP训练范式的启发,我们的方法填补了高维遥感数据与语言引导解释之间的差距。值得注意的是,MMLGNet 使用简单的基于CNN的编码器实现了强大的性能,在两个基准数据集上优于几种现有的多模态视觉方法,证明了语言监督的显著优势。代码可在https://github.com/AdityaChaudhary2913/CLIP_HSI获取。
Summary / 总结
MMLGNet is a multimodal framework that aligns remote sensing data like Hyperspectral Imaging and LiDAR with natural language semantics using CLIP. It employs modality-specific encoders and aligns visual features with textual embeddings in a shared latent space through bi-directional contrastive learning. MMLGNet outperforms several established multimodal visual-only methods on benchmark datasets, showing the advantage of language supervision with simple CNN-based encoders.
研究旨在使用名为MMLGNet的多模态框架,将如高光谱成像和LiDAR等异构遥感数据与自然语言语义进行对齐。MMLGNet 使用模态特定编码器,并通过双向对比学习将视觉特征与文本嵌入对齐到共享的潜在空间中。该方法在基准数据集上优于几种现有的多模态视觉方法,展示了语言监督在增强遥感数据与语义理解融合中的有效性。
Edge-Optimized Multimodal Learning for UAV Video Understanding via BLIP-2
Authors: Yizhan Feng, Hichem Snoussi, Jing Teng, Jian Liu, Yuyang Wang, Abel Cherouat, Tian Wang
First: 2026-01-13T10:26:10+00:00 · Latest: 2026-01-13T10:26:10+00:00
Comments: The Tenth International Conference on Data Mining and Big Data (DMBD'2025)
Abstract
The demand for real-time visual understanding and interaction in complex scenarios is increasingly critical for unmanned aerial vehicles. However, a significant challenge arises from the contradiction between the high computational cost of large Vision language models and the limited computing resources available on UAV edge devices. To address this challenge, this paper proposes a lightweight multimodal task platform based on BLIP-2, integrated with YOLO-World and YOLOv8-Seg models. This integration extends the multi-task capabilities of BLIP-2 for UAV applications with minimal adaptation and without requiring task-specific fine-tuning on drone data. Firstly, the deep integration of BLIP-2 with YOLO models enables it to leverage the precise perceptual results of YOLO for fundamental tasks like object detection and instance segmentation, thereby facilitating deeper visual-attention understanding and reasoning. Secondly, a content-aware key frame sampling mechanism based on K-Means clustering is designed, which incorporates intelligent frame selection and temporal feature concatenation. This equips the lightweight BLIP-2 architecture with the capability to handle video-level interactive tasks effectively. Thirdly, a unified prompt optimization scheme for multi-task adaptation is implemented. This scheme strategically injects structured event logs from the YOLO models as contextual information into BLIP-2's input. Combined with output constraints designed to filter out technical details, this approach effectively guides the model to generate accurate and contextually relevant outputs for various tasks.
中文标题/摘要
标题:基于BLIP-2的边缘优化多模态学习方法及其在无人机视频理解中的应用
随着复杂场景中实时视觉理解和交互需求的不断增加,无人机对视觉理解能力的要求也越来越高。然而,大型视觉语言模型的高计算成本与无人机边缘设备有限的计算资源之间的矛盾成为一个显著挑战。为解决这一挑战,本文提出了一种基于BLIP-2的轻量级多模态任务平台,结合了YOLO-World和YOLOv8-Seg模型。这种集成扩展了BLIP-2在无人机应用中的多任务能力,无需对无人机数据进行特定任务的微调。首先,BLIP-2与YOLO模型的深度集成使其能够利用YOLO的精确感知结果来执行诸如目标检测和实例分割等基本任务,从而促进更深层次的视觉注意力理解和推理。其次,设计了一种基于K-Means聚类的内容感知关键帧采样机制,该机制结合了智能帧选择和时间特征拼接,使轻量级BLIP-2架构能够有效处理视频级交互任务。第三,实现了一种统一的提示优化方案,用于多任务适应。该方案战略性地将YOLO模型的结构化事件日志作为上下文信息注入BLIP-2的输入,并结合输出约束来过滤掉技术细节,从而有效引导模型生成准确且上下文相关的结果,以适应各种任务。
Summary / 总结
This paper addresses the challenge of real-time visual understanding on UAVs by proposing a lightweight multimodal task platform based on BLIP-2, integrated with YOLO-World and YOLOv8-Seg models. The integration leverages the precise perceptual results of YOLO for object detection and instance segmentation, and introduces a content-aware key frame sampling mechanism and a unified prompt optimization scheme to handle video-level interactive tasks effectively. Key findings include the successful extension of BLIP-2's multi-task capabilities for UAV applications without requiring task-specific fine-tuning on drone data.
本文提出了一种基于BLIP-2的轻量级多模态学习平台,结合了YOLO-World和YOLOv8-Seg模型,以解决无人机上的实时视觉理解挑战。该平台利用YOLO的精确感知结果进行目标检测和实例分割,增强视觉注意力和推理能力。此外,还引入了一种基于内容的帧采样机制和统一的提示优化方案,以高效处理视频级任务。实验结果表明,该方法在保持低计算开销的同时,能够有效支持多任务能力,适用于具有有限计算资源的无人机应用。
Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance
Authors: Matina Mahdizadeh Sani, Nima Jamali, Mohammad Jalali, Farzan Farnia
First: 2026-01-13T09:42:57+00:00 · Latest: 2026-01-13T09:42:57+00:00
Abstract
Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose MMD Guidance, a training-free mechanism that augments the reverse diffusion process with gradients of the Maximum Mean Discrepancy (MMD) between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity.
中文标题/摘要
标题:无需训练的扩散模型分布适应性通过最大均值偏差指导
预训练的扩散模型已成为无条件和有条件样本生成的强大生成先验,但它们的输出往往与用户特定目标数据的特征不符。这种不匹配在仅可用少量参考示例且重新训练扩散模型不可行的领域适应任务中尤为严重。现有的推理时指导方法可以调整采样轨迹,但它们通常优化替代目标,如分类器似然性,而不是直接与目标分布对齐。我们提出了一种无需训练的机制MMD指导,该机制通过生成样本与参考数据集之间的最大均值偏差(MMD)梯度来增强逆向扩散过程。MMD可以从有限数据中提供可靠的分布估计,在实践中表现出低方差,并且高效可微,这使其特别适合指导任务。我们的框架自然地扩展到条件生成模型中的提示感知适应,通过产品核。此外,它还可以在计算效率方面应用于潜扩散模型(LDMs),因为指导是在LDM的潜空间中应用的。在合成和真实世界基准上的实验表明,MMD指导可以在保持样本保真度的同时实现分布对齐。
Summary / 总结
The research aims to address the issue of distributional mismatch in diffusion models, especially in domain adaptation scenarios with limited reference data. The proposed MMD Guidance method guides the sampling process by aligning generated samples with a reference dataset using the Maximum Mean Discrepancy (MMD), without requiring retraining. Experiments show that MMD Guidance effectively achieves distributional alignment while maintaining sample quality.
论文针对预训练扩散模型中生成样本与目标数据特征不匹配的问题,提出了一种无需训练的MMD指导方法,直接将生成样本与参考数据集对齐。通过将MMD梯度整合到逆向扩散过程中,该方法能够有效实现分布对齐而不牺牲样本质量。
Semantic Misalignment in Vision-Language Models under Perceptual Degradation
Authors: Guo Cheng
First: 2026-01-13T09:13:05+00:00 · Latest: 2026-01-13T09:13:05+00:00
Abstract
Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception-realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety-critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language-level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel-level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM-based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety-critical applications.
中文标题/摘要
标题:视觉-语言模型在感知退化下的语义不匹配
视觉-语言模型(VLMs)在自动驾驶和具身AI系统中越来越广泛地应用,可靠的感知对于安全的语义推理和决策至关重要。尽管最近的VLMs在多模态基准测试中表现出色,但它们对现实感知退化的鲁棒性仍然知之甚少。在本研究中,我们系统地研究了在控制的上游视觉感知退化条件下VLMs中的语义不匹配,使用Cityscapes数据集上的语义分割作为代表性的感知模块。我们引入了感知现实的破坏,这些破坏仅在传统分割指标上引起适度下降,但观察到下游VLM行为的严重失败,包括虚构对象提及、安全关键实体的遗漏以及不一致的安全判断。为了量化这些影响,我们提出了一组语言层面的不匹配度量标准,以捕捉虚构、关键遗漏和安全误判,并分析这些度量标准与分割质量之间的关系,涵盖多个对比性和生成性VLMs。我们的结果揭示了像素级鲁棒性和多模态语义可靠性之间的明显脱节,突显了当前VLM基系统的一个关键局限性,并强调了在安全关键应用中明确考虑感知不确定性评估框架的必要性。
Summary / 总结
This study investigates the robustness of Vision-Language Models (VLMs) under perceptual degradation, focusing on their performance in autonomous driving and embodied AI systems. By introducing controlled corruptions to semantic segmentation, the research reveals that even moderate drops in segmentation accuracy can lead to severe semantic misalignment in VLMs, including hallucinations and safety misinterpretations. The study proposes new metrics to quantify these effects and highlights the need for evaluation frameworks that consider perception uncertainty in safety-critical applications.
研究探讨了视觉-语言模型(VLMs)在自主驾驶和具身AI系统中的感知降级鲁棒性。通过系统地对Cityscapes数据集上的语义分割应用控制化的破坏,研究发现尽管分割指标仅出现轻微下降,但VLM行为却出现了严重的故障,包括幻觉和安全误解。研究引入了语言层面的不一致性度量来量化这些效果,并强调了在安全关键应用中需要考虑感知不确定性评估框架的必要性。
Enhancing Image Quality Assessment Ability of LMMs via Retrieval-Augmented Generation
Authors: Kang Fu, Huiyu Duan, Zicheng Zhang, Yucheng Zhu, Jun Zhao, Xiongkuo Min, Jia Wang, Guangtao Zhai
First: 2026-01-13T08:00:02+00:00 · Latest: 2026-01-13T08:00:02+00:00
Abstract
Large Multimodal Models (LMMs) have recently shown remarkable promise in low-level visual perception tasks, particularly in Image Quality Assessment (IQA), demonstrating strong zero-shot capability. However, achieving state-of-the-art performance often requires computationally expensive fine-tuning methods, which aim to align the distribution of quality-related token in output with image quality levels. Inspired by recent training-free works for LMM, we introduce IQARAG, a novel, training-free framework that enhances LMMs' IQA ability. IQARAG leverages Retrieval-Augmented Generation (RAG) to retrieve some semantically similar but quality-variant reference images with corresponding Mean Opinion Scores (MOSs) for input image. These retrieved images and input image are integrated into a specific prompt. Retrieved images provide the LMM with a visual perception anchor for IQA task. IQARAG contains three key phases: Retrieval Feature Extraction, Image Retrieval, and Integration & Quality Score Generation. Extensive experiments across multiple diverse IQA datasets, including KADID, KonIQ, LIVE Challenge, and SPAQ, demonstrate that the proposed IQARAG effectively boosts the IQA performance of LMMs, offering a resource-efficient alternative to fine-tuning for quality assessment.
中文标题/摘要
标题:通过检索增强生成提高LMMs的图像质量评估能力
大型多模态模型(LMMs)在低级视觉感知任务中,特别是在图像质量评估(IQA)方面,最近显示出显著的潜力,表现出强大的零样本能力。然而,要达到最先进的性能通常需要昂贵的微调方法,这些方法旨在使输出中与质量相关的标记分布与图像质量水平对齐。受最近无训练工作的启发,我们引入了IQARAG,这是一种新颖的无训练框架,旨在增强LMMs的IQA能力。IQARAG利用检索增强生成(RAG)检索与输入图像在语义上相似但质量不同的参考图像及其相应的平均意见得分(MOSs),并将这些检索到的图像和输入图像整合到特定的提示中。检索到的图像为LMM提供了IQA任务的视觉感知锚点。IQARAG包含三个关键阶段:检索特征提取、图像检索和整合与质量评分生成。在多个多样化的IQA数据集,包括KADID、KonIQ、LIVE挑战和SPAQ上进行的广泛实验表明,提出的IQARAG有效地提升了LMMs的IQA性能,为质量评估提供了一种资源高效的替代微调方法。
Summary / 总结
The research aims to enhance the image quality assessment (IQA) capability of large multimodal models (LMMs) through a training-free framework called IQARAG. IQARAG uses Retrieval-Augmented Generation (RAG) to retrieve semantically similar but quality-variant reference images with corresponding Mean Opinion Scores (MOSs) for the input image, which are then integrated into a specific prompt to provide LMMs with a visual perception anchor for IQA. Experiments across multiple IQA datasets show that IQARAG effectively improves LMMs' IQA performance, offering a resource-efficient alternative to fine-tuning methods.
论文提出了一个名为IQARAG的训练-free框架,通过检索增强生成(RAG)来增强大型多模态模型(LMM)的图像质量评估(IQA)能力。该框架检索与输入图像在语义上相似但质量不同的参考图像及其对应的主观评分(MOS),并将它们整合到特定的提示中,为IQA任务提供视觉感知锚点。实验结果显示,IQARAG显著提升了LMMs的IQA性能,提供了一种资源高效的替代方案,用于质量评估,无需进行微调。
Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
Authors: Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Siteng Huang, Honggang Chen
Venue: AAAI 2026
First: 2025-01-09T11:57:58+00:00 · Latest: 2026-01-13T07:37:47+00:00
Comments: Accepted by AAAI 2026. Code is available at \url{https://github.com/xuyang-liu16/GlobalCom2}
Abstract
Large vision-language models (LVLMs) excel at visual understanding, but face efficiency challenges due to quadratic complexity in processing long multi-modal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of high-resolution LVLMs with dynamic cropping. Existing methods treat all tokens uniformly, but our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation. In this paper, we first analyze dynamic cropping strategy, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose ``Global Compression Commander'' (\textit{i.e.}, \textbf{GlobalCom$^2$}), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom$^2$ leverages thumbnail as the ``commander'' to guide the compression of local crops, adaptively preserving informative details while eliminating redundancy. Extensive experiments show that GlobalCom$^2$ maintains over \textbf{90\%} performance while compressing \textbf{90\%} visual tokens, reducing FLOPs and peak memory to \textbf{9.1\%} and \textbf{60\%}.
中文标题/摘要
标题:全球压缩指挥官:高分辨率大型视觉语言模型的即插即用推理加速
大型视觉语言模型(LVLMs)在视觉理解方面表现出色,但由于处理长多模态上下文的二次复杂性,面临效率挑战。虽然标记压缩可以降低计算成本,但现有方法主要针对单视图LVLMs,未能考虑高分辨率LVLMs动态裁剪的独特多视图特性。现有方法对所有标记处理一致,但我们的分析表明,全局缩略图可以自然引导局部裁剪的压缩,提供整体上下文以评估信息性。在本文中,我们首先分析动态裁剪策略,揭示了缩略图和裁剪之间的互补性质以及不同裁剪的独特特征。基于我们的观察,我们提出了“全球压缩指挥官”(即GlobalCom²),一种新型的即插即用标记压缩框架,适用于高分辨率LVLMs。GlobalCom²利用缩略图作为“指挥官”来引导局部裁剪的压缩,适当地保留信息性细节并消除冗余。大量实验表明,GlobalCom²在压缩90%视觉标记的同时保持超过90%的性能,减少FLOPs和峰值内存到9.1%和60%。
Summary / 总结
The research aims to address the efficiency challenges of high-resolution large vision-language models (HR-LVLMs) by proposing a novel token compression framework, Global Compression Commander (GlobalCom$^2$). This framework uses global thumbnails to guide the compression of local crops, preserving informative details while reducing redundancy. Experiments demonstrate that GlobalCom$^2$ maintains over 90% performance with 90% visual token compression, significantly reducing FLOPs and peak memory by 9.1% and 60%, respectively.
本文提出了一种新颖的令牌压缩框架Global Compression Commander (GlobalCom$^2$),通过全局缩略图来指导局部裁剪的压缩,保留关键细节同时减少冗余。实验表明,GlobalCom$^2$在90%视觉令牌压缩的情况下保持了超过90%的性能,显著减少了FLOPs和峰值内存,分别为9.1%和60%。
GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition
Authors: Jingchao Wang, Yifan He, Haote Yang, Jiang Wu, Lingli Ge, Xingjian Wei, Yinfan Wang, Linye Li, Huijie Ao, Chengjin Liu, Bin Wang, Lijun Wu, Conghui He
First: 2025-06-09T08:47:10+00:00 · Latest: 2026-01-13T07:21:42+00:00
Abstract
Optical Chemical Structure Recognition (OCSR) is essential for converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown promise, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To address these issues, we introduce GTR-VL, featuring two key innovations: (1) the \textit{Graph Traversal as Visual Chain of Thought} mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric \textit{Faithfully Recognize What You've Seen} principle, which aligns abbreviated structures in images with their expanded annotations. For hand-drawn OCSR tasks, where datasets lack graph annotations and only provide final SMILES, we apply reinforcement learning using the GRPO method, introducing reward mechanisms like format reward, graph reward, and SMILES reward. This approach significantly enhances performance in hand-drawn recognition tasks through weak supervision. We developed GTR-1.3M, a large-scale instruction-tuning dataset with corrected annotations, and MolRec-Bench, the first benchmark for fine-grained evaluation of graph-parsing accuracy in OCSR. Our two-stage training scheme involves SFT training for printed images and the GRPO method for transferring capabilities to hand-drawn tasks. Experiments show that GTR-VL outperforms specialist models, chemistry-domain VLMs, and commercial VLMs on both printed and hand-drawn datasets.
中文标题/摘要
标题:GTR-CoT: 图形遍历作为视觉链式思考的分子结构识别
光学化学结构识别(OCSR)对于将分子图像转换为机器可读格式至关重要。尽管最近的视觉-语言模型(VLMs)显示出潜力,但它们的图像-描述方法往往难以处理复杂的分子结构和不一致的注释。为了解决这些问题,我们引入了GTR-VL,其包含两个关键创新:(1)图形遍历作为视觉链式思考机制,通过顺序预测原子-键来模拟人类推理过程,逐步解析分子图;(2)以数据为中心的“忠实再现所见”原则,将图像中的简略结构与扩展注释对齐。对于手绘OCSR任务,由于数据集缺乏图形注释,仅提供最终的SMILES,我们使用GRPO方法应用强化学习,引入格式奖励、图形奖励和SMILES奖励等奖励机制。这种方法通过弱监督显著提高了手绘识别任务的性能。我们开发了GTR-1.3M,这是一个包含修正注释的大规模指令调优数据集,以及MolRec-Bench,这是第一个用于细粒度评估OCSR中图形解析准确性的基准。我们的两阶段训练方案包括对印刷图像的SFT训练和使用GRPO方法将能力转移到手绘任务。实验表明,GTR-VL在印刷和手绘数据集上均优于专门模型、化学领域VLMs和商用VLMs。
Summary / 总结
The research aims to improve Optical Chemical Structure Recognition (OCSR) by addressing the limitations of existing vision-language models. The method introduces GTR-VL, which uses a Graph Traversal as Visual Chain of Thought mechanism to incrementally parse molecular graphs and aligns abbreviated structures with expanded annotations. For hand-drawn OCSR tasks, the approach employs reinforcement learning with the GRPO method to enhance performance through weak supervision. Experiments demonstrate that GTR-VL outperforms specialist models and commercial vision-language models on both printed and hand-drawn datasets.
研究旨在通过解决视觉语言模型在处理复杂分子结构和不一致注解方面的局限性,来提升光学化学结构识别(OCSR)。方法引入了GTR-VL,该方法使用图遍历作为视觉链式思考机制来逐步解析分子图,并采用数据为中心的原则将图像结构与注解对齐。对于手绘OCSR任务,应用了基于GRPO的强化学习。实验结果显示,GTR-VL在印刷和手绘数据集上均优于专门模型和商用视觉语言模型。
VMMU: A Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark
Authors: Vy Tuong Dang, An Vo, Emilio Villa-Cueva, Quang Tau, Duc Dm, Thamar Solorio, Daeyoung Kim
First: 2025-08-19T09:31:18+00:00 · Latest: 2026-01-13T07:17:13+00:00
Abstract
We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision-language models (VLMs) interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art proprietary and open-source VLMs on VMMU. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary source of failure is not OCR, but instead multimodal grounding and reasoning over text and visual evidence. Code and data are available at https://vmmu.github.io.
中文标题/摘要
标题:VMMU:越南多任务多模态理解与推理基准
我们介绍了VMMU,一个越南多任务多模态理解与推理基准,旨在评估视觉语言模型(VLMs)如何超越英语对视觉和文本信息进行解释和推理。VMMU 包含7个任务中的2500个多模态问题,涵盖了从STEM问题解决到数据解释、规则指导的视觉推理和抽象视觉推理等多种问题情境。所有问题都需要真正的多模态整合,而不是依赖于仅基于文本的线索或OCR捷径。我们对VMMU上的一系列最先进的专有和开源VLMs进行了评估。尽管越南OCR性能很强,但专有模型的平均准确率仅为66%。进一步的分析表明,失败的主要原因是多模态定位和文本与视觉证据的推理,而不是OCR。代码和数据可在https://vmmu.github.io/获取。
Summary / 总结
VMMU is a Vietnamese benchmark for evaluating vision-language models in multitask multimodal understanding and reasoning, extending beyond English. It includes 2,500 multimodal questions across 7 diverse tasks, requiring genuine multimodal integration. Despite strong OCR performance in Vietnamese, proprietary models achieve only 66% mean accuracy, indicating challenges in multimodal grounding and reasoning. Open-source and proprietary VLMs are evaluated on this benchmark, highlighting the need for improved multimodal understanding capabilities. Code and data are available at https://vmmu.github.io.
VMMU是一个用于评估视觉语言模型在理解和推理视觉和文本信息方面的越南基准,超越了英语。它包含2500个跨7个不同任务的多模态问题,需要真正的多模态整合。尽管OCR性能很强,但专用模型的平均准确率仅为66%,主要问题是文本和视觉证据的多模态定位和推理。
Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function
Authors: Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park
First: 2025-12-04T08:21:52+00:00 · Latest: 2026-01-13T04:42:44+00:00
Comments: 36 pages, 21 figures, 4 tables
Abstract
Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.
中文标题/摘要
标题:通过软Q函数的参数化策略梯度进行扩散微调
扩散模型在生成高似然样本方面表现出色,但通常需要与下游目标对齐。现有的扩散模型微调方法严重受到奖励过优化的影响,导致生成高奖励但不自然的样本,并且降低了多样性。为了减轻过优化问题,我们提出了基于软Q的扩散微调(SQDF),这是一种新颖的KL正则化强化学习方法,用于扩散对齐,它应用了一个训练无监督、可微软Q函数的参数化策略梯度。SQDF进一步通过三个创新进行增强:在去噪过程中进行适当的信用分配的折扣因子,集成一致性模型以细化Q函数估计,以及使用离策略重放缓冲区以提高模式覆盖范围并管理奖励多样性权衡。我们的实验表明,SQDF在文本到图像对齐中实现了更高的目标奖励,同时保持了多样性。此外,在在线黑盒优化中,SQDF实现了高样本效率,同时保持了自然性和多样性。
Summary / 总结
The research aims to improve the alignment of diffusion models with downstream objectives by addressing the issue of reward over-optimization. The proposed method, Soft Q-based Diffusion Finetuning (SQDF), uses a KL-regularized reinforcement learning approach with a reparameterized policy gradient of a soft Q-function. SQDF includes three innovations: a discount factor for proper credit assignment, integration of consistency models to refine Q-function estimates, and an off-policy replay buffer to enhance mode coverage and manage the reward-diversity trade-off. Experiments show that SQDF achieves high target rewards while preserving diversity in text-to-image alignment and demonstrates high sample efficiency in online black-box optimization with natural and diverse samples.
研究旨在通过改进扩散模型与下游目标的对齐,而不过度优化奖励,避免生成不自然的样本和降低多样性。作者提出了基于Soft Q的扩散微调(SQDF)方法,该方法使用KL正则化的强化学习方法,并采用软Q函数的参数化策略梯度。SQDF还包括折扣因子、一致性模型和离策略重放缓冲区等创新,以提高性能。实验表明,SQDF在文本到图像对齐中实现了更高的目标奖励,同时保持了多样性,并在在线黑盒优化中表现出高效性,同时保持自然性和多样性。
Route, Retrieve, Reflect, Repair: Self-Improving Agentic Framework for Visual Detection and Linguistic Reasoning in Medical Imaging
Authors: Md. Faiyaz Abdullah Sayeedi, Rashedur Rahman, Siam Tahsin Bhuiyan, Sefatul Wasi, Ashraful Islam, Saadia Binte Alam, AKM Mahbubur Rahman
First: 2026-01-13T03:44:06+00:00 · Latest: 2026-01-13T03:44:06+00:00
Abstract
Medical image analysis increasingly relies on large vision-language models (VLMs), yet most systems remain single-pass black boxes that offer limited control over reasoning, safety, and spatial grounding. We propose R^4, an agentic framework that decomposes medical imaging workflows into four coordinated agents: a Router that configures task- and specialization-aware prompts from the image, patient history, and metadata; a Retriever that uses exemplar memory and pass@k sampling to jointly generate free-text reports and bounding boxes; a Reflector that critiques each draft-box pair for key clinical error modes (negation, laterality, unsupported claims, contradictions, missing findings, and localization errors); and a Repairer that iteratively revises both narrative and spatial outputs under targeted constraints while curating high-quality exemplars for future cases. Instantiated on chest X-ray analysis with multiple modern VLM backbones and evaluated on report generation and weakly supervised detection, R^4 consistently boosts LLM-as-a-Judge scores by roughly +1.7-+2.5 points and mAP50 by +2.5-+3.5 absolute points over strong single-VLM baselines, without any gradient-based fine-tuning. These results show that agentic routing, reflection, and repair can turn strong but brittle VLMs into more reliable and better grounded tools for clinical image interpretation. Our code can be found at: https://github.com/faiyazabdullah/MultimodalMedAgent
中文标题/摘要
标题:路径、检索、反思、修复:用于医学影像视觉检测和语言推理的自主改进框架
医学影像分析越来越多地依赖于大型视觉-语言模型(VLMs),但大多数系统仍然是单次通过的黑盒系统,对推理、安全性和空间定位的控制有限。我们提出了一种名为R^4的自主框架,将医学影像工作流程分解为四个协调的代理:一个路由器,从影像、患者病史和元数据中配置任务和专业化意识的提示;一个检索器,使用示例记忆和pass@k采样来联合生成自由文本报告和边界框;一个反思器,对每对草图框进行批判,以识别关键临床错误模式(否定、左右方向性、未支持的断言、矛盾、遗漏的发现和定位错误);以及一个修复器,根据目标约束迭代修订叙事和空间输出,并为未来案例收集高质量的示例。该框架基于多种现代VLM后端在胸部X光分析中实现,并在报告生成和弱监督检测方面进行评估,R^4在LLM-as-a-Judge得分和mAP50上分别提高了约+1.7-+2.5分和+2.5-+3.5分,而无需任何基于梯度的微调。这些结果表明,自主路由、反思和修复可以将强大的但脆弱的VLM转变为更可靠和更好的临床影像解释工具。我们的代码可以在:https://github.com/faiyazabdullah/MultimodalMedAgent 找到
Summary / 总结
The research aims to improve the control and reliability of visual detection and linguistic reasoning in medical imaging using large vision-language models (VLMs). The proposed R^4 framework decomposes the workflow into four agents: Router, Retriever, Reflector, and Repairer. The Router configures task-aware prompts, the Retriever generates reports and bounding boxes, the Reflector critiques for errors, and the Repairer iteratively revises outputs. On chest X-ray analysis, R^4 significantly improves LLM-as-a-Judge scores and mAP50 over strong single-VLM baselines without fine-tuning, demonstrating the effectiveness of the agentic approach in enhancing VLM reliability and grounding.
研究旨在通过大型视觉-语言模型(VLM)提高医学影像中视觉检测和语言推理的控制和可靠性。提出的R^4框架将工作流程分解为四个代理:Router、Retriever、Reflector和Repairer。Router配置任务感知提示,Retriever生成报告和边界框,Reflector批判错误,Repairer迭代修订输出。在胸部X光分析中,R^4显著提高了LLM-as-a-Judge评分和mAP50,优于强大的单一VLM基线,且无需微调,展示了代理方法在增强VLM可靠性和定位方面的有效性。
Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention
Authors: Shezheng Song, Shasha Li, Jie Yu
First: 2026-01-13T02:26:21+00:00 · Latest: 2026-01-13T02:26:21+00:00
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage "review" phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts. Extensive experiments across various MLLMs and benchmarks validate our analysis and demonstrate that the proposed approach improves multimodal reasoning performance. Code will be released.
中文标题/摘要
标题:视觉与语言交汇于何处?通过对比注意力理解并精炼MLLM中的视觉融合
多模态大型语言模型(MLLMs)在视觉-语言理解方面取得了显著进展,但它们如何内部整合视觉和文本信息仍不甚明了。为弥合这一差距,我们对多个架构进行了系统的逐层掩码分析,揭示了MLLMs中视觉-文本融合的演变过程。结果显示,融合主要在几个特定层中出现,而不是均匀分布在整张网络中,某些模型还表现出一种晚期“审查”现象,即在输出生成前重新激活视觉信号。此外,我们进一步分析了逐层注意力的演变,并观察到持续的高注意力噪声集中在无关区域,同时文本对齐区域的注意力逐渐增加。根据这些见解,我们引入了一种无需训练的对比注意力框架,以建模早期融合与最终层之间的转换,突出有意义的注意力变化。广泛的实验在各种MLLMs和基准上验证了我们的分析,并证明所提出的方法提高了多模态推理性能。代码将被发布。
Subspace Alignment for Vision-Language Model Test-time Adaptation
Authors: Zhichen Zeng, Wenxuan Bao, Xiao Lin, Ruizhong Qiu, Tianxin Wei, Xuying Ning, Yuchen Yan, Chen Luo, Monica Xiao Cheng, Jingrui He, Hanghang Tong
First: 2026-01-13T02:02:41+00:00 · Latest: 2026-01-13T02:02:41+00:00
Comments: 17 pages, 10 figures
Abstract
Vision-language models (VLMs), despite their extraordinary zero-shot capabilities, are vulnerable to distribution shifts. Test-time adaptation (TTA) emerges as a predominant strategy to adapt VLMs to unlabeled test data on the fly. However, existing TTA methods heavily rely on zero-shot predictions as pseudo-labels for self-training, which can be unreliable under distribution shifts and misguide adaptation due to two fundamental limitations. First (Modality Gap), distribution shifts induce gaps between visual and textual modalities, making cross-modal relations inaccurate. Second (Visual Nuisance), visual embeddings encode rich but task-irrelevant noise that often overwhelms task-specific semantics under distribution shifts. To address these limitations, we propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions to better guide the TTA process. To bridge the modality gap, SubTTA extracts the principal subspaces of both modalities and aligns the visual manifold to the textual semantic anchor by minimizing their chordal distance. To eliminate visual nuisance, SubTTA projects the aligned visual features onto the task-specific textual subspace, which filters out task-irrelevant noise by constraining visual embeddings within the valid semantic span, and standard TTA is further performed on the purified space to refine the decision boundaries. Extensive experiments on various benchmarks and VLM architectures demonstrate the effectiveness of SubTTA, yielding an average improvement of 2.24% over state-of-the-art TTA methods.
中文标题/摘要
标题:视听模型测试时适应的子空间对齐
视听模型(VLMs),尽管具有非凡的零样本能力,但在分布偏移时却容易受到影响。测试时适应(TTA)作为一种主要策略,可以在不标记的测试数据上实时适应VLMs。然而,现有的TTA方法严重依赖于零样本预测作为伪标签进行自我训练,这在分布偏移下可能不可靠,并且由于两个根本限制而误导适应。首先(模态差距),分布偏移在视觉和文本模态之间造成了差距,使得跨模态关系不准确。其次(视觉噪声),视觉嵌入编码了大量的但与任务无关的噪声,在分布偏移下往往会压倒特定任务的语义。为了解决这些限制,我们提出了SubTTA,它通过增强零样本预测来对齐两种模态的语义子空间,以更好地指导TTA过程。为了弥合模态差距,SubTTA 提取了两种模态的主要子空间,并通过最小化它们的弦距将视觉流形对齐到文本语义锚点。为了消除视觉噪声,SubTTA 将对齐后的视觉特征投影到特定任务的文本子空间,通过限制视觉嵌入在有效的语义范围内来过滤掉与任务无关的噪声,并在净化的空间上进一步执行标准TTA以细化决策边界。在各种基准和VLM架构上的广泛实验表明,SubTTA 的有效性,平均提高了2.24%的最新TTA方法。
Summary / 总结
The paper addresses the limitations of existing test-time adaptation (TTA) methods for vision-language models (VLMs) under distribution shifts, such as modality gaps and visual nuisances. It proposes SubTTA, which aligns the semantic subspaces of visual and textual modalities to enhance zero-shot predictions. SubTTA minimizes the chordal distance between visual and textual subspaces to bridge the modality gap and projects aligned visual features onto the task-specific textual subspace to eliminate visual nuisances. Experiments show that SubTTA improves performance by an average of 2.24% over state-of-the-art TTA methods on various benchmarks and VLM architectures.
论文提出了一种SubTTA方法,通过对视觉和文本模态的语义子空间进行对齐来增强零样本预测,以应对视觉语言模型(VLMs)在分布变化下的脆弱性。SubTTA通过最小化视觉和文本子空间之间的弦距来弥合模态差距,并将对齐后的视觉特征投影到任务特定的文本子空间中,以过滤掉任务无关的噪声。实验表明,SubTTA在各种基准和VLM架构上平均提高了2.24%的性能,优于现有TTA方法。
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection
Authors: Jinhe Bi, Aniri, Yifan Wang, Danqi Yan, Wenke Huang, Zengjie Jin, Xiaowen Ma, Sikuan Yan, Artur Hecker, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, Yunpu Ma
First: 2025-02-17T18:43:41+00:00 · Latest: 2026-01-13T00:54:02+00:00
Abstract
Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a \textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise \textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30\% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7\% relative improvement over the baseline. The code is available for access via \href{https://github.com/bibisbar/PRISM}{this repository}.
中文标题/摘要
标题:PRISM:训练无监督的多模态数据选择自修剪内在选择方法
视觉指令调优将预训练的多模态大型语言模型(MLLMs)调整为遵循人类指令以适应实际应用。然而,这些数据集的快速增长引入了大量冗余,导致计算成本增加。现有选择指令数据的方法旨在消除这种冗余,但主要依赖于计算密集型技术,如代理推理或训练基线度量。因此,这些选择过程产生的大量计算成本往往加剧了它们旨在解决的效率瓶颈,对MLLMs的可扩展和有效调优构成了重大挑战。为应对这一挑战,我们首先识别了一个关键但之前被忽视的因素:视觉特征分布中的各向异性。我们发现,这种各向异性导致了全局语义漂移,而忽视这一现象是当前数据选择方法效率低下的一大原因。基于这一洞察,我们设计了PRISM,这是首个训练无监督的高效视觉指令选择框架。PRISM通过隐式重新中心化建模内在的视觉语义,以手术般的方式消除了全局背景特征的干扰。实验证明,PRISM将端到端的数据选择和模型调优时间缩短至传统管道的30%。更令人惊讶的是,它在提高性能的同时实现了这一效率,超越了在完整数据集上微调的模型,在八个跨模态和三个语言理解基准测试中均取得了显著优势,相对基线提高了101.7%。代码可通过此仓库访问:https://github.com/bibisbar/PRISM。
Summary / 总结
The research addresses the challenge of redundant data in multimodal datasets for tuning Multimodal Large Language Models (MLLMs) to follow human instructions. It introduces PRISM, a training-free method that leverages the anisotropy in visual feature distributions to reduce global semantic drift, thereby improving efficiency. PRISM reduces the end-to-end time for data selection and model tuning by 70% and outperforms models fine-tuned on full datasets across various benchmarks, achieving a 101.7% relative improvement over the baseline.
PRISM 是一种无需训练的方法,旨在减少 MLLMs 的视觉指令数据冗余,将数据选择和模型调优的端到端时间缩短至传统管道的 30%,同时在多个基准测试中表现出更优的效果,超越了全数据集微调模型,相对基线提升了 101.7%。
UNCAP: Uncertainty-Guided Neurosymbolic Planning Using Natural Language Communication for Cooperative Autonomous Vehicles
Authors: Neel P. Bhatt, Po-han Li, Kushagra Gupta, Rohan Siva, Daniel Milan, Alexander T. Hogue, Sandeep P. Chinchali, David Fridovich-Keil, Zhangyang Wang, Ufuk Topcu
Venue: AAMAS 2026
First: 2025-10-14T21:09:09+00:00 · Latest: 2026-01-12T22:23:28+00:00
Abstract
Safe large-scale coordination of multiple cooperative connected autonomous vehicles (CAVs) hinges on communication that is both efficient and interpretable. Existing approaches either rely on transmitting high-bandwidth raw sensor data streams or neglect perception and planning uncertainties inherent in shared data, resulting in systems that are neither scalable nor safe. To address these limitations, we propose Uncertainty-Guided Natural Language Cooperative Autonomous Planning (UNCAP), a vision-language model-based planning approach that enables CAVs to communicate via lightweight natural language messages while explicitly accounting for perception uncertainty in decision-making. UNCAP features a two-stage communication protocol: (i) an ego CAV first identifies the subset of vehicles most relevant for information exchange, and (ii) the selected CAVs then transmit messages that quantitatively express their perception uncertainty. By selectively fusing messages that maximize mutual information, this strategy allows the ego vehicle to integrate only the most relevant signals into its decision-making, improving both the scalability and reliability of cooperative planning. Experiments across diverse driving scenarios show a 63% reduction in communication bandwidth with a 31% increase in driving safety score, a 61% reduction in decision uncertainty, and a four-fold increase in collision distance margin during near-miss events. Project website: https://uncap-project.github.io/
中文标题/摘要
标题:UNCAP:基于自然语言通信的不确定性引导神经符号规划,用于协同自主车辆
多辆协同连接自主车辆(CAVs)的安全大规模协调依赖于高效且可解释的通信。现有方法要么依赖于传输高带宽的原始传感器数据流,要么忽视共享数据中固有的感知和规划不确定性,导致系统既不具有可扩展性也不安全。为解决这些局限性,我们提出了不确定性引导的自然语言协同自主规划(UNCAP),这是一种基于视觉语言模型的规划方法,使CAVs能够通过轻量级自然语言消息进行通信,并在决策中明确考虑感知不确定性。UNCAP具有两阶段的通信协议:(i)一辆自我车辆首先识别最相关的车辆子集进行信息交换,(ii)然后选择的车辆传输定量表达其感知不确定性的消息。通过选择性地融合最大化互信息的消息,该策略使自我车辆仅整合最相关的信号进行决策,从而提高协同规划的可扩展性和可靠性。在多种驾驶场景下的实验显示,通信带宽减少了63%,驾驶安全性得分提高了31%,决策不确定性减少了61%,并且在接近碰撞事件中碰撞距离裕度提高了四倍。项目网站:https://uncap-project.github.io/
Summary / 总结
The research aims to improve the coordination of multiple cooperative autonomous vehicles (CAVs) by developing a communication method that is both efficient and interpretable. UNCAP, a two-stage communication protocol, enables CAVs to use lightweight natural language messages while considering perception uncertainty. Experiments demonstrate a 63% reduction in communication bandwidth, a 31% increase in driving safety, a 61% reduction in decision uncertainty, and a four-fold increase in collision distance margin during near-miss events.
研究旨在通过提高通信效率和可解释性,确保合作自动驾驶车辆的安全和可扩展协调。UNCAP是一种基于视觉-语言模型的方法,使自动驾驶车辆能够通过轻量级自然语言消息进行通信,并考虑感知不确定性。实验结果显示,通信带宽减少了63%,驾驶安全性提高了31%,决策不确定性降低了61%,并且在接近碰撞事件中碰撞距离裕度提高了四倍。
Rescind: Countering Image Misconduct in Biomedical Publications with Vision-Language and State-Space Modeling
Authors: Soumyaroop Nandi, Prem Natarajan
First: 2026-01-12T22:13:58+00:00 · Latest: 2026-01-12T22:13:58+00:00
Abstract
Scientific image manipulation in biomedical publications poses a growing threat to research integrity and reproducibility. Unlike natural image forensics, biomedical forgery detection is uniquely challenging due to domain-specific artifacts, complex textures, and unstructured figure layouts. We present the first vision-language guided framework for both generating and detecting biomedical image forgeries. By combining diffusion-based synthesis with vision-language prompting, our method enables realistic and semantically controlled manipulations, including duplication, splicing, and region removal, across diverse biomedical modalities. We introduce Rescind, a large-scale benchmark featuring fine-grained annotations and modality-specific splits, and propose Integscan, a structured state space modeling framework that integrates attention-enhanced visual encoding with prompt-conditioned semantic alignment for precise forgery localization. To ensure semantic fidelity, we incorporate a vision-language model based verification loop that filters generated forgeries based on consistency with intended prompts. Extensive experiments on Rescind and existing benchmarks demonstrate that Integscan achieves state of the art performance in both detection and localization, establishing a strong foundation for automated scientific integrity analysis.
中文标题/摘要
标题:撤销:使用视觉-语言和状态空间建模在生物医学出版物中对抗图像不当行为
生物医学出版物中的科学图像篡改正日益威胁到研究的完整性和可重复性。与自然图像取证不同,生物医学伪造检测因其领域特定的特征、复杂的纹理和不规则的图布局而独具挑战性。我们提出了第一个由视觉-语言引导的框架,用于生成和检测生物医学图像伪造。通过结合基于扩散的合成与视觉-语言提示,我们的方法能够实现现实且语义控制的篡改,包括复制、拼接和区域删除,跨越多种生物医学模态。我们引入了Rescind,一个大规模基准,包含精细注释和模态特定分割,并提出了Integscan,一种结构化的状态空间建模框架,结合了注意力增强的视觉编码与提示条件的语义对齐,以实现精确的伪造定位。为了确保语义一致性,我们引入了基于视觉-语言模型的验证循环,根据与预期提示的一致性过滤生成的伪造。在Rescind和现有基准上的广泛实验表明,Integscan在检测和定位方面均达到最先进的性能,为自动科学研究完整性分析奠定了坚实的基础。
Summary / 总结
The research addresses the issue of scientific image manipulation in biomedical publications, which threatens research integrity and reproducibility. It introduces a vision-language guided framework combining diffusion-based synthesis and prompting to generate and detect forgeries in biomedical images. The method supports realistic manipulations across various modalities and includes a verification loop for semantic fidelity. Experiments show that the proposed Integscan framework outperforms existing methods in both detection and localization, providing a robust tool for automated scientific integrity analysis.
研究针对生物医学出版物中的图像篡改问题,威胁了研究的完整性。提出了一种基于视觉-语言的框架Rescind,用于生成和检测生物医学图像篡改。通过使用基于扩散的合成和视觉-语言提示,该方法可以在不同生物医学模态中生成逼真的篡改。研究还提出了Integscan,一种结构化的状态空间建模框架,结合了视觉编码和语义对齐,以实现精确的篡改定位。实验表明,Integscan在检测和定位方面均优于现有方法,为自动化的科学诚信分析奠定了坚实的基础。