VIBE: Can a VLM Read the Room?
Authors: Tania Chakraborty, Eylon Caplan, Dan Goldwasser
Venue: EMNLP
First: 2025-06-11T19:07:35+00:00 · Latest: 2025-12-16T18:42:51+00:00
Comments: Findings of EMNLP, 2025
Abstract
Understanding human social behavior such as recognizing emotions and the social dynamics causing them is an important and challenging problem. While LLMs have made remarkable advances, they are limited to the textual domain and cannot account for the major role that non-verbal cues play in understanding social situations. Vision Language Models (VLMs) can potentially account for this gap, however their ability to make correct inferences over such social cues has received little attention. In this paper, we explore the capabilities of VLMs at social reasoning. We identify a previously overlooked limitation in VLMs: the Visual Social-Pragmatic Inference gap. To target this gap, we propose a new task for VLMs: Visual Social-Pragmatic Inference. We construct a high quality dataset to test the abilities of a VLM for this task and benchmark the performance of several VLMs on it.
中文标题/摘要
标题:VIBE:VLM能否读懂房间里的社交信号?
理解人类社会行为,如识别情绪及其背后的社会动态,是一个重要且具有挑战性的问题。尽管语言模型取得了显著进展,但它们仅限于文本领域,无法解释非言语线索在理解社交情境中的重要作用。视觉语言模型(VLMs)可能能够弥补这一差距,但它们在推理此类社会线索方面的能力尚未受到广泛关注。在本文中,我们探讨了VLM在社会推理方面的能力。我们发现VLM的一个先前未被注意到的局限性:视觉社会-语用推理差距。为解决这一差距,我们为VLM提出了一项新任务:视觉社会-语用推理。我们构建了一个高质量的数据集来测试VLM在该任务上的能力,并在该数据集上对几种VLM进行了基准测试。
Summary / 总结
The paper explores the capabilities of Vision Language Models (VLMs) in social reasoning, identifying a limitation known as the Visual Social-Pragmatic Inference gap. To address this, the authors propose a new task and construct a high-quality dataset to test VLMs. The main experimental findings show that current VLMs struggle with visual social-pragmatic inference, highlighting the need for improvement in this area.
论文探讨了视觉语言模型(VLM)在社会推理方面的能力,指出了视觉社会-语用推理差距这一限制。为解决这一问题,作者提出了一项新任务并构建了一个高质量的数据集来测试VLM。主要实验发现表明,当前的VLM在视觉社会-语用推理方面存在困难,强调了在这一领域进行改进的必要性。
Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
Authors: Antonio Guillen-Perez
First: 2025-12-12T20:07:04+00:00 · Latest: 2025-12-16T17:15:46+00:00
Abstract
The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a "System 2" inference-time alignment strategy, utilizing a multi-model "Judge-Scout" consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% ccompared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.
中文标题/摘要
标题:Semantic-Drive: 通过开放词汇接地和神经符号VLM共识促进长尾数据整理的民主化
自主车辆(AV)的稳健开发受到“长尾”训练数据稀缺的限制。尽管车队收集了大量视频日志,但识别罕见的安全关键事件(例如,不规则的随意横穿马路、施工改道)仍然是一个手动且成本高昂的过程。现有解决方案依赖于粗略的元数据搜索,缺乏精确性,或者依赖于基于云的VLM,这侵犯了隐私并成本高昂。我们提出了Semantic-Drive,这是一种本地优先的神经符号框架,用于语义数据挖掘。我们的方法将感知分为两个阶段:(1)通过实时开放词汇检测器(YOLOE)进行符号接地,以锚定注意力;(2)通过推理VLM进行认知分析,执行法医场景分析。为了减轻幻觉,我们实现了一种“系统2”推理时对齐策略,利用多模型“裁判-侦察兵”共识机制。在nuScenes数据集上与Waymo开放数据集(WOD-E2E)分类法进行基准测试,Semantic-Drive的召回率为0.966(而CLIP为0.475),与最佳单侦察兵模型相比,风险评估误差降低了40%。该系统完全在消费级硬件(NVIDIA RTX 3090)上运行,提供了一种隐私保护的替代方案,替代了基于云的解决方案。
Summary / 总结
Semantic-Drive is a local-first framework designed to address the scarcity of long-tail training data for autonomous vehicles. It uses a two-stage process: symbolic grounding with a real-time open-vocabulary detector (YOLOE) and cognitive analysis with a reasoning visual language model (VLM) for forensic scene analysis. To reduce hallucination, it employs a multi-model consensus mechanism. On the nuScenes dataset, Semantic-Drive outperforms existing solutions, achieving a recall of 0.966 and a 40% reduction in risk assessment error compared to the best single scout models, while running on consumer hardware.
Semantic-Drive 是一个本地优先框架,旨在解决自动驾驶汽车中长尾训练数据稀缺的问题。它采用两阶段过程:使用实时开放词汇检测器(YOLOE)进行符号接地,以及使用推理视觉语言模型(VLM)进行法医场景分析。为了减少幻觉,它采用了多模型共识机制。在 nuScenes 数据集上,Semantic-Drive 的召回率为 0.966,相比最佳单模型降低了 40% 的风险评估误差,同时在消费级硬件(NVIDIA RTX 3090)上运行。
SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome
Authors: Dabin Jeong, Amirhossein Vahidi, Ciro Ramírez-Suástegui, Marie Moullet, Kevin Ly, Mohammad Vali Sanian, Sebastian Birk, Yinshui Chang, Adam Boxall, Daniyal Jafree, Lloyd Steele, Vijaya Baskar MS, Muzlifah Haniffa, Mohammad Lotfollahi
First: 2025-11-19T14:22:23+00:00 · Latest: 2025-12-16T16:01:40+00:00
Abstract
Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78\% in the gene-expression prediction task and avg. 26.93\% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.
中文标题/摘要
标题:SIGMMA:基于层次图的多尺度多模态对比对齐框架,用于组织病理图像和空间转录组
计算病理学的最新进展利用视觉-语言模型学习Hematoxylin和Eosin (HE) 图像与空间转录组 (ST) 谱型的联合表示。然而,现有方法通常在单尺度上对HE切片与其对应的ST谱型进行对齐,忽视了细微的细胞结构及其空间组织。为了解决这一问题,我们提出Sigmma,一种多模态对比对齐框架,用于在多个尺度上学习HE图像和空间转录组谱型的层次表示。Sigmma 引入了多尺度对比对齐,确保不同尺度下学习的表示在模态间保持一致性。此外,通过将细胞相互作用表示为图,并整合跨子图和内子图关系,我们的方法有效地捕捉了组织微环境中从精细到粗略的细胞-细胞相互作用。我们证明Sigmma 学习的表示能够更好地捕捉跨模态对应关系,在基因表达预测任务中平均提高9.78%,在跨模态检索任务中平均提高26.93%。我们进一步表明,它在下游分析中学习到了有意义的多组织组织。
Summary / 总结
The research aims to improve the alignment of histopathology images and spatial transcriptomic profiles by addressing the limitations of existing single-scale approaches. SIGMMA, a multi-modal contrastive alignment framework, learns hierarchical representations across multiple scales, ensuring coherent representations at different scales. The method represents cell interactions as a graph and integrates inter- and intra-subgraph relationships to capture cell-cell interactions from fine to coarse scales. Experimental results show that SIGMMA enhances cross-modal correspondences, improving gene-expression prediction by 9.78% and cross-modal retrieval by 26.93% across datasets, and it also reveals meaningful multi-tissue organization in downstream analyses.
研究旨在通过解决现有单尺度方法的局限性,改进组织病理学图像与空间转录组学资料的对齐。Sigmma 是一个多模态对比对齐框架,能够在多个尺度上学习层次表示,确保不同尺度下的表示一致性。该方法使用图基方法捕捉从精细到粗略尺度的细胞-细胞相互作用,从而更好地捕捉跨模态对应关系。实验结果表明,在基因表达预测任务上提高了9.78%,在跨模态检索任务上提高了26.93%。
A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning
Authors: Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen
First: 2025-12-16T14:27:47+00:00 · Latest: 2025-12-16T14:27:47+00:00
Abstract
Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.
中文标题/摘要
标题:A4-Agent:零样本功能推理的代理框架
功能预测,基于语言指令识别物体上的交互区域,对于体态人工智能至关重要。现有的端到端模型将高层推理和低层语义结合在一个单一的管道中,并依赖于标注数据集的训练,这导致在新型物体和未见过的环境中泛化能力较差。在本文中,我们超越了这一范式,提出了A4-Agent,这是一种无需训练的代理框架,将功能预测分解为三个阶段的管道。我们的框架在测试时协调专门的基础模型:(1) 一个**Dreamer**,使用生成模型来可视化**如何**进行交互;(2) 一个**Thinker**,利用大型的视觉-语言模型来决定**与哪个**物体部分进行交互;(3) 一个**Spotter**,协调视觉基础模型来精确定位**哪里**是交互区域。通过利用预训练模型的互补优势,而无需任何特定任务的微调,我们的零样本框架在多个基准测试中显著优于最先进的监督方法,并在真实世界环境中展示了鲁棒的泛化能力。
Summary / 总结
The paper introduces A4-Agent, a zero-shot framework for affordance prediction that decouples the process into three stages: Dreamer, Thinker, and Spotter. Dreamer generates visualizations of interactions, Thinker decides which object part to interact with, and Spotter locates the interaction area. This framework, which leverages the strengths of pre-trained models without fine-tuning, outperforms existing supervised methods across multiple benchmarks and shows robust generalization to real-world settings.
研究旨在通过解决现有端到端模型的局限性,提高体态AI中的功能预测。A4-Agent是一个零样本的代理框架,将功能预测分解为三个阶段:Dreamer、Thinker和Spotter。Dreamer可视化交互,Thinker决定交互的对象部分,Spotter定位交互区域。该框架利用预训练模型而不进行微调,超越了最先进的监督方法,并在现实世界环境中表现出强大的泛化能力。
The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy
Authors: Zhuo Chen, Fanyue Wei, Runze Xu, Jingjing Li, Lixin Duan, Angela Yao, Wen Li
First: 2025-12-16T14:08:00+00:00 · Latest: 2025-12-16T14:08:00+00:00
Comments: Project page:https://synps26.github.io/
Abstract
Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing.To address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation.By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.
中文标题/摘要
标题:共享注意力中的魔鬼在于注意力协同:通过注意力协同提高复杂非刚性图像编辑的忠实度
无需训练的大规模扩散模型使图像编辑变得实用,但执行复杂的非刚性编辑(例如姿态或形状变化)仍然极具挑战性。我们发现一个关键原因:现有注意力共享机制中的注意力崩溃,其中位置嵌入或语义特征之一主导视觉内容检索,导致过度编辑或不足编辑。为解决这一问题,我们引入了SynPS方法,该方法协同利用位置嵌入和语义信息进行忠实的非刚性图像编辑。我们首先提出了一种编辑度量,量化每个去噪步骤所需的编辑量。基于此度量,我们设计了一种注意力协同管道,动态调节位置嵌入的影响,使SynPS能够平衡语义修改和保真度保留。通过适配性地整合位置和语义线索,SynPS有效地避免了过度编辑和不足编辑。在公共和新收集的基准上的广泛实验表明,我们方法的优越性能和忠实度。
Summary / 总结
The paper addresses the challenge of performing complex non-rigid image edits using large diffusion models, which often result in over- or under-editing due to attention collapse in existing mechanisms. To tackle this, the authors introduce SynPS, a method that synergistically combines positional embeddings and semantic information to balance semantic modifications and fidelity preservation. The approach measures the required editing magnitude at each denoising step and dynamically modulates the influence of positional embeddings, leading to superior performance and faithfulness in non-rigid image editing tasks.
研究解决了使用大型扩散模型进行复杂非刚性图像编辑时经常出现的过度编辑或不足编辑问题,这是由于注意力坍塌造成的。作者提出了SynPS方法,该方法结合了位置嵌入和语义信息的协同作用,以实现忠实的非刚性图像编辑。通过动态调节位置嵌入的影响,SynPS有效地平衡了语义修改和保真度的保留,广泛的实验在公共和新收集的基准上证明了其优越的性能和忠实度。
DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning
Authors: Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi Hirose, Yusuke Sekikawa
Venue: AAAI 2026
First: 2025-12-16T14:06:35+00:00 · Latest: 2025-12-16T14:06:35+00:00
Comments: Paper accepted to AAAI 2026
Abstract
Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.
中文标题/摘要
标题:DISCODE: 分布感知分数解码器在图像字幕评估中的鲁棒自动评价
大型视觉-语言模型(LVLMs)在多种跨模态任务中表现出色。然而,在领域迁移场景下,使用LVLMs进行图像字幕评估仍然具有挑战性。为了解决这一问题,我们提出了分布感知分数解码器(DISCODE),这是一种无需微调的新方法,能够生成与人类判断更一致的鲁棒评估分数,适用于多种领域。DISCODE的核心思想在于其测试时自适应评估方法,引入了自适应测试时(ATT)损失,利用高斯先验分布提高评估分数估计的鲁棒性。我们推导出一种分析解法,在测试时高效地最小化该损失。此外,我们还引入了多领域字幕评估基准(MCEval),这是一个新的图像字幕评估基准,涵盖了六个不同的领域,旨在评估评估指标的鲁棒性。在我们的实验中,我们证明DISCODE在MCEval和四个代表性现有基准上均实现了最先进的无参考评价指标性能。
Summary / 总结
The research aims to improve the robustness of automatic image caption evaluation, especially under domain-shift scenarios. DISCODE, a finetuning-free method, uses an Adaptive Test-Time (ATT) loss with a Gaussian prior to generate evaluation scores more aligned with human judgments. Experiments show that DISCODE outperforms existing methods across various benchmarks, including MCEval and four representative existing benchmarks.
研究旨在通过引入DISCODE方法提高自动图像字幕评估的鲁棒性,特别是在领域转换场景下。DISCODE使用带有高斯先验的Adaptive Test-Time (ATT)损失,在测试时自适应地生成与人类判断更好的对齐的评估分数。实验表明,DISCODE在MCEval和四个其他基准上优于现有的无参考评估指标。
Unified Semantic Transformer for 3D Scene Understanding
Authors: Sebastian Koch, Johanna Wald, Hide Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari
First: 2025-12-16T12:49:35+00:00 · Latest: 2025-12-16T12:49:35+00:00
Comments: Project page: https://unite-page.github.io/
Abstract
Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io
中文标题/摘要
标题:统一语义变换器用于3D场景理解
整体3D场景理解涉及捕获和解析未结构化的3D环境。由于现实世界的固有复杂性,现有的模型主要被开发并局限于特定任务。我们引入了UNITE,一种用于3D场景理解的统一语义变换器,这是一种新颖的前馈神经网络,能够在一个模型中统一多种3D语义任务。我们的模型以端到端的方式处理未见过的场景,并且只需几秒钟即可推断出完整的3D语义几何结构。我们的方法能够直接预测多个语义属性,包括3D场景分割、实例嵌入、开放词汇特征,以及用途和关节,仅从RGB图像中。该方法使用2D蒸馏训练,高度依赖于自我监督,并利用了设计用于确保3D视图一致性的新型多视图损失。我们证明,UNITE在多个不同的语义任务上达到了最先进的性能,并且在许多情况下甚至超过了特定任务的模型,甚至超越了在真实3D几何上操作的方法。请参见项目网站:unite-page.github.io
Summary / 总结
UNITE is a Unified Semantic Transformer designed for holistic 3D scene understanding, capable of predicting multiple semantic attributes from RGB images. It uses a combination of 2D distillation and self-supervision, along with novel multi-view losses to ensure 3D consistency. UNITE outperforms task-specific models and even surpasses methods using ground truth 3D geometry on several semantic tasks, achieving state-of-the-art performance.
UNITE 是一个统一的语义变换器,旨在进行全方位的3D场景理解,能够从RGB图像中预测多种语义属性。它结合了2D蒸馏和自我监督,并使用新颖的多视图损失来确保3D一致性。UNITE 在多个语义任务上超越了专门任务模型,甚至在某些情况下超过了使用真实3D几何的模型,达到了最先进的性能。
Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
Authors: Jooyeol Yun, Jaegul Choo
First: 2025-12-16T12:03:46+00:00 · Latest: 2025-12-16T12:03:46+00:00
Comments: yeolj00.github.io/personal-projects/vector-prism
Abstract
Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.
Summary / 总结
The research aims to address the challenge of automating the animation of Scalable Vector Graphics (SVG) for modern web design, where current vision-language models (VLMs) often fail due to fragmented low-level shapes. The method involves a framework that recovers the semantic structure of SVGs by aggregating multiple weak part predictions, enabling VLMs to produce more coherent animations. Experiments show significant improvements over existing approaches, indicating that semantic recovery is crucial for robust SVG animation and more interpretable interactions between VLMs and vector graphics.
本文旨在解决使用视觉语言模型(VLMs)自动化 Scalable Vector Graphics (SVG) 动画的挑战。作者提出了一种名为 Vector Prism 的框架,通过聚合多个弱部分预测来恢复 SVG 的语义结构,从而使 VLMs 能够生成更连贯的动画。实验结果显示,与现有方法相比有显著改进,表明语义恢复是实现稳健 SVG 动画的关键步骤。
From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region
Authors: Akila Premarathna, Kanishka Hewageegana, Garcia Andarcia Mariangel
First: 2025-12-16T11:28:55+00:00 · Latest: 2025-12-16T11:28:55+00:00
Comments: 9 pages, 9 figures
Abstract
In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8's true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.
Summary / 总结
This study aims to improve the identification of wastewater treatment plants (WWTPs) in the Middle East and North Africa (MENA) region using satellite imagery. Traditional methods like YOLOv8 require extensive manual labeling, but vision-language models (VLMs) are proposed as an efficient alternative. The research compares VLMs in zero-shot and few-shot settings, using a dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and the UAE. VLMs such as Gemma-3 outperformed YOLOv8 in zero-shot evaluations, indicating that VLMs can replace YOLOv8 for annotation-free WWTP classification, facilitating scalable remote sensing.
该研究旨在利用卫星图像提高中东和北非(MENA)地区污水处理厂(WWTPs)的识别。它将视觉语言模型(VLMs)与YOLOv8进行比较,用于WWTP组件的零样本和少量样本检测。VLMs,尤其是在零样本评估中,优于YOLOv8,Gemma-3表现出最高的真阳性率。这表明VLMs可以替代YOLOv8进行高效的、无需标注的WWTP分类,促进远程 sensing的大规模应用。
SAM3-I: Segment Anything with Instructions
Authors: Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Wei Ji, Huchuan Lu, Li Cheng
First: 2025-12-04T09:00:25+00:00 · Latest: 2025-12-16T11:17:40+00:00
Comments: Preliminary results; work in progress
Abstract
Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.
中文标题/摘要
标题:SAM3-I: 按指令分割一切
SAM3模型通过可提示的概念分割增强了开放词汇分割能力,允许用户分割给定概念的所有实例,通常用简短的名词短语(NP)提示指定。虽然这是SAM家族中首次将语言级概念整合进来,但在实际应用中通常需要更丰富的表达,包括属性、空间关系、功能、动作、状态,甚至实例间的隐式推理。目前,SAM3依赖外部多模态代理将复杂指令转换为NP并进行迭代掩码过滤。然而,这些NP级概念过于粗略,往往无法精确表示特定实例。在此项工作中,我们提出了SAM3-I,这是一种增强框架,将概念级理解和指令级推理统一在SAM家族中。SAM3-I引入了一种指令感知级联适应机制,逐步将表达性的指令语义与SAM3现有的视觉-语言表示对齐,从而实现直接的指令遵循分割,同时保留其原有的概念驱动能力。此外,我们设计了一种结构化的指令分类体系,涵盖概念、简单和复杂三个层级,并开发了一个可扩展的数据引擎来构建包含多样指令-掩码对的数据集。实验表明,SAM3-I表现出令人满意的效果,证明SAM3可以有效扩展以遵循自然语言指令,同时保持其强大的概念基础。我们开源了SAM3-I,并提供了实用的微调工作流程,使研究人员能够将其应用于特定领域。源代码可在此获取。
Summary / 总结
SAM3-I enhances the Segment Anything Model 3 (SAM3) by integrating instruction-level reasoning into the concept segmentation framework, allowing direct instruction-following segmentation. It introduces an instruction-aware cascaded adaptation mechanism that aligns expressive instruction semantics with SAM3's vision-language representations. Experiments show that SAM3-I can effectively follow natural-language instructions while maintaining strong concept grounding. The structured instruction taxonomy and scalable data engine support diverse instruction-mask pairs, demonstrating appealing performance in real-world applications.
SAM3-I通过将指令级推理与概念级理解相结合,增强了Segment Anything Model 3 (SAM3),使其能够直接根据自然语言指令进行分割。它引入了一种指令感知的级联适应机制,将表达性的指令语义与SAM3的视觉-语言表示进行对齐。实验表明,SAM3-I能够有效地遵循复杂指令,同时保持强大的概念基础,展示了其在实际应用中的潜力。源代码可供研究人员将其适应到特定领域任务中。
Light-Weight Benchmarks Reveal the Hidden Hardware Cost of Zero-Shot Tabular Foundation Models
Authors: Ishaan Gangwani, Aayam Bansal
Venue: ICML
First: 2025-11-30T13:17:08+00:00 · Latest: 2025-12-16T10:51:18+00:00
Comments: ICML NewInML
Abstract
Zero-shot foundation models (FMs) promise training-free prediction on tabular data, yet their hardware footprint remains poorly characterized. We present a fully reproducible benchmark that reports test accuracy together with wall-clock latency, peak CPU RAM, and peak GPU VRAM on four public datasets: Adult-Income, Higgs-100k, Wine-Quality, and California-Housing. Two open FMs (TabPFN-1.0 and TabICL-base) are compared against tuned XGBoost, LightGBM, and Random Forest baselines on a single NVIDIA T4 GPU. The tree ensembles equal or surpass FM accuracy on three datasets while completing full-test batches in <= 0.40 s and <= 150 MB RAM, using zero VRAM. TabICL achieves a 0.8 percentage-point gain on Higgs but requires roughly 40,000 times more latency (960 s) and 9 GB VRAM. TabPFN matches tree-model accuracy on Wine and Housing but peaks at 4 GB VRAM and cannot process the full 100k-row Higgs table. These results quantify the substantial hardware-versus-accuracy trade-offs in current tabular FMs and provide an open baseline for future efficiency-oriented research.
中文标题/摘要
标题:轻量级基准揭示零样本表格基础模型的隐藏硬件成本
零样本基础模型(FMs)承诺在表格数据上实现无需训练的预测,但其硬件占用情况仍缺乏充分的描述。我们提供了一个完全可复现的基准测试,该测试在四个公共数据集(Adult-Income、Higgs-100k、Wine-Quality 和 California-Housing)上报告了测试准确率、墙钟延迟、峰值CPU RAM 和峰值GPU VRAM。在单个NVIDIA T4 GPU 上,两种开源FMs(TabPFN-1.0 和 TabICL-base)与调优后的XGBoost、LightGBM 和随机森林基线进行比较。树集合模型在三个数据集上的准确率不低于FM,同时完成整个测试批次所需时间≤0.40秒且≤150MB RAM,无需使用任何VRAM。TabICL 在Higgs数据集上实现了0.8个百分点的提升,但需要大约40,000倍的延迟(960秒)和9GB VRAM。TabPFN 在Wine和Housing数据集上的准确率与树模型相当,但峰值VRAM达到4GB,无法处理完整的100,000行Higgs表格。这些结果量化了当前表格FMs中的硬件与准确率之间的显著权衡,并为未来效率导向的研究提供了开放基准。
Summary / 总结
This study aims to evaluate the hardware requirements of zero-shot tabular foundation models (FMs) by comparing their performance with traditional tree ensembles on four public datasets. The benchmark measures accuracy, latency, and memory usage, revealing that while tree ensembles can match FM accuracy with minimal resource consumption, FMs like TabICL consume significantly more VRAM and latency. TabPFN matches tree models on some datasets but requires substantial VRAM.
该研究旨在通过将零-shot 表格基础模型(FMs)与传统树集合模型在四个数据集上的表现进行对比,来评估其硬件成本。基准测试衡量准确率、延迟和内存使用情况,结果显示树集合模型可以在资源消耗极低的情况下达到与FM相当的准确率。然而,如TabICL和TabPFN等FM需要大量显存和较长的处理时间,这突显了硬件效率与模型准确率之间的权衡。
Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes
Authors: Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov, Angela Yao, Gianni Franchi
First: 2025-12-16T08:15:24+00:00 · Latest: 2025-12-16T08:15:24+00:00
Abstract
Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.
中文标题/摘要
标题:使用语义高斯过程提高LVLM的语义不确定性量化
大型视觉-语言模型(LVLMs)经常生成可能但不可靠的输出,因此稳健的不确定性估计至关重要。最近关于语义不确定性估计的工作依赖于外部模型对多个采样响应进行聚类并测量它们的语义一致性。然而,这些聚类方法往往是脆弱的,对细微的措辞变化非常敏感,并且可能会错误地将语义相似的答案分组或分开,导致不可靠的不确定性估计。我们提出了语义高斯过程不确定性(SGPU),这是一种贝叶斯框架,通过分析答案嵌入的几何结构来量化语义不确定性,避免了脆弱的聚类。SGPU 将生成的答案映射到密集的语义空间,计算它们嵌入的格拉姆矩阵,并通过特征谱总结它们的语义配置。然后将这种谱表示输入到高斯过程分类器中,学习将语义一致性的模式映射到预测不确定性,并且可以在黑盒和白盒设置中应用。在六个LLMs和LVLMs上的八个数据集上,包括VQA、图像分类和文本问答,SGPU始终实现了最先进的校准(ECE)和区分(AUROC、AUARC)性能。我们进一步表明,SGPU可以在模型和模态之间进行迁移,表明其谱表示捕捉到了语义不确定性的一般模式。
Summary / 总结
This paper addresses the issue of unreliable outputs from Large Vision-Language Models (LVLMs) by proposing Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty without relying on fragile clustering methods. SGPU analyzes the geometric structure of answer embeddings, computes the Gram matrix, and uses the eigenspectrum to represent semantic configurations, which are then used to learn predictive uncertainty through a Gaussian Process Classifier. Experiments on six LLMs and LVLMs across eight datasets show that SGPU outperforms existing methods in terms of calibration and discriminative performance, and it transfers well across different models and modalities.
研究旨在通过提出语义高斯过程不确定性(SGPU)框架来提高大型视觉-语言模型(LVLM)的语义不确定性量化可靠性,该框架避免了聚类方法的脆弱性。SGPU分析答案嵌入的几何结构,计算Gram矩阵,并使用特征谱表示语义配置,然后由高斯过程分类器来估计不确定性。在六个LLM和LVLM的八个数据集上的实验表明,SGPU在校准和区分度指标上达到了最先进的性能,其特征谱表示在模型和模态之间具有可转移性。
From My View to Yours: Ego-to-Exo Transfer in VLMs for Understanding Activities of Daily Living
Authors: Dominick Reilly, Manish Kumar Govind, Le Xue, Srijan Das
First: 2025-01-10T05:01:58+00:00 · Latest: 2025-12-16T07:48:42+00:00
Abstract
Vision Language Models (VLMs) have achieved strong performance across diverse video understanding tasks. However, their viewpoint invariant training limits their ability to understand egocentric properties (e.g., human object interactions) from exocentric video observations. This limitation is critical for many applications, such as Activities of Daily Living (ADL) monitoring, where the understanding of egocentric properties is essential, and egocentric cameras are impractical to deploy. To address this limitation, we propose Ego2ExoVLM, a VLM that learns to infer egocentric properties from exocentric videos by leveraging time-synchronized ego-exo videos during training. Ego2ExoVLM accomplishes this through the use of two components: Ego2Exo Sequence Distillation, which transfers knowledge from an egocentric teacher to an exocentric student, and Ego Adaptive Visual Tokens, designed to enhance the effectiveness of this knowledge transfer. To measure this capability, we introduce Ego-in-Exo Perception, a benchmark of 3.9K questions curated to explicitly measure the understanding of egocentric properties from exocentric videos. Ego2ExoVLM is evaluated on 10 tasks across Ego-in-Exo Perception and existing ADL benchmarks, achieving state-of-the-art results on the ADL-X benchmark suite and outperforming strong baselines on our proposed benchmark. All code, models, and data will be released at https://github.com/dominickrei/EgoExo4ADL.
中文标题/摘要
标题:从我的视角到你的视角:在理解日常生活活动中的自我中心属性到环境中心属性的迁移
视觉语言模型(VLMs)在多种视频理解任务中取得了强大的性能。然而,它们的观点不变性训练限制了它们从环境中心视频观察中理解自我中心属性(例如,人类物体交互)的能力。这一限制对于许多应用至关重要,如日常生活活动(ADL)监测,其中理解自我中心属性是必不可少的,而部署自我中心摄像头是不切实际的。为了解决这一限制,我们提出了一种Ego2ExoVLM,这是一种通过利用训练期间时间同步的自我中心和环境中心视频来学习从环境中心视频推断自我中心属性的VLM。Ego2ExoVLM 通过两个组件实现这一点:Ego2Exo 序列蒸馏,它将自我中心教师的知识转移到环境中心学生,以及增强这种知识转移有效性的自我中心适应性视觉标记。为了衡量这种能力,我们引入了Ego-in-Exo 感知基准,这是一个包含3900个问题的基准,专门用于明确衡量从环境中心视频理解自我中心属性的能力。Ego2ExoVLM 在Ego-in-Exo 感知基准和现有ADL基准上的10个任务中进行了评估,其在ADL-X基准套件上达到了最先进的结果,并在我们提出的基准上优于强基线。所有代码、模型和数据将在https://github.com/dominickrei/EgoExo4ADL上发布。
Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs
Authors: Anran Qi, Changjian Li, Adrien Bousseau, Niloy J. Mitra
First: 2025-12-15T14:45:05+00:00 · Latest: 2025-12-16T07:08:17+00:00
Abstract
We address image-to-video generation with explicit user control over the final frame's disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: https://anranqi.github.io/beyond-visible.github.io/
中文标题/摘要
标题:超越可见:基于代理动态图的消隐感知编辑
我们通过显式用户控制最终帧的消隐区域来解决图像到视频的生成问题。当前的图像到视频流水线能够生成可信的运动,但在生成可预测、有条理的运动并确保用户指定内容在新揭示区域的同时遇到困难。我们的核心思想是将运动规范与外观合成分离:我们引入了一个轻量级、用户可编辑的代理动态图(PDG),它以确定性但近似的方式驱动部分运动,而冻结的扩散先验用于合成遵循该运动的可信外观。在我们的无需训练的流水线中,用户对PDG进行粗略标注和重新定位,从中我们计算密集的运动流以利用扩散作为运动引导的着色器。然后,用户可以在图像的消隐区域编辑外观,并利用PDG编码的可见性信息在这些区域执行潜在空间合成,以在运动与用户意图之间达成一致。此设计实现了可控的有条理性和对消隐的用户控制,无需微调。我们展示了在将图像转换为包含有条理对象、家具、车辆和变形体的短视频方面,我们的方法明显优于现有最佳方案。我们的方法结合了生成控制(松散姿态和结构)与最终帧中可预测的外观规范控制,解锁了一种新的图像到视频工作流程。代码将在接受后发布。项目页面:https://anranqi.github.io/beyond-visible.github.io/
Summary / 总结
The research addresses image-to-video generation with user control over disoccluded regions. It introduces a Proxy Dynamic Graph (PDG) to separate motion specification from appearance synthesis, allowing users to edit the appearance in newly revealed areas while maintaining predictable and articulated motions. The method demonstrates superior performance in generating articulated motions for objects, furniture, vehicles, and deformables compared to existing approaches, providing a new workflow combining generative control with predictable appearance specification in the final frame.
该论文解决了在图像到视频转换中生成合理运动的同时允许用户控制被遮挡区域的问题。它引入了一个代理动态图(PDG),该图以确定性方式驱动部分运动,而冻结的扩散模型则合成外观。用户可以松散地标注和重新定位PDG以编辑被遮挡区域,并利用可见性信息来协调运动与用户意图。结果显示,该方法在生成物体、家具、车辆和变形体的有规律运动方面优于现有方法。
UIXPOSE: Mobile Malware Detection via Intention-Behaviour Discrepancy Analysis
Authors: Amirmohammad Pasdar, Toby Murray, Van-Thuan Pham
First: 2025-12-16T06:26:29+00:00 · Latest: 2025-12-16T06:26:29+00:00
Comments: 15 pages
Abstract
We introduce UIXPOSE, a source-code-agnostic framework that operates on both compiled and open-source apps. This framework applies Intention Behaviour Alignment (IBA) to mobile malware analysis, aligning UI-inferred intent with runtime semantics. Previous work either infers intent statically, e.g., permission-centric, or widget-level or monitors coarse dynamic signals (endpoints, partial resource usage) that miss content and context. UIXPOSE infers an intent vector from each screen using vision-language models and knowledge structures and combines decoded network payloads, heap/memory signals, and resource utilisation traces into a behaviour vector. Their alignment, calculated at runtime, can both detect misbehaviour and highlight exploration of behaviourally rich paths. In three real-world case studies, UIXPOSE reveals covert exfiltration and hidden background activity that evade metadata-only baselines, demonstrating how IBA improves dynamic detection.
中文标题/摘要
标题:UIXPOSE:通过意图-行为差异分析进行移动恶意软件检测
我们介绍了UIXPOSE,一种源代码无关的框架,适用于编译后的和开源应用。该框架采用意图行为对齐(IBA)进行移动恶意软件分析,将UI推断的意图与运行时语义对齐。以往的工作要么静态推断意图,例如基于权限,或者在组件级别进行监控,这些方法会错过内容和上下文。UIXPOSE 使用视觉语言模型和知识结构从每个屏幕推断意图向量,并将解码的网络负载、堆/内存信号和资源利用率轨迹组合成行为向量。它们在运行时的对齐可以检测不良行为并突出显示行为丰富的路径探索。在三个实际案例研究中,UIXPOSE 揭示了逃避元数据基线的隐蔽数据泄露和隐藏后台活动,证明了IBA如何提高动态检测效果。
Summary / 总结
UIXPOSE is a source-code-agnostic framework for mobile malware detection that analyzes both compiled and open-source apps. It uses Intention Behaviour Alignment (IBA) to align UI-inferred intent with runtime semantics. Unlike previous methods that infer intent statically or monitor coarse dynamic signals, UIXPOSE infers intent vectors from each screen using vision-language models and combines them with behavior vectors derived from network payloads, heap/memory signals, and resource utilisation traces. The framework detects misbehaviour and highlights rich behavioral paths, effectively identifying covert exfiltration and hidden background activities that evade metadata-only detection methods.
UIXPOSE 是一个无需源代码的框架,用于检测移动恶意软件,可以分析编译后的和开源的应用程序。它使用意图行为对齐(IBA)来对齐从 UI 推断出的意图与运行时语义。不同于之前的方法只能静态推断意图或监控粗略的动态信号,UIXPOSE 使用视觉语言模型从每个屏幕推断意图向量,并结合网络负载解码、堆/内存信号和资源利用率轨迹生成行为向量。该框架能够检测异常行为并突出显示丰富的行为路径,有效地识别出逃避元数据检测的隐蔽数据泄露和后台活动。
Consistent Instance Field for Dynamic Scene Understanding
Authors: Junyi Wu, Van Nguyen Nguyen, Benjamin Planche, Jiachen Tao, Changchang Sun, Zhongpai Gao, Zhenghao Zhao, Anwesa Choudhuri, Gengyu Zhang, Meng Zheng, Feiran Wang, Terrence Chen, Yan Yan, Ziyan Wu
First: 2025-12-16T06:12:11+00:00 · Latest: 2025-12-16T06:12:11+00:00
Abstract
We introduce Consistent Instance Field, a continuous and probabilistic spatio-temporal representation for dynamic scene understanding. Unlike prior methods that rely on discrete tracking or view-dependent features, our approach disentangles visibility from persistent object identity by modeling each space-time point with an occupancy probability and a conditional instance distribution. To realize this, we introduce a novel instance-embedded representation based on deformable 3D Gaussians, which jointly encode radiance and semantic information and are learned directly from input RGB images and instance masks through differentiable rasterization. Furthermore, we introduce new mechanisms to calibrate per-Gaussian identities and resample Gaussians toward semantically active regions, ensuring consistent instance representations across space and time. Experiments on HyperNeRF and Neu3D datasets demonstrate that our method significantly outperforms state-of-the-art methods on novel-view panoptic segmentation and open-vocabulary 4D querying tasks.
中文标题/摘要
标题:一致实例场:动态场景理解的连续概率时空表示
我们引入了一致实例场,这是一种用于动态场景理解的连续和概率时空表示。与依赖于离散跟踪或视点相关特征的先前方法不同,我们的方法通过使用占用概率和条件实例分布来解耦可见性和持久对象身份。为了实现这一点,我们基于可变形3D高斯分布引入了一种新的实例嵌入表示,该表示联合编码辐射和语义信息,并通过可微放像素化直接从输入的RGB图像和实例掩码中学习。此外,我们引入了新的机制来校准每个高斯的身份,并重新采样高斯以朝向语义活跃区域,从而确保空间和时间上的一致实例表示。在HyperNeRF和Neu3D数据集上的实验表明,我们的方法在新颖视角全景分割和开放词汇4D查询任务上显著优于现有最佳方法。
Summary / 总结
The research introduces Consistent Instance Field, a continuous and probabilistic spatio-temporal representation for dynamic scene understanding. Unlike previous methods that depend on discrete tracking or view-dependent features, this approach models each space-time point with an occupancy probability and a conditional instance distribution. The method uses a novel instance-embedded representation based on deformable 3D Gaussians to jointly encode radiance and semantic information, which are learned from input RGB images and instance masks through differentiable rasterization. The experiments show that this method outperforms state-of-the-art methods on novel-view panoptic segmentation and open-vocabulary 4D querying tasks.
研究引入了连续且概率性的空间-时间表示方法——一致实例场,用于动态场景理解。不同于以往依赖离散跟踪或视点相关特征的方法,该方法通过占用概率和条件实例分布来建模每个空间-时间点,分离可见性和持久对象身份。该方法使用基于可变形3D高斯的新型实例嵌入表示,通过可微放像素化直接从输入的RGB图像和实例掩码中学习。实验表明,该方法在新颖视角全景分割和开放词汇4D查询任务上优于现有方法。
MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion
Authors: Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen, Gang Wang, Runwei Guan, Weiping Ding
First: 2025-12-15T10:37:59+00:00 · Latest: 2025-12-16T05:50:26+00:00
Abstract
Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.
中文标题/摘要
标题:MMDrive:超越视觉的多表示融合交互场景理解
视觉语言模型通过多源信息融合,使复杂交通场景的理解和推理成为可能,成为自动驾驶的核心技术。然而,现有的视觉语言模型受限于二维图像理解范式,限制了其感知三维空间信息和进行深层次语义融合的能力,导致在复杂自动驾驶环境中表现不佳。本研究提出MMDrive,这是一种多模态视觉语言模型框架,将传统的图像理解扩展到一个通用的三维场景理解框架。MMDrive结合了三种互补模态,包括占用地图、LiDAR点云和文本场景描述。为此,它引入了两种新的组件,用于自适应跨模态融合和关键信息提取。具体来说,文本导向的多模态调制器根据问题中的语义线索动态加权每个模态的贡献,引导上下文感知特征整合。跨模态抽象器使用可学习的抽象标记生成紧凑的跨模态摘要,突出关键区域和重要语义。在DriveLM和NuScenes-QA基准上的全面评估表明,MMDrive在自动驾驶中显著优于现有视觉语言模型,DriveLM上的BLEU-4得分为54.56,METEOR得分为41.78,NuScenes-QA上的准确率为62.7%。MMDrive有效地突破了传统的仅图像理解障碍,使在复杂驾驶环境中实现稳健的多模态推理成为可能,并为可解释的自动驾驶场景理解提供了新的基础。
Summary / 总结
MMDrive is a multimodal vision-language model framework that extends traditional 2D image understanding to a 3D scene understanding framework by incorporating occupancy maps, LiDAR point clouds, and textual scene descriptions. It introduces two novel components: Text-oriented Multimodal Modulator for adaptive cross-modal fusion and Cross-Modal Abstractor for key information extraction. MMDrive shows significant performance gains in autonomous driving benchmarks, achieving a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA.
MMDrive 是一种多模态视觉-语言模型框架,将传统的二维图像理解扩展到三维场景理解框架,融合了占用图、LiDAR 点云和文本场景描述。它引入了两种新型组件:面向文本的多模态调制器进行自适应跨模态融合,以及跨模态抽象器生成紧凑的跨模态摘要以突出关键区域和重要语义。MMDrive 在 DriveLM 和 NuScenes-QA 基准测试中取得了显著的性能提升,BLEU-4 得分为 54.56,METEOR 得分为 41.78,NuScenes-QA 的准确率为 62.7%,超越了现有的视觉-语言模型在自动驾驶中的表现。
Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes
Authors: Robinson Umeike, Neil Getty, Yin Xiangyu, Yi Jiang
First: 2025-11-04T11:43:05+00:00 · Latest: 2025-12-16T05:43:39+00:00
Abstract
The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.
中文标题/摘要
标题:将通用基础模型适应于低数据条件下X射线 Ptychography
在先进显微镜工作流自动化方面,基础模型如语言模型(LLMs)和视觉-语言模型(VLMs)显示出巨大潜力。然而,将这些通用模型适应于专门的科学任务至关重要,而最优领域适应策略往往不明确。为解决这一问题,我们引入了PtychoBench,这是一种新的多模态、多任务基准,用于衍射分析。利用这一基准,我们系统地比较了两种专门化策略:监督微调(SFT)和上下文学习(ICL)。我们在数据稀缺条件下使用VLMs进行视觉伪影检测任务,使用LLMs进行文本参数推荐任务。我们的研究发现,最优的专门化路径取决于任务。对于视觉任务,SFT和ICL高度互补,带有上下文感知示例的微调模型达到了最高的平均性能(Micro-F1为0.728)。相反,对于文本任务,大型基础模型上的ICL是更优策略,达到了峰值Micro-F1为0.847,并且优于强大的“超级专家”SFT模型(零样本Micro-F1为0.839)。我们还确认了上下文感知提示的优越性,并在微调模型中识别出了一致的上下文干扰现象。这些结果,与包括GPT-4o和基于DINOv3的分类器在内的强基线进行基准测试,为科学中的AI提供了关键观察:在我们的基准中,最优的专门化路径取决于任务模态,为开发更有效的基于科学的代理系统提供了清晰的框架。
Summary / 总结
The paper aims to explore how general-purpose models like Language Models (LLMs) and Vision-Language Models (VLMs) can be adapted for specialized scientific tasks, specifically ptychographic analysis. Two strategies, Supervised Fine-Tuning (SFT) and In-Context Learning (ICL), were compared on visual and textual tasks. For the visual artifact detection task, SFT and ICL were found to be highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest performance. For the textual parameter recommendation task, ICL on a large base model outperformed SFT, reaching a peak performance. The study highlights the task-dependent nature of the optimal specialization pathway and the importance of context-aware prompting.
该研究旨在探索将通用模型如VLMs和LLMs适应于特定科学任务,特别是X射线 Ptychography分析。研究使用了一个名为PtychoBench的新基准,比较了两种策略:监督微调(SFT)和上下文学习(ICL)。在视觉图像中的伪影检测任务中,SFT和ICL表现出高度互补性,通过上下文感知示例微调的模型取得了最佳性能(Micro-F1为0.728)。而在文本参数推荐任务中,大型基模型上的ICL策略优于SFT,达到了峰值Micro-F1为0.847。研究强调了最优专业化路径依赖于任务模态,并突出了上下文感知提示的重要性。
Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries
Authors: Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof
First: 2025-12-16T05:33:44+00:00 · Latest: 2025-12-16T05:33:44+00:00
Abstract
Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.
中文标题/摘要
标题:神经符号推理在基础模型中的应用:针对遥感文本到图像检索的复杂查询
遥感(RS)中的文本到图像检索随着大型视觉语言模型(LVLMs)的发展而迅速进步,这些模型专门针对航空和卫星图像,最终形成了遥感大型视觉语言模型(RS-LVLMS)。然而,有限的可解释性和对复杂空间关系的处理不足仍然是实际应用中的关键挑战。为了解决这些问题,我们提出了RUNE(基于神经符号实体的推理),这是一种将大型语言模型(LLMs)与神经符号AI结合的方法,通过推理检测到的实体与从文本查询中推导出的一阶逻辑(FOL)表达式的兼容性来检索图像。与依赖于隐式联合嵌入的RS-LVLMs不同,RUNE执行显式推理,从而提高性能和可解释性。为了提高可扩展性,我们提出了一种逻辑分解策略,该策略在检测到的实体的条件子集上操作,确保执行时间比神经方法更短。我们仅利用基础模型生成一阶逻辑表达式,将推理任务委托给神经符号推理模块。为了评估,我们重新利用了DOTA数据集,该数据集最初用于目标检测,通过增加比现有基准更复杂的查询来增强它。我们展示了大型语言模型在文本到逻辑转换中的有效性,并将RUNE与最先进的RS-LVLMs进行比较,显示出更好的性能。我们引入了两个指标,检索对查询复杂性的鲁棒性(RRQC)和检索对图像不确定性的鲁棒性(RRIU),以评估性能相对于查询复杂性和图像不确定性。RUNE在复杂的RS检索任务中优于联合嵌入模型,提供了性能、鲁棒性和可解释性的提升。我们通过一个洪水后卫星图像检索的应用案例展示了RUNE在实际遥感应用中的潜力。
Summary / 总结
This paper addresses the challenges of explainability and handling complex spatial relations in text-to-image retrieval for remote sensing (RS) using large vision-language models (LVLMs). It introduces RUNE, which combines LLMs with neurosymbolic AI to reason over detected entities and First-Order Logic (FOL) expressions derived from text queries. RUNE outperforms joint-embedding models in complex RS retrieval tasks, showing superior performance, robustness, and explainability. Two new metrics, RRQC and RRIU, are introduced to evaluate performance under query complexity and image uncertainty, with RUNE demonstrating better results.
该论文通过引入结合大型语言模型(LLMs)和神经符号AI的RUNE方法,解决了遥感(RS)中文本到图像检索中的解释性和处理复杂空间关系的挑战。RUNE将文本查询转换为一阶逻辑(FOL)表达式,并对检测到的实体进行显式推理,优于联合嵌入模型在复杂RS检索任务中的表现。它引入了两个新的评估指标RRQC和RRIU,以评估在不同查询复杂性和图像不确定性下的性能,展示了更高的鲁棒性和解释性。
Adaptive Detector-Verifier Framework for Zero-Shot Polyp Detection in Open-World Settings
Authors: Shengkai Xu, Hsiang Lun Kao, Tianxiang Xu, Honghui Zhang, Junqiao Wang, Runmeng Ding, Guanyu Liu, Tianyu Shi, Zhenyu Yu, Guofeng Pan, Ziqian Bi, Yuqi Ouyang
First: 2025-12-13T23:33:05+00:00 · Latest: 2025-12-16T04:40:54+00:00
Abstract
Polyp detectors trained on clean datasets often underperform in real-world endoscopy, where illumination changes, motion blur, and occlusions degrade image quality. Existing approaches struggle with the domain gap between controlled laboratory conditions and clinical practice, where adverse imaging conditions are prevalent. In this work, we propose AdaptiveDetector, a novel two-stage detector-verifier framework comprising a YOLOv11 detector with a vision-language model (VLM) verifier. The detector adaptively adjusts per-frame confidence thresholds under VLM guidance, while the verifier is fine-tuned with Group Relative Policy Optimization (GRPO) using an asymmetric, cost-sensitive reward function specifically designed to discourage missed detections -- a critical clinical requirement. To enable realistic assessment under challenging conditions, we construct a comprehensive synthetic testbed by systematically degrading clean datasets with adverse conditions commonly encountered in clinical practice, providing a rigorous benchmark for zero-shot evaluation. Extensive zero-shot evaluation on synthetically degraded CVC-ClinicDB and Kvasir-SEG images demonstrates that our approach improves recall by 14 to 22 percentage points over YOLO alone, while precision remains within 0.7 points below to 1.7 points above the baseline. This combination of adaptive thresholding and cost-sensitive reinforcement learning achieves clinically aligned, open-world polyp detection with substantially fewer false negatives, thereby reducing the risk of missed precancerous polyps and improving patient outcomes.
中文标题/摘要
标题:适应性检测-验证框架在开放世界环境下的零样本结肠息肉检测
在干净数据集上训练的结肠息肉检测器在实际临床内镜检查中往往表现不佳,因为光照变化、运动模糊和遮挡会降低图像质量。现有方法难以弥合实验室控制条件与临床实践中普遍存在的不良成像条件之间的差距。本文提出了一种名为AdaptiveDetector的新型两阶段检测-验证框架,该框架包括YOLOv11检测器和视觉语言模型(VLM)验证器。检测器在VLM的指导下自适应地调整每帧的置信度阈值,而验证器则使用组相对策略优化(GRPO)进行微调,采用一种不对称的成本敏感奖励函数,专门设计用于减少漏检——这是临床中的关键需求。为了在具有挑战性的条件下进行现实评估,我们通过系统地将临床实践中常见的不良条件应用于干净数据集,构建了一个全面的合成测试床,为零样本评估提供了一个严格的基准。在合成降级的CVC-ClinicDB和Kvasir-SEG图像上的零样本评估表明,与仅使用YOLO相比,我们的方法在召回率上提高了14到22个百分点,而精度仅比基线低0.7到1.7个百分点。这种自适应阈值设置与成本敏感强化学习的结合实现了临床对齐的开放世界结肠息肉检测,显著减少了假阴性,从而降低了漏检的癌前息肉的风险,改善了患者预后。
Summary / 总结
The research addresses the underperformance of polyp detectors in real-world endoscopy due to adverse imaging conditions. It introduces AdaptiveDetector, a two-stage detector-verifier framework using YOLOv11 and a vision-language model for adaptive confidence threshold adjustment and fine-tuning with Group Relative Policy Optimization. The approach shows a 14 to 22 percentage point improvement in recall over YOLO alone while maintaining precision close to the baseline, effectively reducing false negatives and improving patient outcomes in open-world settings.
研究旨在通过解决实验室和临床条件之间的领域差距,提高现实世界内窥镜中的息肉检测。提出的AdaptiveDetector框架使用YOLOv11检测器和视觉-语言模型验证器来适应性调整置信阈值,并使用Group Relative Policy Optimization微调验证器。该方法在召回率上显著提高了14-22个百分点,同时保持精度接近基线,展示了临床对齐、开放世界的息肉检测,具有更少的假阴性结果。
Learning neuroimaging models from health system-scale data
Authors: Yiwei Lyu, Samir Harake, Asadur Chowdury, Soumyanil Banerjee, Rachel Gologorsky, Shixuan Liu, Anna-Katharina Meissner, Akshay Rao, Chenhui Zhao, Akhil Kondepudi, Cheng Jiang, Xinhai Hou, Rushikesh S. Joshi, Volker Neuschmelting, Ashok Srinivasan, Dawn Kleindorfer, Brian Athey, Vikas Gulani, Aditya Pandey, Honglak Lee, Todd Hollon
First: 2025-09-23T04:49:59+00:00 · Latest: 2025-12-16T04:26:10+00:00
Abstract
Neuroimaging is a ubiquitous tool for evaluating patients with neurological diseases. The global demand for magnetic resonance imaging (MRI) studies has risen steadily, placing significant strain on health systems, prolonging turnaround times, and intensifying physician burnout. These challenges disproportionately impact patients in low-resource and rural settings. Here, we utilized a large academic health system as a data engine to develop Prima, the first vision language model (VLM) serving as an AI foundation for neuroimaging that supports real-world, clinical MRI studies as input. Trained on over 220,000 MRI studies, Prima uses a hierarchical vision architecture that provides general and transferable MRI features. Prima was tested in a 1-year health system-wide study that included 30K MRI studies. Across 52 radiologic diagnoses from the major neurologic disorders, including neoplastic, inflammatory, infectious, and developmental lesions, Prima achieved a mean diagnostic area under the ROC curve of 92.0, outperforming other state-of-the-art general and medical AI models. Prima offers explainable differential diagnoses, worklist priority for radiologists, and clinical referral recommendations across diverse patient demographics and MRI systems. Prima demonstrates algorithmic fairness across sensitive groups and can help mitigate health system biases, such as prolonged turnaround times for low-resource populations. These findings highlight the transformative potential of health system-scale VLMs and Prima's role in advancing AI-driven healthcare.
中文标题/摘要
标题:从医疗系统规模数据中学习神经影像模型
神经影像学是评估神经疾病患者的一种普遍工具。全球磁共振成像(MRI)研究的需求持续上升,给医疗系统带来了巨大压力,延长了周转时间,并加剧了医生的职业倦怠。这些挑战在资源匮乏和农村地区患者中尤为突出。在这里,我们利用一个大型学术医疗系统作为数据引擎,开发了Prima,这是第一个作为神经影像AI基础的视觉语言模型(VLM),支持实际临床MRI研究作为输入。Prima基于超过220,000份MRI研究进行训练,使用分层视觉架构提供通用和可转移的MRI特征。Prima在为期一年的全系统研究中进行了测试,包括30,000份MRI研究。在52种主要神经疾病放射学诊断中,包括肿瘤、炎症、感染和发育性病变,Prima的受试者操作特征曲线下面积均值达到92.0,优于其他最先进的通用和医疗AI模型。Prima提供了可解释的鉴别诊断、放射科医生的工作列表优先级以及跨不同患者群体和MRI系统的临床转诊建议。Prima展示了对敏感群体的算法公平性,并能帮助缓解医疗系统偏见,如资源匮乏人群的长时间周转时间。这些发现突显了医疗系统规模视觉语言模型的变革潜力以及Prima在推动AI驱动医疗方面的作用。
Summary / 总结
This study addresses the challenges of MRI demand in health systems by developing Prima, a vision language model trained on over 220,000 MRI studies. Prima outperformed other state-of-the-art models in diagnosing 52 major neurologic disorders with a mean diagnostic area under the ROC curve of 92.0. It provides explainable diagnoses, prioritizes worklists for radiologists, and offers clinical referral recommendations, demonstrating algorithmic fairness and helping mitigate health system biases.
研究旨在通过开发基于超过220,000份MRI研究的Prima视觉语言模型来应对健康系统中的MRI需求挑战。Prima在涉及30,000份MRI研究的1年健康系统研究中测试,对52种放射学诊断的平均诊断ROC曲线下面积达到92.0,优于其他最先进的模型。该模型提供可解释的诊断差异、放射科医生的工作列表优先级和临床转诊建议,展示了算法公平性,并有助于缓解健康系统中的偏见。
SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding
Authors: Shuang Cheng, Yuhua Jiang, Zineng Zhou, Dawei Liu, Wang Tao, Linfeng Zhang, Biqing Qi, Bowen Zhou
First: 2025-12-16T04:12:52+00:00 · Latest: 2025-12-16T04:12:52+00:00
Abstract
Block-wise discrete diffusion offers an attractive balance between parallel generation and causal dependency modeling, making it a promising backbone for vision-language modeling. However, its practical adoption has been limited by high training cost, slow convergence, and instability, which have so far kept it behind strong autoregressive (AR) baselines. We present \textbf{SDAR-VL}, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an \emph{integrated framework for efficient and stable training}. This framework unifies three components: (1) \textbf{Asynchronous Block-wise Noise Scheduling} to diversify supervision within each batch; (2) \textbf{Effective Mask Ratio Scaling} for unbiased loss normalization under stochastic masking; and (3) a \textbf{Progressive Beta Noise Curriculum} that increases effective mask coverage while preserving corruption diversity. Experiments on 21 single-image, multi-image, and video benchmarks show that SDAR-VL consistently improves \emph{training efficiency}, \emph{convergence stability}, and \emph{task performance} over conventional block diffusion. On this evaluation suite, SDAR-VL sets a new state of the art among diffusion-based vision-language models and, under matched settings, matches or surpasses strong AR baselines such as LLaVA-OneVision as well as the global diffusion baseline LLaDA-V, establishing block-wise diffusion as a practical backbone for VLU.
中文标题/摘要
标题:SDAR-VL:稳定高效的块级扩散方法用于视觉-语言理解
块级离散扩散在并行生成和因果依赖建模之间提供了吸引人的平衡,使其成为视觉-语言建模的有前途的骨干。然而,由于高训练成本、缓慢收敛和不稳定性,其实际应用受到了限制,目前仍落后于强大的自回归(AR)基线。我们提出了**SDAR-VL**,这是首次系统地将块级离散扩散应用于大规模视觉-语言理解(VLU),并提供了一种**高效的稳定训练集成框架**。该框架统一了三个组件:(1)**异步块级噪声调度**,以在每个批次中多样化监督;(2)**有效的掩码比例缩放**,以在随机掩码下实现无偏损失归一化;以及(3)**渐进贝塔噪声课程**,该课程增加了有效的掩码覆盖率,同时保持了破坏多样性。在21个单图像、多图像和视频基准上的实验表明,SDAR-VL 在训练效率、收敛稳定性和任务性能方面均优于传统的块扩散方法。在这一评估套件中,SDAR-VL 在基于扩散的视觉-语言模型中达到了新的最佳水平,并且在匹配设置下,与强大的自回归基线(如LLaVA-OneVision)以及全球扩散基线(如LLaDA-V)相当或超越,确立了块级扩散作为视觉-语言理解的实用骨干的地位。
Summary / 总结
The paper introduces SDAR-VL, which addresses the limitations of block-wise discrete diffusion in vision-language understanding by proposing an integrated framework that includes asynchronous block-wise noise scheduling, effective mask ratio scaling, and a progressive beta noise curriculum. This framework enhances training efficiency, convergence stability, and task performance, setting a new state-of-the-art on various benchmarks and matching or surpassing strong autoregressive baselines.
论文提出了SDAR-VL,通过提出一个高效的稳定训练框架来解决块离散扩散在视觉语言理解中的局限性。该框架包括异步块噪声调度、有效的掩码比例缩放和渐进的贝塔噪声课程。实验表明,SDAR-VL 在提高训练效率、收敛稳定性和任务性能方面优于传统的块扩散,并且在21个基准测试中与强大的自回归基线相当或超越。
VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management
Authors: Hongbo Jin, Qingyuan Wang, Wenhao Zhang, Yang Liu, Sijie Cheng
First: 2025-12-04T07:42:13+00:00 · Latest: 2025-12-16T03:47:09+00:00
Abstract
Ultra long video understanding remains an open challenge, as existing vision language models (VLMs) falter on such content due to limited context length and inefficient long term memory retention. To address this, recent works have attempted to construct external knowledge bases and corresponding retrieval agumented generation (RAG) systems, yet these incur enormous storage and computational overhead. In this paper, we propose VideoMem, a novel framework that pioneers models long video understanding as a sequential generation task via adaptive memory management. Specifically, VideoMem dynamically updates a global memory buffer, which adaptively retains critical information while discarding redundant content across the video timeline. To efficiently train VLMs for such long-term tasks, VideoMem integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm, equipped with two core modules: Progressive State Propagation (PSP) adaptively retains valid current states, propagates them to the next rollout step, and gradually narrows the model exploration space. Temporal Cascading Reward (TCR) further alleviates reward sparsity, improving sample utilization and accelerating convergence. Extensive experiments demonstrate that VideoMem significantly outperforms existing open-source models across diverse benchmarks for ultra-long video understanding tasks.
中文标题/摘要
标题:VideoMem:通过自适应内存管理增强超长视频理解
超长视频理解仍然是一个开放的挑战,因为现有的视觉语言模型(VLMs)在处理此类内容时由于上下文长度有限和长期记忆保留效率低下而表现不佳。为了解决这个问题,最近的工作尝试构建外部知识库和相应的检索增强生成(RAG)系统,但这些方法会带来巨大的存储和计算开销。在本文中,我们提出了VideoMem,这是一种新颖的框架,通过自适应内存管理将模型的长视频理解任务视为一个序列生成任务。具体而言,VideoMem动态更新全局内存缓冲区,该缓冲区会自适应地保留关键信息并丢弃视频时间线上的冗余内容。为了高效地训练VLMs以处理此类长期任务,VideoMem集成了渐进分组相对策略优化(PRPO)算法,该算法配备了两个核心模块:渐进状态传播(PSP)自适应地保留有效的当前状态,将其传播到下一个滚动步骤,并逐步缩小模型的探索空间。时间级联奖励(TCR)进一步缓解了奖励稀疏性,提高了样本利用效率并加速了收敛。广泛的实验表明,VideoMem在各种超长视频理解任务基准测试中显著优于现有开源模型。
Summary / 总结
VideoMem addresses the challenge of ultra-long video understanding by proposing a novel framework that uses adaptive memory management to dynamically update a global memory buffer, retaining critical information and discarding redundant content. It integrates the Progressive Grouped Relative Policy Optimization (PRPO) algorithm with two core modules: Progressive State Propagation (PSP) and Temporal Cascading Reward (TCR), which help in efficiently training VLMs for long-term tasks. Experimental results show that VideoMem outperforms existing models across various benchmarks for ultra-long video understanding tasks.
论文提出VideoMem框架,通过自适应内存管理增强视觉语言模型(VLMs)对超长视频的理解能力。VideoMem动态更新全局内存缓冲区,保留关键信息并丢弃冗余内容。它结合了渐进分组相对策略优化(PRPO)算法,包括渐进状态传播(PSP)和时间级联奖励(TCR),以提高模型训练效率。实验表明,VideoMem在各种超长视频理解任务基准测试中优于现有模型。
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
Authors: Zhenguo Zhang, Haohan Zhen, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang
First: 2025-12-16T03:19:28+00:00 · Latest: 2025-12-16T03:19:28+00:00
Abstract
The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning.While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels.Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.
中文标题/摘要
标题:OmniDrive-R1:强化驱动的交织多模态链式思考框架以实现可信的视觉-语言自动驾驶
在自动驾驶(AD)等安全关键领域部署视觉-语言模型(VLMs)受到可靠性故障的严重阻碍,尤其是对象幻觉。这种故障源于它们依赖于基于文本的链式思考(CoT)推理。虽然现有的多模态CoT方法试图缓解这一问题,但它们存在两个根本缺陷:(1)感知和推理阶段的分离,这妨碍了端到端联合优化,(2)依赖昂贵的密集定位标签。因此,我们提出了OmniDrive-R1,这是一种为自动驾驶设计的端到端VLM框架,通过交织多模态链式思考(iMCoT)机制统一了感知和推理。我们的核心创新是一种强化驱动的视觉定位能力,使模型能够自主地将注意力集中在关键区域进行精细分析。这种能力通过我们纯两阶段强化学习训练管道和Clip-GRPO算法实现。关键的是,Clip-GRPO引入了一种无需标注的过程导向定位奖励。这种奖励不仅消除了对密集标签的需求,还通过强制实时跨模态一致性来避免外部工具调用的不稳定性。在DriveLMM-o1上的大量实验表明,我们的模型取得了显著改进。与基线Qwen2.5VL-7B相比,OmniDrive-R1的整体推理得分从51.77%提高到80.35%,最终答案准确性从37.81%提高到73.62%。
Summary / 总结
OmniDrive-R1 is an end-to-end Vision-Language Model framework for autonomous driving that integrates perception and reasoning through an interleaved Multi-modal Chain-of-Thought mechanism. It introduces a reinforcement-driven visual grounding capability, which allows the model to autonomously focus on critical regions for fine-grained analysis. This is achieved through a two-stage reinforcement learning training pipeline and the Clip-GRPO algorithm, which uses annotation-free, process-based grounding rewards to ensure real-time cross-modal consistency. Experimental results on DriveLMM-o1 show significant improvements, with the reasoning score increasing from 51.77% to 80.35% and the final answer accuracy from 37.81% to 73.62% compared to the baseline Qwen2.5VL-7B.
OmniDrive-R1 是一个端到端的 VLM 框架,用于自动驾驶,它通过交错的多模态链式思考(iMCoT)机制将感知和推理结合起来。该模型引入了一种基于强化学习的视觉定位能力,使其能够聚焦于关键区域进行精细分析。模型使用两阶段的强化学习训练管道和 Clip-GRPO 算法,提供无需注释的过程导向定位奖励,以确保实时跨模态一致性。实验表明,与基线 Qwen2.5VL-7B 相比,OmniDrive-R1 在推理和答案准确性方面有显著改进,整体推理分数从 51.77% 提高到 80.35%,最终答案准确性从 37.81% 提高到 73.62%。
CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images
Authors: Bo Liu, Qiao Qin, Qinghui He
Venue: AAAI 2026
First: 2025-12-15T12:48:27+00:00 · Latest: 2025-12-16T02:47:19+00:00
Comments: 9 pages,Accepted to AAAI 2026
Abstract
The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.
中文标题/摘要
标题:CausalCLIP:因果驱动的特征解缠与筛选以实现生成图像检测的泛化能力
生成模型的快速发展增加了对能够跨多样且不断演变的生成技术进行泛化的生成图像检测器的需求。然而,现有的方法,包括利用预训练的视觉-语言模型的方法,往往会产生高度纠缠的表示,将与任务相关的法医线索(因果特征)与无关或不相关的模式(非因果特征)混合在一起,从而限制了泛化能力。为了解决这一问题,我们提出了CausalCLIP框架,该框架明确地解缠因果特征与非因果特征,并通过因果推理原则进行目标筛选,仅保留最可转移和区分的法医线索。通过使用结构因果模型建模生成过程,并通过Gumbel-Softmax基特征掩蔽和Hilbert-Schmidt独立性判别准则(HSIC)约束来强制统计独立性,CausalCLIP隔离了对分布偏移具有鲁棒性的稳定因果特征。在不同系列的未见过的生成模型上进行测试时,CausalCLIP展示了强大的泛化能力,相对于最先进的方法,在准确性和平均精度上分别提高了6.83%和4.06%。
Summary / 总结
CausalCLIP is a framework designed to improve the generalization of generated image detectors by disentangling causal and non-causal features. It uses causal inference principles to filter out non-causal features and retains only the most discriminative causal features. By modeling the generation process with a structural causal model and applying Gumbel-Softmax-based feature masking and HSIC constraints, CausalCLIP enhances the stability and transferability of the retained features. Experiments show that CausalCLIP outperforms existing methods, achieving a 6.83% improvement in accuracy and a 4.06% improvement in average precision on unseen generative models.
CausalCLIP 是一种通过分离因果特征和非因果特征来提高生成图像检测器泛化能力的框架。它利用因果推理原则过滤掉非因果特征,确保仅保留最具转移性和区分性的因果特征。通过使用结构因果模型建模生成过程,并应用基于 Gumbel-Softmax 的特征掩码和 HSIC 约束,CausalCLIP 能够隔离在分布变化中依然稳定的因果特征。实验表明,CausalCLIP 在未见过的生成模型上的表现优于最先进的方法,准确率提高了 6.83%,平均精度提高了 4.06%。
MobileWorldBench: Towards Semantic World Modeling For Mobile Agents
Authors: Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Aditya Grover
First: 2025-12-16T02:16:42+00:00 · Latest: 2025-12-16T02:16:42+00:00
Comments: 21 pages, 13 figures
Abstract
World models have shown great utility in improving the task performance of embodied agents. While prior work largely focuses on pixel-space world models, these approaches face practical limitations in GUI settings, where predicting complex visual elements in future states is often difficult. In this work, we explore an alternative formulation of world modeling for GUI agents, where state transitions are described in natural language rather than predicting raw pixels. First, we introduce MobileWorldBench, a benchmark that evaluates the ability of vision-language models (VLMs) to function as world models for mobile GUI agents. Second, we release MobileWorld, a large-scale dataset consisting of 1.4M samples, that significantly improves the world modeling capabilities of VLMs. Finally, we propose a novel framework that integrates VLM world models into the planning framework of mobile agents, demonstrating that semantic world models can directly benefit mobile agents by improving task success rates. The code and dataset is available at https://github.com/jacklishufan/MobileWorld
中文标题/摘要
标题:MobileWorldBench:面向移动代理的语义世界建模
世界模型在提高具身代理任务性能方面显示出巨大的实用性。尽管先前的工作主要集中在像素空间的世界模型上,但在GUI设置中,预测未来状态中的复杂视觉元素往往存在实际限制。在本工作中,我们探索了GUI代理的一种替代世界建模形式,其中状态转换用自然语言描述,而不是预测原始像素。首先,我们介绍了MobileWorldBench,这是一个基准测试,评估视觉语言模型(VLMs)作为移动GUI代理世界模型的能力。其次,我们发布了MobileWorld,这是一个包含140万样本的大规模数据集,显著提高了VLMs的世界建模能力。最后,我们提出了一种新的框架,将VLM世界模型整合到移动代理的规划框架中,证明了语义世界模型可以直接通过提高任务成功率来直接惠及移动代理。代码和数据集可在https://github.com/jacklishufan/MobileWorld获取。
Guideline-Consistent Segmentation via Multi-Agent Refinement
Authors: Vanshika Vats, Ashwani Rathee, James Davis
Venue: AAAI
First: 2025-09-04T22:32:57+00:00 · Latest: 2025-12-16T01:40:54+00:00
Comments: To be published in The Fortieth AAAI Conference on Artificial Intelligence (AAAI 2026)
Abstract
Semantic segmentation in real-world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task-specific retraining that must be repeated as the guidelines evolve. Although recent open-vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph-length guidelines that specify intricate segmentation rules. To address this, we introduce a multi-agent, training-free framework that coordinates general-purpose vision-language models within an iterative Worker-Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline-consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state-of-the-art baselines, demonstrating strong generalization and instruction adherence.
中文标题/摘要
标题:基于指南一致的多智能体细化分割
在实际应用中的语义分割不仅需要准确的掩膜,还需要严格遵守文本标注指南。这些指南通常复杂且冗长,无论是人工还是自动标注往往都不能忠实遵守。传统方法依赖于昂贵的任务特定重训练,且随着指南的演变需要重复进行。尽管最近的开放式词汇分割方法在简单的提示下表现出色,但在面对包含段落长度指南的复杂分割规则时往往失败。为解决这一问题,我们提出了一种无需训练的多智能体框架,该框架在迭代的工人-监督者细化架构中协调通用视觉-语言模型。工人执行分割,监督者根据检索到的指南对其进行评价,轻量级的强化学习停止策略决定何时终止循环,确保指南一致的掩膜同时平衡资源使用。在Waymo和ReasonSeg数据集上评估,我们的方法显著优于最先进的基线,展示了强大的泛化能力和指令遵守能力。
Summary / 总结
The research addresses the challenge of semantic segmentation in real-world applications, where strict adherence to complex textual labeling guidelines is crucial. It proposes a multi-agent, training-free framework that uses an iterative Worker-Supervisor architecture to ensure guideline-consistent masks. The Worker performs segmentation, the Supervisor critiques it against retrieved guidelines, and a reinforcement learning policy decides when to terminate the loop. Experiments on Waymo and ReasonSeg datasets show that the method outperforms existing baselines, demonstrating strong generalization and instruction adherence.
研究旨在解决严格遵循复杂文本标注指南的语义分割需求,这些指南往往难以遵守。提出了一种多代理框架,使用Worker进行分割,Supervisor进行评估,并通过强化学习策略终止过程。该方法在Waymo和ReasonSeg数据集上的表现优于现有方法,展示了强大的泛化能力和指令遵循能力。
From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation
Authors: Dawid Malarz, Artur Kasymov, Filip Manjak, Maciej Zięba, Przemysław Spurek
First: 2025-12-15T23:15:36+00:00 · Latest: 2025-12-15T23:15:36+00:00
Abstract
The rapid progress of text-to-image diffusion models raises significant concerns regarding the unauthorized reproduction of trademarked content. While prior work targets general concepts (e.g., styles, celebrities), it fails to address specific brand identifiers. Crucially, we note that brand recognition is multi-dimensional, extending beyond explicit logos to encompass distinctive structural features (e.g., a car's front grille). To tackle this, we introduce unbranding, a novel task for the fine-grained removal of both trademarks and subtle structural brand features, while preserving semantic coherence. To facilitate research, we construct a comprehensive benchmark dataset. Recognizing that existing brand detectors are limited to logos and fail to capture abstract trade dress (e.g., the shape of a Coca-Cola bottle), we introduce a novel evaluation metric based on Vision Language Models (VLMs). This VLM-based metric uses a question-answering framework to probe images for both explicit logos and implicit, holistic brand characteristics. Furthermore, we observe that as model fidelity increases, with newer systems (SDXL, FLUX) synthesizing brand identifiers more readily than older models (Stable Diffusion), the urgency of the unbranding challenge is starkly highlighted. Our results, validated by our VLM metric, confirm unbranding is a distinct, practically relevant problem requiring specialized techniques. Project Page: https://gmum.github.io/UNBRANDING/.
中文标题/摘要
标题:从去品牌化到UNBRANDING:商标安全文本到图像生成的基准
文本到图像扩散模型的迅速发展引发了对未经授权复制商标内容的严重关切。尽管先前的工作针对一般概念(如风格、名人),但未能解决特定品牌标识的问题。我们注意到,品牌识别是多维度的,不仅限于显性的标志,还包括独特的结构特征(如汽车的前格栅)。为应对这一挑战,我们引入了去品牌化这一新任务,旨在精细去除商标和微妙的品牌结构特征,同时保持语义连贯性。为促进研究,我们构建了一个全面的基准数据集。鉴于现有品牌检测器仅限于标志,无法捕捉抽象的商标外观(如可口可乐瓶子的形状),我们引入了一种基于视觉语言模型(VLM)的新评估指标。该VLM指标使用问答框架来探测图像中的显性标志和隐含的整体品牌特征。此外,我们观察到,随着模型保真度的提高,新系统(如SDXL、FLUX)比旧系统(如Stable Diffusion)更容易生成品牌标识,这突显了去品牌化挑战的紧迫性。我们的结果,经由VLM指标验证,证实了去品牌化是一个独特且实际相关的问题,需要专门的技术。项目页面:https://gmum.github.io/UNBRANDING/
Summary / 总结
This paper addresses the challenge of generating text-to-image content that avoids unauthorized reproduction of trademarks and brand identifiers. It introduces the concept of unbranding, which involves removing both explicit logos and subtle structural brand features while maintaining semantic coherence. The authors develop a comprehensive benchmark dataset and a novel evaluation metric based on Vision Language Models to assess the effectiveness of unbranding techniques. The results show that newer models like SDXL and FLUX are more prone to generating brand identifiers, emphasizing the need for specialized unbranding methods. Validation through the VLM metric confirms the practical relevance of this problem.
该论文旨在通过引入去品牌化概念解决生成不含商标内容图像的挑战。作者开发了一个新的任务,旨在去除显性标志和微妙的品牌特征同时保持语义连贯性。他们构建了一个基准数据集,并引入了一种基于视觉语言模型(VLM)的新评估指标,以评估文本到图像模型在去品牌化方面的有效性。研究表明, newer 模型如 SDXL 和 FLUX 更容易包含品牌标识,突显了去品牌化技术的需求。结果通过 VLM 指标验证,表明去品牌化是一个独特且实际的问题,需要专门的方法。
MIMIR: Masked Image Modeling for Mutual Information-based Adversarial Robustness
Authors: Xiaoyun Xu, Shujian Yu, Zhuoran Liu, Stjepan Picek
First: 2023-12-08T10:50:02+00:00 · Latest: 2025-12-15T22:24:07+00:00
Comments: Accepted by NDSS 2026
Abstract
Vision Transformers (ViTs) have emerged as a fundamental architecture and serve as the backbone of modern vision-language models. Despite their impressive performance, ViTs exhibit notable vulnerability to evasion attacks, necessitating the development of specialized Adversarial Training (AT) strategies tailored to their unique architecture. While a direct solution might involve applying existing AT methods to ViTs, our analysis reveals significant incompatibilities, particularly with state-of-the-art (SOTA) approaches such as Generalist (CVPR 2023) and DBAT (USENIX Security 2024). This paper presents a systematic investigation of adversarial robustness in ViTs and provides a novel theoretical Mutual Information (MI) analysis in its autoencoder-based self-supervised pre-training. Specifically, we show that MI between the adversarial example and its latent representation in ViT-based autoencoders should be constrained via derived MI bounds. Building on this insight, we propose a self-supervised AT method, MIMIR, that employs an MI penalty to facilitate adversarial pre-training by masked image modeling with autoencoders. Extensive experiments on CIFAR-10, Tiny-ImageNet, and ImageNet-1K show that MIMIR can consistently provide improved natural and robust accuracy, where MIMIR outperforms SOTA AT results on ImageNet-1K. Notably, MIMIR demonstrates superior robustness against unforeseen attacks and common corruption data and can also withstand adaptive attacks where the adversary possesses full knowledge of the defense mechanism. Our code and trained models are publicly available at: https://github.com/xiaoyunxxy/MIMIR.
中文标题/摘要
标题:MIMIR:基于互信息的对抗鲁棒性掩蔽图像建模
视觉变换器(ViTs)已成为一种基本架构,并作为现代视觉-语言模型的骨干。尽管它们表现出色,但ViTs在对抗性攻击中表现出明显的脆弱性,因此需要开发专门针对其独特架构的对抗训练(AT)策略。虽然直接解决方案可能涉及将现有的AT方法应用于ViTs,但我们的分析揭示了与最先进的(SOTA)方法如Generalist(CVPR 2023)和DBAT(USENIX Security 2024)之间存在显著的不兼容性。本文系统地研究了ViTs的对抗鲁棒性,并提供了其基于自编码器的半监督预训练中的互信息(MI)分析的新型理论。具体而言,我们表明,基于ViT的自编码器中的对抗样本与其潜在表示之间的互信息应通过导出的互信息界进行约束。基于这一洞察,我们提出了一种半监督AT方法MIMIR,该方法利用互信息惩罚来通过自编码器的掩蔽图像建模进行对抗预训练。在CIFAR-10、Tiny-ImageNet和ImageNet-1K上的广泛实验表明,MIMIR可以一致地提供改进的自然准确性和鲁棒性,其中MIMIR在ImageNet-1K上的SOTA AT结果中表现出色。值得注意的是,MIMIR在未预见的攻击和常见损坏数据中表现出更强的鲁棒性,并且还可以抵御适应性攻击,其中对手完全了解防御机制。我们的代码和训练模型可在以下网址公开获取:https://github.com/xiaoyunxxy/MIMIR。
Summary / 总结
This paper addresses the vulnerability of Vision Transformers (ViTs) to evasion attacks by proposing MIMIR, a novel self-supervised Adversarial Training method. MIMIR leverages Mutual Information (MI) analysis in autoencoder-based pre-training to constrain the MI between adversarial examples and their latent representations. Experiments on CIFAR-10, Tiny-ImageNet, and ImageNet-1K demonstrate that MIMIR improves both natural and robust accuracy, outperforming state-of-the-art AT methods on ImageNet-1K and showing superior robustness against various attacks.
本文针对Vision Transformers (ViTs)的漏洞提出了一个名为MIMIR的新Adversarial Training (AT)方法。受需要为ViTs设计专门的AT策略的驱动,作者进行了系统分析,并提出了一种基于自编码器的半监督预训练方法,该方法通过约束对抗样本与其潜在表示之间的互信息来限制这种关系。实验结果表明,MIMIR在CIFAR-10、Tiny-ImageNet和ImageNet-1K上的自然准确性和鲁棒准确性均有所提升,并且在ImageNet-1K上优于SOTA的AT方法,同时对各种攻击和损坏数据具有更强的鲁棒性。
Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis
Authors: Yu Xin, Gorkem Can Ates, Kuang Gong, Wei Shao
First: 2025-03-25T20:09:30+00:00 · Latest: 2025-12-15T20:51:09+00:00
Abstract
Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluate our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images. Results show that Med3DVLM achieves superior performance across multiple benchmarks. For image-text retrieval, it reaches 61.00% R@1 on 2,000 samples, significantly outperforming the current state-of-the-art M3D model (19.10%). For report generation, it achieves a METEOR score of 36.42% (vs. 14.38%). In open-ended visual question answering (VQA), it scores 36.76% METEOR (vs. 33.58%), and in closed-ended VQA, it achieves 79.95% accuracy (vs. 75.78%). These results highlight Med3DVLM's ability to bridge the gap between 3D imaging and language, enabling scalable, multi-task reasoning across clinical applications. Our code is publicly available at https://github.com/mirthAI/Med3DVLM.
中文标题/摘要
标题:Med3DVLM:一种高效的3D医学图像分析视觉-语言模型
视觉-语言模型(VLMs)在2D医学图像分析中显示出潜力,但将其扩展到3D仍然具有挑战性,因为体数据的高计算需求以及3D空间特征与临床文本对齐的难度。我们提出了Med3DVLM,这是一种通过三个关键创新设计的3D VLM,以解决这些挑战:(1)DCFormer,一种高效的编码器,使用分解的3D卷积来捕捉大规模的细粒度空间特征;(2)SigLIP,一种基于成对Sigmoid损失的对比学习策略,提高了图像-文本对齐,而无需依赖大规模负样本批次;(3)一种双流MLP-Mixer投影器,将低级和高级图像特征与文本嵌入融合,以获得更丰富的跨模态表示。我们在包含放射学报告和120,084张3D医学图像的VQA数据集M3D上评估了我们的模型。结果显示,Med3DVLM在多个基准测试中取得了优越的性能。在图像-文本检索中,它在2,000个样本上达到了61.00%的R@1,显著优于当前最先进的M3D模型(19.10%)。在报告生成中,它实现了36.42%的METEOR分数(相比之下为14.38%)。在开放性视觉问答(VQA)中,它获得了36.76%的METEOR分数(相比之下为33.58%),在封闭性VQA中,它实现了79.95%的准确率(相比之下为75.78%)。这些结果突显了Med3DVLM在3D成像与语言之间建立桥梁的能力,使其能够在临床应用中实现可扩展的多任务推理。我们的代码可在https://github.com/mirthAI/Med3DVLM上公开获取。
Summary / 总结
Med3DVLM is designed to enhance 3D medical image analysis by addressing the challenges of volumetric data processing and image-text alignment. It introduces DCFormer for efficient 3D feature extraction, SigLIP for improved image-text alignment, and a dual-stream MLP-Mixer projector for rich multi-modal representations. On the M3D dataset, Med3DVLM outperforms the state-of-the-art model in image-text retrieval, report generation, and VQA tasks, demonstrating its effectiveness in bridging 3D imaging and language for clinical applications.
Med3DVLM旨在通过解决体积数据处理和图像-文本对齐的挑战来增强3D医学图像分析。它引入了DCFormer进行高效的3D特征提取,SigLIP以提高图像-文本对齐,并使用双流MLP-Mixer投影器生成丰富的多模态表示。在M3D数据集上,Med3DVLM在图像-文本检索、报告生成和VQA任务中均优于当前最先进的模型,展示了其在临床应用中将3D成像与语言相结合的有效性。