DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
Authors: Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang
First: 2025-12-17T18:59:55+00:00 · Latest: 2025-12-17T18:59:55+00:00
Comments: 11 pages, 5 figures, conference or other essential info
Abstract
In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.
中文标题/摘要
标题:DiffusionVL:将任何自回归模型转化为扩散视觉语言模型
在最近的多模态研究中,扩散范式因其独特的解码优势,已成为自回归范式(AR)的有前途的替代方案。然而,由于基础扩散语言模型能力的限制,扩散视觉语言模型(dVLM)的性能仍然远远落后于主流模型。这引发了一个简单而基本的问题:是否可以基于现有的强大自回归模型构建dVLM?为此,我们提出了DiffusionVL,这是一个可以从任何强大自回归模型转换而来的dVLM家族。通过简单的微调,我们成功地将自回归预训练模型适应到扩散范式中。这种方法产生了两个关键观察结果:(1)从基于自回归的多模态模型到扩散的范式转变非常有效。(2)直接将自回归语言模型转换为dVLM也是可行的,其性能与LaLaVA风格的视觉指令调优相当。此外,我们引入了一种块解码设计到dVLM中,支持任意长度的生成和KV缓存重用,实现了显著的推理速度提升。我们进行了大量的实验。尽管使用了比先前方法少于5%的数据进行训练,DiffusionVL在MMMU-Pro(视觉)基准上的综合性能提高了34.4%,在MME(认知)基准上的性能提高了37.5%,同时实现了2倍的推理速度提升。模型和代码发布在https://github.com/hustvl/DiffusionVL。
Summary / 总结
DiffusionVL translates existing powerful autoregressive models into diffusion vision language models (dVLMs) through simple fine-tuning, demonstrating that the paradigm shift from autoregressive to diffusion models is highly effective. Key findings include a 34.4% improvement on the MMMU-Pro (vision) benchmark and a 37.5% improvement on the MME (Cog.) benchmark, alongside a 2x increase in inference speed. This approach requires less than 5% of the data needed by previous methods.
DiffusionVL 是一种可以从现有强大的自回归(AR)模型转换而来的扩散视觉语言模型(dVLM)家族,通过简单的微调实现。这种方法显著提高了 dVLM 的性能,在 MMMU-Pro(视觉)基准上取得了 34.4% 的提升,在 MME(认知)基准上取得了 37.5% 的提升,同时将推理速度提高了两倍。从 AR 基模到扩散的范式转变效果显著,并引入了一种块解码设计,以支持任意长度的生成和 KV 缓存重用。
VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression
Authors: Kyle Sargent, Ruiqi Gao, Philipp Henzler, Charles Herrmann, Aleksander Holynski, Li Fei-Fei, Jiajun Wu, Jason Zhang
First: 2025-12-17T18:52:55+00:00 · Latest: 2025-12-17T18:52:55+00:00
Comments: 14 pages, 8 figures
Abstract
Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic
中文标题/摘要
标题:VLIC:视觉-语言模型作为人类对齐的图像压缩感知裁判
包含人类偏好的图像压缩性能评估通常发现,诸如均方误差(MSE)之类的简单失真函数不足以与人类感知对齐。为了使压缩模型与人类感知对齐,先前的工作使用了基于大规模人类心理视觉判断数据集校准的可微分感知损失,由神经网络组成。我们展示了令人惊讶的是,最先进的视觉-语言模型(VLMs)可以在被要求对两幅图像之间的差异进行推理时,零样本地复制二选一强迫选择(2AFC)的人类判断。受利用VLMs强大的零样本视觉推理能力的启发,我们提出了视觉-语言模型图像压缩系统(VLIC),这是一种基于扩散的图像压缩系统,设计为后训练与二元VLM判断相结合。VLIC 利用现有的扩散模型后训练技术,而不是将VLM判断提炼为一个单独的感知损失网络。我们展示了在VLM判断上校准该系统在感知度量和大规模用户研究中产生了竞争力或最先进的性能,取决于数据集。我们还进行了VLM为基础的奖励设计和训练过程的广泛分析,并分享了重要的见解。更多视觉内容可在 https://kylesargent.github.io/vlic 获取
Summary / 总结
The research aims to improve the alignment of image compression performance with human perception by utilizing vision-language models (VLMs). The method involves training a diffusion-based image compression system (VLIC) using binary judgments from VLMs, without distilling these judgments into a separate perceptual loss network. The key findings show that VLIC achieves competitive or state-of-the-art performance on human-aligned visual compression, as evaluated by perceptual metrics and user studies.
研究旨在通过使图像压缩与人类感知相匹配来改进图像压缩,而传统的失真度量无法做到这一点。方法是利用最先进的视觉-语言模型(VLMs)对图像对进行零样本判断,然后用于训练基于扩散的图像压缩系统(VLIC)。关键发现表明,VLIC在人类对齐的视觉压缩任务上的表现与现有方法相当甚至更好,这通过感知度量和大规模用户研究得到了验证。
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
Authors: Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang
First: 2025-12-17T17:58:35+00:00 · Latest: 2025-12-17T17:58:35+00:00
Abstract
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.
中文标题/摘要
标题:VTCBench:视觉语言模型能否通过视觉文本压缩理解长上下文?
LLM扩展上下文窗口相关的计算和内存开销严重限制了其可扩展性。值得注意的解决方案是视觉文本压缩(VTC),如DeepSeek-OCR和Glyph等框架,将长文本转换为密集的二维视觉表示,从而实现3倍至20倍的标记压缩比。然而,这种高信息密度对视觉语言模型(VLM)的核心长上下文能力的影响仍缺乏研究。为填补这一空白,我们首次引入了VTC基准,并系统评估了VLM在三种长上下文理解设置中的性能:VTC-Retrieval,评估模型检索和聚合信息的能力;VTC-Reasoning,要求模型通过最小的词汇重叠来推断潜在关联以定位事实;VTC-Memory,衡量模型在长期对话记忆中进行综合问答的能力。此外,我们建立了VTCBench-Wild以模拟多样化的输入场景。我们在基准上全面评估了领先开源和专有模型。结果表明,尽管大多数VLM能够很好地解码文本信息(如OCR),但在使用VTC压缩信息时,它们在长上下文理解方面表现出令人惊讶的差劲能力,无法捕捉上下文中的长期关联或依赖关系。本研究为理解VTC提供了深入的见解,并为设计更高效和可扩展的VLM奠定了基础。
Summary / 总结
The paper introduces VTCBench, a benchmark to evaluate the long-context understanding capabilities of vision-language models (VLMs) using vision-text compression (VTC). It assesses models in three settings: VTC-Retrieval, VTC-Reasoning, and VTC-Memory. The study finds that most VLMs struggle to understand long contexts when information is compressed visually, indicating a need for improved VTC integration in VLMs.
该研究引入了VTCBench基准,用于评估使用视觉文本压缩(VTC)的视觉语言模型(VLM)的长上下文理解能力。它在VTC检索、VTC推理和VTC记忆三个场景下评估模型,并发现大多数VLM在处理压缩的视觉文本信息时难以理解长上下文,表明需要改进VLM的设计。研究突显了当前VLM在处理压缩视觉文本数据方面的局限性,并为未来研究提供了基础。
If you can describe it, they can see it: Cross-Modal Learning of Visual Concepts from Textual Descriptions
Authors: Carlo Alberto Barbano, Luca Molinaro, Massimiliano Ciranni, Emanuele Aiello, Vito Paolo Pastore, Marco Grangetto
First: 2024-11-23T17:26:50+00:00 · Latest: 2025-12-17T13:10:40+00:00
Comments: 27 pages. Under review
Abstract
Humans can visualize new and unknown concepts from their natural language description, based on their experience and previous knowledge. Insipired by this, we present a way to extend this ability to Vision-Language Models (VLMs), teaching them novel concepts by only using a textual description. We refer to this approach as Knowledge Transfer (KT). Our hypothesis is that the knowledge of a pre-trained VLM can be re-used to represent previously unknown concepts. Provided with a textual description of the novel concept, KT works by aligning relevant features of the visual encoder, obtained through model inversion, to its text representation. Differently from approaches relying on visual examples or external generative models, KT transfers knowledge within the same VLM by injecting visual knowledge directly from the text. Through an extensive evaluation on several VLM tasks, including classification, segmentation, image-text retrieval, and captioning, we show that: 1) KT can efficiently introduce new visual concepts from a single textual description; 2) the same principle can be used to refine the representation of existing concepts; and 3) KT significantly improves the performance of zero-shot VLMs.
中文标题/摘要
标题:如果你能描述它,他们就能看到它:基于文本描述的视觉概念跨模态学习
人类可以根据自然语言描述和自身经验及先前知识,想象新的和未知的概念。受此启发,我们提出了一种方法,将这种能力扩展到视觉语言模型(VLMs),仅通过文本描述来教授它们新的概念。我们称这种方法为知识迁移(KT)。我们的假设是,预训练的VLM的知识可以被重新利用来表示之前未知的概念。通过提供新概念的文本描述,KT通过模型反转获得视觉编码的相关特征,并将其对齐到其文本表示。不同于依赖视觉示例或外部生成模型的方法,KT在同一个VLM内部通过直接从文本注入视觉知识来进行知识迁移。通过在分类、分割、图像-文本检索和描述生成等多个VLM任务上的广泛评估,我们展示了:1) KT可以从单个文本描述高效地引入新的视觉概念;2) 同一原理可以用于细化现有概念的表示;3) KT显著提高了零样本VLMs的性能。
Summary / 总结
The research aims to enable Vision-Language Models (VLMs) to learn new visual concepts from textual descriptions, inspired by human ability to visualize unknown concepts. The method, referred to as Knowledge Transfer (KT), aligns visual features with text representations to introduce new concepts without visual examples. Experiments show that KT can efficiently teach VLMs new concepts from a single description, refine existing concepts, and significantly enhance zero-shot performance across various tasks such as classification, segmentation, image-text retrieval, and captioning.
研究旨在使Vision-Language模型能够根据文本描述理解和可视化新概念,灵感来源于人类的能力。方法是知识转移(KT),它通过模型反转将视觉特征与新概念的文本表示对齐。实验表明,KT可以从单个描述引入新视觉概念,改进现有概念的表示,并在分类、分割、图像-文本检索和生成等VLM任务中提高零样本性能。
Prompt-Based Continual Compositional Zero-Shot Learning
Authors: Sauda Maryam, Sara Nadeem, Faisal Qureshi, Mohsen Ali
First: 2025-12-09T22:36:31+00:00 · Latest: 2025-12-17T12:41:30+00:00
Abstract
We tackle continual adaptation of vision-language models to new attributes, objects, and their compositions in Compositional Zero-Shot Learning (CZSL), while preventing forgetting of prior knowledge. Unlike classical continual learning where classes are disjoint, CCZSL is more complex as attributes and objects may reoccur across sessions while compositions remain unique. Built on a frozen VLM backbone, we propose the first Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) framework that retains prior knowledge through recency-weighted multi-teacher distillation. It employs session-aware compositional prompts to fuse multimodal features for new compositions, while attribute and object prompts are learned through session-agnostic fusion to maintain global semantic consistency, which is further stabilized by a Cosine Anchor Loss (CAL) to preserve prior knowledge. To enhance adaptation in the current session, an Orthogonal Projection Loss (OPL) ensures that new attribute and object embeddings remain distinct from previous ones, preventing overlap, while an Intra-Session Diversity Loss (IDL) promotes variation among current-session embeddings for richer, more discriminative representations. We also introduce a comprehensive protocol that jointly measures catastrophic forgetting and compositional generalization. Extensive experiments on UT-Zappos and C-GQA benchmarks demonstrate that PromptCCZSL achieves substantial improvements over prior VLM-based and non-VLM baselines, setting a new benchmark for CCZSL in closed-world settings.
中文标题/摘要
标题:基于提示的持续组合零样本学习
我们针对视觉-语言模型在Compositional Zero-Shot Learning (CZSL) 中对新属性、对象及其组合的持续适应问题,同时防止遗忘先前的知识。不同于传统持续学习中类别的互斥,CCZSL 更加复杂,因为属性和对象可能在不同会话中重复出现,而组合则保持唯一性。基于冻结的VLM主干,我们提出了第一个基于提示的持续组合零样本学习(PromptCCZSL)框架,通过最近性加权多教师蒸馏保留先前知识。该框架使用会话感知的组合提示融合多模态特征以生成新的组合,而属性和对象提示通过会话无关的融合学习以保持全局语义一致性,进一步通过余弦锚点损失(CAL)稳定以保留先前知识。为了增强当前会话的适应性,正交投影损失(OPL)确保新属性和对象嵌入与先前的嵌入保持独特性,防止重叠,而会话内多样性损失(IDL)促进当前会话嵌入之间的变化,以获得更丰富、更具区分性的表示。我们还引入了一个综合协议,联合衡量灾难性遗忘和组合泛化。在UT-Zappos和C-GQA基准上的广泛实验表明,PromptCCZSL 在持续组合零样本学习中显著优于基于VLM和非VLM的基线方法,为闭世界设置中的CCZSL 设定了新的基准。
Summary / 总结
The research aims to address the challenge of continual adaptation of vision-language models to new attributes, objects, and their compositions in Compositional Zero-Shot Learning (CZSL) without forgetting previous knowledge. The proposed Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) framework uses a frozen VLM backbone and recency-weighted multi-teacher distillation to retain prior knowledge. It employs session-aware compositional prompts to fuse multimodal features and session-agnostic attribute and object prompts to maintain global semantic consistency, stabilized by a Cosine Anchor Loss. Additionally, the framework includes an Orthogonal Projection Loss to prevent overlap with previous embeddings and an Intra-Session Diversity Loss to promote richer representations. Experiments on UT-Zappos and C-GQA benchmarks show significant improvements over existing baselines, setting a new benchmark for CCZSL in closed-world settings.
研究旨在解决视觉-语言模型在适应新属性、对象及其组合时不断学习并防止遗忘先前知识的挑战。提出的PromptCCZSL框架使用冻结的VLM主干,并采用近期加权多教师蒸馏来保留先前知识。它引入了针对新组合的会话感知组成提示和针对属性和对象的会话无关提示以保持全局语义一致性。此外,还包括正交投影损失和会话内多样性损失以增强适应并防止重叠。在UT-Zappos和C-GQA基准上的实验显示,该方法在闭世界设置中的CCZSL基准上取得了显著改进,超过了现有方法。
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
Authors: Mikel Williams-Lekuona, Georgina Cosma
First: 2025-12-17T12:19:54+00:00 · Latest: 2025-12-17T12:19:54+00:00
Comments: Accepted paper for ECIR 2026
Abstract
Vision transformers in vision-language models apply uniform computational effort across all images, expending 175.33 GFLOPs (ViT-L/14) whether analysing a straightforward product photograph or a complex street scene. We propose ICAR (Image Complexity-Aware Retrieval), which enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both reduced-compute and full-compute processing. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance with 0.959 correlation with human judgement (Pearson) and 4.4x speedup. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% practical speedup while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
中文标题/摘要
标题:基于图像复杂性的自适应检索以提高视觉语言模型的效率
视觉语言模型中的视觉变换器在所有图像上均匀分配计算努力,无论是分析简单的商品照片还是复杂的街道场景,都会消耗175.33 GFLOPs(ViT-L/14)。我们提出了ICAR(基于图像复杂性的自适应检索),使视觉变换器能够为简单图像使用较少的计算资源,而复杂图像则通过其全网络深度处理。关键挑战是保持跨模态对齐:不同处理深度的嵌入必须保持兼容以进行文本匹配。ICAR 通过双路径训练解决这一问题,产生来自低计算和全计算处理的兼容嵌入。这在图像表示和文本嵌入在同一语义空间中保持兼容性方面,无论图像是否提前退出或完全处理。与现有需要昂贵重排序的两阶段方法不同,ICAR 允许直接进行图像-文本匹配而无需额外开销。为了确定使用多少计算资源,我们开发了ConvNeXt-IC,将其视为分类任务。通过应用现代分类器骨干而非专门架构,ConvNeXt-IC 达到了最先进的性能,与人类判断的相关性为0.959(皮尔逊),速度提高了4.4倍。在标准基准上增加了真实世界网络数据进行评估,ICAR 实现了20%的实际加速,同时保持类别级性能和95%的实例级性能,使视觉语言系统的可持续扩展成为可能。
Summary / 总结
The paper proposes ICAR (Image Complexity-Aware Retrieval) to enable vision transformers to use less compute for simple images while processing complex images fully. It addresses the challenge of maintaining cross-modal alignment through dual-path training, allowing direct image-text matching. ICAR achieves a 20% practical speedup on standard benchmarks while maintaining performance, and ConvNeXt-IC, a compute assessment model, provides 4.4x speedup with state-of-the-art performance. This method enables sustainable scaling of vision-language systems.
论文通过提出ICAR(图像复杂性感知检索)解决了视觉-语言模型中均匀计算效率的问题,使得视觉变换器能够对简单图像使用较少计算,而对复杂图像进行全网络深度处理。ICAR通过双路径训练来保持跨模态对齐,确保图像表示和文本嵌入在相同的语义空间中保持兼容性。评估结果显示,ICAR在标准基准测试上实现了20%的实际加速,同时保持类别级性能和95%的实例级性能,而ConvNeXt-IC在图像复杂性评估中达到了最先进的性能,具有4.4倍的加速和0.959的相关性与人类判断一致。
SynthSeg-Agents: Multi-Agent Synthetic Data Generation for Zero-Shot Weakly Supervised Semantic Segmentation
Authors: Wangyu Wu, Zhenhong Chen, Xiaowei Huang, Fei Ma, Jimin Xiao
First: 2025-12-17T10:58:38+00:00 · Latest: 2025-12-17T10:58:38+00:00
Abstract
Weakly Supervised Semantic Segmentation (WSSS) with image level labels aims to produce pixel level predictions without requiring dense annotations. While recent approaches have leveraged generative models to augment existing data, they remain dependent on real world training samples. In this paper, we introduce a novel direction, Zero Shot Weakly Supervised Semantic Segmentation (ZSWSSS), and propose SynthSeg Agents, a multi agent framework driven by Large Language Models (LLMs) to generate synthetic training data entirely without real images. SynthSeg Agents comprises two key modules, a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent autonomously crafts diverse and semantically rich image prompts via iterative refinement, memory mechanisms, and prompt space exploration, guided by CLIP based similarity and nearest neighbor diversity filtering. These prompts are then passed to the Image Generation Agent, which leverages Vision Language Models (VLMs) to synthesize candidate images. A frozen CLIP scoring model is employed to select high quality samples, and a ViT based classifier is further trained to relabel the entire synthetic dataset with improved semantic precision. Our framework produces high quality training data without any real image supervision. Experiments on PASCAL VOC 2012 and COCO 2014 show that SynthSeg Agents achieves competitive performance without using real training images. This highlights the potential of LLM driven agents in enabling cost efficient and scalable semantic segmentation.
中文标题/摘要
标题:SynthSeg-代理:零样本弱监督语义分割的多代理合成数据生成
基于图像级别标签的弱监督语义分割(WSSS)旨在无需密集注释的情况下生成像素级预测。尽管最近的方法利用生成模型来扩充现有数据,但它们仍然依赖于现实世界的训练样本。在本文中,我们提出了一种新的方向——零样本弱监督语义分割(ZSWSSS),并提出了一种由大型语言模型(LLMs)驱动的多代理框架SynthSeg Agents,以完全不使用真实图像的方式生成合成训练数据。SynthSeg Agents 包含两个关键模块,一个自我精炼提示代理和一个图像生成代理。自我精炼提示代理通过迭代精炼、记忆机制和提示空间探索,基于CLIP相似性和最近邻多样性过滤,自主构建多样且语义丰富的图像提示。这些提示随后传递给图像生成代理,该代理利用视觉语言模型(VLMs)生成候选图像。使用冻结的CLIP评分模型选择高质量样本,并进一步训练基于ViT的分类器以提高整个合成数据集的语义精度。我们的框架在没有任何真实图像监督的情况下生成高质量的训练数据。在PASCAL VOC 2012和COCO 2014上的实验表明,SynthSeg Agents 在不使用真实训练图像的情况下实现了具有竞争力的性能。这突显了LLM驱动代理在实现成本效益和可扩展语义分割方面的潜力。
Summary / 总结
The paper introduces SynthSeg Agents, a multi-agent framework for generating synthetic training data for weakly supervised semantic segmentation without relying on real images. It consists of a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent creates diverse image prompts through iterative refinement and similarity filtering, while the Image Generation Agent uses Vision Language Models to synthesize candidate images. The framework produces high-quality synthetic data that achieves competitive performance on PASCAL VOC 2012 and COCO 2014 without using real training images, demonstrating the potential of LLM-driven agents for cost-efficient and scalable semantic segmentation.
该论文提出了SynthSeg Agents,这是一种多代理框架,用于生成无需依赖真实图像的弱监督语义分割训练数据。它由一个Self Refine Prompt Agent和一个Image Generation Agent组成。Self Refine Prompt Agent通过迭代精炼和提示空间探索创建多样化的图像提示,而Image Generation Agent则使用视觉语言模型生成候选图像。该框架生成高质量的合成数据,在PASCAL VOC 2012和COCO 2014上实现了与使用真实训练图像相当的性能,展示了LLM驱动代理在实现成本效益和可扩展语义分割方面的潜力。
Chain-of-Evidence Multimodal Reasoning for Few-shot Temporal Action Localization
Authors: Mengshi Qi, Hongwei Ji, Wulian Yun, Xianlin Zhang, Huadong Ma
First: 2025-04-18T04:35:35+00:00 · Latest: 2025-12-17T10:11:29+00:00
Abstract
Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the action localization task. To address these issues, in this work, we propose a new few-shot temporal action localization method by Chain-of-Evidence multimodal reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level, we design a Chain-of-Evidence (CoE) reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoE text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3, THUMOS14 and our newly collected Human-related Anomaly Localization Dataset. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. Our source code and data are available at https://github.com/MICLAB-BUPT/VAL-VLM.
中文标题/摘要
标题:证据链多模态推理在少样本时间动作定位中的应用
传统的时序动作定位(TAL)方法依赖大量详细的标注数据,而少样本TAL通过仅使用少量训练样本来识别未见过的动作类别,从而减少了对标注数据的依赖。然而,现有的少样本TAL方法通常仅关注视频级别的信息,忽略了文本信息,而文本信息可以为动作定位任务提供有价值的语义支持。为了解决这些问题,本文提出了一种新的基于证据链多模态推理的少样本时间动作定位方法,以提高定位性能。具体而言,我们设计了一种新颖的少样本学习框架来捕捉动作的共性和变异性,其中包括一种语义感知的文本-视觉对齐模块,用于在不同层次上对查询和支撑视频进行对齐。同时,为了更好地表达文本层面动作之间的时序依赖性和因果关系,我们设计了一种证据链(CoE)推理方法,逐步引导视觉语言模型(VLM)和大型语言模型(LLM)生成视频的CoE文本描述。生成的文本可以比视觉特征捕捉到更多的动作变化。我们在公开的ActivityNet1.3、THUMOS14以及我们新收集的人类相关异常定位数据集上进行了广泛的实验。实验结果表明,我们提出的方法在单实例和多实例场景中显著优于现有方法。我们的源代码和数据可在https://github.com/MICLAB-BUPT/VAL-VLM获取。
Summary / 总结
This paper proposes a Chain-of-Evidence multimodal reasoning method for few-shot temporal action localization, addressing the limitations of existing methods by incorporating textual information. The method uses a semantic-aware text-visual alignment module and a Chain-of-Evidence reasoning approach to improve localization performance. Experiments on ActivityNet1.3, THUMOS14, and a new dataset show that the proposed method outperforms existing methods in both single-instance and multi-instance scenarios.
本文提出了一种基于Chain-of-Evidence多模态推理的方法来解决少样本时空动作定位的问题。该方法引入了一个新的少样本学习框架,能够捕捉动作的共性和差异,并结合了一个语义感知的文本-视觉对齐模块。此外,该方法还采用了一种Chain-of-Evidence推理方法生成描述文本,这些文本能够捕捉到比视觉特征更多的动作变化。在ActivityNet1.3、THUMOS14和一个新收集的数据集上的实验表明,所提出的方法在单实例和多实例场景中均优于现有方法。
Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models
Authors: Kuinan Hou, Jing Mi, Marco Zorzi, Lamberto Ballan, Alberto Testolin
First: 2025-12-17T09:56:25+00:00 · Latest: 2025-12-17T09:56:25+00:00
Abstract
Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures. Notably, enumeration accuracy significantly improves when VLMs are prompted to generate intermediate representations (i.e., locations and verbal labels) of each object to be counted. Nevertheless, none of the models can reliably count the number of objects in complex visual scenes, showing that further research is still needed to create AI systems that can reliably deploy counting procedures in realistic environments.
中文标题/摘要
标题:评估专门计数架构和视觉语言模型的视觉计数能力
在计算机视觉中,识别视觉场景中的物品数量仍然是一个基本但具有挑战性的任务。传统的解决方法依赖于特定领域的计数架构,这些架构通过带有预定义对象类别的数据集进行训练。然而,最近在大规模多模态视觉语言模型(VLMs)方面的进展表明,这些通用架构可能为开放集对象计数提供一种灵活的替代方案。因此,在这项研究中,我们系统地将最先进的专门计数架构与VLMs在两个流行的计数数据集以及一个专门为控制测试图像的视觉属性而创建的新基准上进行了比较。我们的研究结果表明,大多数VLMs可以大致估计视觉场景中的物品数量,其性能与专门的计算机视觉架构相当甚至更优。值得注意的是,当VLMs被提示生成每个要计数对象的中间表示(即位置和口头标签)时,计数准确性显著提高。然而,没有任何模型能够可靠地在复杂的视觉场景中计数物品,这表明仍需进一步研究以创建能够在现实环境中可靠执行计数程序的AI系统。
Summary / 总结
This study evaluates the visual enumeration capabilities of specialized counting architectures and vision-language models by comparing their performance on two popular counting datasets and a new benchmark. The research finds that vision-language models can roughly count the number of items in a scene, often matching or outperforming specialized counting architectures. Prompting these models to generate intermediate representations of objects improves their accuracy. However, none of the models can reliably count objects in complex scenes, indicating the need for further research to enhance their performance in real-world settings.
本研究评估了专门的计数架构和视觉语言模型在视觉计数能力上的表现。研究在两个标准计数数据集和一个新的基准上进行了比较。研究结果表明,视觉语言模型可以在图像中大致计数物体,通常与专门的计数架构相当甚至超越。提示视觉语言模型生成中间表示可以提高其准确性。然而,没有任何模型能够在复杂场景中可靠地计数物体,表明还需要进一步研究以实现在现实环境中的应用。
Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification
Authors: Yupeng Zhang, Adam G. Dunn, Usman Naseem, Jinman Kim
First: 2025-12-17T09:47:29+00:00 · Latest: 2025-12-17T09:47:29+00:00
Abstract
Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model's decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $Δ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $Δ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.
中文标题/摘要
标题:视觉语言模型在医学图像疾病分类中的交叉公平性
医学人工智能(AI)系统,尤其是多模态视觉语言模型(VLM),常常表现出交叉偏见,模型在诊断边缘化患者亚组时系统性地缺乏信心。这种偏见可能导致由于样本数据的种族分布偏差和诊断确定性分布差异而出现更高的误诊和漏诊率。当前的公平性干预措施往往未能解决这些差距,或者在实现各亚组统计平等的同时牺牲整体诊断性能。在本研究中,我们开发了跨模态一致性匹配(CMAC-MMD)训练框架,以标准化交叉公平性患者亚组的诊断确定性。与传统的去偏见方法不同,该方法在临床推理过程中不需要敏感的种族数据即可使模型的决策信心相等。我们使用10,015张皮肤病变图像(HAM10000)进行了评估,并通过12,000张图像(BCN20000)和10,000张用于青光眼检测的视网膜图像(Harvard-FairVLMed)进行了外部验证,按交叉公平性年龄、性别和种族属性分层评估性能。在皮肤科队列中,所提出的方法将总体交叉公平性漏诊差距(真实阳性率差异,ΔTPR)从0.50降低到0.26,同时将总体曲线下面积(AUC)从0.94提高到0.97,优于标准训练。同样,在青光眼筛查中,该方法将ΔTPR从0.41降低到0.31,实现了更好的AUC(0.72,基线为0.71)。这建立了一个可扩展的框架,用于开发既准确又能在不同患者亚组中公平执行的高风险临床决策支持系统,确保可靠性能而不增加隐私风险。
Summary / 总结
This study addresses intersectional biases in medical AI systems, particularly in vision-language models used for disease classification. It introduces Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that equalizes diagnostic certainty across different patient subgroups without needing sensitive demographic data. Evaluations on skin lesion and fundus images showed that the proposed method reduced the missed diagnosis gap and improved overall diagnostic performance, achieving better Area Under the Curve (AUC) scores compared to standard training methods.
该研究针对医疗AI系统中的交集偏见问题,特别是视觉-语言模型,这些问题可能导致更高的误诊和漏诊率。作者开发了跨模态一致性对齐(CMAC-MMD)训练框架,该框架能够在不需要敏感人口统计数据的情况下,使不同患者亚组的诊断置信度标准化。在皮肤病变和视网膜图像的评估中,所提出的方法减少了漏诊差距并提高了整体诊断性能,实现了更好的曲线下面积(AUC)分数,分别在皮肤科和青光眼筛查任务中取得了更好的结果。
EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence
Authors: Jiaxu Wan, Xu Wang, Mengwei Xie, Hang Zhang, Mu Xu, Yang Han, Hong Zhang, Ding Yuan, Yifan Yang
First: 2025-12-17T07:51:36+00:00 · Latest: 2025-12-17T07:51:36+00:00
Comments: 13 pages, 7 figures, 6 tables
Abstract
Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for "thinking with images" (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.
中文标题/摘要
标题:EagleVision:基于BEV定位的链式思考双阶段框架以增强空间智能
近期的空间智能方法通常将3D线索附加到2D推理管道中,或结合MLLMs与黑盒重建模块,导致空间一致性较弱、视角多样性有限以及无法追溯到支持视图的证据链。图像思维框架(例如ChatGPT-o3和DeepEyes)表明,通过交替进行假设形成和主动获取视觉证据,可以产生逐步多模态推理,但它们未能解决空间链式思考(CoT)中的三个关键挑战:在严格的标记预算下构建全局空间感知、明确将3D假设与视频帧关联以进行验证,以及为强化学习设计空间定位奖励。为解决这些问题,我们提出了EagleVision,这是一种通过宏观感知和微观验证逐步增强空间认知的双阶段框架。在宏观感知阶段,EagleVision 使用语义视角融合确定性点过程(SPF-DPP)从固定标记预算下的长视频中选择一组紧凑的几何和语义感知关键帧。在微观验证阶段,我们将空间CoT形式化为基于BEV定位的姿态查询:代理逐迭代地在BEV平面上预测姿态,检索最近的真实帧,并通过强化学习进行训练,奖励基于预测姿态与观察视图之间的一致性进行评分。在VSI-Bench上,EagleVision 达到了开源视觉语言模型的最新性能,展示了强大的且可泛化的空间理解。
Summary / 总结
EagleVision is a dual-stage framework designed to enhance spatial intelligence by addressing key challenges in spatial Chain-of-Thought (CoT). It uses a semantics-perspective-fusion determinantal point process for macro perception, selecting keyframes from long videos under a fixed token budget. For micro verification, it formulates spatial CoT as BEV-grounded pose querying, iteratively predicting poses and retrieving real frames, with training based on reinforcement learning and spatial grounding rewards. On VSI-Bench, EagleVision outperforms other open-source vision-language models, showing strong and generalizable spatial understanding.
EagleVision 是一种双阶段框架,旨在通过解决空间链推理(CoT)中的关键问题来增强空间智能。它在宏感知阶段使用语义视角融合确定性点过程来在固定令牌预算下选择关键帧,在微验证阶段将空间 CoT 形式化为基于BEV的姿势查询。该框架在 VSI-Bench 上表现出强大的泛化空间理解能力,优于其他开源视觉-语言模型。
The Devil is in Attention Sharing: Improving Complex Non-rigid Image Editing Faithfulness via Attention Synergy
Authors: Zhuo Chen, Fanyue Wei, Runze Xu, Jingjing Li, Lixin Duan, Angela Yao, Wen Li
First: 2025-12-16T14:08:00+00:00 · Latest: 2025-12-17T06:53:51+00:00
Comments: Project page:https://synps26.github.io/
Abstract
Training-free image editing with large diffusion models has become practical, yet faithfully performing complex non-rigid edits (e.g., pose or shape changes) remains highly challenging. We identify a key underlying cause: attention collapse in existing attention sharing mechanisms, where either positional embeddings or semantic features dominate visual content retrieval, leading to over-editing or under-editing. To address this issue, we introduce SynPS, a method that Synergistically leverages Positional embeddings and Semantic information for faithful non-rigid image editing. We first propose an editing measurement that quantifies the required editing magnitude at each denoising step. Based on this measurement, we design an attention synergy pipeline that dynamically modulates the influence of positional embeddings, enabling SynPS to balance semantic modifications and fidelity preservation. By adaptively integrating positional and semantic cues, SynPS effectively avoids both over- and under-editing. Extensive experiments on public and newly curated benchmarks demonstrate the superior performance and faithfulness of our approach.
中文标题/摘要
标题:注意力共享中的魔鬼在于注意力协同:通过注意力协同提高复杂非刚性图像编辑的忠实度
无需训练的大规模扩散模型使图像编辑变得实用,但准确执行复杂的非刚性编辑(例如姿态或形状变化)仍然极具挑战性。我们发现一个关键原因:现有注意力共享机制中的注意力崩溃,其中位置嵌入或语义特征之一主导视觉内容检索,导致过度编辑或不足编辑。为解决这一问题,我们引入了SynPS方法,该方法协同利用位置嵌入和语义信息进行忠实的非刚性图像编辑。我们首先提出了一种编辑测量方法,量化每个去噪步骤所需的编辑量。基于此测量,我们设计了一种注意力协同管道,动态调节位置嵌入的影响,使SynPS能够平衡语义修改和保真度保留。通过适配性地整合位置和语义线索,SynPS有效地避免了过度编辑和不足编辑。在公共和新收集的基准上的广泛实验表明,我们方法的优越性能和忠实度。
Summary / 总结
The paper addresses the challenge of performing faithful complex non-rigid image edits using large diffusion models, which often suffer from attention collapse where either positional embeddings or semantic features dominate, leading to over- or under-editing. To tackle this, the authors propose SynPS, a method that synergistically combines positional embeddings and semantic information. SynPS introduces an editing measurement to quantify the required editing magnitude at each denoising step and designs an attention synergy pipeline to dynamically balance semantic modifications and fidelity preservation. Experimental results show that SynPS outperforms existing methods in terms of faithfulness and superior performance on various benchmarks.
该论文解决了使用大型扩散模型进行复杂非刚性图像编辑时的忠实性问题,这些模型常常会遭受注意力坍塌,导致位置嵌入或语义特征之一占据主导地位,从而引起过度编辑或不足编辑。为了解决这一问题,作者提出了SynPS方法,该方法协同结合了位置嵌入和语义信息。SynPS引入了一种编辑测量来量化每个去噪步骤所需的编辑量,并设计了一种注意力协同管道以动态平衡语义修改和保真度保留。实验结果表明,SynPS在各种基准上的表现和忠实性都优于现有方法。
Benchmarking and Mitigating Sycophancy in Medical Vision Language Models
Authors: Zikun Guo, Jingwei Lv, Xinyue Xu, Shu Yang, Jun Wen, Di Wang, Lijie Hu
First: 2025-09-26T07:02:22+00:00 · Latest: 2025-12-17T04:57:17+00:00
Comments: 19figures, 61pages
Abstract
Visual language models (VLMs) have the potential to transform medical workflows. However, the deployment is limited by sycophancy. Despite this serious threat to patient safety, a systematic benchmark remains lacking. This paper addresses this gap by introducing a Medical benchmark that applies multiple templates to VLMs in a hierarchical medical visual question answering task. We find that current VLMs are highly susceptible to visual cues, with failure rates showing a correlation to model size or overall accuracy. we discover that perceived authority and user mimicry are powerful triggers, suggesting a bias mechanism independent of visual data. To overcome this, we propose a Visual Information Purification for Evidence based Responses (VIPER) strategy that proactively filters out non-evidence-based social cues, thereby reinforcing evidence based reasoning. VIPER reduces sycophancy while maintaining interpretability and consistently outperforms baseline methods, laying the necessary foundation for the robust and secure integration of VLMs.
中文标题/摘要
标题:医疗视觉语言模型中的阿谀奉承基准测试与缓解
视觉语言模型(VLMs)有潜力改变医疗工作流程。然而,部署受到阿谀奉承的限制。尽管这对患者安全构成了严重威胁,但系统性的基准测试仍然缺乏。本文通过引入一个医疗基准,该基准在分层医疗视觉问答任务中应用多种模板来解决这一缺口。我们发现当前的VLMs对视觉线索非常敏感,失败率与模型大小或整体准确性呈相关性。我们发现感知权威和用户模仿是强大的触发因素,表明一种独立于视觉数据的偏差机制。为了克服这一问题,我们提出了一种视觉信息净化以支持证据响应(VIPER)策略,该策略主动过滤掉非证据基础的社会线索,从而强化基于证据的推理。VIPER减少了阿谀奉承,同时保持了可解释性,并且始终优于基线方法,为VLMs的稳健和安全集成奠定了必要的基础。
Summary / 总结
This paper aims to address the issue of sycophancy in visual language models (VLMs) for medical applications, which can affect patient safety. The authors introduce a Medical benchmark using hierarchical medical visual question answering tasks and find that current VLMs are highly susceptible to visual cues, with failure rates correlating to model size or accuracy. They propose VIPER, a strategy that filters out non-evidence-based social cues, reducing sycophancy while maintaining interpretability and outperforming baseline methods. This work lays the groundwork for the secure integration of VLMs in medical workflows.
该论文旨在解决视觉语言模型(VLMs)在医疗应用中的奉承问题,这可能影响患者安全。作者通过使用分层医疗视觉问答任务引入了一个医疗基准,并发现当前的VLMs对视觉线索非常敏感,失败率与模型大小或准确性相关。他们提出了VIPER策略,该策略过滤掉非证据基础的社会线索,减少了奉承行为,同时保持了可解释性,并且优于基线方法。这项工作为VLMs在医疗工作流程中的稳健和安全集成奠定了基础。
SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification
Authors: Hongbo Wang, MaungMaung AprilPyone, Isao Echizen
Venue: ACL 2026
First: 2025-12-17T03:31:36+00:00 · Latest: 2025-12-17T03:31:36+00:00
Comments: Under Review for ACL 2026
Abstract
Disclaimer: Samples in this paper may be harmful and cause discomfort.
Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.
中文标题/摘要
标题:SGM:通过神经元级去毒化为多模态大型语言模型提供安全防护
免责声明:本文中的样本可能有害并引起不适。
多模态大型语言模型(MLLMs)能够实现多模态生成,但会继承来自未严格筛选预训练语料库的有毒、偏见和NSFW信号,导致安全风险,尤其是在对抗性触发下,后期、不透明的无训练去毒化方法难以应对。我们提出SGM,一种白盒的神经元级多模态干预方法,类似于为有毒神经元佩戴的安全眼镜:它通过专家加权软抑制选择性地重新校准一小部分有毒专家神经元,无需任何参数更新即可消除有害的跨模态激活。我们建立了MM-TOXIC-QA,一个多模态毒性评估框架,并将SGM与现有去毒化技术进行比较。在开源MLLM上的实验表明,SGM在标准和对抗条件下减轻了毒性,将有害率从48.2%降低到2.5%,同时保持流畅性和多模态推理能力。SGM具有扩展性,其综合防御措施SGM*与现有去毒化方法结合,提供了一种可解释、低成本的毒性控制多模态生成解决方案。
Summary / 总结
SGM is a white-box neuron-level intervention for multimodal large language models (MLLMs) that selectively recalibrates toxic expert neurons via expertise-weighted soft suppression, mitigating toxicity in both standard and adversarial conditions. SGM reduces harmful rates from 48.2% to 2.5% while preserving fluency and multimodal reasoning, and its combined defenses, SGM*, integrate with existing detoxification methods for enhanced safety performance.
SGM 是一种白盒神经元级干预方法,通过选择性地重新校准有毒专家神经元来净化多模态大型语言模型(MLLMs),而不进行参数更新。它像安全眼镜一样,中和有害的跨模态激活。实验表明,SGM 在标准和对抗条件下将有害率从 48.2% 降低到 2.5%,同时保持流畅性和多模态推理能力。SGM 可以与现有的净化方法结合使用,以增强安全性表现。
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
Authors: Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister
First: 2025-11-14T04:16:09+00:00 · Latest: 2025-12-17T02:27:55+00:00
Abstract
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3\% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.
中文标题/摘要
标题:抽象3D感知在视觉语言模型中为空间智能
视觉语言模型(VLMs)在诸如空间认知和物理理解等3D相关任务上表现不佳,这对于实际应用如机器人技术和具身代理至关重要。我们将其归因于3D任务与VLM的2D训练之间的模态差距,导致从2D输入中无效检索3D信息。为了解决这一差距,我们引入了SandboxVLM,这是一种简单而有效的框架,利用抽象边界框来编码几何结构和物理运动学信息。具体而言,我们设计了一个包含四个阶段的3D Sandbox重建和感知管道:使用抽象控制生成多视角先验,代理提升,多视角投票和聚类,以及3D感知推理。在多个基准和VLM骨干网络的零样本设置下进行评估,我们的方法在空间智能方面始终表现出改进,与基线方法相比,在SAT Real上的实例改进了8.3%。这些结果表明,为VLM配备3D抽象显著增强了其3D推理能力,而无需额外训练,这为通用具身智能提供了新的可能性。
Summary / 总结
The research aims to improve vision-language models' performance in 3D-related tasks such as spatial cognition and physical understanding, which are essential for applications like robotics. To address the modality gap between 3D tasks and 2D training, the authors propose SandboxVLM, a framework that uses abstract bounding boxes to encode geometric structure and physical kinematics. The method involves a 3D Sandbox reconstruction and perception pipeline with four stages: generating multi-view priors, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Experiments show that SandboxVLM significantly enhances spatial intelligence, achieving an 8.3% improvement on SAT Real compared to baseline methods in zero-shot settings across multiple benchmarks and VLM backbones.
研究旨在提高视觉语言模型(VLMs)在3D相关任务如空间认知和物理理解方面的表现,这对于机器人等实际应用至关重要。为了解决3D任务与2D训练之间的模态差距,作者提出了SandboxVLM,该方法使用抽象边界框来编码几何结构和物理运动学。该方法包括一个3D Sandbox重建和感知管道,分为四个阶段:多视图先验生成、代理提升、多视图投票和聚类以及3D感知推理。实验表明,SandboxVLM 显著增强了VLMs的空间智能,在SAT Real基准测试中相比基线方法取得了8.3%的提升。
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
Authors: Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang
First: 2025-12-02T07:42:38+00:00 · Latest: 2025-12-17T01:55:34+00:00
Abstract
Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots_ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this benchmark, dots_ocr achieves state-of-the-art performance, delivering an approximately 10% relative improvement and demonstrating strong multilingual capability.
中文标题/摘要
标题:dots.ocr:单个视觉语言模型中的多语言文档布局解析
文档布局解析是人工智能(AI)访问和解释世界庞大结构化知识库的关键途径。这一过程包括布局检测、文本识别和关系理解,对于增强下一代视觉语言模型至关重要。然而,当前的方法依赖于分段的多阶段管道,容易产生错误传播,并且无法充分利用联合训练的协同效应。在本文中,我们介绍了dots_ocr,这是一个单个视觉语言模型,首次在统一的端到端框架中联合学习三个核心任务。这得益于一个高度可扩展的数据引擎,该引擎合成了大量的多语言语料库,使模型能够在各种任务中表现出色,涵盖多种语言、布局和领域。我们通过在综合的OmniDocBench上取得最先进的性能验证了我们统一范式的有效性。此外,为了促进全球文档智能的研究,我们引入了XDocParse,这是一个涵盖126种语言的具有挑战性的新基准。在该基准上,dots_ocr 达到了最先进的性能,实现了约10%的相对改进,并展示了强大的多语言能力。
Summary / 总结
The paper introduces dots.ocr, a single Vision-Language Model that jointly learns document layout detection, text recognition, and relational understanding in an end-to-end framework. This approach leverages a scalable data engine to synthesize a large multilingual corpus, improving robustness across various tasks. The model outperforms existing methods on the OmniDocBench and achieves a 10% relative improvement on the new XDocParse benchmark, which spans 126 languages, highlighting its strong multilingual capability.
论文介绍了dots.ocr,这是一种在端到端框架中联合学习布局检测、文本识别和关系理解的单一体视语言模型,克服了分段管道的局限性。该模型利用可扩展的数据引擎合成多语言语料库,在OmniDocBench上实现了最先进的性能,并在涵盖126种语言的新XDocParse基准上实现了约10%的相对改进,展示了强大的多语言能力。
Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts
Authors: Kiran Chhatre, Christopher Peters, Srikrishna Karanam
Venue: AAAI
First: 2025-08-08T05:36:20+00:00 · Latest: 2025-12-17T00:34:06+00:00
Comments: Association for the Advancement of Artificial Intelligence (AAAI) 2026, 14 pages, 11 figures. Webpage: https://s-pectrum.github.io/
Abstract
Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine-grained clothing types. Recent open-vocabulary segmentation approaches leverage pretrained text-to-image (T2I) diffusion model features for strong zero-shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part-level pixel parsing (body parts and clothing) and instance-level grouping. While diffusion-based open-vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image-driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image-to-Texture (I2Tx) diffusion model (obtained by fine-tuning a T2I model on 3D human texture maps) for improved alignment with body parts and clothing. From an input image, we extract human-part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt-guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross-dataset experiments, separately assessing body parts, clothing parts, unseen clothing categories, and full-body masks, and demonstrate that Spectrum consistently outperforms baseline methods in prompt-based segmentation.
中文标题/摘要
标题:学习3D纹理感知表示以解析多样的人类服装和身体部位
现有的人体解析方法通常使用固定掩码类别和宽泛的标签,掩盖了细粒度的服装类型。最近的开放词汇分割方法利用预训练的文本到图像(T2I)扩散模型特征进行强大的零样本迁移,但通常将整个人体归为一个人类类别,无法区分多样化的服装或详细的身体部位。为了解决这个问题,我们提出了Spectrum,一种统一网络,用于部分像素解析(身体部位和服装)和实例级分组。虽然基于扩散的开放词汇模型在任务之间具有良好的泛化能力,但它们的内部表示并不专门用于详细的解析。我们观察到,与具有宽泛表示的扩散模型不同,图像驱动的3D纹理生成器保持了与输入图像的忠实对应关系,从而为解析多样化的服装和身体部位提供了更强的表示。Spectrum引入了一种新的I2Tx扩散模型(通过在3D人体纹理图上微调T2I模型获得)的重新利用方法,以提高与身体部位和服装的对齐。从输入图像中,我们通过I2Tx扩散模型提取人体部分内部特征,并通过提示引导的定位生成与多样化服装类别对齐的语义有效掩码。训练完成后,Spectrum可以为场景中任何数量的人生成每个可见身体部位和服装类别的语义分割图,忽略单独的服装或无关对象。我们在广泛的跨数据集实验中分别评估了身体部位、服装部分、未见过的服装类别和全身掩码,并证明Spectrum在提示驱动的分割中始终优于基线方法。
Summary / 总结
The paper addresses the limitations of existing methods in human parsing by proposing Spectrum, a unified network for part-level pixel parsing and instance-level grouping. Unlike previous diffusion-based models that group humans into a single category, Spectrum uses an Image-to-Texture (I2Tx) diffusion model to generate semantically valid masks aligned to diverse clothing categories. The model extracts human-part internal features and generates masks through prompt-guided grounding, improving the parsing of diverse clothing and body parts. Experiments show that Spectrum outperforms baseline methods in prompt-based segmentation across various datasets and scenarios.
论文提出了一种名为Spectrum的统一网络,利用Image-to-Texture (I2Tx) 扩散模型生成3D纹理感知表示,以解决现有方法在解析多样化人体服装和身体部位时的局限性。该方法通过I2Tx模型提取内部特征,并通过提示引导的定位生成语义有效的掩码,从而为各种身体部位和服装类别生成详细的分割图。实验表明,Spectrum在不同数据集和场景下的提示驱动分割中优于基线方法。
Puzzle Curriculum GRPO for Vision-Centric Reasoning
Authors: Ahmadreza Jeddi, Hakki Can Karaimer, Hue Nguyen, Zhongling Wang, Ke Zhao, Javad Rajabi, Ran Zhang, Raghav Goyal, Babak Taati, Radek Grzeszczuk
First: 2025-12-16T22:17:25+00:00 · Latest: 2025-12-16T22:17:25+00:00
Comments: Project page: https://pcgrpo.github.io
Abstract
Recent reinforcement learning (RL) approaches like outcome-supervised GRPO have advanced chain-of-thought reasoning in Vision Language Models (VLMs), yet key issues linger: (i) reliance on costly and noisy hand-curated annotations or external verifiers; (ii) flat and sparse reward schemes in GRPO; and (iii) logical inconsistency between a chain's reasoning and its final answer. We present Puzzle Curriculum GRPO (PC-GRPO), a supervision-free recipe for RL with Verifiable Rewards (RLVR) that strengthens visual reasoning in VLMs without annotations or external verifiers. PC-GRPO replaces labels with three self-supervised puzzle environments: PatchFit, Rotation (with binary rewards) and Jigsaw (with graded partial credit mitigating reward sparsity). To counter flat rewards and vanishing group-relative advantages, we introduce a difficulty-aware curriculum that dynamically weights samples and peaks at medium difficulty. We further monitor Reasoning-Answer Consistency (RAC) during post-training: mirroring reports for vanilla GRPO in LLMs, RAC typically rises early then degrades; our curriculum delays this decline, and consistency-enforcing reward schemes further boost RAC. RAC correlates with downstream accuracy. Across diverse benchmarks and on Qwen-7B and Qwen-3B backbones, PC-GRPO improves reasoning quality, training stability, and end-task accuracy, offering a practical path to scalable, verifiable, and interpretable RL post-training for VLMs.
中文标题/摘要
标题:Puzzle Curriculum GRPO 视觉中心推理
近期的强化学习(RL)方法如结果监督GRPO在视觉语言模型(VLMs)的链式推理方面取得了进展,但仍存在关键问题:(i)依赖昂贵且噪声大的手动标注或外部验证者;(ii)GRPO中扁平且稀疏的奖励方案;(iii)链式推理与其最终答案之间的逻辑不一致。我们提出了Puzzle Curriculum GRPO(PC-GRPO),一种无需监督的强化学习(RL)方法,通过验证奖励(RLVR)强化视觉推理,无需标注或外部验证者。PC-GRPO用三个自我监督的谜题环境替代标签:PatchFit、旋转(带二元奖励)和拼图(带分级部分信用奖励,缓解奖励稀疏性)。为对抗扁平奖励和分组相对优势的消失,我们引入了一种难度感知的课程,动态加权样本,并在中等难度时达到峰值。我们还在后训练期间监控推理-答案一致性(RAC):类似于LLMs中vanilla GRPO的报告,RAC通常早期上升然后下降;我们的课程推迟了这种下降,并且一致性强化奖励方案进一步提升了RAC。RAC与下游准确性相关。在多种基准测试和Qwen-7B和Qwen-3B骨干网络上,PC-GRPO提高了推理质量、训练稳定性和最终任务准确性,提供了一条实用的路径,用于VLMs的可扩展、可验证和可解释的后训练强化学习。
Summary / 总结
PC-GRPO addresses limitations in vision-language models by introducing a supervision-free method using self-supervised puzzle environments to enhance visual reasoning. It replaces labels with PatchFit, Rotation, and Jigsaw environments, and employs a difficulty-aware curriculum to mitigate flat rewards and vanishing group-relative advantages. The method improves reasoning quality, training stability, and end-task accuracy across various benchmarks and model sizes.
PC-GRPO通过引入自我监督的拼图环境来增强视觉推理,解决现有RL方法在VLM中的局限性,无需标注即可提升推理质量。该方法使用PatchFit、Rotation和Jigsaw替换标签,并采用难度感知的课程学习来缓解奖励稀疏性和消失的优势。该方法在多种基准测试和VLM骨干网络上提高了推理质量、训练稳定性和最终任务准确性。
Parameter Efficient Multimodal Instruction Tuning for Romanian Vision Language Models
Authors: George-Andrei Dima, Dumitru-Clementin Cercel
First: 2025-12-16T21:36:28+00:00 · Latest: 2025-12-16T21:36:28+00:00
Abstract
Focusing on low-resource languages is an essential step toward democratizing generative AI. In this work, we contribute to reducing the multimodal NLP resource gap for Romanian. We translate the widely known Flickr30k dataset into Romanian and further extend it for visual question answering by leveraging open-source LLMs. We demonstrate the usefulness of our datasets by fine-tuning open-source VLMs on Romanian visual question answering. We select VLMs from three widely used model families: LLaMA 3.2, LLaVA 1.6, and Qwen2. For fine-tuning, we employ the parameter-efficient LoRA method. Our models show improved Romanian capabilities in visual QA, as well as on tasks they were not trained on, such as Romanian image description generation. The seven-billion-parameter Qwen2-VL-RoVQA obtains top scores on both tasks, with improvements of +6.05% and +2.61% in BERTScore F1 over its original version. Finally, the models show substantial reductions in grammatical errors compared to their original forms, indicating improvements not only in language understanding but also in Romanian fluency.
中文标题/摘要
标题:参数高效多模态指令调优以提高罗马尼亚视觉语言模型性能
关注低资源语言是使生成式AI普及化的重要步骤。在本工作中,我们致力于缩小罗马尼亚多模态NLP资源差距。我们将广泛使用的Flickr30k数据集翻译成罗马尼亚语,并通过利用开源LLM进一步扩展其用于视觉问答。我们通过在罗马尼亚视觉问答任务上微调开源VLM展示了数据集的实用性。我们选择了三种广泛使用的模型家族中的VLM:LLaMA 3.2、LLaVA 1.6和Qwen2。在微调过程中,我们采用了参数高效的LoRA方法。我们的模型在视觉问答任务上展示了改进的罗马尼亚语能力,以及在它们未训练的任务上,如罗马尼亚图像描述生成。拥有七亿参数的Qwen2-VL-RoVQA在两个任务上均获得最高分数,BERTScore F1分别提高了6.05%和2.61%。最后,与原始版本相比,模型在语法错误方面显示出显著减少,这表明不仅在语言理解方面,也在罗马尼亚语流利度方面有所改进。
Summary / 总结
This work aims to reduce the resource gap for Romanian in multimodal NLP by translating the Flickr30k dataset into Romanian and fine-tuning open-source vision-language models using the parameter-efficient LoRA method. The models show improved performance in Romanian visual question answering and image description generation, with the Qwen2-VL-RoVQA achieving top scores and a significant reduction in grammatical errors compared to its original version.
本研究旨在通过将Flickr30k数据集翻译成罗马尼亚语,并使用参数高效的LoRA方法微调开源的视觉语言模型(VLMs),来减少罗马尼亚语的多模态NLP资源缺口。模型在罗马尼亚视觉问答和图像描述生成任务上的表现得到提升,七亿参数的Qwen2-VL-RoVQA在两项任务上取得了最佳成绩,并且BERTScore F1得分显著提高。此外,模型的语法错误显著减少,表明其在罗马尼亚语流畅度方面也有所提升。
From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model
Authors: Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari
First: 2025-12-04T21:57:10+00:00 · Latest: 2025-12-16T21:10:24+00:00
Abstract
Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at \href{https://huggingface.co/datasets/vbdai/TAD}{Hugging Face} and \href{https://github.com/vbdi/tad_bench}{Github}, respectively.
中文标题/摘要
标题:从片段到场景:通过视觉语言模型在自动驾驶中的时间理解
在自动驾驶(AD)中的时间理解仍然是一个重大挑战,即使是最近的最先进(SoTA)视觉语言模型(VLMs)也是如此。先前的工作引入了数据集和基准测试以提高时间推理能力,但这些基准测试主要集中在其他视频内容上,包括体育、烹饪和电影。目前没有现有的基准测试专注于时间理解在以自我为中心的AD片段中的独特挑战。为了填补这一空白,提出了自动驾驶时间理解(TAD)基准测试,该基准测试评估VLMs捕捉AD中动作动态关系的能力。TAD包含近6000个问答(QA)对,涵盖了7个人工设计的任务。此外,还进行了9个开源通用模型和SoTA AD专业模型的评估。当应用于TAD时,当前的SoTA模型显示出较低的准确性,主要是由于对精细运动理解的不完善。为了提高运动理解并提高TAD的整体准确性,提出了两种无需训练的新解决方案:Scene-CoT,利用思维链(CoT)和TCogMap,结合自我中心的时间认知地图。所提出的方法与现有的VLMs集成,使TAD的平均准确性提高了17.72%。通过引入TAD、基准测试多个SoTA模型并提出有效的改进,这项工作旨在促进AD中时间理解的未来研究。基准测试和评估代码可在Hugging Face(https://huggingface.co/datasets/vbdai/TAD)和Github(https://github.com/vbdi/tad_bench)上获得。
Summary / 总结
This paper addresses the challenge of temporal understanding in autonomous driving (AD) by introducing the TAD benchmark, which evaluates VLMs on their ability to capture dynamic relationships in AD footage. The authors propose two training-free solutions, Scene-CoT and TCogMap, to improve motion understanding and accuracy. These solutions enhance existing VLMs by up to 17.72% on the TAD benchmark, demonstrating the unique challenges of temporal understanding in AD and the potential of vision-language models to address these challenges.
本文通过引入TAD基准,评估VLMs在捕捉AD视频中动态关系的能力,解决了自主驾驶中的时间理解挑战。作者提出了两种无需训练的解决方案,Scene-CoT和TCogMap,以提高运动理解和准确性。这些解决方案通过最多17.72%的提升,增强了现有的VLMs在TAD基准上的表现,展示了AD中时间理解的独特挑战以及视觉语言模型解决这些挑战的潜力。
VIBE: Can a VLM Read the Room?
Authors: Tania Chakraborty, Eylon Caplan, Dan Goldwasser
Venue: EMNLP
First: 2025-06-11T19:07:35+00:00 · Latest: 2025-12-16T18:42:51+00:00
Comments: Findings of EMNLP, 2025
Abstract
Understanding human social behavior such as recognizing emotions and the social dynamics causing them is an important and challenging problem. While LLMs have made remarkable advances, they are limited to the textual domain and cannot account for the major role that non-verbal cues play in understanding social situations. Vision Language Models (VLMs) can potentially account for this gap, however their ability to make correct inferences over such social cues has received little attention. In this paper, we explore the capabilities of VLMs at social reasoning. We identify a previously overlooked limitation in VLMs: the Visual Social-Pragmatic Inference gap. To target this gap, we propose a new task for VLMs: Visual Social-Pragmatic Inference. We construct a high quality dataset to test the abilities of a VLM for this task and benchmark the performance of several VLMs on it.
中文标题/摘要
标题:VIBE:VLM能否读懂房间里的社交信号?
理解人类社会行为,如识别情绪及其背后的社会动态,是一个重要且具有挑战性的问题。尽管语言模型(LLMs)取得了显著进展,但它们仅限于文本领域,无法解释非言语线索在理解社交情境中的重要作用。视觉语言模型(VLMs)有可能弥补这一差距,但它们在推理此类社会线索方面的能力尚未受到广泛关注。在本文中,我们探讨了VLM在社会推理方面的能力。我们发现VLM的一个先前未被注意到的局限性:视觉社会-语用推理差距。为解决这一差距,我们为VLM提出了一项新任务:视觉社会-语用推理。我们构建了一个高质量的数据集来测试VLM在该任务上的能力,并在该数据集上对几种VLM进行了基准测试。
Summary / 总结
This paper explores the capabilities of Vision Language Models (VLMs) in social reasoning, identifying a limitation known as the Visual Social-Pragmatic Inference gap. To address this, the authors propose a new task and construct a high-quality dataset to benchmark VLMs. The main finding is that current VLMs struggle with visual social cues, indicating a need for improvement in this area.
本文探讨了视觉语言模型(VLM)在社会推理方面的能力,指出存在一个名为视觉社会-语用推理的局限性。为解决这一问题,作者提出了一项新任务并构建了一个高质量的数据集来评估VLM的表现。主要发现是当前的VLM在处理视觉社会线索方面存在困难,表明需要在这一领域进行改进。
Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus
Authors: Antonio Guillen-Perez
First: 2025-12-12T20:07:04+00:00 · Latest: 2025-12-16T17:15:46+00:00
Abstract
The development of robust Autonomous Vehicles (AVs) is bottlenecked by the scarcity of "Long-Tail" training data. While fleets collect petabytes of video logs, identifying rare safety-critical events (e.g., erratic jaywalking, construction diversions) remains a manual, cost-prohibitive process. Existing solutions rely on coarse metadata search, which lacks precision, or cloud-based VLMs, which are privacy-invasive and expensive. We introduce Semantic-Drive, a local-first, neuro-symbolic framework for semantic data mining. Our approach decouples perception into two stages: (1) Symbolic Grounding via a real-time open-vocabulary detector (YOLOE) to anchor attention, and (2) Cognitive Analysis via a Reasoning VLM that performs forensic scene analysis. To mitigate hallucination, we implement a "System 2" inference-time alignment strategy, utilizing a multi-model "Judge-Scout" consensus mechanism. Benchmarked on the nuScenes dataset against the Waymo Open Dataset (WOD-E2E) taxonomy, Semantic-Drive achieves a Recall of 0.966 (vs. 0.475 for CLIP) and reduces Risk Assessment Error by 40% ccompared to the best single scout models. The system runs entirely on consumer hardware (NVIDIA RTX 3090), offering a privacy-preserving alternative to the cloud.
中文标题/摘要
标题:语义驱动:通过开放词汇接地和神经符号VLM共识民主化长尾数据整理
自主车辆(AV)的稳健开发受限于“长尾”训练数据的稀缺性。尽管车队收集了大量视频日志,但识别罕见的安全关键事件(例如,不规则的随意横穿马路、施工改道)仍然是一个手动且成本高昂的过程。现有解决方案依赖于粗略的元数据搜索,缺乏精确性,或者基于云的VLM,这侵犯了隐私并昂贵。我们提出了语义驱动,一种本地优先的神经符号框架,用于语义数据挖掘。我们的方法将感知分为两个阶段:(1)通过实时开放词汇检测器(YOLOE)进行符号接地,以锚定注意力;(2)通过推理VLM进行认知分析,执行法医场景分析。为了减轻幻觉,我们实现了一种“系统2”推理时对齐策略,利用多模型“法官-侦察员”共识机制。在nuScenes数据集上与Waymo开放数据集(WOD-E2E)分类法进行基准测试,语义驱动实现了0.966的召回率(而CLIP为0.475),并将风险评估误差降低了40%。该系统完全在消费级硬件(NVIDIA RTX 3090)上运行,提供了一种保护隐私的云替代方案。
Summary / 总结
Semantic-Drive addresses the challenge of curating long-tail data for autonomous vehicles by introducing a local-first, neuro-symbolic framework. It uses YOLOE for real-time open-vocabulary detection and a Reasoning VLM for scene analysis, with a multi-model consensus mechanism to reduce hallucination. On the nuScenes dataset, Semantic-Drive achieves a recall of 0.966 and a 40% reduction in risk assessment error compared to single model approaches, while running on consumer hardware.
Semantic-Drive 通过引入一种本地优先的神经符号框架来解决自动驾驶车辆中识别罕见的安全关键事件的挑战。该框架包括两个阶段:使用实时开放词汇检测器(YOLOE)进行符号接地,以及通过推理视觉语言模型(VLM)进行场景分析。系统通过多模型共识机制减少幻觉,并在 nuScenes 数据集上与 Waymo 开放数据集 (WOD-E2E) 分类法相比,实现了 0.966 的召回率和 40% 的风险评估误差减少,所有这些都在消费级硬件(NVIDIA RTX 3090)上运行。
SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome
Authors: Dabin Jeong, Amirhossein Vahidi, Ciro Ramírez-Suástegui, Marie Moullet, Kevin Ly, Mohammad Vali Sanian, Sebastian Birk, Yinshui Chang, Adam Boxall, Daniyal Jafree, Lloyd Steele, Vijaya Baskar MS, Muzlifah Haniffa, Mohammad Lotfollahi
First: 2025-11-19T14:22:23+00:00 · Latest: 2025-12-16T16:01:40+00:00
Abstract
Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78\% in the gene-expression prediction task and avg. 26.93\% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.
中文标题/摘要
标题:SIGMMA:基于层次图的多尺度多模态对比对齐框架,用于组织病理图像和空间转录组
计算病理学的最新进展利用视觉-语言模型学习Hematoxylin和Eosin (HE) 图像与空间转录组 (ST) 轮廓的联合表示。然而,现有方法通常在单尺度上对HE切片与其相应的ST轮廓进行对齐,忽视了细微的细胞结构及其空间组织。为了解决这个问题,我们提出了Sigmma,这是一种多模态对比对齐框架,用于在多个尺度上学习HE图像和空间转录组轮廓的层次表示。Sigmma 引入了多尺度对比对齐,确保不同尺度下学习的表示在模态间保持一致。此外,通过将细胞相互作用表示为图,并整合跨子图和子图内关系,我们的方法有效地捕捉了组织微环境中从精细到粗略的细胞-细胞相互作用。我们证明Sigmma 学习的表示更好地捕捉了跨模态对应关系,在基因表达预测任务中平均提高了9.78%,在跨模态检索任务中平均提高了26.93%。我们进一步表明,它在下游分析中学习了有意义的多组织组织。
Summary / 总结
The research aims to improve the alignment of histopathology images and spatial transcriptomic profiles by addressing the limitations of existing single-scale approaches. SIGMMA, a multi-modal contrastive alignment framework, learns hierarchical representations across multiple scales, ensuring coherent representations at different scales. The method uses a graph-based approach to capture cell-cell interactions from fine to coarse scales, enhancing cross-modal correspondences. Experimental results show a 9.78% improvement in gene-expression prediction and a 26.93% improvement in cross-modal retrieval across datasets.
研究旨在通过解决现有单尺度方法的局限性,改进病理图像和空间转录组学资料之间的对齐。提出的SIGMMA框架采用基于多尺度多模态对比性对齐的图表示方法,确保不同尺度下的表示具有一致性。实验表明,SIGMMA增强了跨模态对应关系,分别在基因表达预测和跨模态检索任务中提高了9.78%和26.93%,并且有效地捕捉了从精细到粗略尺度的细胞间相互作用在组织微环境中的情况。
A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning
Authors: Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen
First: 2025-12-16T14:27:47+00:00 · Latest: 2025-12-16T14:27:47+00:00
Abstract
Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $\textbf{Dreamer}$ that employs generative models to visualize $\textit{how}$ an interaction would look; (2) a $\textbf{Thinker}$ that utilizes large vision-language models to decide $\textit{what}$ object part to interact with; and (3) a $\textbf{Spotter}$ that orchestrates vision foundation models to precisely locate $\textit{where}$ the interaction area is. By leveraging the complementary strengths of pre-trained models without any task-specific fine-tuning, our zero-shot framework significantly outperforms state-of-the-art supervised methods across multiple benchmarks and demonstrates robust generalization to real-world settings.
中文标题/摘要
标题:A4-Agent:一种用于零样本功能推理的代理框架
功能预测,基于语言指令识别物体上的交互区域,对于体态人工智能至关重要。现有的端到端模型将高层推理和低层语义绑定在一个单一的管道中,并依赖于标注数据集的训练,这导致在新型物体和未见过的环境中泛化能力较差。在本文中,我们提出了A4-Agent,一种无需训练的代理框架,将功能预测拆分为三个阶段的管道。我们的框架在测试时协调专门的基础模型:(1) 一个**Dreamer**,使用生成模型来可视化**如何**进行交互;(2) 一个**Thinker**,利用大型的视觉-语言模型来决定**与哪个**物体部分进行交互;(3) 一个**Spotter**,协调视觉基础模型来精确定位**哪里**是交互区域。通过利用预训练模型的互补优势,而无需任何特定任务的微调,我们的零样本框架在多个基准测试中显著优于最先进的监督方法,并在真实世界环境中展示了鲁棒的泛化能力。
Summary / 总结
The paper introduces A4-Agent, a zero-shot framework for affordance prediction that decouples the process into three stages: Dreamer, Thinker, and Spotter. Dreamer visualizes interaction scenarios, Thinker decides the object part to interact with, and Spotter locates the precise interaction area. This framework, which leverages pre-trained models without fine-tuning, outperforms existing supervised methods across various benchmarks and shows robust generalization to real-world settings.
论文针对体态AI中的 affordance 预测挑战,即根据语言指令识别物体上的交互区域。为克服现有端到端模型在泛化到新物体和未见过的环境中的局限性,作者提出了 A4-Agent,一种无需训练的框架,将过程分为三个阶段:Dreamer、Thinker 和 Spotter。Dreamer 会可视化交互,Thinker 决定要交互的物体部分,Spotter 精确定位交互区域。该框架利用预训练模型的优势,显著优于监督方法,在多个基准测试中表现出色,并在真实世界环境中表现出良好的泛化能力。
DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning
Authors: Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi Hirose, Yusuke Sekikawa
Venue: AAAI 2026
First: 2025-12-16T14:06:35+00:00 · Latest: 2025-12-16T14:06:35+00:00
Comments: Paper accepted to AAAI 2026
Abstract
Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.
中文标题/摘要
标题:DISCODE:面向分布的评分解码器,用于稳健的图像字幕自动评估
大型视觉-语言模型(LVLMs)在多种跨模态任务中表现出色。然而,在领域偏移场景下,使用LVLMs进行稳健的图像字幕评估仍然具有挑战性。为了解决这一问题,我们引入了分布感知评分解码器(DISCODE),这是一种无需微调的新颖方法,能够在多种领域中生成与人类判断更一致的稳健评估分数。DISCODE的核心思想在于其测试时自适应评估方法,该方法引入了自适应测试时(ATT)损失,利用高斯先验分布提高评估分数估计的稳健性。我们推导出的解析解在测试时高效地最小化了该损失。此外,我们还引入了多领域字幕评估基准(MCEval),这是一个新的图像字幕评估基准,涵盖了六个不同的领域,旨在评估评估指标的稳健性。在我们的实验中,我们证明了DISCODE在MCEval和四个代表性现有基准上作为无参考评估指标达到了最先进的性能。
Summary / 总结
The research aims to improve the robustness of automatic image caption evaluation, especially under domain-shift conditions, by introducing DISCODE, a finetuning-free method. DISCODE uses an Adaptive Test-Time (ATT) loss with a Gaussian prior to generate evaluation scores that better align with human judgments. Experiments show that DISCODE outperforms existing reference-free metrics on MCEval and four other benchmarks, demonstrating its effectiveness in diverse domains.
研究旨在提高自动图像字幕评估的鲁棒性,特别是在领域迁移场景下。DISCODE是一种无需微调的新方法,通过使用带有高斯先验的Adaptive Test-Time (ATT)损失来生成与人类判断一致的评估分数。实验表明,DISCODE在MCEval和四个其他基准上均优于现有方法,使其成为最先进的无参考评估指标。
Unified Semantic Transformer for 3D Scene Understanding
Authors: Sebastian Koch, Johanna Wald, Hide Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari
First: 2025-12-16T12:49:35+00:00 · Latest: 2025-12-16T12:49:35+00:00
Comments: Project page: https://unite-page.github.io/
Abstract
Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D semantic tasks within a single model. Our model operates on unseen scenes in a fully end-to-end manner and only takes a few seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, as well as affordance and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io
中文标题/摘要
标题:统一语义变换器用于3D场景理解
整体3D场景理解涉及捕获和解析未结构化的3D环境。由于现实世界的固有复杂性,现有模型主要被开发并局限于特定任务。我们引入了UNITE,一种用于3D场景理解的统一语义变换器,这是一种新颖的前馈神经网络,能够在一个模型中统一多种3D语义任务。我们的模型以端到端的方式处理未见过的场景,并且只需几秒钟即可推断出完整的3D语义几何结构。我们的方法能够直接预测多个语义属性,包括3D场景分割、实例嵌入、开放词汇特征,以及可操作性和关节,仅从RGB图像中。该方法使用2D蒸馏训练,高度依赖于自我监督,并利用了设计用于确保3D视图一致性的新型多视图损失。我们证明,UNITE在多个不同的语义任务上达到了最先进的性能,并且在许多情况下甚至超过了特定任务的模型,甚至在某些情况下超越了在真实3D几何上操作的方法。请参见项目网站:unite-page.github.io
Summary / 总结
UNITE is a Unified Semantic Transformer designed for holistic 3D scene understanding, capable of predicting multiple semantic attributes from RGB images. It uses a combination of 2D distillation and self-supervision, with novel multi-view losses to ensure 3D consistency. UNITE outperforms task-specific models and even surpasses methods using ground truth 3D geometry on several semantic tasks, achieving state-of-the-art performance.
UNITE 是一种统一的语义变换器,用于全面理解 3D 场景,能够从 RGB 图像中预测多种语义属性。它结合了 2D 提炼和自我监督,并使用新型多视图损失来确保 3D 视图一致性。UNITE 在多个语义任务上的表现优于特定任务模型,并且在某些情况下甚至超过了使用真实 3D 几何的模型。
Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
Authors: Jooyeol Yun, Jaegul Choo
First: 2025-12-16T12:03:46+00:00 · Latest: 2025-12-16T12:03:46+00:00
Comments: yeolj00.github.io/personal-projects/vector-prism
Abstract
Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision-language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.
中文标题/摘要
标题:向量棱柱:通过分层语义结构动画化向量图形
可缩放矢量图形(SVG)是现代网页设计的核心,随着网络环境变得越来越动态,对SVG动画的需求也在不断增长。然而,尽管在代码生成和运动规划方面取得了进展,视觉-语言模型(VLMs)仍然难以自动化SVG的动画。由于视觉上连贯的部分经常被分解成低级形状,这些形状提供的指导有限,难以确定哪些元素应该一起移动。在本文中,我们提出了一种框架,用于恢复实现可靠SVG动画所需的语义结构,并揭示了当前VLM系统所忽视的缺失层。这通过统计聚合多个弱部分预测实现,使系统能够从嘈杂的预测中稳定地推断出语义。通过将SVG重新组织成语义组,我们的方法使VLMs能够生成具有更高连贯性的动画。我们的实验表明,与现有方法相比,取得了显著的改进,这表明语义恢复是解锁稳健SVG动画的关键步骤,支持VLMs与向量图形之间更可解释的交互。
Summary / 总结
The research aims to address the challenge of automating the animation of Scalable Vector Graphics (SVG) for modern web design, where current vision-language models (VLMs) struggle due to fragmented low-level shapes. The method involves recovering the semantic structure of SVGs by aggregating multiple weak part predictions, enabling VLMs to produce more coherent animations. Experiments show significant improvements over existing approaches, indicating that semantic recovery is crucial for robust SVG animation.
本文解决了使用视觉语言模型(VLMs)自动化动画生成Scalable Vector Graphics (SVG)的挑战。它提出了一种名为Vector Prism的框架,通过聚合多个弱部分预测来恢复SVG的语义结构,从而使VLMs能够生成更连贯的动画。实验表明,与现有方法相比,这种方法有显著改进,表明语义恢复是实现稳健SVG动画的关键步骤,有助于VLMs与矢量图形之间的更可解释交互。
From YOLO to VLMs: Advancing Zero-Shot and Few-Shot Detection of Wastewater Treatment Plants Using Satellite Imagery in MENA Region
Authors: Akila Premarathna, Kanishka Hewageegana, Garcia Andarcia Mariangel
First: 2025-12-16T11:28:55+00:00 · Latest: 2025-12-16T11:28:55+00:00
Comments: 9 pages, 9 figures
Abstract
In regions of the Middle East and North Africa (MENA), there is a high demand for wastewater treatment plants (WWTPs), crucial for sustainable water management. Precise identification of WWTPs from satellite images enables environmental monitoring. Traditional methods like YOLOv8 segmentation require extensive manual labeling. But studies indicate that vision-language models (VLMs) are an efficient alternative to achieving equivalent or superior results through inherent reasoning and annotation. This study presents a structured methodology for VLM comparison, divided into zero-shot and few-shot streams specifically to identify WWTPs. The YOLOv8 was trained on a governmental dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE: ~85% WWTPs (positives), 15% non-WWTPs (negatives). Evaluated VLMs include LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B (Mistral), used to identify WWTP components such as circular/rectangular tanks, aeration basins and distinguish confounders via expert prompts producing JSON outputs with confidence and descriptions. The dataset comprises 1,207 validated WWTP locations (198 UAE, 354 KSA, 655 Egypt) and equal non-WWTP sites from field/AI data, as 600mx600m Geo-TIFF images (Zoom 18, EPSG:4326). Zero-shot evaluations on WWTP images showed several VLMs out-performing YOLOv8's true positive rate, with Gemma-3 highest. Results confirm that VLMs, particularly with zero-shot, can replace YOLOv8 for efficient, annotation-free WWTP classification, enabling scalable remote sensing.
中文标题/摘要
标题:从YOLO到VLM:利用卫星影像在中东和北非地区零样本和少样本识别废水处理厂的进步
在中东和北非(MENA)地区,废水处理厂(WWTP)的需求很高,对于可持续水资源管理至关重要。从卫星图像精确识别WWTP有助于环境监测。传统方法如YOLOv8分割需要大量人工标注。但研究表明,视觉语言模型(VLMs)是通过内在推理和标注实现同等或更优结果的高效替代方案。本研究提出了一种结构化的VLM比较方法,分为零样本和少样本流,专门用于识别WWTP。YOLOv8在埃及、沙特阿拉伯和阿联酋的政府数据集上进行了训练,包含83,566张高分辨率卫星图像:约85%为WWTP(正样本),15%为非WWTP(负样本)。评估的VLMs包括LLaMA 3.2 Vision、Qwen 2.5 VL、DeepSeek-VL2、Gemma 3、Gemini和Pixtral 12B(Mistral),用于识别WWTP组件如圆形/矩形储罐、曝气池,并通过专家提示区分干扰因素,生成带有置信度和描述的JSON输出。数据集包含1,207个验证过的WWTP位置(198个阿联酋,354个沙特阿拉伯,655个埃及)和等量的非WWTP现场/AI数据,作为600mx600m的Geo-TIFF图像(缩放级别18,EPSG:4326)。零样本评估显示,多个VLMs在WWTP图像上的真阳性率优于YOLOv8,Gemma-3最高。结果证实,特别是零样本的VLMs可以替代YOLOv8进行高效的、无需标注的WWTP分类,实现可扩展的遥感。
Summary / 总结
This study aims to improve the identification of wastewater treatment plants (WWTPs) in the Middle East and North Africa (MENA) using satellite imagery. It compares vision-language models (VLMs) like LLaMA 3.2 Vision, Qwen 2.5 VL, DeepSeek-VL2, Gemma 3, Gemini, and Pixtral 12B against YOLOv8 for zero-shot and few-shot detection. The VLMs were trained on a dataset of 83,566 high-resolution satellite images from Egypt, Saudi Arabia, and UAE, and evaluated on 1,207 validated WWTP locations. The results show that several VLMs outperform YOLOv8 in zero-shot evaluations, with Gemma-3 achieving the highest true positive rate. This suggests VLMs can replace YOLOv8 for efficient, annotation-free WWTP classification.
该研究旨在利用卫星图像提高中东和北非地区的污水处理厂(WWTPs)识别。研究对比了视觉语言模型(VLMs)与YOLOv8在零样本和少量样本检测中的表现。VLMs在包含1,207个验证WWTP位置的数据集上进行了评估,并在零样本评估中显示出比YOLOv8更高的真阳性率,其中Gemma-3表现最佳。这表明VLMs可以替代YOLOv8,实现高效的无标注WWTP分类,支持该地区的可持续水资源管理。
SAM3-I: Segment Anything with Instructions
Authors: Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Wei Ji, Huchuan Lu, Li Cheng
First: 2025-12-04T09:00:25+00:00 · Latest: 2025-12-16T11:17:40+00:00
Comments: Preliminary results; work in progress
Abstract
Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.
中文标题/摘要
标题:SAM3-I: 按指令分割一切
Segment Anything Model 3 (SAM3) 通过可编程概念分割实现了高级开放词汇分割,允许用户分割给定概念的所有实例,通常用简短的名词短语(NP)提示指定。虽然这标志着在SAM家族中首次将语言级概念集成在一起,但在实际使用中通常需要更丰富的表达,包括属性、空间关系、功能、动作、状态,甚至实例间的隐式推理。目前,SAM3依赖于外部多模态代理将复杂指令转换为NP,然后进行迭代掩码过滤。然而,这些NP级概念仍然过于粗略,往往无法精确表示特定实例。在此项工作中,我们提出了SAM3-I,这是一种增强框架,将概念级理解和指令级推理统一在SAM家族中。SAM3-I引入了一种指令感知级联适应机制,逐步将表达性的指令语义与SAM3现有的视觉-语言表示相匹配,从而实现直接的指令遵循分割,同时保留其原有的概念驱动能力。此外,我们设计了一种结构化的指令分类体系,涵盖概念、简单和复杂三个层次,并开发了一个可扩展的数据引擎来构建包含多样指令-掩码对的数据集。实验表明,SAM3-I表现出令人满意的效果,证明SAM3可以有效扩展以遵循自然语言指令,同时保持其强大的概念基础。我们开源了SAM3-I,并提供了实用的微调工作流程,使研究人员能够将其适应特定领域应用。源代码可在此处获取。
Summary / 总结
The research aims to enhance the SAM3 model for more precise and flexible segmentation by integrating instruction-level reasoning. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that aligns expressive instruction semantics with SAM3's vision-language representations, allowing direct instruction-following segmentation. Experiments show that SAM3-I improves performance in following natural-language instructions while maintaining strong concept grounding. The structured instruction taxonomy and scalable data engine support the development of diverse instruction-mask pairs, enabling effective extension of SAM3 to follow complex instructions. The source code is open-sourced for further research and application adaptation.
SAM3-I通过将指令级推理整合到SAM3框架中,增强了开放词汇分割能力,使其能够更精确地遵循自然语言指令进行分割。它引入了一种指令感知级联适应机制,将表达性的指令语义与SAM3的视觉-语言表示对齐,并设计了一种结构化的指令分类体系,涵盖概念、简单和复杂指令。实验表明,SAM3-I在保持强大的概念基础的同时,提高了遵循自然语言指令的性能。
Light-Weight Benchmarks Reveal the Hidden Hardware Cost of Zero-Shot Tabular Foundation Models
Authors: Ishaan Gangwani, Aayam Bansal
Venue: ICML
First: 2025-11-30T13:17:08+00:00 · Latest: 2025-12-16T10:51:18+00:00
Comments: ICML NewInML
Abstract
Zero-shot foundation models (FMs) promise training-free prediction on tabular data, yet their hardware footprint remains poorly characterized. We present a fully reproducible benchmark that reports test accuracy together with wall-clock latency, peak CPU RAM, and peak GPU VRAM on four public datasets: Adult-Income, Higgs-100k, Wine-Quality, and California-Housing. Two open FMs (TabPFN-1.0 and TabICL-base) are compared against tuned XGBoost, LightGBM, and Random Forest baselines on a single NVIDIA T4 GPU. The tree ensembles equal or surpass FM accuracy on three datasets while completing full-test batches in <= 0.40 s and <= 150 MB RAM, using zero VRAM. TabICL achieves a 0.8 percentage-point gain on Higgs but requires roughly 40,000 times more latency (960 s) and 9 GB VRAM. TabPFN matches tree-model accuracy on Wine and Housing but peaks at 4 GB VRAM and cannot process the full 100k-row Higgs table. These results quantify the substantial hardware-versus-accuracy trade-offs in current tabular FMs and provide an open baseline for future efficiency-oriented research.
中文标题/摘要
标题:轻量级基准揭示零样本表格基础模型的隐藏硬件成本
零样本基础模型(FMs)承诺在表格数据上实现无需训练的预测,但其硬件占用情况尚未得到充分描述。我们提供了一个完全可复现的基准测试,该测试在四个公共数据集(Adult-Income、Higgs-100k、Wine-Quality 和 California-Housing)上报告了测试准确率、墙钟延迟、峰值CPU RAM 和峰值GPU VRAM。在单个NVIDIA T4 GPU 上,两种开源FMs(TabPFN-1.0 和 TabICL-base)与调优后的XGBoost、LightGBM 和随机森林基线进行比较。树集合模型在三个数据集上的准确率不低于FM,同时完成全测试批次所需时间≤0.40秒且≤150MB RAM,无需使用VRAM。TabICL 在Higgs数据集上获得了0.8个百分点的提升,但需要大约40,000倍的延迟(960秒)和9GB VRAM。TabPFN 在Wine和Housing数据集上的准确率与树模型相当,但峰值VRAM达到4GB,无法处理完整的100,000行Higgs表格。这些结果量化了当前表格FMs中的硬件与准确率之间的重大权衡,并为未来效率导向的研究提供了开放基准。
Summary / 总结
This study benchmarks zero-shot tabular foundation models (FMs) on four datasets to reveal their hardware costs. Two open FMs, TabPFN-1.0 and TabICL-base, are compared with tuned XGBoost, LightGBM, and Random Forest baselines on a single NVIDIA T4 GPU. The tree ensembles match or exceed FM accuracy while using minimal VRAM and CPU RAM, highlighting significant hardware-versus-accuracy trade-offs in current FMs.
该研究旨在通过将零-shot表型基础模型(FMs)与传统树型集成模型在四个数据集上的表现进行对比,来评估其硬件需求。基准测试测量测试准确率、延迟和内存使用情况,结果显示树型集成模型可以在使用显著较少的VRAM和CPU RAM的情况下达到与FM相当的准确率。然而,如TabICL这样的FM需要更高的VRAM和延迟,表明在硬件成本和准确率之间存在显著的权衡。