arXiv 论文速递

2025-08-27 16:54
Latest digest
VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Authors: Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng
First: 2025-08-26T17:59:47+00:00 · Latest: 2025-08-26T17:59:47+00:00
Comments: Project page: https://huanngzh.github.io/VoxHammer-Page/
Abstract
3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.
Summary / 总结
VoxHammer is a training-free approach for precise and coherent 3D editing in native 3D space, addressing the challenges of preserving unedited regions and maintaining overall coherence. It predicts the inversion trajectory of a 3D model and uses inverted latents and key-value tokens to edit preserved regions, ensuring consistent reconstruction and coherent integration. Experiments show that VoxHammer outperforms existing methods in both 3D consistency and overall quality, making it suitable for synthesizing high-quality edited paired data for in-context 3D generation.
VoxHammer 是一种无需训练的 3D 编辑方法,可以在原生 3D 空间中实现精确和连贯的编辑。该方法通过预测 3D 模型的反转轨迹来获取其潜在特征和键值令牌,然后在编辑阶段用这些潜在特征替换保留区域的去噪特征。这种方法确保了未编辑区域的一致重建和编辑部分的连贯集成。实验表明,VoxHammer 在 3D 一致性和整体质量方面优于现有方法,适用于合成高质量的编辑配对数据,为基于上下文的 3D 生成奠定数据基础。
Articulate3D: Zero-Shot Text-Driven 3D Object Posing
Authors: Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht
First: 2025-08-26T17:59:17+00:00 · Latest: 2025-08-26T17:59:17+00:00
Comments: Project page:https://odeb1.github.io/articulate3d_page_deb/
Abstract
We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85\% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/
中文标题/摘要
标题:Articulate3D:零样本文本驱动的3D物体姿态控制
我们提出了一种无需训练的方法Articulate3D,通过语言控制来摆放3D资产。尽管在视觉和语言模型方面取得了进展,但这项任务仍然令人惊讶地具有挑战性。为了实现这一目标,我们将问题分解为两个步骤。我们修改了一个强大的图像生成器,使其根据输入图像和文本指令生成目标图像。然后,我们通过多视角姿态优化步骤将网格与目标图像对齐。具体来说,我们引入了一种自注意力重连机制(RSActrl),该机制在图像生成模型中解耦了源结构和姿态,使其能够在不同姿态下保持一致的结构。我们观察到,可微渲染对于姿态优化来说是一个不可靠的信号;相反,我们使用关键点来建立输入图像和目标图像之间的对应关系。Articulate3D的有效性在各种3D物体和自由形式的文本提示下得到了验证,成功地操控了姿态同时保持了网格的原始身份。定量评估和对比用户研究证实了其优于现有方法的优越性。项目页面:https://odeb1.github.io/articulate3d_page_deb/
Summary / 总结
We propose a training-free method, Articulate3D, to pose a 3D asset through language control.
Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment
Authors: Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, Navin Kumar
First: 2025-08-22T23:34:37+00:00 · Latest: 2025-08-26T17:13:21+00:00
Abstract
Clinical workflows are fragmented as a patchwork of scripts and task-specific networks that often handle triage, task selection, and model deployment. These pipelines are rarely streamlined for data science pipeline, reducing efficiency and raising operational costs. Workflows also lack data-driven model identification (from imaging/tabular inputs) and standardized delivery of model outputs. In response, we present a practical, healthcare-first framework that uses a single vision-language model (VLM) in two complementary roles. First (Solution 1), the VLM acts as an aware model-card matcher that routes an incoming image to the appropriate specialist model via a three-stage workflow (modality -> primary abnormality -> model-card id). Checks are provided by (i) stagewise prompts that allow early exit via None/Normal/Other and (ii) a stagewise answer selector that arbitrates between the top-2 candidates at each stage, reducing the chance of an incorrect selection and aligning the workflow with clinical risk tolerance. Second (Solution 2), we fine-tune the VLM on specialty-specific datasets ensuring a single model covers multiple downstream tasks within each specialty, maintaining performance while simplifying deployment. Across gastroenterology, hematology, ophthalmology, and pathology, our single-model deployment matches or approaches specialized baselines. Compared with pipelines composed of many task-specific agents, this approach shows that one VLM can both decide and do. It may reduce effort by data scientists, shorten monitoring, increase the transparency of model selection (with per-stage justifications), and lower integration overhead.
Summary / 总结
The paper addresses the fragmented clinical workflows by proposing a framework that uses a single vision-language model (VLM) in two roles: first, as an aware model-card matcher to route images to appropriate specialist models via a three-stage workflow, and second, as a fine-tuned model for multiple downstream tasks within each specialty. The framework reduces operational costs and improves efficiency by matching or approaching specialized baselines across gastroenterology, hematology, ophthalmology, and pathology, and simplifies deployment compared to task-specific pipelines.
论文提出了一种框架,使用单一的视觉-语言模型(VLM)在两个角色中发挥作用:首先作为智能模型卡片匹配器,通过三阶段工作流将图像路由到合适的专科模型;其次,对专科数据集进行微调,确保一个模型可以覆盖每个专科的多个下游任务。该框架通过在胃肠病学、血液学、眼科和病理学中达到或接近专科基准线,减少了运营成本并提高了效率,并且与任务特定的管道相比简化了部署。
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Authors: Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, Zhengzhong Tu
First: 2025-05-29T23:32:03+00:00 · Latest: 2025-08-26T16:42:37+00:00
Comments: 16 pages
Abstract
Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.
Summary / 总结
This paper aims to address the limitations of large vision-language models (LVLMs) in dynamic real-world applications by exploring the design space of retrieval-augmented generation (RAG). The study investigates the retrieval phase, re-ranking stage, and generation phase of multimodal RAG pipelines for LVLMs, and introduces a unified agentic framework for self-reflection. The research results in an average performance boost of 5% without fine-tuning, providing substantial insights into the RAG process for LVLMs.
本文旨在通过探索检索增强生成(RAG)的设计空间来解决大型视觉语言模型(LVLM)在动态现实世界应用中的局限性。研究调查了多模态RAG管道的检索阶段、重排序阶段和生成阶段,并引入了一种通过自我反思集成重排序和生成的统一代理框架。研究结果表明,在无需微调的情况下,性能平均提升了5%,为LVLM中的RAG过程提供了重要的见解。
Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA
Authors: Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar
First: 2025-07-07T22:31:56+00:00 · Latest: 2025-08-26T14:02:57+00:00
Abstract
Dermatological care via telemedicine often lacks the rich context of in-person visits. Clinicians must make diagnoses based on a handful of images and brief descriptions, without the benefit of physical exams, second opinions, or reference materials. While many medical AI systems attempt to bridge these gaps with domain-specific fine-tuning, this work hypothesized that mimicking clinical reasoning processes could offer a more effective path forward. This study tested seven vision-language models on medical visual question answering across six configurations: baseline models, fine-tuned variants, and both augmented with either reasoning layers that combine multiple model perspectives, analogous to peer consultation, or retrieval-augmented generation that incorporates medical literature at inference time, serving a role similar to reference-checking. While fine-tuning degraded performance in four of seven models with an average 30% decrease, baseline models collapsed on test data. Clinical-inspired architectures, meanwhile, achieved up to 70% accuracy, maintaining performance on unseen data while generating explainable, literature-grounded outputs critical for clinical adoption. These findings demonstrate that medical AI succeeds by reconstructing the collaborative and evidence-based practices fundamental to clinical diagnosis.
中文标题/摘要
标题:构建临床协作:多智能体推理系统在多模态医疗VQA中的应用
通过远程医疗进行皮肤科护理往往缺乏面对面访问的丰富背景。临床医生必须基于少量图像和简短描述进行诊断,而没有体格检查、第二意见或参考材料的帮助。虽然许多医疗AI系统通过领域特定微调试图弥合这些差距,但这项研究假设模仿临床推理过程可能提供更有效的途径。该研究测试了七种视觉-语言模型在六种配置下的医疗视觉问答:基线模型、微调变体以及分别与结合多种模型视角的推理层或在推理时结合医学文献的检索增强生成相结合的模型。虽然微调在四款模型中降低了性能,平均下降30%,基线模型在测试数据上表现不佳。相比之下,以临床为灵感的架构实现了高达70%的准确率,在未见过的数据上保持了性能,同时生成了临床采用所需、基于文献的可解释输出。这些发现表明,医疗AI的成功在于重建临床诊断中至关重要的协作和基于证据的做法。
Summary / 总结
Dermatological care via telemedicine often lacks the rich context of in-person visits.
本研究旨在通过开发多代理推理系统来提升皮肤科远程医疗服务,测试了七种视觉-语言模型在六种配置下的表现,包括基线模型、微调变体以及增加推理层或检索增强生成的模型。虽然微调在大多数模型中降低了性能,但临床启发式架构实现了高达70%的准确率,保持了在未见数据上的性能,并生成了临床采用所必需的可解释、基于文献的输出。
ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval
Authors: Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao
Venue: EMNLP 2025
First: 2025-08-26T13:42:48+00:00 · Latest: 2025-08-26T13:42:48+00:00
Comments: Accepted by EMNLP 2025 Findings
Abstract
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.
Summary / 总结
ProPy is designed to address the challenge of Partially Relevant Video Retrieval (PRVR) by leveraging the powerful pretrained vision-language model CLIP. It introduces a Prompt Pyramid structure and an Ancestor-Descendant Interaction Mechanism to capture and interact with semantic information at multiple granularity levels. ProPy outperforms previous models on three public datasets, achieving state-of-the-art performance and significant improvements over existing methods. Code is available at https://github.com/BUAAPY/ProPy.
ProPy旨在通过利用预训练的视觉-语言模型CLIP来解决部分相关视频检索(PRVR)的挑战。它引入了Prompt Pyramid结构和Ancestor-Descendant交互机制,以捕捉和交互多粒度事件语义。ProPy在三个公开数据集上取得了最佳性能,并在现有方法上实现了显著的改进。
ForgetMe: Evaluating Selective Forgetting in Generative Models
Authors: Zhenyu Yu, Mohd Yamani Inda Idris, Pei Wang
First: 2025-04-17T01:44:57+00:00 · Latest: 2025-08-26T13:04:59+00:00
Abstract
The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning. However, due to the high-dimensional nature and complex feature representations of diffusion models, achieving selective unlearning remains challenging, as existing methods struggle to remove sensitive information while preserving the consistency of non-sensitive regions. To address this, we propose an Automatic Dataset Creation Framework based on prompt-based layered editing and training-free local feature removal, constructing the ForgetMe dataset and introducing the Entangled evaluation metric. The Entangled metric quantifies unlearning effectiveness by assessing the similarity and consistency between the target and background regions and supports both paired (Entangled-D) and unpaired (Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe dataset encompasses a diverse set of real and synthetic scenarios, including CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on this dataset and validate the effectiveness of both the ForgetMe dataset and the Entangled metric, establishing them as benchmarks for selective unlearning. Our work provides a scalable and adaptable solution for advancing privacy-preserving generative AI.
Summary / 总结
The paper addresses the challenge of selective unlearning in diffusion models, which are widely used in image generation. It proposes an Automatic Dataset Creation Framework using prompt-based layered editing and training-free local feature removal. The ForgetMe dataset and Entangled evaluation metric are introduced, which help quantify unlearning effectiveness by assessing the similarity and consistency between target and background regions. The study applies LoRA fine-tuning on Stable Diffusion and validates the ForgetMe dataset and Entangled metric as benchmarks for selective unlearning in generative models.
论文针对扩散模型中的选择性遗忘难题,这些模型广泛应用于图像生成。提出了基于提示的分层编辑和无需训练的局部特征移除的自动数据集创建框架。引入了ForgetMe数据集和Entangled评估指标,通过评估目标区域和背景区域之间的相似性和一致性来量化遗忘效果。研究在Stable Diffusion上应用LoRA微调,并验证了ForgetMe数据集和Entangled指标作为生成模型中选择性遗忘的基准的有效性。
Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone
Authors: Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth
First: 2025-08-26T12:41:35+00:00 · Latest: 2025-08-26T12:41:35+00:00
Abstract
As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.
中文标题/摘要
标题:用不同方式再问我:GRAS用于衡量视觉语言模型在性别、种族、年龄和肤色方面的偏见
随着视觉语言模型(VLMs)在实际应用中变得越来越重要,理解其在不同人口统计学方面的偏见变得至关重要。我们引入了GRAS,这是一个基准测试,用于揭示视觉语言模型在性别、种族、年龄和肤色方面的偏见,提供了迄今为止最全面的覆盖范围。我们还提出了GRAS偏见评分,这是一个可解释的指标,用于量化偏见。我们对五种最先进的视觉语言模型进行了基准测试,并揭示了令人担忧的偏见水平,最不偏见的模型的GRAS偏见评分为100分中的2分。我们的研究结果还揭示了一个方法论上的见解:在视觉问答(VQA)中评估视觉语言模型的偏见需要考虑问题的多种表述形式。我们的代码、数据和评估结果已公开可用。
Summary / 总结
As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical.
Enhancing Document VQA Models via Retrieval-Augmented Generation
Authors: Eric López, Artemis Llabrés, Ernest Valveny
First: 2025-08-26T12:32:55+00:00 · Latest: 2025-08-26T12:32:55+00:00
Comments: Accepted at Workshop on Machine Learning in Document Analysis and Recognition (ICDAR WML 2025), Wuhan, China
Abstract
Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.
Summary / 总结
This paper explores the integration of Retrieval-Augmented Generation (RAG) into Document VQA models to address the memory challenges of processing multi-page documents. By evaluating text-based and purely visual retrieval methods across various models and benchmarks, the study shows that the text-centric variant improves the baseline by up to 22.5 ANLS, while the visual variant achieves a 5.0 ANLS improvement without OCR. The experiments highlight the effectiveness of careful evidence selection in enhancing accuracy across different model sizes and benchmarks.
本文探讨了将检索增强生成(RAG)集成到文档VQA模型中,以解决处理多页文档的内存挑战。它在多种模型和基准上评估了基于文本和纯视觉的检索方法,结果显示基于文本的变体将基线提高了最多22.5 ANLS,而纯视觉变体在无需OCR的情况下实现了5.0 ANLS的改进。研究证实检索和重排序是主要的收益来源,而基于布局的分块策略在这些数据集上并未显著提升。实验结果强调了在多页基准上的仔细证据选择的一致性提升,突显了其实用价值。
Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Authors: Somraj Gautam, Abhirama Subramanyam Penamakuri, Abhishek Bhandari, Gaurav Harit
First: 2025-08-24T12:43:27+00:00 · Latest: 2025-08-26T12:16:26+00:00
Abstract
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.
中文标题/摘要
标题:注意(语言)差距:探索LVLMs在数值和跨语言推理方面的极限
我们介绍了MMCRICBENCH-3K,这是一个用于板球比分卡视觉问答(VQA)的基准测试,旨在评估大型视觉语言模型(LVLMs)在半结构化表格图像上进行复杂数值和跨语言推理的能力。MMCRICBENCH-3K 包含1,463张合成生成的板球比分卡图像,来自ODI、T20和Test格式,以及1,500对英文问答对。它包括两个子集:MMCRICBENCH-E-1.5K,包含英文比分卡,和MMCRICBENCH-H-1.5K,包含视觉上相似的印地文比分卡,所有问题和答案都用英文编写,以实现跨文字的控制性评估。该任务要求对结构化数值数据、多图像上下文和隐含领域知识进行推理。实验证明,即使是最先进的LVLMs,如GPT-4o和Qwen2.5VL,在其主要训练语言的英文子集上也难以应对,而在印地文子集上的表现进一步下降。这揭示了结构感知视觉文本理解、数值推理和跨语言泛化的关键局限性。该数据集可通过Hugging Face公开获取,网址为https://huggingface.co/datasets/DIALab/MMCricBench,以促进该方向的LVLM研究。
Summary / 总结
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images.
Prototype-Guided Diffusion: Visual Conditioning without External Memory
Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah
First: 2025-08-13T16:18:35+00:00 · Latest: 2025-08-26T10:29:55+00:00
Abstract
Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains. However, they remain computationally intensive, particularly during the iterative denoising process. Latent-space models like Stable Diffusion alleviate some of this cost by operating in compressed representations, though at the expense of fine-grained detail. More recent approaches such as Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning denoising on similar examples retrieved from large external memory banks. While effective, these methods introduce drawbacks: they require costly storage and retrieval infrastructure, depend on static vision-language models like CLIP for similarity, and lack adaptability during training. We propose the Prototype Diffusion Model (PDM), a method that integrates prototype learning directly into the diffusion process for efficient and adaptive visual conditioning - without external memory. Instead of retrieving reference samples, PDM constructs a dynamic set of compact visual prototypes from clean image features using contrastive learning. These prototypes guide the denoising steps by aligning noisy representations with semantically relevant visual patterns, enabling efficient generation with strong semantic grounding. Experiments show that PDM maintains high generation quality while reducing computational and storage overhead, offering a scalable alternative to retrieval-based conditioning in diffusion models.
Summary / 总结
The research aims to improve the efficiency and adaptability of diffusion models in image generation by integrating prototype learning directly into the diffusion process. Instead of relying on external memory banks, PDM constructs dynamic visual prototypes from clean image features using contrastive learning. This method reduces computational and storage overhead while maintaining high generation quality and strong semantic grounding, offering a scalable alternative to retrieval-based conditioning in diffusion models.
研究旨在通过将原型学习直接集成到扩散过程中来提高图像生成的效率和适应性。PDM 不依赖外部内存的参考样本,而是使用对比学习从干净的图像特征中构建动态的视觉原型。这种方法减少了计算和存储开销,同时保持了高质量的生成质量和强烈的语义关联,为扩散模型中的检索基条件提供了可扩展的替代方案。
M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
Authors: Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang
First: 2025-04-06T22:02:21+00:00 · Latest: 2025-08-26T10:19:05+00:00
Comments: COLM 2025, 30 pages, 10 figures, 16 tables
Abstract
Multimodal in-context learning (ICL) equips Large Vision-language Models (LVLMs) with the ability to adapt to new tasks via multiple user-provided demonstrations, without requiring any model parameter updates. However, its effectiveness is constrained by the token-intensive nature of multimodal inputs and the complexity of cross-modal few-shot reasoning, which together hinder LVLMs from extracting useful patterns from demonstrations. To address these challenges, we propose \textbf{M$^2$IV}, a novel representation engineering approach that replaces explicit token-level demonstrations with a set of learnable Multimodal In-context Vectors directly injected into the residual streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA) and multi-layer perceptrons (MLP) in the ICL process, we design a training strategy that enables M$^2$IV to perform fine-grained semantic distillation and robust cross-modal representation learning. M$^2$IV not only improves performance across diverse tasks and LVLMs but also significantly reduces token overhead, enabling graceful scaling to many-shot scenarios. To further enhance usability, we introduce \textbf{VLibrary}, a repository that stores trained M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer pre-trained LVLMs in a customized manner that meets diverse requirements. Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla ICL and prior representation engineering baselines, achieving an average accuracy gain of 3.74\% with substantial improvements in overall efficiency.
Video CLIP Model for Multi-View Echocardiography Interpretation
Authors: Ryo Takizawa, Satoshi Kodera, Tempei Kabayama, Ryo Matsuoka, Yuta Ando, Yuto Nakamura, Haruki Settai, Norihiko Takeda
First: 2025-04-26T05:11:15+00:00 · Latest: 2025-08-26T10:06:14+00:00
Abstract
Echocardiography records ultrasound videos of the heart, enabling clinicians to assess cardiac function. Recent advances in large-scale vision-language models (VLMs) have spurred interest in automating echocardiographic interpretation. However, most existing medical VLMs rely on single-frame (image) inputs, which can reduce diagnostic accuracy for conditions identifiable only through cardiac motion. In addition, echocardiographic videos are captured from multiple views, each varying in suitability for detecting specific conditions. Leveraging multiple views may therefore improve diagnostic performance. We developed a video-language model that processes full video sequences from five standard views, trained on 60,747 echocardiographic video-report pairs. We evaluated the gains in retrieval performance from video input and multi-view support, including the contributions of various pretrained models.
中文标题/摘要
标题:视频CLIP模型在多视角超声心动图解释中的应用
超声心动图记录心脏的超声视频,使临床医生能够评估心脏功能。近年来,大规模视觉-语言模型(VLMs)的进步激发了自动化超声心动图解释的兴趣。然而,现有的大多数医学VLMs依赖单帧(图像)输入,这可能会降低对仅通过心脏运动可识别的状况的诊断准确性。此外,超声心动图视频是从多个视角捕获的,每个视角对检测特定状况的适用性不同。因此,利用多个视角可能提高诊断性能。我们开发了一个视频-语言模型,处理五个标准视角的完整视频序列,并在60,747个超声心动图视频-报告对上进行训练。我们评估了视频输入和多视角支持带来的检索性能提升,包括各种预训练模型的贡献。
Summary / 总结
The research aims to improve the accuracy of echocardiographic interpretation by developing a video-language model that processes full video sequences from five standard views. The model is trained on 60,747 echocardiographic video-report pairs and evaluates the benefits of using video input and multi-view support. Key findings include improved retrieval performance from video input and the contributions of various pretrained models to diagnostic accuracy.
研究旨在通过开发一个处理五个标准视图完整视频序列的视频语言模型来提高心脏超声图解的准确性。该模型基于60,747个心脏超声视频-报告对进行训练,并评估使用视频输入和多视图支持的好处。关键发现包括视频输入检索性能的提升以及各种预训练模型对诊断准确性的贡献。
Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models
Authors: Yuexuan Xia, Benteng Ma, Jiang He, Zhiyong Wang, Qi Dou, Yong Xia
First: 2025-08-26T10:01:23+00:00 · Latest: 2025-08-26T10:01:23+00:00
Abstract
Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, particularly under distribution shifts caused by variations in imaging equipment and clinical practice. Vision-language models (VLMs) exhibit strong generalization, and text prompts encode identity attributes, enabling explicit identification and removal of sensitive directions. However, existing debiasing approaches typically address vision and text modalities independently, leaving residual cross-modal misalignment and fairness gaps. To address this challenge, we propose DualFairVL, a multimodal prompt-learning framework that jointly debiases and aligns cross-modal representations. DualFairVL employs a parallel dual-branch architecture that separates sensitive and target attributes, enabling disentangled yet aligned representations across modalities. Approximately orthogonal text anchors are constructed via linear projections, guiding cross-attention mechanisms to produce fused features. A hypernetwork further disentangles attribute-related information and generates instance-aware visual prompts, which encode dual-modal cues for fairness and robustness. Prototype-based regularization is applied in the visual branch to enforce separation of sensitive features and strengthen alignment with textual anchors. Extensive experiments on eight medical imaging datasets across four modalities show that DualFairVL achieves state-of-the-art fairness and accuracy under both in- and out-of-distribution settings, outperforming full fine-tuning and parameter-efficient baselines with only 3.6M trainable parameters. Code will be released upon publication.
Summary / 总结
The research aims to ensure fairness in medical diagnosis across demographic groups by addressing distribution shifts in imaging equipment and clinical practices. DualFairVL, a multimodal prompt-learning framework, jointly debiases and aligns cross-modal representations using a parallel dual-branch architecture. The framework achieves state-of-the-art fairness and accuracy, outperforming full fine-tuning and parameter-efficient baselines with only 3.6M trainable parameters across eight medical imaging datasets.
研究旨在通过解决成像设备和临床实践变化导致的分布偏移,确保医疗诊断在不同人群中的公平性。DualFairVL 是一个多模态提示学习框架,通过并行双分支架构联合去偏和对齐跨模态表示。该框架在八个医学成像数据集上实现了最佳的公平性和准确性,仅使用3.6M可训练参数,优于全微调和参数高效基线方法。
Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment
Authors: Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang
First: 2023-12-15T09:08:14+00:00 · Latest: 2025-08-26T08:50:35+00:00
Abstract
Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.
中文标题/摘要
标题:基于视觉语言对齐的弱监督3D视觉定位
学习将自然语言查询定位到3D点云中的目标对象或区域对于3D场景理解至关重要。然而,现有的3D视觉定位方法需要大量文本查询的边界框注释,这需要大量时间和劳动来获取。在本文中,我们提出了一种基于视觉语言对齐的弱监督3D视觉定位方法3D-VLA。我们的3D-VLA利用了当前大规模视觉语言模型(VLMs)在文本和2D图像之间对齐语义的能力,以及2D图像与3D点云之间自然存在的对应关系,从而在训练过程中隐式地构建了文本与3D点云之间的对应关系,而无需在训练过程中使用细粒度的框注释。在推理阶段,学习到的文本-3D对应关系将帮助我们在没有2D图像的情况下将文本查询定位到3D目标对象。据我们所知,这是首次通过使用大规模视觉语言模型来研究弱监督3D视觉定位的工作,我们在ReferIt3D和ScanRefer数据集上的广泛实验表明,我们的3D-VLA在与完全监督方法相当甚至更优的结果方面取得了成功。
Summary / 总结
This paper addresses the challenge of 3D visual grounding by proposing 3D-VLA, a weakly supervised approach that leverages visual linguistic alignment to align text queries with 3D point clouds without the need for bounding box annotations. The method exploits the alignment capabilities of large-scale vision-language models and the natural correspondences between 2D images and 3D point clouds, achieving comparable and sometimes better performance than fully supervised methods on the ReferIt3D and ScanRefer datasets.
本文提出了一种弱监督方法3D-VLA,通过视觉语言对齐将文本查询与3D点云对齐,而无需边界框注释。该方法利用大规模视觉语言模型的对齐能力以及2D图像与3D点云之间的自然对应关系,在ReferIt3D和ScanRefer数据集上的实验表明,3D-VLA的性能与完全监督的方法相当甚至更好。
Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models
Authors: Rui Zhang, Zihan Wang, Tianli Yang, Hongwei Li, Wenbo Jiang, Qingchuan Zhao, Yang Liu, Guowen Xu
First: 2025-08-26T08:40:22+00:00 · Latest: 2025-08-26T08:40:22+00:00
Abstract
Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that \textit{Hidden Tail} outperforms existing attacks, increasing output length by up to 19.2$\times$ and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at https://github.com/zhangrui4041/Hidden_Tail.
Summary / 总结
The research addresses the vulnerability of Vision-Language Models (VLMs) to resource consumption attacks by proposing a stealthy attack called Hidden Tail. This method crafts prompt-agnostic adversarial images that induce VLMs to generate maximum-length outputs by appending invisible special tokens. Experiments show that Hidden Tail significantly increases output length by up to 19.2 times while maintaining stealthiness, outperforming existing attacks. This highlights the need for VLMs to be more robust against such efficiency-oriented threats.
研究提出了名为Hidden Tail的隐蔽资源消耗攻击方法,通过生成与提示无关的对抗图像,诱导视觉-语言模型生成最大长度的输出,同时附加上不可见的特殊标记。实验表明,Hidden Tail可以将输出长度显著增加至19.2倍以上,同时保持隐蔽性,突显了提高视觉-语言模型对效率导向的对抗威胁的鲁棒性的紧迫性。代码可在GitHub上获取。
Robust and Label-Efficient Deep Waste Detection
Authors: Hassan Abid, Khan Muhammad, Muhammad Haris Khan
First: 2025-08-26T08:34:04+00:00 · Latest: 2025-08-26T08:34:04+00:00
Comments: Accepted to BMVC 2025
Abstract
Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.
Summary / 总结
This research aims to improve AI-driven waste detection for sustainable recycling by addressing limitations in datasets and object detectors. The study benchmarks state-of-the-art models and introduces an ensemble-based semi-supervised learning framework. Key findings include enhanced zero-shot accuracy with LLM-optimized prompts, a new baseline of 51.6 mAP through transformer-based detector fine-tuning, and performance gains in semi-supervised training using a soft pseudo-labeling strategy. This work contributes by establishing rigorous baselines and generating high-quality annotations for unlabeled data.
研究旨在通过解决数据集和目标检测器的限制,提高AI驱动的垃圾分类检测,以促进可持续回收。研究对比了最先进的模型,并引入了一种基于集成的半监督学习框架。关键发现包括通过LLM优化提示增强零样本准确性,通过变压器检测器微调达到新的51.6 mAP基线,并通过软伪标签策略在半监督训练中获得性能提升。这项工作通过建立严格的基线和生成高质量的未标注数据注释做出了贡献。
Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods
Authors: Qinqian Lei, Bo Wang, Robby T. Tan
First: 2025-08-26T07:30:53+00:00 · Latest: 2025-08-26T07:30:53+00:00
Abstract
Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either "throwing" or "catching". When only "catching" is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when "catching" is annotated, "throwing" is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.
Summary / 总结
The paper addresses the need for a new benchmark to evaluate human-object interaction (HOI) detection, considering both vision-language models (VLMs) and HOI-specific methods. It introduces a new evaluation protocol that reformulates HOI detection as a multiple-choice task, allowing for valid predictions to be recognized even in ambiguous cases. This approach avoids penalizing correct but non-exact matches, providing a fairer assessment for both VLMs and HOI-specific methods.
论文重新审视了人类物体交互(HOI)检测中视觉语言模型(VLMs)和专门的HOI方法的整合。它引入了一个新的基准,将HOI检测重新表述为一个多项选择题任务,允许对VLMs和HOI特定方法进行更准确的评估。这种新方法避免了在模糊情况下对有效预测的惩罚,提供了更公平的比较,并揭示了当前HOI理解的进展状态。
Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation
Authors: Md Tariquzzaman, Md Farhan Ishmam, Saiyma Sittul Muna, Md Kamrul Hasan, Hasan Mahmud
Venue: ICCV 2025
First: 2025-08-22T04:11:28+00:00 · Latest: 2025-08-26T06:32:51+00:00
Comments: CV4A11y@ICCV 2025
Abstract
Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step textual instructions that enable non-SL users to imitate and learn SL gestures, promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to appear in the VLM pre-training data. To enhance zero-shot performance, we introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL parameters, like hand shape, motion, and orientation, directly into the textual prompts. Subsuming standard sign parameters into the prompt makes the instructions more structured and reproducible than free-form natural text from vanilla prompting. We envision that our work would promote inclusivity and advancement in SL learning systems for the under-resourced communities.
中文标题/摘要
标题:使用手语参数进行低资源手语教学生成的提示
手语(SL)为聋人和听力障碍社区提供了双向交流的手段,但许多手语在AI领域仍资源不足。手语教学生成(SLIG)产生逐步的文本指令,使非SL用户能够模仿和学习SL手势,促进双向互动。我们介绍了BdSLIG,这是第一个孟加拉语SLIG数据集,用于评估视觉语言模型(VLMs)在(i)低资源SLIG任务上的表现,以及(ii)长尾视觉概念上的表现,因为孟加拉语SL不太可能出现在VLM的预训练数据中。为了增强零样本性能,我们引入了手语参数融合(SPI)提示,将标准的手语参数,如手型、动作和方向,直接整合到文本提示中。将标准手语参数整合到提示中使得指令比纯自然文本提示更具结构化和可重复性。我们设想我们的工作将促进低资源社区手语学习系统的包容性和发展。
Summary / 总结
Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space.
Utilizing Training Data to Improve LLM Reasoning for Tabular Understanding
Authors: Chufan Gao, Jintai Chen, Jimeng Sun
First: 2025-08-26T04:46:54+00:00 · Latest: 2025-08-26T04:46:54+00:00
Abstract
Automated tabular understanding and reasoning are essential tasks for data scientists. Recently, Large language models (LLMs) have become increasingly prevalent in tabular reasoning tasks. Previous work focuses on (1) finetuning LLMs using labeled data or (2) Training-free prompting LLM agents using chain-of-thought (CoT). Finetuning offers dataset-specific learning at the cost of generalizability. Training-free prompting is highly generalizable but does not take full advantage of training data. In this paper, we propose a novel prompting-based reasoning approach, Learn then Retrieve: LRTab, which integrates the benefits of both by retrieving relevant information learned from training data. We first use prompting to obtain CoT responses over the training data. For incorrect CoTs, we prompt the LLM to predict Prompt Conditions to avoid the error, learning insights from the data. We validate the effectiveness of Prompt Conditions using validation data. Finally, at inference time, we retrieve the most relevant Prompt Conditions for additional context for table understanding. We provide comprehensive experiments on WikiTQ and Tabfact, showing that LRTab is interpretable, cost-efficient, and can outperform previous baselines in tabular reasoning.
中文标题/摘要
标题:利用训练数据提高LLM在表格理解中的推理能力
自动化表格理解和推理是数据科学家的重要任务。近年来,大型语言模型(LLMs)在表格推理任务中越来越普遍。以往的工作主要集中在(1)使用标记数据微调LLMs或(2)通过链式思考(CoT)无训练提示LLM代理。微调提供了特定数据集的学习,但牺牲了泛化能力。无训练提示具有高度的泛化能力,但并未充分利用训练数据。在本文中,我们提出了一种新颖的基于提示的推理方法——Learn then Retrieve: LRTab,该方法结合了两者的优势,通过检索从训练数据中学到的相关信息。我们首先使用提示获得CoT响应。对于错误的CoT,我们提示LLM预测提示条件以避免错误,并从数据中学习见解。我们使用验证数据验证提示条件的有效性。最后,在推理时,我们检索最相关的提示条件以提供额外的上下文以理解表格。我们在WikiTQ和Tabfact上进行了全面的实验,表明LRTab具有可解释性、成本效益,并且在表格推理中可以超越之前的基线。
Summary / 总结
This paper addresses the challenge of improving large language models (LLMs) for automated tabular understanding and reasoning. It introduces LRTab, a prompting-based approach that combines the benefits of finetuning and training-free prompting. By retrieving relevant information from training data and learning from incorrect chain-of-thought responses, LRTab enhances the model's interpretability and cost-efficiency, outperforming previous methods on WikiTQ and Tabfact datasets.
本文旨在提高大型语言模型(LLMs)在表格理解和推理方面的表现。它提出了LRTab,一种结合了微调和无训练提示优点的提示方法。通过从训练数据中检索相关信息并从错误的链式思考响应中学习,LRTab 提高了模型的可解释性和成本效益,在 WikiTQ 和 Tabfact 数据集上优于之前的基线方法。
PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality
Authors: Nanxi Li, Zhengyue Zhao, Chaowei Xiao
First: 2025-08-26T03:45:19+00:00 · Latest: 2025-08-26T03:45:19+00:00
Abstract
Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning. To this end, we introduce PRISM (Principled Reasoning for Integrated Safety in Multimodality), a system2-like framework that aligns VLMs by embedding a structured, safety-aware reasoning process. Our framework consists of two key components: PRISM-CoT, a dataset that teaches safety-aware chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree Search (MCTS) to further refine this reasoning through Direct Preference Optimization to help obtain a delicate safety boundary. Comprehensive evaluations demonstrate PRISM's effectiveness, achieving remarkably low attack success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90% improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also exhibits strong robustness against adaptive attacks, significantly increasing computational costs for adversaries, and generalizes effectively to out-of-distribution challenges, reducing attack success rates to just 8.70% on the challenging multi-image MIS benchmark. Remarkably, this robust defense is achieved while preserving, and in some cases enhancing, model utility. To promote reproducibility, we have made our code, data, and model weights available at https://github.com/SaFoLab-WISC/PRISM.
Summary / 总结
Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning.
PRISM 是一个系统,旨在通过集成结构化的安全意识推理过程来对视觉语言模型(VLMs)进行对齐。它包括 PRISM-CoT,一个用于教授安全意识链式思考推理的数据集,以及 PRISM-DPO,通过直接偏好优化进一步细化这种推理。PRISM 在低攻击成功率和对适应性攻击的强大鲁棒性方面表现出色,显著优于先前的方法。它还能够很好地泛化到新的分布挑战中。
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
Authors: Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang
Venue: EMNLP 2025
First: 2025-05-27T05:17:41+00:00 · Latest: 2025-08-26T03:25:38+00:00
Comments: Accepted by EMNLP 2025 Main conference
Abstract
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world. It relies on a nuanced understanding of spatial structures and inter-object relationships, serving as the foundation for complex reasoning and decision-making. To investigate whether current vision-language models (VLMs) exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity. Based on this dataset, we design five tasks to rigorously evaluate VLMs' spatial perception, structural understanding, and reasoning capabilities, while deliberately minimizing reliance on domain-specific knowledge to better isolate and assess the general spatial reasoning capability. We conduct a comprehensive evaluation across 24 state-of-the-art VLMs. The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task, with only 30.00% accuracy, far below the performance exceeding 90% achieved by human participants. This persistent gap underscores the need for continued progress, positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for advancing spatial reasoning research in VLMs. Our project page is at https://zesen01.github.io/jigsaw-puzzles.
Summary / 总结
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world.
研究旨在通过引入包含1,100张真实世界图像的新基准Jigsaw-Puzzles,评估视觉语言模型(VLMs)的空间推理能力。设计了五个任务来测试VLMs的空间感知、结构理解和推理能力。尽管最强的Gemini-2.5-Pro总体准确率为77.14%,但在Order Generation任务上的准确率仅为30.00%,远低于人类参与者的表现。这表明需要进一步的研究来提高VLMs的空间推理能力。该基准旨在诊断和推进VLMs的空间推理研究。
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Authors: Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin
Venue: EMNLP
First: 2024-11-25T02:15:30+00:00 · Latest: 2025-08-26T02:48:31+00:00
Comments: Accepted by EMNLP-2025 Main. Project page: https://szhanz.github.io/zoomeye/
Abstract
An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects. When perceiving such images, multimodal large language models~(MLLMs) face limitations due to the restricted input resolution of the pretrained vision encoder and the cluttered, dense context of the image, resulting in a focus on primary objects while easily overlooking detailed ones. In this paper, we propose Zoom Eye, a tree search algorithm designed to navigate the hierarchical and visual nature of images to capture relevant information. Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs to simulate human zooming actions by searching along the image tree from root to leaf nodes, seeking out pertinent information, and accurately responding to related queries. We experiment on a series of elaborate high-resolution benchmarks and the results demonstrate that Zoom Eye not only consistently improves the performance of a series base MLLMs with large margin~(e.g., LLaVA-v1.5-7B increases by 34.57\% on $V^*$ Bench and 17.88\% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o. Our code is available at \href{https://github.com/om-ai-lab/ZoomEye}{https://github.com/om-ai-lab/ZoomEye}.
Summary / 总结
ZoomEye is a tree search algorithm designed to help multimodal large language models (MLLMs) better explore and utilize detailed information in high-resolution images. By conceptualizing images as trees, ZoomEye enables MLLMs to simulate human-like zooming actions, improving their performance on various benchmarks. The results show that ZoomEye significantly enhances the performance of both large and small MLLMs, with some small models even outperforming larger ones like GPT-4o.
ZoomEye 是一种树搜索算法,能够增强多模态大型语言模型(MLLMs),使其能够模拟人类在图像上的放大动作。它将图像视为树结构,允许 MLLMs 在图像的不同部分进行导航和探索。实验表明,ZoomEye 在高分辨率基准测试中显著提升了各种 MLLMs 的性能,甚至使小型 7B 模型能够超越大型模型如 GPT-4o。
The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation
Authors: Girish A. Koushik, Fatemeh Nazarieh, Katherine Birch, Shenbin Qian, Diptesh Kanojia
First: 2025-08-26T00:04:01+00:00 · Latest: 2025-08-26T00:04:01+00:00
Comments: Under Review
Abstract
Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, it needs language understanding to bind a source concept with a target concept, in a way that preserves meaning while ensuring visual coherence. We propose a self-evaluating visual metaphor generation framework that focuses on metaphor alignment. Our self-evaluation approach combines existing metrics with our newly proposed metaphor decomposition score and a meaning alignment (MA) metric. Within this setup, we explore two novel approaches: a training-free pipeline that explicitly decomposes prompts into source-target-meaning (S-T-M) mapping for image synthesis, and a complementary training-based pipeline that improves alignment using our proposed self-evaluation reward schema, without any large-scale retraining. On the held-out test set, the training-free approach surpasses strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores, with the training-based approach close behind. We evaluate our framework output using a user-facing study, and observed that participants preferred GPT-4o overall, while our training-free pipeline led open-source methods and edged Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or more abstract metaphors, with closed models excelling on short, concrete cases; we also observe sensitivity to sampler settings. Overall, structured prompting and lightweight RL perform metaphor alignment well under modest compute, and remaining gaps to human preference appear driven by aesthetics and sampling.
Summary / 总结
Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor.
Generic Guard AI in Stealth Game with Composite Potential Fields
Authors: Kaijie Xu, Clark Verbrugge
First: 2025-08-25T21:56:13+00:00 · Latest: 2025-08-25T21:56:13+00:00
Abstract
Guard patrol behavior is central to the immersion and strategic depth of stealth games, while most existing systems rely on hand-crafted routes or specialized logic that struggle to balance coverage efficiency and responsive pursuit with believable naturalness. We propose a generic, fully explainable, training-free framework that integrates global knowledge and local information via Composite Potential Fields, combining three interpretable maps-Information, Confidence, and Connectivity-into a single kernel-filtered decision criterion. Our parametric, designer-driven approach requires only a handful of decay and weight parameters-no retraining-to smoothly adapt across both occupancy-grid and NavMesh-partition abstractions. We evaluate on five representative game maps, two player-control policies, and five guard modes, confirming that our method outperforms classical baseline methods in both capture efficiency and patrol naturalness. Finally, we show how common stealth mechanics-distractions and environmental elements-integrate naturally into our framework as sub modules, enabling rapid prototyping of rich, dynamic, and responsive guard behaviors.
中文标题/摘要
标题:通用隐蔽AI在潜行游戏中与复合潜力场
守卫巡逻行为是潜行游戏沉浸感和战略深度的核心,而现有的大多数系统依赖于手工设计的路线或专门逻辑,难以在覆盖效率和响应追击之间取得自然可信的平衡。我们提出了一种通用的、完全可解释的、无需训练的框架,通过复合潜力场整合全局知识和局部信息,将信息、信心和连接性三个可解释的地图合并为一个内核过滤决策标准。我们的参数化、设计师驱动的方法只需要少量衰减和权重参数——无需重新训练——即可平滑适应占用网格和NavMesh分区抽象。我们在五个代表性游戏地图、两种玩家控制策略和五种守卫模式上进行了评估,确认我们的方法在捕获效率和巡逻自然性方面均优于经典基准方法。最后,我们展示了如何将常见的潜行机制——干扰和环境元素——自然地整合到我们的框架中作为子模块,从而实现丰富、动态和响应式的守卫行为的快速原型设计。
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation
Authors: Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, Li Fei-Fei
First: 2025-06-10T22:47:16+00:00 · Latest: 2025-08-25T19:45:29+00:00
Abstract
Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods of visual affordance predictions often rely on manually annotated data or conditions only on a predefined set of tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed $<$instruction, visual affordance$>$ pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes and to various human activities, despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations. Project website: https://unsup-affordance.github.io/
中文标题/摘要
标题:UAD:无监督功能蒸馏在机器人操作中的一般化
理解细粒度的对象功能对于机器人在未结构化环境中根据开放任务指令操作对象至关重要。然而,现有的视觉功能预测方法往往依赖于手动标注的数据或仅限于预定义的任务集。我们引入了UAD(无监督功能蒸馏),这是一种从基础模型中提取功能知识的方法,无需任何手动标注。通过利用大型视觉模型和视觉语言模型的互补优势,UAD 自动标注了一个包含详细 $<$指令,视觉功能$>$ 对的大型数据集。仅在冻结特征上训练一个轻量级的任务条件解码器,UAD 在野生机器人场景和各种人类活动中表现出显著的一般化能力,尽管仅在模拟中训练的渲染对象上进行训练。使用UAD提供的功能作为观察空间,我们展示了模仿学习策略,在仅训练10个演示后,能够对未见过的对象实例、对象类别,甚至任务指令的变化表现出显著的一般化能力。项目网站:https://unsup-affordance.github.io/
Summary / 总结
Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions.
UAD 是一种无需人工标注的无监督方法,从基础模型中提取细粒度的物体操作能力知识并注入到任务条件模型中。通过利用大视觉模型和视觉语言模型的优势,UAD 自动标注了一个包含详细指令-视觉操作能力配对的大规模数据集。尽管仅在模拟中对渲染物体进行训练,UAD 在现实机器人场景和各种人类活动中仍表现出显著的泛化能力,并且使用 UAD 提供的操作能力进行模仿学习的策略在少量示范(仅 10 个)后也展示了对未见过的物体实例和任务指令的泛化能力。
CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering
Authors: Aranya Saha, Tanvir Ahmed Khan, Ismam Nur Swapnil, Mohammad Ariful Haque
First: 2025-08-25T19:22:16+00:00 · Latest: 2025-08-25T19:22:16+00:00
Comments: 10 pages, 8 figures, Prepared for submission to IEEE Transactions on Human-Machine Systems
Abstract
Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment. To address these challenges, we introduce CLARIFY, a Specialist-Generalist framework for dermatological visual question answering (VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image classifier (the Specialist) that provides fast and highly accurate diagnostic predictions, and (ii) a powerful yet compressed conversational VLM (the Generalist) that generates natural language explanations to user queries. In our framework, the Specialist's predictions directly guide the Generalist's reasoning, focusing it on the correct diagnostic path. This synergy is further enhanced by a knowledge graph-based retrieval module, which grounds the Generalist's responses in factual dermatological knowledge, ensuring both accuracy and reliability. This hierarchical design not only reduces diagnostic errors but also significantly improves computational efficiency. Experiments on our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an 18\% improvement in diagnostic accuracy over the strongest baseline, a fine-tuned, uncompressed single-line VLM, while reducing the average VRAM requirement and latency by at least 20\% and 5\%, respectively. These results indicate that a Specialist-Generalist system provides a practical and powerful paradigm for building lightweight, trustworthy, and clinically viable AI systems.
中文标题/摘要
标题:CLARIFY:一种用于皮肤病视觉问答的专业-通用框架
视觉语言模型(VLMs)在医疗任务中显示出显著的潜力;然而,它们的一般用途限制了专门诊断的准确性,并且它们的大型尺寸在实际临床部署中带来了重大的推理成本。为了解决这些挑战,我们引入了CLARIFY,一种皮肤病视觉问答(VQA)的专业-通用框架。CLARIFY 结合了两个组件:(i)一个轻量级、领域训练的图像分类器(专家),提供快速且高度准确的诊断预测;(ii)一个强大但压缩的对话VLM(通用者),生成自然语言解释以回答用户查询。在我们的框架中,专家的预测直接引导通用者的推理,使其专注于正确的诊断路径。通过基于知识图谱的检索模块,进一步增强了这种协同作用,该模块使通用者的回答基于事实性的皮肤病知识,确保了准确性和可靠性。这种分层设计不仅减少了诊断错误,还显著提高了计算效率。在我们精心策划的多模态皮肤病数据集上的实验表明,CLARIFY 在诊断准确性上比最强基线(微调的、未压缩的一行VLM)提高了18%,同时将平均VRAM需求和延迟分别减少了至少20%和5%。这些结果表明,专业-通用系统为构建轻量级、可信且临床可行的AI系统提供了一种实用而强大的范式。
Summary / 总结
Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment.
CLARIFY 是一种专家-通用框架,通过结合一个轻量级、领域训练的图像分类器(专家)和一个压缩的对话型视觉语言模型(通用),来提升皮肤病视觉问答(VQA)的性能。专家提供快速且准确的诊断预测,指导通用模型的推理,同时基于知识图谱的检索模块确保了事实上的准确性。实验结果显示,CLARIFY 的诊断准确率提高了 18%,并且减少了至少 20% 的 VRAM 要求和 5% 的延迟,相比一个微调且未压缩的视觉语言模型而言。
SafeBimanual: Diffusion-based Trajectory Optimization for Safe Bimanual Manipulation
Authors: Haoyuan Deng, Wenkai Guo, Qianzhun Wang, Zhenyu Wu, Ziwei Wang
First: 2025-08-25T17:59:02+00:00 · Latest: 2025-08-25T17:59:02+00:00
Comments: Project website is at: https://denghaoyuan123.github.io/SafeBimanip/
Abstract
Bimanual manipulation has been widely applied in household services and manufacturing, which enables the complex task completion with coordination requirements. Recent diffusion-based policy learning approaches have achieved promising performance in modeling action distributions for bimanual manipulation. However, they ignored the physical safety constraints of bimanual manipulation, which leads to the dangerous behaviors with damage to robots and objects. To this end, we propose a test-time trajectory optimization framework named SafeBimanual for any pre-trained diffusion-based bimanual manipulation policies, which imposes the safety constraints on bimanual actions to avoid dangerous robot behaviors with improved success rate. Specifically, we design diverse cost functions for safety constraints in different dual-arm cooperation patterns including avoidance of tearing objects and collision between arms and objects, which optimizes the manipulator trajectories with guided sampling of diffusion denoising process. Moreover, we employ a vision-language model (VLM) to schedule the cost functions by specifying keypoints and corresponding pairwise relationship, so that the optimal safety constraint is dynamically generated in the entire bimanual manipulation process. SafeBimanual demonstrates superiority on 8 simulated tasks in RoboTwin with a 13.7% increase in success rate and a 18.8% reduction in unsafe interactions over state-of-the-art diffusion-based methods. Extensive experiments on 4 real-world tasks further verify its practical value by improving the success rate by 32.5%.
中文标题/摘要
标题:SafeBimanual:基于扩散的双臂操作轨迹优化方法
双臂操作已在家庭服务和制造中广泛应用,能够完成复杂的协调任务。最近的基于扩散的策略学习方法在双臂操作的动作分布建模方面取得了令人鼓舞的性能。然而,它们忽略了双臂操作的物理安全约束,导致了对机器人和物体造成损害的危险行为。为此,我们提出了一种名为SafeBimanual的测试时轨迹优化框架,该框架对任何预训练的基于扩散的双臂操作策略施加安全约束,以提高成功概率并避免危险的机器人行为。具体而言,我们为不同的双臂合作模式设计了多种成本函数,包括避免物体撕裂和手臂与物体之间的碰撞,通过引导扩散去噪过程的采样优化操作器轨迹。此外,我们采用视觉语言模型(VLM)通过指定关键点及其对应的配对关系来调度成本函数,从而在整个双臂操作过程中动态生成最优的安全约束。SafeBimanual在RoboTwin中的8个模拟任务中表现出优越性,与最先进的基于扩散的方法相比,成功率提高了13.7%,不安全交互减少了18.8%。在4个真实世界任务的广泛实验中,其成功概率提高了32.5%,进一步验证了其实用价值。
Summary / 总结
SafeBimanual is a test-time trajectory optimization framework designed to enhance the safety of bimanual manipulation tasks by imposing physical safety constraints on pre-trained diffusion-based policies. It optimizes manipulator trajectories through guided sampling and diverse cost functions for different dual-arm cooperation patterns, with the help of a vision-language model to dynamically generate optimal safety constraints. The framework shows a 13.7% increase in success rate and a 18.8% reduction in unsafe interactions compared to state-of-the-art methods in simulated tasks, and further improves the success rate by 32.5% in real-world tasks.
SafeBimanual 是一个测试时轨迹优化框架,旨在通过将物理安全约束纳入基于扩散的策略来提高双臂操作的安全性。它使用多样化的成本函数来优化轨迹,并使用视觉语言模型动态生成最优的安全约束。SafeBimanual 在模拟任务中显示出 13.7% 的成功率提升和 18.8% 的不安全交互减少,在实际任务中进一步提高了 32.5% 的成功率。
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Authors: Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian
First: 2025-08-25T17:57:49+00:00 · Latest: 2025-08-25T17:57:49+00:00
Comments: Project page: https://project.ironieser.cc/mmtok
Abstract
Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual input to vision tokens. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the criterion of coverage. We first formulate the subset selection problem as a maximum coverage problem. Afterward, a subset of vision tokens is optimized to cover the text tokens and the original set of vision tokens, simultaneously. Finally, a VLM agent can be adopted to further improve the quality of text tokens for guiding vision pruning. The proposed method MMTok is extensively evaluated on benchmark datasets with different VLMs. The comparison illustrates that vision and text information are complementary, and combining multimodal information can surpass the unimodal baseline with a clear margin. Moreover, under the maximum coverage criterion on the POPE dataset, our method achieves a 1.87x speedup while maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore, with only four vision tokens, it still preserves 87.7% of the original performance on LLaVA-1.5-7B. These results highlight the effectiveness of coverage in token selection.
Summary / 总结
The research aims to improve the efficiency of Vision-Language Models (VLMs) by reducing redundant vision tokens while preserving performance. The method, MMTok, leverages both vision and text tokens to select informative vision tokens based on the criterion of coverage. Experiments show that combining multimodal information outperforms unimodal baselines, achieving a 1.87x speedup with 98.7% of original performance on LLaVA-NeXT-13B and 87.7% performance with only four vision tokens on LLaVA-1.5-7B.
研究旨在通过减少冗余的视觉标记来提高视觉语言模型(VLMs)的效率,同时保持性能。方法MMTok利用视觉和文本标记来选择基于覆盖度准则的信息性视觉标记。实验表明,结合多模态信息优于单模态基线,实现1.87倍的速度提升,同时保持LLaVA-NeXT-13B的98.7%性能,并且仅使用四个视觉标记仍能保持LLaVA-1.5-7B的87.7%性能。
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Authors: Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson
First: 2025-08-25T16:33:07+00:00 · Latest: 2025-08-25T16:33:07+00:00
Comments: COLM 2025
Abstract
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.
中文标题/摘要
标题:SEAM:跨模态语义等价基准用于视觉-语言模型
评估视觉-语言模型(VLMs)在不同表示形式中是否一致地进行推理具有挑战性,因为模态比较通常受到任务差异和信息不对称的影响。我们引入了SEAM,这是一个基准,它在四个现有标准化文本和视觉符号表示的领域中配对语义等价的输入。通过在不同模态中使用不同的符号系统,而不是基于OCR的图像-文本配对,SEAM为视觉-符号和视觉-空间推理能力提供了严格的比较评估。在21个当代模型中,我们观察到系统性的模态不平衡:尽管问题包含语义等价的信息,视觉在整体性能上经常落后于语言,跨模态的一致性相对较低。我们的错误分析揭示了两个主要驱动因素:领域符号表示中的标记文本感知失败和视觉感知失败导致的幻觉。我们还表明,我们的结果对视觉变换具有很大程度的稳健性。SEAM为测量和提高模态无关推理建立了受控的语义等价环境。
Summary / 总结
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information.
History