VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Authors: Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng
First: 2025-08-26T17:59:47+00:00 · Latest: 2025-08-26T17:59:47+00:00
Comments: Project page: https://huanngzh.github.io/VoxHammer-Page/
Abstract
3D local editing of specified regions is crucial for game industry and robot
interaction. Recent methods typically edit rendered multi-view images and then
reconstruct 3D models, but they face challenges in precisely preserving
unedited regions and overall coherence. Inspired by structured 3D generative
models, we propose VoxHammer, a novel training-free approach that performs
precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer
first predicts its inversion trajectory and obtains its inverted latents and
key-value tokens at each timestep. Subsequently, in the denoising and editing
phase, we replace the denoising features of preserved regions with the
corresponding inverted latents and cached key-value tokens. By retaining these
contextual features, this approach ensures consistent reconstruction of
preserved areas and coherent integration of edited parts. To evaluate the
consistency of preserved regions, we constructed Edit3D-Bench, a
human-annotated dataset comprising hundreds of samples, each with carefully
labeled 3D editing regions. Experiments demonstrate that VoxHammer
significantly outperforms existing methods in terms of both 3D consistency of
preserved regions and overall quality. Our method holds promise for
synthesizing high-quality edited paired data, thereby laying the data
foundation for in-context 3D generation. See our project page at
https://huanngzh.github.io/VoxHammer-Page/.
Summary / 总结
VoxHammer is a training-free approach for precise and coherent 3D editing in native 3D space, addressing the challenges of preserving unedited regions and maintaining overall coherence. It predicts the inversion trajectory of a 3D model and uses inverted latents and key-value tokens to edit preserved regions, ensuring consistent reconstruction and coherent integration. Experiments show that VoxHammer outperforms existing methods in both 3D consistency and overall quality, making it suitable for synthesizing high-quality edited paired data for in-context 3D generation.
VoxHammer 是一种无需训练的 3D 编辑方法,可以在原生 3D 空间中实现精确和连贯的编辑。该方法通过预测 3D 模型的反转轨迹来获取其潜在特征和键值令牌,然后在编辑阶段用这些潜在特征替换保留区域的去噪特征。这种方法确保了未编辑区域的一致重建和编辑部分的连贯集成。实验表明,VoxHammer 在 3D 一致性和整体质量方面优于现有方法,适用于合成高质量的编辑配对数据,为基于上下文的 3D 生成奠定数据基础。
Articulate3D: Zero-Shot Text-Driven 3D Object Posing
Authors: Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht
First: 2025-08-26T17:59:17+00:00 · Latest: 2025-08-26T17:59:17+00:00
Comments: Project page:https://odeb1.github.io/articulate3d_page_deb/
Abstract
We propose a training-free method, Articulate3D, to pose a 3D asset through
language control. Despite advances in vision and language models, this task
remains surprisingly challenging. To achieve this goal, we decompose the
problem into two steps. We modify a powerful image-generator to create target
images conditioned on the input image and a text instruction. We then align the
mesh to the target images through a multi-view pose optimisation step. In
detail, we introduce a self-attention rewiring mechanism (RSActrl) that
decouples the source structure from pose within an image generative model,
allowing it to maintain a consistent structure across varying poses. We
observed that differentiable rendering is an unreliable signal for articulation
optimisation; instead, we use keypoints to establish correspondences between
input and target images. The effectiveness of Articulate3D is demonstrated
across a diverse range of 3D objects and free-form text prompts, successfully
manipulating poses while maintaining the original identity of the mesh.
Quantitative evaluations and a comparative user study, in which our method was
preferred over 85\% of the time, confirm its superiority over existing
approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/
中文标题/摘要
标题:Articulate3D:零样本文本驱动的3D物体姿态控制
我们提出了一种无需训练的方法Articulate3D,通过语言控制来摆放3D资产。尽管在视觉和语言模型方面取得了进展,但这项任务仍然令人惊讶地具有挑战性。为了实现这一目标,我们将问题分解为两个步骤。我们修改了一个强大的图像生成器,使其根据输入图像和文本指令生成目标图像。然后,我们通过多视角姿态优化步骤将网格与目标图像对齐。具体来说,我们引入了一种自注意力重连机制(RSActrl),该机制在图像生成模型中解耦了源结构和姿态,使其能够在不同姿态下保持一致的结构。我们观察到,可微渲染对于姿态优化来说是一个不可靠的信号;相反,我们使用关键点来建立输入图像和目标图像之间的对应关系。Articulate3D的有效性在各种3D物体和自由形式的文本提示下得到了验证,成功地操控了姿态同时保持了网格的原始身份。定量评估和对比用户研究证实了其优于现有方法的优越性。项目页面:https://odeb1.github.io/articulate3d_page_deb/
Summary / 总结
We propose a training-free method, Articulate3D, to pose a 3D asset through language control.
Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment
Authors: Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, Navin Kumar
First: 2025-08-22T23:34:37+00:00 · Latest: 2025-08-26T17:13:21+00:00
Abstract
Clinical workflows are fragmented as a patchwork of scripts and task-specific
networks that often handle triage, task selection, and model deployment. These
pipelines are rarely streamlined for data science pipeline, reducing efficiency
and raising operational costs. Workflows also lack data-driven model
identification (from imaging/tabular inputs) and standardized delivery of model
outputs. In response, we present a practical, healthcare-first framework that
uses a single vision-language model (VLM) in two complementary roles. First
(Solution 1), the VLM acts as an aware model-card matcher that routes an
incoming image to the appropriate specialist model via a three-stage workflow
(modality -> primary abnormality -> model-card id). Checks are provided by (i)
stagewise prompts that allow early exit via None/Normal/Other and (ii) a
stagewise answer selector that arbitrates between the top-2 candidates at each
stage, reducing the chance of an incorrect selection and aligning the workflow
with clinical risk tolerance. Second (Solution 2), we fine-tune the VLM on
specialty-specific datasets ensuring a single model covers multiple downstream
tasks within each specialty, maintaining performance while simplifying
deployment. Across gastroenterology, hematology, ophthalmology, and pathology,
our single-model deployment matches or approaches specialized baselines.
Compared with pipelines composed of many task-specific agents, this approach
shows that one VLM can both decide and do. It may reduce effort by data
scientists, shorten monitoring, increase the transparency of model selection
(with per-stage justifications), and lower integration overhead.
Summary / 总结
The paper addresses the fragmented clinical workflows by proposing a framework that uses a single vision-language model (VLM) in two roles: first, as an aware model-card matcher to route images to appropriate specialist models via a three-stage workflow, and second, as a fine-tuned model for multiple downstream tasks within each specialty. The framework reduces operational costs and improves efficiency by matching or approaching specialized baselines across gastroenterology, hematology, ophthalmology, and pathology, and simplifies deployment compared to task-specific pipelines.
论文提出了一种框架,使用单一的视觉-语言模型(VLM)在两个角色中发挥作用:首先作为智能模型卡片匹配器,通过三阶段工作流将图像路由到合适的专科模型;其次,对专科数据集进行微调,确保一个模型可以覆盖每个专科的多个下游任务。该框架通过在胃肠病学、血液学、眼科和病理学中达到或接近专科基准线,减少了运营成本并提高了效率,并且与任务特定的管道相比简化了部署。
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Authors: Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, Zhengzhong Tu
First: 2025-05-29T23:32:03+00:00 · Latest: 2025-08-26T16:42:37+00:00
Comments: 16 pages
Abstract
Large Vision-Language Models (LVLMs) have made remarkable strides in
multimodal tasks such as visual question answering, visual grounding, and
complex reasoning. However, they remain limited by static training data,
susceptibility to hallucinations, and inability to verify claims against
up-to-date, external evidence, compromising their performance in dynamic
real-world applications. Retrieval-Augmented Generation (RAG) offers a
practical solution to mitigate these challenges by allowing the LVLMs to access
large-scale knowledge databases via retrieval mechanisms, thereby grounding
model outputs in factual, contextually relevant information. Here in this
paper, we conduct the first systematic dissection of the multimodal RAG
pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the
modality configurations and retrieval strategies, (2) the re-ranking stage: on
strategies to mitigate positional biases and improve the relevance of retrieved
evidence, and (3) the generation phase: we further investigate how to best
integrate retrieved candidates into the final generation process. Finally, we
extend to explore a unified agentic framework that integrates re-ranking and
generation through self-reflection, enabling LVLMs to select relevant evidence
and suppress irrelevant context dynamically. Our full-stack exploration of RAG
for LVLMs yields substantial insights, resulting in an average performance
boost of 5% without any fine-tuning.
Summary / 总结
This paper aims to address the limitations of large vision-language models (LVLMs) in dynamic real-world applications by exploring the design space of retrieval-augmented generation (RAG). The study investigates the retrieval phase, re-ranking stage, and generation phase of multimodal RAG pipelines for LVLMs, and introduces a unified agentic framework for self-reflection. The research results in an average performance boost of 5% without fine-tuning, providing substantial insights into the RAG process for LVLMs.
本文旨在通过探索检索增强生成(RAG)的设计空间来解决大型视觉语言模型(LVLM)在动态现实世界应用中的局限性。研究调查了多模态RAG管道的检索阶段、重排序阶段和生成阶段,并引入了一种通过自我反思集成重排序和生成的统一代理框架。研究结果表明,在无需微调的情况下,性能平均提升了5%,为LVLM中的RAG过程提供了重要的见解。
Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA
Authors: Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar
First: 2025-07-07T22:31:56+00:00 · Latest: 2025-08-26T14:02:57+00:00
Abstract
Dermatological care via telemedicine often lacks the rich context of
in-person visits. Clinicians must make diagnoses based on a handful of images
and brief descriptions, without the benefit of physical exams, second opinions,
or reference materials. While many medical AI systems attempt to bridge these
gaps with domain-specific fine-tuning, this work hypothesized that mimicking
clinical reasoning processes could offer a more effective path forward. This
study tested seven vision-language models on medical visual question answering
across six configurations: baseline models, fine-tuned variants, and both
augmented with either reasoning layers that combine multiple model
perspectives, analogous to peer consultation, or retrieval-augmented generation
that incorporates medical literature at inference time, serving a role similar
to reference-checking. While fine-tuning degraded performance in four of seven
models with an average 30% decrease, baseline models collapsed on test data.
Clinical-inspired architectures, meanwhile, achieved up to 70% accuracy,
maintaining performance on unseen data while generating explainable,
literature-grounded outputs critical for clinical adoption. These findings
demonstrate that medical AI succeeds by reconstructing the collaborative and
evidence-based practices fundamental to clinical diagnosis.
中文标题/摘要
标题:构建临床协作:多智能体推理系统在多模态医疗VQA中的应用
通过远程医疗进行皮肤科护理往往缺乏面对面访问的丰富背景。临床医生必须基于少量图像和简短描述进行诊断,而没有体格检查、第二意见或参考材料的帮助。虽然许多医疗AI系统通过领域特定微调试图弥合这些差距,但这项研究假设模仿临床推理过程可能提供更有效的途径。该研究测试了七种视觉-语言模型在六种配置下的医疗视觉问答:基线模型、微调变体以及分别与结合多种模型视角的推理层或在推理时结合医学文献的检索增强生成相结合的模型。虽然微调在四款模型中降低了性能,平均下降30%,基线模型在测试数据上表现不佳。相比之下,以临床为灵感的架构实现了高达70%的准确率,在未见过的数据上保持了性能,同时生成了临床采用所需、基于文献的可解释输出。这些发现表明,医疗AI的成功在于重建临床诊断中至关重要的协作和基于证据的做法。
Summary / 总结
Dermatological care via telemedicine often lacks the rich context of in-person visits.
本研究旨在通过开发多代理推理系统来提升皮肤科远程医疗服务,测试了七种视觉-语言模型在六种配置下的表现,包括基线模型、微调变体以及增加推理层或检索增强生成的模型。虽然微调在大多数模型中降低了性能,但临床启发式架构实现了高达70%的准确率,保持了在未见数据上的性能,并生成了临床采用所必需的可解释、基于文献的输出。
ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval
Authors: Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao
Venue: EMNLP 2025
First: 2025-08-26T13:42:48+00:00 · Latest: 2025-08-26T13:42:48+00:00
Comments: Accepted by EMNLP 2025 Findings
Abstract
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task
that involves retrieving videos based on queries relevant to only specific
segments. While existing works follow the paradigm of developing models to
process unimodal features, powerful pretrained vision-language models like CLIP
remain underexplored in this field. To bridge this gap, we propose ProPy, a
model with systematic architectural adaption of CLIP specifically designed for
PRVR. Drawing insights from the semantic relevance of multi-granularity events,
ProPy introduces two key innovations: (1) A Prompt Pyramid structure that
organizes event prompts to capture semantics at multiple granularity levels,
and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that
enables dynamic semantic interaction among events. With these designs, ProPy
achieves SOTA performance on three public datasets, outperforming previous
models by significant margins. Code is available at
https://github.com/BUAAPY/ProPy.
Summary / 总结
ProPy is designed to address the challenge of Partially Relevant Video Retrieval (PRVR) by leveraging the powerful pretrained vision-language model CLIP. It introduces a Prompt Pyramid structure and an Ancestor-Descendant Interaction Mechanism to capture and interact with semantic information at multiple granularity levels. ProPy outperforms previous models on three public datasets, achieving state-of-the-art performance and significant improvements over existing methods. Code is available at https://github.com/BUAAPY/ProPy.
ProPy旨在通过利用预训练的视觉-语言模型CLIP来解决部分相关视频检索(PRVR)的挑战。它引入了Prompt Pyramid结构和Ancestor-Descendant交互机制,以捕捉和交互多粒度事件语义。ProPy在三个公开数据集上取得了最佳性能,并在现有方法上实现了显著的改进。
ForgetMe: Evaluating Selective Forgetting in Generative Models
Authors: Zhenyu Yu, Mohd Yamani Inda Idris, Pei Wang
First: 2025-04-17T01:44:57+00:00 · Latest: 2025-08-26T13:04:59+00:00
Abstract
The widespread adoption of diffusion models in image generation has increased
the demand for privacy-compliant unlearning. However, due to the
high-dimensional nature and complex feature representations of diffusion
models, achieving selective unlearning remains challenging, as existing methods
struggle to remove sensitive information while preserving the consistency of
non-sensitive regions. To address this, we propose an Automatic Dataset
Creation Framework based on prompt-based layered editing and training-free
local feature removal, constructing the ForgetMe dataset and introducing the
Entangled evaluation metric. The Entangled metric quantifies unlearning
effectiveness by assessing the similarity and consistency between the target
and background regions and supports both paired (Entangled-D) and unpaired
(Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe
dataset encompasses a diverse set of real and synthetic scenarios, including
CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We
apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on
this dataset and validate the effectiveness of both the ForgetMe dataset and
the Entangled metric, establishing them as benchmarks for selective unlearning.
Our work provides a scalable and adaptable solution for advancing
privacy-preserving generative AI.
Summary / 总结
The paper addresses the challenge of selective unlearning in diffusion models, which are widely used in image generation. It proposes an Automatic Dataset Creation Framework using prompt-based layered editing and training-free local feature removal. The ForgetMe dataset and Entangled evaluation metric are introduced, which help quantify unlearning effectiveness by assessing the similarity and consistency between target and background regions. The study applies LoRA fine-tuning on Stable Diffusion and validates the ForgetMe dataset and Entangled metric as benchmarks for selective unlearning in generative models.
论文针对扩散模型中的选择性遗忘难题,这些模型广泛应用于图像生成。提出了基于提示的分层编辑和无需训练的局部特征移除的自动数据集创建框架。引入了ForgetMe数据集和Entangled评估指标,通过评估目标区域和背景区域之间的相似性和一致性来量化遗忘效果。研究在Stable Diffusion上应用LoRA微调,并验证了ForgetMe数据集和Entangled指标作为生成模型中选择性遗忘的基准的有效性。
Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone
Authors: Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth
First: 2025-08-26T12:41:35+00:00 · Latest: 2025-08-26T12:41:35+00:00
Abstract
As Vision Language Models (VLMs) become integral to real-world applications,
understanding their demographic biases is critical. We introduce GRAS, a
benchmark for uncovering demographic biases in VLMs across gender, race, age,
and skin tone, offering the most diverse coverage to date. We further propose
the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark
five state-of-the-art VLMs and reveal concerning bias levels, with the least
biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings
also reveal a methodological insight: evaluating bias in VLMs with visual
question answering (VQA) requires considering multiple formulations of a
question. Our code, data, and evaluation results are publicly available.
中文标题/摘要
标题:用不同方式再问我:GRAS用于衡量视觉语言模型在性别、种族、年龄和肤色方面的偏见
随着视觉语言模型(VLMs)在实际应用中变得越来越重要,理解其在不同人口统计学方面的偏见变得至关重要。我们引入了GRAS,这是一个基准测试,用于揭示视觉语言模型在性别、种族、年龄和肤色方面的偏见,提供了迄今为止最全面的覆盖范围。我们还提出了GRAS偏见评分,这是一个可解释的指标,用于量化偏见。我们对五种最先进的视觉语言模型进行了基准测试,并揭示了令人担忧的偏见水平,最不偏见的模型的GRAS偏见评分为100分中的2分。我们的研究结果还揭示了一个方法论上的见解:在视觉问答(VQA)中评估视觉语言模型的偏见需要考虑问题的多种表述形式。我们的代码、数据和评估结果已公开可用。
Summary / 总结
As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical.
Enhancing Document VQA Models via Retrieval-Augmented Generation
Authors: Eric López, Artemis Llabrés, Ernest Valveny
First: 2025-08-26T12:32:55+00:00 · Latest: 2025-08-26T12:32:55+00:00
Comments: Accepted at Workshop on Machine Learning in Document Analysis and
Recognition (ICDAR WML 2025), Wuhan, China
Abstract
Document Visual Question Answering (Document VQA) must cope with documents
that span dozens of pages, yet leading systems still concatenate every page or
rely on very large vision-language models, both of which are memory-hungry.
Retrieval-Augmented Generation (RAG) offers an attractive alternative, first
retrieving a concise set of relevant segments before generating answers from
this selected evidence. In this paper, we systematically evaluate the impact of
incorporating RAG into Document VQA through different retrieval variants -
text-based retrieval using OCR tokens and purely visual retrieval without OCR -
across multiple models and benchmarks. Evaluated on the multi-page datasets
MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the
"concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant
achieves +5.0 ANLS improvement without requiring any text extraction. An
ablation confirms that retrieval and reranking components drive most of the
gain, whereas the layout-guided chunking strategy - proposed in several recent
works to leverage page structure - fails to help on these datasets. Our
experiments demonstrate that careful evidence selection consistently boosts
accuracy across multiple model sizes and multi-page benchmarks, underscoring
its practical value for real-world Document VQA.
Summary / 总结
This paper explores the integration of Retrieval-Augmented Generation (RAG) into Document VQA models to address the memory challenges of processing multi-page documents. By evaluating text-based and purely visual retrieval methods across various models and benchmarks, the study shows that the text-centric variant improves the baseline by up to 22.5 ANLS, while the visual variant achieves a 5.0 ANLS improvement without OCR. The experiments highlight the effectiveness of careful evidence selection in enhancing accuracy across different model sizes and benchmarks.
本文探讨了将检索增强生成(RAG)集成到文档VQA模型中,以解决处理多页文档的内存挑战。它在多种模型和基准上评估了基于文本和纯视觉的检索方法,结果显示基于文本的变体将基线提高了最多22.5 ANLS,而纯视觉变体在无需OCR的情况下实现了5.0 ANLS的改进。研究证实检索和重排序是主要的收益来源,而基于布局的分块策略在这些数据集上并未显著提升。实验结果强调了在多页基准上的仔细证据选择的一致性提升,突显了其实用价值。
Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Authors: Somraj Gautam, Abhirama Subramanyam Penamakuri, Abhishek Bhandari, Gaurav Harit
First: 2025-08-24T12:43:27+00:00 · Latest: 2025-08-26T12:16:26+00:00
Abstract
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA)
on cricket scorecards, designed to evaluate large vision-language models
(LVLMs) on complex numerical and cross-lingual reasoning over semi-structured
tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated
scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English
QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English
scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi
scorecards, with all questions and answers kept in English to enable controlled
cross-script evaluation. The task demands reasoning over structured numerical
data, multi-image context, and implicit domain knowledge. Empirical results
show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle
on the English subset despite it being their primary training language and
exhibit a further drop in performance on the Hindi subset. This reveals key
limitations in structure-aware visual text understanding, numerical reasoning,
and cross-lingual generalization. The dataset is publicly available via Hugging
Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM
research in this direction.
中文标题/摘要
标题:注意(语言)差距:探索LVLMs在数值和跨语言推理方面的极限
我们介绍了MMCRICBENCH-3K,这是一个用于板球比分卡视觉问答(VQA)的基准测试,旨在评估大型视觉语言模型(LVLMs)在半结构化表格图像上进行复杂数值和跨语言推理的能力。MMCRICBENCH-3K 包含1,463张合成生成的板球比分卡图像,来自ODI、T20和Test格式,以及1,500对英文问答对。它包括两个子集:MMCRICBENCH-E-1.5K,包含英文比分卡,和MMCRICBENCH-H-1.5K,包含视觉上相似的印地文比分卡,所有问题和答案都用英文编写,以实现跨文字的控制性评估。该任务要求对结构化数值数据、多图像上下文和隐含领域知识进行推理。实验证明,即使是最先进的LVLMs,如GPT-4o和Qwen2.5VL,在其主要训练语言的英文子集上也难以应对,而在印地文子集上的表现进一步下降。这揭示了结构感知视觉文本理解、数值推理和跨语言泛化的关键局限性。该数据集可通过Hugging Face公开获取,网址为https://huggingface.co/datasets/DIALab/MMCricBench,以促进该方向的LVLM研究。
Summary / 总结
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images.
Prototype-Guided Diffusion: Visual Conditioning without External Memory
Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah
First: 2025-08-13T16:18:35+00:00 · Latest: 2025-08-26T10:29:55+00:00
Abstract
Diffusion models have emerged as a leading framework for high-quality image
generation, offering stable training and strong performance across diverse
domains. However, they remain computationally intensive, particularly during
the iterative denoising process. Latent-space models like Stable Diffusion
alleviate some of this cost by operating in compressed representations, though
at the expense of fine-grained detail. More recent approaches such as
Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning
denoising on similar examples retrieved from large external memory banks. While
effective, these methods introduce drawbacks: they require costly storage and
retrieval infrastructure, depend on static vision-language models like CLIP for
similarity, and lack adaptability during training. We propose the Prototype
Diffusion Model (PDM), a method that integrates prototype learning directly
into the diffusion process for efficient and adaptive visual conditioning -
without external memory. Instead of retrieving reference samples, PDM
constructs a dynamic set of compact visual prototypes from clean image features
using contrastive learning. These prototypes guide the denoising steps by
aligning noisy representations with semantically relevant visual patterns,
enabling efficient generation with strong semantic grounding. Experiments show
that PDM maintains high generation quality while reducing computational and
storage overhead, offering a scalable alternative to retrieval-based
conditioning in diffusion models.
Summary / 总结
The research aims to improve the efficiency and adaptability of diffusion models in image generation by integrating prototype learning directly into the diffusion process. Instead of relying on external memory banks, PDM constructs dynamic visual prototypes from clean image features using contrastive learning. This method reduces computational and storage overhead while maintaining high generation quality and strong semantic grounding, offering a scalable alternative to retrieval-based conditioning in diffusion models.
研究旨在通过将原型学习直接集成到扩散过程中来提高图像生成的效率和适应性。PDM 不依赖外部内存的参考样本,而是使用对比学习从干净的图像特征中构建动态的视觉原型。这种方法减少了计算和存储开销,同时保持了高质量的生成质量和强烈的语义关联,为扩散模型中的检索基条件提供了可扩展的替代方案。
M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
Authors: Yanshu Li, Yi Cao, Hongyang He, Qisen Cheng, Xiang Fu, Xi Xiao, Tianyang Wang, Ruixiang Tang
First: 2025-04-06T22:02:21+00:00 · Latest: 2025-08-26T10:19:05+00:00
Comments: COLM 2025, 30 pages, 10 figures, 16 tables
Abstract
Multimodal in-context learning (ICL) equips Large Vision-language Models
(LVLMs) with the ability to adapt to new tasks via multiple user-provided
demonstrations, without requiring any model parameter updates. However, its
effectiveness is constrained by the token-intensive nature of multimodal inputs
and the complexity of cross-modal few-shot reasoning, which together hinder
LVLMs from extracting useful patterns from demonstrations. To address these
challenges, we propose \textbf{M$^2$IV}, a novel representation engineering
approach that replaces explicit token-level demonstrations with a set of
learnable Multimodal In-context Vectors directly injected into the residual
streams of LVLMs. By analyzing the distinct roles of multi-head attention (MHA)
and multi-layer perceptrons (MLP) in the ICL process, we design a training
strategy that enables M$^2$IV to perform fine-grained semantic distillation and
robust cross-modal representation learning. M$^2$IV not only improves
performance across diverse tasks and LVLMs but also significantly reduces token
overhead, enabling graceful scaling to many-shot scenarios. To further enhance
usability, we introduce \textbf{VLibrary}, a repository that stores trained
M$^2$IVs for flexible retrieval and injection. With VLibrary, users can steer
pre-trained LVLMs in a customized manner that meets diverse requirements.
Extensive experiments demonstrate that M$^2$IV consistently outperforms vanilla
ICL and prior representation engineering baselines, achieving an average
accuracy gain of 3.74\% with substantial improvements in overall efficiency.
Video CLIP Model for Multi-View Echocardiography Interpretation
Authors: Ryo Takizawa, Satoshi Kodera, Tempei Kabayama, Ryo Matsuoka, Yuta Ando, Yuto Nakamura, Haruki Settai, Norihiko Takeda
First: 2025-04-26T05:11:15+00:00 · Latest: 2025-08-26T10:06:14+00:00
Abstract
Echocardiography records ultrasound videos of the heart, enabling clinicians
to assess cardiac function. Recent advances in large-scale vision-language
models (VLMs) have spurred interest in automating echocardiographic
interpretation. However, most existing medical VLMs rely on single-frame
(image) inputs, which can reduce diagnostic accuracy for conditions
identifiable only through cardiac motion. In addition, echocardiographic videos
are captured from multiple views, each varying in suitability for detecting
specific conditions. Leveraging multiple views may therefore improve diagnostic
performance. We developed a video-language model that processes full video
sequences from five standard views, trained on 60,747 echocardiographic
video-report pairs. We evaluated the gains in retrieval performance from video
input and multi-view support, including the contributions of various pretrained
models.
中文标题/摘要
标题:视频CLIP模型在多视角超声心动图解释中的应用
超声心动图记录心脏的超声视频,使临床医生能够评估心脏功能。近年来,大规模视觉-语言模型(VLMs)的进步激发了自动化超声心动图解释的兴趣。然而,现有的大多数医学VLMs依赖单帧(图像)输入,这可能会降低对仅通过心脏运动可识别的状况的诊断准确性。此外,超声心动图视频是从多个视角捕获的,每个视角对检测特定状况的适用性不同。因此,利用多个视角可能提高诊断性能。我们开发了一个视频-语言模型,处理五个标准视角的完整视频序列,并在60,747个超声心动图视频-报告对上进行训练。我们评估了视频输入和多视角支持带来的检索性能提升,包括各种预训练模型的贡献。
Summary / 总结
The research aims to improve the accuracy of echocardiographic interpretation by developing a video-language model that processes full video sequences from five standard views. The model is trained on 60,747 echocardiographic video-report pairs and evaluates the benefits of using video input and multi-view support. Key findings include improved retrieval performance from video input and the contributions of various pretrained models to diagnostic accuracy.
研究旨在通过开发一个处理五个标准视图完整视频序列的视频语言模型来提高心脏超声图解的准确性。该模型基于60,747个心脏超声视频-报告对进行训练,并评估使用视频输入和多视图支持的好处。关键发现包括视频输入检索性能的提升以及各种预训练模型对诊断准确性的贡献。
Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models
Authors: Yuexuan Xia, Benteng Ma, Jiang He, Zhiyong Wang, Qi Dou, Yong Xia
First: 2025-08-26T10:01:23+00:00 · Latest: 2025-08-26T10:01:23+00:00
Abstract
Ensuring fairness across demographic groups in medical diagnosis is essential
for equitable healthcare, particularly under distribution shifts caused by
variations in imaging equipment and clinical practice. Vision-language models
(VLMs) exhibit strong generalization, and text prompts encode identity
attributes, enabling explicit identification and removal of sensitive
directions. However, existing debiasing approaches typically address vision and
text modalities independently, leaving residual cross-modal misalignment and
fairness gaps. To address this challenge, we propose DualFairVL, a multimodal
prompt-learning framework that jointly debiases and aligns cross-modal
representations. DualFairVL employs a parallel dual-branch architecture that
separates sensitive and target attributes, enabling disentangled yet aligned
representations across modalities. Approximately orthogonal text anchors are
constructed via linear projections, guiding cross-attention mechanisms to
produce fused features. A hypernetwork further disentangles attribute-related
information and generates instance-aware visual prompts, which encode
dual-modal cues for fairness and robustness. Prototype-based regularization is
applied in the visual branch to enforce separation of sensitive features and
strengthen alignment with textual anchors. Extensive experiments on eight
medical imaging datasets across four modalities show that DualFairVL achieves
state-of-the-art fairness and accuracy under both in- and out-of-distribution
settings, outperforming full fine-tuning and parameter-efficient baselines with
only 3.6M trainable parameters. Code will be released upon publication.
Summary / 总结
The research aims to ensure fairness in medical diagnosis across demographic groups by addressing distribution shifts in imaging equipment and clinical practices. DualFairVL, a multimodal prompt-learning framework, jointly debiases and aligns cross-modal representations using a parallel dual-branch architecture. The framework achieves state-of-the-art fairness and accuracy, outperforming full fine-tuning and parameter-efficient baselines with only 3.6M trainable parameters across eight medical imaging datasets.
研究旨在通过解决成像设备和临床实践变化导致的分布偏移,确保医疗诊断在不同人群中的公平性。DualFairVL 是一个多模态提示学习框架,通过并行双分支架构联合去偏和对齐跨模态表示。该框架在八个医学成像数据集上实现了最佳的公平性和准确性,仅使用3.6M可训练参数,优于全微调和参数高效基线方法。
Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment
Authors: Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang
First: 2023-12-15T09:08:14+00:00 · Latest: 2025-08-26T08:50:35+00:00
Abstract
Learning to ground natural language queries to target objects or regions in
3D point clouds is quite essential for 3D scene understanding. Nevertheless,
existing 3D visual grounding approaches require a substantial number of
bounding box annotations for text queries, which is time-consuming and
labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly
supervised approach for 3D visual grounding based on Visual Linguistic
Alignment. Our 3D-VLA exploits the superior ability of current large-scale
vision-language models (VLMs) on aligning the semantics between texts and 2D
images, as well as the naturally existing correspondences between 2D images and
3D point clouds, and thus implicitly constructs correspondences between texts
and 3D point clouds with no need for fine-grained box annotations in the
training procedure. During the inference stage, the learned text-3D
correspondence will help us ground the text queries to the 3D target objects
even without 2D images. To the best of our knowledge, this is the first work to
investigate 3D visual grounding in a weakly supervised manner by involving
large scale vision-language models, and extensive experiments on ReferIt3D and
ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even
superior results over the fully supervised methods.
中文标题/摘要
标题:基于视觉语言对齐的弱监督3D视觉定位
学习将自然语言查询定位到3D点云中的目标对象或区域对于3D场景理解至关重要。然而,现有的3D视觉定位方法需要大量文本查询的边界框注释,这需要大量时间和劳动来获取。在本文中,我们提出了一种基于视觉语言对齐的弱监督3D视觉定位方法3D-VLA。我们的3D-VLA利用了当前大规模视觉语言模型(VLMs)在文本和2D图像之间对齐语义的能力,以及2D图像与3D点云之间自然存在的对应关系,从而在训练过程中隐式地构建了文本与3D点云之间的对应关系,而无需在训练过程中使用细粒度的框注释。在推理阶段,学习到的文本-3D对应关系将帮助我们在没有2D图像的情况下将文本查询定位到3D目标对象。据我们所知,这是首次通过使用大规模视觉语言模型来研究弱监督3D视觉定位的工作,我们在ReferIt3D和ScanRefer数据集上的广泛实验表明,我们的3D-VLA在与完全监督方法相当甚至更优的结果方面取得了成功。
Summary / 总结
This paper addresses the challenge of 3D visual grounding by proposing 3D-VLA, a weakly supervised approach that leverages visual linguistic alignment to align text queries with 3D point clouds without the need for bounding box annotations. The method exploits the alignment capabilities of large-scale vision-language models and the natural correspondences between 2D images and 3D point clouds, achieving comparable and sometimes better performance than fully supervised methods on the ReferIt3D and ScanRefer datasets.
本文提出了一种弱监督方法3D-VLA,通过视觉语言对齐将文本查询与3D点云对齐,而无需边界框注释。该方法利用大规模视觉语言模型的对齐能力以及2D图像与3D点云之间的自然对应关系,在ReferIt3D和ScanRefer数据集上的实验表明,3D-VLA的性能与完全监督的方法相当甚至更好。
Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models
Authors: Rui Zhang, Zihan Wang, Tianli Yang, Hongwei Li, Wenbo Jiang, Qingchuan Zhao, Yang Liu, Guowen Xu
First: 2025-08-26T08:40:22+00:00 · Latest: 2025-08-26T08:40:22+00:00
Abstract
Vision-Language Models (VLMs) are increasingly deployed in real-world
applications, but their high inference cost makes them vulnerable to resource
consumption attacks. Prior attacks attempt to extend VLM output sequences by
optimizing adversarial images, thereby increasing inference costs. However,
these extended outputs often introduce irrelevant abnormal content,
compromising attack stealthiness. This trade-off between effectiveness and
stealthiness poses a major limitation for existing attacks. To address this
challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption
attack that crafts prompt-agnostic adversarial images, inducing VLMs to
generate maximum-length outputs by appending special tokens invisible to users.
Our method employs a composite loss function that balances semantic
preservation, repetitive special token induction, and suppression of the
end-of-sequence (EOS) token, optimized via a dynamic weighting strategy.
Extensive experiments show that \textit{Hidden Tail} outperforms existing
attacks, increasing output length by up to 19.2$\times$ and reaching the
maximum token limit, while preserving attack stealthiness. These results
highlight the urgent need to improve the robustness of VLMs against
efficiency-oriented adversarial threats. Our code is available at
https://github.com/zhangrui4041/Hidden_Tail.
Summary / 总结
The research addresses the vulnerability of Vision-Language Models (VLMs) to resource consumption attacks by proposing a stealthy attack called Hidden Tail. This method crafts prompt-agnostic adversarial images that induce VLMs to generate maximum-length outputs by appending invisible special tokens. Experiments show that Hidden Tail significantly increases output length by up to 19.2 times while maintaining stealthiness, outperforming existing attacks. This highlights the need for VLMs to be more robust against such efficiency-oriented threats.
研究提出了名为Hidden Tail的隐蔽资源消耗攻击方法,通过生成与提示无关的对抗图像,诱导视觉-语言模型生成最大长度的输出,同时附加上不可见的特殊标记。实验表明,Hidden Tail可以将输出长度显著增加至19.2倍以上,同时保持隐蔽性,突显了提高视觉-语言模型对效率导向的对抗威胁的鲁棒性的紧迫性。代码可在GitHub上获取。
Robust and Label-Efficient Deep Waste Detection
Authors: Hassan Abid, Khan Muhammad, Muhammad Haris Khan
First: 2025-08-26T08:34:04+00:00 · Latest: 2025-08-26T08:34:04+00:00
Comments: Accepted to BMVC 2025
Abstract
Effective waste sorting is critical for sustainable recycling, yet AI
research in this domain continues to lag behind commercial systems due to
limited datasets and reliance on legacy object detectors. In this work, we
advance AI-driven waste detection by establishing strong baselines and
introducing an ensemble-based semi-supervised learning framework. We first
benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on
the real-world ZeroWaste dataset, demonstrating that while class-only prompts
perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy.
Next, to address domain-specific limitations, we fine-tune modern
transformer-based detectors, achieving a new baseline of 51.6 mAP. We then
propose a soft pseudo-labeling strategy that fuses ensemble predictions using
spatial and consensus-aware weighting, enabling robust semi-supervised
training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations
achieve performance gains that surpass fully supervised training, underscoring
the effectiveness of scalable annotation pipelines. Our work contributes to the
research community by establishing rigorous baselines, introducing a robust
ensemble-based pseudo-labeling pipeline, generating high-quality annotations
for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models
under real-world waste sorting conditions. Our code is available at:
https://github.com/h-abid97/robust-waste-detection.
Summary / 总结
This research aims to improve AI-driven waste detection for sustainable recycling by addressing limitations in datasets and object detectors. The study benchmarks state-of-the-art models and introduces an ensemble-based semi-supervised learning framework. Key findings include enhanced zero-shot accuracy with LLM-optimized prompts, a new baseline of 51.6 mAP through transformer-based detector fine-tuning, and performance gains in semi-supervised training using a soft pseudo-labeling strategy. This work contributes by establishing rigorous baselines and generating high-quality annotations for unlabeled data.
研究旨在通过解决数据集和目标检测器的限制,提高AI驱动的垃圾分类检测,以促进可持续回收。研究对比了最先进的模型,并引入了一种基于集成的半监督学习框架。关键发现包括通过LLM优化提示增强零样本准确性,通过变压器检测器微调达到新的51.6 mAP基线,并通过软伪标签策略在半监督训练中获得性能提升。这项工作通过建立严格的基线和生成高质量的未标注数据注释做出了贡献。
Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods
Authors: Qinqian Lei, Bo Wang, Robby T. Tan
First: 2025-08-26T07:30:53+00:00 · Latest: 2025-08-26T07:30:53+00:00
Abstract
Prior human-object interaction (HOI) detection methods have integrated early
vision-language models (VLMs) such as CLIP, but only as supporting components
within their frameworks. In contrast, recent advances in large, generative VLMs
suggest that these models may already possess strong ability to understand
images involving HOI. This naturally raises an important question: can
general-purpose standalone VLMs effectively solve HOI detection, and how do
they compare with specialized HOI methods? Answering this requires a benchmark
that can accommodate both paradigms. However, existing HOI benchmarks such as
HICO-DET were developed before the emergence of modern VLMs, and their
evaluation protocols require exact matches to annotated HOI classes. This is
poorly aligned with the generative nature of VLMs, which often yield multiple
valid interpretations in ambiguous cases. For example, a static image may
capture a person mid-motion with a frisbee, which can plausibly be interpreted
as either "throwing" or "catching". When only "catching" is annotated, the
other, though equally plausible for the image, is marked incorrect when exact
matching is used. As a result, correct predictions might be penalized,
affecting both VLMs and HOI-specific methods. To avoid penalizing valid
predictions, we introduce a new benchmark that reformulates HOI detection as a
multiple-answer multiple-choice task, where each question includes only
ground-truth positive options and a curated set of negatives that are
constructed to reduce ambiguity (e.g., when "catching" is annotated, "throwing"
is not selected as a negative to avoid penalizing valid predictions). The
proposed evaluation protocol is the first of its kind for both VLMs and HOI
methods, enabling direct comparison and offering new insight into the current
state of progress in HOI understanding.
Summary / 总结
The paper addresses the need for a new benchmark to evaluate human-object interaction (HOI) detection, considering both vision-language models (VLMs) and HOI-specific methods. It introduces a new evaluation protocol that reformulates HOI detection as a multiple-choice task, allowing for valid predictions to be recognized even in ambiguous cases. This approach avoids penalizing correct but non-exact matches, providing a fairer assessment for both VLMs and HOI-specific methods.
论文重新审视了人类物体交互(HOI)检测中视觉语言模型(VLMs)和专门的HOI方法的整合。它引入了一个新的基准,将HOI检测重新表述为一个多项选择题任务,允许对VLMs和HOI特定方法进行更准确的评估。这种新方法避免了在模糊情况下对有效预测的惩罚,提供了更公平的比较,并揭示了当前HOI理解的进展状态。
Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation
Authors: Md Tariquzzaman, Md Farhan Ishmam, Saiyma Sittul Muna, Md Kamrul Hasan, Hasan Mahmud
Venue: ICCV 2025
First: 2025-08-22T04:11:28+00:00 · Latest: 2025-08-26T06:32:51+00:00
Comments: CV4A11y@ICCV 2025
Abstract
Sign Language (SL) enables two-way communication for the deaf and
hard-of-hearing community, yet many sign languages remain under-resourced in
the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step
textual instructions that enable non-SL users to imitate and learn SL gestures,
promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG
dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced
SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to
appear in the VLM pre-training data. To enhance zero-shot performance, we
introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL
parameters, like hand shape, motion, and orientation, directly into the textual
prompts. Subsuming standard sign parameters into the prompt makes the
instructions more structured and reproducible than free-form natural text from
vanilla prompting. We envision that our work would promote inclusivity and
advancement in SL learning systems for the under-resourced communities.
中文标题/摘要
标题:使用手语参数进行低资源手语教学生成的提示
手语(SL)为聋人和听力障碍社区提供了双向交流的手段,但许多手语在AI领域仍资源不足。手语教学生成(SLIG)产生逐步的文本指令,使非SL用户能够模仿和学习SL手势,促进双向互动。我们介绍了BdSLIG,这是第一个孟加拉语SLIG数据集,用于评估视觉语言模型(VLMs)在(i)低资源SLIG任务上的表现,以及(ii)长尾视觉概念上的表现,因为孟加拉语SL不太可能出现在VLM的预训练数据中。为了增强零样本性能,我们引入了手语参数融合(SPI)提示,将标准的手语参数,如手型、动作和方向,直接整合到文本提示中。将标准手语参数整合到提示中使得指令比纯自然文本提示更具结构化和可重复性。我们设想我们的工作将促进低资源社区手语学习系统的包容性和发展。
Summary / 总结
Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space.
Utilizing Training Data to Improve LLM Reasoning for Tabular Understanding
Authors: Chufan Gao, Jintai Chen, Jimeng Sun
First: 2025-08-26T04:46:54+00:00 · Latest: 2025-08-26T04:46:54+00:00
Abstract
Automated tabular understanding and reasoning are essential tasks for data
scientists. Recently, Large language models (LLMs) have become increasingly
prevalent in tabular reasoning tasks. Previous work focuses on (1) finetuning
LLMs using labeled data or (2) Training-free prompting LLM agents using
chain-of-thought (CoT). Finetuning offers dataset-specific learning at the cost
of generalizability. Training-free prompting is highly generalizable but does
not take full advantage of training data. In this paper, we propose a novel
prompting-based reasoning approach, Learn then Retrieve: LRTab, which
integrates the benefits of both by retrieving relevant information learned from
training data. We first use prompting to obtain CoT responses over the training
data. For incorrect CoTs, we prompt the LLM to predict Prompt Conditions to
avoid the error, learning insights from the data. We validate the effectiveness
of Prompt Conditions using validation data. Finally, at inference time, we
retrieve the most relevant Prompt Conditions for additional context for table
understanding. We provide comprehensive experiments on WikiTQ and Tabfact,
showing that LRTab is interpretable, cost-efficient, and can outperform
previous baselines in tabular reasoning.
中文标题/摘要
标题:利用训练数据提高LLM在表格理解中的推理能力
自动化表格理解和推理是数据科学家的重要任务。近年来,大型语言模型(LLMs)在表格推理任务中越来越普遍。以往的工作主要集中在(1)使用标记数据微调LLMs或(2)通过链式思考(CoT)无训练提示LLM代理。微调提供了特定数据集的学习,但牺牲了泛化能力。无训练提示具有高度的泛化能力,但并未充分利用训练数据。在本文中,我们提出了一种新颖的基于提示的推理方法——Learn then Retrieve: LRTab,该方法结合了两者的优势,通过检索从训练数据中学到的相关信息。我们首先使用提示获得CoT响应。对于错误的CoT,我们提示LLM预测提示条件以避免错误,并从数据中学习见解。我们使用验证数据验证提示条件的有效性。最后,在推理时,我们检索最相关的提示条件以提供额外的上下文以理解表格。我们在WikiTQ和Tabfact上进行了全面的实验,表明LRTab具有可解释性、成本效益,并且在表格推理中可以超越之前的基线。
Summary / 总结
This paper addresses the challenge of improving large language models (LLMs) for automated tabular understanding and reasoning. It introduces LRTab, a prompting-based approach that combines the benefits of finetuning and training-free prompting. By retrieving relevant information from training data and learning from incorrect chain-of-thought responses, LRTab enhances the model's interpretability and cost-efficiency, outperforming previous methods on WikiTQ and Tabfact datasets.
本文旨在提高大型语言模型(LLMs)在表格理解和推理方面的表现。它提出了LRTab,一种结合了微调和无训练提示优点的提示方法。通过从训练数据中检索相关信息并从错误的链式思考响应中学习,LRTab 提高了模型的可解释性和成本效益,在 WikiTQ 和 Tabfact 数据集上优于之前的基线方法。
PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality
Authors: Nanxi Li, Zhengyue Zhao, Chaowei Xiao
First: 2025-08-26T03:45:19+00:00 · Latest: 2025-08-26T03:45:19+00:00
Abstract
Safeguarding vision-language models (VLMs) is a critical challenge, as
existing methods often suffer from over-defense, which harms utility, or rely
on shallow alignment, failing to detect complex threats that require deep
reasoning. To this end, we introduce PRISM (Principled Reasoning for Integrated
Safety in Multimodality), a system2-like framework that aligns VLMs by
embedding a structured, safety-aware reasoning process. Our framework consists
of two key components: PRISM-CoT, a dataset that teaches safety-aware
chain-of-thought reasoning, and PRISM-DPO, generated via Monte Carlo Tree
Search (MCTS) to further refine this reasoning through Direct Preference
Optimization to help obtain a delicate safety boundary. Comprehensive
evaluations demonstrate PRISM's effectiveness, achieving remarkably low attack
success rates including 0.15% on JailbreakV-28K for Qwen2-VL and 90%
improvement over the previous best method on VLBreak for LLaVA-1.5. PRISM also
exhibits strong robustness against adaptive attacks, significantly increasing
computational costs for adversaries, and generalizes effectively to
out-of-distribution challenges, reducing attack success rates to just 8.70% on
the challenging multi-image MIS benchmark. Remarkably, this robust defense is
achieved while preserving, and in some cases enhancing, model utility. To
promote reproducibility, we have made our code, data, and model weights
available at https://github.com/SaFoLab-WISC/PRISM.
Summary / 总结
Safeguarding vision-language models (VLMs) is a critical challenge, as existing methods often suffer from over-defense, which harms utility, or rely on shallow alignment, failing to detect complex threats that require deep reasoning.
PRISM 是一个系统,旨在通过集成结构化的安全意识推理过程来对视觉语言模型(VLMs)进行对齐。它包括 PRISM-CoT,一个用于教授安全意识链式思考推理的数据集,以及 PRISM-DPO,通过直接偏好优化进一步细化这种推理。PRISM 在低攻击成功率和对适应性攻击的强大鲁棒性方面表现出色,显著优于先前的方法。它还能够很好地泛化到新的分布挑战中。
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models
Authors: Zesen Lyu, Dandan Zhang, Wei Ye, Fangdi Li, Zhihang Jiang, Yao Yang
Venue: EMNLP 2025
First: 2025-05-27T05:17:41+00:00 · Latest: 2025-08-26T03:25:38+00:00
Comments: Accepted by EMNLP 2025 Main conference
Abstract
Spatial reasoning is a core component of human cognition, enabling
individuals to perceive, comprehend, and interact with the physical world. It
relies on a nuanced understanding of spatial structures and inter-object
relationships, serving as the foundation for complex reasoning and
decision-making. To investigate whether current vision-language models (VLMs)
exhibit similar capability, we introduce Jigsaw-Puzzles, a novel benchmark
consisting of 1,100 carefully curated real-world images with high spatial
complexity. Based on this dataset, we design five tasks to rigorously evaluate
VLMs' spatial perception, structural understanding, and reasoning capabilities,
while deliberately minimizing reliance on domain-specific knowledge to better
isolate and assess the general spatial reasoning capability. We conduct a
comprehensive evaluation across 24 state-of-the-art VLMs. The results show that
even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy
and performs particularly poorly on the Order Generation task, with only 30.00%
accuracy, far below the performance exceeding 90% achieved by human
participants. This persistent gap underscores the need for continued progress,
positioning Jigsaw-Puzzles as a challenging and diagnostic benchmark for
advancing spatial reasoning research in VLMs. Our project page is at
https://zesen01.github.io/jigsaw-puzzles.
Summary / 总结
Spatial reasoning is a core component of human cognition, enabling individuals to perceive, comprehend, and interact with the physical world.
研究旨在通过引入包含1,100张真实世界图像的新基准Jigsaw-Puzzles,评估视觉语言模型(VLMs)的空间推理能力。设计了五个任务来测试VLMs的空间感知、结构理解和推理能力。尽管最强的Gemini-2.5-Pro总体准确率为77.14%,但在Order Generation任务上的准确率仅为30.00%,远低于人类参与者的表现。这表明需要进一步的研究来提高VLMs的空间推理能力。该基准旨在诊断和推进VLMs的空间推理研究。
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Authors: Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin
Venue: EMNLP
First: 2024-11-25T02:15:30+00:00 · Latest: 2025-08-26T02:48:31+00:00
Comments: Accepted by EMNLP-2025 Main. Project page:
https://szhanz.github.io/zoomeye/
Abstract
An image, especially with high-resolution, typically consists of numerous
visual elements, ranging from dominant large objects to fine-grained detailed
objects. When perceiving such images, multimodal large language models~(MLLMs)
face limitations due to the restricted input resolution of the pretrained
vision encoder and the cluttered, dense context of the image, resulting in a
focus on primary objects while easily overlooking detailed ones. In this paper,
we propose Zoom Eye, a tree search algorithm designed to navigate the
hierarchical and visual nature of images to capture relevant information. Zoom
Eye conceptualizes an image as a tree, with each children node representing a
zoomed sub-patch of the parent node and the root represents the overall image.
Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs
to simulate human zooming actions by searching along the image tree from root
to leaf nodes, seeking out pertinent information, and accurately responding to
related queries. We experiment on a series of elaborate high-resolution
benchmarks and the results demonstrate that Zoom Eye not only consistently
improves the performance of a series base MLLMs with large margin~(e.g.,
LLaVA-v1.5-7B increases by 34.57\% on $V^*$ Bench and 17.88\% on HR-Bench), but
also enables small 7B MLLMs to outperform strong large models such as GPT-4o.
Our code is available at
\href{https://github.com/om-ai-lab/ZoomEye}{https://github.com/om-ai-lab/ZoomEye}.
Summary / 总结
ZoomEye is a tree search algorithm designed to help multimodal large language models (MLLMs) better explore and utilize detailed information in high-resolution images. By conceptualizing images as trees, ZoomEye enables MLLMs to simulate human-like zooming actions, improving their performance on various benchmarks. The results show that ZoomEye significantly enhances the performance of both large and small MLLMs, with some small models even outperforming larger ones like GPT-4o.
ZoomEye 是一种树搜索算法,能够增强多模态大型语言模型(MLLMs),使其能够模拟人类在图像上的放大动作。它将图像视为树结构,允许 MLLMs 在图像的不同部分进行导航和探索。实验表明,ZoomEye 在高分辨率基准测试中显著提升了各种 MLLMs 的性能,甚至使小型 7B 模型能够超越大型模型如 GPT-4o。
The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation
Authors: Girish A. Koushik, Fatemeh Nazarieh, Katherine Birch, Shenbin Qian, Diptesh Kanojia
First: 2025-08-26T00:04:01+00:00 · Latest: 2025-08-26T00:04:01+00:00
Comments: Under Review
Abstract
Visual metaphor generation is a challenging task that aims to generate an
image given an input text metaphor. Inherently, it needs language understanding
to bind a source concept with a target concept, in a way that preserves meaning
while ensuring visual coherence. We propose a self-evaluating visual metaphor
generation framework that focuses on metaphor alignment. Our self-evaluation
approach combines existing metrics with our newly proposed metaphor
decomposition score and a meaning alignment (MA) metric. Within this setup, we
explore two novel approaches: a training-free pipeline that explicitly
decomposes prompts into source-target-meaning (S-T-M) mapping for image
synthesis, and a complementary training-based pipeline that improves alignment
using our proposed self-evaluation reward schema, without any large-scale
retraining. On the held-out test set, the training-free approach surpasses
strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores,
with the training-based approach close behind. We evaluate our framework output
using a user-facing study, and observed that participants preferred GPT-4o
overall, while our training-free pipeline led open-source methods and edged
Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or
more abstract metaphors, with closed models excelling on short, concrete cases;
we also observe sensitivity to sampler settings. Overall, structured prompting
and lightweight RL perform metaphor alignment well under modest compute, and
remaining gaps to human preference appear driven by aesthetics and sampling.
Summary / 总结
Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor.
Generic Guard AI in Stealth Game with Composite Potential Fields
Authors: Kaijie Xu, Clark Verbrugge
First: 2025-08-25T21:56:13+00:00 · Latest: 2025-08-25T21:56:13+00:00
Abstract
Guard patrol behavior is central to the immersion and strategic depth of
stealth games, while most existing systems rely on hand-crafted routes or
specialized logic that struggle to balance coverage efficiency and responsive
pursuit with believable naturalness. We propose a generic, fully explainable,
training-free framework that integrates global knowledge and local information
via Composite Potential Fields, combining three interpretable maps-Information,
Confidence, and Connectivity-into a single kernel-filtered decision criterion.
Our parametric, designer-driven approach requires only a handful of decay and
weight parameters-no retraining-to smoothly adapt across both occupancy-grid
and NavMesh-partition abstractions. We evaluate on five representative game
maps, two player-control policies, and five guard modes, confirming that our
method outperforms classical baseline methods in both capture efficiency and
patrol naturalness. Finally, we show how common stealth mechanics-distractions
and environmental elements-integrate naturally into our framework as sub
modules, enabling rapid prototyping of rich, dynamic, and responsive guard
behaviors.
中文标题/摘要
标题:通用隐蔽AI在潜行游戏中与复合潜力场
守卫巡逻行为是潜行游戏沉浸感和战略深度的核心,而现有的大多数系统依赖于手工设计的路线或专门逻辑,难以在覆盖效率和响应追击之间取得自然可信的平衡。我们提出了一种通用的、完全可解释的、无需训练的框架,通过复合潜力场整合全局知识和局部信息,将信息、信心和连接性三个可解释的地图合并为一个内核过滤决策标准。我们的参数化、设计师驱动的方法只需要少量衰减和权重参数——无需重新训练——即可平滑适应占用网格和NavMesh分区抽象。我们在五个代表性游戏地图、两种玩家控制策略和五种守卫模式上进行了评估,确认我们的方法在捕获效率和巡逻自然性方面均优于经典基准方法。最后,我们展示了如何将常见的潜行机制——干扰和环境元素——自然地整合到我们的框架中作为子模块,从而实现丰富、动态和响应式的守卫行为的快速原型设计。
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation
Authors: Yihe Tang, Wenlong Huang, Yingke Wang, Chengshu Li, Roy Yuan, Ruohan Zhang, Jiajun Wu, Li Fei-Fei
First: 2025-06-10T22:47:16+00:00 · Latest: 2025-08-25T19:45:29+00:00
Abstract
Understanding fine-grained object affordances is imperative for robots to
manipulate objects in unstructured environments given open-ended task
instructions. However, existing methods of visual affordance predictions often
rely on manually annotated data or conditions only on a predefined set of
tasks. We introduce UAD (Unsupervised Affordance Distillation), a method for
distilling affordance knowledge from foundation models into a task-conditioned
affordance model without any manual annotations. By leveraging the
complementary strengths of large vision models and vision-language models, UAD
automatically annotates a large-scale dataset with detailed $<$instruction,
visual affordance$>$ pairs. Training only a lightweight task-conditioned
decoder atop frozen features, UAD exhibits notable generalization to
in-the-wild robotic scenes and to various human activities, despite only being
trained on rendered objects in simulation. Using affordance provided by UAD as
the observation space, we show an imitation learning policy that demonstrates
promising generalization to unseen object instances, object categories, and
even variations in task instructions after training on as few as 10
demonstrations. Project website: https://unsup-affordance.github.io/
中文标题/摘要
标题:UAD:无监督功能蒸馏在机器人操作中的一般化
理解细粒度的对象功能对于机器人在未结构化环境中根据开放任务指令操作对象至关重要。然而,现有的视觉功能预测方法往往依赖于手动标注的数据或仅限于预定义的任务集。我们引入了UAD(无监督功能蒸馏),这是一种从基础模型中提取功能知识的方法,无需任何手动标注。通过利用大型视觉模型和视觉语言模型的互补优势,UAD 自动标注了一个包含详细 $<$指令,视觉功能$>$ 对的大型数据集。仅在冻结特征上训练一个轻量级的任务条件解码器,UAD 在野生机器人场景和各种人类活动中表现出显著的一般化能力,尽管仅在模拟中训练的渲染对象上进行训练。使用UAD提供的功能作为观察空间,我们展示了模仿学习策略,在仅训练10个演示后,能够对未见过的对象实例、对象类别,甚至任务指令的变化表现出显著的一般化能力。项目网站:https://unsup-affordance.github.io/
Summary / 总结
Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions.
UAD 是一种无需人工标注的无监督方法,从基础模型中提取细粒度的物体操作能力知识并注入到任务条件模型中。通过利用大视觉模型和视觉语言模型的优势,UAD 自动标注了一个包含详细指令-视觉操作能力配对的大规模数据集。尽管仅在模拟中对渲染物体进行训练,UAD 在现实机器人场景和各种人类活动中仍表现出显著的泛化能力,并且使用 UAD 提供的操作能力进行模仿学习的策略在少量示范(仅 10 个)后也展示了对未见过的物体实例和任务指令的泛化能力。
CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering
Authors: Aranya Saha, Tanvir Ahmed Khan, Ismam Nur Swapnil, Mohammad Ariful Haque
First: 2025-08-25T19:22:16+00:00 · Latest: 2025-08-25T19:22:16+00:00
Comments: 10 pages, 8 figures, Prepared for submission to IEEE Transactions on
Human-Machine Systems
Abstract
Vision-language models (VLMs) have shown significant potential for medical
tasks; however, their general-purpose nature can limit specialized diagnostic
accuracy, and their large size poses substantial inference costs for real-world
clinical deployment. To address these challenges, we introduce CLARIFY, a
Specialist-Generalist framework for dermatological visual question answering
(VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image
classifier (the Specialist) that provides fast and highly accurate diagnostic
predictions, and (ii) a powerful yet compressed conversational VLM (the
Generalist) that generates natural language explanations to user queries. In
our framework, the Specialist's predictions directly guide the Generalist's
reasoning, focusing it on the correct diagnostic path. This synergy is further
enhanced by a knowledge graph-based retrieval module, which grounds the
Generalist's responses in factual dermatological knowledge, ensuring both
accuracy and reliability. This hierarchical design not only reduces diagnostic
errors but also significantly improves computational efficiency. Experiments on
our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an
18\% improvement in diagnostic accuracy over the strongest baseline, a
fine-tuned, uncompressed single-line VLM, while reducing the average VRAM
requirement and latency by at least 20\% and 5\%, respectively. These results
indicate that a Specialist-Generalist system provides a practical and powerful
paradigm for building lightweight, trustworthy, and clinically viable AI
systems.
中文标题/摘要
标题:CLARIFY:一种用于皮肤病视觉问答的专业-通用框架
视觉语言模型(VLMs)在医疗任务中显示出显著的潜力;然而,它们的一般用途限制了专门诊断的准确性,并且它们的大型尺寸在实际临床部署中带来了重大的推理成本。为了解决这些挑战,我们引入了CLARIFY,一种皮肤病视觉问答(VQA)的专业-通用框架。CLARIFY 结合了两个组件:(i)一个轻量级、领域训练的图像分类器(专家),提供快速且高度准确的诊断预测;(ii)一个强大但压缩的对话VLM(通用者),生成自然语言解释以回答用户查询。在我们的框架中,专家的预测直接引导通用者的推理,使其专注于正确的诊断路径。通过基于知识图谱的检索模块,进一步增强了这种协同作用,该模块使通用者的回答基于事实性的皮肤病知识,确保了准确性和可靠性。这种分层设计不仅减少了诊断错误,还显著提高了计算效率。在我们精心策划的多模态皮肤病数据集上的实验表明,CLARIFY 在诊断准确性上比最强基线(微调的、未压缩的一行VLM)提高了18%,同时将平均VRAM需求和延迟分别减少了至少20%和5%。这些结果表明,专业-通用系统为构建轻量级、可信且临床可行的AI系统提供了一种实用而强大的范式。
Summary / 总结
Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment.
CLARIFY 是一种专家-通用框架,通过结合一个轻量级、领域训练的图像分类器(专家)和一个压缩的对话型视觉语言模型(通用),来提升皮肤病视觉问答(VQA)的性能。专家提供快速且准确的诊断预测,指导通用模型的推理,同时基于知识图谱的检索模块确保了事实上的准确性。实验结果显示,CLARIFY 的诊断准确率提高了 18%,并且减少了至少 20% 的 VRAM 要求和 5% 的延迟,相比一个微调且未压缩的视觉语言模型而言。
SafeBimanual: Diffusion-based Trajectory Optimization for Safe Bimanual Manipulation
Authors: Haoyuan Deng, Wenkai Guo, Qianzhun Wang, Zhenyu Wu, Ziwei Wang
First: 2025-08-25T17:59:02+00:00 · Latest: 2025-08-25T17:59:02+00:00
Comments: Project website is at: https://denghaoyuan123.github.io/SafeBimanip/
Abstract
Bimanual manipulation has been widely applied in household services and
manufacturing, which enables the complex task completion with coordination
requirements. Recent diffusion-based policy learning approaches have achieved
promising performance in modeling action distributions for bimanual
manipulation. However, they ignored the physical safety constraints of bimanual
manipulation, which leads to the dangerous behaviors with damage to robots and
objects. To this end, we propose a test-time trajectory optimization framework
named SafeBimanual for any pre-trained diffusion-based bimanual manipulation
policies, which imposes the safety constraints on bimanual actions to avoid
dangerous robot behaviors with improved success rate. Specifically, we design
diverse cost functions for safety constraints in different dual-arm cooperation
patterns including avoidance of tearing objects and collision between arms and
objects, which optimizes the manipulator trajectories with guided sampling of
diffusion denoising process. Moreover, we employ a vision-language model (VLM)
to schedule the cost functions by specifying keypoints and corresponding
pairwise relationship, so that the optimal safety constraint is dynamically
generated in the entire bimanual manipulation process. SafeBimanual
demonstrates superiority on 8 simulated tasks in RoboTwin with a 13.7% increase
in success rate and a 18.8% reduction in unsafe interactions over
state-of-the-art diffusion-based methods. Extensive experiments on 4 real-world
tasks further verify its practical value by improving the success rate by
32.5%.
中文标题/摘要
标题:SafeBimanual:基于扩散的双臂操作轨迹优化方法
双臂操作已在家庭服务和制造中广泛应用,能够完成复杂的协调任务。最近的基于扩散的策略学习方法在双臂操作的动作分布建模方面取得了令人鼓舞的性能。然而,它们忽略了双臂操作的物理安全约束,导致了对机器人和物体造成损害的危险行为。为此,我们提出了一种名为SafeBimanual的测试时轨迹优化框架,该框架对任何预训练的基于扩散的双臂操作策略施加安全约束,以提高成功概率并避免危险的机器人行为。具体而言,我们为不同的双臂合作模式设计了多种成本函数,包括避免物体撕裂和手臂与物体之间的碰撞,通过引导扩散去噪过程的采样优化操作器轨迹。此外,我们采用视觉语言模型(VLM)通过指定关键点及其对应的配对关系来调度成本函数,从而在整个双臂操作过程中动态生成最优的安全约束。SafeBimanual在RoboTwin中的8个模拟任务中表现出优越性,与最先进的基于扩散的方法相比,成功率提高了13.7%,不安全交互减少了18.8%。在4个真实世界任务的广泛实验中,其成功概率提高了32.5%,进一步验证了其实用价值。
Summary / 总结
SafeBimanual is a test-time trajectory optimization framework designed to enhance the safety of bimanual manipulation tasks by imposing physical safety constraints on pre-trained diffusion-based policies. It optimizes manipulator trajectories through guided sampling and diverse cost functions for different dual-arm cooperation patterns, with the help of a vision-language model to dynamically generate optimal safety constraints. The framework shows a 13.7% increase in success rate and a 18.8% reduction in unsafe interactions compared to state-of-the-art methods in simulated tasks, and further improves the success rate by 32.5% in real-world tasks.
SafeBimanual 是一个测试时轨迹优化框架,旨在通过将物理安全约束纳入基于扩散的策略来提高双臂操作的安全性。它使用多样化的成本函数来优化轨迹,并使用视觉语言模型动态生成最优的安全约束。SafeBimanual 在模拟任务中显示出 13.7% 的成功率提升和 18.8% 的不安全交互减少,在实际任务中进一步提高了 32.5% 的成功率。
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Authors: Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian
First: 2025-08-25T17:57:49+00:00 · Latest: 2025-08-25T17:57:49+00:00
Comments: Project page: https://project.ironieser.cc/mmtok
Abstract
Vision-Language Models (VLMs) demonstrate impressive performance in
understanding visual content with language instruction by converting visual
input to vision tokens. However, redundancy in vision tokens results in the
degenerated inference efficiency of VLMs. While many algorithms have been
proposed to reduce the number of vision tokens, most of them apply only
unimodal information (i.e., vision/text) for pruning and ignore the inherent
multimodal property of vision-language tasks. Moreover, it lacks a generic
criterion that can be applied to different modalities. To mitigate this
limitation, in this work, we propose to leverage both vision and text tokens to
select informative vision tokens by the criterion of coverage. We first
formulate the subset selection problem as a maximum coverage problem.
Afterward, a subset of vision tokens is optimized to cover the text tokens and
the original set of vision tokens, simultaneously. Finally, a VLM agent can be
adopted to further improve the quality of text tokens for guiding vision
pruning. The proposed method MMTok is extensively evaluated on benchmark
datasets with different VLMs. The comparison illustrates that vision and text
information are complementary, and combining multimodal information can surpass
the unimodal baseline with a clear margin. Moreover, under the maximum coverage
criterion on the POPE dataset, our method achieves a 1.87x speedup while
maintaining 98.7% of the original performance on LLaVA-NeXT-13B. Furthermore,
with only four vision tokens, it still preserves 87.7% of the original
performance on LLaVA-1.5-7B. These results highlight the effectiveness of
coverage in token selection.
Summary / 总结
The research aims to improve the efficiency of Vision-Language Models (VLMs) by reducing redundant vision tokens while preserving performance. The method, MMTok, leverages both vision and text tokens to select informative vision tokens based on the criterion of coverage. Experiments show that combining multimodal information outperforms unimodal baselines, achieving a 1.87x speedup with 98.7% of original performance on LLaVA-NeXT-13B and 87.7% performance with only four vision tokens on LLaVA-1.5-7B.
研究旨在通过减少冗余的视觉标记来提高视觉语言模型(VLMs)的效率,同时保持性能。方法MMTok利用视觉和文本标记来选择基于覆盖度准则的信息性视觉标记。实验表明,结合多模态信息优于单模态基线,实现1.87倍的速度提升,同时保持LLaVA-NeXT-13B的98.7%性能,并且仅使用四个视觉标记仍能保持LLaVA-1.5-7B的87.7%性能。
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Authors: Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson
First: 2025-08-25T16:33:07+00:00 · Latest: 2025-08-25T16:33:07+00:00
Comments: COLM 2025
Abstract
Evaluating whether vision-language models (VLMs) reason consistently across
representations is challenging because modality comparisons are typically
confounded by task differences and asymmetric information. We introduce SEAM, a
benchmark that pairs semantically equivalent inputs across four domains that
have existing standardized textual and visual notations. By employing distinct
notation systems across modalities, in contrast to OCR-based image-text
pairing, SEAM provides a rigorous comparative assessment of the
textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21
contemporary models, we observe systematic modality imbalance: vision
frequently lags language in overall performance, despite the problems
containing semantically equivalent information, and cross-modal agreement is
relatively low. Our error analysis reveals two main drivers: textual perception
failures from tokenization in domain notation and visual perception failures
that induce hallucinations. We also show that our results are largely robust to
visual transformations. SEAM establishes a controlled, semantically equivalent
setting for measuring and improving modality-agnostic reasoning.
中文标题/摘要
标题:SEAM:跨模态语义等价基准用于视觉-语言模型
评估视觉-语言模型(VLMs)在不同表示形式中是否一致地进行推理具有挑战性,因为模态比较通常受到任务差异和信息不对称的影响。我们引入了SEAM,这是一个基准,它在四个现有标准化文本和视觉符号表示的领域中配对语义等价的输入。通过在不同模态中使用不同的符号系统,而不是基于OCR的图像-文本配对,SEAM为视觉-符号和视觉-空间推理能力提供了严格的比较评估。在21个当代模型中,我们观察到系统性的模态不平衡:尽管问题包含语义等价的信息,视觉在整体性能上经常落后于语言,跨模态的一致性相对较低。我们的错误分析揭示了两个主要驱动因素:领域符号表示中的标记文本感知失败和视觉感知失败导致的幻觉。我们还表明,我们的结果对视觉变换具有很大程度的稳健性。SEAM为测量和提高模态无关推理建立了受控的语义等价环境。
Summary / 总结
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information.