arXiv 论文速递

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

Authors: Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng

First: 2025-08-26T17:59:47+00:00 · Latest: 2025-08-26T17:59:47+00:00

Comments: Project page: https://huanngzh.github.io/VoxHammer-Page/

Abstract

3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.

中文标题/摘要

标题：VoxHammer：无需训练的精确和连贯的3D编辑

对指定区域进行3D局部编辑对于游戏行业和机器人交互至关重要。现有方法通常编辑渲染的多视角图像，然后重建3D模型，但它们在精确保留未编辑区域和整体连贯性方面面临挑战。受结构化3D生成模型的启发，我们提出VoxHammer，这是一种全新的无需训练的方法，在3D潜在空间中执行精确和连贯的编辑。给定一个3D模型，VoxHammer首先预测其反转轨迹，并在每个时间步获取其反转的潜在变量和键值令牌。随后，在去噪和编辑阶段，我们用保留区域的相应反转潜在变量和缓存的键值令牌替换去噪特征。通过保留这些上下文特征，该方法确保保留区域的一致重建和编辑部分的连贯集成。为了评估保留区域的一致性，我们构建了Edit3D-Bench，这是一个由数百个样本组成的人工标注数据集，每个样本都有仔细标注的3D编辑区域。实验表明，VoxHammer在保留区域的3D一致性和整体质量方面显著优于现有方法。我们的方法有望合成高质量的编辑配对数据，从而为基于上下文的3D生成奠定数据基础。

Summary / 总结

VoxHammer is a training-free approach for precise and coherent 3D editing in native 3D space, addressing the challenges of preserving unedited regions and maintaining overall coherence. It predicts the inversion trajectory of a 3D model and uses inverted latents and key-value tokens to edit while retaining contextual features. Experiments show that VoxHammer outperforms existing methods in terms of 3D consistency and overall quality of preserved regions and edited parts. The method is promising for synthesizing high-quality edited data for in-context 3D generation.

VoxHammer 是一种无需训练的方法，可以在 3D 潜空间中对 3D 模型进行精确且连贯的编辑。给定一个 3D 模型后，VoxHammer 预测其反转轨迹，并使用反转的潜在特征和键值令牌来替换保留区域的去噪特征，从而确保一致的重建和连贯的集成。实验表明，VoxHammer 在保留 3D 一致性方面优于现有方法，并且在整体质量上也表现出色，适用于生成高质量的编辑配对数据，以支持上下文中的 3D 生成。

Articulate3D: Zero-Shot Text-Driven 3D Object Posing

Authors: Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht

First: 2025-08-26T17:59:17+00:00 · Latest: 2025-08-26T17:59:17+00:00

Comments: Project page:https://odeb1.github.io/articulate3d_page_deb/

Abs · PDF · Project1

Abstract

We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85\% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/

Summary / 总结

Articulate3D proposes a training-free method to pose 3D assets using text instructions. It decomposes the task into two steps: modifying an image-generator to create target images based on input images and text, and aligning the mesh to these target images through multi-view pose optimization. The method introduces RSActrl for consistent structure across poses and uses keypoints for correspondence. Experiments show that Articulate3D can manipulate diverse 3D objects with free-form text while preserving their original identity, outperforming existing approaches in quantitative evaluations and user studies.

Articulate3D 提出了一种无需训练的方法，通过文本指令来调整 3D 资产的姿态。该方法将任务分解为两步：修改图像生成器以根据输入图像和文本生成目标图像，然后通过多视图姿态优化步骤将网格与这些目标图像对齐。方法中引入了 RSActrl 机制以在不同姿态下保持结构一致性，并使用关键点来建立输入图像与目标图像之间的对应关系。实验表明，Articulate3D 可以使用自由形式的文本调整各种 3D 对象的姿态并保持其原始身份，定量评估和用户研究均证明其优于现有方法。

Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment

Authors: Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, Navin Kumar

First: 2025-08-22T23:34:37+00:00 · Latest: 2025-08-26T17:13:21+00:00

Abs · PDF

Abstract

Clinical workflows are fragmented as a patchwork of scripts and task-specific networks that often handle triage, task selection, and model deployment. These pipelines are rarely streamlined for data science pipeline, reducing efficiency and raising operational costs. Workflows also lack data-driven model identification (from imaging/tabular inputs) and standardized delivery of model outputs. In response, we present a practical, healthcare-first framework that uses a single vision-language model (VLM) in two complementary roles. First (Solution 1), the VLM acts as an aware model-card matcher that routes an incoming image to the appropriate specialist model via a three-stage workflow (modality -> primary abnormality -> model-card id). Checks are provided by (i) stagewise prompts that allow early exit via None/Normal/Other and (ii) a stagewise answer selector that arbitrates between the top-2 candidates at each stage, reducing the chance of an incorrect selection and aligning the workflow with clinical risk tolerance. Second (Solution 2), we fine-tune the VLM on specialty-specific datasets ensuring a single model covers multiple downstream tasks within each specialty, maintaining performance while simplifying deployment. Across gastroenterology, hematology, ophthalmology, and pathology, our single-model deployment matches or approaches specialized baselines. Compared with pipelines composed of many task-specific agents, this approach shows that one VLM can both decide and do. It may reduce effort by data scientists, shorten monitoring, increase the transparency of model selection (with per-stage justifications), and lower integration overhead.

Summary / 总结

The paper addresses the inefficiencies in clinical workflows by proposing a framework that uses a single vision-language model (VLM) in two roles: as a model-card matcher to route images to appropriate specialist models, and as a fine-tuned model to handle multiple downstream tasks within each specialty. The VLM employs a three-stage workflow for model selection and includes checks to ensure correct routing. The approach shows that a single VLM can match or approach specialized baselines across gastroenterology, hematology, ophthalmology, and pathology, reducing the need for multiple task-specific models and lowering operational costs. This method simplifies deployment and enhances transparency in model selection.

论文提出了一种框架，使用单一的视觉语言模型（VLM）在两个角色中发挥作用：作为模型卡片匹配器来将图像路由到合适的专科模型，以及作为微调模型来处理每个专科内的多个下游任务。VLM采用三阶段工作流进行模型选择，并包含检查以确保正确的路由。该方法显示，单一VLM可以在胃肠病学、血液学、眼科和病理学中匹配或接近专科基准，减少对多个任务特定模型的需求，降低运营成本。这种方法简化了部署并增强了模型选择的透明度。

mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation

Authors: Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, Zhengzhong Tu

First: 2025-05-29T23:32:03+00:00 · Latest: 2025-08-26T16:42:37+00:00

Comments: 16 pages

Abs · PDF

Abstract

Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.

中文标题/摘要

标题：mRAG：多模态检索增强生成的设计空间阐明

大型视觉-语言模型（LVLMs）在视觉问答、视觉定位和复杂推理等多模态任务中取得了显著进展。然而，它们仍然受限于静态训练数据、幻觉倾向以及无法验证与最新外部证据一致的断言，这在动态现实世界应用中影响了它们的性能。检索增强生成（RAG）提供了一种实用的解决方案，通过检索机制使LVLMs能够访问大规模知识数据库，从而将模型输出与事实性、上下文相关的信息联系起来。在这篇论文中，我们首次系统地剖析了LVLMs的多模态RAG管道，明确研究了（1）检索阶段：模态配置和检索策略，（2）重排序阶段：减少位置偏差并提高检索证据的相关性的策略，以及（3）生成阶段：进一步研究如何将检索到的候选者最佳地整合到最终生成过程中。最后，我们扩展了研究，探索了一种统一的代理框架，通过自我反思将重排序和生成过程整合起来，使LVLMs能够动态地选择相关证据并抑制无关背景。我们对RAG的全面探索提供了宝贵的见解，无需微调即实现了平均5%的性能提升。

Summary / 总结

This paper investigates the design space of multimodal retrieval-augmented generation (RAG) for large vision-language models (LVLMs). It systematically examines the retrieval phase, re-ranking stage, and generation phase, and introduces a unified agentic framework for LVLMs. The study results in an average performance boost of 5% without fine-tuning, addressing limitations such as hallucinations and the need for up-to-date external evidence.

本文研究了大型视觉语言模型（LVLM）中多模态检索增强生成（RAG）的设计空间，系统地探讨了检索阶段、重排序阶段和生成阶段，并引入了一种统一的代理框架。研究结果在无需微调的情况下提高了5%的性能，解决了幻觉和需要最新外部证据的问题。

Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA

Authors: Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar

First: 2025-07-07T22:31:56+00:00 · Latest: 2025-08-26T14:02:57+00:00

Abs · PDF

Abstract

Dermatological care via telemedicine often lacks the rich context of in-person visits. Clinicians must make diagnoses based on a handful of images and brief descriptions, without the benefit of physical exams, second opinions, or reference materials. While many medical AI systems attempt to bridge these gaps with domain-specific fine-tuning, this work hypothesized that mimicking clinical reasoning processes could offer a more effective path forward. This study tested seven vision-language models on medical visual question answering across six configurations: baseline models, fine-tuned variants, and both augmented with either reasoning layers that combine multiple model perspectives, analogous to peer consultation, or retrieval-augmented generation that incorporates medical literature at inference time, serving a role similar to reference-checking. While fine-tuning degraded performance in four of seven models with an average 30% decrease, baseline models collapsed on test data. Clinical-inspired architectures, meanwhile, achieved up to 70% accuracy, maintaining performance on unseen data while generating explainable, literature-grounded outputs critical for clinical adoption. These findings demonstrate that medical AI succeeds by reconstructing the collaborative and evidence-based practices fundamental to clinical diagnosis.

Summary / 总结

This study aimed to improve dermatological care via telemedicine by developing multi-agent reasoning systems for multimodal medical visual question answering. Seven vision-language models were tested across six configurations, including baseline models, fine-tuned variants, and models augmented with reasoning layers or retrieval-augmented generation. While fine-tuning degraded performance in most models, clinical-inspired architectures achieved up to 70% accuracy, maintaining performance on unseen data and generating explainable, literature-grounded outputs essential for clinical adoption.

该研究旨在通过增强医学视觉问答系统来改善通过远程医疗进行的皮肤科护理。七个视觉语言模型在六种配置下进行了测试，包括基础模型、微调变体以及增加了推理层或检索增强生成的模型。虽然微调在大多数模型中降低了性能，但临床启发式架构达到了70%的准确率，保持了在未见过的数据上的性能，并生成了临床采用所需的可解释、文献支持的输出。

ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

Authors: Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao

Venue: EMNLP 2025

First: 2025-08-26T13:42:48+00:00 · Latest: 2025-08-26T13:42:48+00:00

Comments: Accepted by EMNLP 2025 Findings

Abs · PDF · Code1

Abstract

Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.

中文标题/摘要

标题：ProPy：基于CLIP构建交互式提示金字塔以实现部分相关视频检索

部分相关视频检索（PRVR）是一项实际但具有挑战性的任务，涉及根据仅与特定段落相关的查询检索视频。虽然现有工作遵循处理单模态特征的范式，但像CLIP这样的强大预训练跨模态模型在这一领域尚未得到充分利用。为弥合这一差距，我们提出了ProPy，这是一种专门针对PRVR设计的具有系统架构适应性的CLIP模型。从多粒度事件的语义相关性中汲取灵感，ProPy引入了两项关键创新：（1）提示金字塔结构，组织事件提示以捕捉多粒度级别的语义；（2）基于金字塔的祖先-后代交互机制，使事件之间能够动态进行语义交互。通过这些设计，ProPy在三个公开数据集上实现了SOTA性能，显著优于先前的模型。代码可在https://github.com/BUAAPY/ProPy获取。

Summary / 总结

ProPy is designed to address the challenge of Partially Relevant Video Retrieval by leveraging CLIP, a powerful pretrained vision-language model. It introduces a Prompt Pyramid structure and an Ancestor-Descendant Interaction Mechanism to capture and interact with semantic information at multiple granularity levels. ProPy outperforms previous models on three public datasets, achieving state-of-the-art performance and significant improvements over existing methods. Code is available at https://github.com/BUAAPY/ProPy.

ProPy旨在通过利用预训练的视觉语言模型CLIP来解决部分相关视频检索的挑战。它引入了Prompt Pyramid结构和Ancestor-Descendant交互机制，以在多个粒度级别捕获和交互语义信息。ProPy在三个公开数据集上实现了最先进的性能，并在现有方法上取得了显著的改进。代码可在https://github.com/BUAAPY/ProPy获得。

ForgetMe: Evaluating Selective Forgetting in Generative Models

Authors: Zhenyu Yu, Mohd Yamani Inda Idris, Pei Wang

First: 2025-04-17T01:44:57+00:00 · Latest: 2025-08-26T13:04:59+00:00

Abs · PDF

Abstract

The widespread adoption of diffusion models in image generation has increased the demand for privacy-compliant unlearning. However, due to the high-dimensional nature and complex feature representations of diffusion models, achieving selective unlearning remains challenging, as existing methods struggle to remove sensitive information while preserving the consistency of non-sensitive regions. To address this, we propose an Automatic Dataset Creation Framework based on prompt-based layered editing and training-free local feature removal, constructing the ForgetMe dataset and introducing the Entangled evaluation metric. The Entangled metric quantifies unlearning effectiveness by assessing the similarity and consistency between the target and background regions and supports both paired (Entangled-D) and unpaired (Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe dataset encompasses a diverse set of real and synthetic scenarios, including CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on this dataset and validate the effectiveness of both the ForgetMe dataset and the Entangled metric, establishing them as benchmarks for selective unlearning. Our work provides a scalable and adaptable solution for advancing privacy-preserving generative AI.

Summary / 总结

The paper addresses the challenge of selective unlearning in diffusion models by proposing an Automatic Dataset Creation Framework. This framework uses prompt-based layered editing and training-free local feature removal to achieve selective forgetting. The ForgetMe dataset and Entangled evaluation metric were introduced, which quantifies unlearning effectiveness by assessing the similarity and consistency between target and background regions. The results show that the ForgetMe dataset and Entangled metric are effective benchmarks for selective unlearning in generative models.

论文提出了一种自动数据集生成框架，通过基于提示的分层编辑和无需训练的局部特征移除来实现选择性遗忘。引入了ForgetMe数据集和Entangled评估指标，该指标通过评估目标区域和背景区域之间的相似性和一致性来量化遗忘效果。结果表明，ForgetMe数据集和Entangled指标是生成模型中选择性遗忘的有效基准。

Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone

Authors: Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth

First: 2025-08-26T12:41:35+00:00 · Latest: 2025-08-26T12:41:35+00:00

Abs · PDF

Abstract

As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.

中文标题/摘要

标题：用不同方式再问我：GRAS用于衡量视觉语言模型在性别、种族、年龄和肤色上的偏见

随着视觉语言模型（VLMs）在实际应用中变得越来越重要，理解其在不同人口统计学群体中的偏见变得至关重要。我们引入了GRAS，这是一个基准测试，用于揭示视觉语言模型在性别、种族、年龄和肤色方面的偏见，提供了迄今为止最全面的覆盖范围。我们还提出了GRAS偏见评分，这是一个可解释的指标，用于量化偏见。我们对五种最先进的视觉语言模型进行了基准测试，并揭示了令人担忧的偏见水平，最不偏见的模型的GRAS偏见评分为100分中的2分。我们的研究结果还揭示了一个方法论上的见解：使用视觉问答（VQA）评估视觉语言模型的偏见需要考虑问题的多种表述形式。我们的代码、数据和评估结果已公开提供。

Summary / 总结

The research aims to understand and measure demographic biases in Vision Language Models (VLMs) by introducing GRAS, a benchmark that covers gender, race, age, and skin tone. The study proposes the GRAS Bias Score, an interpretable metric to quantify bias. Key findings include concerning bias levels in five state-of-the-art VLMs, with the least biased model scoring only 2 out of 100 on the GRAS Bias Score. The study also highlights the importance of considering multiple question formulations in VQA for bias evaluation. The code, data, and evaluation results are publicly available.

研究旨在通过引入GRAS基准来理解和衡量视觉语言模型（VLMs）中的民众人种偏见，该基准涵盖了性别、种族、年龄和肤色。研究提出了GRAS偏见评分，这是一种可解释的指标来量化偏见。关键发现包括五种最先进的VLMs存在令人担忧的偏见水平，其中最不偏见的模型在GRAS偏见评分中仅得2分（满分100分）。研究还强调，在VQA中评估偏见时需要考虑多种问题表述的重要性。代码、数据和评估结果已公开可用。

Enhancing Document VQA Models via Retrieval-Augmented Generation

Authors: Eric López, Artemis Llabrés, Ernest Valveny

First: 2025-08-26T12:32:55+00:00 · Latest: 2025-08-26T12:32:55+00:00

Comments: Accepted at Workshop on Machine Learning in Document Analysis and Recognition (ICDAR WML 2025), Wuhan, China

Abs · PDF

Abstract

Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the "concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.

中文标题/摘要

标题：通过检索增强生成提升文档VQA模型

文档视觉问答（Document VQA）必须应对跨数十页的文档，但领先系统仍会将每页连接起来或依赖非常大的视觉语言模型，这两种方法都消耗大量内存。检索增强生成（RAG）提供了一种有吸引力的替代方案，首先检索一组相关的简短段落，然后从这些选定的证据中生成答案。在本文中，我们系统地评估了将RAG整合到文档VQA中的影响，通过不同的检索变体——使用OCR标记的文本检索和完全基于视觉的检索——在多个模型和基准上进行了评估。在多页数据集MP-DocVQA、DUDE和InfographicVQA上进行评估，以文本为中心的变体将“连接所有页面”的基线提高了最多+22.5 ANLS，而视觉变体在无需任何文本提取的情况下实现了+5.0 ANLS的改进。消融实验表明，检索和重新排序组件是主要的改进来源，而布局引导的分块策略——在几项最近的工作中提出，旨在利用页面结构——在这些数据集上未能提供帮助。我们的实验表明，仔细选择证据在多个模型大小和多页基准上始终提高了准确性，突显了其实用价值。

Summary / 总结

Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs

Authors: Somraj Gautam, Abhirama Subramanyam Penamakuri, Abhishek Bhandari, Gaurav Harit

First: 2025-08-24T12:43:27+00:00 · Latest: 2025-08-26T12:16:26+00:00

Abs · PDF · Code1

Abstract

We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.

中文标题/摘要

标题：注意（语言）差距：探究LVLMs在数值和跨语言推理方面的极限

我们介绍了MMCRICBENCH-3K，这是一个用于板球得分卡上视觉问答（VQA）的基准测试，旨在评估大型视觉-语言模型（LVLMs）在半结构化表格图像上进行复杂数值和跨语言推理的能力。MMCRICBENCH-3K 包含1,463张合成生成的得分卡图像，来自ODI、T20和测试格式，以及1,500对英文问答对。它包括两个子集：MMCRICBENCH-E-1.5K，包含英文得分卡，和MMCRICBENCH-H-1.5K，包含视觉上相似的印地语得分卡，所有问题和答案都用英文编写，以实现跨文字的控制性评估。该任务要求进行结构化数值数据推理、多图像上下文推理和隐含领域知识推理。实验证明，即使是最先进的LVLMs，如GPT-4o和Qwen2.5VL，在其主要训练语言的英文子集上也难以应对，而在印地语子集上的表现进一步下降。这揭示了结构感知视觉文本理解、数值推理和跨语言泛化的关键局限性。该数据集可通过Hugging Face公开获取，网址为https://huggingface.co/datasets/DIALab/MMCricBench，以促进该方向的LVLM研究。

Summary / 总结

The study introduces MMCRICBENCH-3K, a benchmark for evaluating large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning tasks using cricket scorecards. The benchmark includes 1,463 synthetically generated scorecard images and 1,500 English QA pairs, divided into English and Hindi subsets. Experiments show that even state-of-the-art LVLMs like GPT-4o and Qwen2.5VL struggle with the Hindi subset, highlighting limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available on Hugging Face for further research.

研究引入了MMCRICBENCH-3K，这是一个用于评估大型视觉-语言模型（LVLMs）在使用板球得分卡进行复杂数值和跨语言推理任务上的基准。该基准包括1,463张合成生成的得分卡图像和1,500个英文问答对，分为英文和印地语子集。实验结果显示，即使是最先进的LVLMs如GPT-4o和Qwen2.5VL在印地语子集上也表现不佳，这揭示了结构感知视觉文本理解、数值推理和跨语言泛化的关键局限性。该数据集已在Hugging Face上公开，以促进该方向的进一步研究。