VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Authors: Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng
First: 2025-08-26T17:59:47+00:00 · Latest: 2025-08-26T17:59:47+00:00
Comments: Project page: https://huanngzh.github.io/VoxHammer-Page/
Abstract
3D local editing of specified regions is crucial for game industry and robot
interaction. Recent methods typically edit rendered multi-view images and then
reconstruct 3D models, but they face challenges in precisely preserving
unedited regions and overall coherence. Inspired by structured 3D generative
models, we propose VoxHammer, a novel training-free approach that performs
precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer
first predicts its inversion trajectory and obtains its inverted latents and
key-value tokens at each timestep. Subsequently, in the denoising and editing
phase, we replace the denoising features of preserved regions with the
corresponding inverted latents and cached key-value tokens. By retaining these
contextual features, this approach ensures consistent reconstruction of
preserved areas and coherent integration of edited parts. To evaluate the
consistency of preserved regions, we constructed Edit3D-Bench, a
human-annotated dataset comprising hundreds of samples, each with carefully
labeled 3D editing regions. Experiments demonstrate that VoxHammer
significantly outperforms existing methods in terms of both 3D consistency of
preserved regions and overall quality. Our method holds promise for
synthesizing high-quality edited paired data, thereby laying the data
foundation for in-context 3D generation. See our project page at
https://huanngzh.github.io/VoxHammer-Page/.
中文标题/摘要
标题:VoxHammer:无需训练的精确和连贯的3D编辑
对指定区域进行3D局部编辑对于游戏行业和机器人交互至关重要。现有方法通常编辑渲染的多视角图像,然后重建3D模型,但它们在精确保留未编辑区域和整体连贯性方面面临挑战。受结构化3D生成模型的启发,我们提出VoxHammer,这是一种全新的无需训练的方法,在3D潜在空间中执行精确和连贯的编辑。给定一个3D模型,VoxHammer首先预测其反转轨迹,并在每个时间步获取其反转的潜在变量和键值令牌。随后,在去噪和编辑阶段,我们用保留区域的相应反转潜在变量和缓存的键值令牌替换去噪特征。通过保留这些上下文特征,该方法确保保留区域的一致重建和编辑部分的连贯集成。为了评估保留区域的一致性,我们构建了Edit3D-Bench,这是一个由数百个样本组成的人工标注数据集,每个样本都有仔细标注的3D编辑区域。实验表明,VoxHammer在保留区域的3D一致性和整体质量方面显著优于现有方法。我们的方法有望合成高质量的编辑配对数据,从而为基于上下文的3D生成奠定数据基础。
Summary / 总结
VoxHammer is a training-free approach for precise and coherent 3D editing in native 3D space, addressing the challenges of preserving unedited regions and maintaining overall coherence. It predicts the inversion trajectory of a 3D model and uses inverted latents and key-value tokens to edit while retaining contextual features. Experiments show that VoxHammer outperforms existing methods in terms of 3D consistency and overall quality of preserved regions and edited parts. The method is promising for synthesizing high-quality edited data for in-context 3D generation.
VoxHammer 是一种无需训练的方法,可以在 3D 潜空间中对 3D 模型进行精确且连贯的编辑。给定一个 3D 模型后,VoxHammer 预测其反转轨迹,并使用反转的潜在特征和键值令牌来替换保留区域的去噪特征,从而确保一致的重建和连贯的集成。实验表明,VoxHammer 在保留 3D 一致性方面优于现有方法,并且在整体质量上也表现出色,适用于生成高质量的编辑配对数据,以支持上下文中的 3D 生成。
Articulate3D: Zero-Shot Text-Driven 3D Object Posing
Authors: Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht
First: 2025-08-26T17:59:17+00:00 · Latest: 2025-08-26T17:59:17+00:00
Comments: Project page:https://odeb1.github.io/articulate3d_page_deb/
Abstract
We propose a training-free method, Articulate3D, to pose a 3D asset through
language control. Despite advances in vision and language models, this task
remains surprisingly challenging. To achieve this goal, we decompose the
problem into two steps. We modify a powerful image-generator to create target
images conditioned on the input image and a text instruction. We then align the
mesh to the target images through a multi-view pose optimisation step. In
detail, we introduce a self-attention rewiring mechanism (RSActrl) that
decouples the source structure from pose within an image generative model,
allowing it to maintain a consistent structure across varying poses. We
observed that differentiable rendering is an unreliable signal for articulation
optimisation; instead, we use keypoints to establish correspondences between
input and target images. The effectiveness of Articulate3D is demonstrated
across a diverse range of 3D objects and free-form text prompts, successfully
manipulating poses while maintaining the original identity of the mesh.
Quantitative evaluations and a comparative user study, in which our method was
preferred over 85\% of the time, confirm its superiority over existing
approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/
Summary / 总结
Articulate3D proposes a training-free method to pose 3D assets using text instructions. It decomposes the task into two steps: modifying an image-generator to create target images based on input images and text, and aligning the mesh to these target images through multi-view pose optimization. The method introduces RSActrl for consistent structure across poses and uses keypoints for correspondence. Experiments show that Articulate3D can manipulate diverse 3D objects with free-form text while preserving their original identity, outperforming existing approaches in quantitative evaluations and user studies.
Articulate3D 提出了一种无需训练的方法,通过文本指令来调整 3D 资产的姿态。该方法将任务分解为两步:修改图像生成器以根据输入图像和文本生成目标图像,然后通过多视图姿态优化步骤将网格与这些目标图像对齐。方法中引入了 RSActrl 机制以在不同姿态下保持结构一致性,并使用关键点来建立输入图像与目标图像之间的对应关系。实验表明,Articulate3D 可以使用自由形式的文本调整各种 3D 对象的姿态并保持其原始身份,定量评估和用户研究均证明其优于现有方法。
Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment
Authors: Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, Navin Kumar
First: 2025-08-22T23:34:37+00:00 · Latest: 2025-08-26T17:13:21+00:00
Abstract
Clinical workflows are fragmented as a patchwork of scripts and task-specific
networks that often handle triage, task selection, and model deployment. These
pipelines are rarely streamlined for data science pipeline, reducing efficiency
and raising operational costs. Workflows also lack data-driven model
identification (from imaging/tabular inputs) and standardized delivery of model
outputs. In response, we present a practical, healthcare-first framework that
uses a single vision-language model (VLM) in two complementary roles. First
(Solution 1), the VLM acts as an aware model-card matcher that routes an
incoming image to the appropriate specialist model via a three-stage workflow
(modality -> primary abnormality -> model-card id). Checks are provided by (i)
stagewise prompts that allow early exit via None/Normal/Other and (ii) a
stagewise answer selector that arbitrates between the top-2 candidates at each
stage, reducing the chance of an incorrect selection and aligning the workflow
with clinical risk tolerance. Second (Solution 2), we fine-tune the VLM on
specialty-specific datasets ensuring a single model covers multiple downstream
tasks within each specialty, maintaining performance while simplifying
deployment. Across gastroenterology, hematology, ophthalmology, and pathology,
our single-model deployment matches or approaches specialized baselines.
Compared with pipelines composed of many task-specific agents, this approach
shows that one VLM can both decide and do. It may reduce effort by data
scientists, shorten monitoring, increase the transparency of model selection
(with per-stage justifications), and lower integration overhead.
Summary / 总结
The paper addresses the inefficiencies in clinical workflows by proposing a framework that uses a single vision-language model (VLM) in two roles: as a model-card matcher to route images to appropriate specialist models, and as a fine-tuned model to handle multiple downstream tasks within each specialty. The VLM employs a three-stage workflow for model selection and includes checks to ensure correct routing. The approach shows that a single VLM can match or approach specialized baselines across gastroenterology, hematology, ophthalmology, and pathology, reducing the need for multiple task-specific models and lowering operational costs. This method simplifies deployment and enhances transparency in model selection.
论文提出了一种框架,使用单一的视觉语言模型(VLM)在两个角色中发挥作用:作为模型卡片匹配器来将图像路由到合适的专科模型,以及作为微调模型来处理每个专科内的多个下游任务。VLM采用三阶段工作流进行模型选择,并包含检查以确保正确的路由。该方法显示,单一VLM可以在胃肠病学、血液学、眼科和病理学中匹配或接近专科基准,减少对多个任务特定模型的需求,降低运营成本。这种方法简化了部署并增强了模型选择的透明度。
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Authors: Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, Zhengzhong Tu
First: 2025-05-29T23:32:03+00:00 · Latest: 2025-08-26T16:42:37+00:00
Comments: 16 pages
Abstract
Large Vision-Language Models (LVLMs) have made remarkable strides in
multimodal tasks such as visual question answering, visual grounding, and
complex reasoning. However, they remain limited by static training data,
susceptibility to hallucinations, and inability to verify claims against
up-to-date, external evidence, compromising their performance in dynamic
real-world applications. Retrieval-Augmented Generation (RAG) offers a
practical solution to mitigate these challenges by allowing the LVLMs to access
large-scale knowledge databases via retrieval mechanisms, thereby grounding
model outputs in factual, contextually relevant information. Here in this
paper, we conduct the first systematic dissection of the multimodal RAG
pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the
modality configurations and retrieval strategies, (2) the re-ranking stage: on
strategies to mitigate positional biases and improve the relevance of retrieved
evidence, and (3) the generation phase: we further investigate how to best
integrate retrieved candidates into the final generation process. Finally, we
extend to explore a unified agentic framework that integrates re-ranking and
generation through self-reflection, enabling LVLMs to select relevant evidence
and suppress irrelevant context dynamically. Our full-stack exploration of RAG
for LVLMs yields substantial insights, resulting in an average performance
boost of 5% without any fine-tuning.
中文标题/摘要
标题:mRAG:多模态检索增强生成的设计空间阐明
大型视觉-语言模型(LVLMs)在视觉问答、视觉定位和复杂推理等多模态任务中取得了显著进展。然而,它们仍然受限于静态训练数据、幻觉倾向以及无法验证与最新外部证据一致的断言,这在动态现实世界应用中影响了它们的性能。检索增强生成(RAG)提供了一种实用的解决方案,通过检索机制使LVLMs能够访问大规模知识数据库,从而将模型输出与事实性、上下文相关的信息联系起来。在这篇论文中,我们首次系统地剖析了LVLMs的多模态RAG管道,明确研究了(1)检索阶段:模态配置和检索策略,(2)重排序阶段:减少位置偏差并提高检索证据的相关性的策略,以及(3)生成阶段:进一步研究如何将检索到的候选者最佳地整合到最终生成过程中。最后,我们扩展了研究,探索了一种统一的代理框架,通过自我反思将重排序和生成过程整合起来,使LVLMs能够动态地选择相关证据并抑制无关背景。我们对RAG的全面探索提供了宝贵的见解,无需微调即实现了平均5%的性能提升。
Summary / 总结
This paper investigates the design space of multimodal retrieval-augmented generation (RAG) for large vision-language models (LVLMs). It systematically examines the retrieval phase, re-ranking stage, and generation phase, and introduces a unified agentic framework for LVLMs. The study results in an average performance boost of 5% without fine-tuning, addressing limitations such as hallucinations and the need for up-to-date external evidence.
本文研究了大型视觉语言模型(LVLM)中多模态检索增强生成(RAG)的设计空间,系统地探讨了检索阶段、重排序阶段和生成阶段,并引入了一种统一的代理框架。研究结果在无需微调的情况下提高了5%的性能,解决了幻觉和需要最新外部证据的问题。
Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA
Authors: Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar
First: 2025-07-07T22:31:56+00:00 · Latest: 2025-08-26T14:02:57+00:00
Abstract
Dermatological care via telemedicine often lacks the rich context of
in-person visits. Clinicians must make diagnoses based on a handful of images
and brief descriptions, without the benefit of physical exams, second opinions,
or reference materials. While many medical AI systems attempt to bridge these
gaps with domain-specific fine-tuning, this work hypothesized that mimicking
clinical reasoning processes could offer a more effective path forward. This
study tested seven vision-language models on medical visual question answering
across six configurations: baseline models, fine-tuned variants, and both
augmented with either reasoning layers that combine multiple model
perspectives, analogous to peer consultation, or retrieval-augmented generation
that incorporates medical literature at inference time, serving a role similar
to reference-checking. While fine-tuning degraded performance in four of seven
models with an average 30% decrease, baseline models collapsed on test data.
Clinical-inspired architectures, meanwhile, achieved up to 70% accuracy,
maintaining performance on unseen data while generating explainable,
literature-grounded outputs critical for clinical adoption. These findings
demonstrate that medical AI succeeds by reconstructing the collaborative and
evidence-based practices fundamental to clinical diagnosis.
Summary / 总结
This study aimed to improve dermatological care via telemedicine by developing multi-agent reasoning systems for multimodal medical visual question answering. Seven vision-language models were tested across six configurations, including baseline models, fine-tuned variants, and models augmented with reasoning layers or retrieval-augmented generation. While fine-tuning degraded performance in most models, clinical-inspired architectures achieved up to 70% accuracy, maintaining performance on unseen data and generating explainable, literature-grounded outputs essential for clinical adoption.
该研究旨在通过增强医学视觉问答系统来改善通过远程医疗进行的皮肤科护理。七个视觉语言模型在六种配置下进行了测试,包括基础模型、微调变体以及增加了推理层或检索增强生成的模型。虽然微调在大多数模型中降低了性能,但临床启发式架构达到了70%的准确率,保持了在未见过的数据上的性能,并生成了临床采用所需的可解释、文献支持的输出。
ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval
Authors: Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao
Venue: EMNLP 2025
First: 2025-08-26T13:42:48+00:00 · Latest: 2025-08-26T13:42:48+00:00
Comments: Accepted by EMNLP 2025 Findings
Abstract
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task
that involves retrieving videos based on queries relevant to only specific
segments. While existing works follow the paradigm of developing models to
process unimodal features, powerful pretrained vision-language models like CLIP
remain underexplored in this field. To bridge this gap, we propose ProPy, a
model with systematic architectural adaption of CLIP specifically designed for
PRVR. Drawing insights from the semantic relevance of multi-granularity events,
ProPy introduces two key innovations: (1) A Prompt Pyramid structure that
organizes event prompts to capture semantics at multiple granularity levels,
and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that
enables dynamic semantic interaction among events. With these designs, ProPy
achieves SOTA performance on three public datasets, outperforming previous
models by significant margins. Code is available at
https://github.com/BUAAPY/ProPy.
中文标题/摘要
标题:ProPy:基于CLIP构建交互式提示金字塔以实现部分相关视频检索
部分相关视频检索(PRVR)是一项实际但具有挑战性的任务,涉及根据仅与特定段落相关的查询检索视频。虽然现有工作遵循处理单模态特征的范式,但像CLIP这样的强大预训练跨模态模型在这一领域尚未得到充分利用。为弥合这一差距,我们提出了ProPy,这是一种专门针对PRVR设计的具有系统架构适应性的CLIP模型。从多粒度事件的语义相关性中汲取灵感,ProPy引入了两项关键创新:(1)提示金字塔结构,组织事件提示以捕捉多粒度级别的语义;(2)基于金字塔的祖先-后代交互机制,使事件之间能够动态进行语义交互。通过这些设计,ProPy在三个公开数据集上实现了SOTA性能,显著优于先前的模型。代码可在https://github.com/BUAAPY/ProPy获取。
Summary / 总结
ProPy is designed to address the challenge of Partially Relevant Video Retrieval by leveraging CLIP, a powerful pretrained vision-language model. It introduces a Prompt Pyramid structure and an Ancestor-Descendant Interaction Mechanism to capture and interact with semantic information at multiple granularity levels. ProPy outperforms previous models on three public datasets, achieving state-of-the-art performance and significant improvements over existing methods. Code is available at https://github.com/BUAAPY/ProPy.
ProPy旨在通过利用预训练的视觉语言模型CLIP来解决部分相关视频检索的挑战。它引入了Prompt Pyramid结构和Ancestor-Descendant交互机制,以在多个粒度级别捕获和交互语义信息。ProPy在三个公开数据集上实现了最先进的性能,并在现有方法上取得了显著的改进。代码可在https://github.com/BUAAPY/ProPy获得。
ForgetMe: Evaluating Selective Forgetting in Generative Models
Authors: Zhenyu Yu, Mohd Yamani Inda Idris, Pei Wang
First: 2025-04-17T01:44:57+00:00 · Latest: 2025-08-26T13:04:59+00:00
Abstract
The widespread adoption of diffusion models in image generation has increased
the demand for privacy-compliant unlearning. However, due to the
high-dimensional nature and complex feature representations of diffusion
models, achieving selective unlearning remains challenging, as existing methods
struggle to remove sensitive information while preserving the consistency of
non-sensitive regions. To address this, we propose an Automatic Dataset
Creation Framework based on prompt-based layered editing and training-free
local feature removal, constructing the ForgetMe dataset and introducing the
Entangled evaluation metric. The Entangled metric quantifies unlearning
effectiveness by assessing the similarity and consistency between the target
and background regions and supports both paired (Entangled-D) and unpaired
(Entangled-S) image data, enabling unsupervised evaluation. The ForgetMe
dataset encompasses a diverse set of real and synthetic scenarios, including
CUB-200-2011 (Birds), Stanford-Dogs, ImageNet, and a synthetic cat dataset. We
apply LoRA fine-tuning on Stable Diffusion to achieve selective unlearning on
this dataset and validate the effectiveness of both the ForgetMe dataset and
the Entangled metric, establishing them as benchmarks for selective unlearning.
Our work provides a scalable and adaptable solution for advancing
privacy-preserving generative AI.
Summary / 总结
The paper addresses the challenge of selective unlearning in diffusion models by proposing an Automatic Dataset Creation Framework. This framework uses prompt-based layered editing and training-free local feature removal to achieve selective forgetting. The ForgetMe dataset and Entangled evaluation metric were introduced, which quantifies unlearning effectiveness by assessing the similarity and consistency between target and background regions. The results show that the ForgetMe dataset and Entangled metric are effective benchmarks for selective unlearning in generative models.
论文提出了一种自动数据集生成框架,通过基于提示的分层编辑和无需训练的局部特征移除来实现选择性遗忘。引入了ForgetMe数据集和Entangled评估指标,该指标通过评估目标区域和背景区域之间的相似性和一致性来量化遗忘效果。结果表明,ForgetMe数据集和Entangled指标是生成模型中选择性遗忘的有效基准。
Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone
Authors: Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth
First: 2025-08-26T12:41:35+00:00 · Latest: 2025-08-26T12:41:35+00:00
Abstract
As Vision Language Models (VLMs) become integral to real-world applications,
understanding their demographic biases is critical. We introduce GRAS, a
benchmark for uncovering demographic biases in VLMs across gender, race, age,
and skin tone, offering the most diverse coverage to date. We further propose
the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark
five state-of-the-art VLMs and reveal concerning bias levels, with the least
biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings
also reveal a methodological insight: evaluating bias in VLMs with visual
question answering (VQA) requires considering multiple formulations of a
question. Our code, data, and evaluation results are publicly available.
中文标题/摘要
标题:用不同方式再问我:GRAS用于衡量视觉语言模型在性别、种族、年龄和肤色上的偏见
随着视觉语言模型(VLMs)在实际应用中变得越来越重要,理解其在不同人口统计学群体中的偏见变得至关重要。我们引入了GRAS,这是一个基准测试,用于揭示视觉语言模型在性别、种族、年龄和肤色方面的偏见,提供了迄今为止最全面的覆盖范围。我们还提出了GRAS偏见评分,这是一个可解释的指标,用于量化偏见。我们对五种最先进的视觉语言模型进行了基准测试,并揭示了令人担忧的偏见水平,最不偏见的模型的GRAS偏见评分为100分中的2分。我们的研究结果还揭示了一个方法论上的见解:使用视觉问答(VQA)评估视觉语言模型的偏见需要考虑问题的多种表述形式。我们的代码、数据和评估结果已公开提供。
Summary / 总结
The research aims to understand and measure demographic biases in Vision Language Models (VLMs) by introducing GRAS, a benchmark that covers gender, race, age, and skin tone. The study proposes the GRAS Bias Score, an interpretable metric to quantify bias. Key findings include concerning bias levels in five state-of-the-art VLMs, with the least biased model scoring only 2 out of 100 on the GRAS Bias Score. The study also highlights the importance of considering multiple question formulations in VQA for bias evaluation. The code, data, and evaluation results are publicly available.
研究旨在通过引入GRAS基准来理解和衡量视觉语言模型(VLMs)中的民众人种偏见,该基准涵盖了性别、种族、年龄和肤色。研究提出了GRAS偏见评分,这是一种可解释的指标来量化偏见。关键发现包括五种最先进的VLMs存在令人担忧的偏见水平,其中最不偏见的模型在GRAS偏见评分中仅得2分(满分100分)。研究还强调,在VQA中评估偏见时需要考虑多种问题表述的重要性。代码、数据和评估结果已公开可用。
Enhancing Document VQA Models via Retrieval-Augmented Generation
Authors: Eric López, Artemis Llabrés, Ernest Valveny
First: 2025-08-26T12:32:55+00:00 · Latest: 2025-08-26T12:32:55+00:00
Comments: Accepted at Workshop on Machine Learning in Document Analysis and
Recognition (ICDAR WML 2025), Wuhan, China
Abstract
Document Visual Question Answering (Document VQA) must cope with documents
that span dozens of pages, yet leading systems still concatenate every page or
rely on very large vision-language models, both of which are memory-hungry.
Retrieval-Augmented Generation (RAG) offers an attractive alternative, first
retrieving a concise set of relevant segments before generating answers from
this selected evidence. In this paper, we systematically evaluate the impact of
incorporating RAG into Document VQA through different retrieval variants -
text-based retrieval using OCR tokens and purely visual retrieval without OCR -
across multiple models and benchmarks. Evaluated on the multi-page datasets
MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the
"concatenate-all-pages" baseline by up to +22.5 ANLS, while the visual variant
achieves +5.0 ANLS improvement without requiring any text extraction. An
ablation confirms that retrieval and reranking components drive most of the
gain, whereas the layout-guided chunking strategy - proposed in several recent
works to leverage page structure - fails to help on these datasets. Our
experiments demonstrate that careful evidence selection consistently boosts
accuracy across multiple model sizes and multi-page benchmarks, underscoring
its practical value for real-world Document VQA.
中文标题/摘要
标题:通过检索增强生成提升文档VQA模型
文档视觉问答(Document VQA)必须应对跨数十页的文档,但领先系统仍会将每页连接起来或依赖非常大的视觉语言模型,这两种方法都消耗大量内存。检索增强生成(RAG)提供了一种有吸引力的替代方案,首先检索一组相关的简短段落,然后从这些选定的证据中生成答案。在本文中,我们系统地评估了将RAG整合到文档VQA中的影响,通过不同的检索变体——使用OCR标记的文本检索和完全基于视觉的检索——在多个模型和基准上进行了评估。在多页数据集MP-DocVQA、DUDE和InfographicVQA上进行评估,以文本为中心的变体将“连接所有页面”的基线提高了最多+22.5 ANLS,而视觉变体在无需任何文本提取的情况下实现了+5.0 ANLS的改进。消融实验表明,检索和重新排序组件是主要的改进来源,而布局引导的分块策略——在几项最近的工作中提出,旨在利用页面结构——在这些数据集上未能提供帮助。我们的实验表明,仔细选择证据在多个模型大小和多页基准上始终提高了准确性,突显了其实用价值。
Summary / 总结
Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry.
Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
Authors: Somraj Gautam, Abhirama Subramanyam Penamakuri, Abhishek Bhandari, Gaurav Harit
First: 2025-08-24T12:43:27+00:00 · Latest: 2025-08-26T12:16:26+00:00
Abstract
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA)
on cricket scorecards, designed to evaluate large vision-language models
(LVLMs) on complex numerical and cross-lingual reasoning over semi-structured
tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated
scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English
QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English
scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi
scorecards, with all questions and answers kept in English to enable controlled
cross-script evaluation. The task demands reasoning over structured numerical
data, multi-image context, and implicit domain knowledge. Empirical results
show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle
on the English subset despite it being their primary training language and
exhibit a further drop in performance on the Hindi subset. This reveals key
limitations in structure-aware visual text understanding, numerical reasoning,
and cross-lingual generalization. The dataset is publicly available via Hugging
Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM
research in this direction.
中文标题/摘要
标题:注意(语言)差距:探究LVLMs在数值和跨语言推理方面的极限
我们介绍了MMCRICBENCH-3K,这是一个用于板球得分卡上视觉问答(VQA)的基准测试,旨在评估大型视觉-语言模型(LVLMs)在半结构化表格图像上进行复杂数值和跨语言推理的能力。MMCRICBENCH-3K 包含1,463张合成生成的得分卡图像,来自ODI、T20和测试格式,以及1,500对英文问答对。它包括两个子集:MMCRICBENCH-E-1.5K,包含英文得分卡,和MMCRICBENCH-H-1.5K,包含视觉上相似的印地语得分卡,所有问题和答案都用英文编写,以实现跨文字的控制性评估。该任务要求进行结构化数值数据推理、多图像上下文推理和隐含领域知识推理。实验证明,即使是最先进的LVLMs,如GPT-4o和Qwen2.5VL,在其主要训练语言的英文子集上也难以应对,而在印地语子集上的表现进一步下降。这揭示了结构感知视觉文本理解、数值推理和跨语言泛化的关键局限性。该数据集可通过Hugging Face公开获取,网址为https://huggingface.co/datasets/DIALab/MMCricBench,以促进该方向的LVLM研究。
Summary / 总结
The study introduces MMCRICBENCH-3K, a benchmark for evaluating large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning tasks using cricket scorecards. The benchmark includes 1,463 synthetically generated scorecard images and 1,500 English QA pairs, divided into English and Hindi subsets. Experiments show that even state-of-the-art LVLMs like GPT-4o and Qwen2.5VL struggle with the Hindi subset, highlighting limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available on Hugging Face for further research.
研究引入了MMCRICBENCH-3K,这是一个用于评估大型视觉-语言模型(LVLMs)在使用板球得分卡进行复杂数值和跨语言推理任务上的基准。该基准包括1,463张合成生成的得分卡图像和1,500个英文问答对,分为英文和印地语子集。实验结果显示,即使是最先进的LVLMs如GPT-4o和Qwen2.5VL在印地语子集上也表现不佳,这揭示了结构感知视觉文本理解、数值推理和跨语言泛化的关键局限性。该数据集已在Hugging Face上公开,以促进该方向的进一步研究。