arXiv 论文速递

Snapshot: 20260414_0423

ParseBench: A Document Parsing Benchmark for AI Agents

Authors: Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet, Daniel B. Ospina, Simon Suo

First: 2026-04-09T17:59:36+00:00 · Latest: 2026-04-10T17:59:14+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on https://huggingface.co/datasets/llamaindex/ParseBench and https://github.com/run-llama/ParseBench.

Summary / 总结

The research aims to address the need for semantic correctness in document parsing for AI agents, which is crucial for making autonomous decisions. The study introduces ParseBench, a benchmark consisting of over 2,000 human-verified pages from enterprise documents covering insurance, finance, and government sectors. The benchmark evaluates five dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods, including vision-language models and specialized parsers, the benchmark shows that no method excels in all five dimensions, with LlamaParse Agentic achieving the highest overall score. The benchmark highlights the remaining capability gaps in current systems.

研究旨在解决AI代理在文档解析中对语义正确性的需求，重点关注保留结构和意义以支持自主决策。研究引入了ParseBench，这是一个包含约2,000份企业文档的基准，这些文档涵盖了保险、金融和政府领域，评估了五个维度：表格、图表、内容忠实度、语义格式和视觉定位。基准测试表明，没有方法在所有维度上都能表现出色，LlamaParse Agentic在整体得分上最高，突显了当前系统中存在的能力差距。

Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

Authors: Zibin Geng, Xuefeng Jiang, Jia Li, Zheng Li, Tian Wen, Lvhua Wu, Sheng Sun, Yuwei Wang, Min Liu

First: 2026-04-10T17:48:56+00:00 · Latest: 2026-04-10T17:48:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Authors: Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu, Zhuang Liu

First: 2026-04-10T17:48:51+00:00 · Latest: 2026-04-10T17:48:51+00:00

Comments: Project Page: https://zlab-princeton.github.io/VisionFoundry/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.

中文标题/摘要

标题：VisionFoundry：使用合成图像训练VLMs的视觉感知

视觉语言模型（VLMs）在空间理解、视角识别等视觉感知任务上仍然存在困难。一个可能的原因是自然图像数据集对低级视觉技能的监督有限。这促使了一个实际问题：仅从深度顺序等任务关键词生成的目标合成监督能否解决这些弱点？为了研究这个问题，我们引入了VisionFoundry，一个任务感知的合成数据生成管道，仅需输入任务名称，使用大型语言模型（LLMs）生成问题、答案和文本到图像（T2I）提示，然后使用T2I模型合成图像，并通过专有的VLM验证一致性，无需参考图像或人工标注。使用VisionFoundry，我们构建了包含10000个图像-问题-答案三元组的VisionFoundry-10K合成视觉问答（VQA）数据集，覆盖10个任务。在VisionFoundry-10K上训练的模型在视觉感知基准测试中取得了显著改进：在MMVP上提高了7%，在CV-Bench-3D上提高了10%，同时保持了更广泛的能力，并随着数据量的增加表现出有利的扩展行为。我们的结果表明，有限的任务针对性监督是这一瓶颈的重要原因，合成监督是为VLMs进行更系统训练的一个有前景的途径。

Summary / 总结

Vision-language models (VLMs) face challenges in visual perception tasks like spatial understanding and viewpoint recognition, likely due to insufficient supervision from natural image datasets. To address this, VisionFoundry was developed, a synthetic data generation pipeline that uses large language models to generate questions, answers, and text-to-image prompts, then synthesizes images and verifies them with a proprietary VLM. This resulted in VisionFoundry-10K, a synthetic VQA dataset with 10k image-question-answer triples across 10 tasks. Models trained on VisionFoundry-10K showed significant improvements on visual perception benchmarks, with a 7% gain on MMVP and 10% on CV-Bench-3D, while maintaining broader capabilities and favorable scaling behavior with more data.

视觉语言模型（VLMs）在空间理解和视角识别等视觉感知任务上存在挑战，这可能是因为自然图像数据集提供的监督不足。为了解决这个问题，开发了VisionFoundry，这是一种合成数据生成管道，使用大型语言模型生成问题、答案和文本到图像（T2I）提示，然后使用专有的VLM生成图像并验证其一致性。这产生了包含10k张图像-问题-答案三元组的VisionFoundry-10K合成VQA数据集，覆盖10个任务。在VisionFoundry-10K上训练的模型在视觉感知基准测试中表现出显著改进，MMVP提高了7%，CV-Bench-3D提高了10%，同时保持了更广泛的能力和随数据量增加的有利扩展行为。

VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning

Authors: Wenyi Xiao, Xinchi Xu, Leilei Gan

Venue: ACL 2026

First: 2026-04-10T17:47:19+00:00 · Latest: 2026-04-10T17:47:19+00:00

Comments: 24 pages, ACL 2026 Main. Repository: https://github.com/Mr-Loevan/VL-Calibration

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.

中文标题/摘要

标题：VL-Calibration：大型视觉语言模型推理中的解耦置信度校准

大型视觉语言模型（LVLMs）在多模态推理方面表现出色，但经常以高置信度产生幻觉和错误响应，这阻碍了它们在高风险领域的应用。现有的口头置信度校准方法主要为文本仅大型语言模型开发，通常通过二元答案级正确性优化单一的整体置信度分数。这种设计与LVLMs不匹配：错误预测可能是由于感知失败或正确感知基础上的推理错误，单一置信度将这些来源混淆在一起，而视觉不确定性通常由语言先验主导。为了解决这些问题，我们提出VL-Calibration，这是一种强化学习框架，明确地将置信度解耦为视觉置信度和推理置信度。为了在没有真实感知标签的情况下监督视觉置信度，我们引入了一种内在的视觉确定性估计，结合了（i）在图像扰动下通过KL散度测量的视觉定位和（ii）通过词元熵测量的内部确定性。我们进一步提出词元级优势加权重估，以根据视觉确定性聚焦优化，抑制无根据的幻觉，同时保留有效的感知。在十三个基准上的实验表明，VL-Calibration有效提高了校准效果，同时提升了视觉推理准确性，并且在不同模型规模和架构的离分布基准上具有良好的泛化能力。

Summary / 总结

VL-Calibration is a reinforcement learning framework that decouples confidence into visual and reasoning components to improve the calibration of large vision-language models. It uses intrinsic visual certainty estimation and token-level advantage reweighting to enhance visual reasoning accuracy and reduce hallucinations. Experiments on various benchmarks demonstrate that VL-Calibration improves calibration and visual reasoning performance, and generalizes well across different model scales and architectures.

VL-Calibration 是一种强化学习框架，旨在通过将信心拆分为视觉和推理两个部分来提高大型视觉语言模型（LVLM）的信心校准。它引入了使用 KL 散度和词元熵的内在视觉确定性估计，并通过词元级优势重加权来关注视觉确定性的词元，从而抑制无根据的幻觉并保留有效的感知。实验表明，VL-Calibration 在十三个基准上提高了校准和视觉推理准确性，并在不同模型规模和架构的分布外基准上表现出良好的泛化能力。

VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

Authors: Yucheng Shen, Jiulong Wu, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao

First: 2026-04-10T17:25:34+00:00 · Latest: 2026-04-10T17:25:34+00:00