ParseBench: A Document Parsing Benchmark for AI Agents
Authors: Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet, Daniel B. Ospina, Simon Suo
First: 2026-04-09T17:59:36+00:00 · Latest: 2026-04-10T17:59:14+00:00
Abstract
AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on https://huggingface.co/datasets/llamaindex/ParseBench and https://github.com/run-llama/ParseBench.
Summary / 总结
The research aims to address the need for semantic correctness in document parsing for AI agents, which is crucial for making autonomous decisions. The study introduces ParseBench, a benchmark consisting of over 2,000 human-verified pages from enterprise documents covering insurance, finance, and government sectors. The benchmark evaluates five dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods, including vision-language models and specialized parsers, the benchmark shows that no method excels in all five dimensions, with LlamaParse Agentic achieving the highest overall score. The benchmark highlights the remaining capability gaps in current systems.
研究旨在解决AI代理在文档解析中对语义正确性的需求,重点关注保留结构和意义以支持自主决策。研究引入了ParseBench,这是一个包含约2,000份企业文档的基准,这些文档涵盖了保险、金融和政府领域,评估了五个维度:表格、图表、内容忠实度、语义格式和视觉定位。基准测试表明,没有方法在所有维度上都能表现出色,LlamaParse Agentic在整体得分上最高,突显了当前系统中存在的能力差距。
Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise
Authors: Zibin Geng, Xuefeng Jiang, Jia Li, Zheng Li, Tian Wen, Lvhua Wu, Sheng Sun, Yuwei Wang, Min Liu
First: 2026-04-10T17:48:56+00:00 · Latest: 2026-04-10T17:48:56+00:00
Abstract
Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.
VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images
Authors: Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu, Zhuang Liu
First: 2026-04-10T17:48:51+00:00 · Latest: 2026-04-10T17:48:51+00:00
Comments: Project Page: https://zlab-princeton.github.io/VisionFoundry/
Abstract
Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.
中文标题/摘要
标题:VisionFoundry:使用合成图像训练VLMs的视觉感知
视觉语言模型(VLMs)在空间理解、视角识别等视觉感知任务上仍然存在困难。一个可能的原因是自然图像数据集对低级视觉技能的监督有限。这促使了一个实际问题:仅从深度顺序等任务关键词生成的目标合成监督能否解决这些弱点?为了研究这个问题,我们引入了VisionFoundry,一个任务感知的合成数据生成管道,仅需输入任务名称,使用大型语言模型(LLMs)生成问题、答案和文本到图像(T2I)提示,然后使用T2I模型合成图像,并通过专有的VLM验证一致性,无需参考图像或人工标注。使用VisionFoundry,我们构建了包含10000个图像-问题-答案三元组的VisionFoundry-10K合成视觉问答(VQA)数据集,覆盖10个任务。在VisionFoundry-10K上训练的模型在视觉感知基准测试中取得了显著改进:在MMVP上提高了7%,在CV-Bench-3D上提高了10%,同时保持了更广泛的能力,并随着数据量的增加表现出有利的扩展行为。我们的结果表明,有限的任务针对性监督是这一瓶颈的重要原因,合成监督是为VLMs进行更系统训练的一个有前景的途径。
Summary / 总结
Vision-language models (VLMs) face challenges in visual perception tasks like spatial understanding and viewpoint recognition, likely due to insufficient supervision from natural image datasets. To address this, VisionFoundry was developed, a synthetic data generation pipeline that uses large language models to generate questions, answers, and text-to-image prompts, then synthesizes images and verifies them with a proprietary VLM. This resulted in VisionFoundry-10K, a synthetic VQA dataset with 10k image-question-answer triples across 10 tasks. Models trained on VisionFoundry-10K showed significant improvements on visual perception benchmarks, with a 7% gain on MMVP and 10% on CV-Bench-3D, while maintaining broader capabilities and favorable scaling behavior with more data.
视觉语言模型(VLMs)在空间理解和视角识别等视觉感知任务上存在挑战,这可能是因为自然图像数据集提供的监督不足。为了解决这个问题,开发了VisionFoundry,这是一种合成数据生成管道,使用大型语言模型生成问题、答案和文本到图像(T2I)提示,然后使用专有的VLM生成图像并验证其一致性。这产生了包含10k张图像-问题-答案三元组的VisionFoundry-10K合成VQA数据集,覆盖10个任务。在VisionFoundry-10K上训练的模型在视觉感知基准测试中表现出显著改进,MMVP提高了7%,CV-Bench-3D提高了10%,同时保持了更广泛的能力和随数据量增加的有利扩展行为。
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
Authors: Wenyi Xiao, Xinchi Xu, Leilei Gan
Venue: ACL 2026
First: 2026-04-10T17:47:19+00:00 · Latest: 2026-04-10T17:47:19+00:00
Comments: 24 pages, ACL 2026 Main. Repository: https://github.com/Mr-Loevan/VL-Calibration
Abstract
Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.
中文标题/摘要
标题:VL-Calibration:大型视觉语言模型推理中的解耦置信度校准
大型视觉语言模型(LVLMs)在多模态推理方面表现出色,但经常以高置信度产生幻觉和错误响应,这阻碍了它们在高风险领域的应用。现有的口头置信度校准方法主要为文本仅大型语言模型开发,通常通过二元答案级正确性优化单一的整体置信度分数。这种设计与LVLMs不匹配:错误预测可能是由于感知失败或正确感知基础上的推理错误,单一置信度将这些来源混淆在一起,而视觉不确定性通常由语言先验主导。为了解决这些问题,我们提出VL-Calibration,这是一种强化学习框架,明确地将置信度解耦为视觉置信度和推理置信度。为了在没有真实感知标签的情况下监督视觉置信度,我们引入了一种内在的视觉确定性估计,结合了(i)在图像扰动下通过KL散度测量的视觉定位和(ii)通过词元熵测量的内部确定性。我们进一步提出词元级优势加权重估,以根据视觉确定性聚焦优化,抑制无根据的幻觉,同时保留有效的感知。在十三个基准上的实验表明,VL-Calibration有效提高了校准效果,同时提升了视觉推理准确性,并且在不同模型规模和架构的离分布基准上具有良好的泛化能力。
Summary / 总结
VL-Calibration is a reinforcement learning framework that decouples confidence into visual and reasoning components to improve the calibration of large vision-language models. It uses intrinsic visual certainty estimation and token-level advantage reweighting to enhance visual reasoning accuracy and reduce hallucinations. Experiments on various benchmarks demonstrate that VL-Calibration improves calibration and visual reasoning performance, and generalizes well across different model scales and architectures.
VL-Calibration 是一种强化学习框架,旨在通过将信心拆分为视觉和推理两个部分来提高大型视觉语言模型(LVLM)的信心校准。它引入了使用 KL 散度和词元熵的内在视觉确定性估计,并通过词元级优势重加权来关注视觉确定性的词元,从而抑制无根据的幻觉并保留有效的感知。实验表明,VL-Calibration 在十三个基准上提高了校准和视觉推理准确性,并在不同模型规模和架构的分布外基准上表现出良好的泛化能力。
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
Authors: Yucheng Shen, Jiulong Wu, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao
First: 2026-04-10T17:25:34+00:00 · Latest: 2026-04-10T17:25:34+00:00
Abstract
Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.
中文标题/摘要
标题:VISOR:通过迭代搜索和超视距推理增强的代理视觉检索生成
视觉检索增强生成(VRAG)赋予视觉语言模型检索和处理丰富视觉文档的能力。为应对需要多步推理的复杂查询,代理型VRAG系统将推理与迭代检索交织进行。然而,现有的代理型VRAG面临两个关键瓶颈:(1)视觉证据稀疏性:关键证据分散在多页中且孤立处理,妨碍跨页推理;此外,细粒度的图像内证据往往需要精确的视觉操作,其误用会降低检索质量;(2)长视距搜索漂移:检索的多页视觉标记的累积会稀释上下文并导致认知过载,使代理偏离搜索目标。为解决这些挑战,我们提出了VISOR(通过迭代搜索和超视距推理增强的视觉检索生成),这是一种统一的单代理框架。VISOR具备结构化的证据空间,支持逐步的跨页推理,并结合了视觉操作评估与修正机制来管理视觉操作。此外,我们引入了动态轨迹与滑动窗口和意图注入机制来缓解搜索漂移。它们锚定证据空间,同时丢弃早期的原始交互,防止上下文被视觉标记淹没。我们使用基于组相对策略优化的强化学习(GRPO-based RL)管道进行训练,该管道具有状态遮蔽和针对动态上下文重建的奖励分配。在ViDoSeek、SlideVQA和MMLongBench上的广泛实验表明,VISOR在长视距视觉推理任务中实现了最先进的性能,且具有更高的效率。
HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models
Authors: Reihaneh Zohrabi, Hosein Hasani, Akshita Gupta, Mahdieh Soleymani Baghshah, Anna Rohrbach, Marcus Rohrbach
First: 2026-04-07T17:58:04+00:00 · Latest: 2026-04-10T16:49:15+00:00
Abstract
Large vision-language models can produce object hallucinations in image descriptions, highlighting the need for effective detection and mitigation strategies. Prior work commonly relies on the model's attention weights on visual tokens as a detection signal. We reveal that coarse-grained attention-based analysis is unreliable due to hidden confounders, specifically token position and object repetition in a description. This leads to Simpson's paradox: the attention trends reverse or disappear when statistics are aggregated. Based on this observation, we introduce HaloProbe, a Bayesian framework that factorizes external description statistics and internal decoding signals to estimate token-level hallucination probabilities. HaloProbe uses balanced training to isolate internal evidence and combines it with a learned prior over external features to recover the true posterior. While intervention-based mitigation methods often degrade utility or fluency by modifying models' internals, we use HaloProbe as an external scoring signal for non-invasive mitigation. Our experiments show that HaloProbe-guided decoding reduces hallucinations more effectively than state-of-the-art intervention-based methods while preserving utility.
Towards Knowledgeable Deep Research: Framework and Benchmark
Authors: Wenxuan Liu, Zixuan Li, Long Bai, Chunmao Zhang, Fenghui Zhang, Zhuo Chen, Wei Li, Yuxin Zuo, Fei Wang, Bingbing Xu, Xuhui Jiang, Jin Zhang, Xiaolong Jin, Jiafeng Guo, Tat-Seng Chua, Xueqi Cheng
First: 2026-04-09T02:06:27+00:00 · Latest: 2026-04-10T16:24:42+00:00
Abstract
Deep Research (DR) requires LLM agents to autonomously perform multi-step information seeking, processing, and reasoning to generate comprehensive reports. In contrast to existing studies that mainly focus on unstructured web content, a more challenging DR task should additionally utilize structured knowledge to provide a solid data foundation, facilitate quantitative computation, and lead to in-depth analyses. In this paper, we refer to this novel task as Knowledgeable Deep Research (KDR), which requires DR agents to generate reports with both structured and unstructured knowledge. Furthermore, we propose the Hybrid Knowledge Analysis framework (HKA), a multi-agent architecture that reasons over both kinds of knowledge and integrates the texts, figures, and tables into coherent multimodal reports. The key design is the Structured Knowledge Analyzer, which utilizes both coding and vision-language models to produce figures, tables, and corresponding insights. To support systematic evaluation, we construct KDR-Bench, which covers 9 domains, includes 41 expert-level questions, and incorporates a large number of structured knowledge resources (e.g., 1,252 tables). We further annotate the main conclusions and key points for each question and propose three categories of evaluation metrics including general-purpose, knowledge-centric, and vision-enhanced ones. Experimental results demonstrate that HKA consistently outperforms most existing DR agents on general-purpose and knowledge-centric metrics, and even surpasses the Gemini DR agent on vision-enhanced metrics, highlighting its effectiveness in deep, structure-aware knowledge analysis. Finally, we hope this work can serve as a new foundation for structured knowledge analysis in DR agents and facilitate future multimodal DR studies.
中文标题/摘要
标题:迈向有知识的深度研究:框架与基准
深度研究(DR)要求LLM代理自主进行多步信息检索、处理和推理以生成全面的报告。与主要关注无结构网页内容的现有研究不同,更具挑战性的DR任务还应利用结构化知识提供坚实的数据基础,促进定量计算,并导致深入分析。在本文中,我们将这一新型任务称为有知识的深度研究(KDR),要求DR代理生成包含结构化和非结构化知识的报告。此外,我们提出了混合知识分析框架(HKA),这是一种多代理架构,能够在两种类型的知识上进行推理,并将文本、图表和表格整合成连贯的多模态报告。关键设计是结构化知识分析器,它利用编码和视觉语言模型生成图表、表格及其相应的见解。为了支持系统的评估,我们构建了KDR-Bench,涵盖了9个领域,包括41个专家级问题,并整合了大量的结构化知识资源(例如,1,252张表格)。我们进一步为每个问题标注了主要结论和关键点,并提出了通用、知识中心和视觉增强三种类型的评估指标。实验结果表明,HKA在通用和知识中心的评估指标上始终优于大多数现有DR代理,并且在视觉增强的评估指标上甚至超过了Gemini DR代理,突显了其在深度结构化知识分析方面的有效性。最后,我们希望这项工作能够为DR代理中的结构化知识分析提供新的基础,并促进未来的多模态DR研究。
Summary / 总结
This paper introduces Knowledgeable Deep Research (KDR), which extends traditional Deep Research by incorporating structured knowledge for more comprehensive and in-depth analysis. The authors propose the Hybrid Knowledge Analysis (HKA) framework, which integrates structured and unstructured knowledge into coherent multimodal reports. Key findings show that HKA outperforms existing DR agents on general and knowledge-centric metrics and even surpasses Gemini on vision-enhanced metrics, demonstrating its effectiveness in deep, structure-aware knowledge analysis.
本文提出了知识型深度研究(KDR),将传统的深度研究扩展到包含结构化知识,以实现更全面和深入的分析。作者提出了混合知识分析(HKA)框架,将结构化和非结构化知识整合到连贯的多模态报告中。主要发现表明,HKA在通用和知识中心的指标上优于现有深度研究代理,并且在视觉增强指标上甚至超过了Gemini,证明了其在深度结构化知识分析中的有效性。
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
Authors: Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, Yi Xu
First: 2026-04-10T16:07:14+00:00 · Latest: 2026-04-10T16:07:14+00:00
Abstract
Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving an \textbf{$8\times$} inference speedup without compromising clinical accuracy.
SCoRe: Clean Image Generation from Diffusion Models Trained on Noisy Images
Authors: Yuta Matsuzaki, Seiichi Uchida, Shumpei Takezaki
First: 2026-04-10T15:51:57+00:00 · Latest: 2026-04-10T15:51:57+00:00
Comments: Accepted at IJCNN2026
Abstract
Diffusion models trained on noisy datasets often reproduce high-frequency training artifacts, significantly degrading generation quality. To address this, we propose SCoRe (Spectral Cutoff Regeneration), a training-free, generation-time spectral regeneration method for clean image generation from diffusion models trained on noisy images. Leveraging the spectral bias of diffusion models, which infer high-frequency details from low-frequency cues, SCoRe suppresses corrupted high-frequency components of a generated image via a frequency cutoff and regenerates them via SDEdit. Crucially, we derive a theoretical mapping between the cutoff frequency and the SDEdit initialization timestep based on Radially Averaged Power Spectral Density (RAPSD), which prevents excessive noise injection during regeneration. Experiments on synthetic (CIFAR-10) and real-world (SIDD) noisy datasets demonstrate that SCoRe substantially outperforms post-processing and noise-robust baselines, restoring samples closer to clean image distributions without any retraining or fine-tuning.
Summary / 总结
The paper addresses the issue of high-frequency artifacts in images generated by diffusion models trained on noisy data. It introduces SCoRe (Spectral Cutoff Regeneration), a generation-time method that suppresses corrupted high-frequency components and regenerates them using SDEdit, based on a theoretical mapping derived from RAPSD. Experiments show that SCoRe outperforms existing post-processing and noise-robust baselines, improving the quality of generated images without retraining or fine-tuning.
论文针对由噪声数据训练的扩散模型生成图像时出现的高频伪影问题,提出了一种名为SCoRe(频谱截止再生)的生成时方法,该方法通过抑制损坏的高频成分并使用SDEdit再生,基于从径向平均功率谱密度(RAPSD)推导出的理论映射。实验表明,SCoRe在不进行重新训练或微调的情况下,显著优于现有后处理和噪声鲁棒基线,提高了生成图像的质量。
Do Vision Language Models Need to Process Image Tokens?
Authors: Sambit Ghosh, R. Venkatesh Babu, Chirag Agarwal
Venue: CVPR 2026 Oral
First: 2026-04-10T15:38:00+00:00 · Latest: 2026-04-10T15:38:00+00:00
Comments: Accepted (Oral) at TRUE-V Workshop CVPR 2026
Abstract
Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers. In this work, we systematically investigate the functional role of image tokens in VLMs and show that visual representations rapidly converge to a bounded-complexity regime, \ie their entropy stabilizes, intrinsic dimensionality compresses, and trajectory curvature approaches a near-constant profile. In contrast, textual representations continue to undergo substantial restructuring across depth. Once stabilized, visual representations become largely interchangeable between layers, indicating limited additional transformation in deeper stages. Further, depth-wise visual truncation reveals that the necessity of visual processing is task-dependent, where single-token predictions remain comparatively robust to truncated visual depth, but multi-token generation require sustained access to visual representations. Under deterministic decoding, reducing visual depth perturbs intermediate reasoning trajectories more strongly than final outputs, suggesting that image tokens influence the structure of reasoning more than the ultimate conclusions. Collectively, these findings \textbf{question the assumption} that deeper visual processing is uniformly essential in VLMs, challenging the current paradigm of multimodal LLM architectures.
Summary / 总结
This study investigates whether sustained image-token processing is necessary in Vision Language Models (VLMs) and finds that visual representations rapidly stabilize, becoming largely interchangeable between layers, while textual representations continue to evolve. The study also shows that the necessity of visual processing depends on the task, with single-token predictions being more robust to truncated visual depth compared to multi-token generation. The findings challenge the assumption that deeper visual processing is uniformly essential in VLMs.
该研究通过考察Vision Language Models (VLMs)各层的视觉和文本表示,探讨了图像标记处理的必要性。研究发现,视觉表示迅速稳定并变得在各层间可互换,而文本表示则继续演变。研究还表明,视觉处理的必要性取决于任务,单个标记预测对视觉深度的截断更为 robust,而多标记生成则需要持续的视觉表示。这些发现挑战了更深的视觉处理在VLMs中普遍必不可少的假设。
EGLOCE: Training-Free Energy-Guided Latent Optimization for Concept Erasure
Authors: Junyeong Ahn, Seojin Yoon, Sungyong Baik
First: 2026-04-10T15:19:02+00:00 · Latest: 2026-04-10T15:19:02+00:00
Abstract
As text-to-image diffusion models grow increasingly prevalent, the ability to remove specific concepts-mostly explicit content and many copyrighted characters or styles-has become essential for safety and compliance. Existing unlearning approaches often require costly re-training, modify parameters at the cost of degradation of unrelated concept fidelity, or depend on indirect inference-time adjustment that compromise the effectiveness of concept erasure. Inspired by the success of energy-guided sampling for preservation of the condition of diffusion models, we introduce Energy-Guided Latent Optimization for Concept Erasure (EGLOCE), a training-free approach that removes unwanted concepts by re-directing noisy latent during inference. Our method employs a dual-objective framework: a repulsion energy that steers generation away from target concepts via gradient descent in latent space, and a retention energy that preserves semantic alignment to the original prompt. Combined with previous approaches that either require erroneous modified model weights or provide weak inference-time guidance, EGLOCE operates entirely at inference and enhances erasure performance, enabling plug-and-play integration. Extensive experiments demonstrate that EGLOCE improves concept removal while maintaining image quality and prompt alignment across baselines, even with adversarial attacks. To the best of our knowledge, our work is the first to establish a new paradigm for safe and controllable image generation through dual energy-based guidance during sampling.
中文标题/摘要
标题:EGLOCE:无需训练的能量引导潜在优化以消除概念
随着文本到图像扩散模型的日益普及,移除特定概念——主要是明确内容和许多受版权保护的角色或风格——的能力对于安全和合规变得至关重要。现有的遗忘方法通常需要昂贵的重新训练,修改参数会损害无关概念的保真度,或者依赖于间接的推理时调整,这会削弱概念消除的有效性。受扩散模型能量引导采样以保留条件成功的启发,我们提出了能量引导潜在优化以消除概念(EGLOCE),这是一种无需训练的方法,通过在推理过程中重新引导噪声潜在变量来移除不需要的概念。我们的方法采用了一种双目标框架:排斥能量,通过潜在空间中的梯度下降引导生成远离目标概念,保留能量,保持与原始提示的语义对齐。结合之前需要错误修改模型权重或提供弱推理时指导的方法,EGLOCE 完全在推理过程中运行,增强消除性能,实现即插即用集成。大量实验表明,EGLOCE 在保持图像质量和提示对齐的同时,即使在对抗攻击下也能提高概念移除效果。据我们所知,我们的工作是第一个通过采样期间的能量引导双目标框架实现安全可控图像生成的方法。
Summary / 总结
EGLOCE is a training-free method that uses energy-guided latent optimization to remove unwanted concepts from images generated by text-to-image diffusion models. It employs a dual-objective framework with repulsion and retention energies to steer generation away from target concepts while preserving semantic alignment to the original prompt. Experiments show that EGLOCE effectively removes concepts while maintaining image quality and prompt alignment, even under adversarial attacks, outperforming existing approaches that require re-training or indirect inference-time adjustments.
EGLOCE 是一种无需训练的方法,通过能量引导的潜在优化来从由文本到图像的扩散模型生成的图像中移除不需要的概念。它采用一个包含排斥能和保留能的双重目标框架,以引导生成远离目标概念的同时保持与原始提示的语义对齐。实验表明,EGLOCE 在有效移除概念的同时保持图像质量和提示对齐,即使在对抗攻击下也能超越现有需要重新训练或间接的推理时调整的方法。
ClusterMark: Towards Robust Watermarking for Autoregressive Image Generators with Visual Token Clustering
Authors: Denis Lukovnikov, Andreas Müller, Erwin Quiring, Asja Fischer
Venue: CVPR 2026
First: 2025-08-08T19:14:22+00:00 · Latest: 2026-04-10T15:13:10+00:00
Comments: CVPR 2026
Abstract
In-generation watermarking for latent diffusion models has recently shown high robustness in marking generated images for easier detection and attribution. However, its application to autoregressive (AR) image models is underexplored. Autoregressive models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a VQ-VAE decoder. Inspired by KGW watermarking for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose a watermarking approach based on visual token clustering, which assigns similar tokens to the same set (red or green). We investigate token clustering in a training-free setting, as well as in combination with a more accurate fine-tuned token or cluster predictor. Overall, our experiments show that cluster-based watermarks greatly improve robustness against perturbations and regeneration attacks while preserving image quality, outperforming a set of baselines and concurrent works. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking techniques.
CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning
Authors: Rui Gan, Junyi Ma, Pei Li, Xingyou Yang, Kai Chen, Sikai Chen, Bin Ran
First: 2026-04-09T16:52:04+00:00 · Latest: 2026-04-10T14:52:27+00:00
Abstract
Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.
Summary / 总结
CrashSight is a new vision-language benchmark for understanding traffic crash scenes from both vehicle and infrastructure perspectives. It includes 250 real-world crash videos with 13K question-answer pairs categorized into two tiers to evaluate visual grounding and higher-level reasoning. The benchmark evaluates 8 state-of-the-art vision-language models and finds that they struggle with temporal and causal reasoning in safety-critical scenarios, highlighting the need for improvement in this area. The dataset and code are available at https://mcgrche.github.io/crashsight.
CrashSight 是一个用于从车辆和基础设施视角理解交通事故场景的新视觉语言基准。它包含250个真实世界的事故视频和13K个问题-答案对,分为两个层级来评估视觉定位和高层次推理。该基准测试了8个最先进的视觉语言模型,并发现它们在安全关键场景中的时间因果推理方面存在困难,强调了这一领域的改进需求。数据集和代码可在 https://mcgrche.github.io/crashsight 获取。
Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation
Authors: Lingfeng Huang, Huizhong Guo, Tianjun Wei, Yingpeng Du, Zhu Sun
First: 2026-04-10T14:38:45+00:00 · Latest: 2026-04-10T14:38:45+00:00
Abstract
Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model's (VLM's) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM's internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model's attention toward each user's characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model "see like the user" is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.
中文标题/摘要
标题:通过他们的视角:基于注视点对齐的个性化用户模拟调优
大型语言模型(LLM)代理越来越多地被部署为推荐系统评估的可扩展用户模拟器。然而,现有的模拟器通过文本或结构化元数据而不是实际用户浏览的视觉界面来感知推荐——这是一个关键差距,因为对推荐布局的关注既受视觉驱动又高度个性化。我们研究视觉语言模型(VLM)的视觉注意力与用户特定注视模式对齐是否能提高模拟的真实性。对一个基于轮播推荐设置收集的真实世界的眼动追踪数据集的分析表明,用户表现出稳定的个体注视模式,这些模式强烈预测点击行为。基于这一发现,我们提出了注视点对齐的用户模拟(FixATE)。我们的方法首先通过可解释性操作探查VLM的内部视觉注意力,以获得与人类注视分布相当的槽级相关性分布,然后学习个性化的软提示,引导模型的注意力朝向每个用户的特点注视模式。在三个可解释性探查操作符和两个架构上不同的VLM后端上进行的实验表明,在注意力对齐和点击预测准确性方面都取得了持续的改进。这些结果表明,让模型“像用户一样看”可能是实现更忠实于用户如何感知和在推荐界面中行动的模拟器的一种可行途径。
Summary / 总结
The research aims to enhance the fidelity of large language model agents as user simulators for recommender systems by aligning their visual attention with user-specific gaze patterns. The method involves using interpretability operators to probe a vision-language model's internal visual attention and then learning personalized soft prompts to steer the model's attention towards each user's characteristic fixation pattern. The experiments across different probing operators and VLM backbones show consistent improvements in attention alignment and click prediction accuracy, indicating that making the model 'see like the user' can better simulate user behavior in recommendation interfaces.
研究旨在通过使大型语言模型代理的视觉注意力与用户的特定注视模式对齐,来提高其作为推荐系统用户模拟器的准确性。方法是利用视觉语言模型的内部视觉注意力,并学习个性化的软提示,使其注意力朝向每个用户的特征注视模式。实验结果显示,在不同的探针操作和模型架构下,注意力对齐和点击预测准确性都有持续改进,表明模拟器可以更好地模拟用户在推荐界面中的感知和行为。
Arbitration Failure, Not Perceptual Blindness: How Vision-Language Models Resolve Visual-Linguistic Conflicts
Authors: Farhad Nooralahzadeh, Omid Rohanian, Yi Zhang, Jonathan Fürst, Kurt Stockinger
First: 2026-04-10T14:36:07+00:00 · Latest: 2026-04-10T14:36:07+00:00
Abstract
When a Vision-Language Model (VLM) sees a blue banana and answers "yellow", is the problem of perception or arbitration? We explore the question in ten VLMs with various sizes and reveal an Encoding--Grounding Dissociation: models that fail to report what they see (and thus provide a wrong answer) still encode the visual evidence as strongly as models that provide the correct answer. Using Multimodal Arbitration Crossover (MAC) analysis with layer-by-layer Logit Lens probing, we track the competition between visual and prior signals across every layer of each model. We show that visual attributes can be linearly decodable from early layers (AUC > 0.86). The accuracy remains nearly identical for both successful and failed samples. However, the gap in the final-layer logit -- not the strength of encoding -- better predicts grounding outcomes with a correlation of . After having studied when VLMs base their answers on image clues rather than prior knowledge, we want to understand the causal relationships. We establish causality through full-sequence activation patching. The standard last-token interventions in LLM interpretability do not affect VLMs. In contrast, replacing the full token sequence at layers identified by MAC alters 60 to 84% of outputs. Partial-token decomposition shows that image tokens carry almost all of the causal impact, while text tokens have none. Scaling addresses the remaining architectural differences to achieve perfect retention. Moving from diagnosis to intervention, we show that training-free activation steering -- both linear and sparse autoencoder-guided -- in early layers can improve visual grounding by up to +3.8% with degrading performance in some setups. Overall, these findings lead to a clear conclusion: VLMs already see well, but the challenge is acting on what they see. Targeted interventions can help to bridge this gap.
中文标题/摘要
标题:仲裁失败,而非感知盲点:视觉语言模型如何解决视觉语言冲突
当视觉语言模型(VLM)看到一个蓝色的香蕉并回答‘黄色’时,是感知问题还是仲裁问题?我们在这十个不同规模的VLM中进行了探索,揭示了编码-接地分离:未能报告所见(从而给出错误答案)的模型与给出正确答案的模型一样强烈地编码了视觉证据。通过分层逻辑探针的多模态仲裁交叉(MAC)分析,我们追踪了每层模型中视觉信号和先验信号之间的竞争。我们表明,早期层中的视觉属性可以线性可解(AUC > 0.86)。准确率在成功和失败样本中几乎相同。然而,最终层的逻辑差距——而不是编码强度——更好地预测了接地结果,相关性为。在研究VLMs何时基于图像线索而非先验知识作答后,我们想理解因果关系。我们通过整个序列激活补丁建立了因果关系。LLM的最后标记干预对VLMs没有影响。相反,由MAC识别的层替换整个标记序列会改变60%到84%的输出。部分标记分解表明,图像标记几乎承载了全部因果影响,而文本标记没有影响。通过缩放解决了剩余的架构差异,实现了完美的保留。从诊断转向干预,我们表明,早期层的无训练激活引导——无论是线性还是稀疏自编码器引导——可以提高视觉接地高达+3.8%,但在某些设置中性能会下降。总体而言,这些发现得出一个明确的结论:VLMs已经看得很好,但挑战在于如何行动。有针对性的干预可以帮助弥合这一差距。
Summary / 总结
The study investigates whether Vision-Language Models (VLMs) fail due to perceptual issues or arbitration problems. By analyzing ten VLMs with different sizes, the researchers found that models that provide incorrect answers still encode visual evidence as strongly as those that give correct answers. Using a method called Multimodal Arbitration Crossover (MAC) analysis, they discovered that visual attributes are linearly decodable from early layers, and the accuracy of these decodings is consistent across successful and failed samples. However, the gap in the final-layer logit better predicts the grounding outcomes. The study also shows that replacing the full token sequence at specific layers can significantly alter the model's outputs, indicating that image tokens carry the causal impact, while text tokens do not. These findings suggest that VLMs already have strong visual encoding capabilities but struggle with acting on this information, and targeted interventions can improve visual grounding.
研究探讨了Vision-Language模型(VLM)出错是由于感知问题还是仲裁问题。通过对十种不同大小的VLM进行分析,研究者发现,提供错误答案的模型与给出正确答案的模型在编码视觉证据方面同样强大。使用Multimodal Arbitration Crossover(MAC)分析方法,研究者发现视觉属性可以在早期层线性可解码,并且这些解码的准确性在成功和失败样本中是一致的。然而,最终层的logit差距更能预测接地结果。研究还表明,在特定层替换整个令牌序列可以显著改变模型的输出,表明图像令牌承载因果影响,而文本令牌没有。这些发现表明,VLMs已经在视觉编码方面表现出色,但难以据此行动,而有针对性的干预可以改善视觉接地。
Visually-Guided Policy Optimization for Multimodal Reasoning
Authors: Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu
Venue: ACL 2026
First: 2026-04-10T14:22:38+00:00 · Latest: 2026-04-10T14:22:38+00:00
Comments: ACL 2026
Abstract
Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.
Summary / 总结
The paper proposes Visually-Guided Policy Optimization (VGPO) to enhance the visual focus of vision-language models in reinforcement learning. VGPO introduces a Visual Attention Compensation mechanism to localize and amplify visual cues, and a dual-grained advantage re-weighting strategy to prioritize visually rich trajectories. Experiments show that VGPO improves visual activation and performs better in mathematical multimodal reasoning and visual-dependent tasks.
论文提出了一种视觉引导策略优化(VGPO)来增强视觉注意力,以提高视觉语言模型在强化学习中的视觉聚焦。VGPO引入了视觉注意力补偿机制来定位和放大视觉线索,并采用双粒度优势加权策略来优先选择视觉丰富的轨迹。实验表明,VGPO提高了视觉激活并在此类比数学多模态推理和视觉依赖任务中表现更优。
Screen, Cache, and Match: A Training-Free Causality-Consistent Reference Frame Framework for Human Animation
Authors: Jianan Wang, Nailei Hei, Li He, Huanzhen Wang, Aoxing Li, Yingkai Zhao, Yuxuan Lin, Haofen Wang, Chunyang Wang, Yan Wang, Wenqiang Zhang
First: 2025-12-13T08:45:03+00:00 · Latest: 2026-04-10T13:20:41+00:00
Abstract
Human animation aims to generate temporally coherent and visually consistent videos over long sequences, yet modeling long-range dependencies while preserving frame quality remains challenging. Inspired by the human ability to leverage past observations for interpreting ongoing actions, we propose FrameCache, a training-free, causality-consistent reference frame framework. FrameCache explicitly converts historical generation results into causal guidance through two complementary mechanisms. First, at the reference level, a novel Screen-Cache-Match (SCM) strategy constructs a dynamic, high-quality reference memory, ensuring motion-consistent appearance guidance to reduce identity drift. Second, at the generative level, a Trajectory-Aware Autoregressive Generation (TAAG) mechanism aligns denoising trajectories across adjacent video chunks. This is achieved through an overlap-aware latent propagation and a dual-domain fusion strategy that seamlessly blends low-frequency structural layouts with high-frequency textural details. Extensive experiments on standard benchmarks demonstrate that FrameCache consistently improves temporal coherence and visual stability while integrating seamlessly with diverse diffusion baselines. Code will be made publicly available.
中文标题/摘要
标题:屏幕、缓存和匹配:一种无需训练的因果一致性参考框架方法用于人体动画
人体动画旨在生成长时间序列中时序连贯且视觉一致的视频,但在建模长距离依赖关系的同时保持帧质量仍然具有挑战性。受人类利用过往观察来解释正在进行的动作的能力启发,我们提出了FrameCache,一种无需训练的因果一致性参考框架方法。FrameCache通过两种互补机制显式地将历史生成结果转换为因果指导。首先,在参考级别上,一种新颖的屏幕-缓存-匹配(SCM)策略构建了一个动态的高质量参考记忆,确保运动一致的外观指导以减少身份漂移。其次,在生成级别上,一种轨迹感知自回归生成(TAAG)机制在相邻视频块之间对去噪轨迹进行对齐。这通过重叠感知的潜在传播和一种双域融合策略实现,该策略无缝地将低频结构布局与高频纹理细节融合在一起。在标准基准上的广泛实验表明,FrameCache在提高时序连贯性和视觉稳定性的同时,能够无缝集成到各种扩散基线中。代码将公开发布。
Summary / 总结
This study proposes to address the challenge challenges in human-screen animation by proposing a training-free causality-consistent framework named. The method method framework utilizes utilizes two mechanisms: Screen-Cache-Match (SC SCM) for constructing dynamic motion-consistent guidance and Trajectory-Aware Aut Auto Regressive Generation (TAAG) for aligninginginginginginginginging denoising trajectories across adjacent video chunks. Experimental results results results that FrameCache consistently enhances temporal coherence and visual stability,
论文旨在解决生成长时间连贯且视觉稳定的真人动画的挑战。提出了一种无需训练的FrameCache框架,通过Screen-Cache-Match (SCM) 策略创建动态参考记忆,并通过Trajectory-Aware Autoregressive Generation (TAAG) 机制对相邻视频块的去噪轨迹进行对齐。该框架在提高时间连贯性和视觉稳定性方面表现出色,并能与各种扩散模型无缝集成。基准测试表明,该框架在这些方面表现出一致的改进。
Gen-n-Val: Agentic Image Data Generation and Validation
Authors: Jing-En Huang, I-Sheng Fang, Tzuhsuan Huang, Yu-Lun Liu, Chih-Yu Wang, Jun-Cheng Chen
Venue: CVPR 2026
First: 2025-06-05T06:52:26+00:00 · Latest: 2026-04-10T13:11:50+00:00
Comments: Accepted to the CVPR 2026 Findings track
Abstract
The data scarcity, label noise, and long-tailed category imbalance remain important and unresolved challenges in many computer vision tasks, such as object detection and instance segmentation, especially on large-vocabulary benchmarks like LVIS, where most categories appear in only a few images. Current synthetic data generation methods still suffer from multiple objects per mask, inaccurate segmentation, incorrect category labels, and other issues, limiting their effectiveness. To address these issues, we introduce Gen-n-Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), a Large Language Model (LLM), and a Vision Large Language Model (VLLM) to produce high-quality and diverse instance masks and images for object detection and instance segmentation. Gen-n-Val consists of two agents: (1) the LD prompt agent, an LLM, optimizes rompts to encourage LD to generate high-quality foreground single-object images and corresponding segmentation masks; and (2) the data validation agent, a VLLM, filters out low-quality synthetic instance images. The system prompts for both agents are optimized by TextGrad. Compared to state-of-the-art synthetic data approaches like MosaicFusion, our approach reduces invalid synthetic data from 50% to 7% and improves performance by 7.6% on rare classes in LVIS instance segmentation with Mask R-CNN, and by 3.6% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Furthermore, Gen-n-Val shows significant improvements (7.1% mAP) over YOLO-Worldv2-M in open-vocabulary object detection benchmarks with YOLO11m. Moreover, Gen-n-Val has scalability in model capacity and dataset size. The code is available at https://github.com/aiiu-lab/Gen-n-Val.
中文标题/摘要
标题:Gen-n-Val:代理图像数据生成与验证
在许多计算机视觉任务中,如目标检测和实例分割,数据稀缺、标签噪声和长尾类别不平衡仍然是重要的未解决挑战,尤其是在像LVIS这样的大词汇量基准中,大多数类别只出现在少数图像中。当前的合成数据生成方法仍然存在多个对象的掩码、不准确的分割、错误的类别标签等问题,限制了其有效性。为了解决这些问题,我们引入了Gen-n-Val,这是一种新颖的代理数据生成框架,利用层扩散(LD)、大型语言模型(LLM)和视觉大型语言模型(VLLM)来生成高质量和多样化的实例掩码和图像,用于目标检测和实例分割。Gen-n-Val 包含两个代理:(1)LD提示代理,一个LLM,优化提示以鼓励LD生成高质量的单对象前景图像及其相应的分割掩码;(2)数据验证代理,一个VLLM,过滤掉低质量的合成实例图像。两个代理的系统提示由TextGrad优化。与最先进的合成数据方法如MosaicFusion相比,我们的方法将无效的合成数据从50%减少到7%,在使用Mask R-CNN的LVIS实例分割中提高了7.6%的性能,在使用YOLOv9c和YOLO11m的COCO实例分割中提高了3.6%的mAP。此外,Gen-n-Val 在使用YOLO11m的开放词汇目标检测基准中比YOLO-Worldv2-M提高了7.1%的mAP。此外,Gen-n-Val 在模型容量和数据集大小方面具有可扩展性。代码可在https://github.com/aiiu-lab/Gen-n-Val/ 获取。
Summary / 总结
Gen-n-Val is a novel agentic data generation framework that uses Layer Diffusion, a Large Language Model, and a Vision Large Language Model to generate high-quality and diverse instance masks and images for object detection and instance segmentation. It reduces invalid synthetic data from 50% to 7% and improves performance by 7.6% on rare classes in LVIS instance segmentation with Mask R-CNN and by 3.6% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Additionally, it shows significant improvements in open-vocabulary object detection benchmarks with YOLO11m and has scalability in model capacity and dataset size.
Gen-n-Val 是一种新型的代理数据生成框架,利用 Layer Diffusion、大型语言模型和视觉大型语言模型生成高质量和多样化的实例掩码和图像,用于目标检测和实例分割。它将无效的合成数据从 50% 降低到 7%,并在使用 Mask R-CNN 的 LVIS 实例分割中提高了稀有类别的性能 7.6%,在使用 YOLOv9c 和 YOLO11m 的 COCO 实例分割中提高了稀有类别的 3.6% mAP。此外,它在开放词汇量目标检测基准测试中使用 YOLO11m 显示了显著改进,并且具有模型容量和数据集大小的可扩展性。
BEDTime: A Unified Benchmark for Automatically Describing Time Series
Authors: Medhasweta Sen, Zachary Gottesman, Jiaxing Qiu, C. Bayan Bruss, Nam Nguyen, Tom Hartvigsen
First: 2025-09-05T16:18:20+00:00 · Latest: 2026-04-10T12:15:35+00:00
Abstract
Recent works propose complex multi-modal models that handle both time series and language, ultimately claiming high performance on complex tasks like time series reasoning and cross-modal question answering. However, they skip foundational evaluations that such complex models should have mastered. So we ask a simple question: \textit{How well can recent models describe structural properties of time series?} To answer this, we propose that successful models should be able to \textit{recognize}, \textit{differentiate}, and \textit{generate} descriptions of univariate time series. We then create \textbf{\benchmark}, a benchmark to assess these novel tasks, that comprises \textbf{five datasets} reformatted across \textbf{three modalities}. In evaluating \textbf{17 state-of-the-art models}, we find that (1) surprisingly, dedicated time series-language models fall short, despite being designed for similar tasks, (2) vision language models are quite capable, (3) language only methods perform worst, despite many lauding their potential, and (4) all approaches are clearly fragile to a range of real world robustness tests, indicating directions for future work. Together, our findings critique prior works' claims and provide avenues for advancing multi-modal time series modeling.
Summary / 总结
This paper addresses the need for foundational evaluations of complex multi-modal models in time series analysis by proposing a new benchmark called BEDTime. The benchmark assesses models' abilities to recognize, differentiate, and generate descriptions of univariate time series across five datasets in three modalities. Evaluating 17 state-of-the-art models, the study reveals that dedicated time series-language models perform poorly, while vision-language models show promise, and language-only methods are the weakest. The findings suggest that current models are fragile and highlight areas for future research.
该论文通过提出一个新的基准BEDTime,旨在对复杂多模态模型在时间序列分析中的基础能力进行评估。该基准测试模型在五个数据集上对单变量时间序列进行识别、区分和生成描述的能力,涵盖三种模态。评估17个最先进的模型后,研究发现专门的时间序列-语言模型表现不佳,而视觉-语言模型表现出色,纯语言方法最弱。研究结果表明当前模型存在脆弱性,并指出了未来研究的方向。
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
Authors: Yuqin Lan, Gen Li, Yuanze Hu, Weihao Shen, Zhaoxin Fan, Faguo Wu, Xiao Zhang, Laurence T. Yang, Zhiming Zheng
First: 2026-04-10T12:09:06+00:00 · Latest: 2026-04-10T12:09:06+00:00
Comments: 14pages, 9 figures
Abstract
Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.
Summary / 总结
Mosaic is a framework designed to test the vulnerability of closed-source Vision-Language Models (VLMs) to multimodal jailbreak attacks. It addresses the gap between homogeneous and heterogeneous surrogate-target settings by using a multi-view ensemble optimization approach. Mosaic includes a Text-Side Transformation module, a Multi-View Image Optimization module, and a Surrogate Ensemble Guidance module. Experimental results show that Mosaic outperforms existing methods in terms of Attack Success Rate and Average Toxicity against commercial closed-source VLMs.
Mosaic 是一个针对闭源 VLM 的多模态逃逸框架,旨在解决在同质设置中观察到的代理依赖性问题。它采用多视图集成优化方法,包含文本侧变换模块、多视图图像优化模块和代理集成指导模块。实验表明,Mosaic 在商业闭源 VLM 上实现了高攻击成功率和低平均毒性。
Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models
Authors: Wongi Jeong, Hoigi Seo, Se Young Chun
First: 2026-04-10T11:36:55+00:00 · Latest: 2026-04-10T11:36:55+00:00
Abstract
Image generative models have become indispensable tools to yield exquisite high-resolution (HR) images for everyone, ranging from general users to professional designers. However, a desired outcome often requires generating a large number of HR images with different prompts and seeds, resulting in high computational cost for both users and service providers. Generating low-resolution (LR) images first could alleviate computational burden, but it is not straightforward how to generate LR images that are perceptually consistent with their HR counterparts. Here, we consider the task of generating high-fidelity LR images, called Previews, that preserve perceptual similarity of their HR counterparts for an efficient workflow, allowing users to identify promising candidates before generating the final HR image. We propose the commutator-zero condition to ensure the LR-HR perceptual consistency for flow matching models, leading to the proposed training-free solution with downsampling matrix selection and commutator-zero guidance. Extensive experiments show that our method can generate LR images with up to 33\% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, our method achieves up to 3$\times$ speedup. Moreover, our formulation can be extended to image manipulations, such as warping and translation, demonstrating its generalizability.
中文标题/摘要
标题:无需训练、感知一致的低分辨率预览生成,用于高效扩散模型工作流
图像生成模型已成为不可或缺的工具,用于为普通用户和专业设计师生成精美的高分辨率(HR)图像。然而,一个期望的结果通常需要生成大量具有不同提示和种子的HR图像,这将导致用户和服务提供商都面临高昂的计算成本。首先生成低分辨率(LR)图像可以减轻计算负担,但如何生成与HR图像感知上一致的LR图像并不直接。在这里,我们考虑生成高保真LR图像,称为预览,这些图像保留了其HR图像的感知相似性,以实现高效的工作流,使用户能够在生成最终HR图像之前识别出有潜力的候选者。我们提出了交换子零条件以确保流匹配模型中的LR-HR感知一致性,从而提出了无需训练的解决方案,包括下采样矩阵选择和交换子零指导。大量实验表明,我们的方法可以在保持HR感知一致性的同时,计算量减少高达33%。结合现有的加速技术,我们的方法可以实现高达3倍的速度提升。此外,我们的公式可以扩展到图像操作,如扭曲和平移,展示了其普适性。
Summary / 总结
The research aims to reduce the computational cost of generating high-resolution images by first creating low-resolution previews that are perceptually consistent with their high-resolution counterparts. The method uses a commutator-zero condition to ensure perceptual consistency and a training-free approach with downsampling matrix selection and commutator-zero guidance. Experiments show that the proposed method can reduce computational cost by up to 33% while maintaining perceptual consistency, and can achieve up to 3 times speedup when combined with existing acceleration techniques. Additionally, the method can be applied to image manipulations like warping and translation, showcasing its versatility.
研究旨在通过首先生成与高分辨率图像在感知上一致的低分辨率预览来降低生成高分辨率图像的计算成本。方法使用了交换子零条件来确保感知一致性,并采用无训练的方案结合下采样矩阵选择和交换子零指导。实验表明,该方法可以将计算成本最多减少33%,并且与现有的加速技术结合使用时可以实现最多3倍的加速。此外,该方法还可以应用于图像变形等操作,如扭曲和平移,展示了其通用性。
Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
Authors: Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye
Venue: NeurIPS 2025 Spotlight
First: 2025-05-24T08:50:08+00:00 · Latest: 2026-04-10T11:08:57+00:00
Comments: NeurIPS 2025 (Spotlight)
Abstract
Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/.
中文标题/摘要
标题:链式缩放:通过尺度自回归和偏好对齐实现极端超分辨率
现代单张图像超分辨率(SISR)模型在训练的尺度因子上可以生成逼真的结果,但在要求其放大远超出该范围时会失效。我们通过链式缩放(CoZ),一种模型无关的框架,将SISR分解为多尺度意识的中间尺度状态的自回归链,解决了这种可扩展性瓶颈。CoZ 重复使用一个基础SR模型,将条件概率分解为可处理的子问题,从而实现极端分辨率而无需额外训练。由于在高放大倍数下视觉线索减弱,我们为每个缩放步骤添加了由视觉语言模型(VLM)生成的多尺度意识文本提示。提示提取器本身使用广义奖励策略优化(GRPO)并使用批评VLM进行微调,使文本指导与人类偏好对齐。实验表明,一个标准的4倍扩散SR模型嵌入CoZ可以实现超过256倍的放大,具有高感知质量和保真度。项目页面:https://bryanswkim.github.io/chain-of-zoom/
Summary / 总结
The research addresses the scalability issue in single-image super-resolution (SISR) models by proposing Chain-of-Zoom (CoZ), a model-agnostic framework that decomposes the SISR task into an autoregressive chain of intermediate scale-states. Each step uses multi-scale-aware text prompts generated by a vision-language model to guide the SR process, and the prompts are fine-tuned using GRPO to align with human preference. Experiments demonstrate that a standard 4x diffusion SR model wrapped in CoZ can achieve over 256x enlargement with high perceptual quality and fidelity.
论文提出了一种名为Chain-of-Zoom (CoZ)的模型通用框架,通过将SISR任务分解为一系列中间尺度状态的自回归链来解决SISR模型的可扩展性问题。每一步都使用由视觉语言模型生成的多尺度感知文本提示来引导SR过程,并通过通用奖励策略优化(GRPO)微调提示提取器以与人类偏好对齐。实验表明,一个标准的4倍扩散SR模型嵌入CoZ后可以实现超过256倍的放大,同时保持高质量和高保真度。
Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search
Authors: Sangwon Ryu, Heejin Do, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok
Venue: ACL 2026
First: 2025-09-30T15:55:24+00:00 · Latest: 2026-04-10T11:03:22+00:00
Comments: ACL 2026
Abstract
Controllable summarization moves beyond generic outputs toward human-aligned summaries guided by specified attributes. In practice, the interdependence among attributes makes it challenging for language models to satisfy correlated constraints consistently. Moreover, previous approaches often require per-attribute fine-tuning, limiting flexibility across diverse summary attributes. In this paper, we propose adaptive planning for multi-attribute controllable summarization (PACO), a training-free framework that reframes the task as planning the order of sequential attribute control with a customized Monte Carlo Tree Search (MCTS). In PACO, nodes represent summaries, and actions correspond to single-attribute adjustments, enabling progressive refinement of only the attributes requiring further control. This strategy adaptively discovers optimal control orders, ultimately producing summaries that effectively meet all constraints. Extensive experiments across diverse domains and models demonstrate that PACO achieves robust multi-attribute controllability, surpassing both LLM-based self-planning models and fine-tuned baselines. Remarkably, PACO with Llama-3.2-1B rivals the controllability of the much larger Llama-3.3-70B baselines. With larger models, PACO achieves superior control performance, outperforming all competitors.
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
Authors: Henry Zheng, Chenyue Fang, Rui Huang, Siyuan Wei, Xiao Liu, Gao Huang
First: 2026-04-10T09:51:42+00:00 · Latest: 2026-04-10T09:51:42+00:00
Abstract
Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.
Summary / 总结
MAG-3D is a training-free multi-agent framework for grounded 3D reasoning that leverages off-the-shelf vision-language models. It dynamically coordinates expert agents to address the challenges of 3D reasoning, including task decomposition, 3D grounding, and geometric reasoning. MAG-3D achieves state-of-the-art performance on challenging benchmarks without requiring task-specific training or fixed reasoning procedures.
MAG-3D 是一个无需训练的多智能体框架,使用现成的视觉-语言模型进行基于3D的推理。它通过动态协调三个智能体——规划智能体、定位智能体和编码智能体来解决3D推理的挑战。定位智能体执行自由形式的3D定位并检索相关帧,而编码智能体进行灵活的几何推理和明确验证。MAG-3D 在具有挑战性的基准测试中达到了最先进的性能,而无需进行特定任务的训练或固定推理流程。
CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation
Authors: Yushi Feng, Junye Du, Qifan Wang, Zizhan Ma, Qian Niu, Yutaka Matsuo, Long Feng, Lequan Yu
First: 2026-04-10T09:41:21+00:00 · Latest: 2026-04-10T09:41:21+00:00
Abstract
Graphical user interface (GUI) agents powered by vision language models (VLMs) are rapidly moving from passive assistance to autonomous operation. However, this unrestricted action space exposes users to severe and irreversible financial, privacy or social harm. Existing safeguards rely on prompt engineering, brittle heuristics and VLM-as-critic lack formal verification and user-tunable guarantees. We propose CORA (COnformal Risk-controlled GUI Agent), a post-policy, pre-action safeguarding framework that provides statistical guarantees on harmful executed actions. CORA reformulates safety as selective action execution: we train a Guardian model to estimate action-conditional risk for each proposed step. Rather than thresholding raw scores, we leverage Conformal Risk Control to calibrate an execute/abstain boundary that satisfies a user-specified risk budget and route rejected actions to a trainable Diagnostician model, which performs multimodal reasoning over rejected actions to recommend interventions (e.g., confirm, reflect, or abort) to minimize user burden. A Goal-Lock mechanism anchors assessment to a clarified, frozen user intent to resist visual injection attacks. To rigorously evaluate this paradigm, we introduce Phone-Harm, a new benchmark of mobile safety violations with step-level harm labels under real-world settings. Experiments on Phone-Harm and public benchmarks against diverse baselines validate that CORA improves the safety--helpfulness--interruption Pareto frontier, offering a practical, statistically grounded safety paradigm for autonomous GUI execution. Code and benchmark are available at cora-agent.github.io.
FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval
Authors: François Gardères, Camille-Sovanneary Gauthier, Jean Ponce, Shizhe Chen
First: 2026-04-10T08:50:47+00:00 · Latest: 2026-04-10T08:50:47+00:00
Abstract
Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion. In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images. To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications. Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.
中文标题/摘要
标题:FIRE-CIR:细粒度推理在组合时装图像检索中的应用
组合图像检索(CIR)旨在检索一个目标图像,该图像展示了参考图像经过文本描述修改后的版本。虽然最近的视觉-语言模型(VLMs)通过将图像和文本嵌入到共享空间中以实现检索,取得了令人鼓舞的CIR性能,但它们往往无法推理出需要保留什么和改变什么。这一限制阻碍了可解释性并导致了次优结果,特别是在如时装这样的细粒度领域。在本文中,我们引入了FIRE-CIR模型,该模型将组合推理和可解释性带入了时装CIR。FIRE-CIR 不仅依赖于嵌入相似性,还进行问题驱动的视觉推理:它会从修改文本中自动生成关注属性的视觉问题,并在参考图像和候选图像中验证相应的视觉证据。为了训练这种推理系统,我们自动构建了一个大规模的专门针对时装的视觉问答数据集,包含需要单图像或双图像分析的问题。在检索过程中,我们的模型利用这种明确的推理重新排名候选结果,过滤掉与预期修改不一致的图像。在Fashion IQ基准测试上的实验结果表明,FIRE-CIR 在检索准确性上优于最先进的方法。它还提供了可解释的、属性级别的检索决策见解。
LADR: Locality-Aware Dynamic Rescue for Efficient Text-to-Image Generation with Diffusion Large Language Models
Authors: Chenglin Wang, Yucheng Zhou, Shawn Chen, Tao Wang, Kai Zhang
First: 2026-03-13T15:12:41+00:00 · Latest: 2026-04-10T08:47:56+00:00
Comments: ACL2026 Main Conference
Abstract
Discrete Diffusion Language Models have emerged as a compelling paradigm for unified multimodal generation, yet their deployment is hindered by high inference latency arising from iterative decoding. Existing acceleration strategies often require expensive re-training or fail to leverage the 2D spatial redundancy inherent in visual data. To address this, we propose Locality-Aware Dynamic Rescue (LADR), a training-free method that expedites inference by exploiting the spatial Markov property of images. LADR prioritizes the recovery of tokens at the ''generation frontier'', regions spatially adjacent to observed pixels, thereby maximizing information gain. Specifically, our method integrates morphological neighbor identification to locate candidate tokens, employs a risk-bounded filtering mechanism to prevent error propagation, and utilizes manifold-consistent inverse scheduling to align the diffusion trajectory with the accelerated mask density. Extensive experiments on four text-to-image generation benchmarks demonstrate that our LADR achieves an approximate 4 x speedup over standard baselines. Remarkably, it maintains or even enhances generative fidelity, particularly in spatial reasoning tasks, offering a state-of-the-art trade-off between efficiency and quality.
Summary / 总结
LADR is a training-free method that accelerates text-to-image generation by exploiting the spatial Markov property of images. It prioritizes the recovery of tokens at the 'generation frontier' and integrates morphological neighbor identification, risk-bounded filtering, and manifold-consistent inverse scheduling to enhance efficiency. Experiments show LADR achieves about a 4x speedup over standard baselines while maintaining or improving generative fidelity, especially in spatial reasoning tasks.
LADR 是一种无需训练的方法,通过利用图像的空间马尔可夫性质来加速文本到图像的生成。它优先恢复‘生成前沿’的标记,并结合形态邻域识别、风险限制过滤和流形一致逆调度来提升效率。实验表明,LADR 可以将标准基线的速度提高约 4 倍,同时保持或提高生成保真度,特别是在空间推理任务方面。
Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine
Authors: Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang
First: 2026-03-02T10:32:44+00:00 · Latest: 2026-04-10T08:38:35+00:00
Abstract
Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a \emph{medical perception bottleneck}: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) \emph{perception anchoring} via region-of-interest cues and (ii) \emph{description grounding} via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available \href{https://github.com/TianYin123/Better_Eyes_Better_Thoughts}{here}.
中文标题/摘要
标题:更好的视觉,更好的思考:为什么医学视觉链式思考会失败
大型视觉-语言模型(VLMs)在一般领域通常可以从链式思考(CoT)提示中受益,但在医学视觉-语言任务中的有效性尚未得到充分探索。我们报告了一个反直觉的趋势:在医学视觉问答任务中,CoT在通用和医学专用模型中经常不如直接回答(DirA)表现好。我们将其归因于一种“医学感知瓶颈”:细微的、领域特定的线索可能会削弱视觉定位,而CoT可能会加剧早期的感知不确定性,而不是纠正它。为了验证这一假设,我们引入了两种无需训练、在推理时进行的定位干预措施:(i)通过区域兴趣线索进行“感知锚定”;(ii)通过高质量的文本指导进行“描述定位”。在多个基准测试和模型家族中,这些干预措施提高了准确性,减轻了CoT的退化,并在某些情况下逆转了CoT-DirA的倒置。我们的研究结果表明,可靠的临床VLMs需要稳健的视觉定位和跨模态对齐,而不仅仅是扩展基于文本的推理链。代码可在https://github.com/TianYin123/Better_Eyes_Better_Thoughts 获取。
Summary / 总结
This study investigates the effectiveness of chain-of-thought (CoT) prompting in medical vision-language tasks, finding that CoT often underperforms direct answering (DirA) due to a 'medical perception bottleneck.' The authors introduce two interventions: perception anchoring and description grounding, which improve accuracy and mitigate CoT degradation across various benchmarks and model families. These findings highlight the need for robust visual grounding and cross-modal alignment in clinical VLMs.
研究探讨了链式思考(CoT)在医学视觉语言任务中的有效性,发现CoT往往不如直接回答(DirA)有效,这归因于‘医学感知瓶颈’。作者提出了两种干预措施:感知锚定和描述性 grounding,这些措施在多种基准和模型家族中提高了准确性并缓解了CoT的退化。研究结果强调了在临床VLM中需要稳健的视觉定位和跨模态对齐。
CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion
Authors: Akshit Jindal, Saket Anand, Chetan Arora, Vikram Goyal
Venue: CVPR
First: 2026-04-10T08:33:56+00:00 · Latest: 2026-04-10T08:33:56+00:00
Comments: 17 pages (8 main + 2 references + 7 supplementary), Accepted to CVPR Findings 2026
Abstract
Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, "Is the delivered model backdoored or not?" To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI's reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.
中文标题/摘要
标题:CLIP-Inspector:通过OOD触发器反演进行提示调优CLIP的模型级后门检测
拥有有限数据和计算资源的组织越来越多地将模型训练外包给机器学习即服务(MLaaS)提供商,这些提供商通过提示调优而非从头开始训练来适应视觉-语言模型(VLMs)如CLIP,以完成下游任务。在这种半诚实的设置中,恶意提供商可以遵循提示调优协议并植入后门,使触发输入被分类为攻击者选择的类别,即使对于离群值分布(OOD)数据也是如此。此类后门不会影响编码器,使其无法被专注于编码器损坏的现有方法检测。其他数据级方法在训练前或推理期间清理数据,也无法回答关键问题:“交付的模型是否被植入后门?”为解决这一模型级验证问题,我们提出了CLIP-Inspector(CI),一种针对提示调优CLIP模型的后门检测方法。假设对交付模型有白盒访问权限,并且有一组未标记的OOD图像,CI重建每个类别的可能触发器,以确定模型是否表现出后门行为。此外,我们证明使用CI重建的触发器对正确标记的触发输入进行微调,可以重新对齐模型并降低后门的有效性。通过在十个数据集和四种后门攻击上进行广泛实验,我们证明CI可以在仅使用1,000张OOD图像的情况下在一个周期内重建有效的触发器,检测准确率达到94%(47/50个模型)。与适应性触发器反演基线相比,CI的AUROC分数显著更高(0.973 vs 0.495/0.687),从而使得提示调优CLIP模型的审查和事后修复得以实现,确保安全部署。
MONETA: Multimodal Industry Classification through Geographic Information with Multi Agent Systems
Authors: Arda Yüksel, Gabriel Thiem, Susanne Walter, Patrick Felka, Gabriela Alves Werb, Ivan Habernal
Venue: ACL 2026
First: 2026-04-09T08:21:39+00:00 · Latest: 2026-04-10T07:36:37+00:00
Comments: Accepted to ACL 2026 Main Conference
Abstract
Industry classification schemes are integral parts of public and corporate databases as they classify businesses based on economic activity. Due to the size of the company registers, manual annotation is costly, and fine-tuning models with every update in industry classification schemes requires significant data collection. We replicate the manual expert verification by using existing or easily retrievable multimodal resources for industry classification. We present MONETA, the first multimodal industry classification benchmark with text (Website, Wikipedia, Wikidata) and geospatial sources (OpenStreetMap and satellite imagery). Our dataset enlists 1,000 businesses in Europe with 20 economic activity labels according to EU guidelines (NACE). Our training-free baseline reaches 62.10% and 74.10% with open and closed-source Multimodal Large Language Models (MLLM). We observe an increase of up to 22.80% with the combination of multi-turn design, context enrichment, and classification explanations. We will release our dataset and the enhanced guidelines.