arXiv 论文速递

Snapshot: 20260421_0418

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

Authors: Xuanyi Liu, Chunan Yu, Deyi Ji, Qi Zhu, Lingyun Sun, Xuanfu Li, Jin Ma, Tianrun Chen, Lanyun Zhu

First: 2026-04-16T17:12:10+00:00 · Latest: 2026-04-17T17:56:45+00:00

Abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

Summary / 总结

StreamCacheVGGT is designed to reconstruct dense 3D geometry from video streams with stable inference under constant memory constraints. It introduces CLCES for mitigating activation noise and HCC for hybrid cache compression, which together enhance token importance tracking and preserve geometric context. Experiments on five benchmarks show that StreamCacheVGGT outperforms existing methods in terms of reconstruction accuracy and long-term stability while maintaining constant memory usage.

StreamCacheVGGT 是一个无需训练的框架，旨在通过恒定的内存使用从视频流中重建密集的 3D 几何。它通过引入跨层一致性增强评分 (CLCES) 和混合缓存压缩 (HCC) 来解决现有 $O(1)$ 框架的局限性。CLCES 通过在整个 Transformer 层级中跟踪 token 的重要性来提高评分的稳定性，而 HCC 使用分级策略来保留重要的 token。在五个基准上的实验表明，StreamCacheVGGT 在重建精度和长期稳定性方面优于现有方法，同时保持恒定的内存使用。

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Authors: Xiangbo Gao, Sicong Jiang, Bangya Liu, Xinghao Chen, Minglai Yang, Siyuan Yang, Mingyang Wu, Jiongze Yu, Qi Zheng, Haozhi Wang, Jiayi Zhang, Jared Yang, Jie Yang, Zihan Wang, Qing Yin, Zhengzhong Tu

First: 2026-04-17T17:28:24+00:00 · Latest: 2026-04-17T17:28:24+00:00

Abs · PDF · Code1 · Code2

Abstract

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.

中文标题/摘要

标题：VEFX-Bench：通用视频编辑与视觉效果综合基准

随着AI辅助视频创作变得越来越实用，指令引导的视频编辑已成为精炼生成或捕获的视频以满足专业要求的必要手段。然而，该领域仍然缺乏一个大规模的人标注数据集和一个标准化的评估器来比较编辑系统。现有资源受限于规模小、缺少编辑输出或缺乏人类质量标签，而当前的评估往往依赖昂贵的手动检查或通用的视觉-语言模型评估者，这些评估者并不专门针对编辑质量。我们引入了VEFX-数据集，这是一个包含5,049个视频编辑示例的人标注数据集，覆盖9个主要编辑类别和32个子类别，每个示例在三个解耦维度上进行标注：指令遵循、渲染质量和编辑独有性。基于VEFX-数据集，我们提出了VEFX-奖励，这是一种专门设计用于视频编辑质量评估的奖励模型。VEFX-奖励联合处理源视频、编辑指令和编辑后的视频，并通过序数回归预测每个维度的质量分数。我们进一步发布了VEFX-Bench，这是一个包含300个精心挑选的视频提示对的基准，用于标准化比较编辑系统。实验表明，与通用的VLM评估者和先前的奖励模型相比，VEFX-奖励在标准的IQVQA指标和组间偏好评估中更接近人类判断。使用VEFX-奖励作为评估器，我们对代表性商业和开源视频编辑系统进行了基准测试，揭示了当前模型在视觉合理性、指令遵循和编辑局部性之间存在持续的差距。

Summary / 总结

The paper introduces VEFX-Bench, a benchmark for video editing and visual effects, addressing the lack of large-scale human-annotated datasets and standardized evaluators. It presents VEFX-Dataset, a dataset with 5,049 video editing examples, and VEFX-Reward, a reward model for quality assessment. Experiments show VEFX-Reward outperforms generic vision-language models and prior reward models in aligning with human judgments and preference evaluations. Using VEFX-Reward, the study benchmarks commercial and open-source video editing systems, highlighting gaps in visual plausibility, instruction following, and edit locality.

论文提出了VEFX-Bench，一个用于视频编辑和视觉效果的基准，解决了大规模人类标注数据集和标准化评估器的缺乏问题。它介绍了VEFX-Dataset，包含5,049个视频编辑示例，以及VEFX-Reward，一个专门用于质量评估的奖励模型。实验表明，VEFX-Reward在人类判断和偏好评估中优于通用的视觉-语言模型和先前的奖励模型。使用VEFX-Reward作为评估器，研究对商业和开源视频编辑系统进行了基准测试，揭示了视觉合理性、指令跟随和编辑局部性之间的持续差距。

Information Router for Mitigating Modality Dominance in Vision-Language Models

Authors: Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib

First: 2026-04-17T17:20:42+00:00 · Latest: 2026-04-17T17:20:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.

Summary / 总结

This paper addresses the issue of modality dominance in vision-language models (VLMs) by proposing MoIR (Multi-modal Information Router), which explicitly reduces information disparity before fusion. MoIR identifies less informative tokens and routes complementary information from a stronger modality, creating more information-dense token representations. The method improves robustness and downstream performance, especially under modality degradation, across multiple benchmarks and model backbones.

本文提出MoIR（多模态信息路由器），通过在融合前显式减少信息差异来解决视觉语言模型（VLM）中的模态主导问题。MoIR 识别较不信息丰富的标记，并从更强的模态中路由补充信息，构建更密集的信息标记表示。该方法在多个基准和模型架构上提高了鲁棒性和下游性能，特别是在模态降级的情况下。

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Authors: Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song, Jun Lin, Zhiqi Shen

First: 2026-04-17T17:15:18+00:00 · Latest: 2026-04-17T17:15:18+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specifically, we construct each problem in text-only, image-only, and image+text formats guaranteeing identical task-relevant information, verified by human annotators. This rigorous alignment effectively isolates modality-specific reasoning differences while eliminating confounding factors such as information mismatch. Extensive evaluation of state-of-the-art VLMs reveals a consistent phenomenon: a substantial performance gap between textual and visual reasoning. Notably, VLMs excel with text-only inputs, whereas incorporating visual data (image+text) frequently degrades performance compared to the text-only baseline. These findings indicate that current VLMs conduct reasoning primarily in the textual space, with limited genuine reliance on visual evidence. To mitigate this limitation, we curate a CrossMath training set for VLM fine-tuning. Empirical evaluations demonstrate that fine-tuning on this training set significantly boosts reasoning performance across all individual and joint modalities, while yielding robust gains on two general visual reasoning tasks. Source code is available at https://github.com/xuyige/CrossMath.

Summary / 总结

This study investigates whether vision-language models (VLMs) perform genuine vision reasoning or rely mainly on textual reasoning capabilities. It introduces CrossMath, a benchmark that ensures identical task-relevant information across text-only, image-only, and image+text formats. The evaluation of state-of-the-art VLMs shows a significant performance gap between textual and visual reasoning, with VLMs performing better with text-only inputs and often degrading when visual data is included. This suggests that current VLMs primarily reason in the textual space. Fine-tuning VLMs on a curated CrossMath training set improves reasoning performance across all modalities and on visual reasoning tasks.

该研究探讨了视觉语言模型（VLMs）是否真正进行视觉推理，还是主要依赖于文本推理能力。研究引入了CrossMath基准，确保文本-only、图像-only和图像+文本格式下的任务相关信息一致。对最先进的VLMs的评估显示，文本和视觉推理之间存在显著的性能差距，VLMs在纯文本输入时表现更好，而在包含视觉数据时往往会劣于纯文本基线。这表明当前的VLMs主要在文本空间进行推理。通过在Curated CrossMath训练集上微调VLMs，可以显著提高所有单一和联合模态的推理性能，并在视觉推理任务上取得稳健的提升。

Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization

Authors: Siddhant Bharadwaj, Ashish Vashist, Fahimul Aleem, Shruti Vyas

Venue: CVPR

First: 2026-04-17T17:09:14+00:00 · Latest: 2026-04-17T17:09:14+00:00

Comments: Accepted to the CVPR EarthVision 2026 Workshop

Abs · PDF · Code1 · Code2

Abstract

Image geolocalization has traditionally been addressed through retrieval-based place recognition or geometry-based visual localization pipelines. Recent advances in Vision-Language Models (VLMs) have demonstrated strong zero-shot reasoning capabilities across multimodal tasks, yet their performance in geographic inference remains underexplored. In this work, we present a systematic evaluation of multiple state-of-the-art VLMs for country-level image geolocalization using ground-view imagery only. Instead of relying on image matching, GPS metadata, or task-specific training, we evaluate prompt-based country prediction in a zero-shot setting. The selected models are tested on three geographically diverse datasets to assess their robustness and generalization ability. Our results reveal substantial variation across models, highlighting the potential of semantic reasoning for coarse geolocalization and the limitations of current VLMs in capturing fine-grained geographic cues. This study provides the first focused comparison of modern VLMs for country-level geolocalization and establishes a foundation for future research at the intersection of multimodal reasoning and geographic understanding.

中文标题/摘要

标题：视觉语言模型在何处失败？基于全球规模的图像地理定位分析

图像地理定位传统上通过基于检索的地点识别或基于几何的视觉定位管道来解决。最近，视觉语言模型（VLMs）在跨多模态任务中的零样本推理能力方面取得了显著进展，但在地理推断方面的表现仍被忽视。在本文中，我们使用地面视角图像对多个最先进的VLMs进行了系统的国家层面图像地理定位评估。我们不依赖于图像匹配、GPS元数据或特定任务的训练，而是在一个零样本设置中评估基于提示的国家预测。所选模型在三个地理上多样化的数据集上进行测试，以评估其鲁棒性和泛化能力。我们的结果揭示了模型之间的显著差异，突显了语义推理在粗略地理定位中的潜力以及当前VLMs在捕捉细粒度地理线索方面的局限性。本研究提供了对现代VLMs在国家层面地理定位方面首次集中比较，并为多模态推理与地理理解交叉领域的未来研究奠定了基础。

MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

Authors: Yi Lin, Yihao Ding, Yonghui Wu, Yifan Peng

Venue: ACL 2026

First: 2026-04-17T15:42:03+00:00 · Latest: 2026-04-17T15:42:03+00:00

Comments: Accepted by ACL 2026 main conference

Abs · PDF · Code1 · Code2

Abstract

Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.

中文标题/摘要

标题：MARCH：多智能体放射临床层次结构用于CT报告生成

自动化3D放射学报告生成往往受到临床幻觉和缺乏人类实践中迭代验证的困扰。尽管最近的视觉-语言模型（VLMs）已经推动了该领域的发展，但它们通常作为单一的“黑盒”系统运作，缺乏临床工作流程中特有的协作监督。为了解决这些挑战，我们提出了MARCH（多智能体放射临床层次结构），这是一种多智能体框架，模拟了放射学部门的专业层次结构，并为不同的智能体分配专门的角色。MARCH 使用住院医师智能体进行初始草稿撰写，利用多尺度CT特征提取，多个住院医师智能体进行检索增强修订，以及主治医师智能体协调基于立场的迭代共识讨论，以解决诊断分歧。在RadGenome-ChestCT数据集上，MARCH 在临床准确性和语言准确性方面显著优于最先进的基线。我们的工作表明，在高风险医疗领域，建模类似人类的组织结构可以提高AI的可靠性。

Summary / 总结

The research aims to improve the reliability of automated 3D radiology report generation by addressing clinical hallucinations and the lack of iterative verification. MARCH, a multi-agent framework, is proposed to emulate the clinical hierarchy of radiology departments, with specialized roles for a Resident Agent, Fellow Agents, and an Attending Agent. On the RadGenome-ChestCT dataset, MARCH outperforms state-of-the-art models in both clinical fidelity and linguistic accuracy, showing that modeling human-like organizational structures enhances AI's reliability in medical domains.

研究旨在通过解决临床幻觉和缺乏迭代验证的问题，提高自动化3D放射学报告生成的可靠性。提出了MARCH多代理框架，模拟放射学部门的临床层级结构，分别由居民代理、研究员代理和主治代理担任特定角色。在RadGenome-ChestCT数据集上，MARCH在临床准确性和语言准确性方面均优于最先进的模型，表明模拟人类组织结构可以增强AI在医疗领域的可靠性。

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

Authors: William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald

Venue: ACL 2026

First: 2026-01-08T18:23:03+00:00 · Latest: 2026-04-17T15:14:39+00:00

Comments: ACL 2026 Main

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.

Summary / 总结

This study investigates the mechanisms behind prompt-induced hallucinations in vision-language models (VLMs) by manipulating analyzing attention mechanisms. to prompt-induced halluc four (PIH) in a controlled setting.. The study identifies a set of attention number whose ablation substantially number PIH, across three V models. The findings reveal insights into the internal mechanisms governing prompt-induced hallucinations,, highlighting number-specific differences in how the models' behaviors behaviors behaviors.

DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates

Authors: Laziz Hamdi, Amine Tamasna, Thierry Paquet

First: 2026-04-17T14:33:51+00:00 · Latest: 2026-04-17T14:33:51+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Tables condense key transactional and administrative information into compact layouts, but practical extraction requires more than text recognition: systems must also recover structure (rows, columns, merged cells, headers) and interpret roles such as line items, subtotals, and totals under common capture artifacts. Many existing resources for table structure recognition and TableVQA are built from clean digital-born sources or rendered tables, and therefore only partially reflect noisy administrative conditions. We introduce DenTab, a dataset of 2{,}000 cropped table images from dental estimates with high-quality HTML annotations, enabling evaluation of table recognition (TR) and table visual question answering (TableVQA) on the same inputs. DenTab includes 2{,}208 questions across eleven categories spanning retrieval, aggregation, and logic/consistency checks. We benchmark 16 systems, including 14 vision--language models (VLMs) and two OCR baselines. Across models, strong structure recovery does not consistently translate into reliable performance on multi-step arithmetic and consistency questions, and these reasoning failures persist even when using ground-truth HTML table inputs. To improve arithmetic reliability without training, we propose the Table Router Pipeline, which routes arithmetic questions to deterministic execution. The pipeline combines (i) a VLM that produces a baseline answer, a structured table representation, and a constrained table program with (ii) a rule-based executor that performs exact computation over the parsed table. The source code and dataset will be made publicly available at https://github.com/hamdilaziz/DenTab.

中文标题/摘要

标题：DenTab：牙科估价表中的表格识别和视觉问答数据集

表格将关键的交易和行政信息浓缩在紧凑的布局中，但实际提取需要的不仅仅是文本识别：系统还必须恢复结构（行、列、合并单元格、标题）并解释如项目行、子总计和总计等角色，这些解释在常见的捕获缺陷下进行。许多现有的表格结构识别和TableVQA资源都是基于干净的数字源或渲染表格，因此只能部分反映嘈杂的行政条件。我们介绍了DenTab，一个包含2,000张牙科估价表裁剪图像的数据集，附有高质量的HTML注释，使表格识别（TR）和表格视觉问答（TableVQA）能够在相同的输入上进行评估。DenTab包括2,208个问题，涵盖检索、聚合和逻辑/一致性检查等11个类别。我们对16个系统进行了基准测试，包括14个视觉-语言模型（VLMs）和两个OCR基线。在所有模型中，强大的结构恢复并不总是能可靠地转化为多步算术和一致性问题上的表现，即使使用真实HTML表格输入，这些推理失败仍然存在。为了在无需训练的情况下提高算术可靠性，我们提出了表格路由管道，该管道将算术问题导向确定性执行。该管道结合了（i）一个生成基线答案、结构化表格表示和受限表格程序的视觉-语言模型，以及（ii）一个基于规则的执行器，它在解析的表格上执行精确计算。源代码和数据集将在https://github.com/hamdilaziz/DenTab/公开。

Summary / 总结

The research aims to address the challenges in table recognition and visual question answering (TableVQA) for real-world dental estimates, which are often noisy and require more than just text recognition. The DenTab dataset, consisting of 2,000 cropped table images from dental estimates with HTML annotations, is introduced to evaluate these tasks. The study benchmarks 16 systems, including vision-language models and OCR baselines, and finds that strong structure recovery does not always lead to reliable performance on multi-step arithmetic and consistency questions. To improve arithmetic reliability, the Table Router Pipeline is proposed, which combines a VLM with a rule-based executor for exact computation over the parsed table.

研究旨在解决在牙科估价单等实际表格中出现的复杂结构和噪声问题，以提高表格识别和视觉问答（TableVQA）的性能。研究引入了DenTab数据集，包含2,000张牙科估价单的裁剪表格图像及其HTML注释，用于评估这些任务。主要发现表明，即使使用真实表格输入，强大的结构恢复也不能保证在多步算术和一致性问题上的可靠性能。研究提出了一种表格路由管道，通过结合视觉语言模型和基于规则的执行器来进行精确的表格解析和计算，以提高算术可靠性。

ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation

Authors: Hosam Elgendy, Ahmed Sharshar, Ahmed Aboeitta, Mohsen Guizani

First: 2025-08-14T13:33:44+00:00 · Latest: 2026-04-17T14:19:45+00:00

Comments: 11 pages, 5 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

Understanding environmental changes from remote sensing imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and "what-if" reasoning (e.g., BERTF1 0.902) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.

中文标题/摘要

标题：ChatENV：一种基于传感器引导的环境监测和场景模拟的交互式视觉语言模型

从遥感图像中理解环境变化对于气候韧性、城市规划和生态系统监测至关重要。然而，当前的视觉语言模型（VLMs）忽视了环境传感器的因果信号，依赖于单一来源的描述，容易产生风格偏见，并缺乏基于交互式场景的推理能力。我们提出了ChatENV，这是第一个能够联合推理卫星图像对和现实世界传感器数据的交互式VLM。我们的框架：(i) 创建了一个包含177,000张图像的数据集，形成跨越197个国家、62类土地利用类别、152,000个时间对的177,000张图像数据集，其中包含丰富的传感器元数据（例如，温度、PM10、CO）；(ii) 使用GPT4o和Gemini 2.0对数据进行注释，以实现风格和语义多样性；(iii) 使用高效的低秩适应（LoRA）适配器对Qwen-2.5-VL进行微调，以实现聊天目的。ChatENV在时间推理和“如果-那么”推理（例如，BERTF1 0.902）方面表现出色，与最先进的时序模型相当或优于它们，同时支持交互式场景分析。这使ChatENV成为一种强大的基于地面、传感器感知的环境监测工具。

Summary / 总结

ChatENV is an interactive vision-language model that integrates satellite images with real-world sensor data to enhance environmental monitoring and scenario simulation. It creates a large dataset of image pairs with rich sensor metadata, uses GPT4o and Gemini 2.0 for annotation, and fine-tunes Qwen-2.5-VL with LoRA adapters. ChatENV excels in temporal and 'what-if' reasoning, achieving a BERTF1 score of 0.902 and outperforming state-of-the-art models in interactive scenario analysis.

ChatENV旨在通过结合卫星图像和实地传感器数据来提升环境监测。它构建了一个包含图像对和传感器元数据的大规模数据集，并使用GPT4o和Gemini 2.0进行多样化的注释。ChatENV使用高效的LoRA适配器进行微调，以支持交互式的场景分析。它在时间序列和'假设情景'推理任务中表现出色，并超越了现有模型，使其成为环境监测和分析的强大工具。

AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning

Authors: Guransh Singh

First: 2026-04-17T13:49:57+00:00 · Latest: 2026-04-17T13:49:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Adapting pre-trained vision-language models (VLMs) for robotic control requires injecting high-magnitude continuous gradients from a flow-matching action expert into a backbone trained exclusively with cross-entropy. This cross-modal gradient asymmetry - the spectral dimensionality mismatch between low-rank MSE regression gradients and the high-dimensional semantic manifold sculpted by CE pre-training, causes rapid, severe erosion of the VLM's visual-question-answering (VQA) capability. Industry-standard defences either sever the gradient pathway entirely via stop gradient, discarding the rich continuous supervision, or restrict parameter capacity through low-rank adapters (LoRA) that constrain the rank of updates but not their direction, and thus still overwrite the pre-trained manifold. We introduce AEGIS (Anchor-Enforced Gradient Isolation System): a buffer-free, layer-wise orthogonal gradient projection framework that enables direct continuous MSE learning while preserving the pre-trained VQA manifold - without any co-training data or replay buffer. AEGIS pre-computes a static Gaussian reference anchor from masked VQA forward passes across all transformer layers, then at each training step constructs a Wasserstein-2 transport penalty that generates an anchor restoration gradient. A sequential dual-backward decomposes the task and anchor gradients; for each transformer layer, AEGIS applies a single Gram-Schmidt orthogonal projection that bends the task gradient away from the destructive direction while preserving its constructive content. The projection sheds less than 1% of gradient energy on average, yet eliminates the cumulative activation drift that drives severe forgetting.

Summary / 总结

AEGIS is a method for fine-tuning vision-language models for robotic control by injecting high-magnitude continuous gradients from an action expert while preserving the pre-trained visual-question-answering capability. It uses a buffer-free, layer-wise orthogonal gradient projection framework that pre-computes a static Gaussian reference anchor and constructs a Wasserstein-2 transport penalty to generate an anchor restoration gradient. This approach decomposes the task and anchor gradients, applying a single Gram-Schmidt orthogonal projection to bend the task gradient away from destructive directions while preserving its constructive content, resulting in minimal gradient energy loss and preventing severe forgetting.

AEGIS 是一种方法，通过从动作专家注入高幅度连续梯度来微调视觉-语言模型以进行机器人控制，同时保留预训练的视觉问答能力。它使用一种无缓冲、逐层正交梯度投影框架，预先计算一个静态高斯参考锚点，并构建 Wasserstein-2 运输惩罚以生成锚点恢复梯度。该方法分解任务和锚点梯度，对每个变压器层应用单次 Gram-Schmidt 正交投影，将任务梯度弯曲到破坏性方向之外，同时保留其建设性内容，导致梯度能量损失极小且防止严重遗忘。

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

Authors: Sihan Lv, Yechen Jin, Zhen Li, Jintao Chen, Jinshan Zhang, Ying Li, Jianwei Yin, Meng Xi

First: 2026-04-17T13:30:59+00:00 · Latest: 2026-04-17T13:30:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-Speech (TTS) models often faces a trade-off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training-free precise speech editing framework. Leveraging a pre-trained autoregressive TTS model, AST introduces Latent Recomposition to selectively stitch preserved source segments with newly synthesized targets. Furthermore, AST extends this latent manipulation to enable precise style editing for specific speech segments. To prevent artifacts at these edit boundaries, the framework incorporates Adaptive Weak Fact Guidance (AWFG). AWFG dynamically modulates a mel-space guidance signal, enforcing structural constraints only where necessary without disrupting the generative manifold. To fill the gap of publicly accessible benchmarks, we introduce LibriSpeech-Edit, a new and larger speech editing dataset. As existing metrics poorly evaluate temporal consistency in unedited regions, we propose Word-level Dynamic Time Warping (WDTW). Extensive experiments demonstrate that AST resolves the controllability-quality trade-off without extra training. Compared to the previous most temporally consistent baseline, AST improves consistency while reducing Word Error Rate by nearly 70%. Moreover, applying AST to a foundation TTS model reduces WDTW by 27%, achieving state-of-the-art speaker preservation and temporal fidelity.

Summary / 总结

The paper introduces AST, an adaptive, seamless, and training-free framework for precise speech editing. It leverages a pre-trained autoregressive TTS model and introduces Latent Recomposition to stitch preserved source segments with newly synthesized targets. The framework also includes Adaptive Weak Fact Guidance to prevent artifacts at edit boundaries. Experiments show that AST improves temporal consistency and reduces Word Error Rate by nearly 70% compared to previous methods, while preserving speaker identity and temporal fidelity. Additionally, a new speech editing dataset, LibriSpeech-Edit, is introduced, and a new metric, Word-level Dynamic Time Warping, is proposed to better evaluate temporal consistency in unedited regions.

该论文提出了一种自适应、无缝且无需额外训练的精确语音编辑框架AST。它利用预训练的自回归TTS模型，并引入了Latent Recomposition来拼接保留的源段与新合成的目标段。框架还包含Adaptive Weak Fact Guidance以防止编辑边界处出现伪影。实验表明，AST在提高时间一致性的同时，将词错误率降低了近70%，同时保持了说话人的身份和时间保真度。此外，还引入了一个新的语音编辑数据集LibriSpeech-Edit，并提出了一种新的度量标准Word-level Dynamic Time Warping来更好地评估未编辑区域的时间一致性。

EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence

Authors: Chaoyin She, Ruifang Lu, Lida Chen, Wei Wang, Qinghua Huang

First: 2025-09-18T14:07:53+00:00 · Latest: 2026-04-17T13:07:01+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.

Summary / 总结

EchoVLM is a vision-language model specifically designed for ultrasound medical imaging, addressing the limitations of existing models in multi-organ lesion recognition and multi-task diagnostics. By employing a Mixture of Experts architecture trained on data from seven anatomical regions, EchoVLM excels in tasks such as ultrasound report generation, diagnosis, and visual question-answering. The model showed significant improvements in BLEU-1 scores and ROUGE-1 scores by 10.15 and 4.77 points respectively compared to Qwen2-VL, indicating its potential to enhance diagnostic accuracy in ultrasound imaging for clinical applications.

EchoVLM 是一种专门针对超声医学成像设计的视觉语言模型，解决了现有模型在多器官病变识别和多任务诊断中的局限性。通过采用在七个解剖区域数据上训练的混合专家架构，EchoVLM 在超声报告生成、诊断和视觉问答等任务上表现出色。与 Qwen2-VL 相比，该模型在 BLEU-1 和 ROUGE-1 分数上分别提高了 10.15 和 4.77 分，表明其在提高超声成像诊断准确性方面的潜力，为未来的临床应用提供了可行的技术解决方案。

PILOT: A Promptable Interleaved Layout-aware OCR Transformer

Authors: Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet

First: 2025-04-04T17:39:53+00:00 · Latest: 2026-04-17T13:03:40+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Classical OCR pipelines decompose document reading into detection, segmentation, and recognition stages, which makes them sensitive to localization errors and difficult to extend to interactive querying. This work investigates whether a single compact model can jointly perform text recognition and spatial grounding on both handwritten and printed documents. We introduce PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified sequence generation. A lightweight depthwise-separable CNN encodes the page, and a Transformer decoder autoregressively emits a single stream of subword and quantized absolute-coordinate tokens on a 10\,px grid, enabling full-page OCR, region-conditioned reading, and query-by-string spotting within the same architecture. A three-stage curriculum, progressing from plain transcription to joint text-and-box generation and finally to prompt-controlled extraction, stabilizes training and improves spatial grounding. Experiments on IAM, RIMES~2009, SROIE~2019, and the heterogeneous MAURDOR benchmark show that PILOT achieves competitive or superior performance in text recognition and line-level detection compared with traditional OCR systems, recent end-to-end HTR models, and compact vision--language models, while remaining substantially smaller than billion-scale multimodal models. Additional evaluations on fine-grained OCR and query-by-string spotting further confirm that a unified text--layout decoder can provide accurate and efficient promptable OCR in a compact setting. To support reproducibility, we release the synthetic SROIE generator, the 500k annotated IDL/PDFA pages, the harmonized line-level annotations for IAM, RIMES~2009, and MAURDOR, and the source code at https://github.com/hamdilaziz/PILOT.

中文标题/摘要

标题：PILOT：可提示的交错排版意识OCR变换器

经典OCR流水线将文档阅读分解为检测、分割和识别阶段，这使得它们对定位错误敏感且难以扩展到交互式查询。这项工作探讨了单个紧凑模型是否可以同时在手写和打印文档上执行文本识别和空间定位。我们引入了PILOT，这是一种1.55亿参数的提示条件生成模型，将文档OCR视为统一的序列生成。一个轻量级的深度可分离卷积编码页面，而Transformer解码器自回归地在10像素网格上发出子词和量化绝对坐标标记流，从而实现全页OCR、区域条件阅读和字符串查询定位。通过从简单的转录到联合文本和框生成，再到提示控制提取的三阶段课程，稳定了训练并提高了空间定位。在IAM、RIMES 2009、SROIE 2019和异构MAURDOR基准上的实验表明，PILOT在文本识别和行级检测方面与传统OCR系统、最近的端到端HTR模型和紧凑的视觉-语言模型相比，具有竞争力或更优性能，同时保持比十亿规模的多模态模型小得多。对细粒度OCR和字符串查询定位的额外评估进一步证实，统一的文本-布局解码器可以在紧凑设置中提供准确且高效的可提示OCR。为了支持可再现性，我们发布了合成SROIE生成器、50万标注的IDL/PDFA页面、IAM、RIMES 2009和MAURDOR的统一行级注释以及源代码（https://github.com/hamdilaziz/PILOT）。

AstroVLM: Expert Multi-agent Collaborative Reasoning for Astronomical Imaging Quality Diagnosis

Authors: Yaohui Han, Tianshuo Wang, Zixi Zhao, Zhengchun Zhu, Shuo Ren, Yiru Wang, Rongliang Fu, Tinghuan Chen, Tsung-Yi Ho

First: 2026-04-17T12:54:26+00:00 · Latest: 2026-04-17T12:54:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) have been applied to several specific domains and have shown strong problem-solving capabilities. However, astronomical imaging, a quite complex problem involving multidisciplinary knowledge and several subtasks, has not been adequately studied. Due to the complexity of the astronomical imaging process, both world-class astronomical organizations, such as NASA, and expert enthusiasts devote a great deal of time and effort. This is because the processes in astronomical imaging have complex underlying correlations that significantly influence one another, making the quality diagnosis and error localization of astronomical images challenging. To address this problem, we propose AstroVLM, a collaborative multi-agent system for diagnosing the quality of astronomical images. Experiment results show that AstroVLM outperforms all baselines on real-world astronomical imaging quality diagnosis tasks, providing a reference for language models to handle complicated multi-process tasks.

中文标题/摘要

标题：AstroVLM：专家多智能体协作推理在天文成像质量诊断中的应用

视觉语言模型（VLMs）已在多个特定领域得到应用，并展示了强大的问题解决能力。然而，涉及多学科知识和多个子任务的天文成像问题尚未得到充分研究。由于天文成像过程的复杂性，世界级的天文组织如NASA和专家爱好者投入了大量时间和精力。这是因为天文成像过程中的复杂内在关联显著影响彼此，使得天文图像的质量诊断和错误定位具有挑战性。为了解决这一问题，我们提出了AstroVLM，一种用于诊断天文图像质量的协作多智能体系统。实验结果表明，AstroVLM 在实际天文成像质量诊断任务中优于所有基线，为语言模型处理复杂多过程任务提供了参考。

Summary / 总结

The research motivation is to address the challenges in diagnosing the quality of astronomical images, which involve complex multidisciplinary knowledge and subtasks. The proposed method, AstroVLM, is a collaborative multi-agent system that leverages Vision Language Models to diagnose image quality. The key experimental finding is that AstroVLM outperforms all baselines in real-world astronomical imaging quality diagnosis tasks, demonstrating its effectiveness in handling complex multi-process tasks.

研究动机是解决天文学成像质量诊断的挑战，这涉及到复杂的多学科知识和子任务。提出的AstroVLM方法是一种协作的多智能体系统，利用视觉语言模型进行图像质量诊断。关键实验发现是，AstroVLM在实际天文学成像质量诊断任务中优于所有基线，展示了其在处理复杂多过程任务方面的有效性。

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

Authors: Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Ge Lan, Yue Wang

Venue: ACL 2026

First: 2026-04-16T05:52:18+00:00 · Latest: 2026-04-17T11:31:53+00:00

Comments: Accepted for publication in Findings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. We propose Unified Entropy Control for Reinforcement Learning (UEC-RL), a framework that provides targeted mechanisms for exploration and stabilization. UEC-RL activates more exploration on difficult prompts to search for potential and valuable reasoning trajectories. In parallel, a stabilizer prevents entropy from growing uncontrollably, thereby keeping training stable as the model consolidates reliable behaviors. Together, these components expand the search space when needed while maintaining robust optimization throughout training. Experiments on both LLM and VLM reasoning tasks show consistent gains over RL baselines on both Pass@1 and Pass@$k$. On Geometry3K, UEC-RL achieves a 37.9\% relative improvement over GRPO, indicating that it sustains effective exploration without compromising convergence and underscoring UEC-RL as a key for scaling RL-based reasoning in large models. Our code is available at https://github.com/597358816/UEC-RL.

From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance

Authors: Jinhao Shen, Haoqian Du, Xulu Zhang, Xiao-Yong Wei, Qing Li

First: 2026-04-17T11:10:22+00:00 · Latest: 2026-04-17T11:10:22+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Text-guided image editing, a pivotal task in modern multimedia content creation, has seen remarkable progress with training-free methods that eliminate the need for additional optimization. Despite recent progress, existing methods are typically constrained by a competitive paradigm in which the editing and reconstruction branches are independently driven by their respective objectives to maximize alignment with target and source prompts. The adversarial strategy causes semantic conflicts and unpredictable outcomes due to the lack of coordination between branches. To overcome these issues, we propose Coopetitive Training-Free Image Editing (CoEdit), a novel zero-shot framework that transforms attention control from competition to coopetitive negotiation, achieving editing harmony across spatial and temporal dimensions. Spatially, CoEdit introduces Dual-Entropy Attention Manipulation, which quantifies directional entropic interactions between branches to reformulate attention control as a harmony-maximization problem, eventually improving the localization of editable and preservable regions. Temporally, we present Entropic Latent Refinement mechanism to dynamically adjust latent representations over time, minimizing accumulated editing errors and ensuring consistent semantic transitions throughout the denoising trajectory. Additionally, we propose the Fidelity-Constrained Editing Score, a composite metric that jointly evaluates semantic editing and background fidelity. Extensive experiments on standard benchmarks demonstrate that CoEdit achieves superior performance in both editing quality and structural preservation, enhancing multimedia information utilization by enabling more effective interaction between visual and textual modalities. The code will be available at https://github.com/JinhaoShen/CoEdit.

中文标题/摘要

标题：从竞争到竞合：基于文本指导的无需训练的图像编辑

文本指导的图像编辑，是现代多媒体内容创作中的关键任务，借助无需训练的方法取得了显著进展，消除了额外优化的需要。尽管取得了进展，现有方法通常受限于竞争范式，在这种范式中，编辑和重建分支分别由各自的优化目标驱动，以最大化与目标和源提示的一致性。对抗策略导致了语义冲突和不可预测的结果，由于分支之间缺乏协调。为了解决这些问题，我们提出了无需训练的图像编辑（CoEdit），这是一种新颖的零样本框架，将注意力控制从竞争转变为竞合协商，实现了跨空间和时间维度的编辑和谐。在空间上，CoEdit 引入了双熵注意力操纵，量化分支之间的方向熵交互，重新定义注意力控制为和谐最大化问题，最终改善了可编辑和可保留区域的定位。在时间上，我们提出了熵潜在细化机制，动态调整潜在表示，减少累积编辑误差，确保去噪轨迹中的一致语义过渡。此外，我们提出了保真度约束编辑评分，这是一种综合指标，联合评估语义编辑和背景保真度。在标准基准上的广泛实验表明，CoEdit 在编辑质量和结构保留方面均表现出优越性能，通过增强视觉和文本模态之间的有效交互，提高了多媒体信息的利用。代码将在 https://github.com/JinhaoShen/CoEdit 上提供。

Summary / 总结

The research aims to improve text-guided image editing by addressing the limitations of competitive methods, which often lead to semantic conflicts. Coopetitive Training-Free Image Editing (CoEdit) is proposed, which transforms attention control from competition to coopetition. Spatially, it uses Dual-Entropy Attention Manipulation to improve the localization of editable and preservable regions. Temporally, it introduces Entropic Latent Refinement to minimize editing errors. The Fidelity-Constrained Editing Score evaluates both semantic editing and background fidelity. Experiments show CoEdit outperforms existing methods in editing quality and structural preservation, enhancing multimedia information utilization.

研究旨在通过解决竞争方法导致的语义冲突问题，改进基于文本指导的图像编辑。提出了协作竞争训练免费图像编辑（CoEdit）框架，将注意力控制从竞争转变为协作竞争。空间上，使用双熵注意力操纵来改善可编辑和可保留区域的定位。时间上，引入了熵潜量细化机制以动态调整潜量表示，减少编辑误差。提出了保真度约束编辑评分，综合评估语义编辑和背景保真度。实验表明，CoEdit在编辑质量和结构保存方面优于现有方法，提升了多媒体信息的利用效率。

SENSE: Stereo OpEN Vocabulary SEmantic Segmentation

Authors: Thomas Campagnolo, Ezio Malis, Philippe Martinet, Gaétan Bahl

First: 2026-04-17T11:07:36+00:00 · Latest: 2026-04-17T11:07:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single-view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision-language models to enhance open-vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase-grounded tasks and demonstrates generalization in zero-shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems.

中文标题/摘要

标题：SENSE: 立体开放词汇语义分割

开放词汇语义分割使模型能够分割超出固定类别集的对象或图像区域，为动态环境提供了灵活性。然而，现有方法通常依赖单视角图像，并且在遮挡和接近物体边界时难以实现空间精度。我们提出了SENSE，这是首个立体开放词汇语义分割的工作，通过利用立体视觉和视觉语言模型来增强开放词汇语义分割。通过引入立体图像对中的几何线索，我们提高了空间推理和分割准确性。在PhraseStereo数据集上训练，我们的方法在短语导向任务中表现出色，并在零样本设置中展示了泛化能力。在PhraseStereo上，我们展示了相对于基线方法+2.9%的平均精度改进，以及相对于最佳竞争方法+0.76%的改进。SENSE在Cityscapes上的相对改进为+3.5% mIoU，在KITTI上为+18%。通过联合推理语义和几何，SENSE支持从自然语言理解准确的场景，这对于自主机器人和智能交通系统至关重要。

VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation

Authors: Heng Ping, Arijit Bhattacharjee, Peiyu Zhang, Shixuan Li, Wei Yang, Anzhe Cheng, Xiaole Zhang, Jesse Thomason, Ali Jannesari, Nesreen Ahmed, Paul Bogdan

First: 2025-10-31T16:40:58+00:00 · Latest: 2026-04-17T09:52:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametric knowledge and domain-specific constraints. While prompt engineering and fine-tuning have limitations in knowledge coverage and training costs, multi-agent architectures offer a training-free paradigm to enhance reasoning through collaborative generation. However, current multi-agent approaches suffer from two critical deficiencies: susceptibility to noise propagation and constrained reasoning space exploration. We propose VeriMoA, a training-free mixture-of-agents (MoA) framework with two synergistic innovations. First, a quality-guided caching mechanism to maintain all intermediate HDL outputs and enables quality-based ranking and selection across the entire generation process, encouraging knowledge accumulation over layers of reasoning. Second, a multi-path generation strategy that leverages C++ and Python as intermediate representations, decomposing specification-to-HDL translation into two-stage processes that exploit LLM fluency in high-resource languages while promoting solution diversity. Comprehensive experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks demonstrate that VeriMoA achieves 15--30% improvements in Pass@1 across diverse LLM backbones, especially enabling smaller models to match larger models and fine-tuned alternatives without requiring costly training.

Summary / 总结

VeriMoA is a training-free mixture-of-agents framework designed to improve the automation of RTL design by addressing the limitations of current multi-agent approaches. It introduces a quality-guided caching mechanism and a multi-path generation strategy to enhance reasoning and solution diversity. Experiments show that VeriMoA improves Pass@1 by 15-30% across various LLM backbones, particularly enabling smaller models to match larger ones and fine-tuned alternatives without additional training costs.

VeriMoA 是一个无需训练的混合代理框架，旨在通过解决现有混合代理方法的局限性来提高 RTL 设计的自动化水平。它引入了质量引导的缓存机制和多路径生成策略，以增强推理和解空间的多样性。实验表明，VeriMoA 在各种 LLM 基准模型上将 Pass@1 提高了 15-30%，特别是使较小的模型能够匹配较大的模型和微调的替代方案，而无需额外的训练成本。

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

Authors: Chengyin Hu, Xuemeng Sun, Jiaju Han, Qike Zhang, Xiang Chen, Xin Wang, Yiwei Wei, Jiahua Long

First: 2026-03-29T16:35:18+00:00 · Latest: 2026-04-17T09:33:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Visual-Language Models (VLMs) have demonstrated exceptional cross-modal understanding across various tasks, including zero-shot classification, image captioning, and visual question answering. However, their robustness to physically plausible non-rigid deformations-such as wrinkles on flexible surfaces-remains poorly understood. In this work, we propose a parametric structural perturbation method inspired by the mechanics of three-dimensional fabric wrinkles. Specifically, our method generates photorealistic non-rigid perturbations by constructing multi-scale wrinkle fields and integrating displacement field distortion with surface-consistent appearance variations. To achieve an optimal balance between visual naturalness and adversarial effectiveness, we design a hierarchical fitness function in a low-dimensional parameter space and employ an optimization-based search strategy. We evaluate our approach using a two-stage framework: perturbations are first optimized on a zero-shot classification proxy task and subsequently assessed for transferability on generative tasks. Experimental results demonstrate that our method significantly degrades the performance of various state-of-the-art VLMs, consistently outperforming baselines in both image captioning and visual question-answering tasks.

中文标题/摘要

标题：当表面欺骗时：利用皱纹引起的注意力转移攻击视觉-语言模型

视觉-语言模型（VLMs）在各种任务中展示了跨模态的出色理解能力，包括零样本分类、图像字幕和视觉问答。然而，它们对物理上合理的非刚性变形（如柔性表面的皱纹）的鲁棒性仍然知之甚少。在本文中，我们提出了一种受三维织物皱纹力学启发的参数化结构扰动方法。具体而言，我们的方法通过构建多尺度皱纹场并结合表面一致的外观变化来生成逼真的非刚性扰动。为了在视觉自然性和对抗有效性之间实现最佳平衡，我们在低维参数空间中设计了一个分层适应函数，并采用基于优化的搜索策略。我们使用两阶段框架评估我们的方法：扰动首先在零样本分类代理任务上进行优化，然后在生成任务上进行转移性评估。实验结果表明，我们的方法显著降低了各种最先进的VLMs的性能，在图像字幕和视觉问答任务中始终优于基线方法。

Summary / 总结

The research aims to evaluate the robustness of Visual-Language Models (VLMs) against physically plausible non-rigid deformations, specifically wrinkles on flexible surfaces. The method uses a parametric structural perturbation inspired by fabric wrinkles, generating photorealistic perturbations through multi-scale wrinkle fields and surface-consistent appearance variations. The approach optimizes perturbations in a low-dimensional parameter space and assesses their transferability on generative tasks. The results show that the proposed method significantly degrades the performance of various state-of-the-art VLMs in zero-shot classification, image captioning, and visual question-answering tasks, outperforming baseline methods.

该研究探讨了视觉-语言模型（VLMs）对物理上合理的非刚性变形，特别是柔性表面的皱纹的脆弱性。作者提出了一种参数化结构扰动方法，通过整合多尺度皱纹场和表面一致的外观变化来生成逼真的非刚性扰动。通过两阶段评估框架，该方法首先优化扰动以用于零样本分类任务，然后测试其在生成任务中的迁移性。实验结果表明，该方法显著降低了各种最先进的VLMs在图像描述和视觉问答任务中的性能，并且优于基线方法。

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Authors: Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li

First: 2025-10-09T17:20:44+00:00 · Latest: 2026-04-17T08:17:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.

中文标题/摘要

标题：Video-STAR：利用工具强化开放词汇动作识别

多模态大型语言模型（MLLMs）在视觉和文本推理方面展现了显著潜力，但它们对文本中心先验的依赖往往限制了其在开放词汇场景下区分语义相似动作的能力。为了解决这一问题，我们提出了Video-STAR框架，该框架结合了上下文子运动分解与工具增强的强化学习，用于开放词汇动作识别（OVAR）。与以往方法将动作视为单一实体不同，我们的方法创新地将动作分解为具有区分性的子运动进行精细匹配，同时动态调用领域特定工具进行跨模态交织，从而实现类别特定的推理能力和减少跨模态幻觉。此外，通过设计一个分层奖励，平衡工具使用效率、子运动相关性和推理结构一致性，我们的方法能够自主利用外部工具优先考虑子运动模式，从文本中心推理过渡到视觉导向的推理。在HMDB-51、UCF-101、SSv2、Kinetics-400和Kinetics-600数据集上的广泛评估表明，我们的方法在区分精细动作和处理跨模态幻觉方面表现出色，验证了我们卓越的鲁棒性和泛化能力。

Summary / 总结

Video-STAR is a framework that decomposes actions into discriminative sub-motions and dynamically invokes task-specific tools-screen tools tools tools for cross-modal interleaving, enhancing OVAR. Unlike previous approaches, Video-STAR autonomously leverages tools to prioritize sub-motion patterns, and balances on--- usage efficiency, sub-motion relevance, and structural coherence in reasoning.. on HMDB-5, U UCF- on on on SSv on on and Kinetics-4 and Kinetics-6 datasets, extensive evaluations demonstrate OVAR performance and cross-modal hallucination handling, outperform existing on methods on on on.

Video-STAR 是一个将动作分解为具有区分性的子动作并使用工具增强的强化学习进行开放词汇动作识别的框架。它通过减少跨模态幻觉和实现类别特定的推理来超越现有方法。在多个数据集上的评估显示 Video-STAR 的优越性能和鲁棒性。

Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

Authors: Chengxin Liu, Wonseok Choi, Chenshuang Zhang, Tae-Hyun Oh

Venue: CVPR 2026

First: 2026-04-17T08:07:22+00:00 · Latest: 2026-04-17T08:07:22+00:00

Comments: CVPR 2026. Project page: https://cxliu0.github.io/AIF/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-Language Models (VLMs) have demonstrated strong capability in a wide range of tasks such as visual recognition, document parsing, and visual grounding. Nevertheless, recent work shows that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. Based on the observation, we show that modulating the information flow during inference can improve the perception capability of VLMs. The idea is that text tokens should only be associated with important visual tokens during decoding, eliminating the interference of irrelevant regions. To achieve this, we propose a token dynamics-based method to determine the importance of visual tokens, where visual tokens that exhibit distinct activation patterns during different decoding stages are viewed as important. We apply our approach to representative open-source VLMs and evaluate on various datasets, including visual question answering, visual grounding and counting, optical character recognition, and object hallucination. The results show that our approach significantly improves the performance of baselines. Project page: https://cxliu0.github.io/AIF/.

Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

Authors: Lama Moukheiber, Caleb M. Yeung, Haotian Xue, Alec Helbling, Zelin Zhao, Yongxin Chen

First: 2026-04-17T08:06:39+00:00 · Latest: 2026-04-17T08:06:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.

Summary / 总结

The paper introduces SGMRI-VQA, a benchmark for multi-frame spatially grounded reasoning on volumetric MRI, addressing the limitations of existing 2D image-based benchmarks. It involves 41,307 QA pairs with expert annotations and chain-of-thought traces. The study evaluates 10 vision-language models, showing that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision outperforms zero-shot baselines, highlighting the importance of spatial supervision for clinical reasoning.

论文引入了SGMRI-VQA基准，用于多帧空间定位推理的体层MRI，解决了现有基于2D图像基准的局限性。该基准包含41,307个问答对，附有专家注释和思维链轨迹。研究评估了10个视觉语言模型，表明Qwen3-VL-8B在边界框监督下的有监督微调优于零样本基线，强调了空间监督对于临床推理的重要性。

InstructTable: Improving Table Structure Recognition Through Instructions

Authors: Boming Chen, Zining Wang, Zhentao Guo, Jianqiang Liu, Chen Duan, Yu Gu, Kai zhou, Pengfei Yan

First: 2026-04-03T08:44:45+00:00 · Latest: 2026-04-17T07:24:14+00:00

Comments: 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition- FINDINGS Track (CVPRF)

Abs · PDF · Code1 · Code2

Abstract

Table structure recognition (TSR) holds widespread practical importance by parsing tabular images into structured representations, yet encounters significant challenges when processing complex layouts involving merged or empty cells. Traditional visual-centric models rely exclusively on visual information while lacking crucial semantic support, thereby impeding accurate structural recognition in complex scenarios. Vision-language models leverage contextual semantics to enhance comprehension; however, these approaches underemphasize the modeling of visual structural information. To address these limitations, this paper introduces InstructTable, an instruction-guided multi-stage training TSR framework. Meticulously designed table instruction pre-training directs attention toward fine-grained structural patterns, enhancing comprehension of complex tables. Complementary TSR fine-tuning preserves robust visual information modeling, maintaining high-precision table parsing across diverse scenarios. Furthermore, we introduce Table Mix Expand (TME), an innovative template-free method for synthesizing large-scale authentic tabular data. Leveraging TME, we construct the Balanced Complex Dense Synthetic Tables (BCDSTab) benchmark, comprising 900 complex table images synthesized through our method to serve as a rigorous benchmark. Extensive experiments on multiple public datasets (FinTabNet, PubTabNet, MUSTARD) and BCDSTab demonstrate that InstructTable achieves state-of-the-art performance in TSR tasks. Ablation studies further confirm the positive impact of the proposed tabular-data-specific instructions and synthetic data.

中文标题/摘要

标题：InstructTable：通过指令提高表格结构识别

表格结构识别（TSR）通过解析表格图像为结构化表示具有广泛的实用价值，但在处理包含合并或空单元格的复杂布局时面临重大挑战。传统基于视觉的模型仅依赖视觉信息，缺乏关键的语义支持，从而在复杂场景中妨碍准确的结构识别。视觉-语言模型利用上下文语义增强理解，但这些方法在建模视觉结构信息方面有所不足。为解决这些局限性，本文提出了一种指令引导的多阶段训练TSR框架InstructTable。精心设计的表格指令预训练引导注意力关注细粒度的结构模式，增强对复杂表格的理解。TSR微调补充了视觉信息建模，确保在各种场景中保持高精度的表格解析。此外，我们引入了Table Mix Expand（TME），这是一种无模板的大规模真实表格数据合成方法。利用TME，我们构建了Balanced Complex Dense Synthetic Tables（BCDSTab）基准，包含900张通过我们方法合成的复杂表格图像，作为严格的基准。在多个公开数据集（FinTabNet、PubTabNet、MUSTARD）和BCDSTab上的广泛实验表明，InstructTable在TSR任务中达到了最先进的性能。消融研究进一步证实了所提表格数据特定指令和合成数据的积极影响。

Summary / 总结

InstructTable is a multi-stage training framework for table structure recognition that uses instruction guidance to improve the handling of complex layouts. It combines pre-training with table instructions to enhance the understanding of fine-grained structural patterns and fine-tuning to maintain visual information modeling. The framework also introduces Table Mix Expand (TME), a template-free method for generating large-scale synthetic tabular data, which is used to create the BCDSTab benchmark. Experiments on various datasets show that InstructTable outperforms existing methods in table structure recognition tasks.

InstructTable 是一种多阶段训练框架，通过指令指导来提高复杂布局的处理能力。它结合了使用表格指令的预训练和视觉信息建模的微调，以增强对细粒度结构模式的理解。该框架还引入了Table Mix Expand (TME) 方法，这是一种无模板的大规模合成表格数据生成方法，用于构建BCDSTab基准。实验表明，InstructTable 在各种数据集上的表格结构识别任务中优于现有方法。

PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding

Authors: Junjie Wen, Junlin He, Fei Ma, Jinqiang Cui

First: 2026-04-17T07:24:14+00:00 · Latest: 2026-04-17T07:24:14+00:00

Comments: Accepted by ICCA 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emph{PLAF}, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emph{PLAF} provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at https://github.com/RockWenJJ/PLAF.

中文标题/摘要

标题：PLAF：像素级语言对齐特征提取以实现高效的3D场景理解

准确的开放词汇3D场景理解需要同时在像素级别上具有语义对齐和空间精确性的语义表示，同时在提升到3D空间时保持可扩展性。然而，现有的表示方法难以同时满足这些要求，而密集传播像素级语义到3D通常会导致大量冗余，导致在大规模场景中存储和查询效率低下。为了解决这些挑战，我们提出了\emph{PLAF}，一种像素级语言对齐特征提取框架，能够在2D中实现密集且准确的语义对齐，而不牺牲开放词汇的表达能力。在此表示基础上，我们进一步设计了一种高效的语义存储和查询方案，显著减少了2D和3D域中的冗余。实验结果表明，\emph{PLAF}为准确高效的开放词汇3D场景理解提供了强大的语义基础。代码已公开发布在https://github.com/RockWenJJ/PLAF。

Summary / 总结

PLAF is a framework for pixel-wise language-aligned feature extraction that enables dense and accurate semantic alignment in 2D while maintaining open-vocabulary expressiveness. It addresses the challenge of efficiently storing and querying large-scale 3D scenes by reducing redundancy. Experimental results demonstrate that PLAF provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding without sacrificing scalability. The codes are publicly available.

PLAF是一种像素级语言对齐特征提取框架，能够在2D中实现密集且准确的语义对齐，同时保持开放词汇表的表达能力。它通过减少冗余来高效地存储和查询大规模3D场景。实验结果表明，PLAF为准确且高效的开放词汇表3D场景理解提供了坚实的基础，而不牺牲可扩展性。代码已公开可用。

FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound

Authors: Hussain Alasmawi, Numan Saeed, Mohammad Yaqub

First: 2025-12-25T04:54:37+00:00 · Latest: 2026-04-17T07:08:36+00:00

Abs · PDF · Code1 · Code2

Abstract

The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality's challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55\% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.

中文标题/摘要

标题：FETAL-GAUGE：胎儿超声成像视觉语言模型基准

随着产前超声成像需求的增长，全球范围内训练超声技师的人才短缺问题日益严重，阻碍了胎儿健康监测的必要性。深度学习有望提高超声技师的工作效率，并支持新从业者培训。视觉语言模型（VLMs）特别适用于超声解释，因为它们可以在单一框架内同时处理图像和文本，执行多种临床任务。然而，尽管VLMs的扩展，尚无标准化基准来评估其在胎儿超声成像中的性能。这一缺口主要是由于该模态的挑战性、操作员依赖性以及公共数据集的有限可用性。为解决这一缺口，我们提出了Fetal-Gauge，这是首个专门用于评估VLMs在各种胎儿超声任务上的最大视觉问答基准。我们的基准包括超过42,000张图像和93,000个问题-答案对，涵盖了解剖平面识别、解剖结构的视觉定位、胎儿方位评估、临床视图一致性以及临床诊断。我们系统地评估了几种最先进的VLMs，包括通用和医学专用模型，并揭示了显著的性能差距：最佳模型的准确率仅为55%，远低于临床要求。我们的分析指出了当前VLMs在胎儿超声解释中的关键局限性，突显了迫切需要领域适应架构和专门训练方法的必要性。Fetal-Gauge为推进产前护理中的多模态深度学习奠定了严格的基石，并提供了一条解决全球医疗保健可及性挑战的途径。我们的基准将在论文被接受后公开。

Summary / 总结

Fetal-Gauge is a benchmark designed to evaluate the performance of Vision-Language Models (VLMs) in fetal ultrasound imaging, addressing the lack of standardized evaluation tools. It includes over 42,000 images and 93,000 question-answer pairs for various tasks. The evaluation of several state-of-the-art VLMs revealed a significant performance gap, with the best model achieving only 55% accuracy, indicating the need for domain-specific adaptations. This benchmark aims to advance multimodal deep learning in prenatal care and improve global healthcare accessibility.

Fetal-Gauge 是一个用于评估 Vision-Language 模型在胎儿超声成像中性能的基准，解决了缺乏标准化评估工具的问题。它包含超过 42,000 张图像和 93,000 个问答对，用于各种任务。对多个最先进的 VLM 的评估显示了显著的性能差距，最佳模型的准确率仅为 55%，表明需要领域特定的适应。该基准旨在推动多模态深度学习在产前护理中的发展，并改善全球医疗保健可及性。

TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models

Authors: Jinlun Ye, Jiang Liao, Runhe Lai, Xinhua Lu, Jiaxin Zhuang, Zhiyong Gan, Ruixuan Wang

Venue: CVPR 2026

First: 2026-04-17T06:59:50+00:00 · Latest: 2026-04-17T06:59:50+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models (VLMs) such as CLIP exhibit strong Out-of-distribution (OOD) detection capabilities by aligning visual and textual representations. Recent CLIP-based test-time adaptation methods further improve detection performance by incorporating external OOD labels. However, such labels are finite and fixed, while the real OOD semantic space is inherently open-ended. Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams. To address this limitation, we introduce Test-time Textual Learning (TTL), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels. TTL updates learnable prompts using pseudo-labeled test samples to capture emerging OOD knowledge. To suppress noise introduced by pseudo-labels, we introduce an OOD knowledge purification strategy that selects reliable OOD samples for adaptation while suppressing noise. In addition, TTL maintains an OOD Textual Knowledge Bank that stores high-quality textual features, providing stable score calibration across batches. Extensive experiments on two standard benchmarks with nine OOD datasets demonstrate that TTL consistently achieves state-of-the-art performance, highlighting the value of textual adaptation for robust test-time OOD detection. Our code is available at https://github.com/figec/TTL.

中文标题/摘要

标题：TTL：基于预训练视觉-语言模型的测试时文本学习以进行OOD检测

视觉-语言模型（VLMs）如CLIP通过视觉和文本表示的对齐表现出强大的离群值（OOD）检测能力。基于CLIP的测试时适应方法进一步通过引入外部OOD标签提高了检测性能。然而，这些标签是有限且固定的，而真实的OOD语义空间是固有的开放式的。因此，固定的标签无法代表测试流中遇到的多样且不断演变的OOD语义。为了解决这一局限性，我们引入了测试时文本学习（TTL）框架，该框架可以从未标记的测试流中动态学习OOD文本语义，而不依赖于外部OOD标签。TTL使用伪标记的测试样本更新可学习的提示，以捕捉新兴的OOD知识。为了抑制伪标签引入的噪声，我们引入了一种OOD知识净化策略，该策略选择可靠的OOD样本进行适应，同时抑制噪声。此外，TTL维护了一个OOD文本知识库，存储高质量的文本特征，提供跨批次的稳定评分校准。在两个标准基准上的广泛实验表明，TTL在九个OOD数据集上始终实现了最先进的性能，突显了文本适应对于稳健的测试时OOD检测的价值。我们的代码可在https://github.com/figec/TTL/获取。

DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference

Authors: Xiang Xia, Wuyang Zhang, Jiazheng Liu, Cheng Yan, Yanyong Zhang

First: 2026-04-17T06:53:27+00:00 · Latest: 2026-04-17T06:53:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive language generation due to their potential for parallel decoding and global refinement of the entire sequence. To unlock this potential, DLM inference must carefully balance generation quality and decoding speed. Recent block-wise DLM decoding methods improve this trade-off by performing diffusion-based decoding sequentially in blocks. However, existing methods typically rely on fixed block schedules or current-step local signals to determine block boundaries, and use conservative confidence-based parallel decoding to avoid conflicts, limiting the quality-speed trade-off. In this paper, we argue that block-wise DLM inference requires more suitable signals for its two core decisions: cross-step signals for determining block boundaries, and token-level conflict signals for parallel decoding. Based on this view, we propose DepCap, a training-free framework for efficient block-wise DLM inference. Specifically, DepCap instantiates the cross-step signal as the influence of the last decoded block and uses it to adaptively determine how far the next block should extend, while identifying a conflict-free subset of tokens for safe parallel decoding within each block, enabling substantial inference acceleration with negligible quality degradation. DepCap is a plug-and-play method applicable to various DLMs, and compatible with existing KV-cache strategies for block-wise DLM. An information-theoretic analysis further suggests that the cumulative last-block influence on a candidate block is approximately additive across tokens, supporting the proposed block-partitioning criterion. Experimental results show that DepCap achieves favorable speed-quality trade-offs across multiple DLM backbones and reasoning and coding benchmarks, with up to 5.63$\times$ speedup without significant performance degradation.

中文标题/摘要

标题：DepCap：自适应块级并行解码以提高高效扩散语言模型推理效率

扩散语言模型（DLMs）由于其并行解码和整个序列全局优化的潜力，已成为自回归语言生成的有前途的替代方案。为了充分利用这一潜力，DLM推理必须仔细平衡生成质量和解码速度。最近的块级DLM解码方法通过按块顺序进行基于扩散的解码来改善这种权衡。然而，现有方法通常依赖于固定的块调度或当前步骤的局部信号来确定块边界，并使用保守的基于置信度的并行解码以避免冲突，从而限制了质量-速度权衡。在本文中，我们主张块级DLM推理需要更适合其两个核心决策的信号：跨步骤信号以确定块边界，以及标记级冲突信号以进行并行解码。基于这一观点，我们提出了DepCap，这是一种无需训练的高效块级DLM推理框架。具体而言，DepCap将跨步骤信号实例化为最后一个解码块的影响，并使用它来自适应地确定下一个块应扩展多远，同时在每个块内识别一个无冲突的标记子集以安全地进行并行解码，从而实现显著的推理加速，同时质量下降可以忽略不计。DepCap是一种即插即用的方法，适用于各种DLM，并与现有的块级DLM的KV缓存策略兼容。信息论分析进一步表明，候选块的累积最后一个块影响在标记之间是近似可加的，支持所提出的块划分标准。实验结果表明，DepCap在多个DLM骨干网络和推理与编码基准测试中实现了有利的质量-速度权衡，最高可实现5.63倍的加速，而性能下降可以忽略不计。

Summary / 总结

The paper introduces DepCap, a framework for efficient block-wise decoding of diffusion language models (DLMs) that improves the quality-speed trade-off by using adaptive cross-step signals and token-level conflict signals. DepCap dynamically determines block boundaries and identifies conflict-free tokens for parallel decoding, achieving up to 5.63 times speedup with negligible performance degradation across various DLMs and benchmarks.

该论文提出了DepCap框架，通过使用自适应的跨步信号和token级冲突信号，提高了DLMs的解码效率。DepCap动态确定块边界，并在每个块内识别无冲突的token进行并行解码，实现了在各种DLMs和基准测试中高达5.63倍的加速，同时性能几乎没有下降。

Concept-wise Attention for Fine-grained Concept Bottleneck Models

Authors: Minghong Zhong, Guoshuai Zou, Kanghao Chen, Dexia Chen, Ruixuan Wang

Venue: CVPR 2026

First: 2026-04-17T06:43:30+00:00 · Latest: 2026-04-17T06:43:30+00:00

Comments: 10 pages, 7 figures, Accepted by CVPR 2026 Fingdings

Abs · PDF · Code1 · Code2

Abstract

Recently impressive performance has been achieved in Concept Bottleneck Models (CBM) by utilizing the image-text alignment learned by a large pre-trained vision-language model (i.e. CLIP). However, there exist two key limitations in concept modeling. Existing methods often suffer from pre-training biases, manifested as granularity misalignment or reliance on structural priors. Moreover, fine-tuning with Binary Cross-Entropy (BCE) loss treats each concept independently, which ignores mutual exclusivity among concepts, leading to suboptimal alignment. To address these limitations, we propose Concept-wise Attention for Fine-grained Concept Bottleneck Models (CoAt-CBM), a novel framework that achieves adaptive fine-grained image-concept alignment and high interpretability. Specifically, CoAt-CBM employs learnable concept-wise visual queries to adaptively obtain fine-grained concept-wise visual embeddings, which are then used to produce a concept score vector. Then, a novel concept contrastive optimization guides the model to handle the relative importance of the concept scores, enabling concept predictions to faithfully reflect the image content and improved alignment. Extensive experiments demonstrate that CoAt-CBM consistently outperforms state-of-the-art methods. The codes will be available upon acceptance.

Summary / 总结

The research aims to improve Concept Bottleneck Models (CBM) by addressing pre-training biases and mutual exclusivity issues. CoAt-CBM, a novel framework, uses learnable concept-wise visual queries to achieve fine-grained image-concept alignment and high interpretability. Experiments show that CoAt-CBM outperforms existing methods in terms of alignment and performance.

论文提出了细粒度概念瓶颈模型（CoAt-CBM），通过使用可学习的概念级视觉查询来获取细粒度的概念嵌入，并采用一种新颖的概念对比优化来提高对齐效果。实验表明，CoAt-CBM 在细粒度图像-概念对齐和可解释性方面优于现有最佳方法。

P3T: Prototypical Point-level Prompt Tuning with Enhanced Generalization for 3D Vision-Language Models

Authors: Geunyoung Jung, Soohong Kim, Kyungwoo Song, Jiyoung Jung

Venue: ICRA 2026

First: 2026-04-17T05:18:22+00:00 · Latest: 2026-04-17T05:18:22+00:00

Comments: Accepted by ICRA 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

With the rise of pre-trained models in the 3D point cloud domain for a wide range of real-world applications, adapting them to downstream tasks has become increasingly important. However, conventional full fine-tuning methods are computationally expensive and storage-intensive. Although prompt tuning has emerged as an efficient alternative, it often suffers from overfitting, thereby compromising generalization capability. To address this issue, we propose Prototypical Point-level Prompt Tuning (P$^3$T), a parameter-efficient prompt tuning method designed for pre-trained 3D vision-language models (VLMs). P$^3$T consists of two components: 1) \textit{Point Prompter}, which generates instance-aware point-level prompts for the input point cloud, and 2) \textit{Text Prompter}, which employs learnable prompts into the input text instead of hand-crafted ones. Since both prompters operate directly on input data, P$^3$T enables task-specific adaptation of 3D VLMs without sacrificing generalizability. Furthermore, to enhance embedding space alignment, which is key to fine-tuning 3D VLMs, we introduce a prototypical loss that reduces intra-category variance. Extensive experiments demonstrate that our method matches or outperforms full fine-tuning in classification and few-shot learning, and further exhibits robust generalization under data shift in the cross-dataset setting. The code is available at \textcolor{violet}{https://github.com/gyjung975/P3T}.

中文标题/摘要

标题：P3T：用于3D视觉语言模型的原型点级提示调优，增强泛化能力

随着预训练模型在3D点云领域广泛应用，适应下游任务变得越来越重要。然而，传统的全面微调方法计算成本高且存储密集。尽管提示调优作为一种高效的替代方法已经出现，但它往往容易过拟合，从而损害泛化能力。为了解决这个问题，我们提出了原型点级提示调优（P$^3$T），这是一种为预训练3D视觉语言模型（VLMs）设计的参数高效提示调优方法。P$^3$T包括两个组件：1）点提示器，为输入点云生成实例感知的点级提示；2）文本提示器，将可学习的提示引入输入文本，而不是手工设计的提示。由于两个提示器直接作用于输入数据，P$^3$T能够实现3D VLMs的任务特定适应，而不牺牲泛化能力。此外，为了增强嵌入空间对齐，这是微调3D VLMs的关键，我们引入了一种原型损失，以减少类别内的方差。广泛的实验表明，我们的方法在分类和少样本学习中与全面微调相当或更优，并且在跨数据集设置下表现出鲁棒的泛化能力。代码可在https://github.com/gyjung975/P3T获取。

VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck

Authors: Feiran Zhang, Yixin Wu, Zhenghua Wang, Xiaohua Wang, Changze Lv, Xuanjing Huang, Xiaoqing Zheng

First: 2026-01-09T05:58:22+00:00 · Latest: 2026-04-17T05:02:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal tasks, but remain susceptible to hallucinations, where generated text deviates from the underlying visual content. Existing hallucination detection methods primarily rely on output logits or external verification tools, often overlooking their internal mechanisms. In this work, we investigate the outputs of internal attention heads, postulating that specific heads carry the primary signals for truthful generation.However, directly probing these high-dimensional states is challenging due to the entanglement of visual-linguistic syntax and noise. To address this, we propose VIB-Probe, a novel hallucination detection and mitigation framework leveraging the Variational Information Bottleneck (VIB) theory. Our method extracts discriminative patterns across layers and heads while filtering out semantic nuisances through the information bottleneck principle. Furthermore, by leveraging the gradients of our VIB probe, we identify attention heads with strong causal influence on hallucinations and introduce an inference-time intervention strategy for hallucination mitigation. Extensive experiments across diverse benchmarks demonstrate that VIB-Probe significantly outperforms existing baselines in both settings. Our code will be made publicly available.

中文标题/摘要

标题：VIB-Probe：通过变分信息瓶颈检测和缓解视觉-语言模型中的幻觉

视觉-语言模型（VLMs）在多模态任务中取得了显著进展，但仍易受到幻觉的影响，即生成的文本与底层视觉内容不符。现有的幻觉检测方法主要依赖于输出logits或外部验证工具，往往忽视了其内部机制。在本文中，我们研究了内部注意力头的输出，假设特定的头携带着真实生成的主要信号。然而，直接探测这些高维状态由于视觉-语言语法和噪声的纠缠而具有挑战性。为了解决这一问题，我们提出了一种新的幻觉检测和缓解框架VIB-Probe，该框架利用了变分信息瓶颈（VIB）理论。我们的方法通过信息瓶颈原理提取各层和各头的判别模式，同时过滤掉语义噪声。此外，通过利用我们的VIB探针的梯度，我们识别出对幻觉有强烈因果影响的注意力头，并引入了一种推理时的干预策略以缓解幻觉。广泛的实验表明，VIB-Probe在各种基准上显著优于现有基线。我们的代码将公开发布。

History

20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553