arXiv 论文速递

Snapshot: 20260430_0430

QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

Authors: Shuxiang Cao, Zijian Zhang, Abhishek Agarwal, Grace Bratrud, Niyaz R. Beysengulov, Daniel C. Cole, Alejandro Gómez Frieiro, Elena O. Glen, Hao Hsu, Gang Huang, Raymond Jow, Greshma Shaji, Tom Lubowe, Ligeng Zhu, Luis Mantilla Calderón, Nicola Pancotti, Joel Pendleton, Brandon Severin, Charles Etienne Staub, Sara Sussman, Antti Vepsäläinen, Neel Rajeshbhai Vora, Yilun Xu, Varinia Bernales, Daniel Bowring, Elica Kyoseva, Ivan Rungger, Giulia Semeghini, Sam Stanwyck, Timothy Costa, Alán Aspuru-Guzik, Krysta Svore

First: 2026-04-28T17:28:33+00:00 · Latest: 2026-04-28T17:28:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.

From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

Authors: Jianghao Lin, Zi Ling, Chenyu Zhou, Tianyi Xu, Ruoqing Jiang, Zizhuo Wang, Dongdong Ge

First: 2026-04-28T16:53:37+00:00 · Latest: 2026-04-28T16:53:37+00:00

Comments: Working Paper

Abs · PDF · Code1 · Code2 · Code3

Abstract

Optimization modeling underpins real-world decision-making in logistics, manufacturing, energy, and public services, but reliably solving such problems from natural-language requirements remains challenging for current large language models (LLMs). In this paper, we propose \emph{Agora-Opt}, a modular agentic framework for optimization modeling that combines decentralized debate with a read-write memory bank. Agora-Opt allows multiple agent teams to independently produce end-to-end solutions and reconcile them through an outcome-grounded debate protocol, while memory stores solver-verified artifacts and past disagreement resolutions to support training-free improvement over time. This design is flexible across both backbones and methods: it reduces base-model lock-in, transfers across different LLM families, and can be layered onto existing pipelines with minimal coupling. Across public benchmarks, Agora-Opt achieves the strongest overall performance among all compared methods, outperforming strong zero-shot LLMs, training-centric approaches, and prior agentic baselines. Further analyses show robust gains across backbone choices and component variants, and demonstrate that decentralized debate offers a structural advantage over centralized selection by enabling agents to refine candidate solutions through interaction and even recover correct formulations when all initial candidates are flawed. These results suggest that reliable optimization modeling benefits from combining collaborative cross-checking with reusable experience, and position Agora-Opt as a practical and extensible foundation for trustworthy optimization modeling assistance. Our code and data are available at https://github.com/CHIANGEL/Agora-Opt.

中文标题/摘要

标题：从独白到 agora：增强记忆的 LLM 代理及其分散式辩论优化建模

优化建模是物流、制造、能源和公共服务等领域实际决策的基础，但当前的大语言模型（LLMs）难以从自然语言需求可靠地解决此类问题。本文提出了一种名为Agora-Opt的模块化代理框架，结合了分散式辩论和读写记忆库。Agora-Opt允许多个代理团队独立生成端到端解决方案，并通过基于结果的辩论协议进行协调，同时记忆库存储经过求解器验证的成果和过去的分歧解决方案，以支持随着时间的推移无需训练的改进。该设计在基础模型、方法和现有管道之间具有灵活性：它减少了基础模型的锁定，可以在不同的LLM家族之间进行转移，并且可以与现有管道进行最小耦合。在公共基准测试中，Agora-Opt在所有比较方法中表现出最强的整体性能，优于强大的零样本LLMs、以训练为中心的方法和先前的代理基线。进一步的分析显示，无论基础模型选择还是组件变体，分散式辩论都提供了结构上的优势，通过互动使代理能够改进候选解决方案，并在所有初始候选方案都错误时恢复正确的表述。这些结果表明，可靠的优化建模可以从协作交叉检查与可重用的经验相结合中受益，并将Agora-Opt定位为值得信赖的优化建模辅助的实用和可扩展的基础。我们的代码和数据可在https://github.com/CHIANGEL/Agora-Opt/ 获取。

Personalization Toolkit: Training Free Personalization of Large Vision Language Models

Authors: Soroush Seifi, Vaggelis Dorovatas, Matteo Cassinelli, Fabien Despinoy, Daniel Olmeda Reino, Rahaf Aljundi

First: 2025-02-04T16:19:20+00:00 · Latest: 2026-04-28T16:36:23+00:00

Comments: Accepted at Transactions on Machine Learning Research (TMLR) 2026

Abs · PDF · Code1 · Code2

Abstract

Personalization of Large Vision-Language Models (LVLMs) involves customizing models to recognize specific users or object instances and to generate contextually tailored responses. Existing approaches rely on time-consuming training for each item, making them impractical for real-world deployment, as reflected in current personalization benchmarks limited to object-centric single-concept evaluations. In this paper, we present a novel training-free approach to LVLM personalization called \ours. We introduce a comprehensive, real-world benchmark designed to rigorously evaluate various aspects of the personalization task. \ours leverages pre-trained vision foundation models to extract distinctive features, applies retrieval-augmented generation (RAG) techniques to identify instances within visual inputs, and employs visual prompting strategies to guide model outputs. Our model-agnostic vision toolkit enables efficient and flexible multi-concept personalization across both images and videos, without any additional training. We achieve state-of-the-art results, surpassing existing training-based methods.

Summary / 总结

The research aims to address the impracticality of training LVLMs for personalization by proposing a training-free approach. The method uses pre-trained vision foundation models to extract features, retrieval-augmented generation techniques to identify instances, and visual prompting to guide model outputs. The approach achieves state-of-the-art results in multi-concept personalization across images and videos without additional training, surpassing existing training-based methods.

研究旨在通过提出一种无需训练的方法来解决大型视觉语言模型个性化的问题。该方法利用预训练的视觉基础模型提取特征，使用检索增强生成技术识别实例，并通过视觉提示引导模型输出。该方法在图像和视频的多概念个性化方面取得了最先进的结果，超越了现有的基于训练的方法，无需额外训练。

Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

Authors: Yashwant Pravinrao Bangde, Debaditya Roy

First: 2026-04-28T16:18:31+00:00 · Latest: 2026-04-28T16:18:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens this issue by amplifying language priors, especially when the visual signal is uncertain or ambiguous. To address this challenge, we propose a decoding framework that explicitly balances linguistic informativeness and visual faithfulness during generation. Our method, Instruction-Evidence Contrastive Dual-Stream Decoding (IECD2), maintains two parallel probability distributions of tokens at each decoding step: an instruction-driven stream that promotes expressive and informative responses, and an evidence-driven stream that enforces strict grounding in the image. These two streams are adaptively fused using a symmetric KL-based contrast-based gate, which suppresses tokens favored by language priors but unsupported by visual evidence, while preserving them when both distributions agree. We evaluate IECD2 on multiple datasets spanning various generative vision-language reasoning tasks such as captioning and visual question answering, including POPE, MME, VQAv2, AMBER, MS-COCO, and LLaVA-Bench. IECD2 demonstrates consistent improvements in task accuracy and reasoning performance, alongside a substantial reduction in hallucination across all evaluation metrics compared to state-of-the-art decoding approaches.

MemeScouts@LT-EDI 2026: Asking the Right Questions -- Prompted Weak Supervision for Meme Hate Speech Detection

Authors: Ivo Bueno, Lea Hirlimann, Enkelejda Kasneci

First: 2026-04-27T08:36:23+00:00 · Latest: 2026-04-28T15:12:11+00:00

Comments: Accepted at Sixth Workshop on Language Technology for Equality, Diversity and Inclusion at ACL2026 (LT-EDI@ACL26)

Abs · PDF · Code1 · Code2

Abstract

Detecting hate speech in memes is challenging due to their multimodal nature and subtle, culturally grounded cues such as sarcasm and context. While recent vision-language models (VLMs) enable joint reasoning over text and images, end-to-end prompting can be brittle, as a single prediction must resolve target, stance, implicitness, and irony. These challenges are amplified in multilingual settings. We propose a prompted weak supervision (PWS) approach that decomposes meme understanding into targeted, question-based labeling functions with constrained answer options for homophobia and transphobia detection in the LT-EDI 2026 shared task. Using a quantized Qwen3-VLM to extract features by answering targeted questions, our method outperforms direct VLM classification, with substantial gains for Chinese and Hindi, ranking 1st in English, 2nd in Chinese, and 3rd in Hindi. Iterative refinement via error-driven LF expansion and feature pruning reduces redundancy and improves generalization. Our results highlight the effectiveness of prompted weak supervision for multilingual multimodal hate speech detection.

Summary / 总结

The paper addresses the challenge of detecting hate speech in memes, which are multimodal and contain subtle cultural cues. It proposes a prompted weak supervision (PWS) method that decomposes meme understanding into targeted labeling functions with constrained answers for homophobia and transphobia detection. The method uses a quantized Qwen3-VLM to extract features by answering targeted questions and outperforms direct VLM classification, especially for Chinese and Hindi, ranking 1st in English, 2nd in Chinese, and 3rd in Hindi. Iterative refinement through error-driven labeling function expansion and feature pruning improves generalization.

论文针对表情包中的仇恨言论检测难题，这些表情包具有多模态特征且包含微妙的文化暗示。提出了一种提示弱监督（PWS）方法，将表情包理解分解为针对特定问题的标签函数，并使用量化后的Qwen3-VLM通过回答这些问题来提取特征。该方法在直接VLM分类的基础上表现出色，特别是在中文和印地语中表现突出，分别在英语、中文和印地语中排名分别为第1、2和3。通过错误驱动的标签函数扩展和特征剪枝进行迭代优化，提高了泛化能力。

A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

Authors: Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe

First: 2026-03-28T17:18:40+00:00 · Latest: 2026-04-28T14:44:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but surgical benchmarks in particular are often missing from prominent medical benchmark suites (specifically, those requiring visual recognition). Since surgery requires integrating disparate tasks, generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

Cross-Lingual Jailbreak Detection via Semantic Codebooks

Authors: Shirin Alanova, Bogdan Minko, Sabrina Sadiekh, Evgeniy Kokuykin

First: 2026-04-28T14:43:40+00:00 · Latest: 2026-04-28T14:43:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a training-free external guardrail for black-box LLMs. We conduct a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5). Our results reveal two distinct regimes of cross-lingual transfer. On curated benchmarks containing canonical jailbreak templates, semantic similarity generalizes reliably across languages, achieving near-perfect separability (AUC up to 0.99) and substantial reductions in absolute attack success rates under strict low-false-positive constraints. However, under distribution shift - on behaviorally diverse and heterogeneous unsafe benchmarks - separability degrades markedly (AUC $\approx$ 0.60-0.70), and recall in the security-critical low-FPR regime drops across all embedding models.

中文标题/摘要

标题：跨语言逃逸检测通过语义词典

大型语言模型（LLMs）的安全机制主要以英语为中心，这在多语言部署中造成了系统性的漏洞。先前的研究表明，将恶意提示翻译成其他语言可以显著提高逃逸成功率，揭示了跨语言结构安全缺口。我们研究了是否可以通过语言无关的语义相似性来缓解此类攻击，而无需重新训练或特定语言的适应。我们的方法将多语言查询嵌入与固定不变的英语逃逸提示词典进行比较，作为无训练的外部护栏，用于黑盒LLMs。我们在四种语言、两种翻译管道、四种安全基准、三种嵌入模型和三种目标LLM（Qwen、Llama、GPT-3.5）上进行了系统评估。我们的结果揭示了两种不同的跨语言转移模式。在包含经典逃逸模板的精心策划基准上，语义相似性在语言间可靠地泛化，实现接近完美的可分性（AUC高达0.99）和在严格低误报率约束下的绝对攻击成功率显著降低。然而，在分布变化下——在行为多样且异质的不安全基准上——可分性显著下降（AUC约为0.60-0.70），并且在安全关键的低误报率召回率下，所有嵌入模型的召回率均下降。

Summary / 总结

This study addresses the cross-lingual security gap in large language models by developing a language-agnostic method that uses semantic codebooks to detect jailbreak attempts. The method compares multilingual query embeddings against a fixed English codebook, providing a training-free external guardrail for black-box LLMs. Evaluations across four languages, multiple translation pipelines, and various safety benchmarks show that semantic similarity reliably generalizes for curated jailbreak templates, achieving high separability and reducing attack success rates. However, under distribution shift, separability decreases, and recall drops in the low-FPR regime.

该研究通过开发一种使用语义代码本的语言无关方法来解决大型语言模型中的跨语言安全缺口，该方法将多语言查询嵌入与固定英语代码本进行比较，为黑盒LLM提供一个无需训练的外部护栏。跨四种语言、多种翻译管道和多种安全基准的评估表明，语义相似性在针对精心策划的劫持模板时能够可靠地泛化，实现高分离度并显著降低攻击成功率。但在分布变化的情况下，分离度下降，低误报率（FPR）区间内的召回率降低。

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

Authors: Chengsheng Zhang, Chenghao Sun, Xinyan Jiang, Wei Li, Xinmei Tian

Venue: CVPR 2026

First: 2026-04-28T13:42:27+00:00 · Latest: 2026-04-28T13:42:27+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision-Language Models (LVLMs) have achieved remarkable progress in visual-textual understanding, yet their reliability is critically undermined by hallucinations, i.e., the generation of factually incorrect or inconsistent responses. While recent studies using steering vectors demonstrated promise in reducing hallucinations, a notable challenge remains: they inadvertently amplify the severity of residual hallucinations. We attribute this to their exclusive focus on the decoding stage, where errors accumulate autoregressively and progressively worsen subsequent hallucinatory outputs. To address this, we propose Prefill-Time Intervention (PTI), a novel steering paradigm that intervenes only once during the prefill stage, enhancing the initial Key-Value (KV) cache before error accumulation occurs. Specifically, PTI is modality-aware, deriving distinct directions for visual and textual representations. This intervention is decoupled to steer keys toward visually-grounded objects and values to filter background noise, correcting hallucination-prone representations at their source. Extensive experiments demonstrate PTI's significant performance in mitigating hallucinations and its generalizability across diverse decoding strategies, LVLMs, and benchmarks. Moreover, PTI is orthogonal to existing decoding-stage methods, enabling plug-and-play integration and further boosting performance. Code is available at: https://github.com/huaiyi66/PTI.

中文标题/摘要

标题：预填充干预以减轻大型视觉-语言模型中的幻觉

大型视觉-语言模型（LVLMs）在视觉-文本理解方面取得了显著进展，但其可靠性因幻觉而受到严重削弱，即生成事实错误或不一致的响应。虽然最近使用引导向量的研究在减少幻觉方面显示出前景，但仍存在一个显著挑战：它们无意中放大了剩余幻觉的严重性。我们将其归因于它们仅专注于解码阶段，错误在此阶段自回归地累积并逐渐恶化后续的幻觉输出。为了解决这一问题，我们提出了一种新颖的干预方法——预填充时间干预（PTI），该方法仅在预填充阶段干预一次，在错误累积之前增强初始键值（KV）缓存。具体而言，PTI 具有模态感知性，为视觉和文本表示提取不同的方向。这种干预是解耦的，引导键朝视觉基础的对象方向，值则过滤背景噪声，从源头纠正幻觉倾向的表示。大量实验表明，PTI 在减轻幻觉方面表现出显著性能，并且在各种解码策略、LVLMs 和基准测试中具有普适性。此外，PTI 与现有的解码阶段方法正交，使其能够无缝集成并进一步提升性能。代码可在：https://github.com/huaiyi66/PTI 获取。

Summary / 总结

The research aims to address the issue of hallucinations in large vision-language models by proposing Prefill-Time Intervention (PTI), which intervenes in the prefill stage to enhance the initial Key-Value cache. PTI is modality-aware, steering visual and textual representations towards more accurate and contextually relevant information. Experiments show that PTI effectively mitigates hallucinations and is applicable across various models and benchmarks, enhancing overall performance without conflicting with existing decoding-stage methods.

研究旨在通过提出预填充时间干预（PTI）来解决大型视觉语言模型中的幻觉问题，PTI在预填充阶段干预，增强初始的关键值缓存。PTI具有模态感知能力，将视觉和文本表示引导至更准确和上下文相关的信息。实验表明，PTI有效减少了幻觉，并适用于各种模型和基准，提升了整体性能且不与现有解码阶段方法冲突。

AInstein: Can LLMs Solve Research Problems From Parametric Memory Alone?

Authors: Shambhavi Mishra, Gaurav Sahu, Marco Pedersoli, Laurent Charlin, Jose Dolz, Christopher Pal

First: 2025-10-06T22:50:41+00:00 · Latest: 2026-04-28T12:36:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Can large language models solve AI research problems using only their parametric knowledge, without fine-tuning, retrieval, or other external aids? We introduce AInstein, a framework for testing whether LLM agents can generate and refine solutions to research problems through iterative critique loops. A blind study with 20 domain experts on held-out ICLR 2026 problems validates our automated metrics, which we then scale to 1,214 ICLR 2025 papers using an LLM-as-a-judge paradigm. Two metrics capture complementary aspects of performance: Success Rate (does the solution address the problem?) and Rediscovery (does it match the published approach?). LLMs succeed on over 70% of problems, yet strictly rediscover the published solution less than 19% of the time, suggesting genuine problem-solving rather than associative recall. However, this ability has clear limits: models handle familiar methodological territory well but fail when solutions require cross-domain analogical transfer, a pattern we call the parametric knowledge boundary. On the ResearchPlanGen benchmark (2,645 problems), our training-free iterative refinement strategy matches RL finetuning, and a criteria-coverage analysis pins down the ceiling of what test-time refinement alone can achieve. Together, these findings map both the capabilities and the limits of LLMs as autonomous scientific problem-solvers.

中文标题/摘要

标题：AInstein：LLM能否仅凭参数记忆解决研究问题？

大型语言模型能否仅凭其参数化知识解决AI研究问题，而无需微调、检索或其他外部辅助？我们引入了AInstein框架，用于测试LLM代理是否可以通过迭代批判循环生成和改进研究问题的解决方案。一项针对ICLR 2026保留问题的20名领域专家的盲测试验证了我们的自动化指标，然后我们将其扩展到1,214篇ICLR 2025论文，采用LLM作为裁判的模式。两个指标分别捕捉了性能的互补方面：成功率（解决方案是否解决了问题？）和重新发现（是否与已发表的方法匹配？）。LLM在超过70%的问题上取得成功，但在严格重新发现已发表解决方案方面少于19%，这表明是真正的解决问题能力而非关联性回忆。然而，这种能力有明显的局限性：模型在处理熟悉的方法论领域时表现良好，但在需要跨领域类比转移的解决方案上失败，我们称之为参数化知识边界。在ResearchPlanGen基准测试（2,645个问题）上，我们的无需训练的迭代改进策略与RL微调相当，且基于标准覆盖率分析确定了仅通过测试时的改进所能达到的上限。这些发现共同描绘了LLM作为自主科学研究问题解决者的能力和局限性。

Summary / 总结

AInstein evaluates whether large language models can solve AI research problems using only their parametric knowledge through iterative critique loops. A study with 20 domain experts on ICLR 2026 problems shows that LLMs succeed on over 70% of problems, with a Success Rate of 70% and a Rediscovery rate of less than 19%, indicating genuine problem-solving rather than simple recall. However, LLMs struggle with cross-domain analogical transfer, suggesting a parametric knowledge boundary. On the ResearchPlanGen benchmark, LLMs match RL fine-tuning results, but the ceiling of test-time refinement is limited.

AInstein 评估大型语言模型是否仅通过迭代批判循环就能利用其参数化知识解决 AI 研究问题。一项针对 ICLR 2026 问题的 20 位领域专家研究显示，LLM 在超过 70% 的问题上取得成功，成功率为 70%，而重新发现率为不到 19%，表明是真正的解决问题能力而非简单的回忆。然而，LLM 在跨领域类比转移方面遇到困难，表明存在参数化知识边界。在 ResearchPlanGen 基准上，LLM 的结果与 RL 微调相当，但测试时的改进极限是有限的。

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Authors: Mengyao Du, Han Fang, Haokai Ma, Jiahao Chen, Kai Xu, Quanjun Yin, Ee-Chien Chang

First: 2026-04-28T12:32:21+00:00 · Latest: 2026-04-28T12:32:21+00:00

Comments: 10 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

Web agents have emerged as an effective paradigm for automating interactions with complex web environments, yet remain vulnerable to prompt injection attacks that embed malicious instructions into webpage content to induce unintended actions. This threat is further amplified for screenshot-based web agents, which operate on rendered visual webpages rather than structured textual representations, making predominant text-centric defenses ineffective. Although multimodal detection methods have been explored, they often rely on large vision-language models (VLMs), incurring significant computational overhead. The bottleneck lies in the complexity of modern webpages: VLMs must comprehend the global semantics of an entire page, resulting in substantial inference time and GPU memory usage. This raises a critical question: can we detect prompt injection attacks from screenshots in a lightweight manner? In this paper, we observe that injected webpages exhibit distinct characteristics compared to benign ones from both visual and textual perspectives. Building on this insight, we propose SnapGuard, a lightweight yet accurate method that reformulates prompt injection detection as multimodal representation analysis over webpage screenshots. SnapGuard leverages two complementary signals: a visual stability indicator that identifies abnormally smooth gradient distributions induced by malicious content, and action-oriented textual signals recovered via contrast-polarity reversal. Extensive evaluations across eight attacks and two benign settings demonstrate that SnapGuard achieves an F1 score of 0.75, outperforming GPT-4o-prompt while being 8x faster (1.81s vs. 14.50s) and introducing no additional memory overhead.

中文标题/摘要

标题：SnapGuard: 轻量级屏幕截图基础网络代理提示注入检测

网络代理已作为一种有效的方法用于自动化与复杂网络环境的交互，但仍然容易受到提示注入攻击的影响，这些攻击将恶意指令嵌入网页内容中以诱导意外操作。对于基于屏幕截图的网络代理而言，这一威胁被进一步放大，因为它们基于渲染的视觉网页而非结构化的文本表示，使得主要的文本中心防御无效。尽管已经探索了多模态检测方法，但它们通常依赖于大型视觉语言模型（VLM），导致显著的计算开销。瓶颈在于现代网页的复杂性：VLM 必须理解整个页面的全局语义，导致推理时间和 GPU 内存使用量大幅增加。这提出了一个关键问题：我们能否以轻量级的方式从屏幕截图中检测提示注入攻击？在本文中，我们观察到注入的网页在视觉和文本方面与良性网页表现出不同的特征。基于这一见解，我们提出了 SnapGuard，这是一种轻量级但准确的方法，将提示注入检测重新表述为网页屏幕截图的多模态表示分析。SnapGuard 利用了两个互补的信号：一个视觉稳定性指标，用于识别由恶意内容引起的异常平滑梯度分布，以及通过对比极性反转恢复的动作导向文本信号。在八个攻击和两个良性设置的广泛评估中，SnapGuard 达到了 0.75 的 F1 分数，优于 GPT-4o-prompt，同时速度快 8 倍（1.81s 对比 14.50s），并且没有增加额外的内存开销。

GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

Authors: Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu

Venue: ACL 2026

First: 2026-01-08T16:58:07+00:00 · Latest: 2026-04-28T12:22:24+00:00

Comments: Accepted to ACL 2026 Findings. Code available at https://github.com/Zengwh02/GlimpRouter

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the "Aha Moment" phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.

中文标题/摘要

标题：GlimpRouter：通过窥视一个思维令牌实现高效的协作推理

大型推理模型（LRMs）通过显式生成多步思维链实现显著性能，但这种能力会带来严重的推理延迟和计算成本。协作推理通过在轻量级和大型模型之间选择性分配工作提供了有希望的解决方案，但一个基本挑战仍然存在：确定推理步骤何时需要大型模型的容量或小型模型的效率。现有的路由策略要么依赖于局部令牌概率，要么进行事后验证，引入了显著的推理开销。在本文中，我们提出了一种新的步骤协作视角：推理步骤的难度可以从其第一个令牌中推断出来。受LRMs中“顿悟时刻”现象的启发，我们表明初始令牌的熵是步骤难度的强预测器。基于这一洞察，我们引入了GlimpRouter，这是一种无需训练的步骤协作框架。GlimpRouter使用一个轻量级模型仅生成每个推理步骤的第一个令牌，并仅当初始令牌的熵超过阈值时才将步骤路由到一个更大的模型。在多个基准上的实验表明，我们的方法在显著减少推理延迟的同时保持了准确性。例如，与单独使用大型模型相比，GlimpRouter在AIME25上的准确率提高了10.7%，推理延迟减少了25.9%。这些结果表明，一种简单而有效的推理机制是：根据思维的一瞥来分配计算，而不是进行完整的步骤评估。

C3G: Learning Compact 3D Representations with 2K Gaussians

Authors: Honggyu An, Jaewoo Jung, Mungyeom Kim, Chaehyun Kim, Minkyeong Jeon, Jisang Han, Kazumi Fukuda, Takuya Narihira, Hyuna Ko, Junsu Kim, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim

First: 2025-12-03T17:59:05+00:00 · Latest: 2026-04-28T10:44:53+00:00

Comments: Project Page : https://cvlab-kaist.github.io/C3G/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.

Summary / 总结

C3G is a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations to minimize redundancy and enable effective feature lifting. It uses learnable tokens to aggregate multi-view features through self-attention, guiding Gaussian generation and ensuring each Gaussian integrates relevant visual features across views. Experiments show that C3G achieves superior memory efficiency and feature fidelity compared to existing methods in pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation.

论文旨在通过前馈方法从稀疏未对齐的视角重建和理解3D场景。提出了一种名为C3G的新方法，该方法在关键空间位置估计紧凑的3D高斯分布，以减少冗余并改善特征聚合。该方法使用可学习的令牌和自注意力来引导高斯生成和解码，增强特征提升。实验表明，C3G在新颖视角合成、3D语义分割和特征聚合任务中实现了更高的内存效率和特征保真度。

SARU: A Shadow-Aware and Removal Unified Framework for Remote Sensing Images with New Benchmarks

Authors: Zi-Yang Bo, Wei Lu, Hongruixuan Chen, Si-Bao Chen, Bin Luo

First: 2026-04-28T09:38:02+00:00 · Latest: 2026-04-28T09:38:02+00:00

Comments: 17 pages, 14 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non-shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow-Aware and Removal Unified (SARU) Framework , a cohesive two-stage framework. First, its dual-branch detection module (DBCSF-Net) fuses multi-color space and semantic features to generate high-fidelity shadow masks, effectively distinguishing shadows from dark objects. Then, leveraging these masks, a novel, training-free physical algorithm (N$^2$SGSR) restores illumination by transferring properties from adjacent non-shadow regions within the single input image. To facilitate rigorous evaluation and foster future work, we also introduce two new benchmark datasets: the RSI Shadow Detection (RSISD) dataset and the Single-image Shadow Removal Benchmark (SiSRB). Extensive experiments demonstrate that SARU achieves state-of-the-art performance on both the public AISD dataset and our newly introduced benchmarks. By holistically integrating shadow detection and removal to mitigate error propagation and eliminating the dependency on paired training data, SARU establishes a robust, practical framework for real-world RSI analysis. The source code and datasets are publicly available at: https://github.com/AeroVILab-AHU/SARU-Framework.

中文标题/摘要

标题：SARU：一种统一的阴影感知与去除框架及其新的基准

阴影是遥感图像（RSI）中常见的问题，会降低视觉质量并严重限制诸如目标检测和语义分割等下游任务的性能。大多数先前的工作将阴影检测和去除视为分离的、级联的任务，这可能导致繁琐的过程和错误累积。此外，许多深度学习方法依赖于配对的阴影和非阴影图像进行训练，而在实践中这些图像往往不可用。为了解决这些挑战，我们提出了阴影感知与去除统一（SARU）框架，这是一种综合的两阶段框架。首先，其双分支检测模块（DBCSF-Net）融合多色彩空间和语义特征以生成高保真的阴影掩码，有效地区分阴影和暗物体。然后，利用这些掩码，一种新的无需训练的物理算法（N²SGSR）通过在单张输入图像内转移相邻非阴影区域的属性来恢复光照。为了促进严格的评估并促进未来的工作，我们还引入了两个新的基准数据集：RSI阴影检测（RSISD）数据集和单图像阴影去除基准（SiSRB）。广泛的实验表明，SARU在公共AISD数据集和我们新引入的基准上均实现了最先进的性能。通过整体集成阴影检测和去除以减轻错误传播并消除对配对训练数据的依赖，SARU建立了一个稳健的、实用的框架，用于实际的RSI分析。源代码和数据集可在以下链接获取：https://github.com/AeroVILab-AHU/SARU-Framework。

HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation

Authors: Bingzi Zhang, Kaisi Guan, Ruihua Song

Venue: ICME 2026

First: 2026-04-28T08:27:35+00:00 · Latest: 2026-04-28T08:27:35+00:00

Comments: Accepted to the 2026 IEEE International Conference on Multimedia and Expo (ICME 2026)

Abs · PDF · Code1 · Code2

Abstract

Video generation models have developed rapidly in recent years, where generating natural human motion plays a pivotal role. However, accurately evaluating the quality of generated human motion video remains a significant challenge. Existing evaluation metrics primarily focus on global scene statistics, often overlooking fine-grained human details and consequently failing to align with human subjective preference. To bridge this gap, we propose HuM-Eval, a novel human-centric evaluation framework that adopts a coarse-to-fine strategy. Specifically, our framework first utilizes a Vision Language Model to perform a coarse assessment of global video quality. It then proceeds to a fine-grained analysis, using 2D pose to verify anatomical correctness and 3D human motion to evaluate motion stability. Extensive experiments demonstrate that HuM-Eval achieves an average human correlation of 58.2%, outperforming state-of-the-art baselines. Furthermore, we introduce HuM-Bench, a comprehensive benchmark comprising 1,000 diverse prompts, and conduct a detailed evaluation of existing text-to-video models, paving the way for next-generation human motion generation.

Summary / 总结

HuM-Eval is a human-centric evaluation framework for generated human motion videos, addressing the limitations of existing metrics by focusing on both global and fine-grained aspects. It uses a coarse-to-fine strategy, starting with a Vision Language Model for global quality assessment and then analyzing 2D pose and 3D motion for anatomical correctness and motion stability. Experimental results show that HuM-Eval achieves an average human correlation of 58.2%, surpassing current state-of-the-art methods.

HuM-Eval 是一个针对生成的人体动作视频的人本评价框架，通过关注全局和细粒度方面来解决现有指标的局限性。它采用从粗到细的策略，首先使用视觉语言模型进行全局质量评估，然后分析2D姿态和3D动作以验证解剖正确性和动作稳定性。实验结果显示，HuM-Eval 的平均人类相关性为58.2%，优于当前最先进的方法。

Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference

Authors: Xiaowei Mao, Bowen Sui, Weijie Zhang, Yawen Yang, Shengnan Guo, Shilong Zhao, Jiaqi Lin, Tingrui Wu, Youfang Lin, Huaiyu Wa

First: 2026-04-26T14:09:55+00:00 · Latest: 2026-04-28T08:05:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Expressway video anomaly detection is essential for safety management. However, identifying anomalies across diverse scenes remains challenging, particularly for far-field targets exhibiting subtle abnormal vehicle motions. While Vision-Language Models (VLMs) demonstrate strong semantic reasoning capabilities, processing global frames causes attention dilution for these far-field objects and incurs prohibitive computational costs. To address these issues, we propose VIBES, an asynchronous collaborative framework utilizing VLMs guided by Bayesian inference. Specifically, to overcome poor generalization across varying expressway environments, we introduce an online Bayesian inference module. This module continuously evaluates vehicle trajectories to dynamically update the probabilistic boundaries of normal driving behaviors, serving as an asynchronous trigger to precisely localize anomalies in space and time. Instead of processing the continuous video stream, the VLM processes only the localized visual regions indicated by the trigger. This targeted visual input prevents attention dilution and enables accurate semantic reasoning. Extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead, achieving high real-time efficiency and explainability while demonstrating generalization across diverse expressway conditions.

Summary / 总结

The paper addresses the challenge of detecting anomalies in far-field expressway surveillance videos, where subtle vehicle motions are hard to identify. It proposes VIBES, an asynchronous collaborative framework that uses VLMs guided by Bayesian inference. VIBES introduces an online Bayesian inference module to dynamically update the boundaries of normal driving behaviors, triggering the VLM to process only the localized visual regions, thus improving detection accuracy and reducing computational costs while maintaining real-time efficiency and explainability.

论文针对远距离高速公路监控视频中难以识别的细微车辆异常运动问题，提出了一种异步协作框架VIBES，利用贝叶斯推理引导视觉语言模型。VIBES引入了一个在线贝叶斯推理模块，动态更新正常驾驶行为的边界，触发视觉语言模型仅处理触发的局部视觉区域，从而提高检测准确性并减少计算成本，同时保持实时效率和可解释性。

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Authors: Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, Amit Ranjan Trivedi

First: 2026-04-28T05:30:18+00:00 · Latest: 2026-04-28T05:30:18+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty

Summary / 总结

The study investigates the reliability of vision-language models (VLMs) as judges in multimodal systems using conformal prediction, which converts scores into calibrated prediction intervals without retraining. Across three VLMs and 14 visual tasks, the research reveals that evaluation uncertainty varies significantly depending on the task, with intervals covering 40-70% of the score range. This indicates that VLMs can rank responses reliably but struggle to provide accurate absolute scores, especially for complex tasks like chart and mathematical reasoning. The study also finds that interval width is influenced by task difficulty and annotation quality, highlighting a failure mode where high ranking correlation does not guarantee reliable scoring.

研究使用分布无关的校准预测方法，将视觉语言模型（VLM）的评分转换为校准的预测区间，无需重新训练，分析了VLM在14种视觉任务类别中的评价可靠性。结果显示，评价不确定性随着任务的不同而显著变化，区间覆盖评分范围的40%-70%。这表明VLM可以可靠地对响应进行排名，但在提供准确的绝对评分方面存在困难，尤其是在复杂的图表和数学推理任务中。研究还发现，区间宽度主要受任务难度和注释质量的影响，揭示了一种标准评估指标未能捕捉到的失败模式，即高排名相关性并不保证可靠的评分。

DRAGON: A Benchmark for Evidence-Grounded Visual Reasoning over Diagrams

Authors: Anirudh Iyengar Kaniyar Narayana Iyengar, Tampu Ravi Kumar, Gaurav Najpande, Manan Suri, Dinesh Manocha, Puneet Mathur, Vivek Gupta

First: 2026-04-28T05:24:05+00:00 · Latest: 2026-04-28T05:24:05+00:00

Comments: 22 Pages, 14 Figures

Abs · PDF · Code1 · Code2

Abstract

Diagram question answering (DQA) requires models to interpret structured visual representations such as charts, maps, infographics, circuit schematics, and scientific diagrams. Recent vision-language models (VLMs) often achieve high answer accuracy on these tasks, yet correct answers do not guarantee that models ground their reasoning in the diagram regions that support the prediction. Models may instead rely on textual correlations or dataset artifacts without identifying the visual evidence required to verify the answer. This limitation prevents reliable evaluation of diagram reasoning and reduces interpretability. We introduce DRAGON, a benchmark for evaluating evidence-grounded visual reasoning in diagrams. Given a diagram, a question, and the correct answer, a model must predict bounding boxes that correspond to the visual elements required to justify the answer. These evidence regions may include answer-bearing components, textual labels, legends, axes, connectors, and other supporting structures involved in the reasoning process. The DRAGON dataset contains 11,664 annotated question instances collected from six diagram QA datasets: ChartQA, Circuit-VQA, InfographicsVQA, MapIQ, MapWise, and AI2D. We release a 2,445-instance benchmark test set with human-verified reasoning evidence annotations and a standardized evaluation framework. We evaluate eight recent VLMs and analyze their ability to localize reasoning evidence across diverse diagram domains. DRAGON enables systematic evaluation of diagram reasoning and supports future research on models that ground their predictions in visual evidence.

中文标题/摘要

标题：DRAGON：基于证据的图表视觉推理基准

图表问答（DQA）要求模型解释结构化的视觉表示，如图表、地图、信息图、电路图和科学图表。最近的视觉-语言模型（VLMs）在这些任务上通常能获得高答案准确率，但正确的答案并不保证模型将其推理基于支持预测的图表区域。模型可能依赖于文本相关性或数据集中的异常现象，而不是识别验证答案所需的视觉证据。这一限制阻碍了图表推理的可靠评估，并降低了模型的可解释性。我们引入了DRAGON，一个用于评估图表中基于证据的视觉推理的基准。给定一个图表、一个问题和正确答案，模型必须预测与答案相关的视觉元素对应的边界框。这些证据区域可能包括答案承载组件、文本标签、图例、轴、连接器和其他参与推理过程的支持结构。DRAGON数据集包含从六个图表问答数据集中收集的11,664个注释问题实例：ChartQA、Circuit-VQA、InfographicsVQA、MapIQ、MapWise和AI2D。我们发布了一个包含2,445个实例的基准测试集，其中包含由人类验证的推理证据注释，并提供了一个标准化的评估框架。我们评估了八个最近的VLMs，并分析了它们在不同图表领域定位推理证据的能力。DRAGON使图表推理的系统评估成为可能，并支持未来研究基于视觉证据进行预测的模型。

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

Authors: Brandon Collins, Logan Bolton, Hung Huy Nguyen, Mohammad Reza Taesiri, Trung Bui, Anh Totti Nguyen

First: 2026-04-23T22:33:15+00:00 · Latest: 2026-04-28T04:48:22+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision-language models (VLMs) such as Gemini-3-Pro and GPT-5 only respond with text, which can be difficult for users to verify. We present SketchVLM, a training-free, model-agnostic framework that enables VLMs to produce non-destructive, editable SVG overlays on the input image to visually explain their answers. Across seven benchmarks spanning visual reasoning (maze navigation, ball-drop trajectory prediction, and object counting) and drawing (part labeling, connecting-the-dots, and drawing shapes around objects), SketchVLM improves visual reasoning task accuracy by up to +28.5 percentage points and annotation quality by up to 1.48x relative to image-editing and fine-tuned sketching baselines, while also producing annotations that are more faithful to the model's stated answer. We find that single-turn generation already achieves strong accuracy and annotation quality, and multi-turn generation opens up further opportunities for human-AI collaboration. An interactive demo and code are at https://sketchvlm.github.io/.

中文标题/摘要

标题：SketchVLM：视觉语言模型可以标注图像以解释想法并引导用户

在回答图像相关问题时，人类自然会指指点点、标注和绘制来解释他们的推理。相比之下，现代视觉-语言模型（VLMs）如Gemini-3-Pro和GPT-5仅以文本形式回应，这使得用户难以验证。我们提出了SketchVLM，这是一种无需训练、模型无关的框架，使VLMs能够在输入图像上生成非破坏性的、可编辑的SVG叠加层，以可视化地解释其答案。在七个涵盖视觉推理（迷宫导航、球落下轨迹预测和物体计数）和绘画（部分标注、连线和围绕物体绘制形状）的基准测试中，SketchVLM在视觉推理任务准确性上提高了最高28.5个百分点，在注释质量上提高了最高1.48倍，同时生成的注释更忠于模型声明的答案。我们发现单轮生成已经实现了强大的准确性和注释质量，而多轮生成则为人类-人工智能协作提供了进一步的机会。交互式演示和代码可在https://sketchvlm.github.io/找到。

From Scene to Object: Text-Guided Dual-Gaze Prediction

Authors: Zehong Ke, Yanbo Jiang, Jinhao Li, Zhiyuan Liu, Yiqian Tu, Qingwen Meng, Heye Huang, Jianqiang Wang

First: 2026-04-22T05:11:59+00:00 · Latest: 2026-04-28T03:54:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Interpretable driver attention prediction is crucial for human-like autonomous driving. However, existing datasets provide only scene-level global gaze rather than fine-grained object-level annotations, inherently failing to support text-grounded cognitive modeling. Consequently, while Vision-Language Models (VLMs) hold great potential for semantic reasoning, this critical data limitations leads to severe text-vision decoupling and visual-bias hallucinations. To break this bottleneck and achieve precise object-level attention prediction, this paper proposes a novel dual-branch gaze prediction framework, establishing a complete paradigm from data construction to model architecture. First, we construct G-W3DA, a object-level driver attention dataset. By integrating a multimodal large language model with the Segment Anything Model 3 (SAM3), we decouple macroscopic heatmaps into object-level masks under rigorous cross-validation, fundamentally eliminating annotation hallucinations. Building upon this high-quality data foundation, we propose the DualGaze-VLM architecture. This architecture extracts the hidden states of semantic queries and dynamically modulates visual features via a Condition-Aware SE-Gate, achieving intent-driven precise spatial anchoring. Extensive experiments on the W3DA benchmark demonstrate that DualGaze-VLM consistently surpasses existing state-of-the-art (SOTA) models in spatial alignment metrics, notably achieving up to a 17.8% improvement in Similarity (SIM) under safety-critical scenarios. Furthermore, a visual Turing test reveals that the attention heatmaps generated by DualGaze-VLM are perceived as authentic by 88.22% of human evaluators, proving its capability to generate rational cognitive priors.

Summary / 总结

This study addresseses the limitations of existing datasets and proposes frameworks for scene-grounded cognitive modeling in autonomous driving. By constructing a new-W3DA dataset and proposing the DualGaze-VLM architecture,, The DualGaze-VLM architecture extracts semantic queries and dynamically adjusts visual attention using via a G-Aware SE-Gate for precise spatial anchoring. The proposed on the WDA benchmark shows shows that DualGaze-VLM outperper SOTA models by up 8. percentage in in improvement in Similarity (SIM) metrics, and achieving visual Turing tests performance where 88% of human evaluation,.

SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition

Authors: Jingxiao Yang, DaLin He, Miao Pan, Kaixiang Yao, Ge Su, Wenqi Zhang, Yifeng Hu, Tangwei Li, Yuke Li, Xuhong Zhang

First: 2026-03-18T13:49:27+00:00 · Latest: 2026-04-28T03:35:25+00:00

Comments: preprint, under review

Abs · PDF · Code1 · Code2

Abstract

Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.

中文标题/摘要

标题：SARE：样本自适应推理用于无需训练的细粒度视觉识别

大型视觉-语言模型（LVLMs）的最新进展使无需训练的细粒度视觉识别（FGVR）成为可能。然而，有效利用LVLMs进行FGVR仍然具有挑战性，因为子类别级别的固有视觉模糊性。现有方法主要采用检索导向或推理导向的方法来应对这一挑战，但都受到两个基本限制的制约：(1) 它们对所有样本应用相同的推理管道，而不考虑识别难度的不均衡性，从而导致性能和效率不佳；(2) 缺乏机制来整合和重用特定错误的经验，导致在类似具有挑战性的案例上重复失败。为了解决这些限制，我们提出了SARE，一种样本自适应推理框架，用于无需训练的FGVR。具体而言，SARE采用级联设计，结合快速候选检索与细粒度推理，仅在必要时调用后者。在推理过程中，SARE引入了一种自我反思的经验机制，利用过去的失败来在推理过程中提供可转移的判别性指导，而无需更新任何参数。在14个数据集上的广泛实验表明，SARE在保持高性能的同时显著减少了计算开销。

Summary / 总结

SARE is a Sample-wise Adaptive Reasoning framework for training-free Fine-Grained Visual Recognition, addressing the limitations of existing methods by adapting inference pipelines to sample-specific recognition difficulties and leveraging past failures for discriminative guidance. Experiments across 14 datasets show that SARE outperforms existing methods while reducing computational overhead.

论文提出了一种基于样本自适应推理的SARE框架，以解决使用大型视觉语言模型进行细粒度视觉识别（FGVR）的挑战。SARE通过根据每个样本的识别难度调整推理管道，并利用过去的错误经验避免重复失败，从而改进了现有方法。实验表明，SARE在保持高性能的同时显著减少了计算开销。

MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents

Authors: Ruihan Chen, Qiming Li, Xiaocheng Feng, Weihong Zhong, Xiaoliang Yang, Yuxuan Gu, Zekun Zhou, Yunfei Lu, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin

First: 2025-11-30T06:47:33+00:00 · Latest: 2026-04-28T03:32:32+00:00

Comments: 35pages, 15figures

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have shown strong potential as multilingual Graphical User Interface (GUI) agents, as evidenced by existing GUI benchmarks. However, these benchmarks exhibit two primary limitations: (1) although Perception and Reasoning (P&R) capabilities are fundamental for GUI agents, current benchmarks lack fine-grained diagnostics to identify which specific capabilities lead to task failures, hindering targeted improvements; (2) existing benchmarks fail to provide a strictly aligned cross-lingual evaluation environment, introducing confounding factors that prevent isolating the language impact on GUI agent performance. To address these issues, we propose the Multilingual P&R GUI Benchmark (MPR-GUI-Bench), featuring strictly aligned environments across six languages and eight fine-grained P&R tasks. Our benchmark reveals consistent P&R gaps between English and non-English settings, particularly on reasoning-intensive tasks. To leverage the superior English P&R capabilities for bridging cross-lingual gaps, we identify layers sensitive to language and propose GUI-XLI, a GUI Cross-Lingual Intervention method that aligns non-English hidden states with their English counterparts at these layers during inference. Experiments show that GUI-XLI effectively reduces the cross-lingual gaps, with an average gain of 6.5% in non-English settings.

Summary / 总结

The study aims to improve multilingual perception and reasoning in GUI agents by addressing limitations in existing benchmarks. It introduces MPR-GUI-Bench, a benchmark with aligned environments across six languages and eight tasks, revealing consistent performance gaps between English and non-English settings. The research proposes GUI-XLI, a method that aligns non-English hidden states with English counterparts, reducing cross-lingual gaps by an average of 6.5%.

研究旨在通过解决现有基准的局限性，提高GUI代理的多语言感知和推理能力。它引入了MPR-GUI-Bench，该基准在六种语言和八项任务上具有对齐的环境，揭示了英语和非英语设置之间的一致性能差距。研究提出了GUI-XLI方法，在推理过程中将非英语隐藏状态与英语对应状态对齐，从而将非英语设置的性能平均提高了6.5%。

Mitigating Coordinate Prediction Bias from Positional Encoding Failures

Authors: Xingjian Tao, Yiwei Wang, Yujun Cai, Yihong Luo, Kai Han, Jing Tang

First: 2025-10-25T00:58:47+00:00 · Latest: 2026-04-28T03:11:40+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

While Multimodal Large Language Models (MLLMs) excel at general vision-language tasks, precise coordinate prediction remains a significant challenge, particularly as high-resolution inputs cause visual positional encodings (VPEs) to degrade. We demonstrate that these encoding failures do not result in random noise but instead trigger predictable, directional biases, suggesting that models default to internal spatial priors when grounding signals are weak. To counteract this, we introduce Vision-PE Shuffle Guidance (VPSG), a training-free, inference-time correction method. VPSG isolates position-unconditioned tendencies by shuffling VPEs and utilizes this negative evidence to steer digit decoding through a lightweight finite-state machine. Evaluation on the ScreenSpot-Pro benchmark confirms that VPSG effectively rectifies coordinate drift, yielding consistent improvements in localization accuracy across various model scales without any retraining. Our code is available at https://github.com/taoxj2001/VPSG.

中文标题/摘要

标题：缓解位置编码失败导致的坐标预测偏差

尽管多模态大型语言模型（MLLMs）在通用视觉-语言任务中表现出色，但精确的坐标预测仍然是一个重大挑战，特别是在高分辨率输入导致视觉位置编码（VPEs）退化的情况下。我们证明这些编码失败并非产生随机噪声，而是触发可预测的方向性偏差，表明当视觉信号较弱时，模型会默认使用内部的空间先验。为应对这一问题，我们引入了视觉PE洗牌指导（VPSG），这是一种无需训练、在推理时进行校正的方法。VPSG通过洗牌VPEs来隔离位置无关的趋势，并利用这种负面证据通过轻量级有限状态机引导数字解码。在ScreenSpot-Pro基准测试上的评估证实，VPSG有效地纠正了坐标漂移，无需重新训练即可在各种模型规模上一致地提高定位准确性。我们的代码可在https://github.com/taoxj2001/VPSG获取。

Latent Anomaly Knowledge Excavation: Unveiling Sparse Sensitive Neurons in Vision-Language Models

Authors: Shaotian Li, Shangze Li, Chuancheng Shi, Wenhua Wu, Yanqiu Wu, Xiaohan Yu, Fei Shen, Tat-Seng Chua

First: 2026-04-09T04:54:25+00:00 · Latest: 2026-04-28T02:50:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Large-scale vision-language models (VLMs) exhibit remarkable zero-shot capabilities, yet the internal mechanisms driving their anomaly detection (AD) performance remain poorly understood. Current methods predominantly treat VLMs as black-box feature extractors, assuming that anomaly-specific knowledge must be acquired through external adapters or memory banks. In this paper, we challenge this assumption by arguing that anomaly knowledge is intrinsically embedded within pre-trained models but remains latent and under-activated. We hypothesize that this knowledge is concentrated within a sparse subset of anomaly-sensitive neurons. To validate this, we propose latent anomaly knowledge excavation (LAKE), a training-free framework that identifies and elicits these critical neuronal signals using only a minimal set of normal samples. By isolating these sensitive neurons, LAKE constructs a highly compact normality representation that integrates visual structural deviations with cross-modal semantic activations. Extensive experiments on industrial AD benchmarks demonstrate that LAKE achieves state-of-the-art performance while providing intrinsic, neuron-level interpretability. Ultimately, our work advocates for a paradigm shift: redefining anomaly detection as the targeted activation of latent pre-trained knowledge rather than the acquisition of a downstream task.

中文标题/摘要

标题：潜在异常知识挖掘：揭示视觉-语言模型中的稀疏敏感神经元

大规模视觉-语言模型（VLMs）表现出显著的零样本能力，但其异常检测（AD）性能背后的内部机制仍不甚明了。当前方法主要将VLMs视为黑盒特征提取器，假设异常特定知识必须通过外部适配器或记忆库获得。在本文中，我们挑战这一假设，认为异常知识实际上已嵌入预训练模型中，但处于潜藏且未被激活的状态。我们假设这种知识集中在少量异常敏感神经元中。为了验证这一点，我们提出了潜在异常知识挖掘（LAKE），这是一种无需训练的框架，仅使用少量正常样本即可识别并激发这些关键神经元信号。通过隔离这些敏感神经元，LAKE 构建了一个高度紧凑的正常性表示，将视觉结构偏差与跨模态语义激活相结合。在工业AD基准上的广泛实验表明，LAKE 达到了最先进的性能，同时提供了内在的、神经元级别的可解释性。最终，我们的工作倡导了一种范式转变：将异常检测重新定义为对潜藏预训练知识的靶向激活，而非下游任务的获取。

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Authors: Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu

First: 2026-03-31T15:02:27+00:00 · Latest: 2026-04-28T02:10:56+00:00

Comments: Project page: https://xpeng-robotics.github.io/dial

Abs · PDF · Code1 · Code2 · Project1

Abstract

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

Summary / 总结

DIAL addresses the limitations of existing end-to-end VLA models by decoupling intent and action through a latent world modeling approach. It uses a VLM-based System-2 to synthesize latent visual foresight, which serves as a bottleneck for high-level decision making, while a lightweight System-1 decodes this foresight and current observations into precise robot actions. DIAL achieves superior performance on the RoboCasa GR1 Tabletop benchmark with 10x fewer demonstrations and demonstrates robust zero-shot generalization to unseen objects and novel configurations.

DIAL通过潜世界建模方法解耦意图和动作，解决现有端到端VLA模型的局限性。它使用基于VLM的System-2生成潜视觉前瞻，作为高层决策的瓶颈，而轻量级的System-1则解码这些前瞻和当前观察结果以生成精确的机器人动作。DIAL在RoboCasa GR1桌面基准测试中表现出色，仅需10倍少的演示，并且在实际部署中对未见过的对象和新型配置具有鲁棒的零样本泛化能力。

CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

Authors: Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen

First: 2026-04-24T06:34:45+00:00 · Latest: 2026-04-28T02:02:19+00:00

Comments: some errors in the method

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.

中文标题/摘要

标题：CAGE-SGG：基于反事实关系验证的开放词汇场景图生成

开放词汇场景图生成（SGG）旨在使用灵活和精细的关系短语超越固定谓词词汇集来描述视觉场景。虽然近期的视觉-语言模型极大地扩展了SGG的语义覆盖范围，但也引入了一个关键的可靠性问题：预测的关系可能由语言先验或对象共现驱动，而不是基于视觉证据。本文提出了一种基于反事实关系验证的证据导向的开放词汇场景图生成框架。我们的方法不直接接受可能的关系提案，而是验证每个候选关系是否由特定的关系视觉、几何和上下文证据支持。具体来说，我们首先使用视觉-语言提案生成开放词汇关系候选，然后将谓词短语分解为支持、接触、包含、深度、运动和状态等软证据基础。关系条件下的证据编码器提取与谓词相关的线索，而反事实验证器测试在移除必要证据时关系得分是否降低，并在无关扰动下是否保持稳定。我们进一步引入了矛盾感知谓词学习和图级偏好优化，以提高细粒度的区分能力和全局图的一致性。在常规、开放词汇和全景场景图基准测试上进行的实验表明，我们的方法在标准召回率指标、未见过的谓词泛化能力和反事实定位质量上都表现出一致的改进。这些结果表明，从关系生成转向关系验证可以生成更可靠、可解释和基于证据的场景图。

Summary / 总结

The paper addresses the issue of reliability in open-vocabulary scene graph generation (SGG) by proposing a framework that verifies the visual evidence for each relation. It generates relation candidates using a vision-language model and decomposes predicate phrases into soft evidence bases. A relation-conditioned evidence encoder extracts relevant cues, and a counterfactual verifier checks if the relation score decreases when necessary evidence is removed. The method also includes contradiction-aware predicate learning and graph-level preference optimization. Experiments show improvements in recall-based metrics, unseen predicate generalization, and counterfactual grounding quality, indicating more reliable and interpretable scene graphs.

论文提出了一种框架，通过验证每个关系的视觉证据来解决开放词汇场景图生成（SGG）的可靠性问题。该框架使用视觉语言模型生成关系候选，并将谓词短语分解为软证据基础。关系条件下的证据编码器提取相关线索，而反事实验证器检查在移除必要证据时关系得分是否降低。该方法还包括反矛盾谓词学习和图级偏好优化。实验表明，在召回率指标、未见过谓词泛化能力和反事实定位质量方面均有所改进，表明生成的场景图更可靠、可解释且基于证据。

ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

Authors: Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang

First: 2026-04-24T13:36:41+00:00 · Latest: 2026-04-28T02:01:57+00:00

Comments: Some errors in the experimental sections

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.

中文标题/摘要

标题：ReLIC-SGG：开放词汇场景图生成的关系不完整性完成

开放词汇场景图生成（SGG）旨在使用超出固定谓词集的灵活关系短语来描述视觉场景。现有方法通常将标注三元组视为正样本，将所有未标注的对象对关系视为负样本。然而，场景图标注本质上是不完整的：许多有效的关系缺失，相同的交互可以在不同的粒度下描述，例如on、standing on、resting on和supported by。在开放词汇SGG中，由于关系空间更大，这一问题更为严重。我们提出了**ReLIC-SGG**，一种关系不完整性感知框架，将未标注的关系视为潜在变量而不是确定的负样本。ReLIC-SGG构建了一个语义关系格来建模开放词汇谓词之间的相似性、蕴含和矛盾，并利用其从视觉语言兼容性、图上下文和语义一致性中推断缺失的正关系。正样本-未标注图学习目标进一步减少了假阴性监督，而格引导解码生成紧凑且语义一致的场景图。在传统、开放词汇和泛光SGG基准上的实验表明，ReLIC-SGG提高了罕见和未见过的谓词识别，并更好地恢复了缺失的关系。

One Perturbation, Two Failure Modes: Probing VLM Safety via Embedding-Guided Typographic Perturbations

Authors: Ravikumar Balakrishnan, Sanket Mendapara

First: 2026-04-28T01:21:47+00:00 · Latest: 2026-04-28T01:21:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Typographic prompt injection exploits vision language models' (VLMs) ability to read text rendered in images, posing a growing threat as VLMs power autonomous agents. Prior work typically focus on maximizing attack success rate (ASR) but does not explain \emph{why} certain renderings bypass safety alignment. We make two contributions. First, an empirical study across four VLMs including GPT-4o and Claude, twelve font sizes, and ten transformations reveals that multimodal embedding distance strongly predicts ASR ($r{=}{-}0.71$ to ${-}0.93$, $p{<}0.01$), providing an interpretable, model agnostic proxy. Since embedding distance predicts ASR, reducing it should improve attack success, but the relationship is mediated by two factors: perceptual readability (whether the VLM can parse the text) and safety alignment (whether it refuses to comply). Second, we use this as a red teaming tool: we directly maximize image text embedding similarity under bounded $\ell_\infty$ perturbations via CWA-SSA across four surrogate embedding models, stress testing both factors without access to the target model. Experiments across five degradation settings on GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL confirm that optimization recovers readability and reduces safety aligned refusals as two co-occurring effects, with the dominant mechanism depending on the model's safety filter strength and the degree of visual degradation.

Summary / 总结

This study investigates the safety of vision language models (VLMs) by probing their susceptibility to typographic prompt injection attacks. The research finds that multimodal embedding distance is a strong predictor of attack success rate (ASR) across different VLMs and font sizes. The authors propose that reducing embedding distance should improve attack success but is mediated by perceptual readability and safety alignment. Using a red teaming approach, the study directly maximizes image text embedding similarity under bounded perturbations, confirming that optimization recovers readability and reduces safety-aligned refusals, with the dominant mechanism varying based on the model's safety filter strength and visual degradation level.

该研究通过探查视觉语言模型（VLMs）对字型提示注入攻击的脆弱性，来评估其安全性。研究发现，多模态嵌入距离与攻击成功率之间存在强烈关联，提供了一个模型无关的代理指标。作者利用这一洞见直接最大化图像文本嵌入相似性，在有限扰动下进行优化，实验表明优化可以提高可读性并减少安全对齐的拒绝，效果取决于模型的安全过滤器强度和视觉降级程度。

DouC: Dual-Branch CLIP for Training-Free Open-Vocabulary Segmentation

Authors: Mohamad Zamini, Diksha Shukla

First: 2026-04-27T20:59:01+00:00 · Latest: 2026-04-27T20:59:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary semantic segmentation requires assigning pixel-level semantic labels while supporting an open and unrestricted set of categories. Training-free CLIP-based approaches preserve strong zero-shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training-free dual-branch CLIP framework that decomposes dense prediction into two complementary components. OG-CLIP improves patch-level reliability via lightweight, inference-time token gating, while FADE-CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure-aware patch interactions to jointly influence final predictions, with optional instance-aware correction applied as post-processing. DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP's zero-shot generalization. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training-free methods and scales favorably with model capacity.

中文标题/摘要

标题：DouC：基于CLIP的训练-free 开集词汇分割

开集词汇语义分割需要在支持开放且不受限制的类别集的同时，为像素级语义标签进行分配。基于CLIP的训练-free 方法保留了强大的零样本泛化能力，但通常依赖单一的推理机制，限制了它们同时解决不可靠的局部标记和空间连贯性不足的能力。我们提出DouC，一种基于CLIP的训练-free 双分支框架，将密集预测分解为两个互补的组件。OG-CLIP通过轻量级、推理时的标记门控提高局部可靠性，而FADE-CLIP通过冻结的视觉基础模型引导的代理注意力注入外部结构先验。两个分支在logit级别融合，使局部标记可靠性和结构感知的块交互共同影响最终预测，可选地在后处理阶段应用实例感知的校正。DouC引入了无额外可学习参数，无需重新训练，并保留了CLIP的零样本泛化能力。在八个基准和多个CLIP基础模型上的广泛实验表明，DouC在所有先验训练-free 方法中表现更优，并且随模型容量增加具有更好的可扩展性。

VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

Authors: Avinash Amballa, Yashas Malur Saidutta, Chi-Heng Lin, Vivek Kulkarni, Srinivas Chappidi

Venue: ACL 2026

First: 2025-12-12T22:39:01+00:00 · Latest: 2026-04-27T19:46:30+00:00

Comments: Accepted to ACL 2026 Main

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper, we propose Voyager, a novel principled approach to generate diverse datasets. Our approach is iterative and directly optimizes a mathematical quantity that optimizes the diversity of the dataset using the machinery of determinantal point processes. Furthermore, our approach is training-free, applicable to closed-source models, and scalable. In addition to providing theoretical justification for the working of our method, we also demonstrate through comprehensive experiments that Voyager significantly outperforms popular baseline approaches by providing a 1.5-3 times improvement in diversity.

Summary / 总结

VOYAGER is a training-free method that uses large language models to generate diverse datasets by optimizing a mathematical quantity related to determinantal point processes. The approach iteratively enhances dataset diversity and is scalable and applicable to closed-source models. Experimental results show that VOYAGER outperforms existing methods, providing a 1.5 to 3 times improvement in diversity.

VOYAGER 是一种无需训练的方法，利用大型语言模型通过优化与确定性点过程相关的数学量来生成多样化的数据集。该方法通过迭代增强数据集的多样性，并且具有可扩展性和适用于封闭源模型。实验表明，VOYAGER 的表现优于现有方法，多样性提高了1.5到3倍。

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

Authors: Muhammad Akhtar Munir, Muhammad Umer Sheikh, Akashah Shabbir, Muhammad Haris Khan, Fahad Khan, Xiao Xiang Zhu, Begum Demir, Salman Khan

First: 2026-04-27T18:59:49+00:00 · Latest: 2026-04-27T18:59:49+00:00

Comments: 31 pages. Position Paper

Abs · PDF · Code1 · Code2

Abstract

Earth Observation (EO) is moving beyond static prediction toward multi-step analytical workflows that require coordinated reasoning over data, tools, and geospatial state. While foundation models and vision-language models have expanded representation learning and language-grounded interaction for remote sensing, and agentic AI has demonstrated long-horizon reasoning and external tool use, EO is not a straightforward extension of generic agentic AI. EO workflows operate over georeferenced, multi-modal, and temporally structured data, where operations such as reprojection, resampling, compositing, and aggregation actively transform the underlying state and can constrain subsequent analysis. As a result, errors may propagate silently across steps, and correctness depends not only on internal coherence, but also on geospatial consistency, temporally valid comparisons, and physical validity. This position paper argues that these challenges are structural rather than incidental. We identify the implicit assumptions commonly made in generic agentic models, analyze how they break in geospatial workflows, and characterize the resulting failure modes in multi-step EO pipelines. We then outline design principles for EO-native agents centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and learning objectives aligned with geospatial and physical validity. Finally, we present research directions spanning EO-specific benchmarks, hybrid supervised and reinforcement learning, constrained self-improvement, and trajectory-level evaluation beyond final-answer accuracy. Building reliable geospatial agents therefore requires rethinking agent design around the physical, geospatial, and workflow constraints that govern EO analysis.

Summary / 总结

This paper addresses the challenges of applying agentic AI to Earth Observation (EO) workflows, which involve complex, geospatially structured data and operations. It identifies common assumptions in generic agentic models that break down in EO contexts and proposes design principles for EO-native agents, including structured geospatial state, tool-aware reasoning, and alignment with geospatial and physical validity. The research suggests new directions for EO-specific benchmarks and hybrid learning methods to build reliable geospatial agents.

本文探讨了将代理型AI应用于地球观测（EO）工作流中的挑战，这些工作流涉及复杂的、地理空间结构化数据和操作。它指出了通用代理模型中常见的假设在EO环境中会失效，并提出了针对EO的代理设计原则，包括结构化的地理空间状态、工具感知推理以及与地理空间和物理有效性对齐。研究建议了针对EO的特定基准测试和混合学习方法的新方向，以构建可靠的地理空间代理。

History

20260429_0437 20260428_0429 20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553