Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Authors: Mohammadreza Armandpour, Fatih Ilhan, David Harrison, Ajay Jaiswal, Duc N. M Hoang, Fartash Faghri, Yizhe Zhang, Minsik Cho, Mehrdad Farajtabar
First: 2026-05-11T17:33:28+00:00 · Latest: 2026-05-11T17:33:28+00:00
Abstract
On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.
Summary / 总结
This study investigates the conditions under which on-policy distillation is beneficial or detrimental for training reasoning models. It introduces a training-free diagnostic framework that evaluates the alignment of distillation gradients with an ideal gradient at the level of individual tokens. The research finds that distillation guidance is more aligned with the ideal on incorrect rollouts compared to correct ones, and that the optimal distillation context varies depending on the student model's capacity and the target task, with no single universally effective configuration emerging.
论文探讨了在哪些条件下在线策略蒸馏对训练推理模型是有益还是有害。研究引入了一种无需训练的诊断框架,该框架在每个token层面评估理想梯度与实际蒸馏梯度之间的对齐程度。研究发现,蒸馏指导在错误的展开中与理想信号的对齐程度高于正确的展开,且最优的蒸馏上下文取决于学生模型的能力和目标任务,没有一种配置是普遍有效的。
Count Anything at Any Granularity
Authors: Chang Liu, Haoning Wu, Weidi Xie
First: 2026-05-11T17:32:37+00:00 · Latest: 2026-05-11T17:32:37+00:00
Comments: Project page: https://verg-avesta.github.io/KubriCount/
Abstract
Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat "what to count" as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: https://verg-avesta.github.io/KubriCount/.
Summary / 总结
This work addresses the issue of open-world object counting by redefining it as multi-grained counting, where visual exemplars and fine-grained text specify the target appearance and intended semantic granularity, respectively. The authors propose KubriCount, a new dataset that includes multi-category scenes, controlled distractors, and instance-level annotations. They also develop a fully automatic data-scaling pipeline to construct this dataset. Experiments show that both multimodal large language models and specialist counting models struggle with fine-grained distinctions. To address this, the authors train HieraCount, a multi-grained counting model that improves accuracy and generalizes well to real-world scenarios.
该研究重新定义了开放世界计数为多粒度计数,通过视觉示例和细粒度文本分别指定目标外观和预期的语义粒度。作者提出了KubriCount数据集,包含多类别场景、受控干扰物和实例级注释。他们还开发了一个全自动数据扩展管道来构建该数据集。实验表明,多模态大语言模型和专门的计数模型在细粒度区分上表现不佳。为了解决这个问题,作者训练了HieraCount,这是一种多粒度计数模型,能够显著提高多粒度计数的准确性,并在现实场景中表现出良好的泛化能力。
Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
Authors: Earl Ranario, Mason J. Earles
First: 2025-12-17T21:22:44+00:00 · Latest: 2026-05-11T17:29:44+00:00
Abstract
Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking
Authors: Sijia Chen, Yanqiu Yu, En Yu, Wenbing Tao
First: 2025-05-26T17:55:19+00:00 · Latest: 2026-05-11T17:14:02+00:00
Comments: Code: https://github.com/chen-si-jia/ReaMOT
Abstract
Referring Multi-Object Tracking (RMOT) aims to track targets specified by language instructions. However, existing RMOT paradigms heavily rely on explicit visual-textual matching and consequently fail to generalize to complex instructions that require logical reasoning. To overcome this, we propose Reasoning-based Multi-Object Tracking (ReaMOT), a novel task that elevates tracking to a cognitive level, requiring models to infer and track specific targets satisfying implicit constraints via logical reasoning. To advance this field, we construct the ReaMOT Challenge, a comprehensive benchmark featuring a tailored metric suite and a large scale dataset. This dataset comprises 1,156 language instructions, 423,359 image language pairs, and 869 distinct video sequences systematically categorized into six distinct evaluation scenarios, with over 75\% of the instructions dedicated to High Level Reasoning. Furthermore, recognizing that traditional trackers lack cognitive capacity while direct application of Large Vision-Language Model (LVLM) yields severe temporal inconsistencies, we propose ReaTrack. Driven by the insight to decouple high-level cognitive localization from low-level physical motion continuity, this training-free framework dynamically aligns the semantic detections of a Thinking-variant LVLM with the robust motion priors of SAM2. Extensive experiments on the ReaMOT Challenge benchmark demonstrate that ReaTrack establishes a new leading performance standard. Notably, it achieves a more than threefold improvement in RHOTA on the High Level Reasoning subset. Our dataset and code will be available at https://github.com/chen-si-jia/ReaMOT.
Summary / 总结
ReaMOT addresses the limitation of existing Referring Multi-Object Tracking (RMOT) methods by introducing a new task that requires logical reasoning for tracking targets specified by language instructions. The proposed ReaTrack framework decouples high-level cognitive localization from low-level motion continuity, achieving significant improvements in performance, especially for high-level reasoning tasks. The ReaMOT Challenge provides a comprehensive benchmark with a large dataset and tailored metrics, demonstrating the effectiveness of ReaTrack over traditional trackers and LVLMs.
ReaMOT旨在通过引入一个需要通过语言指令进行逻辑推理的新任务来解决现有Referring Multi-Object Tracking (RMOT)方法的局限性。作者提出了ReaTrack,这是一种无需训练的框架,结合了Thinking-variant Large Vision-Language Model的语义检测和SAM2的稳健运动先验,以处理跟踪中的高层次推理。在ReaMOT挑战基准上的实验表明,ReaTrack在高层次推理子集中的RHOTA得分显著优于现有方法,提高了三倍以上。
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
Authors: Ruinan Jin, Beidi Zhao, Myeongkyun Kang, Qiong Zhang, Xiaoxiao Li
First: 2026-05-11T17:00:00+00:00 · Latest: 2026-05-11T17:00:00+00:00
Comments: 31 pages, 12 figures
Abstract
Self-verification, re-invoking the same vision language model (VLM) in a fresh context to check its own generated answer, is increasingly used as a default safety layer for medical visual question answering (VQA). We argue that this practice is fundamentally unreliable. We introduce [METHOD NAME], a diagnostic framework for mapping the reliability boundary of medical VLM self-verification by decomposing verifier behavior into discrimination capability and agreement bias. Because the verifier and answer generator are capacity-coupled, the verifier can overly agree with the generator, creating a verification mirage: a regime with both high verifier error and high agreement bias, driven by false acceptance of incorrect answers. Evaluating six open-weight VLMs across five medical VQA datasets and seven medical tasks, we find that this boundary is strongly task-conditioned. Knowledge-intensive clinical tasks fall deepest into the mirage, simpler tasks are more resistant, and perceptual tasks lie in between. Verification also fails to provide an independent safety signal: logistic mixed-effects analysis shows that verifier error and agreement bias become more likely when the generator is wrong, while saliency analyses show that verifiers under-attend to image evidence relative to generators, a phenomenon we call the lazy verifier. Cross-verification reduces but does not eliminate the mirage. Moreover, when verification is reused in multi-turn actor-verifier loops, most initially wrong answers become locked in by false verification. Since our experiments use clean benchmarks, the observed reliability boundary likely underestimates failures in real clinical deployment.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
Authors: Huynh Trung Kiet, Dao Sy Duy Minh, Tuan Nguyen, Chi-Nguyen Tran, Phu-Hoa Pham, Nguyen Lam Phu Quy, The Anh Han, Long Tran-Thanh
First: 2026-05-11T16:55:16+00:00 · Latest: 2026-05-11T16:55:16+00:00
Comments: 57 pages, 1 figure, 6 MultiTP moral dimensions
Abstract
Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.
Summary / 总结
This work addresses the cultural bias in large language models by introducing DISCA, an inference-time method that uses within-country sociodemographic disagreement to reduce cultural misalignment. DISCA converts disagreement into a bounded logit correction and achieves a 10-24% reduction in cultural misalignment across 20 countries and six backbone models, without requiring fine-tuning or per-country preference data. The method is scalable and suitable for serving global moral preferences.
该研究通过引入DISCA方法,利用国内社会人口分歧来减少文化偏差。DISCA将分歧转化为有界的logit修正,并在20个国家和六种基础模型中实现了10-24%的文化偏差减少,无需进行微调或特定国家的偏好数据。该方法具有可扩展性,适用于服务全球道德偏好。
When Large Vision-Language Models Meet Person Re-Identification
Authors: Qizao Wang, Bin Li, Xiangyang Xue
Venue: ICASSP 2026
First: 2024-11-27T07:45:25+00:00 · Latest: 2026-05-11T16:39:12+00:00
Comments: Accepted by ICASSP 2026
Abstract
Large Vision-Language Models (LVLMs) that incorporate visual models and large language models have achieved impressive results across cross-modal understanding and reasoning tasks. In recent years, person re-identification (ReID) has also started to explore cross-modal semantics to improve the accuracy of identity recognition. However, effectively utilizing LVLMs for ReID remains an open challenge. While LVLMs operate under a generative paradigm by predicting the next output word, ReID requires the extraction of discriminative identity features to match pedestrians across cameras. In this paper, we propose LVLM-ReID, a novel framework that harnesses the strengths of LVLMs to promote ReID. Specifically, we employ instructions to guide the LVLM in generating one semantic token that encapsulates key appearance semantics from the person image. This token is further refined through our Semantic-Guided Interaction (SGI) module, establishing a reciprocal interaction between the semantic token and visual tokens. Ultimately, the reinforced semantic token serves as the representation of pedestrian identity. Our framework integrates the semantic understanding and generation capabilities of LVLM into end-to-end ReID training, allowing LVLM to capture rich semantic cues during both training and inference. LVLM-ReID achieves competitive results on multiple benchmarks without additional image-text annotations, demonstrating the potential of LVLM-generated semantics to advance person ReID.
Probing Cross-modal Information Hubs in Audio-Visual LLMs
Authors: Jihoo Jung, Chaeyoung Jung, Ji-Hoon Kim, Joon Son Chung
Venue: ICML 2026
First: 2026-05-11T16:34:18+00:00 · Latest: 2026-05-11T16:34:18+00:00
Comments: Accepted by ICML 2026
Abstract
Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal-hub.
Summary / 总结
This paper investigates the cross-modal information flow between audio and visual modalities in audio-visual large language models (AVLLMs). By analyzing multiple recent AVLLMs, the authors found that AVLLMs primarily encode integrated audio-visual information in sink tokens, and a subset of these tokens, termed cross-modal sink tokens, specializes in storing cross-modal information. Based on these findings, the authors propose a training-free method to mitigate hallucinations by encouraging reliance on integrated cross-modal information within these tokens.
本文研究了音频和视觉模态在音频-视觉大型语言模型(AVLLMs)中的跨模态信息流。通过对多个最新AVLLMs的分析,作者发现AVLLMs主要在sink tokens中编码整合的音频-视觉信息,而其中一部分sink tokens,称为跨模态sink tokens,专门存储跨模态信息。基于这些发现,作者提出了一种无需训练的方法,通过鼓励在这些tokens中依赖整合的跨模态信息来减轻幻觉现象。
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Authors: Shuaizhi Cheng, Xiang Shi, Zhiwei Zhang, Mingwei Li
First: 2026-04-26T14:59:14+00:00 · Latest: 2026-05-11T16:19:42+00:00
Comments: 35 pages, 15 figures v2: minor layout fixes and author list update
Abstract
Hypernetwork-based methods such as Doc-to-LoRA internalize a document into an LLM's weights in a single forward pass, but they fail systematically on conflicts: when the document contradicts pretraining knowledge, accuracy collapses to 46.4% on the deepest facts. We show the failure is a magnitude problem rather than a representational one. The hypernetwork already targets the right layers, but its adapter margin is approximately constant across documents while the pretrained margin grows with training frequency, so deep conflicts lose by construction. The account predicts that failure should track prior strength: sorting 194 conflicts by the base model's log-probability on the contradicted fact, baseline accuracy falls from 68% on weak-prior questions to 16% on strong-prior ones, a 52 percentage-point gap. The cure is amplitude. Selective Layer Boosting scales the adapter at its top-norm layers, and Conflict-Aware Internalization triggers boosting only when the base model is confident. Both are training-free; together they raise deep-conflict accuracy from 46.4% to 71.0% on Gemma-2B and from 53.6% to 72.5% on Mistral-7B while preserving novel-knowledge recall, and beat vanilla retrieval-augmented generation on medium conflicts by 18 percentage points despite operating entirely in parameter space. We release KID-Bench, a 489-question benchmark that separates novel recall, cross-knowledge combination, and prior-graded conflicts.
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
Authors: Mengqi He, Xinyu Tian, Xin Shen, Shu Zou, Jinhong Ni, Zhaoyuan Yang, Weikang Li, Xuesong Li, Jing Zhang
First: 2026-05-11T15:59:02+00:00 · Latest: 2026-05-11T15:59:02+00:00
Comments: Preprint. 17 pages, 8 figures, 6 tables
Abstract
Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.
GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs
Authors: Mohamed Eltahir, Lama Ayash, Ali Habibullah, Tanveer Hussain, Naeemullah Khan
First: 2026-05-11T15:57:46+00:00 · Latest: 2026-05-11T15:57:46+00:00
Abstract
Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass; common for training-free selectors via auxiliary encoder-space similarities. Such signals are capped by contrastive pretraining, which usually fails on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). We propose GridProbe, an efficient training-free posterior-probing inference paradigm that scores evidence in answer space using a frozen VLM's own reasoning and then selects question-relevant frames adaptively, resulting in sub-quadratic attention cost with little to no accuracy loss. We arrange frames on a $K{\times}K$ grid and run lightweight row R and column C probes, where each probe reads its peak posterior as a query-conditioned confidence. The outer product of R and C yields an interpretable importance map whose skewness and kurtosis drive Shape-Adaptive Selection, a closed-form rule that reliably replaces the fixed frame budget $M$ with a per-question $M_{\mathrm{eff}}$. We show empirically that $M_{\mathrm{eff}}$ tracks intrinsic question difficulty without ever seeing the answer, a sign of test-time adaptive compute. On Video-MME-v2, GridProbe matches the monolithic baseline within $1.6$ pp Avg Acc at $3.36\times$ TFLOPs reduction, while on LongVideoBench it Pareto-dominates the baseline ($+0.9$ pp at $0.35\times$ compute). Because the selector and QA models can be decoupled, pairing a small 2B selector with a stronger 4B or 8B QA is strictly Pareto-dominant over the 2B monolithic baseline (up to $+4.0$ pp at $0.52\times$ compute, on average), with no retraining. Finally, the interpretability of the importance maps opens future avenues for behavioral diagnostics, grounding, and frame-selection distillation.
Summary / 总结
GridProbe is an efficient training-free inference method that scores evidence in answer space using a frozen VLM's reasoning to select question-relevant frames, reducing attention cost to sub-quadratic while maintaining accuracy. It arranges frames on a $K imes K$ grid and uses lightweight row and column probes to generate an importance map, which drives adaptive frame selection. On Video-MME-v2, GridProbe matches the monolithic baseline with a $1.6$ pp improvement at $3.36 imes$ compute reduction, and on LongVideoBench, it Pareto-dominates the baseline with $+0.9$ pp improvement at $0.35 imes$ compute. Coupling a small selector with a stronger QA model further improves performance.
GridProbe 是一种高效的无训练后验探针推理方法,通过冻结 VLM 的推理在答案空间中评分,选择与问题相关的帧,从而将注意力成本降低到次平方级同时保持准确性。它将帧排列在一个 $K imes K$ 网格上,并使用轻量级的行和列探针生成重要性图,驱动自适应帧选择。在 Video-MME-v2 上,GridProbe 在 $3.36 imes$ 计算量减少的情况下与基线持平,提高了 $1.6$ 个点;在 LongVideoBench 上,它在 $0.35 imes$ 计算量的情况下优于基线,提高了 $0.9$ 个点。将小型选择器与更强的 QA 模型结合使用进一步提高了性能。
TINS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection
Authors: Yifeng Yang, Jubo Feng, Jing Xu, Xinbing Wang, Qinying Gu, Nanyang Ye
First: 2026-05-11T15:54:34+00:00 · Latest: 2026-05-11T15:54:34+00:00
Abstract
Vision-language models enable OOD detection by comparing image alignment with ID labels and negative semantics. Existing negative-label-based methods mainly rely on static negative labels constructed before inference, limiting their ability to cover diverse and evolving OOD concepts. Although test-time expansion provides a natural solution, naively learning negative semantics from potential OOD samples may introduce hard ID contamination. To address this issue, we propose a \textbf{T}est-time \textbf{I}D-prototype-separated \textbf{N}egative \textbf{S}emantics learning method, termed \textbf{TINS}. TINS learns sample-specific negative text embeddings via image-to-text modality inversion and introduces ID-prototype-separated regularization to keep them separated from ID semantics. To further stabilize negative semantics expansion, TINS employs group-wise aggregation scoring and a buffer update strategy. Extensive experiments across Four-OOD, OpenOOD, Temporal-shift, and Various ID settings show consistent improvements over strong baselines. Notably, on the Four-OOD benchmark with ImageNet-1K as ID, TINS reduces the average FPR95 from 14.04\% to 6.72\%. Our code is available at https://github.com/zxk1212/tins.
C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving
Authors: Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li
First: 2026-05-11T15:45:00+00:00 · Latest: 2026-05-11T15:45:00+00:00
Abstract
Safety-critical planning in complex environments, particularly at urban intersections, remains a fundamental challenge for autonomous driving. Existing methods, whether rule-based or data-driven, frequently struggle to capture complex scene semantics, infer potential risks, and make reliable decisions in rare, high-risk situations. While vision-language models (VLMs) offer promising approaches for safe decision-making in these environments, most current approaches lack reflective and causal reasoning, thereby limiting their overall robustness. To address this, we propose a counterfactual chain-of-thought (C-CoT) framework that leverages VLMs to decompose driving decisions into five sequential stages: scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning. Within the counterfactual reasoning stage, we introduce a structured meta-action evaluation tree to explicitly assess the potential consequences of alternative action combinations. This self-reflective reasoning establishes causal links between action choices and safety outcomes, improving robustness in long-tail and out-of-distribution scenarios. To validate our approach, we construct the DeepAccident-CCoT dataset based on the DeepAccident benchmark and fine-tune a Qwen2.5-VL (7B) model using low-rank adaptation. Our model achieves a risk prediction recall of 81.9%, reduces the collision rate to 3.52%, and lowers L2 error to 1.98 m. Ablation studies further confirm the critical role of counterfactual reasoning and the meta-action evaluation tree in enhancing safety and interpretability.
中文标题/摘要
标题:C-CoT:基于视觉语言模型的反事实链思考方法以实现安全自动驾驶
在复杂环境中的安全关键规划,尤其是在城市交叉口,仍然是自动驾驶领域的基本挑战。现有方法,无论是基于规则的还是数据驱动的,经常难以捕捉复杂的场景语义、推断潜在风险并做出可靠的决策,特别是在罕见的高风险情况下。尽管视觉语言模型(VLMs)为这些环境中的安全决策提供了有前景的方法,但大多数当前方法缺乏反思性和因果推理,从而限制了它们的整体鲁棒性。为了解决这一问题,我们提出了一种反事实链思考(C-CoT)框架,利用VLMs将驾驶决策分解为五个连续阶段:场景描述、关键对象识别、风险预测、反事实风险推理和最终行动规划。在反事实推理阶段,我们引入了一个结构化的元动作评估树,以明确评估不同行动组合的潜在后果。这种自我反思推理建立了行动选择与安全结果之间的因果联系,提高了在长尾和分布外场景中的鲁棒性。为了验证我们的方法,我们基于DeepAccident基准构建了DeepAccident-CCoT数据集,并使用低秩适应对Qwen2.5-VL(7B)模型进行了微调。我们的模型实现了风险预测召回率81.9%,碰撞率降低至3.52%,L2误差降低至1.98米。消融研究进一步证实了反事实推理和元动作评估树在提高安全性和可解释性方面的作用。
Summary / 总结
The paper proposes C-CoT, a counterfactual chain-of-thought framework that uses vision-language models to decompose driving decisions into five stages: scene description, critical object identification, risk prediction, counterfactual risk reasoning, and final action planning. This approach introduces a structured meta-action evaluation tree to assess potential consequences of alternative actions, enhancing safety in rare, high-risk situations. Experiments show the model achieves a risk prediction recall of 81.9%, a collision rate reduction to 3.52%, and a lower L2 error of 1.98 m. Ablation studies confirm the importance of counterfactual reasoning and the meta-action evaluation tree for safety and interpretability.
论文提出了一种C-CoT框架,利用视觉语言模型将驾驶决策分解为五个阶段:场景描述、关键对象识别、风险预测、反事实风险推理和最终行动规划。该方法引入了一个结构化的元动作评估树,以评估不同行动组合的潜在后果,从而提高在罕见高风险情况下的安全性。实验结果显示,该模型的风险预测召回率为81.9%,碰撞率降低至3.52%,L2误差降低至1.98米。消融研究进一步证实了反事实推理和元动作评估树在提高安全性和可解释性方面的重要性。
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models
Authors: Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang
First: 2025-12-26T01:01:25+00:00 · Latest: 2026-05-11T15:42:09+00:00
Comments: 19 Pages,11 figures,8 tables
Abstract
Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, as a measure of model uncertainty, is highly correlated with VLM reliability. While prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token equally contributes to model instability, we reveal that a small fraction (around 20%) of high-entropy tokens, in the evaluated representative open-source VLMs with diverse architectures, concentrates a disproportionate share of adversarial influence during autoregressive generation. We demonstrate that concentrating adversarial perturbations on these high-entropy positions achieves comparable semantic degradation to global methods while optimizing fewer decoding positions. Additionally, across multiple representative VLMs, such attacks induce not only semantic drift but also a substantial unsafe subset (20-31%) under the current pipeline. Remarkably, since such vulnerable high-entropy tokens recur across architecturally diverse VLMs, attacks focused on them exhibit non-trivial transferability. Motivated by these findings, we design a simple Entropy-Guided Attack (EGA) that operationalizes sparse high-entropy targeting and extends it with a reusable token bank, yielding competitive attack success rates (93-95%) with a considerable harmful rate (30.2-38.6%) on the three representative open-source VLMs.
Determinism of Randomness: Prompt-Residual Seed Shaping for Diffusion Generation
Authors: Song Yan, Wei Zhai, Chenfeng Wang, Xinliang Bi, Jian Yang, Yancheng Cai, Yusen Zhang, Yunwei Lan, Tao Zhang, GuanYe Xiong, Min Li, Zheng-Jun Zha
First: 2025-11-11T02:12:38+00:00 · Latest: 2026-05-11T15:40:18+00:00
Abstract
Diffusion models start generation from an isotropic Gaussian latent, yet changing only the random seed can lead to large differences in prompt faithfulness, composition, and visual quality. We study this seed sensitivity through the semantic map from initial noise to generated meaning. Although the sampling flow is locally invertible, the subsequent semantic projection is many-to-one, inducing a degenerate pullback semi-metric on the latent space: most local directions are nearly semantic-invariant, while semantic-sensitive variation is concentrated in a much smaller horizontal subspace. This provides an explanatory geometric view of the seed lottery. Motivated by this view, we introduce a training-free prompt-residual seed-shaping procedure. Rather than claiming to recover the exact horizontal space, the method uses a single high-noise cold-start prompt residual as a model-coupled proxy, injects only its tangential component, and retracts the seed to the original Gaussian radius shell. This keeps the initialization prior-compatible while adding only one conditional/unconditional probe before standard sampling. Across multiple generation benchmarks, the method improves alignment and quality metrics over standard sampling, supporting both the practical value of the proxy and the explanatory relevance of semantic anisotropy.
Summary / 总结
This paper explores the sensitivity of diffusion models to random seeds, which significantly affect the generated content's faithfulness to the prompt, composition, and visual quality. By analyzing the geometric properties of the latent space, the authors propose a training-free method to shape the seed, using a high-noise prompt residual to inject a tangential component that retracts the seed to the original Gaussian radius shell. This method improves alignment and quality metrics in various generation benchmarks, validating the practical value of the proxy and the relevance of semantic anisotropy.
本文探讨了扩散模型对随机种子的高度敏感性,这显著影响生成内容对提示的忠实度、构图和视觉质量。通过分析潜在空间的几何特性,作者提出了一种无需训练的方法来塑造种子,使用高噪声提示残差注入一个切向分量,将种子重新拉回到原始的高斯半径壳中。该方法在多个生成基准测试中提高了对齐和质量指标,验证了代理的实际价值和语义各向异性解释的相关性。
ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
Authors: Rongtian Ye
Venue: ACL 2026
First: 2026-03-30T18:29:02+00:00 · Latest: 2026-05-11T15:32:44+00:00
Comments: 21 pages, 17 figures, accepted to ACL 2026: the 4th Workshop on Advances in Language and Vision Research
Abstract
Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap and actual summary quality. We further find that multi-series charts remain challenging across model families, whereas strong end-to-end models are relatively robust to differences in plotting libraries. Overall, our findings demonstrate that comparative chart reasoning remains a significant challenge for current vision-language models and position ChartDiff as a new benchmark for advancing research on multi-chart understanding.
中文标题/摘要
标题:ChartDiff:跨图比较总结的大规模基准
图表是分析推理的核心,但现有的图表理解基准几乎完全集中在单图解释上,而忽视了跨多个图表的比较推理。为解决这一问题,我们引入了ChartDiff,这是首个大规模的跨图比较总结基准。ChartDiff包含8,541对图表,覆盖了多种数据源、图表类型和视觉风格,并且每对图表都附有人工验证的LLM生成的摘要,描述了趋势、波动和异常的差异。使用ChartDiff,我们评估了通用、专门化和流水线模型。结果显示,前沿的通用模型在GPT质量上最高,而专门化和流水线方法虽然在ROUGE分数上较高,但在人类对齐评估上较低,揭示了词汇重叠与实际摘要质量之间的明显不匹配。我们还发现,多系列图表对各类模型都是挑战,而强大的端到端模型对绘图库差异具有相对的鲁棒性。总体而言,我们的研究结果表明,比较图表推理仍然是当前视觉-语言模型的重大挑战,并将ChartDiff定位为多图理解研究的新基准。
Summary / 总结
ChartDiff is a large-scale benchmark for evaluating models in understanding pairs of charts, addressing the lack of benchmarks for comparative reasoning. It includes 8,541 chart pairs with diverse data sources and visual styles, each annotated with summaries. Evaluations show that general-purpose models perform best in GPT-based quality, while specialized and pipeline-based models have higher ROUGE scores but lower human-aligned evaluation. The benchmark highlights the challenges in comparative chart reasoning, especially for multi-series charts, and positions ChartDiff as a new standard for advancing research on multi-chart understanding.
ChartDiff 是一个大规模基准,用于评估模型在理解图表对方面的能力,解决了缺乏用于比较推理的基准的问题。它包含8,541个具有多样数据源和视觉风格的图表对,并且每个图表对都带有注释的摘要。评估结果显示,通用模型在GPT基线质量方面表现最佳,而专门和管道模型具有更高的ROUGE分数但较低的人类对齐评估。该基准突显了比较图表推理的挑战,尤其是在多系列图表方面,并将ChartDiff定位为推进多图表理解研究的新标准。
Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium
Authors: Qingxin Xiao, Peilin Zhao, Yangyang Zhao, Lingwei Dang, Qingyao Wu
First: 2026-05-11T14:55:59+00:00 · Latest: 2026-05-11T14:55:59+00:00
Abstract
During MLLM decoding, attention often abnormally concentrates on irrelevant image tokens. While existing research dismisses this as invalid noise and forcibly redirects attention to compel focusing on key image information, we argue these tokens are critical carriers of visual and narrative logic, and such coercive corrections exacerbate visual-language imbalance. Adopting a "decoding-as-game" perspective, we reveal that hallucinations stem from an equilibrium imbalance between linguistic priors and visual information. We propose Adversarial Counter-Commonsense Equilibrium (ACE), a training-free framework that perturbs visual context via counter-commonsense patches. Leveraging the fact that authentic visual features remain stable under perturbation while hallucinations fluctuate, ACE implements a dynamic game decoding strategy. This approach precisely suppresses perturbation-sensitive priors while compensating for stable visual signals to restore balance. Extensive experiments demonstrate that ACE, as a plug-and-play strategy, enhances model trustworthiness with negligible inference overhead.
Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation
Authors: Yannick Brunink, Daniel Daza, Yunjie He, Michael Cochez
First: 2025-11-27T15:57:29+00:00 · Latest: 2026-05-11T14:32:08+00:00
Comments: Accepted in Transactions on Machine Learning Research (2026)
Abstract
Neural methods for Complex Query Answering (CQA) over knowledge graphs (KGs) are widely believed to learn patterns that generalize beyond explicit graph structure, allowing them to infer answers that are unreachable through symbolic query processing. In this work, we critically examine this assumption through a systematic analysis comparing neural CQA models with an alternative, training-free query relaxation strategy that retrieves possible answers by relaxing query constraints and counting resulting paths. Across multiple datasets and query structures, we find several cases where neural and relaxation-based approaches perform similarly, with no neural model consistently outperforming the latter. Moreover, a similarity analysis reveals that their retrieved answers exhibit little overlap, and that combining their outputs consistently improves performance. These results call for a re-evaluation of progress in neural query answering: despite their complexity, current models fail to subsume the reasoning patterns captured by query relaxation. Our findings highlight the importance of stronger non-neural baselines and suggest that future neural approaches could benefit from incorporating principles of query relaxation.
中文标题/摘要
标题:计数仍然重要:通过查询松弛理解神经复杂查询回答
在知识图谱(KGs)上进行复杂查询回答(CQA)的神经方法普遍认为能够学习超越显式图结构的模式,从而能够推断出符号查询处理无法到达的答案。在本文中,我们通过系统分析,将神经CQA模型与一种替代的、无需训练的查询松弛策略进行比较,该策略通过放松查询约束并计数结果路径来检索可能的答案。在多个数据集和查询结构上,我们发现神经方法和基于松弛的方法在多个情况下表现相似,没有一个神经模型能够始终优于后者。此外,相似性分析表明,它们检索的答案几乎没有重叠,而将它们的输出结合起来可以持续提高性能。这些结果要求重新评估神经查询回答的进步:尽管模型复杂,当前模型未能涵盖查询松弛捕获的推理模式。我们的发现强调了更强的非神经基线的重要性,并建议未来的神经方法可以从查询松弛的原则中受益。
Summary / 总结
This study investigates the effectiveness of neural methods in Complex Query Answering (CQA) over knowledge graphs by comparing them with a query relaxation strategy that retrieves possible answers by relaxing query constraints and counting paths. The research finds that neural and relaxation-based approaches perform similarly in many cases, with no neural model consistently outperforming the relaxation method. The analysis also shows that their retrieved answers have little overlap, and combining their outputs improves performance. These findings suggest that current neural models do not fully capture the reasoning patterns of query relaxation and highlight the need for stronger non-neural baselines in CQA research.
研究通过将神经方法与一种通过放宽查询约束并计数路径来检索可能答案的查询松弛策略进行比较,来考察其在知识图谱上进行复杂查询回答的有效性。研究发现,在许多情况下,神经方法和松弛方法的性能相似,没有一种神经模型能够始终优于松弛方法。分析还显示,它们检索的答案几乎没有重叠,将它们的输出结合起来可以提高性能。这些发现表明,当前的神经模型未能完全捕捉查询松弛所包含的推理模式,并强调了在CQA研究中需要更强的非神经基线的重要性。
Composing diffusion priors with explicit physical context via generative Gibbs sampling
Authors: Weizhou Wang, Jonathan Weare, Aaron R. Dinner
First: 2026-05-11T14:29:20+00:00 · Latest: 2026-05-11T14:29:20+00:00
Comments: 31 pages, 11 figures
Abstract
Pretrained diffusion models provide powerful learned priors, but in scientific sampling the target distribution often depends on physical context that is not fully represented by one generative model. We introduce Generative Gibbs for Physics-Aware Sampling (GG-PA), a training-free framework that formulates the composition of learned partial priors and explicit physical context as inference over a joint target distribution in an augmented state space. We derive a Gibbs sampler for this joint target, show that it is asymptotically exact as the diffusion time approaches zero, and prove that in settings with quadratic interactions it remains exact at finite diffusion times. We further introduce replica exchange over diffusion time to accelerate mixing. Experiments on a double-well system, a $φ^4$ lattice model, and atomistic peptide systems show that GG-PA recovers context-induced distribution shifts and emergent collective behavior in interacting systems using partial priors without retraining. These results demonstrate GG-PA as a practical approach for combining pretrained generative priors with explicit physical context.
LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
Authors: Nikolaos Gkalelis, Vasileios Mezaris
First: 2026-05-11T14:28:44+00:00 · Latest: 2026-05-11T14:28:44+00:00
Comments: Under review
Abstract
Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving one high-capacity Teacher (or an ensemble of such), inspired by human formal education systems, we introduce one (potentially, more) additional Teacher(s) of intermediate capacity that gradually bring the Student network to the next level, where the next (higher-capacity) Teacher can take over. We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student. We apply the proposed framework on models build upon the LLaVA methodology and evaluate the derived models on seven standard, publicly available VQA benchmarks, demonstrating their SotA performance.
Hypergraph-Enhanced Training-Free and Language-Free Few-Shot Anomaly Detection
Authors: Guohuan Xie, Xin He, Dingying Fan, Siqi Li, Yun Liu
First: 2026-05-11T14:20:22+00:00 · Latest: 2026-05-11T14:20:22+00:00
Abstract
Few-shot anomaly detection (FSAD) has made significant strides, yet existing methods still face critical challenges: (i) dependence on task- or dataset-specific training/fine-tuning, (ii) reliance on language supervision or carefully hand-crafted prompts, and (iii) limited robustness across domains. In this paper, we introduce HyperFSAD, a novel FSAD framework that is training-free, language-free, and robust across domains, offering a powerful solution to these challenges. Built upon DINOv3 and a hypergraph-based inference mechanism, our approach performs inference without any task-specific optimization or text prompts, while remaining competitive. Specifically, we replace sensitive nearest-neighbor / top-$n$ matching with \textbf{Sparse Hyper Matching}: \textit{sparsemax} first selects the most relevant support patches, which are then aggregated into a \textit{hyperedge} as compact normal evidence to suppress background noise and distractors. We further introduce \textbf{Dual-Branch Image Scoring}, which fuses \emph{spatial anomaly evidence} from the patch-grid anomaly map with \emph{global semantic deviation} captured by support-aware CLS matching, yielding a robust image-level anomaly score in a strictly visual manner. Notably, all components of HyperFSAD are purely visual, eliminating the need for labor-intensive hand-crafted text prompts. Under the stringent training-free and language-free setting, HyperFSAD achieves state-of-the-art performance across six datasets spanning four industrial datasets (MVTecAD, VisA, MPDD, BTAD) and two medical datasets (RESC, BraTS).
Summary / 总结
HyperFSAD is a training-free and language-free few-shot anomaly detection framework that addresses the limitations of existing methods by using a hypergraph-based inference mechanism. It replaces traditional nearest-neighbor matching with Sparse Hyper Matching and introduces Dual-Branch Image Scoring to enhance robustness. Under stringent conditions, HyperFSAD outperforms existing methods across six datasets, including industrial and medical datasets.
HyperFSAD 是一种无需训练和无需语言的少量样本异常检测框架,通过使用基于超图的推理机制来解决现有方法的局限性。它用稀疏超匹配替换传统的最近邻匹配,并引入双分支图像评分以增强鲁棒性。在严格条件下,HyperFSAD 在六个数据集中表现出色,包括工业和医学数据集。
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
Authors: Yangneng Chen, Junlin Li, Weijun Yao, Xilai Ma, Guodong Du, Wenya Wang, Jing Li
Venue: ACL 2026
First: 2026-05-11T14:16:25+00:00 · Latest: 2026-05-11T14:16:25+00:00
Comments: Accepted by ACL 2026 Main
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet their reliability is persistently undermined by hallucinations-generating text that contradicts visual input. Recent studies often attribute these errors to inadequate visual attention. In this work, we analyze the attention mechanisms via the logit lens, uncovering a distinct anomaly we term Vocabulary Hijacking. We discover that specific visual tokens, defined as Inert Tokens, disproportionately attract attention. Crucially, when their intermediate hidden states are projected into the vocabulary space, they consistently decode to a fixed set of unrelated words (termed Hijacking Anchors) across layers, revealing a rigid semantic collapse. Leveraging this semantic rigidity, we propose Hijacking Anchor-Based Identification (HABI), a robust strategy to accurately localize these Inert Tokens. To quantify the impact of this phenomenon, we introduce the Non-Hijacked Visual Attention Ratio (NHAR), a novel metric designed to identify attention heads that remain resilient to hijacking and are critical for factual accuracy. Building on these insights, we propose Hijacking-Aware Visual Attention Enhancement (HAVAE), a training-free intervention that selectively strengthens the focus of these identified heads on salient visual content. Extensive experiments across multiple benchmarks demonstrate that HAVAE significantly mitigates hallucinations with no additional computational overhead, while preserving the model's general capabilities. Our code is publicly available at https://github.com/lab-klc/HAVAE.
Summary / 总结
This study addresses the issue of hallucinations in Large Vision-Language Models (LVLMs) by identifying a phenomenon termed Vocabulary Hijacking, where specific visual tokens attract excessive attention and decode to unrelated words. The authors propose HAVAE, a training-free method that enhances the focus of critical attention heads on salient visual content, effectively reducing hallucinations without additional computational cost. NHAR is introduced to quantify the resilience of visual attention to hijacking, and experiments show significant improvements in factual accuracy across multiple benchmarks.
该研究通过识别词汇劫持现象(Vocabulary Hijacking),即特定视觉标记(Inert Tokens)吸引过多注意力并一致解码为无关词汇(Hijacking Anchors),解决了大型视觉-语言模型(LVLMs)中的幻觉问题。作者提出了一种无需训练的方法HAVAE,增强关键注意头对显著视觉内容的关注,有效减少幻觉且无需额外计算开销。NHAR指标被引入以量化该现象的影响并识别稳健的注意头。跨多个基准的实验表明,HAVAE显著减轻了幻觉现象,同时保持了模型的通用能力。
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
Authors: Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Zelei Cheng, Haohan Wang
First: 2025-08-28T00:07:10+00:00 · Latest: 2026-05-11T14:12:43+00:00
Comments: 56 pages
Abstract
As Large Language Models (LLMs) become increasingly integral to various domains, their potential to generate harmful responses has prompted significant societal and regulatory concerns. In response, governments have issued ethics guidelines to promote the development of trustworthy AI. However, these guidelines are typically high-level demands for developers and testers, leaving a gap in translating them into actionable testing questions to verify LLM compliance. To address this challenge, we introduce GUARD (Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics), a testing method designed to operationalize guidelines into specific guideline-violating questions that assess LLM adherence. To implement this, GUARD uses automated generation of guideline-violating questions based on government-issued guidelines, thereby testing whether responses comply with these guidelines. When responses directly violate guidelines, GUARD reports inconsistencies. Furthermore, for responses that do not directly violate guidelines, GUARD integrates the concept of ``jailbreaks'' to diagnostics, named GUARD-JD, which creates scenarios that provoke unethical or guideline-violating responses, effectively identifying potential scenarios that could bypass built-in safety mechanisms. Our method finally culminates in a compliance report, delineating the extent of adherence and highlighting any violations. We empirically validated the effectiveness of GUARD on eight LLMs, including Vicuna-13B, LongChat-7B, Llama2-7B, Llama-3-8B, GPT-3.5, GPT-4, GPT-4o, and Claude-3.7, by testing compliance under three government-issued guidelines and conducting jailbreak diagnostics. Additionally, GUARD-JD can transfer jailbreak diagnostics to vision-language models (MiniGPT-v2 and Gemini-1.5), demonstrating its usage in promoting reliable LLM-based applications.
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
Authors: Chen Zhong, Xiao An, Jiaxing Sun, Zihan Gui, Guangyi Yang, Wei He
First: 2026-05-11T13:48:57+00:00 · Latest: 2026-05-11T13:48:57+00:00
Abstract
Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.
Summary / 总结
SenseBench is a benchmark designed to evaluate the low-level visual perception and description capabilities of Vision-Language Models (VLMs) in remote sensing (RS) images. It addresses the gap in current IQA methods that fail to characterize RS-specific degradations. The benchmark includes over 10K instances across 22 RS degradation categories, and evaluates both objective perception and subjective description. Key findings show that VLMs have skewed domain priors, exhibit multi-distortion collapse, and suffer from fluency illusion and perception-description inversion.
SenseBench 是一个用于评估 Vision-Language 模型在遥感图像中低级视觉感知和描述能力的基准。它解决了当前图像质量评估方法无法表征遥感特定退化的问题。基准包括超过 10K 个实例,涵盖 22 个遥感退化类别,并评估了客观感知和主观描述。关键发现表明,VLMs 具有领域偏见,表现出多退化崩溃现象,并且存在流畅性错觉和感知-描述反转效应。
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
Authors: Lingjun Zhang, Changjie Wu, Linzhe Shi, Jiangyang Li, Jiaxin Liu, Lei Yang, Hang Zhang, Mu Xu, Hong Wang
Venue: ICML 2026
First: 2026-05-11T13:36:51+00:00 · Latest: 2026-05-11T13:36:51+00:00
Comments: ICML 2026
Abstract
End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.
中文标题/摘要
标题:DeepSight:通过潜在状态预测实现端到端长时距世界建模的自主驾驶
端到端的自主驾驶系统越来越多地整合了视觉语言模型(VLM)架构,通过文本推理或视觉推理来增强驾驶决策的稳健性和准确性。然而,大多数方法中的推理机制是从通用领域直接改编而来的,缺乏针对自主驾驶场景的深入探索,特别是在视觉推理模块中。在本文中,我们提出了一种驾驶世界模型,该模型在鸟瞰图(BEV)空间中并行预测连续未来帧的潜在语义特征,从而实现未来世界状态的长时距建模。我们还引入了一种高效的自适应文本推理机制,该机制利用额外的社会知识和推理能力,进一步提高在具有挑战性的长尾场景中的驾驶性能。我们提出了一种新颖、高效且有效的方法,在闭环Bench2drive基准测试中达到了目前最先进的(SOTA)结果。代码可在:https://github.com/hotdogcheesewhite/DeepSight 获取。
PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
Authors: Yuliang Li, Chu Zhou, Heng Guo, Boxin Shi, Imari Sato, Zhanyu Ma
First: 2026-05-08T10:43:54+00:00 · Latest: 2026-05-11T13:10:17+00:00
Comments: 23 pages, 12 figures, including appendices
Abstract
Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.
中文标题/摘要
标题:PolarVLM:在视觉语言模型中弥合语义-物理差距
主流的视觉语言模型(VLMs)在处理反射和透明物体等严重的光学歧义时,由于标准RGB输入的固有限制而面临根本性挑战。虽然偏振成像捕捉到的偏振物理参数可以解决这些歧义,但现有方法受限于固定格式的输出,并且与开放性推理隔离。为弥合这一语义-物理差距,我们提出了PolarVLM,这是第一个将偏振物理参数整合到VLMs中的多模态框架。通过采用双流架构和渐进的两阶段训练策略,PolarVLM有效地防止了物理误解释,同时保留了通用的视觉能力。配合我们的架构,我们构建了PolarVQA,这是第一个针对反射和透明场景的偏振感知VQA基准,包含75K基于物理的指令调优对。实验表明,PolarVLM在五个评估任务中的整体性能优于RGB基线25.4%,在反射识别上的显著提升为26.6%,在玻璃计数上的提升为34.0%,成功解锁了物理感知的语义理解。
SLASH the Sink: Sharpening Structural Attention Inside LLMs
Authors: Yiming Liu, Bin Lu, Xinbing Wang, Chenghu Zhou, Meng Jin
First: 2026-05-11T12:59:07+00:00 · Latest: 2026-05-11T12:59:07+00:00
Abstract
Large Language Models (LLMs) show remarkable semantic understanding but often struggle with structural understanding when processing graph topologies in a serialized format. Existing solutions rely on training external graph-based adapters or fine-tuning, which incur high costs and lost generalizability. In this work, we investigate the internal mechanisms of LLMs and present a critical finding: LLMs spontaneously reconstruct the graph's topology internally, evidenced by a distinct "sawtooth" pattern in their attention maps that structurally aligns with the "token-level adjacency matrix". However, this intrinsic structural understanding is diluted by the attention sink. We theoretically formalize this dilution as a representation bottleneck, stemming from a fundamental conflict: the model's anisotropic bias, essential for language tasks, suppresses the topology-aware local aggregation required for graph reasoning. To address this, we propose a training-free solution, named StructuraL Attention SHarpening (Slash), which amplifies this internal structural understanding via a plug-and-play attention redistribution. Experiments on pure graph tasks and molecular prediction validate Slash delivers significant and consistent performance gains across diverse LLMs.
中文标题/摘要
标题:SLASH the Sink: 锐化LLM内部结构注意力
大型语言模型(LLMs)在处理图拓扑的序列化格式时表现出显著的语义理解能力,但往往在结构理解方面存在困难。现有解决方案依赖于训练外部图基适配器或微调,这会带来高昂的成本并丧失泛化能力。在本研究中,我们探讨了LLMs的内部机制,并提出一个关键发现:LLMs自发地在内部重构图的拓扑结构,这在它们的注意力图中表现为一种独特的“锯齿”模式,与“标记级邻接矩阵”在结构上对齐。然而,这种内在的结构理解被注意力陷阱稀释。我们从理论上将这种稀释形式化为一种表示瓶颈,源于一个根本冲突:模型的各向异性偏差,对于语言任务至关重要,抑制了图推理所需的拓扑意识局部聚合。为了解决这一问题,我们提出了一种无需训练的解决方案,名为结构注意力锐化(Slash),通过即插即用的注意力重分布来放大这种内部结构理解。实验结果表明,Slash在纯图任务和分子预测中为各种LLMs带来了显著且一致的性能提升。
Summary / 总结
This work addresses the limitation of Large Language Models (LLMs) in handling graph topologies by proposing a training-free method called StructuraL Attention SHarpening (Slash). The study finds that LLMs internally reconstruct graph topology through a 'sawtooth' pattern in attention maps, but this is diluted by an attention sink. Slash amplifies this internal structural understanding by redistributing attention, leading to significant performance gains in graph tasks and molecular prediction across various LLMs.
该研究针对大型语言模型(LLMs)在处理图结构时的局限性,提出了一种无需训练的方法——结构注意力强化(Slash)。研究发现,LLMs 通过注意力图中的‘锯齿’模式内部重建图的拓扑结构,但这一过程受到注意力陷阱的影响而减弱。Slash 通过重新分配注意力来增强这种内部的结构理解,从而在各种LLMs上显著提升了图任务和分子预测的表现。
Filtering Memorization from Parameter-Space in Diffusion Models
Authors: Yu Zhe, Yang Jiayan, Wei Junhao, Yu-Lin Tsai, Wang Chen
First: 2026-05-11T12:09:42+00:00 · Latest: 2026-05-11T12:09:42+00:00
Abstract
Low-Rank Adaptation (LoRA) has become a widely used mechanism for customizing diffusion models, enabling users to inject new visual concepts or styles through lightweight parameter updates. However, LoRAs can memorize training images, causing generated outputs to reproduce copyrighted or sensitive content. This risk is particularly concerning in LoRA-sharing ecosystems, where users distribute trained LoRAs without releasing the underlying training data. Existing approaches for mitigating memorization rely on access to the training pipeline, training data, or control over the inference process, making them difficult to apply when only the released LoRA weights are available. We propose \textbf{Base-Anchored Filtering (BAF)}, a training-free and data-free framework for post-hoc memorization mitigation in diffusion LoRAs. BAF decomposes LoRA updates into spectral channels and measures their alignment with the principal subspace of the pretrained backbone. Channels strongly aligned with this subspace are retained as generalizable adaptations, while weakly aligned channels are suppressed as potential carriers of memorized content. Experiments on multiple datasets and diffusion backbones demonstrate that BAF consistently reduces memorization while preserving or even improving generation quality. Our code is available in the supplementary material.
中文标题/摘要
标题:在扩散模型中从参数空间过滤记忆
低秩适应(LoRA)已成为广泛使用的扩散模型定制机制,允许用户通过轻量级参数更新注入新的视觉概念或风格。然而,LoRAs 可能会记忆训练图像,导致生成输出再现受版权保护或敏感内容。这种风险在LoRA共享生态系统中尤为令人担忧,用户在不释放底层训练数据的情况下分发训练好的LoRAs。现有减轻记忆的方法依赖于访问训练管道、训练数据或控制推理过程,当仅可获得发布的LoRA权重时,这些方法难以应用。我们提出了一种名为**基锚定过滤(BAF)**的无训练和无数据框架,用于扩散LoRA的后处理记忆缓解。BAF将LoRA更新分解为谱通道,并测量其与预训练主干的主子空间的对齐程度。与主子空间强烈对齐的通道被保留为可泛化的适应,而弱对齐的通道被抑制,作为潜在携带记忆内容的载体。在多个数据集和扩散主干上的实验表明,BAF在减少记忆的同时,能够保持或甚至提高生成质量。我们的代码可在附录中获取。
Summary / 总结
The paper addresses the issue of memorization in Low-Rank Adaptation (LoRA) for diffusion models, which can lead to the generation of copyrighted or sensitive content. To mitigate this, the authors propose Base-Anchored Filtering (BAF), a training-free and data-free method that decomposes LoRA updates into spectral channels and retains only those aligned with the principal subspace of the pretrained backbone, thereby reducing memorization while maintaining generation quality. Experiments show that BAF effectively reduces memorization across various datasets and diffusion models without compromising on the quality of generated outputs.
论文针对低秩适应(LoRA)在扩散模型中的记忆化问题,该问题可能导致复制版权或敏感内容。为解决这一问题,作者提出了基于基准锚定过滤(BAF)的方法,该方法不依赖训练和数据,通过将LoRA更新分解为谱通道,并仅保留与预训练主干的主子空间高度对齐的通道,从而有效减少记忆化现象,同时保持生成质量。实验结果表明,BAF能够在不牺牲生成质量的情况下一致地减少记忆化现象。
Progressive Photorealistic Simplification
Authors: Adi Rosenthal, Dana Berman, Yedid Hoshen, Ariel Shamir
First: 2026-05-11T11:47:44+00:00 · Latest: 2026-05-11T11:47:44+00:00
Abstract
Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative Select-Remove-Verify pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods.
Summary / 总结
This paper addresses the challenge of simplifying images while preserving their photorealistic appearance, which is often lost in traditional Non-Photorealistic Rendering techniques. The authors propose a progressive semantic image simplification framework that iteratively removes and inpaints elements, ensuring each step remains a plausible natural photograph. This is achieved through a Select-Remove-Verify pipeline using Vision-Language Models to identify elements for removal and a learned verifier to maintain photorealism. The method is further optimized into an image-to-video generation model for efficient simplification. Key findings include the ability to generate cleaner compositions and enable applications like content-aware decluttering and interactive editing.
本文旨在解决简化图像同时保持其写实外观的挑战,传统非写实渲染技术往往会失去这一特性。作者提出了一种渐进的语义图像简化框架,通过迭代移除和修复元素,确保每一步都保持为一个合理的自然照片。这通过一个选择-移除-验证流水线实现,使用视觉语言模型来识别需要移除的元素,并使用学习验证器来保持写实性和一致性。该方法进一步优化为一种从单张图像生成连贯简化序列的图像到视频生成模型。关键发现包括生成更干净的构图以及实现内容感知去杂乱、语义层分解和交互编辑等功能。
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
Authors: Xi Jiang, Yinjie Zhao, Zesheng Yang, Feng Zheng
First: 2026-05-11T11:40:07+00:00 · Latest: 2026-05-11T11:40:07+00:00
Comments: We release the agent, the benchmark, and the analysis artifacts at https://github.com/jam-cc/AnomalyClaw
Abstract
Visual anomaly detection (VAD) is crucial in many real-world fields, such as industrial inspection, medical imaging, infrastructure monitoring, and remote sensing. However, the specific anomaly definitions, data modalities, and annotation standards across different domains make it difficult to transfer single-domain trained VAD models. Vision-language models (VLMs), pre-trained on large-scale cross-domain data, can perform visual perception under task instructions, offering a promising solution for cross-domain VAD. However, single-inference VLM judgments are unreliable, since they rely more on prior knowledge than on normal-sample references or fine-grained feature evidence. We therefore present AnomalyClaw, a training-free VAD agent that turns anomaly judgment into a multi-round refutation process. In each round, the agent proposes candidate anomalies and refutes each against normal-sample references, drawing on a 13-tool library for visual verification, reference parsing, and frozen expert probing. On the CrossDomainVAD-12 benchmark (12 datasets), AnomalyClaw achieves consistent macro-AUROC improvements over single-step direct inference with +6.23 pp on GPT-5.5, +7.93 pp on Seed2.0-lite, and +3.52 pp on Qwen3.5-VL-27B. We further introduce an optional verbalized self-evolution extension. It builds an online rulebook from internal-branch disagreement without oracle labels. On Qwen3.5-VL-27B, it delivers a +2.09 pp mean gain, comparable to a K = 10 oracle-label supervised baseline (+1.99 pp). These results show that agentic refutation improve anomaly understanding and reasoning of VLMs, rather than merely aggregating tool outputs.