arXiv 论文速递

2026-05-17 04:25
Snapshot: 20260517_0425
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
Authors: Yifan Wang, Tong He
First: 2026-05-14T17:58:26+00:00 · Latest: 2026-05-14T17:58:26+00:00
Comments: Project page: https://yyfz.github.io/warp-as-history/
Abstract
Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.
Summary / 总结
Warp-as-History proposes a simple method to enable a frozen video generation model to follow camera trajectories without training or test-time optimization. It transforms camera-induced warps into camera-warped pseudo-history, aligning positional encoding with target frames and removing invalid tokens. This method demonstrates zero-shot capability and can be further improved with lightweight offline LoRA finetuning on a single camera-annotated video, enhancing camera adherence, visual quality, and motion dynamics on unseen videos.
Warp-as-History 提出了一种简单方法,使冻结的视频生成模型能够在无需训练或测试时优化的情况下跟随摄像机轨迹。该方法将摄像机诱导的扭曲转换为摄像机扭曲的伪历史,并将位置编码与目标帧对齐,移除无效的令牌。这种方法展示了零样本能力,并且可以通过对单个摄像机标注视频进行轻量级的离线 LoRA 微调进一步改进,从而提高摄像机跟随性、视觉质量和运动动态性,适用于未见过的视频。
Does Synthetic Layered Design Data Benefit Layered Design Decomposition?
Authors: Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang, Jingye Chen, Qifeng Chen
First: 2026-05-14T17:55:11+00:00 · Latest: 2026-05-14T17:55:11+00:00
Comments: 22 pages, 10 figures. Code is available at https://github.com/YangHaolin0526/SynLayers
Abstract
Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.
Summary / 总结
This study investigates whether synthetic layered data can improve graphic design decomposition. By constructing a synthetic dataset called SynLayers and using a state-of-the-art layer decomposition framework, the researchers found that training with purely synthetic data outperforms non-scalable alternatives like PrismLayersPro, especially with larger datasets. The study also shows that synthetic data helps achieve balanced layer-count distributions, which is a common issue in real-world datasets.
该研究探讨了使用合成层数据来改进图形设计分解的方法。通过构建合成数据集SynLayers,并使用视觉语言模型进行文本监督和边界框预测,研究证明,使用纯合成数据训练可以超越现有方法,尤其是在样本量达到约50K时。研究还发现,合成数据有助于实现层数量分布的均衡,这在现实世界的数据集中是难以实现的。
Do-Undo Bench: Reversibility for Action Understanding in Image Generation
Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli
First: 2025-12-15T18:03:42+00:00 · Latest: 2026-05-14T17:13:30+00:00
Comments: Project page: https://s-mahajan.github.io/Do-Undo-Bench/
Abstract
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.
Summary / 总结
This work introduces the Do--Doo-Bench, a benchmark for evaluating the ability of models to generate and reverse plausible scene-world scene transformations based. It given real-world actions... prompt-based image generation and editing. Unlike previous methods,, which rely on prompts-world prompts, Do--Doo-Bench introduces on-condition image manipulation on based on the hypothesis that the outcome of a real-world action can can can can be reversed to generate a on-reverse on on genuine on-and-effect on rather on stylistic and semantic edits. The benchmark curates high a high high-quality set of reversible actions from on-world scenarios to enable robust on-world scene-generation on.. the assumption that current models struggle with on-revers,., highlighting.
提出了Do-Undo任务和基准,旨在评估视觉-语言模型理解并根据真实世界动作生成合理场景变换的能力。不同于以往依赖提示进行图像生成和编辑的方法,该基准要求模型模拟动作并将其逆转回原始状态,测试其真正的因果理解能力。实验表明,当前模型在动作逆转方面存在困难,表明需要在多模态系统中提高动作理解能力。
On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
Authors: Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen
First: 2026-05-14T16:58:16+00:00 · Latest: 2026-05-14T16:58:16+00:00
Comments: Project Page: https://khushboo0012.github.io/tab-vlm-webpage/
Abstract
Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.
中文标题/摘要
标题:视觉语言模型中的文化错置与时间推理问题
视觉-语言模型(VLMs)越来越多地应用于文化遗产材料,从数字档案到教育平台。本文指出了这些模型在解释历史文物时的一个根本问题。我们将这种现象定义为文化错置,即使用不适当的时间概念、材料或文化框架来误解历史物件。为了量化这一现象,我们引入了视觉语言模型的时间错置基准(TAB-VLM),这是一个包含600个问题的数据集,涵盖六个类别,旨在评估1600件印度文化遗产物件(从史前到现代)的时间推理能力。对十种最先进的模型进行系统评估显示,它们在基准测试中的表现存在显著缺陷,即使最好的模型(GPT-5.2)也只能达到58.7%的整体准确率。性能差距在不同架构和规模下依然存在,表明文化错置是视觉AI系统的一个重要限制,无论模型大小如何。这些发现突显了当前VLM能力与准确解释文化遗产材料之间存在的差距,特别是对于在训练数据中代表性不足的非西方视觉文化。我们的基准为增强与历史文物互动的多模态AI系统的时序认知提供了基础。数据集和代码可在我们的项目页面获取。
Summary / 总结
This work addresses the issue of cultural anachronism in Vision-Language Models (VLMs) when interpreting historical artifacts. It introduces the Temporal Anachronism Benchmark for VLMs (TAB-VLM) to evaluate temporal reasoning, using 600 questions on 1,600 Indian cultural artifacts. Evaluations of ten state-of-the-art models show significant deficiencies, with the best model achieving only 58.7% accuracy, indicating a critical limitation in VLMs for accurately interpreting cultural heritage materials, especially for non-Western cultures. The benchmark provides a foundation for improving temporal cognition in multimodal AI systems.
这项研究关注视觉语言模型(VLM)在解读历史文物时存在的文化错置问题。引入了时间错置基准测试(TAB-VLM),使用600个问题涵盖6个类别,针对1,600件印度文化艺术品。对十种最先进的模型进行评估显示了显著的不足,最佳模型的准确率仅为58.7%,表明VLM在准确解读文化遗产材料,尤其是非西方视觉文化方面存在重大局限性。该基准测试为改进多模态AI系统的时间认知提供了新的标准。
LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection
Authors: Mitchell Piehl, Muchao Ye
First: 2026-05-14T16:48:03+00:00 · Latest: 2026-05-14T16:48:03+00:00
Abstract
Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.
Summary / 总结
The paper aims to improve video anomaly detection by addressing the limitations of existing vision-language models (VLMs) that perform segment-level inference independently and lack structured temporal context. LATERN, a context-aware framework, is proposed to reformulate VAD as a temporal evidence aggregation process. It includes two modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA uses a memory mechanism to select historical content for anomaly scoring, while REA aggregates temporal evidence to identify coherent anomaly intervals. Experiments on UCF-Crime and XD-Violence show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, generating temporally coherent and semantically grounded event-level explanations.
论文旨在通过解决现有视觉语言模型独立进行片段级推理且缺乏结构化时间上下文的问题,来改进视频异常检测。提出了一种上下文感知框架LATERN,将其视作一种时间证据聚合过程。该框架包括两个模块:上下文感知异常评分(CEA)和递归证据聚合(REA)。CEA使用记忆机制选择历史内容进行异常评分,而REA则递归聚合时间证据以识别连贯的异常时间段。在UCF-Crime和XD-Violence上的实验表明,LATERN提高了检测准确性和解释一致性,同时生成了时间上连贯且语义上合理的事件级解释。
MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs
Authors: Wei Ding, Yilin Li, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang
First: 2026-05-14T15:31:18+00:00 · Latest: 2026-05-14T15:31:18+00:00
Comments: 19 pages, 17 figures
Abstract
Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.
中文标题/摘要
标题:MHSA:一种通过引导注意减轻LVLMs幻觉的轻量级框架
大型视觉-语言模型(LVLMs)在多种跨模态任务中取得了显著的性能,但它们仍然受到幻觉的影响,生成与视觉输入不一致的内容。先前的工作DHCP(通过跨模态注意力模式检测幻觉)从跨模态注意力的角度探索了幻觉检测,但没有解决幻觉减轻的问题。在本文中,我们提出了MHSA(通过引导注意减轻幻觉),这是一种轻量级框架,通过学习纠正LVLM中的跨模态注意力模式来减轻幻觉。MHSA训练一个简单的三层MLP生成器,生成纠正后的注意力,由DHCP判别器和LVLM本身的监督信号引导。在推理过程中,MHSA通过简单地用纠正后的跨模态注意力替换原始的跨模态注意力,减轻了各种数据集和LVLM中的判别性和生成性幻觉,而不修改任何LVLM参数。通过将跨模态注意力机制从幻觉检测扩展到幻觉减轻,MHSA为LVLM中的幻觉研究提供了新的视角,并有助于提高其可靠性。
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding
Authors: Pengxin Xu, Xincheng Lin, Luping Xiao, Qing Jiang, Meishan Zhang, Hao Fei, Shanghang Zhang, Xingyu Chen
First: 2026-05-14T14:58:46+00:00 · Latest: 2026-05-14T14:58:46+00:00
Comments: Preprint. Code, models, and dataset are provided in the manuscript
Abstract
General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.
中文标题/摘要
标题:SceneParser:面向交互的层次场景解析
通用场景感知已经从对象识别发展到开放词汇定位、部件定位和功能预测。然而,这些能力通常是孤立的预测,仅定位对象、部件或交互点,而没有捕捉到交互导向场景理解所需的结构依赖关系。为了解决这一差距,我们引入了面向交互的层次场景解析任务,该任务将物理场景表示为显式的场景->对象->部件->功能层次结构,并具有跨级绑定。我们通过SceneParser实例化此任务,这是一种基于VLM的解析器,用于统一的层次生成训练,带有结构补全伪标签和课程学习。为了支持训练和评估,我们构建了SceneParser-Bench,这是一个大规模基准,使用可扩展的层次数据引擎构建,包含110K训练图像、5K验证集、777K对象、114万部件、174万功能注释以及174万有效的对象-部件-功能链实例。我们还引入了从Level-1到Level-3的条件度量和ParseRate来评估定位、跨级绑定和层次完整性。实验表明,现有MLLM和感知拼接管道在我们的SceneParser-Bench上难以进行层次解析,而SceneParser实现了更强的结构感知性能。此外,消融实验、在COCO和AGD20K上的评估以及下游规划探针表明,我们的SceneParser与传统任务兼容,并为视觉理解提供了可操作的表示。
Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy
Authors: Abdulrahman Alswaidan, Jeffrey D. Varner
First: 2026-03-06T20:50:30+00:00 · Latest: 2026-05-14T14:55:42+00:00
Comments: Main body (including references excluding the appendix): 11 pages, 2 figures and 1 table. Total paper: 26 pages, 13 figures and 7 pages
Abstract
Attention heads retrieve: given a query, they return a weighted average of stored values. We showed that this computation is one step of gradient descent on the modern Hopfield energy, and that Langevin sampling from the corresponding Boltzmann distribution yielded stochastic attention, a training-free sampler controlled by a single temperature parameter. Lowering the temperature gave exact retrieval; raising it gave open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model was required, making the approach particularly suited to the low-data regime where learned generative models are starved of training signal. We derived an entropy inflection condition that identified the retrieval-to-generation transition temperature for any memory geometry and validated the sampler on five domains spanning two orders of magnitude in dimension. A single Boolean mask on the attention softmax, identical to the causal mask used in transformers but applied along the memory axis rather than the sequence axis, turned the sampler into a zero-shot class-conditional generator on Olivetti faces with no retraining and no learned classifier. On MNIST digit images, stochastic attention produced samples that were markedly more novel and more diverse than the best learned baseline while matching a Metropolis-corrected gold standard. On protein sequences from a small Pfam family, the generation regime preserved amino acid composition far more faithfully than a variational autoencoder at matched novelty, indicating that the training-free score function retained family-level fidelity that learned models lost. A denoising diffusion baseline failed across all memory sizes tested, producing samples indistinguishable from isotropic noise. The approach required no architectural changes to the underlying attention mechanism.
中文标题/摘要
标题:现代霍普菲尔德能量上的朗格维动力学随机注意力
注意力头检索:给定一个查询,它们返回存储值的加权平均值。我们展示了这种计算是现代霍普菲尔德能量梯度下降的一步,从相应的玻尔兹曼分布进行朗格维采样产生了随机注意力,这是一种无需训练的采样器,由单一的温度参数控制。降低温度实现了精确检索;提高温度则实现了开放生成。由于能量梯度等于注意力图,因此无需评分网络、训练循环或学习模型,这使得该方法特别适用于低数据环境中,此时学习生成模型缺乏训练信号。我们推导出一个熵拐点条件,以识别任何内存几何结构下的检索到生成的过渡温度,并在五个领域上验证了采样器,这些领域在维度上跨越了两个数量级。一个单一的布尔掩码应用于注意力softmax,类似于变压器中使用的因果掩码,但应用于内存轴而不是序列轴,将采样器转换为无需重新训练和无学习分类器的零样本类条件生成器,应用于奥利维蒂人脸。在MNIST数字图像上,随机注意力生成的样本在新颖性和多样性方面明显优于最佳学习基线,同时与经过修正的黄金标准匹配。在小型Pfam家族的蛋白质序列上,生成模式比变分自编码器在匹配新颖性的情况下更准确地保留了氨基酸组成,表明无训练评分函数保留了家族级别的准确性,而学习模型则丢失了。去噪扩散基线在所有测试的内存大小上都失败了,生成的样本与各向同性噪声无法区分。该方法无需对基础注意力机制进行任何架构更改。
SteerSeg: Attention Steering for Reasoning Video Segmentation
Authors: Ali Cheraghian, Hamidreza Dastmalchi, Abdelwahed Khamis, Morteza Saberi, Aijun An, Lars Petersson
First: 2026-05-14T14:42:15+00:00 · Latest: 2026-05-14T14:42:15+00:00
Comments: Project page: https://steerseg.github.io
Abstract
Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model's pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: https://steerseg.github.io
Summary / 总结
SteerSeg addresses the issue of diffuse and ambiguous grounding signals in video reasoning segmentation by introducing a lightweight framework that steers attention at its source through input-level conditioning. It combines learnable soft prompts with reasoning-guided Chain-of-Thought prompting to produce more spatially concentrated attention maps, which are then used to guide a segmentation model. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of large vision-language models.
SteerSeg通过引入一种轻量级框架,在输入级别调整注意力,解决视频推理分割中注意力分布模糊和含糊的问题。该框架结合了可学习的软提示和基于推理的Chain-of-Thought提示,生成更集中的注意力图,进而指导分割模型。尽管仅在Ref-YouTube-VOS上进行训练,SteerSeg在多种基准测试中表现出色,显著提高了大型视觉语言模型的空间定位能力。
Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)
Authors: Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising
Venue: IEEE Data Descriptions, 2026
First: 2025-11-17T14:12:22+00:00 · Latest: 2026-05-14T14:41:56+00:00
Abstract
The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
Authors: Xiyu Ren, Zhaowei Wang, Yiming Du, Zhongwei Xie, Chi Liu, Xinlin Yang, Haoyue Feng, Wenjun Pan, Tianshi Zheng, Baixuan Xu, Zhengnan Li, Yangqiu Song, Ginny Wong, Simon See
First: 2026-05-14T14:41:17+00:00 · Latest: 2026-05-14T14:41:17+00:00
Comments: Work in progress
Abstract
Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.
中文标题/摘要
标题:MemLens:大型视觉语言模型中多模态长时记忆的基准测试
记忆对于大型视觉语言模型(LVLMs)处理长的多模态交互至关重要,有两个方法方向提供了这种能力:长上下文LVLMs和记忆增强代理。然而,目前没有基准对这两种方法在真正需要多模态证据的问题上进行系统比较。为了弥补这一差距,我们引入了MEMLENS,这是一个全面的多模态多会话对话中的记忆基准,包含789个问题,涵盖了五种记忆能力(信息提取、多会话推理、时间推理、知识更新和答案拒绝),在跨模态标记计数方案下,标准上下文长度为4个标准上下文长度(32K-256K标记)。图像消融研究证实,解决MEMLENS需要视觉证据:移除证据图像会使两个前沿LVLMs在包含图像的80.4%的问题上准确率降至2%以下。评估27个LVLMs和7个记忆增强代理,我们发现长上下文LVLMs在短上下文中有高准确率,但随着对话增长而下降,而记忆代理在长度上是稳定的,但在存储时间压缩下失去视觉精度。多会话推理使大多数系统得分低于30%,两种方法单独都无法完成任务。这些结果促使结合长上下文注意力与结构化多模态检索的混合架构。我们的代码可在https://github.com/xrenaf/MEMLENS/获取。
Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers
Authors: Jakub Grzywaczewski, Dawid Płudowski, Przemysław Biecek
First: 2026-05-14T14:37:50+00:00 · Latest: 2026-05-14T14:37:50+00:00
Abstract
Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.
中文标题/摘要
标题:您的CLIP具有164维噪声:探索对比预训练视觉-语言变换器的嵌入协方差特征谱
对比预训练视觉-语言模型(VLMs)作为强大的特征提取器。然而,它们共享的潜在空间容易出现结构异常,并作为多模态非语义噪声的存储库。为了解决这一现象,我们采用协方差矩阵的谱分解来将VLM潜在空间分解为多模态语义信号成分和共享噪声子空间。我们观察到这种噪声几何结构在不同数据子集上表现出强烈的子群不变性。至关重要的是,修剪这些共享噪声维度主要是无害的,可以保持或积极提高下游任务性能。通过将真实的语义信号与人为噪声隔离,本工作为现代VLMs的表征结构提供了新的机制性见解,表明其潜在几何结构的很大一部分由共享的架构级噪声而非仅由任务相关语义所支配。
Summary / 总结
The study aims to address structural anomalies and noise in the shared latent spaces of contrastively pre-trained Vision-Language Models (VLMs) by decomposing their latent spaces using spectral decomposition of covariance matrices. The research finds that the noise geometry in these models exhibits strong subgroup invariance and that pruning shared noise dimensions can preserve or even improve performance on downstream tasks without harming semantic signal extraction. This work provides new insights into the representational structure of VLMs, indicating that a significant portion of their latent geometry is due to shared noise rather than task-relevant semantics alone.
研究旨在通过协方差矩阵的谱分解来分解对比预训练的视觉-语言模型(VLMs)的共享潜在空间,以解决其结构异常和噪声问题。研究发现,这些模型中的噪声几何表现出强烈的子组不变性,并且剪除共享噪声维度可以保持甚至提高下游任务的性能,而不损害语义信号的提取。这项工作为VLMs的表示结构提供了新的见解,表明其潜在几何结构中的很大一部分是由共享的架构级噪声而非仅由任务相关的语义所支配。
Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods
Authors: Md Abu Obaida Zishan, Jannatun Noor, Annajiat Alim Rasel
First: 2026-05-09T05:13:21+00:00 · Latest: 2026-05-14T14:25:41+00:00
Comments: Updated the title for clarity. Removed background and redundant text from section 4.2,5. Improved organization in section 4 and clarity of text in Section 4.3
Abstract
Stable Diffusion (SD) has evolved DDPM (Denoising Diffusion Probabilistic Model) based image generation significantly by denoising in latent space instead of feature space. This popularized DDPM-based image generation as the cost and compute barrier was significantly lowered. However, these models could only generate fixed-resolution images according to their training configuration. When we attempt to generate higher resolutions, the resulting images show object duplication artifacts consistently. To solve this problem without finetuning SD models, recent works have tried dilating the convolution kernels of the models and have achieved a great level of success. But dilated kernels are harder to fine-tune due to being zero-gapped. Apart from this, other methods, such as patched diffusion, could not solve the object-duplication problem efficiently. Hence, to overcome the limitations of dilated convolutions, we propose kernel interpolation of SD models for higher-resolution image generation. In this work, we show mathematically that interpolation can correctly scale convolution kernels if multiplied by a constant coefficient and achieve competitive empirical results in generating beyond-training-resolution images with Stable Diffusion using zero training. Furthermore, we demonstrate that our method enables interpolation of deep neural networks to adapt to higher-dimensional training data, with a worst-case performance drop of $2.6\%$ in accuracy and F1-Score relative to the baseline. This shows the applicability of our method to be general, where we interpolate fully-connected layers, going beyond convolution layers. We also discuss how we can reduce the memory footprints of training neural networks, using our method up to at least $4\times$.
中文标题/摘要
标题:超采样稳定扩散及更进一步:一种无需训练的无缝扩展神经网络的方法
稳定扩散(SD)通过在潜在空间而非特征空间去噪,显著提升了基于DDPM(去噪扩散概率模型)的图像生成技术,大幅降低了成本和计算门槛。然而,这些模型只能生成与其训练配置相匹配的固定分辨率图像。当尝试生成更高分辨率的图像时,结果图像会表现出对象重复的伪影。为了解决这一问题而不对SD模型进行微调,最近的研究尝试扩大模型卷积核的大小,并取得了显著的成功。但是,扩大的卷积核由于存在零间隙,难以进行微调。除了这种方法之外,其他方法,如补丁扩散,也无法高效地解决对象重复问题。因此,为了克服扩大小卷积核的局限性,我们提出了使用内插法对SD模型进行高分辨率图像生成。在本文中,我们通过数学证明了内插法可以在乘以一个常数系数后正确地扩展卷积核,并在无需训练的情况下使用稳定扩散生成超出训练分辨率的图像,取得了具有竞争力的实验结果。此外,我们展示了我们的方法能够内插深度神经网络以适应更高维度的训练数据,最坏情况下准确率和F1分数下降2.6%。这表明我们的方法具有广泛的适用性,我们不仅内插了全连接层,还超越了卷积层。我们还讨论了如何使用我们的方法减少训练神经网络的内存占用,最多可以减少4倍。
AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
Authors: Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, Chen Lv
First: 2026-03-16T05:50:31+00:00 · Latest: 2026-05-14T14:09:03+00:00
Abstract
Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose AutoMoT in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that AutoMoT achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to https://automot-website.github.io/ for the demonstration videos and qualitative results.
Summary / 总结
AutoMoT is an end-to-end autonomous driving framework that integrates vision-language models into a unified vision-language-action model using a mixture-of-transformers architecture. This approach addresses limitations of existing methods by efficiently balancing reasoning and action generation, and enabling asynchronous execution for fast-slow inference. Experimental results show that AutoMoT performs competitively on multiple benchmarks and that pre-trained vision-language models can achieve good multi-task scene understanding through semantic prompting alone, though fine-tuning is still necessary for action-level tasks like decision-making and trajectory planning.
AutoMoT 是一个端到端的自动驾驶框架,将视觉-语言模型整合到一个统一的视觉-语言-行动模型中,使用混合变换器架构。该方法通过高效平衡推理和行动生成,并实现异步执行以进行快慢推理来解决现有方法的局限性。实验结果表明,AutoMoT 在多个基准测试中表现竞争力,并且预训练的视觉-语言模型仅通过语义提示即可实现良好的多任务场景理解,但针对决策和轨迹规划等行动级任务仍需进行细调。
Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study
Authors: Marta Robledo-Moreno, Ruben Vera-Rodriguez, Ruben Tolosana, Javier Ortega-Garcia
First: 2026-05-14T13:53:28+00:00 · Latest: 2026-05-14T13:53:28+00:00
Comments: Accepted at the 14th International Workshop on Biometrics and Forensics
Abstract
Recent advancements in Vision-Language Models (VLMs) have demonstrated strong capabilities in general visual reasoning, yet their applicability to rigorous biometric tasks remains unexplored. This work presents an exploratory study evaluating the zero-shot performance of state-of-the-art VLMs (GPT-5.2 and Gemini 2.5 Pro) on the Signature Verification Challenge (SVC) benchmark. To enable visual processing, raw kinematic time-series are converted into static images, encoding pressure information into stroke opacity whenever available in the source data. Furthermore, we introduce a scoring protocol that extracts latent token probabilities to compute robust biometric scores. Experimental results reveal a significant performance dichotomy dependent on signal quality and forgery type. In random forgery scenarios, the zero-shot VLM achieves exceptional discrimination, with GPT-5.2 reaching an Equal Error Rate of 0.32% in mobile tasks, outperforming supervised state-of-the-art systems. Conversely, in skilled forgery scenarios, where the task is more challenging because both signatures are almost identical, the results are significantly worse, and a critical "Rationalization Trap" emerges: chain-of-thought (CoT) reasoning degrades performance as the model produces kinematic hallucinations to justify forgery artifacts as natural variability.
Summary / 总结
This study explores the zero-shot performance of state-of-the-art Vision-Language Models (VLMs) on the Signature Verification Challenge (SVC) benchmark. By converting raw kinematic time-series into static images and introducing a scoring protocol, the research evaluates GPT-5.2 and Gemini 2.5 Pro. Results show that in random forgery scenarios, VLMs perform exceptionally well, with GPT-5.2 achieving an Equal Error Rate of 0.32% in mobile tasks. However, in skilled forgery scenarios, performance drops significantly due to a 'Rationalization Trap' where CoT reasoning leads to kinematic hallucinations, degrading performance.
这项研究探讨了最先进的Vision-Language模型(VLM)在签名验证挑战(SVC)基准上的零样本性能。通过将原始的运动时间序列转换为静态图像并引入评分协议,研究评估了GPT-5.2和Gemini 2.5 Pro。结果显示,在随机伪造场景中,VLM表现出色,GPT-5.2在移动任务中的等错误率为0.32%。但在高技能伪造场景中,由于‘理性化陷阱’导致的运动幻觉推理降低了性能。
The Velocity Deficit: Initial Energy Injection for Flow Matching
Authors: Linze Li, Zong-Wei Hong, Shen Zhang, Bo Lin, Jinglun Li, Yao Tang, Jiajun Liang
First: 2026-05-14T13:30:07+00:00 · Latest: 2026-05-14T13:30:07+00:00
Comments: Accepted by ICML2026
Abstract
While Flow Matching theoretically guarantees constant-velocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold-a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory's start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.
Summary / 总结
The research addresses the Velocity Deficit in Flow Matching, where the Mean Squared Error (MSE) objective underestimates velocity magnitude, leading to Integration Lag. To solve this, the study proposes Initial Energy Injection through Magnitude-Aware Flow Matching (MAFM) and Scale Schedule Corrector (SSC). SSC, a training-free method, significantly improves FID by 44.6% and achieves a 5x speedup on ImageNet-1k, outperforming a 250-step baseline with just one line of code. The methods also generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by about 22%.
研究解决了Flow Matching中的Velocity Deficit问题,即MSE目标低估了速度幅度,导致Integration Lag。为此,研究提出了Initial Energy Injection,通过Magnitude-Aware Flow Matching (MAFM)和Scale Schedule Corrector (SSC)两种方法。SSC作为一种无需训练的方法,显著提高了ImageNet-1k上的FID,提升了44.6%,并实现了5倍的速度提升,仅需一行代码即可超越250步基线。此外,该方法还适用于Text-to-Image任务和高分辨率生成,使MS-COCO上的FID提高了约22%。
SuperADD: Training-free Class-agnostic Anomaly Segmentation -- CVPR 2026 VAND 4.0 Workshop Challenge Industrial Track
Authors: Lukas Roming, Felix Lehnerer, Jonas V. Funk, Andreas Michel, Georg Maier, Thomas Längle, Jürgen Beyerer
Venue: CVPR 2026
First: 2026-05-14T13:22:02+00:00 · Latest: 2026-05-14T13:22:02+00:00
Comments: Technical report for the CVPR 2026 VAND 4.0 workshop challenge industrial track
Abstract
Visual anomaly detection (AD) for industrial inspection is a highly relevant task in modern production environments. The problem becomes particularly challenging when training and deployment data differ due to changes in acquisition conditions during production. In the VAND 4.0 Industrial Track, models must remain robust under distribution shifts such as varying illumination and their performance is assessed on the MVTec AD 2 dataset. To address this setting, we propose a training-free and class-agnostic anomaly detection pipeline based on the work of SuperAD. Our approach improves generalization through several modifications designed to enhance robustness under distribution shifts. These adaptations include using a DINOv3 backbone, overlapping patch-wise processing, intensity-based augmentations, improved memory-bank subsampling for better coverage of the data distribution, and iterative morphological closing for cleaner and more spatially consistent anomaly maps. Unlike methods that rely on class-specific architectures or per-class hyperparameter tuning, our method uses a single architecture and one shared hyperparameter configuration across all object classes. This makes the approach well suited for industrial deployment, where product variants and appearance changes must be handled with minimal adaptation effort. We achieve segmentation F1 scores of $62.61\%$, $57.42\%$, and $54.35\%$ on test public, private, and private mixed of MVTec AD 2 respectively, thereby outperforming SuperAD and other state-of-the-art methods. Code is available at https://github.com/LukasRoom/SuperADD.
中文标题/摘要
标题:SuperADD:无需训练的无类别异常分割——CVPR 2026 VAND 4.0 工作坊挑战工业赛道
工业检测中的视觉异常检测(AD)是现代生产环境中一个非常相关的重要任务。当训练和部署数据因生产过程中采集条件的变化而不同时,问题变得尤为具有挑战性。在VAND 4.0工业赛道中,模型必须在分布变化(如光照变化)下保持鲁棒性,并在MVTec AD 2数据集上评估其性能。为了解决这一问题,我们提出了一种无需训练且无类别的异常检测管道,基于SuperAD的工作进行改进。我们的方法通过多种改进来增强在分布变化下的泛化能力,这些改进包括使用DINOv3骨干网络、重叠的块级处理、基于强度的增强、改进的记忆库子采样以更好地覆盖数据分布,以及迭代形态学闭运算以获得更干净且更空间一致的异常图。与依赖于特定类别架构或每类别超参数调优的方法不同,我们的方法使用单一架构和一个适用于所有对象类别的共享超参数配置。这使得该方法非常适合工业部署,能够在最小的适应努力下处理产品变体和外观变化。我们在MVTec AD 2的测试公共、测试私有和测试私有混合数据集上分别实现了分割F1分数为62.61%、57.42%和54.35%,从而优于SuperAD和其他最先进的方法。代码可在https://github.com/LukasRoom/SuperADD/获取。
Summary / 总结
The paper proposes SuperADD, a training-free and class-agnostic anomaly detection method for industrial inspection, which addresses the challenge of distribution shifts. It uses modifications like a DINOv3 backbone, overlapping patch-wise processing, and improved data augmentation to enhance robustness. The method achieves segmentation F1 scores of 62.61%, 57.42%, and 54.35% on MVTec AD 2 test sets, outperforming SuperAD and other state-of-the-art methods.
该研究提出了一种训练-free 和类-无关的异常检测方法 SuperADD,用于工业检测,以应对分布变化的挑战。通过使用 DINOv3 骨干网络、重叠的块级处理和改进的数据增强等修改,增强了鲁棒性。该方法在 MVTec AD 2 测试集上的分割 F1 得分分别为 62.61%、57.42% 和 54.35%,优于 SuperAD 和其他最先进的方法。
Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation
Authors: Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Xianxun Zhu, Abdenour Hadid
First: 2026-05-14T13:09:16+00:00 · Latest: 2026-05-14T13:09:16+00:00
Abstract
In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.
Summary / 总结
This study evaluates the effectiveness of Vision Mamba models in detecting AI-generated images, comparing them with CNNs, ViTs, and VLMs across various datasets and generative models. Key findings show that Vision Mamba models perform well in terms of accuracy and efficiency but still face limitations in generalizability across diverse image types and generative models.
该研究评估了Vision Mamba模型在检测AI生成图像方面的有效性,将其与CNNs、ViTs和VLMs进行比较,涵盖了多种数据集和生成模型。主要发现表明,Vision Mamba模型在准确性和效率方面表现良好,但在不同图像类型和生成模型的通用性方面仍存在局限性。
EVA: Editing for Versatile Alignment against Jailbreaks
Authors: Yi Wang, Hongye Qiu, Yue Xu, Sibei Yang, Zhan Qin, Minlie Huang, Wenjie Wang
First: 2026-05-14T12:16:10+00:00 · Latest: 2026-05-14T12:16:10+00:00
Comments: IEEE TPAMI 2026
Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated impressive capabilities but remain vulnerable to jailbreaking attacks, where adversaries exploit textual or visual triggers to bypass safety guardrails. Recent defenses typically rely on safety fine-tuning or external filters to reduce the model's likelihood of producing harmful content. While effective to some extent, these methods often incur significant computational overheads and suffer from the safety utility trade-off, degrading the model's performance on benign tasks. To address these challenges, we propose EVA (Editing for Versatile Alignment against Jailbreaks), a novel framework that pioneers the application of direct model editing for safety alignment. EVA reframes safety alignment as a precise knowledge correction task. Instead of retraining massive parameters, EVA identifies and surgically edits specific neurons responsible for the model's susceptibility to harmful instructions, while leaving the vast majority of the model unchanged. By localizing the updates, EVA effectively neutralizes harmful behaviors without compromising the model's general reasoning capabilities. Extensive experiments demonstrate that EVA outperforms baselines in mitigating jailbreaks across both LLMs and VLMs, offering a precise and efficient solution for post-deployment safety alignment.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
Authors: Yue Ma, Ziyuan Yang, Yi Zhang
First: 2026-05-03T07:38:42+00:00 · Latest: 2026-05-14T12:12:16+00:00
Comments: 12 pages
Abstract
Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.
Summary / 总结
The paper addresses the vulnerability of Multi-Agent Systems (MASs) to infectious jailbreaks, where a single compromised agent can spread to others. It proposes a training-free Foresight-Guided Local Purification (FLP) framework that allows each agent to simulate future interactions to detect and mitigate infections. The FLP framework uses a multi-persona simulation strategy and response diversity as diagnostic signals to detect and eliminate infections. Experiments show that FLP significantly reduces the maximum cumulative infection rate and preserves interaction diversity.
论文针对多智能体系统(MASs)中的传染性逃逸问题,即一个被攻破的智能体可以传播给其他智能体。提出了一种无需训练的前瞻性局部净化(FLP)框架,使每个智能体能够模拟未来的交互来检测和缓解感染。FLP框架使用多角色模拟策略和响应多样性作为诊断信号来检测和消除感染。实验表明,FLP显著降低了最大累积感染率,并保持了交互多样性。
From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models
Authors: Zicheng Fan, Kunihiko Fujiwara, Pengyuan Liu, Fan Zhang, Filip Biljecki
First: 2025-05-17T03:41:45+00:00 · Latest: 2026-05-14T12:09:49+00:00
Abstract
Visibility analysis in urban planning has traditionally relied on line-of-sight (LoS) simulations, which capture geometric occlusion. However, these approaches depend on accurate 3D data that is often unavailable and may not adequately represent how visually distinctive urban landmarks are encountered in real streetscapes. We reformulate landmark visibility assessment as an urban visual search problem in image space by leveraging the widespread availability of street view imagery (SVI). Given a reference image of a target landmark, a Vision Language Model (VLM) is applied to detect the landmark in direction- and zoom-controlled SVI. A successful detection indicates machine-recognised landmark visibility at the corresponding viewpoint. Beyond isolated viewpoints, we construct a heterogeneous visibility graph to represent visual connectivity among landmarks, street-view locations, and the urban spaces that mediate them. This graph enables us to map where visual connections occur, how strong they are, and how multiple landmarks become jointly connected through shared visual corridors. Across six well-known landmark structures in global cities, the image-based method achieves an overall detection accuracy of 87%, with a precision score of 68% for landmark-visible locations. In a second case study along the River Thames in London, the visibility graph reveals multi-landmark connections and identifies key mediating locations, with bridges accounting for approximately 31% of all connections. The proposed method complements LoS-based visibility analysis and offers a practical alternative in data-constrained settings. It also showcases the possibility of revealing the prevalent connections of visual objects in the urban environment, opening new perspectives for urban planning and heritage conservation.
中文标题/摘要
标题:从街道视角到视觉网络:使用视觉语言模型映射城市地标可见性
在城市规划中,可见性分析传统上依赖于视线(LoS)模拟,这些模拟捕捉几何遮挡。然而,这些方法依赖于准确的3D数据,这些数据往往不可用,可能无法充分代表人们在真实街道景观中遇到视觉独特城市地标的方式。我们通过利用广泛可用的街道视角图像(SVI)将地标可见性评估重新表述为图像空间中的城市视觉搜索问题。给定目标地标的一张参考图像,应用视觉语言模型(VLM)在方向和缩放控制的SVI中检测地标。成功的检测表明机器识别的地标在相应视角下的可见性。除了孤立的视角,我们构建了一个异构可见性图来表示地标、街道视角位置以及它们之间的城市空间之间的视觉连接。该图使我们能够映射视觉连接发生的位置、强度以及多个地标通过共享视觉走廊联合连接的情况。在六个全球城市的知名地标结构中,基于图像的方法总体检测准确率为87%,地标可见位置的精确得分为68%。在伦敦泰晤士河的第二个案例研究中,可见性图揭示了多地标连接,并确定了关键的中介位置,桥梁占所有连接的约31%。所提出的方法补充了基于视线的可见性分析,并在数据受限的环境中提供了一种实用的替代方案。它还展示了揭示城市环境中视觉对象普遍连接的可能性,为城市规划和遗产保护提供了新的视角。
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
Authors: Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu
First: 2025-12-03T07:51:03+00:00 · Latest: 2026-05-14T12:05:47+00:00
Abstract
Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
Summary / 总结
OpenTrack3D addresses the challenge of open-vocabulary 3D instance segmentation in diverse and unstructured environments by introducing a novel visual-spatial tracker that generates cross-view consistent object proposals online. Unlike existing methods that rely on pre-generated proposals or mesh-based superpoints, OpenTrack3D's framework is entirely mesh-free and uses a 2D open-vocabulary segmenter to generate masks, which are then lifted to 3D point clouds. The tracker fuses visual and spatial cues to maintain instance consistency, and the pipeline is enhanced with a multi-modal large language model for better compositional reasoning. Experiments on various benchmarks show state-of-the-art performance and strong generalization capabilities.
OpenTrack3D通过引入一种新型的视觉-空间追踪器,能够在线生成跨视图一致的对象提案,解决了在多样且未结构化环境中开放词汇3D实例分割的挑战。不同于依赖预生成提案或基于网格的超点的方法,OpenTrack3D的框架完全不依赖网格,并使用2D开放词汇分割器生成掩码,然后将这些掩码提升到3D点云。追踪器融合视觉和空间线索以保持实例一致性,并通过多模态大型语言模型增强以更好地处理复杂用户查询。在各种基准上的实验显示了最先进的性能和强大的泛化能力。
DAPL: Integration of Positive and Negative Descriptions in Text-Based Person Search
Authors: Yuchuan Deng, Zhanpeng Hu, Zijie Xin, Chuang Deng, Qijun Zhao
Venue: ICME
First: 2024-05-13T04:21:00+00:00 · Latest: 2026-05-14T11:40:35+00:00
Abstract
Text-based person search (TBPS) aims to retrieve specific images of individuals from large datasets using textual descriptions. Existing TBPS methods focus primarily on identifying explicit positive attributes, often neglecting the critical role of negative descriptions. This oversight can lead to false positives, where images that should be excluded based on negative descriptions are incorrectly included, due to partial alignment with the positive criteria. To address this limitation, we propose the Dual Attribute Prompt Learning (DAPL) framework, which incorporates both positive and negative descriptions to improve the interpretative accuracy of vision-language models in TBPS tasks. DAPL combines Dual Image-Attribute Contrastive (DIAC) learning with Sensitive Image-Attribute Matching (SIAM) learning to enhance the detection of previously unseen attributes. Furthermore, to achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we introduce the Dynamic Token-wise Similarity (DTS) loss. This loss function refines the representation of both matching and non-matching descriptions at the token level, providing more precise and adaptable similarity assessments, and ultimately improving the accuracy of the matching process. Empirical results demonstrate that DAPL outperforms state-of-the-art methods, enhancing both precision and robustness in TBPS tasks.
中文标题/摘要
标题:DAPL:文本基于的人像搜索中正负描述的整合
文本基于的人像搜索(TBPS)旨在使用文本描述从大型数据集中检索特定个体的图像。现有的TBPS方法主要关注识别显式的正属性,往往忽视了负描述的关键作用。这种忽视可能导致误报,即基于负描述应被排除的图像由于部分符合正描述标准而被错误地包含。为解决这一局限,我们提出了双属性提示学习(DAPL)框架,该框架结合了正负描述以提高视觉-语言模型在TBPS任务中的解释准确性。DAPL结合了双图像-属性对比学习(DIAC)和敏感图像-属性匹配学习(SIAM)来增强对未见过属性的检测。此外,为了在视觉和文本嵌入之间实现粗细粒度的平衡对齐,我们引入了动态令牌级相似性(DTS)损失函数。该损失函数在令牌级别细化匹配和非匹配描述的表示,提供更精确和适应性的相似性评估,最终提高匹配过程的准确性。实验证明,DAPL在TBPS任务中优于现有方法,提高了精确度和鲁棒性。
SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
Authors: Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu
First: 2026-05-14T11:21:41+00:00 · Latest: 2026-05-14T11:21:41+00:00
Abstract
In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.
Summary / 总结
SceneFunRI is a benchmark designed to evaluate the ability of vision-language models to reason about invisible objects in scenes. It uses a semi-automatic pipeline based on the SceneFun3D dataset and includes 855 instances where models must infer the locations of functional objects from task instructions and commonsense reasoning. The strongest baseline model achieves low scores, indicating that current models struggle with this task, highlighting the need for models that better integrate task intent, commonsense knowledge, and spatial reasoning.
SceneFunRI 是一个基准,旨在评估视觉-语言模型在场景中推理不可见物体的能力。它基于 SceneFun3D 数据集使用半自动管道,并包含 855 个实例,要求模型从任务指令和常识推理中推断功能物体的位置。最强的基线模型得分较低,表明当前模型在这一任务上存在困难,突显了需要更好地将任务意图、常识先验、空间推理结合在一起的模型的重要性。
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
Authors: Tian Qin, Junzhe Chen, Yuqing Shi, Tianshu Zhang, Qiang Ju, Lijie Wen
First: 2026-05-14T09:37:55+00:00 · Latest: 2026-05-14T09:37:55+00:00
Abstract
Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.
Summary / 总结
The study explores the use of SIRA, a novel-fre framework, to mitigate hallucinations in large large large language models (LVLMs). The method compares the flow of multimodal transformers to construct a counterfactual inside the LVLM without removing visual information on. Instead retains multimodal context while suppressing tokens influenced by late visual evidence. on decoding. CHAIR and thatBER with Qwen and LLaVA-v, LVLMs. experiments experiments two-pass oniteive decoding. Experiments show that SIRA consistently on hallucinates while preserving descriptive coverage and incurring on overheadverhead on two-pass on oniveive.
论文提出了一种名为SIRA的无训练内部对比解码框架,通过利用多模态变压器的信息流阶段来构建内部的反事实参考,从而解决大型视觉-语言模型(LVLM)中的幻觉问题,而无需外部工具或扰动输入。实验表明,SIRA能够减少幻觉现象,同时保持描述性覆盖并比两阶段对比解码方法具有更低的开销。
Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt
Authors: Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen, Lian Zhang, Peiwu Qin
First: 2025-10-17T17:42:28+00:00 · Latest: 2026-05-14T09:15:35+00:00
Abstract
Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.
中文标题/摘要
标题:Memory-SAM:无需人工提示的舌段分割
准确的舌段分割对于可靠的中医分析至关重要。监督模型需要大量标注的数据集,而SAM家族模型仍依赖于提示。我们提出了Memory-SAM,这是一种无需训练、无需人工提示的流水线,通过密集的DINOv3特征和FAISS检索,自动从少量的先前案例记忆中生成有效的提示。给定查询图像,掩膜约束的对应关系被提炼成前景/背景点提示,引导SAM2进行分割,无需手动点击或模型微调。我们在600张由专家标注的图像(300张受控,300张野外)上进行了评估。在混合测试集上,Memory-SAM的mIoU为0.9863,超过了FCN(0.8188)和一个检测器到框的SAM基线(0.1839)。在受控数据上,天花板效应使得超过0.98的小差异变得不那么有意义,而我们的方法在真实条件下显示出明显的改进。结果表明,检索到提示能够实现数据高效、鲁棒的舌影像不规则边界分割。代码已公开发布在https://github.com/jw-chae/memory-sam。
Summary / 总结
Memory-SAM is a training-free pipeline that automatically generates effective prompts for tongue segmentation without human intervention. It uses dense DINOv3 features and FAISS retrieval to distill mask-constrained correspondences from a small memory of prior cases. On a mixed test split of 600 expert-annotated images, Memory-SAM achieves a mean intersection over union (mIoU) of 0.9863, significantly outperforming FCN and a detector-to-box SAM baseline. On controlled data, the method shows clear gains under real-world conditions despite ceiling effects above 0.98, indicating its effectiveness for data-efficient, robust segmentation of irregular tongue boundaries.
Memory-SAM 是一个无需训练的管道,能够自动从少量先例案例中生成有效的舌头分割提示,无需人工干预。它使用密集的 DINOv3 特征和 FAISS 检索来提炼掩码约束的对应关系。在包含 600 张专家标注图像的混合测试集上,Memory-SAM 达到了 0.9863 的平均交并比(mIoU),显著优于 FCN 和检测到框的 SAM 基线。在控制数据集上,该方法在真实世界条件下显示出明显的改进,尽管在天花板效应之上 0.98 的情况下,标注的变异性使得小差异变得不那么有意义,表明其在数据高效、鲁棒分割不规则舌头边界方面的有效性。
HERO: Hierarchical Extrapolation and Refresh for Efficient World Models
Authors: Quanjian Song, Xinyu Wang, Donghao Zhou, Jingyu Lin, Cunjian Chen, Yue Ma
First: 2025-08-25T01:22:15+00:00 · Latest: 2026-05-14T09:03:20+00:00
Comments: 12 pages in total
Abstract
Generation-driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models. While recent advances have improved diffusion model efficiency, directly applying these techniques to world models introduces limitations such as quality degradation. In this paper, we present HERO, a training-free hierarchical acceleration framework tailored for efficient world models. Owing to the multi-modal nature of world models, we identify a feature coupling phenomenon, wherein shallow layers exhibit high temporal variability, while deeper layers yield more stable feature representations. Motivated by this, HERO adopts hierarchical strategies to accelerate inference: (i) In shallow layers, a patch-wise refresh mechanism efficiently selects tokens for recomputation. With patch-wise sampling and frequency-aware tracking, it avoids extra metric computation and remain compatible with FlashAttention. (ii) In deeper layers, a linear extrapolation scheme directly estimates intermediate features. This completely bypasses the computations in attention modules and feed-forward networks. Our experiments show that HERO achieves a 1.73$\times$ speedup with minimal quality degradation, significantly outperforming existing diffusion acceleration methods.
Discovering Physical Directions in Weight Space: Composing Neural PDE Experts
Authors: Pengkai Wang, Pengwei Liu, Yuanyi Wang, Guanyu Chen, Xingyu Ren, Xiaolong Li, Zhongkai Hao, Yuting Kong, Qixin Zhang, Dong Ni
First: 2026-05-14T08:25:16+00:00 · Latest: 2026-05-14T08:25:16+00:00
Abstract
Recent advances in neural operators have made partial differential equation (PDE) surrogate modeling increasingly scalable and transferable through large-scale pretraining and in-context adaptation. However, after a shared operator is fine-tuned to multiple regimes within a continuous physical family, it remains unclear whether the resulting weight-space updates merely form isolated regime experts or reveal reusable physical structure. Starting from a shared family anchor, we fine-tune low- and high-regime endpoint experts and show that their updates can be separated into a family-shared adaptation and a direction aligned with the underlying physical parameter. This separation reinterprets endpoint experts as finite-difference probes of a local physical direction in weight space, explaining why static averaging can interpolate between regimes but attenuates endpoint-specific physics. Building on this perspective, we propose Calibration-Conditioned Merge (CCM), a post-hoc coordinate readout method for composing neural PDE experts along this physical direction. Given physical metadata, a calibrated coordinate mapping, or a short observed rollout prefix, CCM infers the target composition coordinate and deploys a single merged checkpoint for the remaining rollout. We evaluate CCM on the reaction--diffusion system, viscosity-parameterized two-dimensional Navier--Stokes equations, and radial dam-break dynamics. Across these benchmarks, CCM achieves its strongest gains in extrapolative regimes, reducing out-of-distribution rollout error relative to the family anchor by 54.2%, 42.8%, and 13.8%, respectively. Further experiments across FNO scales, a DPOT-style backbone, and ablations confirm that endpoint fine-tuning is not arbitrary checkpoint drift, but reveals a calibratable physical direction for training-free transfer across PDE regimes.
Summary / 总结
The research aims to explore whether fine-tuning a shared neural operator to different physical regimes reveals reusable physical structure in weight space. By fine-tuning low- and high-regime experts, the study shows that weight-space updates can be separated into a family-shared adaptation and a direction aligned with the physical parameter. This leads to the development of Calibration-Conditioned Merge (CCM), a method that composes neural PDE experts along a physical direction, improving out-of-distribution performance by 54.2%, 42.8%, and 13.8% on reaction-diffusion, Navier-Stokes, and radial dam-break benchmarks, respectively.
研究旨在探索将共享神经算子细调到不同物理区间时,是否能在权重空间中揭示可重用的物理结构。通过细调低和高区间专家,研究发现权重空间更新可以分为共享的适应和与物理参数对齐的方向。这导致了Calibration-Conditioned Merge (CCM) 方法的提出,该方法沿着物理方向组合神经PDE专家,分别在反应扩散、Navier-Stokes 和径向水坝破裂基准上提高了54.2%、42.8%和13.8%的出分布性能。
TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models
Authors: Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Yu Zhang, Ying Li, Rong Xiao
First: 2026-02-04T15:33:10+00:00 · Latest: 2026-05-14T08:16:18+00:00
Abstract
Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose TRIO from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specifically, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle.The proposed TRIO is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, TRIO retains just 11.1\% of visual tokens but maintains 97.2\% of the original performance, with a 2.75$\times$ prefill speedup, 2.14$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead.Our code is available at https://github.com/ocy1/TRIO.
中文标题/摘要
标题:TRIO:通过推理目标指导的视觉标记减少方法以提高视觉语言模型效率
近年来,通过减少视觉语言模型(VLMs)中的冗余视觉标记来加速VLM推理已成为一个热点话题。然而,大多数现有方法依赖于基于视觉标记间相似性或跨模态视觉-文本相似性的启发式方法,这在压缩性能和实际部署方面存在一定的局限性。相比之下,我们从推理目标的角度提出了TRIO,将视觉标记压缩转化为保持输出结果不变性,并主要通过其对这一目标的重要性来选择标记。具体来说,视觉标记在由我们设计的层局部代理损失生成的标记级梯度显著性指导下重新排序,这是一种从当前层到最终结果的粗略约束。然后,根据非极大值抑制(NMS)原则选择最有价值的视觉标记。所提出的TRIO是训练无损的,并且与FlashAttention兼容,易于实际应用和部署。它可以独立部署为一种无需编码器的方法,或者与VisionZip等编码器压缩方法结合使用,作为一种涉及编码器的方法。在LLaVA-Next-7B上,TRIO仅保留了11.1%的视觉标记,但保持了97.2%的原始性能,预填充速度提高了2.75倍,推理速度提高了2.14倍,FLOPs降低了6.22倍,KV缓存开销减少了6.05倍。我们的代码可在https://github.com/ocy1/TRIO获取。
Summary / 总结
TRIO aims to reduce redundant visual tokens in vision-language models (VLMs) by preserving output result invariance, using a layer-local proxy loss to guide token reordering and non-maximum suppression for token selection. On LLaVA-Next-7B, TRIO retains only 11.1% of visual tokens while maintaining 97.2% of the original performance, achieving significant speedups and resource reductions.
TRIO 通过保留输出结果不变性,使用由层局部代理损失引导的 token 级别梯度显著性来减少视觉语言模型中的冗余视觉标记。它仅保留 11.1% 的视觉标记,但仍能保持 97.2% 的原始性能,提供 2.75 倍的预填充加速、2.14 倍的推理加速、6.22 倍的更低 FLOPs 和 6.05 倍的 KV 缓存开销减少。TRIO 是无训练的,并且与 FlashAttention 兼容,使其适合独立部署或与编码器压缩方法(如 VisionZip)结合使用进行实际部署。
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Authors: Sujung Hong, Chanyong Yoon, Seongjae Hwang
First: 2026-05-14T08:11:32+00:00 · Latest: 2026-05-14T08:11:32+00:00
Abstract
Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.
中文标题/摘要
标题:缓解大型扩散视觉语言模型中的掩码先验漂移和位置注意坍塌
大型扩散视觉语言模型(LDVLMs)最近已成为自回归模型的有前途的替代方案,能够实现并行解码以提高推理效率,并利用双向注意以获取全局上下文。尽管取得了这些进展,但它们在长文本生成中的行为仍然未被充分探索。在本文中,我们展示了现有的LDVLMs存在重复生成和视觉定位退化的现象,并确定了两个根本原因。首先,重复生成源自掩码令牌先验:由于生成令牌初始化为掩码令牌,它们的隐藏表示在生成步骤中逐渐向共享先验方向漂移。其次,位置注意偏置与逐步解掩过程之间的基本不一致抑制了对信息性视觉令牌的注意,降低了视觉定位的效果。基于这些见解,我们提出了一种无需训练的方法,引入掩码先验抑制和单调RoPE缩放来缓解解码过程中的掩码先验漂移和位置注意坍塌。在通用多模态基准和视觉定位任务上的实验表明,与基线LDVLMs相比有所改进,特别是在长文本描述基准上表现出稳健的提升。我们的结果表明,这些失败可以通过一种轻量级、即插即用的策略来有效解决,该策略不需要额外的训练且适用于多种LDVLM架构。
Summary / 总结
This paper addresses the issues of repetitive generation and degraded visual grounding in large diffusion vision-language models (LDVLMs) by identifying two underlying causes: mask token prior drift and positional attention collapse. To mitigate these problems, the authors propose a training-free approach involving Mask Prior Suppression and Monotonic RoPE Scaling, which improves performance on general multimodal benchmarks and visual grounding tasks, especially on long-form description tasks.
该论文通过识别两个根本原因——掩码令牌先验漂移和位置注意力坍缩——解决了大型扩散视觉-语言模型(LDVLMs)中的重复生成和视觉定位退化问题。为了解决这些问题,作者提出了一种无需训练的方法,包括掩码令牌先验抑制和单调RoPE缩放,这在通用多模态基准和视觉定位任务中提高了性能,特别是在长形式描述任务中表现出显著的改进。
History
20260516_0436 20260515_0457 20260514_0507 20260513_0503 20260512_0505 20260511_0418 20260510_0414 20260509_0426 20260508_0435 20260507_0454 20260506_0427 20260505_0436 20260504_0410 20260503_0414 20260502_0426 20260501_0429 20260430_0430 20260429_0437 20260428_0429 20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553