Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
Authors: Yifan Wang, Tong He
First: 2026-05-14T17:58:26+00:00 · Latest: 2026-05-14T17:58:26+00:00
Comments: Project page: https://yyfz.github.io/warp-as-history/
Abstract
Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.
Summary / 总结
Warp-as-History proposes a simple method to enable a video generation model to follow camera trajectories without requiring post-training on camera-annotated videos. By aligning positional encoding and removing invalid tokens, it generates camera-warped pseudo-history from past observations. This method reveals the model's zero-shot capability and can be further improved with lightweight offline LoRA finetuning on a single annotated video, enhancing camera adherence, visual quality, and motion dynamics.
Warp-as-History 提出了一种简单的方法,使冻结的视频生成模型能够跟随摄像机轨迹,无需在摄像机标注视频上进行后训练。通过对齐位置编码并移除无效令牌,它从过去的观察中生成摄像机扭曲的伪历史。这种方法揭示了模型的零样本能力,并且可以通过对单个标注视频进行轻量级的离线 LoRA 微调来进一步改进,从而增强摄像机对准、视觉质量和运动动态。
Does Synthetic Layered Design Data Benefit Layered Design Decomposition?
Authors: Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang, Jingye Chen, Qifeng Chen
First: 2026-05-14T17:55:11+00:00 · Latest: 2026-05-14T17:55:11+00:00
Comments: 22 pages, 10 figures. Code is available at https://github.com/YangHaolin0526/SynLayers
Abstract
Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.
Do-Undo Bench: Reversibility for Action Understanding in Image Generation
Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli
First: 2025-12-15T18:03:42+00:00 · Latest: 2026-05-14T17:13:30+00:00
Comments: Project page: https://s-mahajan.github.io/Do-Undo-Bench/
Abstract
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.
中文标题/摘要
标题:Do-Undo 基准:图像生成中的动作理解可逆性
我们提出了 Do-Undo 任务和基准,以解决视觉-语言模型中的关键问题:理解并生成由真实世界动作驱动的场景变换。与先前依赖提示驱动的图像生成和编辑来执行动作条件下的图像操作的工作不同,我们的训练假设要求模型模拟真实世界动作的结果,然后将其恢复到原始状态。这一正向-反向要求测试的是真正的因果理解,而不是风格或语义编辑。我们从真实场景中精心策划了一个高质量的可逆动作基准,以实现稳健的动作定位。我们的实验表明,当前模型在动作可逆性方面存在困难,突显了评估动作理解的必要性。Do-Undo 为评估和推进多模态系统中的动作感知提供了一个直观的测试平台,这些系统必须推理真实世界的动态。
Summary / 总结
This work introduces the Do--Do and benchmark for addressing a gap in image-language generation, introducing plausible scene transformations based real-world actions.. The method involves introducing a hypothesis on image-conditioned image manipulation, and curating a high-reverse benchmark from real on-world scenarios to evaluate robust image generation.. The findings highlight current models struggling with on-reverse operations highlight highlight highlight highlight highlight the need for a intuitive testbed for evaluating and advancing on-aware generation generation.-
提出了Do-Undo任务和基准,旨在评估视觉-语言模型在基于真实世界动作理解并生成合理场景变换的能力。不同于以往依赖提示进行图像生成的方法,该基准要求模型模拟动作并将其恢复到原始状态,测试其真正的因果理解能力。实验结果显示当前模型在动作可逆性方面存在困难,表明需要在多模态系统中提高动作理解能力。
On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
Authors: Mukul Ranjan, Prince Jha, Khushboo Kumari, Zhiqiang Shen
First: 2026-05-14T16:58:16+00:00 · Latest: 2026-05-14T16:58:16+00:00
Comments: Project Page: https://khushboo0012.github.io/tab-vlm-webpage/
Abstract
Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.
中文标题/摘要
标题:视觉语言模型中的文化错置与时间推理问题
视觉-语言模型(VLMs)越来越多地应用于文化遗产材料,从数字档案到教育平台。本文指出了这些模型在解释历史文物时的一个根本问题。我们将其定义为文化错置现象,即使用不适当的时间概念、材料或文化框架来误解历史物件。为了量化这一现象,我们引入了视觉语言模型的时间错置基准(TAB-VLM),这是一个包含600个问题的数据集,涵盖六个类别,旨在评估1600件印度文化遗产物件(从史前到现代)的时间推理能力。对十种最先进的模型进行系统评估显示,它们在基准测试中的表现存在显著缺陷,即使最好的模型(GPT-5.2)也只能达到58.7%的整体准确率。性能差距在不同架构和规模下持续存在,表明文化错置是视觉AI系统的一个重要限制,无论模型大小如何。这些发现突显了当前VLM能力与准确解释文化遗产材料之间存在的差距,特别是对于在训练数据中代表性不足的非西方视觉文化。我们的基准为增强与历史文物互动的多模态AI系统的时序认知提供了基础。数据集和代码可在我们的项目页面获取。
LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection
Authors: Mitchell Piehl, Muchao Ye
First: 2026-05-14T16:48:03+00:00 · Latest: 2026-05-14T16:48:03+00:00
Abstract
Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.
中文标题/摘要
标题:LATERN:测试时上下文感知的可解释视频异常检测
视觉语言模型(VLMs)由于其强大的视觉推理能力和基于自然语言的可解释性,最近在视频异常检测(VAD)中崭露头角。本文旨在解决此类管道中的一个关键局限性,即由于标记限制,它们独立地进行段级推理,且缺乏结构化的时空上下文,导致VLMs将异常解释为视频动态的变化,而不是产生碎片化的预测和解释。为此,我们提出了一种上下文感知框架,称为LATERN,将VAD重新定义为时间证据聚合过程。LATERN由两个互补模块组成:上下文感知异常评分(CEA)和递归证据聚合(REA)。CEA引入了一种新颖的图像导向记忆机制,通过帧多样性和视觉文本对齐选择历史内容作为扩展上下文,以帮助生成可靠的异常评分。基于这些评分,REA执行递归的时间聚合,以识别一致的异常区间,并生成基于视觉文本证据的事件级决策和解释。在包括UCF-Crime和XD-Violence在内的具有挑战性的基准测试中,实验表明,LATERN在测试时增强了冻结VLMs的检测准确性和解释一致性,同时生成了时空一致且语义上合理的事件级解释。
Summary / 总结
This paper addresses the limitations of vision-language models in video anomaly detection by proposing LATERN, a context-aware framework. LATERN reformulates VAD as a temporal evidence aggregation process, using two modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a memory mechanism to select historical content, while REA performs recursive temporal aggregation to identify coherent anomaly intervals. Experiments on UCF-Crime and XD-Violence show that LATERN improves detection accuracy and explanation consistency, generating temporally coherent and semantically grounded event-level explanations.
本文提出了一种上下文感知框架LATERN,以解决视觉语言模型在视频异常检测中的局限性。LATERN 将 VAD 形式化为一个时间证据聚合过程,使用两个模块:上下文感知异常评分 (CEA) 和递归证据聚合 (REA)。CEA 引入了一种记忆机制来选择历史内容,而 REA 则进行递归时间聚合以识别一致的异常区间。在 UCF-Crime 和 XD-Violence 上的实验表明,LATERN 提高了检测准确性和解释一致性,生成了时间上连贯且语义上合理的事件级解释。
MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs
Authors: Wei Ding, Yilin Li, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang
First: 2026-05-14T15:31:18+00:00 · Latest: 2026-05-14T15:31:18+00:00
Comments: 19 pages, 17 figures
Abstract
Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.
Summary / 总结
The paper introduces MHSA, a lightweight framework that mitigates hallucinations in large vision-language models (LVLMs) by learning to correct cross-modal attention patterns. It uses a simple three-layer MLP generator guided by signals from a DHCP discriminator and the LVLM itself. During inference, MHSA replaces the original cross-modal attention with corrected attention, effectively mitigating both discriminative and generative hallucinations across various datasets and LVLMs without modifying any LVLM parameters.
论文提出了MHSA,一种轻量级框架,通过学习纠正跨模态注意力模式来缓解大型视觉-语言模型(LVLM)中的幻觉问题。它使用一个简单的三层MLP生成器,由DHCP鉴别器和LVLM本身的信号引导。在推理过程中,MHSA用纠正后的注意力替换原始的跨模态注意力,有效地缓解了各种数据集和LVLM中的判别性和生成性幻觉,而不修改任何LVLM参数。
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding
Authors: Pengxin Xu, Xincheng Lin, Luping Xiao, Qing Jiang, Meishan Zhang, Hao Fei, Shanghang Zhang, Xingyu Chen
First: 2026-05-14T14:58:46+00:00 · Latest: 2026-05-14T14:58:46+00:00
Comments: Preprint. Code, models, and dataset are provided in the manuscript
Abstract
General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene -> object -> part -> affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.
Summary / 总结
SceneParser addresses the gap in scene perception by introducing hierarchical scene parsing, which captures structured dependencies for interaction-oriented understanding. It uses a VLM-based parser trained with structural-completion pseudo labels and curriculum learning, and evaluates with a large-scale benchmark SceneParser-Bench. Experiments show that SceneParser outperforms existing methods in structure-aware hierarchical parsing and provides actionable representations for visual understanding.
SceneParser通过引入层次场景解析来解决场景感知中的空白,捕捉交互导向理解所需的结构依赖。它使用基于VLM的解析器,通过结构完成伪标签和课程学习进行训练,并使用大规模基准SceneParser-Bench进行评估。实验表明,SceneParser在层次解析中表现出更强的结构感知性能,并为视觉理解提供可操作的表示。
Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy
Authors: Abdulrahman Alswaidan, Jeffrey D. Varner
First: 2026-03-06T20:50:30+00:00 · Latest: 2026-05-14T14:55:42+00:00
Comments: Main body (including references excluding the appendix): 11 pages, 2 figures and 1 table. Total paper: 26 pages, 13 figures and 7 pages
Abstract
Attention heads retrieve: given a query, they return a weighted average of stored values. We showed that this computation is one step of gradient descent on the modern Hopfield energy, and that Langevin sampling from the corresponding Boltzmann distribution yielded stochastic attention, a training-free sampler controlled by a single temperature parameter. Lowering the temperature gave exact retrieval; raising it gave open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model was required, making the approach particularly suited to the low-data regime where learned generative models are starved of training signal. We derived an entropy inflection condition that identified the retrieval-to-generation transition temperature for any memory geometry and validated the sampler on five domains spanning two orders of magnitude in dimension. A single Boolean mask on the attention softmax, identical to the causal mask used in transformers but applied along the memory axis rather than the sequence axis, turned the sampler into a zero-shot class-conditional generator on Olivetti faces with no retraining and no learned classifier. On MNIST digit images, stochastic attention produced samples that were markedly more novel and more diverse than the best learned baseline while matching a Metropolis-corrected gold standard. On protein sequences from a small Pfam family, the generation regime preserved amino acid composition far more faithfully than a variational autoencoder at matched novelty, indicating that the training-free score function retained family-level fidelity that learned models lost. A denoising diffusion baseline failed across all memory sizes tested, producing samples indistinguishable from isotropic noise. The approach required no architectural changes to the underlying attention mechanism.
SteerSeg: Attention Steering for Reasoning Video Segmentation
Authors: Ali Cheraghian, Hamidreza Dastmalchi, Abdelwahed Khamis, Morteza Saberi, Aijun An, Lars Petersson
First: 2026-05-14T14:42:15+00:00 · Latest: 2026-05-14T14:42:15+00:00
Comments: Project page: https://steerseg.github.io
Abstract
Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model's pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: https://steerseg.github.io
Summary / 总结
SteerSeg addresses the issue of diffuse and ambiguous grounding signals in video segmentation by introducing a lightweight framework that steers attention at its source through input-level conditioning. It combines learnable soft prompts with reasoning-guided Chain-of-Thought prompting to reshape attention distributions and resolve ambiguity among similar objects. The approach improves the spatial grounding capability of large vision-language models, generalizing well across diverse benchmarks despite being trained only on Ref-YouTube-VOS.
SteerSeg通过引入一种轻量级框架,在输入级别调整注意力,解决视频分割中注意力分布模糊和含糊的问题。该框架结合了可学习的软提示和基于推理的Chain-of-Thought提示,重塑注意力分布并解决相似对象之间的歧义。该方法在Ref-YouTube-VOS上训练,但在多种基准测试中表现出色,显著提高了大型视觉语言模型的空间定位能力。
Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)
Authors: Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising
Venue: IEEE Data Descriptions, 2026
First: 2025-11-17T14:12:22+00:00 · Latest: 2026-05-14T14:41:56+00:00
Abstract
The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
Summary / 总结
The research aims to evaluate the perception capabilities of Vision-Language Models (VLMs) in traffic scenarios, particularly their ability to understand distant objects. The Distance-Annotated Traffic Perception Question Answering (DTPQA) benchmark is introduced, consisting of both synthetic and real-world traffic scenes with distance annotations. Key findings show that VLM performance degrades as the distance of the objects in the scene increases, highlighting the need for robust long-range perception capabilities in automated driving systems.
研究旨在评估Vision-Language Models (VLMs)在交通场景中的感知能力,特别是它们识别远处物体的能力。引入了Distance-Annotated Traffic Perception Question Answering (DTPQA)基准,包括一个合成基准(DTP-Synthetic)和一个真实世界基准(DTP-Real)。每个样本包含一张图片、一个问题、正确答案以及物体与相机的距离。研究发现,随着物体距离的增加,VLM的性能会下降,突显了在自动驾驶系统中需要具备在远距离识别物体的稳健感知能力。
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
Authors: Xiyu Ren, Zhaowei Wang, Yiming Du, Zhongwei Xie, Chi Liu, Xinlin Yang, Haoyue Feng, Wenjun Pan, Tianshi Zheng, Baixuan Xu, Zhengnan Li, Yangqiu Song, Ginny Wong, Simon See
First: 2026-05-14T14:41:17+00:00 · Latest: 2026-05-14T14:41:17+00:00
Comments: Work in progress
Abstract
Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.
Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers
Authors: Jakub Grzywaczewski, Dawid Płudowski, Przemysław Biecek
First: 2026-05-14T14:37:50+00:00 · Latest: 2026-05-14T14:37:50+00:00
Abstract
Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.
Summary / 总结
This study aims to understand the structural anomalies in the shared latent spaces of contrastively pre-trained Vision-Language Models (VLMs) by decomposing their covariance matrices. The research employs spectral decomposition to separate the latent space into a semantic signal component and a shared noise subspace. Key findings indicate that the noise geometry shows strong subgroup invariance across different data subsets, and pruning these shared noise dimensions does not harm, and often improves, downstream task performance. This work offers new insights into the representational structure of modern VLMs, suggesting that a significant portion of their latent geometry is due to shared, architecture-level noise rather than task-relevant semantics alone.
该研究旨在通过分解协方差矩阵来理解对比预训练视觉-语言模型(VLMs)共享的潜在空间中的结构异常。研究采用谱分解将潜在空间分离为语义信号成分和共享噪声子空间。关键发现表明,噪声几何在不同数据子集上表现出强烈的子群不变性,去除这些共享噪声维度不会损害,反而可能提升下游任务性能。这项工作为现代VLMs的表征结构提供了新的机制性见解,表明其潜在几何结构的很大一部分是由共享的、架构级别的噪声而非仅由任务相关的语义所支配。
Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods
Authors: Md Abu Obaida Zishan, Jannatun Noor, Annajiat Alim Rasel
First: 2026-05-09T05:13:21+00:00 · Latest: 2026-05-14T14:25:41+00:00
Comments: Updated the title for clarity. Removed background and redundant text from section 4.2,5. Improved organization in section 4 and clarity of text in Section 4.3
Abstract
Stable Diffusion (SD) has evolved DDPM (Denoising Diffusion Probabilistic Model) based image generation significantly by denoising in latent space instead of feature space. This popularized DDPM-based image generation as the cost and compute barrier was significantly lowered. However, these models could only generate fixed-resolution images according to their training configuration. When we attempt to generate higher resolutions, the resulting images show object duplication artifacts consistently. To solve this problem without finetuning SD models, recent works have tried dilating the convolution kernels of the models and have achieved a great level of success. But dilated kernels are harder to fine-tune due to being zero-gapped. Apart from this, other methods, such as patched diffusion, could not solve the object-duplication problem efficiently. Hence, to overcome the limitations of dilated convolutions, we propose kernel interpolation of SD models for higher-resolution image generation. In this work, we show mathematically that interpolation can correctly scale convolution kernels if multiplied by a constant coefficient and achieve competitive empirical results in generating beyond-training-resolution images with Stable Diffusion using zero training. Furthermore, we demonstrate that our method enables interpolation of deep neural networks to adapt to higher-dimensional training data, with a worst-case performance drop of $2.6\%$ in accuracy and F1-Score relative to the baseline. This shows the applicability of our method to be general, where we interpolate fully-connected layers, going beyond convolution layers. We also discuss how we can reduce the memory footprints of training neural networks, using our method up to at least $4\times$.
中文标题/摘要
标题:超采样稳定扩散及更进一步:一种无需训练的无缝扩展神经网络的方法
稳定扩散(SD)通过在潜在空间而非特征空间去噪,显著提升了基于DDPM(去噪扩散概率模型)的图像生成技术,大幅降低了成本和计算门槛。然而,这些模型只能生成与其训练配置相匹配的固定分辨率图像。当尝试生成更高分辨率的图像时,结果图像会表现出对象重复的伪影。为了解决这一问题而不对SD模型进行微调,最近的研究尝试扩大模型卷积核的大小,并取得了显著的成功。但是,扩大的卷积核由于存在零间隙,难以进行微调。除了这种方法之外,其他方法,如补丁扩散,也无法高效地解决对象重复问题。因此,为了克服扩大小卷积核的局限性,我们提出了使用内插法对SD模型进行高分辨率图像生成。在本文中,我们通过数学证明了内插法可以在乘以一个常数系数后正确地扩展卷积核,并在无需训练的情况下使用稳定扩散生成超出训练分辨率的图像,取得了具有竞争力的实验结果。此外,我们展示了我们的方法能够内插深度神经网络以适应更高维度的训练数据,最坏情况下准确率和F1分数下降2.6%。这表明我们的方法具有广泛的适用性,我们不仅内插了全连接层,还超越了卷积层。我们还讨论了如何使用我们的方法减少训练神经网络的内存占用,最多可以减少4倍。
Summary / 总结
This paper addresses the issue of object duplication artifacts in high-resolution image generation using Stable Diffusion (SD) models. Instead of fine-tuning or dilating convolution kernels, the authors propose using kernel interpolation to scale SD models for higher resolutions. Experiments show that this method can generate images beyond the training resolution without significant accuracy loss, demonstrating its effectiveness and general applicability to fully-connected layers. The method also reduces memory footprints by up to 4 times.
该论文解决了使用Stable Diffusion (SD)模型在高分辨率图像生成中出现的对象重复伪影问题。作者提出使用核插值来扩展SD模型以生成更高分辨率的图像,而无需对卷积核进行微调或膨胀。实验表明,这种方法可以在不显著降低准确率的情况下生成超出训练分辨率的图像,展示了其有效性和对全连接层的通用适用性。此外,该方法还能将训练神经网络的内存占用最多减少4倍。
AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving
Authors: Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, Chen Lv
First: 2026-03-16T05:50:31+00:00 · Latest: 2026-05-14T14:09:03+00:00
Abstract
Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose AutoMoT in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that AutoMoT achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to https://automot-website.github.io/ for the demonstration videos and qualitative results.
Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study
Authors: Marta Robledo-Moreno, Ruben Vera-Rodriguez, Ruben Tolosana, Javier Ortega-Garcia
First: 2026-05-14T13:53:28+00:00 · Latest: 2026-05-14T13:53:28+00:00
Comments: Accepted at the 14th International Workshop on Biometrics and Forensics
Abstract
Recent advancements in Vision-Language Models (VLMs) have demonstrated strong capabilities in general visual reasoning, yet their applicability to rigorous biometric tasks remains unexplored. This work presents an exploratory study evaluating the zero-shot performance of state-of-the-art VLMs (GPT-5.2 and Gemini 2.5 Pro) on the Signature Verification Challenge (SVC) benchmark. To enable visual processing, raw kinematic time-series are converted into static images, encoding pressure information into stroke opacity whenever available in the source data. Furthermore, we introduce a scoring protocol that extracts latent token probabilities to compute robust biometric scores. Experimental results reveal a significant performance dichotomy dependent on signal quality and forgery type. In random forgery scenarios, the zero-shot VLM achieves exceptional discrimination, with GPT-5.2 reaching an Equal Error Rate of 0.32% in mobile tasks, outperforming supervised state-of-the-art systems. Conversely, in skilled forgery scenarios, where the task is more challenging because both signatures are almost identical, the results are significantly worse, and a critical "Rationalization Trap" emerges: chain-of-thought (CoT) reasoning degrades performance as the model produces kinematic hallucinations to justify forgery artifacts as natural variability.
中文标题/摘要
标题:探索视觉语言模型在在线签名验证中的应用:零样本能力研究
近期视觉语言模型(VLMs)在通用视觉推理方面表现出强大的能力,但其在严格生物特征任务中的应用尚未被探索。本研究旨在评估最先进的VLMs(GPT-5.2和Gemini 2.5 Pro)在签名验证挑战(SVC)基准上的零样本性能。为了实现视觉处理,原始的运动时间序列被转换成静态图像,当源数据中存在压力信息时,将其编码为笔画的不透明度。此外,我们还引入了一种评分协议,通过提取潜在的标记概率来计算稳健的生物特征评分。实验结果揭示了性能的显著差异,这取决于信号质量和伪造类型。在随机伪造场景中,零样本VLM表现出卓越的区分能力,GPT-5.2在移动任务中的等错误率达到了0.32%,超越了监督学习的先进系统。而在高技能伪造场景中,由于两个签名几乎完全相同,任务更具挑战性,结果显著变差,并出现了一个关键的“合理化陷阱”:链式推理(CoT)降低了性能,因为模型生成运动幻觉来证明伪造特征是自然变异的结果。
Summary / 总结
This study evaluates the zero-shot performance of state-of-the-art Vision-Language Models (GPT-5.2 and Gemini 2.5 Pro) on the Signature Verification Challenge benchmark. By converting raw kinematic time-series into static images and introducing a scoring protocol, the research demonstrates significant performance in random forgery scenarios, with GPT-5.2 achieving an Equal Error Rate of 0.32% in mobile tasks. However, in skilled forgery scenarios, performance drops due to a 'Rationalization Trap' where the model hallucinates kinematic details to justify forgery artifacts as natural variability.
研究评估了最先进的Vision-Language模型(GPT-5.2和Gemini 2.5 Pro)在签名验证挑战基准上的零样本性能。通过将原始的运动时间序列转换为静态图像并引入评分协议,研究在随机伪造场景中表现出色,GPT-5.2在移动任务中的等错误率为0.32%。但在高技能伪造场景中,由于‘推理陷阱’,模型会虚构运动细节来合理化伪造特征,导致性能下降。
The Velocity Deficit: Initial Energy Injection for Flow Matching
Authors: Linze Li, Zong-Wei Hong, Shen Zhang, Bo Lin, Jinglun Li, Yao Tang, Jiajun Liang
First: 2026-05-14T13:30:07+00:00 · Latest: 2026-05-14T13:30:07+00:00
Comments: Accepted by ICML2026
Abstract
While Flow Matching theoretically guarantees constant-velocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold-a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory's start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.
SuperADD: Training-free Class-agnostic Anomaly Segmentation -- CVPR 2026 VAND 4.0 Workshop Challenge Industrial Track
Authors: Lukas Roming, Felix Lehnerer, Jonas V. Funk, Andreas Michel, Georg Maier, Thomas Längle, Jürgen Beyerer
Venue: CVPR 2026
First: 2026-05-14T13:22:02+00:00 · Latest: 2026-05-14T13:22:02+00:00
Comments: Technical report for the CVPR 2026 VAND 4.0 workshop challenge industrial track
Abstract
Visual anomaly detection (AD) for industrial inspection is a highly relevant task in modern production environments. The problem becomes particularly challenging when training and deployment data differ due to changes in acquisition conditions during production. In the VAND 4.0 Industrial Track, models must remain robust under distribution shifts such as varying illumination and their performance is assessed on the MVTec AD 2 dataset. To address this setting, we propose a training-free and class-agnostic anomaly detection pipeline based on the work of SuperAD. Our approach improves generalization through several modifications designed to enhance robustness under distribution shifts. These adaptations include using a DINOv3 backbone, overlapping patch-wise processing, intensity-based augmentations, improved memory-bank subsampling for better coverage of the data distribution, and iterative morphological closing for cleaner and more spatially consistent anomaly maps. Unlike methods that rely on class-specific architectures or per-class hyperparameter tuning, our method uses a single architecture and one shared hyperparameter configuration across all object classes. This makes the approach well suited for industrial deployment, where product variants and appearance changes must be handled with minimal adaptation effort. We achieve segmentation F1 scores of $62.61\%$, $57.42\%$, and $54.35\%$ on test public, private, and private mixed of MVTec AD 2 respectively, thereby outperforming SuperAD and other state-of-the-art methods. Code is available at https://github.com/LukasRoom/SuperADD.
Summary / 总结
The paper addresses the challenge of visual anomaly detection in industrial settings where training and deployment data differ due to varying acquisition conditions. It proposes SuperADD, a training-free and class-agnostic anomaly detection pipeline that uses modifications like a DINOv3 backbone, overlapping patch-wise processing, and improved data augmentation techniques. This approach achieves segmentation F1 scores of 62.61%, 57.42%, and 54.35% on the MVTec AD 2 dataset for public, private, and private mixed test sets, respectively, outperforming existing methods like SuperAD and other state-of-the-art approaches.
论文针对工业环境中训练数据和部署数据因采集条件变化而不同所带来的视觉异常检测挑战。提出了一种训练免费且跨类别的异常检测管道SuperADD,通过使用DINOv3骨干网络、重叠的块级处理和改进的数据增强技术等修改来增强鲁棒性。该方法在MVTec AD 2数据集上分别实现了公共测试集62.61%、私有测试集57.42%和混合私有测试集54.35%的分割F1分数,超过了现有方法如SuperAD和其他最先进的方法。
Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation
Authors: Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Xianxun Zhu, Abdenour Hadid
First: 2026-05-14T13:09:16+00:00 · Latest: 2026-05-14T13:09:16+00:00
Abstract
In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.
EVA: Editing for Versatile Alignment against Jailbreaks
Authors: Yi Wang, Hongye Qiu, Yue Xu, Sibei Yang, Zhan Qin, Minlie Huang, Wenjie Wang
First: 2026-05-14T12:16:10+00:00 · Latest: 2026-05-14T12:16:10+00:00
Comments: IEEE TPAMI 2026
Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated impressive capabilities but remain vulnerable to jailbreaking attacks, where adversaries exploit textual or visual triggers to bypass safety guardrails. Recent defenses typically rely on safety fine-tuning or external filters to reduce the model's likelihood of producing harmful content. While effective to some extent, these methods often incur significant computational overheads and suffer from the safety utility trade-off, degrading the model's performance on benign tasks. To address these challenges, we propose EVA (Editing for Versatile Alignment against Jailbreaks), a novel framework that pioneers the application of direct model editing for safety alignment. EVA reframes safety alignment as a precise knowledge correction task. Instead of retraining massive parameters, EVA identifies and surgically edits specific neurons responsible for the model's susceptibility to harmful instructions, while leaving the vast majority of the model unchanged. By localizing the updates, EVA effectively neutralizes harmful behaviors without compromising the model's general reasoning capabilities. Extensive experiments demonstrate that EVA outperforms baselines in mitigating jailbreaks across both LLMs and VLMs, offering a precise and efficient solution for post-deployment safety alignment.
中文标题/摘要
标题:EVA:针对脱管攻击的多功能对齐编辑
大型语言模型(LLMs)和视觉语言模型(VLMs)展示了令人印象深刻的性能,但仍然容易受到脱管攻击的影响,攻击者利用文本或视觉触发器绕过安全防护。最近的防御措施通常依赖于安全微调或外部过滤器来降低模型生成有害内容的可能性。虽然这些方法在一定程度上有效,但它们往往会产生显著的计算开销,并且会牺牲安全性和性能之间的权衡,从而降低模型在良性任务上的表现。为了解决这些挑战,我们提出了EVA(针对脱管攻击的多功能对齐编辑),这是一种新颖的框架,首次将直接模型编辑应用于安全对齐。EVA将安全对齐重新定义为精确的知识修正任务。EVA不重新训练大量参数,而是识别并精确编辑导致模型对有害指令敏感的特定神经元,同时保留模型的大部分不变。通过局部更新,EVA有效地消除了有害行为,而不损害模型的一般推理能力。广泛的实验表明,EVA在LLMs和VLMs中都优于基线方法,提供了精确且高效的部署后安全对齐解决方案。
Summary / 总结
EVA is a novel framework that addresses the vulnerability of Large Language Models (LLMs) and Vision Language Models (VLMs) to jailbreak attacks by directly editing specific neurons to correct harmful behaviors. Unlike safety fine-tuning or external filters, EVA identifies and edits these neurons without retraining large portions of the model, thus maintaining the model's performance on benign tasks. Experiments show that EVA effectively mitigates jailbreaks in both LLMs and VLMs, offering a precise and efficient solution for post-deployment safety alignment.
EVA 是一种新型框架,通过直接编辑导致有害指令的特定神经元来解决大型语言模型和视觉语言模型对 jailbreak 攻击的脆弱性问题。与安全微调或外部过滤器不同,EVA 减少了计算开销并保持了模型在良性任务上的性能。实验表明,EVA 在 LLM 和 VLM 中有效缓解了 jailbreak 问题,提供了一种精确且高效的部署后安全对齐解决方案。
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
Authors: Yue Ma, Ziyuan Yang, Yi Zhang
First: 2026-05-03T07:38:42+00:00 · Latest: 2026-05-14T12:12:16+00:00
Comments: 12 pages
Abstract
Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.
From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models
Authors: Zicheng Fan, Kunihiko Fujiwara, Pengyuan Liu, Fan Zhang, Filip Biljecki
First: 2025-05-17T03:41:45+00:00 · Latest: 2026-05-14T12:09:49+00:00
Abstract
Visibility analysis in urban planning has traditionally relied on line-of-sight (LoS) simulations, which capture geometric occlusion. However, these approaches depend on accurate 3D data that is often unavailable and may not adequately represent how visually distinctive urban landmarks are encountered in real streetscapes. We reformulate landmark visibility assessment as an urban visual search problem in image space by leveraging the widespread availability of street view imagery (SVI). Given a reference image of a target landmark, a Vision Language Model (VLM) is applied to detect the landmark in direction- and zoom-controlled SVI. A successful detection indicates machine-recognised landmark visibility at the corresponding viewpoint. Beyond isolated viewpoints, we construct a heterogeneous visibility graph to represent visual connectivity among landmarks, street-view locations, and the urban spaces that mediate them. This graph enables us to map where visual connections occur, how strong they are, and how multiple landmarks become jointly connected through shared visual corridors. Across six well-known landmark structures in global cities, the image-based method achieves an overall detection accuracy of 87%, with a precision score of 68% for landmark-visible locations. In a second case study along the River Thames in London, the visibility graph reveals multi-landmark connections and identifies key mediating locations, with bridges accounting for approximately 31% of all connections. The proposed method complements LoS-based visibility analysis and offers a practical alternative in data-constrained settings. It also showcases the possibility of revealing the prevalent connections of visual objects in the urban environment, opening new perspectives for urban planning and heritage conservation.
中文标题/摘要
标题:从街景到视觉网络:使用视觉语言模型映射城市地标可见性
在城市规划中,可见性分析传统上依赖视线(LoS)模拟,捕捉几何遮挡。然而,这些方法依赖于准确的3D数据,这些数据往往不可用,可能无法充分代表人们在真实街道景观中遇到视觉独特地标的情况。我们通过利用广泛可用的街景图像(SVI)将地标可见性评估重新表述为图像空间中的城市视觉搜索问题。给定目标地标的一张参考图像,应用视觉语言模型(VLM)在方向和缩放控制的SVI中检测地标。成功的检测表明机器识别的地标可见性对应于相应的视角。除了孤立的视角,我们构建了一个异构可见性图来表示地标、街景位置以及它们之间的城市空间之间的视觉连接。该图使我们能够映射视觉连接发生的位置、强度以及多个地标通过共享视觉走廊联合连接的情况。在六个全球城市的知名地标结构中,基于图像的方法总体检测准确率为87%,地标可见位置的精确得分为68%。在伦敦泰晤士河的第二个案例研究中,可见性图揭示了多地标连接,并确定了关键的中介位置,桥梁占所有连接的约31%。所提出的方法补充了基于视线的可见性分析,并在数据受限的环境中提供了一种实用的替代方案。它还展示了揭示城市环境中视觉对象普遍连接的可能性,为城市规划和遗产保护提供了新的视角。
Summary / 总结
The study addresses the limitations of traditional line-of-sight (LoS) simulations in urban planning by reformulating landmark visibility as an urban visual search problem using Vision Language Models (VLMs) and street view imagery (SVI). The method detects landmarks in controlled SVI to assess visibility and constructs a visibility graph to represent visual connectivity among landmarks and urban spaces. Across six global landmarks, the image-based method achieved 87% detection accuracy and 68% precision. In a Thames case study, bridges accounted for 31% of connections, highlighting the method's practicality and potential for urban planning and heritage conservation.
研究通过使用视觉语言模型(VLM)和街景图像(SVI)将地标可见性问题重新定义为城市视觉搜索问题,以解决传统视线(LoS)模拟在城市规划中的局限性。该方法通过在控制视角的SVI中检测地标来评估可见性,并构建可视化图来表示地标、街景位置和城市空间之间的视觉连接。在六个全球地标中,基于图像的方法实现了87%的检测准确率和68%的精度。在泰晤士河案例研究中,桥梁占连接的31%,突显了该方法在城市规划和文化遗产保护中的实用性和潜力。
OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation
Authors: Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu
First: 2025-12-03T07:51:03+00:00 · Latest: 2026-05-14T12:05:47+00:00
Abstract
Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.
中文标题/摘要
标题:OpenTrack3D:朝向准确且通用的开放词汇3D实例分割
将开放词汇3D实例分割(OV-3DIS)推广到多样、无结构且无网格的环境中对于机器人技术和AR/VR至关重要,但仍然是一个重大挑战。我们将其归因于现有方法的两个关键限制:(1)提案生成依赖于数据集特定的提案网络或基于网格的超点,使其在无网格场景中不适用,并限制了对新场景的泛化;(2)基于CLIP的分类器的弱文本推理能力,难以识别组合性和功能性用户查询。为了解决这些问题,我们提出了OpenTrack3D,这是一种通用且准确的框架。与依赖预生成提案的方法不同,OpenTrack3D采用了一种新颖的视觉-空间跟踪器来在线构建跨视图一致的对象提案。给定一个RGB-D流,我们的流水线首先利用2D开放词汇分割器生成掩码,然后使用深度信息将这些掩码提升到3D点云。掩码引导的实例特征随后使用DINO特征图提取,我们的跟踪器融合视觉和空间线索以保持实例一致性。核心流水线完全无网格,但我们还提供了一个可选的超点细化模块,当场景网格可用时,可以进一步提高性能。最后,我们用多模态大型语言模型(MLLM)取代了CLIP,显著增强了对复杂用户查询的组合性推理能力。在包括ScanNet200、Replica、ScanNet++和SceneFun3D在内的多种基准上的广泛实验表明,该方法具有最先进的性能和强大的泛化能力。
Summary / 总结
OpenTrack3D addresses the challenge of open-vocabulary 3D instance segmentation in diverse and unstructured environments by introducing a novel visual-spatial tracker that generates cross-view consistent object proposals online. The framework leverages a 2D open-vocabulary segmenter and DINO feature maps to extract instance features, and uses a multi-modal large language model for enhanced compositional reasoning. Experiments on various benchmarks show that OpenTrack3D achieves state-of-the-art performance and strong generalization capabilities.
OpenTrack3D通过引入一种新型的视觉-空间追踪器,在多变和非结构化的环境中在线生成跨视图一致的对象提案,解决了开放词汇3D实例分割的挑战。该框架利用2D开放词汇分割器和DINO特征图来提取实例特征,并使用多模态大型语言模型增强组合推理。在各种基准上的实验表明,OpenTrack3D实现了最先进的性能和强大的泛化能力。
DAPL: Integration of Positive and Negative Descriptions in Text-Based Person Search
Authors: Yuchuan Deng, Zhanpeng Hu, Zijie Xin, Chuang Deng, Qijun Zhao
Venue: ICME
First: 2024-05-13T04:21:00+00:00 · Latest: 2026-05-14T11:40:35+00:00
Abstract
Text-based person search (TBPS) aims to retrieve specific images of individuals from large datasets using textual descriptions. Existing TBPS methods focus primarily on identifying explicit positive attributes, often neglecting the critical role of negative descriptions. This oversight can lead to false positives, where images that should be excluded based on negative descriptions are incorrectly included, due to partial alignment with the positive criteria. To address this limitation, we propose the Dual Attribute Prompt Learning (DAPL) framework, which incorporates both positive and negative descriptions to improve the interpretative accuracy of vision-language models in TBPS tasks. DAPL combines Dual Image-Attribute Contrastive (DIAC) learning with Sensitive Image-Attribute Matching (SIAM) learning to enhance the detection of previously unseen attributes. Furthermore, to achieve a balance between coarse and fine-grained alignment of visual and textual embeddings, we introduce the Dynamic Token-wise Similarity (DTS) loss. This loss function refines the representation of both matching and non-matching descriptions at the token level, providing more precise and adaptable similarity assessments, and ultimately improving the accuracy of the matching process. Empirical results demonstrate that DAPL outperforms state-of-the-art methods, enhancing both precision and robustness in TBPS tasks.
中文标题/摘要
标题:DAPL:文本基于的人像搜索中正负描述的整合
文本基于的人像搜索(TBPS)旨在使用文本描述从大型数据集中检索特定个体的图像。现有的TBPS方法主要集中在识别显式的正属性,往往忽视了负描述的关键作用。这种忽视可能导致误报,即基于负描述应被排除的图像由于部分符合正描述标准而被错误地包含。为解决这一局限,我们提出了双属性提示学习(DAPL)框架,该框架结合了正负描述以提高视觉-语言模型在TBPS任务中的解释准确性。DAPL结合了双图像属性对比学习(DIAC)和敏感图像属性匹配学习(SIAM)来增强对未见过属性的检测。此外,为了在视觉和文本嵌入之间实现粗细粒度的平衡对齐,我们引入了动态令牌级相似性(DTS)损失。该损失函数在令牌级别细化匹配和非匹配描述的表示,提供更精确和适应性的相似性评估,最终提高匹配过程的准确性。实验证明,DAPL在TBPS任务中优于现有方法,提高了精确度和鲁棒性。
SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
Authors: Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu
First: 2026-05-14T11:21:41+00:00 · Latest: 2026-05-14T11:21:41+00:00
Abstract
In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.
Summary / 总结
SceneFunRI is a benchmark for reasoning about invisible objects in real-world scenes, addressing the challenge of task-driven functional object localization where target objects are not visible. It uses a semi-automatic pipeline based on the SceneFun3D dataset and includes 855 instances requiring models to infer invisible object locations from task instructions and commonsense reasoning. The strongest baseline model achieves low performance with a CAcc@75 of 15.20, mIoU of 0.74, and Dist of 28.65, highlighting the instability of invisible-region reasoning in current VLMs and the need for models that integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.
SceneFunRI 是一个用于推理不可见物体的基准,旨在解决在真实场景中目标物体不可见时的任务驱动功能性物体定位问题。它基于 SceneFun3D 数据集使用半自动管道,并包含 855 个实例,要求模型根据任务指令和常识推理来推断不可见物体的位置。最强的基线模型表现不佳,CAcc@75 为 15.20,mIoU 为 0.74,Dist 为 28.65,这表明当前 VLMs 在不可见区域推理方面仍不稳定,未来需要结合任务意图、常识先验、空间定位和不确定性搜索的模型。
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
Authors: Tian Qin, Junzhe Chen, Yuqing Shi, Tianshu Zhang, Qiang Ju, Lijie Wen
First: 2026-05-14T09:37:55+00:00 · Latest: 2026-05-14T09:37:55+00:00
Abstract
Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.
Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt
Authors: Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen, Lian Zhang, Peiwu Qin
First: 2025-10-17T17:42:28+00:00 · Latest: 2026-05-14T09:15:35+00:00
Abstract
Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.
中文标题/摘要
标题:Memory-SAM:无需人工提示的舌段分割
准确的舌段分割对于可靠的中医分析至关重要。监督模型需要大量标注数据集,而SAM家族模型仍依赖于提示。我们提出Memory-SAM,这是一种无需训练、无需人工提示的流水线,通过密集的DINOv3特征和FAISS检索,从少量的先前案例记忆中自动生成有效的提示。给定查询图像,通过掩码约束检索到的示例的对应关系被提炼成前景/背景点提示,指导SAM2进行分割,无需手动点击或模型微调。我们在600张专家标注图像(300张受控,300张野外)上进行了评估。在混合测试集上,Memory-SAM的mIoU为0.9863,超越了FCN(0.8188)和一个检测到框的SAM基线(0.1839)。在受控数据上,天花板效应使得超过0.98的小差异变得不那么有意义,而我们的方法在真实条件下显示出明显的改进。结果表明,检索到提示能够实现数据高效、鲁棒的舌影像不规则边界分割。代码已公开发布在https://github.com/jw-chae/memory-sam。
HERO: Hierarchical Extrapolation and Refresh for Efficient World Models
Authors: Quanjian Song, Xinyu Wang, Donghao Zhou, Jingyu Lin, Cunjian Chen, Yue Ma
First: 2025-08-25T01:22:15+00:00 · Latest: 2026-05-14T09:03:20+00:00
Comments: 12 pages in total
Abstract
Generation-driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models. While recent advances have improved diffusion model efficiency, directly applying these techniques to world models introduces limitations such as quality degradation. In this paper, we present HERO, a training-free hierarchical acceleration framework tailored for efficient world models. Owing to the multi-modal nature of world models, we identify a feature coupling phenomenon, wherein shallow layers exhibit high temporal variability, while deeper layers yield more stable feature representations. Motivated by this, HERO adopts hierarchical strategies to accelerate inference: (i) In shallow layers, a patch-wise refresh mechanism efficiently selects tokens for recomputation. With patch-wise sampling and frequency-aware tracking, it avoids extra metric computation and remain compatible with FlashAttention. (ii) In deeper layers, a linear extrapolation scheme directly estimates intermediate features. This completely bypasses the computations in attention modules and feed-forward networks. Our experiments show that HERO achieves a 1.73$\times$ speedup with minimal quality degradation, significantly outperforming existing diffusion acceleration methods.
中文标题/摘要
标题:HERO:高效世界模型的分层外推和刷新
生成驱动的世界模型能够创建沉浸式的虚拟环境,但由于扩散模型的迭代性质,推理速度较慢。尽管最近的技术进步提高了扩散模型的效率,但直接将这些技术应用于世界模型会引入质量下降等限制。在本文中,我们提出了HERO,一种无需训练的分层加速框架,专为高效世界模型设计。由于世界模型的多模态性质,我们发现浅层特征表现出高时间变异性,而深层特征则提供更稳定的特征表示。受此启发,HERO 采用分层策略加速推理:(i) 在浅层,采用块级刷新机制高效地选择需要重新计算的令牌。通过块级采样和频率感知跟踪,它避免了额外的度量计算,并与FlashAttention兼容。(ii) 在深层,采用线性外推方案直接估计中间特征。这完全绕过了注意力模块和前馈网络的计算。我们的实验表明,HERO 在几乎不降低质量的情况下实现了1.73倍的加速,显著优于现有的扩散加速方法。
Discovering Physical Directions in Weight Space: Composing Neural PDE Experts
Authors: Pengkai Wang, Pengwei Liu, Yuanyi Wang, Guanyu Chen, Xingyu Ren, Xiaolong Li, Zhongkai Hao, Yuting Kong, Qixin Zhang, Dong Ni
First: 2026-05-14T08:25:16+00:00 · Latest: 2026-05-14T08:25:16+00:00
Abstract
Recent advances in neural operators have made partial differential equation (PDE) surrogate modeling increasingly scalable and transferable through large-scale pretraining and in-context adaptation. However, after a shared operator is fine-tuned to multiple regimes within a continuous physical family, it remains unclear whether the resulting weight-space updates merely form isolated regime experts or reveal reusable physical structure. Starting from a shared family anchor, we fine-tune low- and high-regime endpoint experts and show that their updates can be separated into a family-shared adaptation and a direction aligned with the underlying physical parameter. This separation reinterprets endpoint experts as finite-difference probes of a local physical direction in weight space, explaining why static averaging can interpolate between regimes but attenuates endpoint-specific physics. Building on this perspective, we propose Calibration-Conditioned Merge (CCM), a post-hoc coordinate readout method for composing neural PDE experts along this physical direction. Given physical metadata, a calibrated coordinate mapping, or a short observed rollout prefix, CCM infers the target composition coordinate and deploys a single merged checkpoint for the remaining rollout. We evaluate CCM on the reaction--diffusion system, viscosity-parameterized two-dimensional Navier--Stokes equations, and radial dam-break dynamics. Across these benchmarks, CCM achieves its strongest gains in extrapolative regimes, reducing out-of-distribution rollout error relative to the family anchor by 54.2%, 42.8%, and 13.8%, respectively. Further experiments across FNO scales, a DPOT-style backbone, and ablations confirm that endpoint fine-tuning is not arbitrary checkpoint drift, but reveals a calibratable physical direction for training-free transfer across PDE regimes.
TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models
Authors: Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Yu Zhang, Ying Li, Rong Xiao
First: 2026-02-04T15:33:10+00:00 · Latest: 2026-05-14T08:16:18+00:00
Abstract
Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose TRIO from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specifically, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle.The proposed TRIO is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, TRIO retains just 11.1\% of visual tokens but maintains 97.2\% of the original performance, with a 2.75$\times$ prefill speedup, 2.14$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead.Our code is available at https://github.com/ocy1/TRIO.
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Authors: Sujung Hong, Chanyong Yoon, Seongjae Hwang
First: 2026-05-14T08:11:32+00:00 · Latest: 2026-05-14T08:11:32+00:00
Abstract
Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.
中文标题/摘要
标题:缓解大型扩散视觉语言模型中的掩码先验漂移和位置注意坍塌
大型扩散视觉语言模型(LDVLMs)最近已成为自回归模型的有前途的替代方案,能够实现并行解码以提高推理效率,并利用双向注意以获取全局上下文。尽管取得了这些进展,但它们在长文本生成中的行为仍然未被充分探索。在本文中,我们展示了现有的LDVLMs存在重复生成和视觉定位退化的现象,并确定了两个根本原因。首先,重复生成源自掩码令牌先验:由于生成令牌初始化为掩码令牌,它们的隐藏表示在生成步骤中逐渐向共享先验方向漂移。其次,位置注意偏置与逐步解掩过程之间的基本不一致抑制了对信息性视觉令牌的注意,降低了视觉定位的效果。基于这些见解,我们提出了一种无需训练的方法,引入掩码先验抑制和单调RoPE缩放来缓解解码过程中的掩码先验漂移和位置注意坍塌。在通用多模态基准和视觉定位任务上的实验表明,与基线LDVLMs相比有所改进,特别是在长文本描述基准上表现出稳健的提升。我们的结果表明,这些失败可以通过一种轻量级、即插即用的策略来有效解决,该策略不需要额外的训练且适用于各种不同的LDVLM架构。
Summary / 总结
This study addresses explores the mitigation of of mask prior drift and positional attention collapse in large large large large vision language (LDVLMs) through through large vision large vision. The motivation stems from the observation that existing LDVLMs suffer from repetitive generation and degraded visual grounding-form under benchmarks. The method involves involves involves a novel-free approach on introducing Base Suppression and Under-onic Ro Scaling to address these issues,-form under.mask prior drift and positional attention collapse. Experiments on general multimodal benchmarks demonstrate demonstrate that on robust performance gains on on-form under benchmarks on-form under, demonstrating the effectiveness of on a lightweight on plug-and-play strategy that requiring no additional training required solution on generalizingizability across diverse LDVLM architectures.