arXiv 论文速递

Snapshot: 20260508_0435

Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging

Authors: Bernhard Kainz, Johanna P Mueller, Matthew Baugh, Cosmin Bercea

Venue: MICCAI 2026

First: 2026-05-06T17:32:34+00:00 · Latest: 2026-05-06T17:32:34+00:00

Comments: submitted to MICCAI 2026

Abstract

Zero-shot anomaly localisation via vision-language models (VLMs) offers a compelling approach for rare pathology detection, yet its performance is fundamentally limited by the absence of healthy anatomical context. We reformulate zero-shot localisation as a comparative inference problem in which anomalies are identified through structured comparison against reference distributions of normal anatomy. We introduce WALDO, a training-free framework grounded in optimal transport theory that enables comparative reasoning through: (i) entropy-weighted Sliced Wasserstein distances for anatomically-aware reference selection from DINOv2 patch distributions, (ii) Goldilocks zone sampling exploiting the non-monotonic relationship between reference similarity and localisation accuracy, and (iii) self-consistency aggregation via weighted non-maximum suppression. We theoretically analyse the Goldilocks effect through distributional divergence, and show that references with moderate similarity minimize a bias-variance trade-off in comparative visual reasoning. On the NOVA brain MRI benchmark, WALDO with Qwen2.5-VL-72B achieves $43.5_{\pm1.6}\%$ mAP@30 (95\% CI: [40.4, 46.7]), representing a 19\% relative improvement over zero-shot baselines. Cross-model evaluation shows consistent gains: GPT-4o achieves $32.0_{\pm6.5}\%$ and Qwen3-VL-32B achieves $32.0_{\pm6.6}\%$ mAP@30. Paired McNemar tests confirm statistical significance ($p<0.01$). Source code is available at https://github.com/bkainz/WALDO_MICCAI26_demo .

Summary / 总结

The paper addresses the challenge of zero-shot anomaly localization in medical imaging using vision-language models (VLMs), which are limited by the lack of healthy anatomical context. It introduces WALDO, a training-free framework based on optimal transport theory, which uses entropy-weighted Sliced Wasserstein distances for reference selection, Goldilocks zone sampling for optimal reference similarity, and self-consistency aggregation for improved localization accuracy. On the NOVA brain MRI benchmark, WALDO achieves 43.5% mAP@30, a 19% improvement over zero-shot baselines, with consistent gains across different models. Statistical significance is confirmed through McNemar tests.

该论文针对使用视觉语言模型（VLMs）进行医学影像中的零样本异常定位面临的挑战，这些模型受限于缺乏健康解剖上下文。它引入了基于最优传输理论的WALDO框架，该框架使用熵加权的Sliced Wasserstein距离进行参考选择，使用Goldilocks区采样以获得最佳的参考相似度，并通过加权非极大值抑制实现自一致性聚合以提高定位准确性。在NOVA脑MRI基准测试上，WALDO实现了43.5%的mAP@30，比零样本基线提高了19%，不同模型的评估结果一致显示了改进。通过McNemar检验确认了统计显著性。

Shape2Animal: Creative Animal Generation from Natural Silhouettes

Authors: Quoc-Duy Tran, Anh-Tuan Vo, Dinh-Khoi Vo, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

First: 2025-06-25T17:04:08+00:00 · Latest: 2026-05-06T17:28:14+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Humans possess a unique ability to perceive meaningful patterns in ambiguous stimuli, a cognitive phenomenon known as pareidolia. This paper introduces Shape2Animal framework to mimics this imaginative capacity by reinterpreting natural object silhouettes, such as clouds, stones, or flames, as plausible animal forms. Our automated framework first performs open-vocabulary segmentation to extract object silhouette and interprets semantically appropriate animal concepts using vision-language models. It then synthesizes an animal image that conforms to the input shape, leveraging text-to-image diffusion model and seamlessly blends it into the original scene to generate visually coherent and spatially consistent compositions. We evaluated Shape2Animal on a diverse set of real-world inputs, demonstrating its robustness and creative potential. Our Shape2Animal can offer new opportunities for visual storytelling, educational content, digital art, and interactive media design. Our project page is here: https://shape2image.github.io

中文标题/摘要

标题：Shape2Animal: 从自然轮廓生成创意动物

人类具有在模糊刺激中感知有意义模式的独特能力，这一认知现象被称为pareidolia。本文介绍了一种名为Shape2Animal的框架，通过重新解释自然物体轮廓，如云朵、石头或火焰，将其转化为可能的动物形态来模仿这种想象力。我们的自动化框架首先执行开放式词汇分割以提取物体轮廓，并使用视觉语言模型解释合适的动物概念。然后，利用文本到图像的扩散模型合成符合输入形状的动物图像，并将其无缝融合到原始场景中，生成视觉上连贯且空间上一致的组合。我们在多种实际输入上评估了Shape2Animal，展示了其稳健性和创意潜力。我们的Shape2Animal可以为视觉叙事、教育内容、数字艺术和互动媒体设计提供新的机会。我们的项目页面在此：https://shape2image.github.io

ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

Authors: Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, SuiYang Guang, Tuan Kiet Pham, Linh Chi Vo

First: 2026-05-06T15:52:35+00:00 · Latest: 2026-05-06T15:52:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.

Summary / 总结

Script-vocabulary human-object interaction (HOI) detection is addressed by matching human-object features with embeddings,. butI predictions are often objectI dominated by object-object affordance and phrase on--. occurrence. ScriptHOI proposes a structured framework that decomposes interaction phrases into soft scripted tokensI including roleI geometryI affordanceI motionI and state-state slots. Experiments on HOI datasets show that ScriptHOI improves rare and unseen interaction recognition while substantially on affordance-conflict on positives.

ScriptHOI 通过将每个交互短语表示为软脚本状态转换来增强开放词汇量的人物体交互检测。它将交互短语分解为身体角色、接触、几何形状、功能、运动和物体状态等槽位，并使用视觉状态解析器和槽位匹配器来估计脚本覆盖和冲突。这种方法提高了对罕见和未见过的交互的识别，并减少了由于物体功能而导致的假阳性。实验结果表明了这些优势。

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Authors: Hongxu Chen, Yanghao Wang, Bowei Zhu, Hongxiang Li, Zhen Wang, Ziqi Jiang, Lin Li, Rui Liu, Long Chen

First: 2026-05-06T15:51:26+00:00 · Latest: 2026-05-06T15:51:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent flow matching (FM) methods improve the few-shot adaptation of vision-language models, by modeling cross-modal alignment as a continuous multi-step flow. In this paper, we argue that existing FM methods are inherently constrained by incompatible geometric priors on pre-trained cross-modal features, resulting in suboptimal adaptation performance. We first analyze these methods from a polar decomposition perspective (i.e., radial and angular sub-manifolds). Under this new geometric view, we identify three overlooked limitations in them: 1) Angular dynamics distortion: The radial-angular coupling induces non-uniform speed on the angular sub-manifold, leading to regression training difficulty and extra truncation errors. 2) Radial dynamics neglect: Feature normalization discards modality confidence, failing to distinguish out-of-distribution and in-distribution data, and abandoning crucial radial dynamics. 3) Context-agnostic unconditional flow: Dataset-specific information loss during pre-trained cross-modal feature extraction remains unrecovered. To resolve these issues, we propose warped product flow matching (WP-FM), a unified Riemannian framework that reformulates alignment on a warped product manifold. Within this framework, we derive direct product flow matching (DP-FM) by introducing a constant-warping metric, which yields a decoupled cylindrical manifold (i.e., direct product manifold). DP-FM enables independent radial evolution and constant-speed angular geodesic transport, effectively eliminating angular dynamics distortion while preserving radial consistency. Meanwhile, we incorporate classifier-free guidance by conditioning the flow on the pre-trained VLMs' hidden states to inject missing dataset-specific information. Extensive results across 11 benchmarks have demonstrated that DP-FM achieves a new state-of-the-art for multi-step few-shot adaptation.

中文标题/摘要

标题：直接乘积流匹配：解耦径向和角动态以实现少样本适应

最近的流匹配（FM）方法通过将跨模态对齐建模为连续的多步流，提高了视觉-语言模型的少样本适应能力。在本文中，我们认为现有的FM方法本质上受到不兼容的几何先验的约束，导致适应性能不佳。我们首先从极分解的角度（即径向和角子流形）分析了这些方法。在这一新的几何视角下，我们发现了它们的三个未被注意到的局限性：1）角动态失真：径向-角耦合在角子流形上引入了非均匀速度，导致回归训练难度增加和额外的截断误差。2）径向动态忽略：特征归一化丢弃了模态置信度，无法区分分布外和分布内数据，放弃了关键的径向动态。3）无上下文的无条件流：在预训练跨模态特征提取过程中特定数据集的信息丢失无法恢复。为了解决这些问题，我们提出了变形乘积流匹配（WP-FM），这是一种统一的黎曼框架，重新定义了变形乘积流形上的对齐。在此框架内，通过引入恒定变形度量，我们推导出了直接乘积流匹配（DP-FM），从而获得了一个解耦的圆柱流形（即直接乘积流形）。DP-FM 使径向演化独立且角动态以恒定速度进行几何传输，有效消除了角动态失真并保持了径向一致性。同时，我们通过将流条件化在预训练的 VLMs 的隐藏状态上，注入缺失的特定数据集信息，实现了无分类器引导。在 11 个基准测试中的广泛结果表明，DP-FM 在多步少样本适应中达到了新的最佳水平。

Summary / 总结

This paper addresses the limitations of existing flow matching methods in few-shot adaptation of vision-language models, particularly the issues of angular dynamics distortion, neglect of radial dynamics, and context-agnostic flow. It proposes Direct Product Flow Matching (DP-FM), which decouples radial and angular dynamics on a direct product manifold, thereby improving adaptation performance. Experimental results across 11 benchmarks show that DP-FM outperforms previous methods.

本文针对现有流匹配方法在视觉-语言模型少样本适应中的局限性，特别是角度动态失真、径向动态忽略和无上下文流动问题，提出了直接产品流匹配（DP-FM），通过在直接产品流形上解耦径向和角度动态，提升了适应性能。实验结果表明，DP-FM 在 11 个基准测试中优于先前的方法。

UI2Code^N: UI-to-Code Generation as Interactive Visual Optimization

Authors: Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiale Cheng, Xiaotao Gu, Jie Tang

First: 2025-11-11T13:00:09+00:00 · Latest: 2026-05-06T15:47:34+00:00

Comments: 27 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

UI-to-code aims to translate UI screenshots into executable front-end code. Despite progress with vision-language models (VLMs), most existing methods formulate UI-to-code as a single-pass generation, which mismatches real-world UI development that is inherently iterative and feedback-driven. We reformulate UI-to-code as an interactive visual optimization problem, where code generation is embedded in a closed-loop process of execution, visual inspection, and iterative refinement driven by rendered visual feedback. To address the non-differentiability of visual objectives and the noise of absolute visual evaluators, we propose Relative Visual Policy Optimization (RVPO), a preference-based reinforcement learning method that optimizes relative visual rankings among rendered candidates under execution feedback. We instantiate this paradigm in UI2Code^N, an open-source 9B model trained via continual pre-training, supervised fine-tuning, and reinforcement learning. Experiments demonstrate state-of-the-art performance on UI drafting, UI polishing, and UI editing benchmarks, even outperforming larger models, with performance consistently improving through iterative visual optimization. Our code and models are available at https://github.com/zai-org/UI2Code_N.

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

Authors: Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli, Rui Zhang, Jack Sampson, Vijaykrishnan Narayanan

First: 2026-05-06T15:41:24+00:00 · Latest: 2026-05-06T15:41:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.

Summary / 总结

This study investigates the issue of relation hallucination in vision-language models under visual perturbations such as rotation and noise. The research demonstrates that even mild distortions can severely impair relational reasoning across different models and datasets. The authors evaluate prompt-based augmentation and preprocessing methods but find that these strategies only provide partial improvements and do not fully resolve the hallucination problem. The findings highlight the need for more robust and geometry-aware vision-language models.

研究探讨了视觉语言模型在旋转和噪声等视觉干扰下的关系幻觉问题。研究发现，即使是轻微的扭曲也会严重影响不同模型和数据集中的关系推理能力。作者评估了基于提示的增强和预处理方法，但发现这些方法只能提供部分改进，并不能完全解决幻觉问题。研究结果强调了需要更 robust 和几何感知更强的视觉语言模型。

Prompt-Anchored Vision-Text Distillation for Lifelong Person Re-identification

Authors: Wen Wen, Hao Chen, Shiliang Zhang

Venue: CVPR 2026

First: 2026-05-06T15:23:13+00:00 · Latest: 2026-05-06T15:23:13+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Lifelong person re-identification (LReID) aims to train a generalizable model with sequentially collected data. However, such models often suffer from semantic drift, limited adaptability, and catastrophic forgetting as new domains emerge. Existing exemplar-free approaches largely rely on visual-only distillation or parameter regularization, while overlooking the potential of auxiliary modalities, such as text, to preserve semantic stability and enable incremental plasticity. We observe that the frozen text encoder in pretrained vision-language models can serve as a stable semantic anchor across domains. To decouple the roles of vision and text, we propose Prompt-Anchored vision-text Distillation (PAD), an asymmetric vision-text framework for semantic alignment and cross-domain generalization. On the textual side, we distill prompts to preserve vision-text alignment under a fixed semantic space, acting as a global semantic reference rather than a dominant learning signal. On the visual side, an EMA-based teacher with an adaptive prompt pool enables domain-wise adaptation by allocating new slots while freezing past ones. Extensive experiments show that PAD substantially outperforms state-of-the-art methods across seen and unseen domains, achieving a strong balance between stability and plasticity. Project page is available at https://github.com/zu-zi/PAD.

Summary / 总结

The research aims to address the challenges of semantic drift and catastrophic forgetting in lifelong person re-identification by leveraging a stable semantic anchor from a frozen text encoder in pretrained vision-language models. The proposed Prompt-Anchored vision-text Distillation (PAD) framework uses an asymmetric vision-text approach to align semantic spaces and enable incremental plasticity. Experiments demonstrate that PAD significantly outperforms existing methods, balancing stability and adaptability across seen and unseen domains.

研究旨在通过利用预训练的视觉-语言模型中的文本编码器稳定性来解决终身行人重识别（LReID）中的语义漂移和灾难性遗忘问题。提出的Prompt-Anchored视觉-文本蒸馏（PAD）方法使用不对称的视觉-文本框架来对齐语义空间并实现增量塑性。实验表明，PAD在已见和未见领域中均优于现有方法，有效平衡了稳定性和适应性。

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

Authors: Kenneth J. K. Ong

First: 2026-04-30T14:50:48+00:00 · Latest: 2026-05-06T14:48:56+00:00

Abs · PDF · Code1 · Code2

Abstract

As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effects of visual priming on VLMs' cooperative behavior using the Iterated Prisoner's Dilemma (IPD) as a test scenario. We examine whether exposure to images depicting behavioral concepts (kindness/helpfulness vs. aggressiveness/selfishness) and color-coded reward matrices alters VLM decision patterns. Experiments were conducted across multiple state-of-the-art VLMs. We further explore mitigation strategies including prompt modifications, Chain of Thought (CoT) reasoning, and visual token reduction. Results show that VLM behavior can be influenced by both image content and color cues, with varying susceptibility and mitigation effectiveness across models. These findings not only underscore the importance of robust evaluation frameworks for VLM deployment in visually rich and safety-critical environments, but also highlight how architectural and training differences among models may lead to distinct behavioral responses-an area worthy of further investigation.

RLDX-1 Technical Report

Authors: Dongyoung Kim, Huiwon Jang, Myungkyu Koo, Suhyeok Jang, Taeyoung Kim, Beomjun Kim, Byungjun Yoon, Changsung Jang, Daewon Choi, Dongsu Han, Donguk Lee, Heeseung Kwon, Hojin Jeon, Jaehyun Kang, Jaekyoung Bae, Jihyuk Lee, Jimin Lee, John Won, Joonwoo Ahn, Junhyeong Park, Junyoung Sung, Kyungmin Lee, Minseong Han, Minsung Yoon, Sejune Joo, Seonil Son, Seungcheol Park, Seunggeun Cho, Seungjun Moon, Seungku Kim, Yonghoon Dong, Yongjin Cho, Youngchan Kim, Chang Hwan Kim, Dohyeon Kim, Heecheol Kim, Heewon Lee, Hensen Ahn, Hyungkyu Ryu, Hyunsoo Choi, Hyunsoo Shin, Jaeheon Jung, Jaewoo Kim, Jinwook Kim, Joochul Chang, Joonsoo Kim, Junghun Park, Jungwoo Park, Junho Cho, Junhyeok Park, Junwon Lee, Kangwook Lee, Kwanghoon Kim, Kyoungwhan Choe, Manoj Bhadu, Nayoung Oh, Sangjun Kim, Sangwoo Kim, Seunghoon Shim, Seunghyun Kim, Seungjun Lee, Seungyup Ka, Sungryol Yang, Wook Jung, Yashu Shukla, Yeonjae Lee, Yeonwoo Bae, Jinwoo Shin

First: 2026-05-05T01:40:15+00:00 · Latest: 2026-05-06T14:24:04+00:00

Comments: Project page: https://rlwrld.ai/rldx-1

Abs · PDF · Code1 · Code2 · Project1

Abstract

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, long-term memory, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including data synthesis for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. $π_{0.5}$ and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while $π_{0.5}$ and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.

Summary / 总结

RLDX-1 is a general-purpose robotic policy for dexterous manipulation, built on the Multi-Stream Action Transformer (MSAT) to integrate various capabilities such as motion awareness and long-term memory. It also includes data synthesis, specialized learning procedures, and real-time inference optimizations. Empirical evaluations show that RLDX-1 outperforms recent VLAs like $π_{0.5}$ and GR00T N1.6 in both simulation and real-world tasks, particularly in ALLEX humanoid tasks where it achieves 86.8% success rate compared to around 40% for the others.

RLDX-1 是一种基于多流动作变换器（MSAT）的通用机器人策略，用于灵巧操作，旨在整合多种能力如运动意识和长期记忆。它还包括数据合成、专门的学习程序和实时推理优化。实证评估显示，RLDX-1 在模拟和真实世界任务中均优于最近的 VLA 如 $π_{0.5}$ 和 GR00T N1.6，特别是在 ALLEX 人形任务中，其成功率达到了 86.8%，而其他方法仅为约 40%。

Valley3: Scaling Omni Foundation Models for E-commerce

Authors: Zeyu Chen, Guanghao Zhou, Qixiang Yin, Ziwang Zhao, Huanjin Yao, Pengjiu Xia, Min Yang, Cen Chen, Minghui Qiu

First: 2026-05-02T06:25:48+00:00 · Latest: 2026-05-06T13:38:25+00:00

Abs · PDF · Code1 · Code2

Abstract

In this work, we present Valley3, an omni multimodal large language model (MLLM) developed for diverse global e-commerce tasks, with unified understanding and reasoning capabilities across text, images, video, and audio. A key feature of Valley3 is its native multilingual audio capability for e-commerce, developed by extending vision-language models to better support crucial audio-visual tasks, particularly in short-video scenarios. To achieve this, we carefully design a four-stage omni e-commerce continued pre-training pipeline, through which Valley3 progressively acquires audio understanding, cross-modal instruction-following, e-commerce domain knowledge, and long-context reasoning capabilities, ultimately evolving into an omni model for diverse e-commerce scenarios. Then, we further improve Valley3 through post-training to encourage long-chain reasoning with controllable reasoning modes, enabling one non-thinking mode and three distinct levels of thinking, thereby balancing inference efficiency in simple scenarios with deep reasoning for complex applications. Moreover, we equip Valley3 with agentic search capabilities to proactively invoke search tools and acquire task-relevant information for e-commerce deep research tasks. To comprehensively assess the capabilities of Valley3, we construct an omni e-commerce benchmark spanning 6 tasks. Experimental results show that Valley3 consistently outperforms strong baselines on our in-house and open-source e-commerce benchmarks, while remaining competitive on general-domain benchmarks.

FairEnc: A Fair Vision-Language Model with Fair Vision and Text Encoders for Glaucoma Detection

Authors: Mohamed Elhabebe, Ayman El-Baz, Qing Liu

First: 2026-05-06T13:18:34+00:00 · Latest: 2026-05-06T13:18:34+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Automated glaucoma detection is critical for preventing irreversible vision loss and reducing the burden on healthcare systems. However, ensuring fairness across diverse patient populations remains a significant challenge. In this paper, we propose FairEnc, a fair pretraining method for vision-language models (VLMs) that enables simultaneous debiasing across multiple sensitive attributes. FairEnc jointly mitigates biases in both textual and visual modalities with respect to multiple sensitive attributes, including race, gender, ethnicity, and language. Specifically, for the textual encoder, we leverage a large language model to generate synthetic clinical descriptions with varied sensitive attributes while preserving disease semantics, and employ a contrastive alignment objective to encourage demographic-invariant representations. For the visual encoder, we propose a dual-level fairness strategy that combines mutual information regularization to reduce statistical dependence between learned features and demographic groups, with multi-discriminator adversarial debiasing. Comprehensive experiments on the publicly available Harvard-FairVLMed dataset demonstrate that FairEnc effectively reduces demographic disparity as measured by DPD and DEOdds while achieving strong diagnostic performance under both zero-shot and linear probing evaluations. Additional experiments on the private FairFundus dataset show that FairEnc consistently preserves fairness advantages under cross-domain and cross-modality settings and maintains diagnostic performance within a competitive range. These results highlight FairEnc's ability to generalize fairness under distribution shifts, supporting its potential for more equitable deployment in real-world clinical settings. Our codebase and synthetic clinical notes are available at https://github.com/Mohamed-Elhabebe/FairEnc

Summary / 总结

The research aims to develop a fair vision-language model (VLM) for glaucoma detection by addressing biases in both textual and visual modalities. FairEnc uses a large language model to generate synthetic clinical descriptions with varied sensitive attributes and employs a contrastive alignment objective to encourage demographic-invariant representations. For the visual encoder, a dual-level fairness strategy combining mutual information regularization and multi-discriminator adversarial debiasing is proposed. Experiments on the Harvard-FairVLMed dataset show that FairEnc reduces demographic disparity while maintaining strong diagnostic performance. Additional experiments on the FairFundus dataset confirm the model's fairness and diagnostic performance under different settings, highlighting its potential for equitable deployment in clinical settings.

FairEnc 是一种用于视图-语言模型的公平预训练方法，旨在解决文本和视觉模态中的偏见问题，以实现青光眼检测。该方法利用大型语言模型生成多样化的临床描述，并使用对比对齐目标来鼓励去偏见的表示。对于视觉编码器，它采用互信息正则化和多判别器对抗去偏。实验表明，FairEnc 在减少人口统计差异的同时保持了强大的诊断性能，并且在不同场景下始终保留了公平优势。这项工作支持在临床环境中更公平地部署青光眼检测。

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Authors: Haibin He, Maoyuan Ye, Jing Zhang, Juhua Liu, Bo Du

First: 2026-05-06T13:01:52+00:00 · Latest: 2026-05-06T13:01:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited. To better understand this gap, we conduct an upper-bound analysis through frame-wise question answering, counting a sample as correct if any frame yields the right answer, which significantly outperforms direct video-based inference and reveals a substantial performance gap. The results suggest that the primary bottleneck lies in the localization of key question-relevant evidence, rather than in reasoning capacity itself. Building on this insight, we propose a question-guided agent framework that explicitly anchors the relevant keyframes before answering. The approach operates effectively in a training-free setting and consistently surpasses direct video inference. With additional supervised fine-tuning (SFT) and reinforcement learning (RL), it achieves an average improvement of +12.12 in accuracy and +11.15 in ANLS across benchmarks, establishing new state-of-the-art results. Our study underscores the critical role of explicit keyframe anchoring for advancing Video TextVQA. The code will be publicly released.

中文标题/摘要

标题：VTAgent：具有证据意识的关键帧锚定的代理框架用于视频文本VQA

基于视频的文本视觉问答（Video TextVQA）旨在通过在视频中推理视觉文本内容来回答问题。尽管最近的视频LLMs在多模态视频理解方面表现出强大的能力，但它们在现有Video TextVQA基准上的表现仍然有限。为了更好地理解这一差距，我们通过逐帧问答进行上限分析，如果任何一帧给出正确答案则视为正确样本，这显著优于直接基于视频的推理，并揭示了显著的性能差距。结果表明，主要瓶颈在于关键问题相关证据的定位，而不是推理能力本身。基于这一洞察，我们提出了一种问题导向的代理框架，在回答之前明确锚定相关关键帧。该方法在无训练设置中有效运行，并且始终超越直接视频推理。通过附加的监督微调（SFT）和强化学习（RL），它在基准测试中的准确性和ANLS分别提高了12.12%和11.15%，建立了新的最先进结果。我们的研究强调了明确关键帧锚定在推进Video TextVQA中的关键作用。代码将公开发布。

Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation

Authors: Jiamin Zheng, Jingwen Yu, Guangcheng Chen, Hong Zhang

First: 2026-04-20T14:35:31+00:00 · Latest: 2026-05-06T12:52:59+00:00

Comments: 9 pages, 8 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Indoor robot navigation is often compromised by glass surfaces, which severely corrupt depth sensor measurements. While foundation models like Depth Anything 3 provide excellent geometric priors, they lack an absolute metric scale. We propose a training-free framework that leverages depth foundation models as a structural prior, employing a robust local RANSAC-based alignment to fuse it with raw sensor depth. This naturally avoids contamination from erroneous glass measurements and recovers an accurate metric scale. Furthermore, we introduce \ti{GlassRecon}, a novel RGB-D dataset with geometrically derived ground truth for glass regions. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art baselines, especially under severe sensor depth corruption. The dataset and related code will be released at https://github.com/jarvisyjw/GlassRecon.

Summary / 总结

The research aims to improve indoor robot navigation by addressing the issue of glass surfaces corrupting depth sensor measurements. The proposed method uses a depth foundation model as a structural prior and aligns it with raw sensor depth using a robust local RANSAC-based approach, which effectively mitigates the impact of erroneous glass measurements and recovers an accurate metric scale. Experiments show that the approach outperforms existing methods, particularly in scenarios with severe sensor depth corruption.

研究旨在通过解决玻璃表面干扰深度传感器测量的问题，提高室内机器人导航性能。提出的方法利用深度基础模型作为结构先验，并使用稳健的局部RANSAC对齐方法将其与原始传感器深度融合，有效避免了错误的玻璃测量干扰，恢复了准确的度量尺度。实验表明，该方法在严重传感器深度干扰的情况下优于现有方法。

VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

Authors: Chih-Chung Liu, Zhiwei Lin, Yongtao Wang

First: 2026-05-05T07:44:03+00:00 · Latest: 2026-05-06T10:51:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors for instance-level spatial anchoring and dense priors for class-aware local context. These priors are integrated with the original detection prompts via Memory-Guided Prompt Refinement, enabling a shared retrieval-and-refinement mechanism that supports open-vocabulary and open-ended inference.Extensive zero-shot experiments on LVIS show that VL-SAM-v3 consistently improves detection performance under both open-vocabulary and open-ended inference, with particularly strong gains on rare categories.Moreover, experiments with a stronger open-vocabulary detector (i.e., SAM3) validate the generality of the proposed retrieval-and-refinement mechanism.

中文标题/摘要

标题：VL-SAM-v3：面向开放世界物体检测的记忆引导视觉先验

开放世界物体检测旨在定位和识别超出固定封闭标签空间的物体。它通常分为两类：开放词汇检测，假设测试时有一个预定义的类别列表；开放生成检测，需要在推理过程中生成候选类别。现有方法主要依赖粗略的文本语义和参数知识，这通常不足以为细粒度的外观变化、稀有类别和杂乱场景提供足够的视觉证据。在本文中，我们提出了一种名为VL-SAM-v3的统一框架，该框架通过检索导向的外部视觉记忆增强了开放世界检测。具体而言，一旦候选类别可用，VL-SAM-v3将从非参数记忆库中检索相关视觉原型，并将其转换为两个互补的视觉先验，即实例级空间锚定的稀疏先验和类感知局部上下文的密集先验。这些先验通过记忆引导提示精炼与原始检测提示集成，实现了一种共享的检索和精炼机制，支持开放词汇和开放生成推理。在LVIS上的大量零样本实验表明，VL-SAM-v3在开放词汇和开放生成推理下均能提高检测性能，特别是在稀有类别上表现尤为突出。此外，使用更强的开放词汇检测器（即SAM3）的实验验证了所提出的检索和精炼机制的普适性。

Summary / 总结

VL-SAM-v3 is a unified framework for open-world object detection that uses a non-parametric memory bank to retrieve visual prototypes and transform them into sparse and dense visual priors. These priors are integrated with original detection prompts to support both open-vocabulary and open-ended inference. Experiments on LVIS demonstrate that VL-SAM-v3 improves detection performance, especially for rare categories, and the retrieval-and-refinement mechanism is generalizable to stronger detectors.

VL-SAM-v3 是一个统一框架，通过非参数记忆库检索视觉原型并转化为稀疏和密集的视觉先验，这些先验与原始检测提示集成，支持开放词汇和开放生成推理。在 LVIS 上的零样本实验表明，VL-SAM-v3 改进了检测性能，特别是在稀有类别方面，并且检索和精炼机制对更强的检测器具有普适性。

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

Authors: Miao Wang, Yuling Shi, Yijiang Li, Yeheng Chen, Xiaodong Gu, Bin Li, Bo Gao, Yaduan Ruan

First: 2026-05-06T10:32:23+00:00 · Latest: 2026-05-06T10:32:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Text-based role-playing models can imitate character styles, yet they often fail to reflect a scene's atmosphere and evolving tension, both essential for immersive applications such as Virtual Reality (VR) games and interactive narratives. We study video-grounded role-playing dialogue and introduce EBM-RL (Eye-Brain-Mouth Reinforcement Learning), a decoupled GRPO-based framework that explicitly separates observation ([perception]), reasoning ([think]), and utterance ([answer]). This structure promotes human-like sensory grounding by compelling the model to first attend to visual cues, then form internal interpretations, and finally generate context-appropriate dialogue. EBM-RL integrates four complementary rewards: (i) CLIP-based scene-text alignment to improve ambiance and emotion; (ii) a Perceptual-Cognitive reward that encourages [perception] and [think] processes that increase the likelihood of the reference response; (iii) answer accuracy to ensure faithfulness; and (iv) a dense format reward to enforce the desired structured output. Extensive experiments demonstrate that EBM-RL substantially outperforms text-only role-playing baselines and larger-scale vision-language models on our immersive role-playing benchmark, delivering simultaneous gains in visual-atmosphere consistency and character authenticity. Beyond the role-playing domain, EBM-RL also exhibits strong zero-shot generalization: without any additional fine-tuning, it consistently improves performance on out-of-domain VideoQA benchmarks. We additionally release an open-source dataset for video-grounded role-playing dialogue.

中文标题/摘要

标题：基于奖励分解的沉浸式视频角色扮演强化学习

基于文本的角色扮演模型可以模仿角色风格，但往往无法反映场景氛围和不断升级的紧张感，这两者对于沉浸式应用如虚拟现实（VR）游戏和互动叙事至关重要。我们研究了基于视频的角色扮演对话，并引入了EBM-RL（眼-脑-口强化学习）框架，这是一种解耦的GRPO框架，明确地将观察（感知）、推理（思考）和发言（回答）分离。这种结构通过促使模型首先关注视觉线索，然后形成内部解释，最后生成上下文相关对话，促进了类似人类的感觉接地。 EBM-RL 结合了四种互补的奖励：（i）基于CLIP的场景-文本对齐以提高氛围和情感；（ii）感知-认知奖励，鼓励增加参考回答可能性的感知和思考过程；（iii）答案准确性以确保忠实性；（iv）密集格式奖励以确保期望的结构输出。广泛的实验表明，EBM-RL 在我们的沉浸式角色扮演基准测试中显著优于仅基于文本的角色扮演基线和更大规模的视觉语言模型，同时在视觉氛围一致性和角色真实性方面取得进步。除了角色扮演领域，EBM-RL 还表现出强大的零样本泛化能力：无需任何额外微调，它在跨域视频问答基准测试中始终提高性能。我们还发布了基于视频的角色扮演对话的开源数据集。

Summary / 总结

The research aims to enhance immersive video role-playing by addressing the limitations of text-based models in capturing scene atmosphere and evolving tension. EBM-RL, a decoupled GRPO-based framework, is introduced to improve human-like dialogue generation. It uses four complementary rewards: CLIP-based scene-text alignment, perceptual-cognitive encouragement, answer accuracy, and dense format reward. Experiments show that EBM-RL outperforms text-only baselines and vision-language models, improving visual consistency and character authenticity. Beyond role-playing, it also generalizes well to VideoQA benchmarks without fine-tuning.

研究旨在通过解决基于文本的角色扮演模型在捕捉场景氛围和情节紧张感方面的局限性，来提升沉浸式视频角色扮演的效果。EBM-RL，一种分耦合的GRPO框架，被引入以改进类人对话生成。该框架使用四种互补奖励：基于CLIP的场景-文本对齐、感知认知鼓励、答案准确性以及密集格式奖励。实验表明，EBM-RL 在视觉一致性与角色真实性方面优于基于文本的基线模型和大规模视觉-语言模型。此外，它在未进行微调的情况下也能很好地泛化到视频问答基准上。

UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

Authors: Jiajin Guan, Haibo Mei, Bonan Zhang, Dan Liu, Yuanshuang Fu, Yue Zhang

First: 2025-08-15T04:06:40+00:00 · Latest: 2026-05-06T08:41:14+00:00

Comments: The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: 10.1007/s11704-026-52082-z

Abs · PDF · Code1 · Code2

Abstract

Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a high-resolution visual question answering dataset named HRVQA-VL, which consists of 50,019 annotated samples covering eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference. Experimental results show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant, which is 36x larger, on multiple tasks. Ablation studies reveal that while SFT improves semantic alignment, it may reduce reasoning diversity in mathematical tasks. GRPO-based RL compensates for this limitation by enhancing logical flexibility and the robustness of inference. Additionally, UAV-VL-R1 requires only 3.9GB of memory under FP16 inference and can be quantized to 2.5GB with INT8, supporting real-time deployment on resource-constrained UAV platforms.

Summary / 总结

The research aims to enhance the performance of vision-language models (VLMs) in UAV-based aerial imagery tasks, which are characterized by high resolution and complex spatial semantics. The authors propose UAV-VL-R1, a lightweight VLM trained using a hybrid method combining supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL) with the GRPO algorithm. Experimental results show that UAV-VL-R1 outperforms the Qwen2-VL-2B-Instruct baseline by 48.17% in zero-shot accuracy and even surpasses a larger 72B-scale variant on multiple tasks. Ablation studies indicate that while SFT improves semantic alignment, GRPO-based RL enhances logical flexibility and robustness of inference.

研究旨在提高视觉语言模型（VLMs）在基于无人机的航空图像上的性能，由于高分辨率和复杂的空间语义，这具有挑战性。提出了UAV-VL-R1，结合了监督微调（SFT）和多阶段强化学习（RL）以及GRPO，以增强结构化的推理。实验表明，UAV-VL-R1在多个任务上的零样本准确率高于72B规模的变体，并且具有更好的实时部署能力。

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

Authors: Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Lei Huang, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin

First: 2026-05-06T08:32:30+00:00 · Latest: 2026-05-06T08:32:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Although Large Vision-Language Models (LVLMs) have demonstrated remarkable performance on downstream tasks, they frequently produce contents that deviate from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or decoding strategies which significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly enhanced when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-guided Visual Attention Steering (CAST), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern corresponding to caption queries to enhance LVLMs' visual perception capability. Specifically, we use probing techniques to identify attention heads that are highly sensitive to caption queries and estimate optimized steering directions for their outputs. This steering strengthens LVLM's fine-grained visual perception capabilities, thereby effectively mitigating object hallucination. CAST reduced object hallucination by an average of 6.03% across five widely used LVLMs and five benchmarks including both discriminative and generative tasks, demonstrating state-of-the-art performance while adding little inference cost and preserving other foundational capabilities.

Summary / 总结

This work addresses the issue of object hallucination in Large Vision-Language Models (LVLMs) by proposing Caption-guided Visual Attention Steering (CAST). CAST leverages the enhanced attention to visual information when answering caption queries to mitigate hallucination. By identifying and steering attention heads sensitive to captions, CAST improves LVLMs' visual perception, reducing hallucination by an average of 6.03% across various models and benchmarks. This method is training-free and adds minimal inference cost while maintaining other foundational capabilities.

本文提出了一种名为Caption-guided Visual Attention Steering (CAST)的方法，以解决大型视觉-语言模型（LVLM）中的物体幻觉问题。CAST通过利用回答标题查询时对视觉信息的增强关注来减轻幻觉。通过识别并引导对标题查询敏感的注意力头，CAST提高了LVLM的视觉感知能力，使其在各种模型和基准测试中平均减少了6.03%的幻觉现象。该方法无需训练且几乎不增加推理成本，同时保留了其他基础能力。

The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

Authors: Anjie Liu, Ziqin Gong, Yan Song, Yuxiang Chen, Xiaolong Liu, Hengtong Lu, Kaike Zhang, Chen Wei, Jun Wang

Venue: ICML 2026

First: 2026-05-02T09:39:42+00:00 · Latest: 2026-05-06T08:19:25+00:00

Comments: 27 pages, 5 figures, accepted at ICML 2026

Abs · PDF · Code1 · Code2

Abstract

Visual perception in modern Vision-Language Models (VLMs) is constrained by a perceptual bandwidth bottleneck: a broad field of view preserves global context but sacrifices the fine-grained details required for complex reasoning. We argue that high-resolution visual reasoning is therefore not only semantic reasoning but also task-relevant evidence acquisition under limited perceptual bandwidth. Inspired by active vision and information foraging, we formalise this process as sequential Bayesian optimal experimental design (S-BOED), where an agent decides which visual evidence to acquire before answering. Since exact Bayesian inference is intractable in continuous gigapixel spaces, we derive a tractable coverage--resolution objective as a proxy for task-relevant information gain. We instantiate this framework with FOVEA, a training-free procedure that refines VLM crop proposals through evidence-oriented probing. Experiments on high-resolution benchmarks show consistent gains over direct and ReAct-style baselines, with particularly strong improvements in search-dominated remote-sensing settings.

Summary / 总结

The research addresses the perceptual bandwidth bottleneck in Vision-Language Models (VLMs) by proposing a method to enhance high-resolution visual reasoning. It formulates the process as sequential Bayesian optimal experimental design (S-BOED), where an agent decides which visual evidence to acquire before answering. The key finding is that the proposed FOVEA framework, which refines VLM crop proposals through evidence-oriented probing, achieves consistent gains over direct and ReAct-style baselines, especially in search-dominated remote-sensing settings.

研究针对视觉语言模型（VLMs）中的感知带宽瓶颈，提出了一种增强高分辨率视觉推理的方法。该方法将过程形式化为顺序贝叶斯最优实验设计（S-BOED），其中代理决定在回答前获取哪些视觉证据。关键发现是，通过证据导向的探查来细化VLM裁剪提议的FOVEA框架，在直接方法和ReAct风格基线中实现了持续的改进，特别是在搜索主导的遥感设置中表现尤为突出。

Advancing Aesthetic Image Generation via Composition Transfer

Authors: Kai Zou, Zhiwei Zhao, Bin Liu, Nenghai Yu

Venue: International Journal of Computer Vision, 2026

First: 2026-05-06T07:56:16+00:00 · Latest: 2026-05-06T07:56:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Composition is a cornerstone of visual aesthetics, influencing the appeal of an image. While its principles operate independently of specific content, in practice, composition is often coupled with semantics. As a result, existing methods often enhance composition either through implicit learning or by semantics-based layout control, rather than explicitly modeling composition itself. To address this gap, we introduce Composer, a framework rooted in aesthetic theory, designed to model composition in a semantic-agnostic manner. First, it supports composition transfer by extracting key composition-aware representations from a reference image and leveraging a tailored conditional guidance module to control composition based on pre-trained diffusion models. Second, when users specify only text themes without a composition reference, Composer supports theme-driven composition retrieval by leveraging the in-context learning capabilities of Large Vision-Language Models (LVLMs), achieving explicit composition planning. To enhance composition in a reference-free mode, we conduct text-to-composition fine-tuning on the trained control module to enable implicit composition planning. Furthermore, we curated a high-quality dataset comprising 2 million image-text pairs using state-of-the-art generative models to support model training. Experimental results demonstrate that Composer significantly enhances aesthetic quality in text-to-image tasks and facilitates personalized composition control and transfer, offering users precision and flexibility in the creative process.

Summary / 总结

The paper aims to improve the aesthetic quality of image generation by explicitly modeling composition. It introduces Composer, a framework that extracts composition-aware representations from a reference image and uses a conditional guidance module to control composition in pre-trained diffusion models. When no reference is provided, Composer leverages Large Vision-Language Models for theme-driven composition retrieval. The framework enhances aesthetic quality and allows for personalized composition control and transfer, offering users precision and flexibility in the creative process.

研究旨在通过明确建模构图来提高生成图像的美学质量。Composer 基于美学理论的框架，从参考图像中提取关键的构图感知表示，并使用定制的条件引导模块在预训练的扩散模型中控制构图。当没有提供参考时，Composer 利用大型视觉-语言模型进行主题驱动的构图检索。实验结果表明，Composer 提高了美学质量，并允许在文本到图像任务中实现精确和灵活的构图控制。

Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR

Authors: Yulong Zhang, Tianyi Liang, Xinyue Huang, Erfei Cui, Guoqing Wang, Xu Guo, Chenhui Li, Gongshen Liu

First: 2025-04-15T11:51:18+00:00 · Latest: 2026-05-06T07:49:45+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Optical Character Recognition (OCR) is fundamental to Vision-Language Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control. We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge. Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration. Code: https://github.com/Aslan-yulong/consensus-entropy.

WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning

Authors: Ke Xu

First: 2026-04-30T09:19:26+00:00 · Latest: 2026-05-06T07:17:48+00:00

Comments: 16 pages, 3 figures, 8 tables

Abs · PDF · Code1 · Code2

Abstract

We present WaferSAGE, a framework for wafer defect visual question answering using small vision-language models. To address data scarcity in semiconductor manufacturing, we propose a three-stage synthesis pipeline incorporating structured rubric generation for precise evaluation. Starting from limited labeled wafer maps, we employ clustering-based cleaning to filter label noise, then generate comprehensive defect descriptions using vision-language models, which are converted into structured evaluation rubrics criteria. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. Our dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, our 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment. We demonstrate that small models with domain-specific training can surpass proprietary large models in specialized industrial visual understanding, offering a viable path for privacy-preserving, cost-effective deployment in semiconductor manufacturing.

Summary / 总结

WaferSAGE is a framework for wafer defect visual question answering using small vision-language models. It addresses data scarcity in semiconductor manufacturing through a three-stage synthesis pipeline involving structured rubric generation. Starting with limited labeled wafer maps, the framework employs clustering-based cleaning, generates comprehensive defect descriptions, and converts them into structured evaluation rubrics. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. The dual assessment framework uses Bayesian optimization to align rule-based metrics with LLM-Judge scores, and through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO), the 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, approaching Gemini-3-Flash (7.149) while enabling on-premise deployment.

WaferSAGE 是一种使用小型视觉语言模型进行晶圆缺陷视觉问答的框架。它通过包含结构化评分表生成的三阶段合成管道来解决半导体制造中的数据稀缺问题。从有限的标注晶圆图开始，框架采用聚类基清洗，生成全面的缺陷描述，并将其转换为结构化的评分表。这些评分表指导 VQA 对应对缺陷类型识别、空间分布、形态和根本原因分析进行全面覆盖。双重评估框架使用贝叶斯优化来对齐基于规则的指标与LLM-裁判评分，通过基于课程的强化学习与组序列策略优化（GSPO），4B参数的Qwen3-VL模型实现了6.493的LLM-裁判评分，接近Gemini-3-Flash（7.149），同时支持本地部署。

UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings

Authors: Jiajun Qin, Yuan Pu, Zhuolun He, Seunggeun Kim, David Z. Pan, Bei Yu

First: 2025-05-17T03:53:11+00:00 · Latest: 2026-05-06T07:09:58+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Current vision-language models have been explored for multi-modal embedding tasks like information retrieval. However, they face significant challenges in real-world queries and targets involving diverse modality combinations, as existing approaches often fail to align all modality combinations within a unified embedding space during training, leading to degraded performance on rare modality patterns during inference. To address this fundamental limitation, we propose UniMoCo, a novel architecture featuring a modality-completion module that generates visual features from text, thereby ensuring modality completeness for both queries and targets. Additionally, UniMoCo incorporates a specialized training strategy that aligns embeddings from both original and modality-completed inputs, thus ensuring consistent and robust embeddings for diverse modality combinations. Comprehensive experiments demonstrate that UniMoCo outperforms previous methods while exhibiting consistent robustness across diverse settings. Furthermore, we identify and quantify the inherent bias in conventional approaches caused by imbalanced modality combinations in training data, showing that our modality-completion paradigm effectively mitigates this limitation. The code is available at https://github.com/HobbitQia/UniMoCo.

Summary / 总结

The paper addresses the challenge of aligning diverse modality combinations in vision-language models, proposing UniMoCo, which includes a modality-completion module to generate visual features from text and a specialized training strategy to ensure consistent embeddings. Experiments show that UniMoCo outperforms previous methods and demonstrates robust performance across various settings, while also mitigating the inherent bias in conventional approaches due to imbalanced training data.

论文针对视觉-语言模型在处理多样模态组合时的对齐难题，提出了UniMoCo架构，该架构包含一种模态完成模块以从文本生成视觉特征，并采用专门的训练策略以确保一致的嵌入。实验表明，UniMoCo在各种设置中均表现出色并超越了先前的方法。该方法还缓解了由于训练数据中模态组合不平衡而导致的传统模型中的固有偏差。

Reward-Guided Semantic Evolution for Test-time Adaptive Object Detection

Authors: Lihua Zhou, Mao Ye, Xiatian Zhu, Nianxin Li, Changyi Ma, Shuaifeng Li, Yitong Qin, Hongbin Liu, Jiebo Luo, Zhen Lei

First: 2026-05-06T06:17:41+00:00 · Latest: 2026-05-06T06:17:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary object detection with vision-language models (VLMs) such as Grounding DINO suffers from performance degradation under test-time distribution shifts, primarily due to semantic misalignment between text embeddings and shifted visual embeddings of region proposals. While recent test-time adaptive object detection methods for VLM-based either rely on costly backpropagation or bypass semantic misalignment via external memory, none directly and efficiently align text and vision in a training-free manner. To address this, we propose Reward-Guided Semantic Evolution (RGSE), a training-free framework that directly refines the text embeddings at test time. Inspired by evolutionary search, RGSE treats text embedding adaptation as a semantic search process: it perturbs text embeddings as candidate variants, evaluates them via cosine similarity with current and historical high-confidence visual proposals as a reward signal, and fuses them into a refined embedding through reward-weighted averaging. Without any backpropagation, RGSE achieves state-of-the-art performance across multiple detection benchmarks while adding minimal computational overhead. Our code will be open source upon publication.

Summary / 总结

The paper addresses the issue of performance degradation in open-vocabulary object detection with vision-language models under test-time distribution shifts. It introduces Reward-Guided Semantic Evolution (RGSE), a training-free framework that refines text embeddings at test time by perturbing them and evaluating them based on cosine similarity with visual proposals. RGSE achieves state-of-the-art performance across multiple benchmarks with minimal computational overhead.

论文针对视觉语言模型在对象检测中由于文本和视觉嵌入之间的语义不匹配导致的性能下降问题，提出了一个无需训练的框架——奖励引导的语义进化（RGSE），该框架在测试时通过扰动文本嵌入并基于与高置信度视觉提案的余弦相似度进行评估来优化文本嵌入。RGSE在多个基准测试中实现了最先进的性能，同时增加了极少的计算开销。

Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents

Authors: Jiahua Li, Zhanhe Zhang, Chenghao Xu, Zhe Xu, Kun Wei, Xu Yang, Cheng Deng

First: 2025-09-29T15:42:55+00:00 · Latest: 2026-05-06T06:07:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Long videos, characterized by temporal complexity and sparse task-relevant information, pose significant reasoning challenges for AI systems. Although existing Large Language Model (LLM)-based approaches have advanced long video understanding, they remain bottlenecked by task-agnostic, fixed-granularity perception pipelines and suffer from vision-language hallucinations. Inspired by human adaptive perception and active verification, we propose CogniGPT, a framework leveraging an interactive loop between a Multi-Granular Perception Agent (MPA) and an Active Verification Agent (AVA). Specifically, instead of predetermined heuristics, MPA adaptively determines the optimal perception granularity and strategy based on the evolving context, while AVA actively mines multi-perspective visual evidence to cross-verify key observations and eliminate hallucinations. This interaction allows CogniGPT to efficiently identify a minimal set of reliable task-related clues. Extensive experiments on EgoSchema, Video-MME, NExT-QA, and MovieChat demonstrate its superiority in accuracy and efficiency. Notably, on EgoSchema, it surpasses existing training-free methods using only 11.2 frames and achieves performance comparable to Gemini 1.5-Pro.

中文标题/摘要

标题：感知、验证和理解长视频：基于交互代理的多粒度感知与主动验证

长视频由于时间复杂性和稀疏的任务相关信息，给AI系统带来了显著的推理挑战。尽管现有的基于大型语言模型（LLM）的方法在长视频理解方面取得了进展，但它们仍然受限于任务无关的固定粒度感知管道，并且容易出现视觉-语言幻觉。受人类适应性感知和主动验证的启发，我们提出了一种框架，该框架利用多粒度感知代理（MPA）和主动验证代理（AVA）之间的交互循环。具体而言，MPA不是基于预设的启发式方法，而是根据不断变化的上下文自适应地确定最佳感知粒度和策略，而AVA则积极挖掘多视角的视觉证据以交叉验证关键观察并消除幻觉。这种交互使CogniGPT能够高效地识别出一组可靠的任务相关信息。在EgoSchema、Video-MME、NExT-QA和MovieChat上的广泛实验表明，它在准确性和效率方面具有优势。值得注意的是，在EgoSchema上，它仅使用11.2帧就超过了现有的无训练方法，并且性能与Gemini 1.5-Pro相当。

Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

Authors: Binh Long Nguyen, Kien Nguyen, Sridha Sridharan, Clinton Fookes, Peyman Moghadam

First: 2026-05-06T05:23:41+00:00 · Latest: 2026-05-06T05:23:41+00:00

Comments: The International Conference on Pattern Recognition (ICPR) 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding. Project page: https://csiro-robotics.github.io/Ilov3Splat.

Summary / 总结

Ilov3Splat introduces a novel framework for instance-level open-vocabulary 3D scene understanding using 3D Gaussian Splatting. It addresses limitations of previous methods by jointly optimizing scene geometry and semantic representations, leveraging multi-resolution hash embedding for efficient language-aligned CLIP feature encoding and contrastive loss for instance feature learning. Experiments show Ilov3Splat outperforms prior methods in object selection and instance segmentation, enabling accurate language-driven 3D scene understanding without category supervision.

Ilov3Splat 提出了一种基于 3D 高斯点云的新框架，用于实例级开放词汇的 3D 场景理解。该方法通过联合优化场景几何和语义表示，利用多分辨率哈希嵌入高效编码 CLIP 特征，并使用对比损失学习实例特征。实验表明，Ilov3Splat 在对象选择和实例分割方面优于先前的方法，能够实现无需类别监督的语言驱动 3D 场景理解。

Example-Based Object Detection

Authors: ZhiXin Sun

First: 2026-05-06T05:10:09+00:00 · Latest: 2026-05-06T05:10:09+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

In recent years, object detection has achieved significant progress, especially in the field of open-vocabulary object detection. Unlike traditional methods that rely on predefined categories, open-vocabulary approaches can detect arbitrary objects based on human-provided prompts. With the advancement of prompt-based detection techniques, models such as SAM3 can even outperform some category-specific detectors trained on particular datasets without requiring additional training on those datasets. However, despite these advancements, false positives and false negatives still occur. In practical engineering applications, persistent misdetections or missed detections of the same object are unacceptable. Yet retraining the model every time such errors occur incurs substantial costs in terms of human effort, computational resources, and time. Therefore, how to leverage existing false positive and false negative samples to prevent such errors from recurring remains a highly challenging and urgent problem. To address this issue, we propose EBOD (Example-Based Object Detection), which integrates a prompt-based detector (SAM3) with robust feature matching modules (DINOv3 and LightGlue). The proposed framework effectively suppresses the repeated occurrence of false positives and false negatives by leveraging previous error examples, without requiring additional model retraining. Code is available at https://github.com/sunzx97/examples_based_object_detection.

Summary / 总结

The paper addresses the challenge of persistent misdetections and missed detections in open-vocabulary object detection by proposing EBOD (Example-Based Object Detection). EBOD integrates SAM3, a prompt-based detector, with DINOv3 and LightGlue for robust feature matching. The framework leverages previous error examples to suppress the repeated occurrence of false positives and false negatives without additional model retraining, thus reducing human effort and computational resources. The method effectively mitigates recurring errors in practical applications.

论文通过提出EBOD（基于示例的物体检测）来解决开放词汇物体检测中持续的误检和漏检问题。EBOD将基于提示的检测器SAM3与DINOv3和LightGlue的稳健特征匹配模块结合。该框架利用之前的错误示例来抑制重复出现的假阳性与假阴性，无需额外的模型重新训练，从而减少人力和计算资源的消耗。该方法在实际应用中有效缓解了反复出现的错误问题。

CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation

Authors: Zhipeng Song, Yizhi Zhou, Xiangyu Kong, Jiulong Jiao, Xuezhou Ye, Chunqi Gao, Xueqing Shi, Yuhang Zhou, Heng Qi

First: 2026-05-06T04:51:28+00:00 · Latest: 2026-05-06T04:51:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieval-Augmented Generation (RAG) depends on document ranking to provide useful evidence for generation, but conventional reranking methods mainly optimize query-document relevance rather than generation usefulness. A relevant document may still introduce noise, while a lower-ranked document may better reduce the generator's uncertainty. We propose CAR (Confidence-Aware Reranking), a query-guided, training-free, and plug-and-play reranking framework that uses generator confidence change as a document usefulness signal. CAR estimates confidence through the semantic consistency of multiple sampled answers under query-only and query-document conditions. Documents that significantly increase confidence are promoted, those that decrease confidence are demoted, and uncertain cases preserve the baseline order, while a query-level gate avoids unnecessary intervention on already confident queries. Experiments on four BEIR datasets show that CAR consistently improves NDCG@5 across sparse and dense retrievers, LLM-based and supervised rerankers, and four LLM backbones. Notably, CAR improves the YesNo reranker by 25.4 percent on average under Contriever retrieval, and its ranking gains strongly correlate with downstream generation F1 improvements, achieving Spearman rho = 0.964.

BEM: Training-Free Background Embedding Memory for False-Positive Suppression in Real-Time Fixed-Background Camera

Authors: Junwoo Park, Jangho Lee, Sunho Lim

First: 2026-04-13T16:50:05+00:00 · Latest: 2026-05-06T02:51:51+00:00

Comments: Accepted to ICPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Pretrained detectors perform well on benchmarks but often suffer performance degradation in real-world deployments due to distribution gaps between training data and target environments. COCO-like benchmarks emphasize category diversity rather than instance density, causing detectors trained under per-class sparsity to struggle in dense, single- or few-class scenes such as surveillance and traffic monitoring. In fixed-camera environments, the quasi-static background provides a stable, label-free prior that can be exploited at inference to suppress spurious detections. To address the issue, we propose Background Embedding Memory (BEM), a lightweight, training-free, weight-frozen module that can be attached to pretrained detectors during inference. BEM estimates clean background embeddings, maintains a prototype memory, and re-scores detection logits with an inverse-similarity, rank-weighted penalty, effectively reducing false positives while maintaining recall. Empirically, background-frame cosine similarity correlates negatively with object count and positively with Precision-Confidence AUC (P-AUC), motivating its use as a training-free control signal. Across YOLO and RT-DETR families on LLVIP and simulated surveillance streams, BEM consistently reduces false positives while preserving real-time performance. Our code is available at https://github.com/Leo-Park1214/Background-Embedding-Memory.git

Summary / 总结

The paper addresses the performance degradation of pretrained detectors in real-world deployments due to distribution gaps. It introduces Background Embedding Memory (BEM), a lightweight, training-free module that estimates clean background embeddings and re-scores detection logits to reduce false positives while maintaining recall. BEM is evaluated on YOLO and RT-DETR families and shows consistent performance improvements across various datasets without affecting real-time performance.

论文针对预训练检测器在实际场景中的性能下降问题，提出了轻量级的无训练模块Background Embedding Memory (BEM)，该模块通过估计干净的背景嵌入并重新评分检测逻辑来减少误检，同时保持召回率。实验结果显示，BEM在固定背景的监控和交通监测场景中有效抑制了误检，且不牺牲实时性能。

Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning

Authors: Yating Wang, Yaqi Zhao, Yongshun Gong, Yilong Yin, Haoliang Sun

First: 2026-05-06T02:38:59+00:00 · Latest: 2026-05-06T02:38:59+00:00

Comments: 15 pages, 4 figures. Preprint version

Abs · PDF · Code1 · Code2

Abstract

Vision-language models such as CLIP achieve strong visual-textual alignment, but often suffer from overfitting and limited interpretability when adapted through continuous prompt learning. While discrete prompt optimization improves interpretability, it usually depends on large external models, leading to high computational costs and limited scalability. In this paper, we propose Interpretable Prompt Learning (IPL), a hybrid framework that alternates between discrete semantic token selection and continuous prompt optimization. Specifically, IPL formulates semantic token selection as an approximate submodular optimization problem, encouraging tokens that are both human-understandable and semantically diverse. It further adopts an alternating optimization strategy to integrate discrete token selection with continuous prompt tuning, improving interpretability while preserving adaptability to downstream tasks. Our framework is plug-and-play, allowing seamless integration with existing prompt learning methods. Extensive experiments on multiple benchmarks show that IPL consistently improves both interpretability and accuracy across five representative prompt learning methods, providing an effective and scalable extension to existing frameworks.

中文标题/摘要

标题：联合语义标记选择和提示优化以实现可解释的提示学习

视觉-语言模型如CLIP在视觉-文本对齐方面表现出色，但在通过连续提示学习进行适应时，往往会遭受过拟合和解释性有限的问题。虽然离散提示优化可以提高解释性，但它通常依赖于大型外部模型，导致高计算成本和有限的可扩展性。在本文中，我们提出了一种混合框架Interpretable Prompt Learning (IPL)，该框架交替进行离散语义标记选择和连续提示优化。具体而言，IPL 将语义标记选择形式化为近似子模优化问题，鼓励既易于人类理解又具有语义多样性的标记。它进一步采用交替优化策略将离散标记选择与连续提示调整集成在一起，从而提高解释性同时保持对下游任务的适应性。我们的框架是即插即用的，可以无缝集成到现有的提示学习方法中。在多个基准上的广泛实验表明，IPL 在五个代表性提示学习方法中始终提高了解释性和准确性，为现有框架提供了有效的可扩展扩展。

Summary / 总结

The paper addresses the limitations of vision-language models like CLIP in terms of interpretability and scalability during continuous prompt learning. It introduces Interpretable Prompt Learning (IPL), which combines discrete semantic token selection and continuous prompt optimization. IPL enhances interpretability by selecting human-understandable and semantically diverse tokens and integrates these tokens with continuous tuning, improving performance on various benchmarks while maintaining adaptability to downstream tasks.

本文针对CLIP等视觉-语言模型在连续提示学习过程中存在的可解释性和可扩展性问题。提出了可解释提示学习(IPL)框架，该框架交替进行离散语义标记选择和连续提示优化。IPL通过选择既易于理解又具有语义多样性的标记来增强可解释性，并将这些标记与连续调优相结合，提高了各种提示学习方法的性能，同时没有显著增加计算成本。

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Authors: Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, Zhiying Xu, Jun Wu, Chenfeng Xu, Ion Stoica, Song Han, Kurt Keutzer

Venue: ICML 2026

First: 2026-02-03T00:54:32+00:00 · Latest: 2026-05-05T23:46:52+00:00

Comments: Accepted by ICML 2026. 13 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grows with generation history and quickly dominates GPU memory, often exceeding 30 GB, preventing deployment on widely available hardware. More critically, constrained KV cache budgets restrict the effective working memory, directly degrading long horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training free KV cache quantization framework for autoregressive video diffusion models. QVG leverages video spatiotemporal redundancy through Semantic Aware Smoothing, producing low magnitude, quantization friendly residuals. It further introduces Progressive Residual Quantization, a coarse to fine multi stage scheme that reduces quantization error while enabling a smooth quality memory trade off. Across LongCat Video, HY WorldPlay, and Self Forcing benchmarks, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV cache memory by up to 7.0 times with less than 4% end to end latency overhead while consistently outperforming existing baselines in generation quality. Code is available at: https://github.com/svg-project/Quant-VideoGen

中文标题/摘要

标题：Quant VideoGen：通过2位KV缓存量化实现自回归长视频生成

尽管自回归视频扩散取得了快速进展，但新兴的系统算法瓶颈限制了部署能力和生成能力：KV缓存内存。在自回归视频生成模型中，KV缓存随着生成历史增长，迅速占据GPU内存，经常超过30 GB，阻止在广泛可用的硬件上部署。更严重的是，受限的KV缓存预算限制了有效的内存工作空间，直接降低了长期一致性在身份、布局和运动方面的表现。为了解决这一挑战，我们提出了Quant VideoGen（QVG），一种无需训练的KV缓存量化框架，用于自回归视频扩散模型。QVG 通过语义感知平滑利用视频时空冗余，产生低幅度、量化友好的残差。它进一步引入了渐进残差量化，这是一种从粗到细的多阶段方案，减少了量化误差，同时允许平滑的质量与内存折衷。在LongCat Video、HY WorldPlay 和 Self Forcing 基准测试中，QVG 在质量与内存效率之间建立了新的帕累托前沿，将KV缓存内存最多减少7.0倍，同时端到端延迟开销低于4%，并且在生成质量上始终优于现有基线。代码可在：https://github.com/svg-project/Quant-VideoGen 获取

Summary / 总结

Quant VideoGen addresses the memory bottleneck in autoregressive video generation by reducing KV cache memory usage through a training-free framework. It uses Semantic Aware Smoothing to produce quantization-friendly residuals and Progressive Residual Quantization to reduce quantization error while maintaining quality. Experiments show QVG reduces KV cache memory by up to 7.0 times with minimal latency overhead and outperforms existing methods in generation quality across various benchmarks.

Quant VideoGen (QVG) 通过一种无需训练的方法对 KV 缓存内存进行量化，以解决自回归视频生成中的内存瓶颈。它使用语义感知平滑来生成量化友好的残差，并采用逐级细化的残差量化方案来减少量化误差。QVG 在 LongCat Video、HY WorldPlay 和 Self Forcing 等基准测试中实现了高达 7.0 倍的 KV 缓存内存减少，同时端到端延迟开销不到 4%，并且在生成质量上优于现有基线。

History

20260507_0454 20260506_0427 20260505_0436 20260504_0410 20260503_0414 20260502_0426 20260501_0429 20260430_0430 20260429_0437 20260428_0429 20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553