Relational Visual Similarity
Authors: Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi, Nicholas Kolkin, Eli Shechtman, Yong Jae Lee, Yuheng Li
First: 2025-12-08T18:59:56+00:00 · Latest: 2025-12-08T18:59:56+00:00
Comments: Project page, data, and code: https://thaoshibe.github.io/relsim
Abstract
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.
中文标题/摘要
标题:关系视觉相似性
人类不仅看到属性相似性,还看到关系相似性。苹果和桃子都是红颜色的水果,因此相似,但地球也与桃子相似:地壳、地幔和地核分别对应桃子的果皮、果肉和果核。认知科学家认为,人类能够感知和识别关系相似性可能是人类与其他物种的区别之一。然而,目前所有广泛使用的视觉相似性度量(例如LPIPS、CLIP、DINO)仅关注感知属性相似性,而未能捕捉人类感知的丰富且常常令人惊讶的关系相似性。我们如何超越图像的可见内容,捕捉其关系属性?我们如何将具有相同关系逻辑的图像在表示空间中拉近?为回答这些问题,我们首先将关系图像相似性表述为可测量的问题:当两幅图像中的视觉元素之间的内部关系或功能相同时,即使它们的视觉属性不同,它们就是关系相似的。然后,我们制作了一个包含114,000幅图像-描述对的匿名数据集,其中描述的是场景背后的关系逻辑,而不是表面内容。使用此数据集,我们微调了一个视觉-语言模型来测量图像之间的关系相似性。该模型是连接图像背后的关系结构而非其可见外观的第一步。我们的研究显示,尽管关系相似性在实际应用中有很多用途,但现有的图像相似性模型未能捕捉到它——揭示了视觉计算中的一个关键缺口。
Summary / 总结
The research aims to capture relational similarity in images, which humans perceive beyond simple attribute similarity. The authors formulate relational image similarity as a measurable problem and create a dataset of 114k image-caption pairs focusing on the underlying relational logic. They fine-tune a Vision-Language model to measure relational similarity, demonstrating that existing models fail to capture this type of similarity, highlighting a significant gap in visual computing.
研究旨在捕捉图像中超越简单属性相似性的关系相似性,这是人类感知的内容。作者提出了一种方法,使用包含114k图像-描述对的数据集,其中描述的是场景的关系逻辑而非表面内容。他们通过微调Vision-Language模型来测量关系相似性,表明现有的模型如LPIPS、CLIP和DINO无法捕捉这种相似性,揭示了视觉计算中的一个重要缺口。研究显示了关系相似性在实际应用中的重要性,并指出了需要开发新模型来解决这一缺口。
TV2TV: A Unified Framework for Interleaved Language and Video Generation
Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan
First: 2025-12-04T18:59:09+00:00 · Latest: 2025-12-08T18:58:00+00:00
Abstract
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
中文标题/摘要
标题:TV2TV:一种统一的交错语言和视频生成框架
视频生成模型正在迅速发展,但仍可能在需要大量语义分支或反复进行下一步该发生什么的高级推理的复杂视频输出上遇到困难。在本文中,我们介绍了一类新的全能视频-文本模型,这些模型结合了最近语言模型推理进展的想法,以解决这一挑战。具体来说,我们提出了TV2TV,这是一种统一的生成建模框架,将视频生成分解为交错的语言和视频生成过程。TV2TV 使用混合的变换器(MoT)架构联合学习语言建模(下一个标记预测)和视频流匹配(下一个帧预测)。在推理时,TV2TV 决定何时在生成文本和视频帧之间交替,使模型能够在“用词思考”后续内容之前“用像素行动”来生成帧。这种设计将决定下一步该发生什么的责任大部分转移到了语言建模塔上,从而提高了生成视频的视觉质量和提示对齐。它还使用户能够在过程中任何一点通过文本干预来实现精细的可控性。在对视频游戏数据的受控实验中,TV2TV 在视觉质量和可控性方面都表现出显著的改进。TV2TV 还扩展到自然视频,正如我们通过使用视觉-语言模型(VLMs)交替自然语言动作描述来增强体育视频所展示的那样。在该语料库上训练 TV2TV 产生了强大的视觉质量和提示对齐,展示了模型能够推理和生成复杂的现实世界动作序列的能力。这些结果共同突显了 TV2TV 作为视频生成中具有开放文本推理和控制潜力的有希望的一步。
Summary / 总结
TV2TV is a unified generative modeling framework that addresses the challenge of complex video outputs by integrating language and video generation. It uses a Mixture-of-Transformers architecture to jointly learn language modeling and video flow matching, allowing the model to alternate between generating text and video frames. Experiments show that TV2TV improves visual quality and controllability in video generation, especially in video game data, and scales to natural videos with interleaved language descriptions.
TV2TV 是一个统一的生成模型框架,将语言和视频生成过程结合起来,以应对复杂视频输出的挑战。它使用混合变换器架构同时学习语言建模和视频流匹配,允许模型在“以像素行动”之前“用语言思考”。实验表明,TV2TV 在视觉质量和可控性方面有所改进,特别是在视频游戏数据中,并且可以扩展到带有交错语言描述的自然视频,展示了模型在生成复杂现实动作序列方面的推理和生成能力。
SAVE: Sparse Autoencoder-Driven Visual Information Enhancement for Mitigating Object Hallucination
Authors: Sangha Park, Seungryong Yoo, Jisoo Mok, Sungroh Yoon
Venue: WACV 2026
First: 2025-12-08T17:20:07+00:00 · Latest: 2025-12-08T17:20:07+00:00
Comments: WACV 2026
Abstract
Although Multimodal Large Language Models (MLLMs) have advanced substantially, they remain vulnerable to object hallucination caused by language priors and visual information loss. To address this, we propose SAVE (Sparse Autoencoder-Driven Visual Information Enhancement), a framework that mitigates hallucination by steering the model along Sparse Autoencoder (SAE) latent features. A binary object-presence question-answering probe identifies the SAE features most indicative of the model's visual information processing, referred to as visual understanding features. Steering the model along these identified features reinforces grounded visual understanding and effectively reduces hallucination. With its simple design, SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10\%p improvement in CHAIR\_S and consistent gains on POPE and MMHal-Bench. Extensive evaluations across multiple models and layers confirm the robustness and generalizability of our approach. Further analysis reveals that steering along visual understanding features suppresses the generation of uncertain object tokens and increases attention to image tokens, mitigating hallucination. Code is released at https://github.com/wiarae/SAVE.
中文标题/摘要
标题:SAVE:稀疏自编码驱动的视觉信息增强以减轻对象幻觉
尽管多模态大型语言模型(MLLMs)取得了显著进步,但它们仍然容易受到语言先验和视觉信息丢失导致的对象幻觉的影响。为了解决这一问题,我们提出了SAVE(Sparse Autoencoder-Driven Visual Information Enhancement)框架,通过引导模型沿着稀疏自编码(SAE)潜在特征来减轻幻觉。二元对象存在性问题回答探针识别出最能反映模型视觉信息处理的SAE特征,称为视觉理解特征。沿着这些识别出的特征引导模型增强了基于视觉的理解,有效地减少了幻觉。凭借其简单的设计,SAVE在标准基准上优于最先进的无训练方法,实现了CHAIR_S 10%p的改进,并在POPE和MMHal-Bench上保持一致的收益。广泛的评估表明,我们的方法具有鲁棒性和通用性。进一步的分析表明,沿着视觉理解特征引导可以抑制不确定对象标记的生成,并增加对图像标记的关注,从而减轻幻觉。代码发布在https://github.com/wiarae/SAVE。
Summary / 总结
SAVE is a framework designed to mitigate object hallucination in Multimodal Large Language Models by steering the model along Sparse Autoencoder latent features. It uses a binary object-presence probe to identify visual understanding features, which are then used to reinforce grounded visual understanding. SAVE outperforms state-of-the-art training-free methods on standard benchmarks, achieving a 10% improvement in CHAIR_S and consistent gains on POPE and MMHal-Bench. The approach is robust and generalizable across multiple models and layers, and further analysis shows it suppresses uncertain object tokens and increases attention to image tokens, effectively reducing hallucination.
SAVE 是一个框架,通过引导模型沿着稀疏自编码器的潜在特征来减轻多模态大型语言模型中的物体幻觉问题。它使用二元物体存在探针来识别视觉理解特征,然后利用这些特征强化视觉理解。SAVE 在标准基准测试中优于最先进的无训练方法,分别在 CHAIR_S 上实现了 10% 的改进,并在 POPE 和 MMHal-Bench 上取得一致的提升。该方法在多个模型和层面上具有鲁棒性和通用性,并且进一步分析表明,它抑制了不确定的物体标记并增加了对图像标记的关注,从而有效减少了幻觉。
Depth-Wise Activation Steering for Honest Language Models
Authors: Gracjan Góral, Marysia Winkels, Steven Basart
First: 2025-12-08T16:03:06+00:00 · Latest: 2025-12-08T16:03:06+00:00
Comments: See \url{https://github.com/marysia/gaussian-activation-steering}. for code and experiments
Abstract
Large language models sometimes assert falsehoods despite internally representing the correct answer, failures of honesty rather than accuracy, which undermines auditability and safety. Existing approaches largely optimize factual correctness or depend on retraining and brittle single-layer edits, offering limited leverage over truthful reporting. We present a training-free activation steering method that weights steering strength across network depth using a Gaussian schedule. On the MASK benchmark, which separates honesty from knowledge, we evaluate seven models spanning the LLaMA, Qwen, and Mistral families and find that Gaussian scheduling improves honesty over no-steering and single-layer baselines in six of seven models. Equal-budget ablations on LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct show the Gaussian schedule outperforms random, uniform, and box-filter depth allocations, indicating that how intervention is distributed across depth materially affects outcomes beyond total strength. The method is simple, model-agnostic, requires no finetuning, and provides a low-cost control knob for eliciting truthful reporting from models' existing capabilities.
中文标题/摘要
标题:深度激活导向以促进诚实的语言模型
大型语言模型有时会断言错误的事实,尽管它们内部持有正确的答案,这是由于诚实性而非准确性的失败,这削弱了审计性和安全性。现有方法主要优化事实正确性或依赖重新训练和脆弱的单层编辑,提供的杠杆作用有限。我们提出了一种无需训练的激活导向方法,该方法使用高斯时间表在网络深度上加权导向强度。在分离诚实性和知识的MASK基准上,我们评估了涵盖LLaMA、Qwen和Mistral家族的七种模型,并发现高斯调度在六种模型中优于无导向和单层基线。在LLaMA-3.1-8B-Instruct和Qwen-2.5-7B-Instruct上的等预算消融实验表明,高斯时间表优于随机、均匀和箱形滤波深度分配,表明干预如何在深度上分布对结果的影响远超总强度。该方法简单、模型无关、无需微调,并提供了一个低成本的控制旋钮,以激发模型现有能力中的诚实报告。
Summary / 总结
The paper addresses the issue of large language models asserting falsehoods despite knowing the correct answer, a problem of honesty rather than accuracy. It introduces a training-free method called Gaussian scheduling to steer activations across network depth, improving honesty in six out of seven models tested on the MASK benchmark. This method is simple, model-agnostic, and does not require fine-tuning, offering a low-cost way to enhance truthful reporting from models.
论文针对大型语言模型在知道正确答案的情况下仍会断言错误的情况,即诚实性失败的问题。提出了一种名为高斯调度的无训练方法,用于在网络深度上引导激活,测试结果显示,在七个模型中,有六个模型在MASK基准测试中的诚实性得到了提升。该方法简单、模型无关,不需要微调,提供了一种低成本的方法来增强语言模型的诚实报告能力。
Optimization-Guided Diffusion for Interactive Scene Generation
Authors: Shiaho Li, Naisheng Ye, Tianyu Li, Kashyap Chitta, Tuo An, Peng Su, Boyang Wang, Haiou Liu, Chen Lv, Hongyang Li
First: 2025-12-08T15:56:18+00:00 · Latest: 2025-12-08T15:56:18+00:00
Abstract
Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.
中文标题/摘要
标题:优化引导扩散在交互场景生成中的应用
真实的多样化多智能体驾驶场景对于评估自动驾驶车辆至关重要,但这些场景中的关键安全事件在驾驶数据集中很少见且代表性不足。数据驱动的场景生成提供了一种低成本的替代方案,通过从现有的驾驶日志中合成复杂的交通行为。然而,现有的模型往往缺乏可控性,或者生成的样本违反了物理或社会约束,限制了它们的实用性。我们提出了OMEGA,这是一种优化引导、无需训练的框架,在基于扩散的场景生成模型采样过程中确保结构一致性并增强交互意识。OMEGA 通过约束优化重新锚定每个逆向扩散步骤,引导生成物理上合理且行为上一致的轨迹。在此基础上,我们将自我攻击者交互建模为分布空间中的博弈论优化,近似纳什均衡以生成现实且关键的安全对抗场景。在nuPlan和Waymo上的实验表明,OMEGA 提高了生成的真实感、一致性和可控性,使自由探索能力下物理和行为上有效的场景比例从32.35%提高到72.27%,可控性生成下从11%提高到80%。此外,我们的方法还可以生成5倍的接近碰撞帧,碰撞时间在3秒以内,同时保持整体场景的真实感。
Summary / 总结
The research aims to generate realistic and diverse multi-agent driving scenes for evaluating autonomous vehicles, addressing the scarcity of safety-critical events in driving datasets. OMEGA, an optimization-guided framework, enforces structural consistency and interaction awareness during diffusion-based sampling, improving generation realism, consistency, and controllability. Experiments show a significant increase in the ratio of physically and behaviorally valid scenes, from 32.35% to 72.27% for free exploration and from 11% to 80% for controllability-focused generation. Additionally, it can generate five times more near-collision frames with a time-to-collision under three seconds while maintaining scene realism.
研究旨在生成现实且多样的多智能体驾驶场景,以评估自动驾驶车辆,解决驾驶数据集中安全关键事件稀缺的问题。OMEGA 是一个优化引导框架,在基于扩散的采样过程中强制执行结构一致性和交互意识,提高生成的真实性和一致性。实验表明,自由探索场景中物理和行为上有效的场景比例从32.35%提高到72.27%,而专注于可控性生成的比例从11%提高到80%。此外,还能生成五倍数量的时间到碰撞小于三秒的接近碰撞帧,同时保持场景的真实性。
DCoAR: Deep Concept Injection into Unified Autoregressive Models for Personalized Text-to-Image Generation
Authors: Fangtai Wu, Mushui Liu, Weijie He, Zhao Wang, Yunlong Yu
First: 2025-08-10T13:36:39+00:00 · Latest: 2025-12-08T15:39:55+00:00
Abstract
The unified autoregressive (AR) model excels at multimodal understanding and generation. However, its full potential in the domain of customized image generation has yet to be fully realized. Existing customization approaches for unified AR models face a fundamental dilemma: adaptation-based methods suffer from overfitting and scalability bottlenecks, while concept-injection paradigms are constrained by a shallow injection strategy that leads to poor visual fidelity and impaired re-contextualization. To address this, we propose DCoAR, a novel deep concept injection framework that maintains a completely frozen pre-trained model. DCoAR deeply integrates new concepts through a Layer-wise Multimodal Context Learning (LMCL) strategy, which is stabilized by a multi-faceted regularization scheme: a Dual Prior Preservation (DPP) loss to mitigate semantic drift and a Context-Aware Self-Regularization (CASR) loss to enhance re-contextualization. The framework also enables training-free subject customization in user-provided styles. Experiments demonstrate that DCoAR significantly outperforms previous injection-based methods and achieves performance competitive with adaptation-based approaches while requiring substantially fewer trainable parameters. Code: https://github.com/KZF-kzf/CoAR
中文标题/摘要
标题:DCoAR:统一自回归模型中的深度概念注入以实现个性化文本到图像生成
统一自回归(AR)模型在多模态理解和生成方面表现出色。然而,它在定制图像生成领域的全部潜力尚未完全实现。现有的统一AR模型定制方法面临一个根本性的困境:基于适应的方法受到过拟合和可扩展性瓶颈的困扰,而概念注入范式则受限于浅层注入策略,导致视觉保真度差和重新语境化能力受损。为了解决这一问题,我们提出了一种名为DCoAR的新型深度概念注入框架,该框架保持了预训练模型的完全冻结状态。DCoAR通过逐层多模态上下文学习(LMCL)策略深度整合新概念,并通过多方面的正则化方案进行稳定:双重先验保留(DPP)损失以减轻语义漂移,以及上下文感知自我正则化(CASR)损失以增强重新语境化。该框架还允许在用户提供的样式中进行无训练的主体定制。实验表明,DCoAR在性能上显著优于之前的注入方法,并且在需要的可训练参数数量大幅减少的情况下,其性能与基于适应的方法相当。代码:https://github.com/KZF-kzf/CoAR
Summary / 总结
DCoAR is a novel deep concept injection framework designed to enhance personalized text-to-image generation using unified autoregressive models. It addresses the limitations of existing methods by integrating new concepts through a Layer-wise Multimodal Context Learning strategy and employing a multi-faceted regularization scheme. Experimental results show that DCoAR outperforms previous injection-based methods and achieves performance comparable to adaptation-based approaches with fewer trainable parameters.
DCoAR 是一种新颖的深度概念注入框架,旨在使用统一的自回归模型增强个性化文本到图像生成。它通过分层多模态上下文学习策略和多方面的正则化方案(包括语义漂移缓解的双重先验保持损失和增强上下文重构的上下文感知自我正则化损失)来解决现有方法的局限性。实验结果表明,DCoAR 在性能上优于之前的注入基方法,并且在可训练参数数量较少的情况下达到了与适应基方法相当的表现。
All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMs
Authors: Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei Yang, Ying Wen, Xianfeng Tang, Hui Liu, Yuyin Zhou, Lianghua He
First: 2025-12-08T14:16:01+00:00 · Latest: 2025-12-08T14:16:01+00:00
Abstract
Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at https://github.com/YahongWang1/Information-Horizon.
中文标题/摘要
标题:仅需随机视觉标记?揭秘VLLMs中的标记剪枝
视觉大型语言模型(VLLMs)由于依赖数百个视觉标记来表示图像而产生高昂的计算成本。虽然标记剪枝为加速推理提供了有希望的解决方案,但本文却发现一个关键观察结果:在较深的层(例如,超过第20层)中,现有的无训练剪枝方法的表现与随机剪枝相当。我们假设这种退化是由“消失的标记信息”引起的,其中视觉标记随网络深度增加而逐渐失去其重要性。为了验证这一假设,我们通过测量移除标记后模型输出概率的变化来量化一个标记的信息量。使用此提出的度量标准,我们对各层中视觉标记的信息量进行分析,揭示了三个关键发现:(1)随着层的加深,视觉标记的信息逐渐变得均匀,并在某个中间层消失,我们称之为“信息边界”,在此之后视觉标记变得多余;(2)这一边界的位臵不是固定的,对于视觉密集型任务(如光学字符识别(OCR)),它比更通用的任务(如视觉问答(VQA))更深;(3)这一边界也与模型容量密切相关,更强的VLLMs(例如,Qwen2.5-VL)使用更深的视觉标记,而较弱的模型(例如,LLaVA-1.5)则使用较浅的视觉标记。基于我们的发现,我们展示了在较深的层中简单的随机剪枝可以高效地平衡性能和效率。此外,集成随机剪枝可以一致地增强现有方法。使用DivPrune结合随机剪枝实现了最先进的结果,保持了96.9%的Qwen-2.5-VL-7B性能的同时剪枝了50%的视觉标记。代码将在https://github.com/YahongWang1/Information-Horizon公开。
Summary / 总结
This paper investigates the effectiveness of token pruning in Vision Large Language Models (VLLMs) and finds that beyond the 20th layer, existing training-free pruning methods perform no better than random pruning. The authors hypothesize that this is due to 'vanishing token information,' where visual tokens lose their salience with increasing depth. Key findings include the existence of an 'information horizon' where token information becomes uniform and eventually vanishes, the position of which varies with task complexity and model capacity, and the efficiency of random pruning in deep layers. Integrating random pruning with existing methods enhances performance, as shown by achieving state-of-the-art results with 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens.
该论文研究了视觉大型语言模型(VLLMs)中令牌剪枝的有效性,发现现有的无训练剪枝方法在较深的层中与随机剪枝效果相当。作者假设这是由于“令牌信息消失”现象,即视觉令牌随深度增加而失去其显著性。他们提出了一种度量令牌信息的方法,并发现视觉令牌信息在“信息水平”之后变得均匀并最终消失,在较深的层中变得多余。这一水平随着任务复杂性和模型容量的变化而变化。研究表明,在较深的层中进行简单的随机剪枝可以高效地平衡性能和效率,并且将随机剪枝整合到现有方法中可以增强其效果,实现96.9%的Qwen-2.5-VL-7B性能,同时剪枝50%的视觉令牌。
Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models
Authors: Kassoum Sanogo, Renzo Ardiccioni
First: 2025-12-08T13:58:46+00:00 · Latest: 2025-12-08T13:58:46+00:00
Comments: 24 pages, 3 figures, 2 tables. Training-free self-correction framework for vision-language models. Code and implementation details will be released at: https://github.com/kassoumsanogo1/self-correcting-vlm-re-Attention.git
Abstract
Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.
中文标题/摘要
标题:迈向更可靠的人工智能:减少视觉语言模型的幻觉
视觉语言模型(VLMs)经常生成与图像内容相符但不正确的幻觉内容。我们提出了一种无需训练的自我纠正框架,使VLMs能够通过不确定性引导的视觉重新关注逐步细化响应。我们的方法结合了多维度不确定性量化(标记熵、注意力分散、语义一致性、断言置信度)与注意力引导的对未充分探索区域的裁剪。完全使用冻结的预训练VLMs,我们的框架不需要梯度更新。我们在POPE和MMHAL BENCH基准上使用Qwen2.5-VL-7B [23]架构验证了我们的方法。实验结果表明,与基线相比,我们的方法将幻觉率降低了9.8个百分点,同时在对抗性分割上提高了4.7个百分点的物体存在准确性。此外,定性分析证实,不确定性引导的重新关注成功地将修正与视觉证据联系起来,而标准解码则失败。我们将在Qwen2.5-VL-7B [23]上验证我们的方法,并计划在未来版本中扩展到多种架构。我们发布了代码和方法,以促进未来在可信赖的多模态系统中的研究。
Summary / 总结
This paper addresses the issue of hallucinations in vision-language models (VLMs) by proposing a training-free self-correction framework. The method uses uncertainty quantification and attention-guided cropping to iteratively refine responses. Experiments on POPE and MMHAL BENCH benchmarks show a 9.8 percentage point reduction in hallucination rates and a 4.7 point improvement in object existence accuracy on adversarial splits. Qualitative analysis confirms the effectiveness of uncertainty-guided re-attention in grounding corrections in visual evidence. The framework operates without gradient updates and is implemented using the Qwen2.5-VL-7B architecture.
论文提出了一种无需训练的自我纠正框架,以解决视觉语言模型(VLM)中的幻觉问题。该框架通过不确定性引导的视觉重新关注,结合多维度不确定性量化和注意力引导的裁剪,使VLM能够迭代地细化其响应。该方法在POPE和MMHAL BENCH基准上使用Qwen2.5-VL-7B架构验证,显著减少了9.8个百分点的幻觉率,并在对抗性分割上提高了4.7个百分点的对象存在准确性。
Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
Authors: Samuele Dell'Erba, Andrew D. Bagdanov
First: 2025-11-25T20:20:21+00:00 · Latest: 2025-12-08T12:40:52+00:00
Comments: 11 pages, 7 figures, technical report (preprint)
Abstract
Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.
中文标题/摘要
标题:基于优化反演的无训练扩散先验文本到图像生成
扩散模型已经在文本到图像生成中确立了最先进的地位,但它们的表现往往依赖于一个扩散先验网络将文本嵌入转换为视觉流形,以便更容易地解码。这些先验网络计算成本高,需要在大规模数据集上进行大量训练。在本文中,我们通过采用基于优化反演(OVI)的方法,一种无需训练和无需数据的替代方案,来挑战所有训练先验的必要性,以替代先验的需要。OVI 从随机伪标记初始化一个潜在的视觉表示,并迭代优化以最大化与输入文本提示嵌入的余弦相似度。我们还提出了两种新的约束条件,基于马氏距离的和最近邻损失,以规范OVI的优化过程,使其朝向真实图像的分布。我们在Kandinsky 2.2上的实验表明,OVI 可以作为传统先验的替代方案。更重要的是,我们的分析揭示了当前评价基准(如T2I-CompBench++)中的一个关键缺陷,其中仅使用文本嵌入作为先验就能获得令人惊讶的高分数,尽管感知质量较低。我们的约束OVI方法在视觉保真度上优于这一基线,最近邻方法尤其有效,其定量得分与或优于最先进的数据高效先验,表明该想法值得进一步研究。代码将在接受后公开。
Summary / 总结
This work proposes a training-free method for text-to-image generation using Optimization-based Visual Inversion (OVI) to replace the need for a diffusion prior network. OVI initializes a latent visual representation and iteratively optimizes it to match the input text embedding. The authors introduce two constraints to improve the optimization process. Experiments on Kandinsky 2.2 show that OVI can achieve visual fidelity comparable to or better than state-of-the-art data-efficient priors, highlighting the potential of this approach. However, the study also reveals that current evaluation benchmarks may not accurately reflect the quality of generated images, as using text embeddings alone can achieve high scores. The Nearest-Neighbor approach in OVI is particularly effective in improving visual fidelity.
本文提出了一种无需训练的方法,使用基于优化的视觉反转(OVI)来替代扩散先验网络,用于文本到图像的生成。OVI从随机伪标记开始初始化一个潜在的视觉表示,并迭代优化以匹配输入的文本嵌入。作者引入了两种约束来改进优化过程。实验表明,OVI可以在视觉保真度上达到或超过最先进的数据高效先验的水平,显示出该方法的潜力。然而,研究还揭示了当前的评估基准可能无法准确反映生成图像的质量,因为仅使用文本嵌入即可获得高分。OVI中的最近邻方法特别有效,能够显著提高视觉保真度。
SJD++: Improved Speculative Jacobi Decoding for Training-free Acceleration of Discrete Auto-regressive Text-to-Image Generation
Authors: Yao Teng, Zhihuan Jiang, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, Xihui Liu
First: 2025-12-08T12:36:43+00:00 · Latest: 2025-12-08T12:36:43+00:00
Abstract
Large autoregressive models can generate high-quality, high-resolution images but suffer from slow generation speed, because these models require hundreds to thousands of sequential forward passes for next-token prediction during inference. To accelerate autoregressive text-to-image generation, we propose Speculative Jacobi Decoding++ (SJD++), a training-free probabilistic parallel decoding algorithm. Unlike traditional next-token prediction, SJD++ performs multi-token prediction in each forward pass, drastically reducing generation steps. Specifically, it integrates the iterative multi-token prediction mechanism from Jacobi decoding, with the probabilistic drafting-and-verification mechanism from speculative sampling. More importantly, for further acceleration, SJD++ reuses high-confidence draft tokens after each verification phase instead of resampling them all. We conduct extensive experiments on several representative autoregressive text-to-image generation models and demonstrate that SJD++ achieves $2\times$ to $3\times$ inference latency reduction and $2\times$ to $7\times$ step compression, while preserving visual quality with no observable degradation.
中文标题/摘要
标题:SJD++:改进的推测雅可比解码以实现无训练加速离散自回归文本到图像生成
大型自回归模型可以生成高质量、高分辨率的图像,但生成速度较慢,因为这些模型在推理过程中需要数百到数千次顺序前向传递来进行下一个标记预测。为了加速自回归文本到图像生成,我们提出了一种无训练概率并行解码算法——推测雅可比解码++(SJD++)。与传统的下一个标记预测不同,SJD++在每次前向传递中执行多标记预测,大幅减少了生成步骤。具体而言,它结合了雅可比解码的迭代多标记预测机制和推测采样的概率草稿和验证机制。更重要的是,为了进一步加速,SJD++在每次验证阶段后重用高置信度草稿标记,而不是重新采样它们。我们在几个代表性自回归文本到图像生成模型上进行了广泛的实验,并证明SJD++实现了2至3倍的推理延迟减少和2至7倍的步骤压缩,同时保持了视觉质量,没有可观察到的退化。
Summary / 总结
The research aims to accelerate autoregressive text-to-image generation by proposing SJD++, a training-free probabilistic parallel decoding algorithm. SJD++ integrates multi-token prediction and probabilistic verification from Jacobi decoding and speculative sampling, respectively, to reduce the number of inference steps. Experiments show that SJD++ reduces inference latency by 2 to 3 times and step compression by 2 to 7 times without degrading visual quality.
研究旨在通过解决大型模型推理速度慢的问题来加速自回归文本到图像生成。SJD++是一种无需训练的方法,每次前向传递中进行多令牌预测,减少生成步骤的数量。实验表明,SJD++可以实现2到3倍的推理延迟减少和2到7倍的步骤压缩,同时保持视觉质量不下降。
Event-Customized Image Generation
Authors: Zhen Wang, Yilei Jiang, Dong Zheng, Jun Xiao, Long Chen
Venue: ICML 2025
First: 2024-10-03T13:41:58+00:00 · Latest: 2025-12-08T12:28:57+00:00
Abstract
Customized Image Generation, generating customized images with user-specified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ''exactly same'' reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ''event'' as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.
中文标题/摘要
标题:事件定制图像生成
定制图像生成,根据用户指定的概念生成定制图像,由于其创新性和新颖性引起了广泛关注。在主体定制方面取得了显著进展后,一些先驱工作进一步探索了动作和交互的定制,超越了实体(即人类、动物和物体)的外观。然而,这些方法仅关注基本的动作和两个实体之间的交互,其效果受限于缺乏“完全相同”的参考图像。为了将定制图像生成扩展到更复杂的场景,以适应一般现实世界的应用,我们提出了一项新任务:事件定制图像生成。给定一张参考图像,我们将“事件”定义为场景中不同实体之间所有具体的行为、姿态、关系或交互。该任务旨在准确捕捉复杂的事件,并生成包含各种目标实体的定制图像。为了解决这一任务,我们提出了一种新的无需训练的事件定制方法:FreeEvent。具体来说,FreeEvent 引入了两条额外路径,与一般的去噪扩散过程并行:1) 实体切换路径:它应用交叉注意力指导和调节以生成目标实体。2) 事件转移路径:它将参考图像的空间特征和自我注意力图注入到目标图像中以生成事件。为了进一步促进这一新任务,我们收集了两个评估基准:SWiG-Event 和 Real-Event。广泛的实验和消融实验已经证明了 FreeEvent 的有效性。
Summary / 总结
The research focuses on extending customized image generation to more complex scenes by introducing a new task called event-customized image generation. The method, FreeEvent, introduces two paths during the diffusion denoising process: one for entity switching and another for event transferring. The effectiveness of FreeEvent is demonstrated through extensive experiments and ablations on newly collected benchmarks, SWiG-Event and Real-Event.
研究旨在通过提出事件定制化图像生成,将定制化图像生成扩展到更复杂的场景中,基于单张参考图像生成具有多种目标实体的图像。方法FreeEvent引入了两个路径:实体切换路径用于生成目标实体,事件转移路径用于生成事件。实验表明FreeEvent能够准确捕捉复杂事件并生成定制化图像。
Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models
Authors: Haidong Kang, Jun Du, Lihong Lin
First: 2025-12-08T10:52:55+00:00 · Latest: 2025-12-08T10:52:55+00:00
Abstract
Mixed-Precision Quantization (MPQ) liberates the Deep Neural Networks (DNNs) from the Out-Of-Memory (OOM) bottleneck, which garnered increasing research attention. However, conventional methods either searched from costly differentiable optimization, which is neither efficient nor flexible, or learned a quantized DNN from the proxy (i.e., HAWQ) manually designed by human experts, which is labor-intensive and requires huge expert knowledge. Can we design a proxy without involving any human experts and training? In this paper, we provide an affirmative answer by proposing a novel Large Language Models (LLMs)-driven Training-free Automatic Proxy (dubbed TAP) discovery framework, which reforms the design paradigm of MPQ by utilizing LLMs to find superior TAP tailored for MPQ, automatically. In addition, to bridge the gap between black-box LLMs and the tough MPQ task, we ingeniously propose simple Direct Policy Optimization (DPO) based reinforcement learning to enhance LLMs' reasoning by optimizing prompts, which can construct a positive feedback loop between the LLM and the MPQ task, enabling LLMs to generate better TAP in the next evolution. Extensive experiments on mainstream benchmarks demonstrate that TAP achieves state-of-the-art performance. Finally, we truly believe that our TAP will significantly contribute to the MPQ community by providing a new perspective on LLM-driven design algorithms.
中文标题/摘要
标题:革新混合精度量化:通过大型语言模型实现无需训练的自动代理发现
混合精度量化(MPQ)使深度神经网络(DNNs)摆脱了内存不足(OOM)的瓶颈,引起了越来越多的研究关注。然而,传统方法要么从昂贵的可微优化中搜索,这既不高效也不灵活,要么从人类专家手工设计的代理(例如HAWQ)中学习量化DNN,这既耗时又需要大量专家知识。我们能否设计一个无需任何人类专家和训练的代理?在本文中,我们通过提出一种新颖的大型语言模型(LLMs)驱动的无需训练的自动代理(简称TAP)发现框架,给出了肯定的答案,该框架通过利用LLMs来寻找适用于MPQ的优质TAP,实现了MPQ设计范式的革新。此外,为了弥合黑盒LLMs与MPQ任务之间的差距,我们巧妙地提出了基于直接策略优化(DPO)的强化学习方法,通过优化提示来增强LLMs的推理能力,从而在LLM与MPQ任务之间建立正反馈循环,使LLMs能够在下一次进化中生成更好的TAP。在主流基准上的广泛实验表明,TAP达到了最先进的性能。最后,我们坚信,我们的TAP将通过提供LLM驱动设计算法的新视角,显著贡献于MPQ社区。
Summary / 总结
This paper addresses the challenge of designing Mixed-Precision Quantization (MPQ) proxies without human intervention or training. It introduces a novel framework called TAP, which uses Large Language Models (LLMs) to automatically discover superior proxies. To enhance LLMs' reasoning for the MPQ task, the paper proposes Direct Policy Optimization (DPO) based reinforcement learning. Experiments show that TAP outperforms existing methods on mainstream benchmarks, offering a new approach to MPQ design through LLMs.
本文解决了无需人工干预或训练来设计混合精度量化(MPQ)代理的问题。它提出了一种名为TAP的新框架,利用大型语言模型(LLMs)自动发现更优的代理。为了增强LLMs的推理能力,论文提出了基于直接策略优化(DPO)的强化学习方法。实验表明,TAP在主流基准上优于现有方法,为MPQ设计提供了新的思路。
Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-Free Open-Vocabulary Semantic Segmentation
Authors: Qiming Huang, Hao Ai, Jianbo Jiao
First: 2025-12-08T10:00:36+00:00 · Latest: 2025-12-08T10:00:36+00:00
Comments: Accepted to WACV2026
Abstract
Benefiting from the inductive biases learned from large-scale datasets, open-vocabulary semantic segmentation (OVSS) leverages the power of vision-language models, such as CLIP, to achieve remarkable progress without requiring task-specific training. However, due to CLIP's pre-training nature on image-text pairs, it tends to focus on global semantic alignment, resulting in suboptimal performance when associating fine-grained visual regions with text. This leads to noisy and inconsistent predictions, particularly in local areas. We attribute this to a dispersed bias stemming from its contrastive training paradigm, which is difficult to alleviate using CLIP features alone. To address this, we propose a structure-aware feature rectification approach that incorporates instance-specific priors derived directly from the image. Specifically, we construct a region adjacency graph (RAG) based on low-level features (e.g., colour and texture) to capture local structural relationships and use it to refine CLIP features by enhancing local discrimination. Extensive experiments show that our method effectively suppresses segmentation noise, improves region-level consistency, and achieves strong performance on multiple open-vocabulary segmentation benchmarks.
中文标题/摘要
标题:基于区域邻接图的结构感知特征校正以实现无需训练的开放词汇语义分割
得益于大规模数据集学习到的归纳偏置,开放词汇语义分割(OVSS)利用视觉语言模型(如CLIP)的力量取得了显著进展,无需特定任务的训练。然而,由于CLIP在图像-文本对上的预训练性质,它倾向于关注全局语义对齐,导致在将细粒度视觉区域与文本关联时表现不佳。这导致了局部区域的嘈杂和不一致的预测。我们将其归因于其对比训练范式带来的分散偏置,仅使用CLIP特征难以缓解。为了解决这一问题,我们提出了一种结构感知特征校正方法,该方法结合了直接从图像中提取的实例特定先验。具体而言,我们基于低级特征(如颜色和纹理)构建区域邻接图(RAG)以捕获局部结构关系,并使用它来通过增强局部区分性来校正CLIP特征。大量实验表明,我们的方法有效地抑制了分割噪声,提高了区域一致性,并在多个开放词汇分割基准上取得了优异性能。
Summary / 总结
The paper addresses the issue of noisy and inconsistent predictions in open-vocabulary semantic segmentation (OVSS) due to CLIP's focus on global semantic alignment. It proposes a structure-aware feature rectification method using a region adjacency graph (RAG) to capture local structural relationships and refine CLIP features. Experiments show that this approach effectively suppresses segmentation noise and improves region-level consistency, achieving strong performance on multiple benchmarks.
论文针对CLIP侧重全局语义对齐导致的开放词汇语义分割中的噪声和不一致预测问题,提出了一种基于区域邻接图(RAG)的结构感知特征校正方法,通过增强局部区分性来精炼CLIP特征。实验表明,该方法有效抑制了分割噪声,提高了区域一致性,并在多个开放词汇分割基准上取得了优异表现。
Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding
Authors: Shengyuan Ye, Bei Ouyang, Tianyi Qian, Liekang Zeng, Mu Yuan, Xiaowen Chu, Weijie Hong, Xu Chen
First: 2025-12-08T09:32:47+00:00 · Latest: 2025-12-08T09:32:47+00:00
Comments: Accepted by IEEE International Conference on Computer Communications 2026
Abstract
Vision-language models (VLMs) have demonstrated impressive multimodal comprehension capabilities and are being deployed in an increasing number of online video understanding applications. While recent efforts extensively explore advancing VLMs' reasoning power in these cases, deployment constraints are overlooked, leading to overwhelming system overhead in real-world deployments. To address that, we propose Venus, an on-device memory-and-retrieval system for efficient online video understanding. Venus proposes an edge-cloud disaggregated architecture that sinks memory construction and keyframe retrieval from cloud to edge, operating in two stages. In the ingestion stage, Venus continuously processes streaming edge videos via scene segmentation and clustering, where the selected keyframes are embedded with a multimodal embedding model to build a hierarchical memory for efficient storage and retrieval. In the querying stage, Venus indexes incoming queries from memory, and employs a threshold-based progressive sampling algorithm for keyframe selection that enhances diversity and adaptively balances system cost and reasoning accuracy. Our extensive evaluation shows that Venus achieves a 15x-131x speedup in total response latency compared to state-of-the-art methods, enabling real-time responses within seconds while maintaining comparable or even superior reasoning accuracy.
中文标题/摘要
标题:金星:一种高效的边缘记忆与检索系统,用于基于VLM的在线视频理解
视觉语言模型(VLMs)展示了令人印象深刻的多模态理解能力,并被部署在越来越多的在线视频理解应用中。尽管最近的努力广泛探索了在这些情况下增强VLMs的推理能力,但部署限制被忽视了,导致实际部署中的系统开销过大。为了解决这个问题,我们提出了Venus,一种用于高效在线视频理解的边缘设备上的记忆与检索系统。Venus 提出了一个边缘-云分离的架构,将云中的记忆构建和关键帧检索下沉到边缘,分为两个阶段运行。在摄取阶段,Venus 通过场景分割和聚类连续处理边缘视频流,其中选定的关键帧通过多模态嵌入模型嵌入,构建一个分层记忆以实现高效存储和检索。在查询阶段,Venus 从记忆中索引入来的查询,并采用基于阈值的渐进采样算法进行关键帧选择,以增强多样性并适当地平衡系统成本和推理准确性。我们广泛评估表明,Venus 在总响应延迟上比最先进的方法快15-131倍,能够在几秒钟内实现实时响应,同时保持相当甚至更优的推理准确性。
Summary / 总结
Venus is an on-device memory-and-retrieval system designed to enhance the efficiency of online video understanding using vision-language models (VLMs). It proposes an edge-cloud disaggregated architecture to reduce system overhead. In the ingestion stage, Venus processes streaming videos and builds a hierarchical memory for efficient storage and retrieval. In the querying stage, it uses a threshold-based progressive sampling algorithm for keyframe selection. Venus achieves a 15x-131x speedup in total response latency compared to state-of-the-art methods, enabling real-time responses while maintaining comparable or superior reasoning accuracy.
Venus 是一种用于增强在线视频理解的边缘设备记忆和检索系统,采用边缘-云分离架构,将记忆构建和关键帧检索从云端移至边缘。系统通过场景分割和聚类处理流媒体视频,使用多模态模型嵌入选定的关键帧以构建层次化记忆。在查询阶段,系统索引入来的查询,并使用基于阈值的渐进采样算法选择关键帧,平衡系统成本和推理准确性。广泛评估表明,Venus 将总响应延迟显著降低至 15-131 倍,能够实现秒级实时响应,同时保持或提高推理准确性。
Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts
Authors: Mingning Guo, Mengwei Wu, Shaoxian Li, Haifeng Li, Chao Tao
First: 2025-12-08T08:44:57+00:00 · Latest: 2025-12-08T08:44:57+00:00
Abstract
Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.
中文标题/摘要
标题:向准确的无人机图像感知迈进:使用更强的任务提示引导视觉-语言模型
基于VLM的现有图像感知方法通常遵循一种模式,即模型根据用户提供的文本任务提示提取和分析图像内容。然而,当应用于无人机图像时,这些方法会面临目标混淆、尺度变化和复杂背景等挑战。这些挑战源于VLM对图像内容的理解依赖于视觉和文本标记之间的语义对齐。当任务提示简单且图像内容复杂时,实现有效的对齐变得困难,限制了模型聚焦于任务相关信息的能力。为解决这一问题,我们引入了AerialVP,这是首个用于无人机图像感知任务提示增强的代理框架。AerialVP主动从无人机图像中提取多维度辅助信息以增强任务提示,克服了传统基于VLM的方法的局限性。具体而言,增强过程包括三个阶段:(1)分析任务提示以确定任务类型和增强需求,(2)从工具库中选择合适的工具,(3)基于分析和选定的工具生成增强的任务提示。为了评估AerialVP,我们引入了AerialSense,这是一个全面的无人机图像感知基准,包括Aerial视觉推理、Aerial视觉问答和Aerial视觉定位任务。AerialSense为评估模型在不同分辨率、光照条件以及城市和自然场景中的泛化能力和性能提供了一个标准化的基础。实验结果表明,AerialVP显著增强了任务提示的指导作用,导致开源和专有VLM在性能上实现了稳定且显著的提升。我们的工作可在https://github.com/lostwolves/AerialVP获取。
Summary / 总结
The paper addresses the limitations of existing vision-language models (VLMs) in processing UAV imagery, which often suffer from target confusion, scale variations, and complex backgrounds. To improve this, the authors introduce AerialVP, an agent framework that enhances task prompts by extracting multi-dimensional auxiliary information from UAV images. AerialVP consists of three stages: analyzing the task prompt, selecting appropriate tools, and generating enhanced task prompts. The authors evaluate AerialVP using AerialSense, a comprehensive benchmark for UAV image perception, and show that it leads to significant performance improvements in both open-source and proprietary VLMs.
本文针对现有视觉-语言模型(VLMs)在处理无人机图像时遇到的目标混淆、尺度变化和复杂背景等问题,提出了一种名为AerialVP的任务提示增强框架,该框架通过从无人机图像中提取多维度辅助信息来增强任务提示。框架包括三个阶段:分析任务提示、选择合适工具和生成增强任务提示。AerialVP通过包含Aerial视觉推理、Aerial视觉问答和Aerial视觉定位等任务的AerialSense基准进行评估。实验结果表明,AerialVP显著提升了开源和专有VLMs在无人机图像感知任务中的性能。
Geo3DVQA: Evaluating Vision-Language Models for 3D Geospatial Reasoning from Aerial Imagery
Authors: Mai Tsujimoto, Junjue Wang, Weihao Xuan, Naoto Yokoya
Venue: WACV 2026
First: 2025-12-08T08:16:14+00:00 · Latest: 2025-12-08T08:16:14+00:00
Comments: Accepted to WACV 2026. Camera-ready-based version with minor edits for readability (no change in the contents)
Abstract
Three-dimensional geospatial analysis is critical to applications in urban planning, climate adaptation, and environmental assessment. Current methodologies depend on costly, specialized sensors (e.g., LiDAR and multispectral), which restrict global accessibility. Existing sensor-based and rule-driven methods further struggle with tasks requiring the integration of multiple 3D cues, handling diverse queries, and providing interpretable reasoning. We hereby present Geo3DVQA, a comprehensive benchmark for evaluating vision-language models (VLMs) in height-aware, 3D geospatial reasoning using RGB-only remote sensing imagery. Unlike conventional sensor-based frameworks, Geo3DVQA emphasizes realistic scenarios that integrate elevation, sky view factors, and land cover patterns. The benchmark encompasses 110k curated question-answer pairs spanning 16 task categories across three complexity levels: single-feature inference, multi-feature reasoning, and application-level spatial analysis. The evaluation of ten state-of-the-art VLMs highlights the difficulty of RGB-to-3D reasoning. GPT-4o and Gemini-2.5-Flash achieved only 28.6% and 33.0% accuracy respectively, while domain-specific fine-tuning of Qwen2.5-VL-7B achieved 49.6% (+24.8 points). These results reveal both the limitations of current VLMs and the effectiveness of domain adaptation. Geo3DVQA introduces new challenge frontiers for scalable, accessible, and holistic 3D geospatial analysis. The dataset and code will be released upon publication at https://github.com/mm1129/Geo3DVQA.
中文标题/摘要
标题:Geo3DVQA:评估基于视觉-语言模型的从航空影像进行三维地理空间推理能力
三维地理空间分析对于城市规划、气候适应和环境评估至关重要。当前的方法依赖于昂贵的专业传感器(例如,LiDAR和多光谱),这限制了全球的可访问性。现有的基于传感器和规则驱动的方法在处理需要整合多种三维线索、处理多样查询和提供可解释推理的任务时也存在困难。我们在此提出Geo3DVQA,这是一个全面的基准,用于评估视觉-语言模型(VLMs)在仅使用RGB遥感影像进行高度感知的三维地理空间推理方面的能力。与传统的基于传感器的框架不同,Geo3DVQA 强调整合了高程、天空视角因子和土地覆盖模式的现实场景。基准数据集包括11万个精心策划的问题-答案对,涵盖16个任务类别,分为三个复杂度级别:单一特征推理、多特征推理和应用级空间分析。对十种最先进的VLMs的评估揭示了RGB到三维推理的难度。GPT-4o和Gemini-2.5-Flash的准确率分别为28.6%和33.0%,而针对特定领域的Qwen2.5-VL-7B微调则达到了49.6%(+24.8分)。这些结果揭示了当前VLMs的局限性以及领域适应的有效性。Geo3DVQA 为可扩展、可访问和全面的三维地理空间分析引入了新的挑战前沿。数据集和代码将在发表时在https://github.com/mm1129/Geo3DVQA上发布。
Summary / 总结
Geo3DVQA is a benchmark for evaluating vision-language models in 3D geospatial reasoning using RGB-only remote sensing imagery. It includes 110k question-answer pairs covering 16 task categories at three complexity levels. Evaluating ten state-of-the-art VLMs, the results show that GPT-4o and Gemini-2.5-Flash achieved 28.6% and 33.0% accuracy, respectively, while domain-specific fine-tuning of Qwen2.5-VL-7B achieved 49.6%, highlighting the challenges and potential of VLMs in this domain.
Geo3DVQA 使用仅 RGB 的遥感图像评估视觉语言模型在 3D 地理空间推理中的能力,重点关注如高程、天空视角因子和土地覆盖模式的任务。它包含 110k 个问题-答案对,涵盖 16 个类别和三个复杂度级别。评估了十个最先进的 VLMs,研究发现 GPT-4o 和 Gemini-2.5-Flash 的准确率分别为 28.6% 和 33.0%,而针对领域的 Qwen2.5-VL-7B 微调提高了准确率至 49.6%。这突显了 VLMs 在 3D 地理空间分析中的挑战和潜力。
RVLF: A Reinforcing Vision-Language Framework for Gloss-Free Sign Language Translation
Authors: Zhi Rao, Yucheng Zhou, Benjia Zhou, Yiqing Huang, Sergio Escalera, Jun Wan
First: 2025-12-08T08:11:53+00:00 · Latest: 2025-12-08T08:11:53+00:00
Abstract
Gloss-free sign language translation (SLT) is hindered by two key challenges: **inadequate sign representation** that fails to capture nuanced visual cues, and **sentence-level semantic misalignment** in current LLM-based methods, which limits translation quality. To address these issues, we propose a three-stage **r**einforcing **v**ision-**l**anguage **f**ramework (**RVLF**). We build a large vision-language model (LVLM) specifically designed for sign language, and then combine it with reinforcement learning (RL) to adaptively enhance translation performance. First, for a sufficient representation of sign language, RVLF introduces an effective semantic representation learning mechanism that fuses skeleton-based motion cues with semantically rich visual features extracted via DINOv2, followed by instruction tuning to obtain a strong SLT-SFT baseline. Then, to improve sentence-level semantic misalignment, we introduce a GRPO-based optimization strategy that fine-tunes the SLT-SFT model with a reward function combining translation fidelity (BLEU) and sentence completeness (ROUGE), yielding the optimized model termed SLT-GRPO. Our conceptually simple framework yields substantial gains under the gloss-free SLT setting without pre-training on any external large-scale sign language datasets, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on the CSL-Daily, PHOENIX-2014T, How2Sign, and OpenASL datasets, respectively. To the best of our knowledge, this is the first work to incorporate GRPO into SLT. Extensive experiments and ablation studies validate the effectiveness of GRPO-based optimization in enhancing both translation quality and semantic consistency.
中文标题/摘要
标题:RVLF:一种用于无手语词签语言翻译的强化视觉-语言框架
无手语词签语言翻译(SLT)受到两个关键挑战的阻碍:**不充分的手语表示**无法捕捉细微的视觉线索,以及当前基于LLM的方法在句级语义对齐方面的局限性,这限制了翻译质量。为了解决这些问题,我们提出了一种三阶段的**r** 强化 **v** 视觉- **l** 语言 **f** 框架 (**RVLF**)。我们构建了一个专门针对手语的大型视觉-语言模型(LVLM),然后结合强化学习(RL)以自适应地增强翻译性能。首先,为了充分表示手语,RVLF引入了一种有效的语义表示学习机制,该机制将基于骨架的动作线索与通过DINOv2提取的语义丰富的视觉特征融合,随后进行指令调优以获得强大的SLT-SFT基线。然后,为了改善句级语义对齐,我们引入了一种基于GRPO的优化策略,该策略使用结合翻译忠实度(BLEU)和句子完整性(ROUGE)的奖励函数微调SLT-SFT模型,从而得到优化模型SLT-GRPO。在无需在任何外部大规模手语数据集上进行预训练的情况下,我们的概念简单框架在无手语词SLT设置下取得了显著的提升,在CSL-Daily、PHOENIX-2014T、How2Sign和OpenASL数据集上分别提高了BLEU-4分数5.1、1.11、1.4和1.61。据我们所知,这是首次将GRPO应用于SLT的工作。广泛的实验和消融研究验证了基于GRPO优化在提高翻译质量和语义一致性方面的有效性。
Summary / 总结
The paper addresses the challenges of gloss-free sign language translation by proposing RVLF, a three-stage reinforcing vision-language framework. RVLF uses a large vision-language model and reinforcement learning to enhance translation quality. It introduces a semantic representation learning mechanism that combines skeleton-based motion cues with semantically rich visual features, and uses GRPO-based optimization to fine-tune the model, improving BLEU-4 scores by +5.1, +1.11, +1.4, and +1.61 on different datasets.
该论文通过提出RVLF三阶段强化视觉语言框架来解决无手语词汇的翻译挑战。RVLF引入了有效的语义表示学习机制,并使用强化学习来增强翻译性能。该框架在CSL-Daily、PHOENIX-2014T、How2Sign和OpenASL数据集上分别提高了BLEU-4分数5.1、1.11、1.4和1.61,且无需依赖外部大规模手语数据集。
Zero-Shot Textual Explanations via Translating Decision-Critical Features
Authors: Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto
First: 2025-12-08T07:39:52+00:00 · Latest: 2025-12-08T07:39:52+00:00
Comments: 11+6 pages, 8 figures, 4 tables
Abstract
Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons -- i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER generates more faithful and interpretable explanations than existing methods. The code will be publicly released.
中文标题/摘要
标题:通过翻译决策关键特征实现零样本文本解释
文本解释通过自然语言描述预测理由使图像分类器的决策透明化。大型视觉-语言模型可以生成描述,但它们旨在进行通用视觉理解,而不是特定分类器的推理。现有的零样本解释方法将全局图像特征与语言对齐,产生可见内容的描述而非驱动预测的内容。我们提出了TEXTER,通过在对齐之前隔离决策关键特征来克服这一限制。TEXTER识别对预测做出贡献的神经元,并强调这些神经元中编码的特征——即决策关键特征。然后,它将这些强调的特征映射到CLIP特征空间,以检索反映模型推理的文本解释。稀疏自编码器进一步提高了可解释性,特别是对于Transformer架构。大量实验表明,TEXTER生成的解释比现有方法更加忠实和可解释。代码将公开发布。
Summary / 总结
The research aims to provide transparent explanations for image classifier decisions by generating natural language descriptions of the prediction rationale. The method, TEXTER, isolates decision-critical features before aligning them with language, focusing on the neurons that contribute to the prediction. Experimental results demonstrate that TEXTER produces more faithful and interpretable explanations compared to existing methods.
研究旨在通过生成文本描述来使图像分类器的决策透明化,解释预测的合理性。方法TEXTER在对决策关键特征进行隔离处理后再进行语言对齐,专注于那些对预测有贡献的神经元。这种方法生成的解释比现有方法更忠实和可解释。实验表明,TEXTER在忠实度和可解释性方面优于当前技术。
Dropout Prompt Learning: Towards Robust and Adaptive Vision-Language Models
Authors: Biao Chen, Lin Zuo, Mengmeng Jing, Kunbin He, Yuchen Wang
First: 2025-12-08T07:31:27+00:00 · Latest: 2025-12-08T07:31:27+00:00
Abstract
Dropout is a widely used regularization technique which improves the generalization ability of a model by randomly dropping neurons. In light of this, we propose Dropout Prompt Learning, which aims for applying dropout to improve the robustness of the vision-language models. Different from the vanilla dropout, we apply dropout on the tokens of the textual and visual branches, where we evaluate the token significance considering both intra-modal context and inter-modal alignment, enabling flexible dropout probabilities for each token. Moreover, to maintain semantic alignment for general knowledge transfer while encouraging the diverse representations that dropout introduces, we further propose residual entropy regularization. Experiments on 15 benchmarks show our method's effectiveness in challenging scenarios like low-shot learning, long-tail classification, and out-of-distribution generalization. Notably, our method surpasses regularization-based methods including KgCoOp by 5.10% and PromptSRC by 2.13% in performance on base-to-novel generalization.
中文标题/摘要
标题:Dropout 提示学习:朝向稳健和自适应的视觉-语言模型
Dropout 是一种广泛使用的正则化技术,通过随机丢弃神经元来提高模型的泛化能力。基于此,我们提出了 Dropout 提示学习,旨在通过 Dropout 提高视觉-语言模型的稳健性。与传统的 Dropout 不同,我们对文本和视觉分支的标记应用 Dropout,考虑了模内上下文和模间对齐来评估标记的重要性,从而为每个标记提供灵活的 Dropout 概率。此外,为了在保持语义对齐以促进通用知识转移的同时鼓励 Dropout 引入的多样化表示,我们进一步提出了残差熵正则化。在 15 个基准上的实验表明,我们的方法在低样本学习、长尾分类和离分布泛化等具有挑战性的场景中具有有效性。值得注意的是,我们的方法在基类到新类泛化性能上分别超越了包括 KgCoOp 在内的基于正则化的方法 5.10% 和 PromptSRC 2.13%。
Summary / 总结
The research aims to enhance the robustness of vision-language models by applying dropout on tokens of both textual and visual branches, considering their intra-modal and inter-modal significance. The method introduces residual entropy regularization to maintain semantic alignment while encouraging diverse representations. Experimental results on 15 benchmarks demonstrate the effectiveness of the proposed method, particularly in low-shot learning, long-tail classification, and out-of-distribution generalization, surpassing other regularization-based methods by significant margins.
该研究提出了Dropout Prompt Learning方法,通过在视觉和语言分支的令牌上应用dropout来提高模型的鲁棒性。不同于传统的dropout,该方法考虑了内在模态上下文和跨模态对齐来评估令牌的重要性,允许灵活的dropout概率。此外,还提出了残差熵正则化来保持语义对齐的同时鼓励引入的多样化表示。该方法在低样本学习、长尾分类和离分布泛化等挑战性场景中表现出显著的改进,其基到新泛化性能分别超越其他基于正则化的KgCoOp和PromptSRC方法5.10%和2.13%。
Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Authors: Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
First: 2025-12-08T07:05:18+00:00 · Latest: 2025-12-08T07:05:18+00:00
Abstract
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
中文标题/摘要
标题:减少关注功能词以提高视觉-语言模型的鲁棒性
为了解决视觉-语言模型(VLM)的鲁棒性和性能之间的权衡问题,我们观察到功能词可能会使VLMs在跨模态对抗攻击中变得脆弱,并据此提出了功能词去注意(FDA)方法以减轻功能词的影响。类似于差分放大器,我们的FDA在注意力头内计算原始的和功能词的交叉注意力,并从前者中差分地减去后者,从而获得更对齐和鲁棒的VLMs。全面的实验包括在2个下游任务、3个数据集和3个模型上的2个SOTA基线,在6种不同的攻击下的2个下游任务。总体而言,我们的FDA在3个测试模型上的检索任务中仅导致平均18/13/53%的ASR下降,性能下降仅为0.2/0.3/0.6%;在视觉定位任务中,ASR下降90%,性能提升0.3%。我们通过实验展示了FDA的可扩展性、泛化能力和零样本性能,并进行了深入的消融研究和分析。代码将在https://github.com/michaeltian108/FDA公开。
Summary / 总结
The research aims to enhance the robustness of vision-language models (VLMs) without significantly compromising their performance. It introduces Function-word De-Attention (FDA) to mitigate the vulnerability of VLMs to cross-modal adversarial attacks by focusing on function words. Comprehensive experiments show that FDA reduces adversarial success rate (ASR) by an average of 18/13/53% with minimal performance drops (0.2/0.3/0.6%) on three models for retrieval tasks, and achieves a 90% ASR drop with a 0.3% performance gain on visual grounding tasks.
研究旨在通过减少对功能词的关注来提高视觉语言模型(VLM)的鲁棒性,而不显著影响其性能。引入了功能词去注意力(FDA)机制,以减轻VLM对跨模态对抗攻击的脆弱性。全面的实验表明,FDA在三个模型上分别将对抗成功率(ASR)降低了18/13/53%,同时仅导致检索任务中0.2/0.3/0.6%的性能下降,而在视觉定位任务中实现了90%的ASR下降和0.3%的性能提升。
VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation
Authors: Md Selim Sarowar, Sungho Kim
First: 2025-12-08T06:54:16+00:00 · Latest: 2025-12-08T06:54:16+00:00
Abstract
Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.
中文标题/摘要
标题:VFM-VLM:基于视觉基础模型和视觉语言模型的3D姿态估计中的视觉比较
视觉基础模型(VFMs)和视觉语言模型(VLMs)通过提供丰富的语义和几何表示,已经革新了计算机视觉领域。本文介绍了CLIP基和DINOv2基方法在手部物体抓取场景中3D姿态估计的全面视觉比较。我们在6D物体姿态估计任务上评估了这两种模型,并展示了它们的互补优势:CLIP在通过语言定位实现语义理解方面表现出色,而DINOv2提供了更优的密集几何特征。通过在基准数据集上的大量实验,我们证明了基于CLIP的方法在语义一致性方面表现更好,而基于DINOv2的方法在几何精度方面表现出竞争性性能。我们的分析为选择合适的视觉模型用于机器人操作和抓取提供了见解。
Summary / 总结
This paper compares CLIP-based and DINOv2-based approaches for 6D object pose estimation in hand object grasping scenarios. It evaluates both Vision Foundation Models (VFMs) and Vision Language Models (VLMs) on benchmark datasets, showing that CLIP excels in semantic consistency and DINOv2 provides superior geometric precision. The study highlights the complementary strengths of these models and offers insights for selecting appropriate vision models for robotic manipulation and grasping tasks.
该研究比较了基于CLIP和DINOv2的方法在手部抓取物体场景中的6D物体姿态估计。利用Vision Foundation Models (VFMs)和Vision Language Models (VLMs)提供丰富的语义和几何表示。研究表明,CLIP在语义理解方面表现出色,而DINOv2在几何精度方面更优。实验表明,基于CLIP的方法在语义一致性方面表现更好,而基于DINOv2的方法在几何精度方面具有竞争力。这些发现为选择合适的视觉模型用于机器人操作和抓取任务提供了参考。
GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations
Authors: Xinwei Liu, Xiaojun Jia, Yuan Xun, Simeng Qin, Xiaochun Cao
First: 2025-08-05T08:37:06+00:00 · Latest: 2025-12-08T06:51:35+00:00
Comments: AAAI2026 Poster
Abstract
Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users' locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns.
中文标题/摘要
标题:GeoShield:通过对抗性扰动保护地理定位隐私的视觉-语言模型
视觉-语言模型(VLMs)如GPT-4o现在能够从公共共享图像中推断出用户的地理位置,对地理隐私构成了重大风险。尽管对抗性扰动提供了一种潜在的防御手段,但当前的方法并不适合这种场景:它们在高分辨率图像和低扰动预算下表现不佳,并且可能会引入无关的语义内容。为了解决这些限制,我们提出了GeoShield,这是一种针对实际场景中地理隐私保护的新颖对抗框架。GeoShield 包含三个关键模块:一个特征解耦模块,用于分离地理和非地理信息;一个曝光元素识别模块,用于识别图像中的地理揭示区域;以及一个尺度自适应增强模块,用于在全局和局部层面联合优化扰动,以确保在不同分辨率下均有效。在具有挑战性的基准测试上的广泛实验表明,GeoShield 在黑盒设置中始终优于先前的方法,能够在不显著影响视觉或语义质量的情况下实现强大的隐私保护。据我们所知,这是首次探索对抗性扰动以防御高级VLMs的地理定位推断的工作,为日益严重的隐私问题提供了实用且有效的解决方案。
Summary / 总结
GeoShield is a novel adversarial framework designed to protect geolocation privacy from Vision-Language Models (VLMs) by applying adversarial perturbations. It includes three key modules: feature disentanglement, exposure element identification, and scale-adaptive enhancement. Experiments show that GeoShield outperforms existing methods in black-box settings while maintaining visual and semantic quality.
GeoShield 是一种新型对抗框架,旨在通过对抗扰动保护视觉-语言模型(VLMs)中的地理位置隐私。它包含三个关键模块:特征解耦、曝光元素识别和尺度自适应增强。实验表明,GeoShield 在黑盒设置中优于现有方法,提供了强大的隐私保护,同时对视觉和语义质量的影响最小。
MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning
Authors: Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, Faqiang Qian, Yichao Wu
First: 2025-12-08T06:26:13+00:00 · Latest: 2025-12-08T06:26:13+00:00
Comments: 7 pages, 1 figures
Abstract
Multimodal pre-training remains constrained by the descriptive bias of image-caption pairs, leading models to favor surface linguistic cues over grounded visual understanding. We introduce MMRPT, a masked multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs. We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models, enabling learning signals that reward visual grounding rather than caption imitation. MMRPT constructs masked multimodal data by estimating sentence-level visual dependency via attention over visual tokens and masking highly vision-dependent segments; the model reconstructs these spans through vision-grounded reasoning guided by a semantic-visual reward. Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning, demonstrating that reinforcement-driven masked reasoning provides a more reliable and generalizable pre-training objective for multimodal models.
中文标题/摘要
标题:MMRPT:多模态强化预训练通过掩蔽视觉依赖推理
多模态预训练仍然受到图像-描述对描述偏见的限制,导致模型倾向于关注表面语言线索而非基于视觉的理解。我们提出了MMRPT,一种强化学习驱动的多模态预训练框架,以增强MLLMs中的视觉推理。我们首次将强化学习直接纳入大型视觉-语言模型的预训练中,从而能够学习奖励视觉接地而非描述模仿的学习信号。MMRPT通过注意力机制估计句子级别的视觉依赖性并掩蔽高度视觉依赖的段落,模型通过由语义-视觉奖励引导的视觉接地推理重建这些段落。实验表明,MMRPT在多种基准测试中实现了一致的零样本增益,并且在监督微调下具有显著增强的鲁棒性,证明了强化驱动的掩蔽推理为多模态模型提供了一个更可靠和泛化的预训练目标。
Summary / 总结
The research addresses the issue of descriptive bias in image-caption pairs used for multimodal pre-training, which can lead models to rely on surface linguistic cues rather than visual understanding. MMRPT introduces a masked multimodal reinforcement pre-training framework that enhances visual reasoning in large language models. By incorporating reinforcement learning directly into pre-training, MMRPT encourages models to focus on visual grounding rather than just imitating captions. Key experimental findings show consistent improvements in zero-shot performance across various benchmarks and better robustness during supervised fine-tuning, indicating that reinforcement-driven masked reasoning is a more reliable and generalizable pre-training objective for multimodal models.
研究旨在解决图像-描述对在多模态预训练中描述性偏见的问题,这使得模型更倾向于依赖表面语言线索而非视觉理解。MMRPT 提出了一种带掩码的多模态强化预训练框架,以增强大型语言模型中的视觉推理能力。通过直接将强化学习纳入预训练过程,MMRPT 促使模型更注重视觉接地而非仅仅模仿描述。实验结果表明,MMRPT 在各种基准测试中的零样本性能得到了一致的提升,并且在监督微调过程中表现出更好的鲁棒性,这表明强化驱动的带掩码推理是一个更可靠且通用的多模态模型预训练目标。
uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data
Authors: Dahyun Chung, Donghyun Shin, Yujin Sung, Seunggi Moon, Jinwoo Jeon, Byung-Jun Lee
First: 2025-11-17T06:34:49+00:00 · Latest: 2025-12-08T06:07:33+00:00
Comments: Our project page can be found at https://dinyudin203.github.io/uCLIP-project/
Abstract
Contrastive Language-Image Pre-training (CLIP) has demonstrated strong generalization across a wide range of visual tasks by leveraging large-scale English-image pairs. However, its extension to low-resource languages remains limited due to the scarcity of high-quality multilingual image-text data. Existing multilingual vision-language models exhibit consistently low retrieval performance in underrepresented languages including Czech, Finnish, Croatian, Hungarian, and Romanian on the Crossmodal-3600 (XM3600) benchmark. To address this, we propose a lightweight and data-efficient framework for multilingual vision-language alignment. Our approach requires no image-text pairs or text-text pairs and freezes both the pretrained image encoder and multilingual text encoder during training. Only a compact 1.7M-parameter projection module is trained, using a contrastive loss over English representations as semantic anchors. This minimal training setup enables robust multilingual alignment even for languages with limited supervision. Extensive evaluation across multiple multilingual retrieval benchmarks confirms the effectiveness of our method, showing significant gains in five underrepresented languages where existing models typically underperform. These findings highlight the effectiveness of our pivot-based, parameter-efficient alignment strategy for inclusive multimodal learning.
中文标题/摘要
标题:uCLIP:利用未配对数据的参数高效多语言视觉-语言模型扩展
对比语言-图像预训练(CLIP)通过利用大规模的英语-图像配对数据,在广泛的视觉任务中展示了强大的泛化能力。然而,将其扩展到低资源语言仍然受到高质量多语言图像-文本数据稀缺的限制。现有的多语言视觉-语言模型在Crossmodal-3600(XM3600)基准测试中,在包括捷克语、芬兰语、克罗地亚语、匈牙利语和罗马尼亚语等代表性不足的语言中表现出一致的检索性能低下。为了解决这个问题,我们提出了一种轻量级且数据高效的多语言视觉-语言对齐框架。我们的方法不需要图像-文本配对或文本-文本配对,并在训练过程中冻结预训练的图像编码器和多语言文本编码器。仅训练一个紧凑的1.7M参数投影模块,使用英语表示作为语义锚点,通过对比损失进行训练。这种最小的训练设置即使在监督有限的语言中也能实现稳健的多语言对齐。在多个多语言检索基准测试中的广泛评估证实了我们方法的有效性,显示出在五个代表性不足的语言中显著的性能提升,这些语言在现有模型中通常表现不佳。这些发现突显了我们基于轴心的参数高效对齐策略在包容性多模态学习中的有效性。
Summary / 总结
The paper introduces uCLIP, a parameter-efficient method for extending vision-language models to low-resource languages using unpaired data. It leverages a contrastive loss with English representations as anchors and trains only a small 1.7M-parameter projection module, while keeping the pretrained encoders frozen. The approach significantly improves retrieval performance in five underrepresented languages, demonstrating its effectiveness in multilingual vision-language alignment.
研究旨在通过利用无配对数据来提高低资源语言中视觉语言模型的性能。方法采用了一个轻量级框架,不需要图像-文本或文本-文本配对,仅训练一个小型的1.7M参数投影模块,使用基于英语表示的对比损失。该方法在五个欠代表语言中取得了显著的提升,展示了参数高效对齐策略的有效性。
DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy
Authors: Jaewoo Song, Jooyoung Choi, Kanghyun Baek, Sangyub Lee, Daemin Park, Sungroh Yoon
Venue: WACV 2026
First: 2025-12-01T05:52:55+00:00 · Latest: 2025-12-08T05:26:07+00:00
Comments: Accepted to WACV 2026
Abstract
Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.
中文标题/摘要
标题:DCText:基于分而治之策略的计划注意掩蔽视觉文本生成
尽管近期的文本到图像模型在高保真文本渲染方面取得了显著进展,但它们仍然难以处理长文本或多文本,因为全局注意力被稀释了。我们提出了一种名为DCText的无需训练的视觉文本生成方法,该方法采用分而治之策略,利用多模态扩散变换器可靠的短文本生成能力。该方法首先通过提取和分割目标文本将其分解,然后将每个文本分配到指定区域。为了在保持图像整体一致性的同时准确渲染每个段落,我们引入了两种注意力掩蔽——文本聚焦和上下文扩展,并在去噪过程中依次应用。此外,局部噪声初始化在不增加计算成本的情况下进一步提高了文本准确性和区域对齐。在单句和多句基准测试中的广泛实验表明,DCText在不牺牲图像质量的情况下实现了最佳的文本准确度,同时生成延迟最低。
Summary / 总结
DCText is a training-free method for visual text generation that addresses the challenge of handling long or multiple texts by using a divide-and-conquer strategy. It decomposes the input text into segments and assigns them to designated regions. Two attention masks, Text-Focus and Context-Expansion, are applied sequentially to render each segment accurately while maintaining image coherence. Localized Noise Initialization enhances text accuracy and region alignment. Experiments show that DCText achieves the highest text accuracy without compromising image quality and has the lowest generation latency compared to other methods.
DCText 是一种无需训练的方法,用于处理长或多段文本的视觉文本生成,通过分而治之的策略将文本分解成段落并分配到指定区域,应用 Text-Focus 和 Context-Expansion 两种注意力掩码以确保准确渲染同时保持图像连贯性。实验结果表明,DCText 在文本准确性和生成延迟方面优于其他方法,同时不牺牲图像质量。
Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration
Authors: Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu
First: 2025-12-08T05:15:41+00:00 · Latest: 2025-12-08T05:15:41+00:00
Comments: 8 pages, 3 figures. Preprint under review
Abstract
We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.
中文标题/摘要
标题:通过无训练自信心校准提高基于扩散的大语言模型的吞吐量
我们提出了CadLLM,这是一种无训练方法,用于加速基于扩散的大语言模型(dLLMs)的推理吞吐量。我们首先研究了令牌去遮蔽信心在块和步骤之间的动态性质。基于这一观察,我们提出了一种轻量级自适应方法,根据未遮蔽令牌的平均信心控制生成块大小、步长和阈值。我们进一步通过动态利用词汇表的子集来调节采样范围,从而减少softmax开销。CadLLM 是一种即插即用、模型无关的方法,兼容基于KV缓存的大语言模型。在四个流行任务上的广泛实验表明,与最先进的基线相比,CadLLM 可以获得高达2.28倍的吞吐量提升,同时保持竞争力的准确性。
Summary / 总结
CadLLM is a training-free method that improves the inference throughput of diffusion-based large language models (dLLMs) by dynamically adjusting the generation block size, step size, and threshold based on the average confidence of unmasked tokens. It also reduces softmax overhead by using a subset of the vocabulary. Experiments show that CadLLM can achieve up to 2.28x throughput improvement with competitive accuracy compared to the state-of-the-art baseline.
CadLLM 是一种无需训练的方法,通过动态调整生成块大小、步长和阈值,基于未掩码令牌的平均置信度来提升扩散基础大型语言模型(dLLM)的推理吞吐量。它还通过使用词汇表的子集来动态减少 softmax 的开销。实验表明,CadLLM 可以实现最高 2.28 倍的吞吐量提升,同时保持竞争力的准确性。
MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning
Authors: Weihai Zhi, Jiayan Guo, Shangyang Li
Venue: AAAI
First: 2025-08-28T08:41:32+00:00 · Latest: 2025-12-08T05:05:37+00:00
Comments: 8 pages, 5 figures
Abstract
The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.
中文标题/摘要
标题:MedGR$^2$: 通过生成奖励学习打破医学推理的数据壁垒
医学中视觉-语言模型(VLMs)的应用受到高质量、专家标注数据稀缺的严重阻碍。现有数据集上的监督微调(SFT)往往在未见过的模态和任务上表现不佳,而强化学习(RL),作为一种有前景的替代方案,由于数据稀缺领域缺乏可靠的奖励信号而受阻。为打破这一僵局,我们提出了医学推理的生成奖励学习框架(MedGR$^2$),这是一种新颖的框架,能够创建一个自我改进的良性循环。MedGR$^2$ 同时开发了一个数据生成器和一个奖励模型,使自动化、持续地生成高质量多模态医学数据成为可能,这些数据既可作为SFT和RL的优质训练源。我们的实验表明,使用MedGR$^2$生成的数据进行SFT已经超越了基于大规模、人工标注数据集训练的基线。更重要的是,通过组相对策略优化(GRPO)利用这些数据进行RL时,我们的模型在跨模态和跨任务泛化方面达到了最先进的水平,显著优于专门的基于RL的方法。此外,我们的紧凑型模型,得益于MedGR$^2$,在性能上与参数量超过其10倍的预训练模型相当。MedGR$^2$ 为高风险领域中的数据高效学习提供了一个新的范式,将问题从数据稀缺转变为数据生成,并解锁了RL在构建真正泛化医学AI方面的全部潜力。
Summary / 总结
MedGR$^2$ addresses the scarcity of high-quality medical data by introducing a Generative Reward Learning framework for medical reasoning. It co-develops a data generator and a reward model to create high-quality, multi-modal medical data for both Supervised Fine-Tuning and Reinforcement Learning. Experiments show that MedGR$^2$-generated data outperforms large-scale, human-curated datasets in SFT and achieves state-of-the-art generalization in RL tasks, outperforming specialized methods. Additionally, the compact model using MedGR$^2$ matches the performance of much larger foundation models.
MedGR$^2$通过引入生成奖励学习框架解决医学数据稀缺问题,该框架共同开发数据生成器和奖励模型以创建高质量的多模态医学数据,用于监督微调和强化学习。实验表明,MedGR$^2$生成的数据在监督微调中优于大规模的人工标注数据集,并在RL任务中实现最先进的泛化能力,超越专门的RL方法。此外,使用MedGR$^2$的紧凑模型与拥有超过10倍参数的大型基础模型性能相当。
CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics
Authors: Dahyeon Kye, Jeahun Sung, MinKyu Jeon, Jihyong Oh
First: 2025-12-08T04:39:12+00:00 · Latest: 2025-12-08T04:39:12+00:00
Comments: Please visit our project page at https://cmlab-korea.github.io/CHIMERA/
Abstract
Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.
中文标题/摘要
标题:CHIMERA:自适应缓存注入与语义锚点提示在零样本图像变形中的应用及其评价指标
扩散模型展示了出色的生成能力,但在实现平滑且语义一致的图像变形方面仍面临挑战。现有方法往往由于缺乏自适应结构和语义对齐而产生突兀的过渡或过度饱和的外观。我们提出CHIMERA,一种基于扩散的零样本框架,将变形公式化为缓存反演引导的去噪过程。为处理大规模的语义和外观差异,我们提出了自适应缓存注入和语义锚点提示。自适应缓存注入(ACI)在DDIM反演过程中缓存输入的低、中、高层特征,并在去噪过程中适配性地重新注入,从而在深度和时间自适应的方式下实现空间和语义对齐,并实现自然特征融合和平滑过渡。语义锚点提示(SAP)利用视觉-语言模型生成共享的锚点提示,作为语义锚点,连接不相似的输入,并引导去噪过程向一致的结果发展。最后,我们引入全局-局部一致性评分(GLCS),这是一种变形导向的评价指标,同时评估两个输入的全局协调性和局部变形的平滑度。广泛的实验和用户研究显示,CHIMERA实现了比现有方法更平滑且更语义一致的过渡,建立了图像变形的新基准。代码和项目页面将公开发布。
Summary / 总结
CHIMERA is a zero-shot diffusion-based framework designed to achieve smooth and semantically consistent image morphing. It introduces Adaptive Cache Injection and Semantic Anchor Prompting to handle large semantic and appearance disparities. Experimental results demonstrate that CHIMERA outperforms existing methods in producing smoother and more semantically aligned transitions, setting a new state-of-the-art in image morphing. The framework uses a Global-Local Consistency Score to evaluate both global harmonization and local smoothness of the morphing process.
CHIMERA 是一个零样本扩散基础框架,用于解决图像变形中的突变过渡和过度饱和问题。它引入了自适应缓存注入(ACI)和语义锚点提示(SAP)来对齐空间和语义特征,并引入全局-局部一致性评分(GLCS)来评估变形质量。实验表明,CHIMERA 产生的过渡更加平滑且语义对齐,达到了图像变形的新前沿。
FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers
Authors: Jonghyun Park, Jong Chul Ye
First: 2025-12-08T04:18:13+00:00 · Latest: 2025-12-08T04:18:13+00:00
Abstract
Deep generative models have become powerful priors for solving inverse problems, and various training-free methods have been developed. However, when applied to latent flow models, existing methods often fail to converge to the posterior mode or suffer from manifold deviation within latent spaces. To mitigate this, here we introduce a novel training-free framework, FlowLPS, that solves inverse problems with pretrained flow models via a Langevin Proximal Sampling (LPS) strategy. Our method integrates Langevin dynamics for manifold-consistent exploration with proximal optimization for precise mode seeking, achieving a superior balance between reconstruction fidelity and perceptual quality across multiple inverse tasks on FFHQ and DIV2K, outperforming state of the art inverse solvers.
中文标题/摘要
标题:FlowLPS:基于流模型的朗格万-近端采样逆问题求解器
深度生成模型已成为解决逆问题的强大先验,各种无需训练的方法也已开发。然而,当应用于潜在流模型时,现有方法往往无法收敛到后验模式或在潜在空间内出现流形偏差。为解决这一问题,我们引入了一种新的无需训练框架FlowLPS,通过朗格万-近端采样(LPS)策略使用预训练流模型来解决逆问题。我们的方法将流形一致的探索与精确的模式搜索相结合,通过拉格朗日动力学实现,在FFHQ和DIV2K的多个逆任务中实现了重建保真度和感知质量的优越平衡,超越了最先进的逆问题求解器。
Summary / 总结
The research aims to address the limitations of existing training-free methods in deep generative models for solving inverse problems, particularly in latent flow models. The proposed FlowLPS framework uses a Langevin Proximal Sampling strategy to integrate manifold-consistent exploration with precise mode seeking, thereby improving reconstruction fidelity and perceptual quality. Experiments on FFHQ and DIV2K datasets show that FlowLPS outperforms state-of-the-art inverse solvers in multiple inverse tasks.
研究旨在提高深度生成模型在解决逆问题中的性能,特别是在潜流模型中的表现。FlowLPS 使用 Langevin Proximal Sampling 策略来增强收敛性和精确度。该方法结合了 Langevin 动力学进行流形探索和近端优化进行模式搜索,从而在 FFHQ 和 DIV2K 数据集上的多种逆任务中实现了更好的重建保真度和感知质量,超越了现有的逆问题求解器。
Latent Collaboration in Multi-Agent Systems
Authors: Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang
First: 2025-11-25T18:56:57+00:00 · Latest: 2025-12-08T04:05:49+00:00
Comments: Project: https://github.com/Gen-Verse/LatentMAS
Abstract
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
中文标题/摘要
标题:多智能体系统的潜在协作
多智能体系统(MAS)将大型语言模型(LLMs)从独立的单模型推理扩展到协调的系统级智能。虽然现有的LLM智能体依赖于基于文本的中介进行推理和交流,但我们通过使模型能够在连续的潜在空间中直接协作,向前迈进了一步。我们引入了LatentMAS,这是一种端到端无需训练的框架,使LLM智能体之间能够纯粹地进行潜在协作。在LatentMAS中,每个智能体首先通过最后一层隐藏嵌入进行自回归潜在思维生成。共享的潜在工作记忆则保存并转移每个智能体的内部表示,确保无损信息交换。我们提供了理论分析,证明LatentMAS在表达能力和无损信息保存方面比传统的基于文本的MAS具有更高的性能,且复杂度显著降低。此外,在涵盖数学和科学推理、常识理解和代码生成的9个全面基准测试中,LatentMAS在所有基准测试中都优于强大的单模型和基于文本的MAS基线,准确率提高了14.6%,输出令牌使用量减少了70.8%-83.7%,端到端推理速度提高了4-4.3倍。这些结果表明,我们的新潜在协作框架在提高系统级推理质量的同时,还提供了显著的效率提升,无需额外训练。代码和数据已完全开源,可在https://github.com/Gen-Verse/LatentMAS/ 获取。
Summary / 总结
The research aims to enhance multi-agent systems (MAS) by enabling direct collaboration among large language model (LLM) agents in a continuous latent space, rather than relying on text-based mediation. The LatentMAS framework allows agents to generate latent thoughts and share internal representations through a shared latent working memory, achieving higher expressiveness and lossless information exchange. Empirical evaluations across various benchmarks show that LatentMAS outperforms existing single-model and text-based MAS baselines, with up to 14.6% higher accuracy, 70.8%-83.7% fewer output tokens, and 4x-4.3x faster inference times without additional training. This demonstrates the framework's effectiveness in improving reasoning quality and efficiency. Code and data are available at https://github.com/Gen-Verse/LatentMAS.
研究旨在通过使LLM代理直接在连续的潜在空间中协作,而非依赖基于文本的中介,来提升多代理系统(MAS)。LatentMAS框架允许代理生成潜在思想并通过共享的潜在工作记忆传递内部表示,确保信息无损交换。跨多个基准的实证评估表明,LatentMAS在准确性和推理速度方面优于现有的单模型和基于文本的MAS基线,同时减少了输出标记的使用量。