arXiv 论文速递

2025-09-25 03:35
Snapshot: 20250925_0335
A Gradient Flow Approach to Solving Inverse Problems with Latent Diffusion Models
Authors: Tim Y. J. Wang, O. Deniz Akyildiz
Venue: NeurIPS 2025
First: 2025-09-23T17:41:43+00:00 · Latest: 2025-09-23T17:41:43+00:00
Comments: Accepted at the 2nd Workshop on Frontiers in Probabilistic Inference: Sampling Meets Learning, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract
Solving ill-posed inverse problems requires powerful and flexible priors. We propose leveraging pretrained latent diffusion models for this task through a new training-free approach, termed Diffusion-regularized Wasserstein Gradient Flow (DWGF). Specifically, we formulate the posterior sampling problem as a regularized Wasserstein gradient flow of the Kullback-Leibler divergence in the latent space. We demonstrate the performance of our method on standard benchmarks using StableDiffusion (Rombach et al., 2022) as the prior.
中文标题/摘要
标题:使用潜在扩散模型的梯度流方法解决逆问题
解决病态逆问题需要强大的灵活先验。我们提出通过一种新的无需训练的方法——扩散正则化 Wasserstein 梯度流(DWGF)——利用预训练的潜在扩散模型来完成此任务。具体而言,我们将后验采样问题表述为在潜在空间中 Kullback-Leibler 散度的正则化 Wasserstein 梯度流。我们使用 StableDiffusion(Rombach 等人,2022)作为先验,在标准基准上展示了我们方法的性能。
Leveraging Large Models to Evaluate Novel Content: A Case Study on Advertisement Creativity
Authors: Zhaoyi Joey Hou, Adriana Kovashka, Xiang Lorraine Li
First: 2025-02-26T04:28:03+00:00 · Latest: 2025-09-23T17:34:10+00:00
Comments: To Appear in EMNLP2025
Abstract
Evaluating creativity is challenging, even for humans, not only because of its subjectivity but also because it involves complex cognitive processes. Inspired by work in marketing, we attempt to break down visual advertisement creativity into atypicality and originality. With fine-grained human annotations on these dimensions, we propose a suite of tasks specifically for such a subjective problem. We also evaluate the alignment between state-of-the-art (SoTA) vision language models (VLMs) and humans on our proposed benchmark, demonstrating both the promises and challenges of using VLMs for automatic creativity assessment.
中文标题/摘要
标题:利用大型模型评估新颖内容:广告创意案例研究
评估创意极具挑战性,即使是人类也难以做到,这不仅因为其主观性,还因为它涉及复杂的认知过程。受市场营销工作的启发,我们尝试将视觉广告创意分解为不寻常性和原创性。通过在这些维度上进行细致的人工标注,我们提出了一套专门针对此类主观问题的任务。我们还评估了最先进的(SoTA)视觉语言模型(VLMs)与人类在我们提出的基准上的对齐情况,展示了使用VLMs进行自动创意评估的潜力和挑战。
Summary / 总结
The paper aims to evaluate the creativity of visual advertisements by breaking it down into atypicality and originality, using fine-grained human annotations. It proposes a benchmark for subjective creativity assessment and evaluates state-of-the-art vision language models against human judgments, highlighting both their potential and limitations in automatic creativity assessment.
论文旨在通过将视觉广告的创意分解为非典型性和原创性来评估创意,使用精细的人类注释。它提出了一个主观创意评估基准,并评估了最先进的视觉语言模型与人类判断之间的契合度,突显了它们在自动创意评估中的潜力和局限性。
Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation
Authors: Jihao Gu, Yingyao Wang, Meng Cao, Pi Bu, Jun Song, Yancheng He, Shilong Li, Bo Zheng
First: 2024-12-19T03:21:01+00:00 · Latest: 2025-09-23T17:03:44+00:00
Abstract
Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent progress, existing methods suffer from two drawbacks: 1) Lack of scalable token-level rewards; and 2) Neglect of visual-anchored tokens. To this end, we propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO), which adaptively attends to visual-correlated tokens without fine-grained annotations. Specifically, we introduce a token-level \emph{visual-anchored} \emph{reward} as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one. In addition, to highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more accurate token-level optimization. Extensive experimental results have manifested the state-of-the-art performance of the proposed TPO. For example, by building on top of LLAVA-1.5-7B, our TPO boosts the performance absolute improvement for hallucination benchmarks.
中文标题/摘要
标题:基于自校准视觉锚定奖励的令牌偏好优化以减轻幻觉
直接偏好优化(DPO)已被证明在通过使大型视觉语言模型(LVLM)的输出更接近人类偏好来减轻幻觉方面非常有效。尽管取得了近期进展,现有方法仍存在两个缺点:1)缺乏可扩展的令牌级奖励;2)忽视了视觉锚定的令牌。为此,我们提出了一种新颖的基于自校准奖励的令牌偏好优化模型(称为TPO),该模型无需细粒度注释即可适应与视觉相关的令牌。具体而言,我们引入了一个令牌级的视觉锚定奖励,它是基于原始图像和受损图像生成的令牌的逻辑分布差异。此外,为了突出信息性的视觉锚定令牌,我们提出了一种视觉感知的训练目标,以增强更准确的令牌级优化。广泛的实验结果表明,所提出的TPO具有最先进的性能。例如,通过在LLAVA-1.5-7B的基础上构建,我们的TPO在幻觉基准上的绝对性能有所提升。
Summary / 总结
The research aims to improve the performance of Large Vision Language Models (LVLMs) by mitigating hallucinations through token preference optimization. The method introduces a novel Token Preference Optimization (TPO) model with self-calibrated visual-anchored rewards, which adaptively focuses on visual-correlated tokens without needing fine-grained annotations. The model uses a token-level reward based on the difference in logistic distributions of generated tokens conditioned on the raw and corrupted images. Experimental results show that TPO significantly improves performance on hallucination benchmarks, such as boosting the LLAVA-1.5-7B model's performance.
研究旨在通过令牌偏好优化来提高大型视觉语言模型(LVLM)的性能并减轻幻觉现象。方法提出了一种新颖的令牌偏好优化(TPO)模型,该模型通过自校准的视觉锚定奖励,无需精细标注即可适应性地关注视觉相关的令牌。模型使用基于生成令牌在原始图像和损坏图像上条件化的逻辑分布差异的令牌级奖励。实验结果表明,TPO在幻觉基准测试中显著提高了性能,例如,提升了LLAVA-1.5-7B模型的表现。
Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability
Authors: Dong Shu, Haiyan Zhao, Jingyu Hu, Weiru Liu, Ali Payani, Lu Cheng, Mengnan Du
Venue: EMNLP 2025
First: 2025-01-02T16:53:50+00:00 · Latest: 2025-09-23T16:40:27+00:00
Comments: EMNLP 2025 Findings
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in processing both visual and textual information. However, the critical challenge of alignment between visual and textual representations is not fully understood. This survey presents a comprehensive examination of alignment and misalignment in LVLMs through an explainability lens. We first examine the fundamentals of alignment, exploring its representational and behavioral aspects, training methodologies, and theoretical foundations. We then analyze misalignment phenomena across three semantic levels: object, attribute, and relational misalignment. Our investigation reveals that misalignment emerges from challenges at multiple levels: the data level, the model level, and the inference level. We provide a comprehensive review of existing mitigation strategies, categorizing them into parameter-frozen and parameter-tuning approaches. Finally, we outline promising future research directions, emphasizing the need for standardized evaluation protocols and in-depth explainability studies.
中文标题/摘要
标题:大型视觉-语言模型的对齐与偏差:通过可解释性视角的综述
大型视觉-语言模型(LVLMs)在处理视觉和文本信息方面表现出色。然而,视觉和文本表示之间的对齐问题尚未完全理解。本文综述了通过可解释性视角对LVLMs中的对齐和偏差进行全面考察。首先,我们探讨了对齐的基本原理,包括其表示和行为方面、训练方法和理论基础。然后,我们分析了在三个语义层次上的偏差现象:对象偏差、属性偏差和关系偏差。我们的研究揭示了偏差在多个层面的挑战:数据层面、模型层面和推理层面。我们对现有的缓解策略进行了全面回顾,将其分类为参数冻结和参数调整方法。最后,我们提出了有希望的未来研究方向,强调需要标准化评估协议和深入的可解释性研究。
Summary / 总结
This survey examines the alignment and misalignment in Large Vision-Language Models (LVLMs) through an explainability lens. It explores the representational and behavioral aspects of alignment, training methodologies, and theoretical foundations. The study identifies misalignment at object, attribute, and relational levels, which arise from challenges at the data, model, and inference levels. It reviews existing mitigation strategies and suggests future research directions, focusing on standardized evaluation protocols and in-depth explainability studies.
本文通过解释性视角探讨了大型视觉-语言模型(LVLM)中的对齐与不一致问题。研究了对齐的基本原理,包括表示和行为方面的内容、训练方法和理论基础。还分析了在对象、属性和关系三个语义层次上的不一致现象,揭示了不一致问题源于数据、模型和推理层面的挑战。研究总结了现有的缓解策略,将其分为参数冻结和参数调优两类,并提出了未来研究方向,强调需要标准化评估协议和深入的解释性研究。
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs
Authors: Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva
First: 2025-09-23T16:28:51+00:00 · Latest: 2025-09-23T16:28:51+00:00
Abstract
Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, but understanding long, dense captions remains an open challenge. We hypothesize that compositionality, the capacity to reason about object-attribute bindings and inter-object relationships, is key to understanding longer captions. In this paper, we investigate the interaction between compositionality and long-caption understanding, asking whether training for one property enhances the other. We train and evaluate a range of models that target each of these capabilities. Our results reveal a bidirectional relationship: compositional training improves performance on long-caption retrieval, and training on long captions promotes compositionality. However, these gains are sensitive to data quality and model design. We find that training on poorly structured captions, or with limited parameter updates, fails to support generalization. Likewise, strategies that aim at retaining general alignment, such as freezing positional embeddings, do not improve compositional understanding. Overall, we find that compositional understanding and long-caption understanding are intertwined capabilities that can be jointly learned through training on dense, grounded descriptions. Despite these challenges, we show that models trained on high-quality, long-caption data can achieve strong performance in both tasks, offering practical guidance for improving VLM generalization.
中文标题/摘要
标题:长文速览:在VLM中解开组成性和长描述理解的关系
对比视觉-语言模型(VLMs)在结合视觉和文本信息方面取得了显著进展,但理解长而密集的描述仍然是一个开放的挑战。我们假设组成性,即推理对象属性绑定和对象间关系的能力,是理解更长描述的关键。在本文中,我们研究了组成性和长描述理解之间的相互作用,询问一种属性的训练是否能增强另一种属性。我们训练和评估了一系列针对这些能力的模型。我们的结果揭示了一种双向关系:组成性训练可以提高长描述检索的表现,而训练长描述则能促进组成性。然而,这些改进对数据质量和模型设计都很敏感。我们发现,使用结构不良的描述或有限的参数更新进行训练无法支持泛化。同样,旨在保持一般对齐的策略,如冻结位置嵌入,也无法提高组成性理解。总体而言,我们发现组成性和长描述理解是交织的能力,可以通过训练密集、具体的描述来共同学习。尽管存在这些挑战,我们展示了在高质量、长描述数据上训练的模型可以在两个任务中取得出色表现,为提高VLM泛化提供了实用指导。
Summary / 总结
This study explores the relationship between compositionality and long-caption understanding in vision-language models (VLMs). The research aims to understand how training for one capability can enhance the other. Through a series of experiments, the authors find a bidirectional relationship where compositional training improves long-caption retrieval, and training on long captions enhances compositionality. However, these improvements are dependent on data quality and model design. The study suggests that high-quality, dense, and grounded descriptions are crucial for effective learning of both capabilities.
本文探讨了视觉语言模型(VLMs)中组成性和长描述理解之间的关系。研究假设组成性对于理解长而密集的描述至关重要。通过一系列实验,作者发现组成性训练可以提升长描述检索,反之亦然,但这些改进依赖于数据质量和模型设计。研究指出,高质量的长描述数据对于模型在两个任务上都取得良好表现至关重要,提供了提高VLM泛化能力的实用指导。
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions
Authors: Ioanna Ntinou, Alexandros Xenos, Yassine Ouali, Adrian Bulat, Georgios Tzimiropoulos
Venue: EMNLP 2025
First: 2025-09-23T16:22:27+00:00 · Latest: 2025-09-23T16:22:27+00:00
Comments: Accepted at EMNLP 2025
Abstract
Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only a few hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters. Code is available at: https://github.com/IoannaNti/LexiCLIP
中文标题/摘要
标题:无视觉检索:重新思考基于文本场景描述的多模态搜索
对比训练的视觉-语言模型(VLMs),如CLIP,已成为学习区分性视觉-语言表示的标准方法。然而,这些模型往往表现出浅层的语言理解,表现为词袋行为。这些限制在它们的双编码器设计中得到了强化,这导致了模态差距。此外,依赖大规模的网络收集数据集进行训练使得过程计算成本高昂,并引入了重要的隐私问题。为了解决这些限制,本文通过引入无视觉、单编码器检索流水线,挑战了检索任务中视觉编码器的必要性。我们从传统的文本到图像检索范式转向了文本到文本范式,借助VLLM生成的结构化图像描述。我们证明了这种范式转变具有显著优势,包括显著减少模态差距、提高组合性和在短和长描述查询上的更好性能,所有这些仅需在两块GPU上进行几小时的校准即可实现。此外,用文本描述替换原始图像为检索提供了更友好的隐私替代方案。为了进一步评估泛化能力并解决先前组合性基准的一些不足,我们从Flickr30k和COCO中发布了两个基准,包含由短描述组成的多样组合查询,我们称之为subFlickr和subCOCO。我们的无视觉检索器与传统多模态模型相当,甚至在某些情况下超越了它们。重要的是,我们的方法在多个检索和组合性基准上实现了最先进的零样本性能,模型参数量小至0.3B。代码可在:https://github.com/IoannaNti/LexiCLIP
Penalizing Boundary Activation for Object Completeness in Diffusion Models
Authors: Haoyang Xu, Tianhao Zhao, Sibei Yang, Yutian Lin
First: 2025-09-21T07:58:48+00:00 · Latest: 2025-09-23T16:17:58+00:00
Abstract
Diffusion models have emerged as a powerful technique for text-to-image (T2I) generation, creating high-quality, diverse images across various domains. However, a common limitation in these models is the incomplete display of objects, where fragments or missing parts undermine the model's performance in downstream applications. In this study, we conduct an in-depth analysis of the incompleteness issue and reveal that the primary factor behind incomplete object generation is the usage of RandomCrop during model training. This widely used data augmentation method, though enhances model generalization ability, disrupts object continuity during training. To address this, we propose a training-free solution that penalizes activation values at image boundaries during the early denoising steps. Our method is easily applicable to pre-trained Stable Diffusion models with minimal modifications and negligible computational overhead. Extensive experiments demonstrate the effectiveness of our method, showing substantial improvements in object integrity and image quality.
中文标题/摘要
标题:在扩散模型中通过惩罚边界激活提高物体完整性
扩散模型已成为文本到图像(T2I)生成的强大技术,能够生成高质量、多样化的图像,覆盖多个领域。然而,这些模型的一个常见局限是物体不完整,碎片或缺失部分影响了模型在下游应用中的性能。本研究深入分析了不完整性问题,并揭示了不完整物体生成的主要原因是训练过程中使用了RandomCrop数据增强方法。尽管这种方法提高了模型的泛化能力,但在训练过程中破坏了物体的连续性。为解决这一问题,我们提出了一种无需训练的解决方案,在早期去噪步骤中惩罚图像边界处的激活值。该方法易于应用于预训练的Stable Diffusion模型,修改 minimal,且几乎没有计算开销。大量实验表明,该方法的有效性,显著提高了物体完整性和图像质量。
Summary / 总结
This study addresses the issue of incomplete object generation in diffusion models used for text-to-image synthesis. The authors identify RandomCrop as the primary cause of this problem, as it disrupts object continuity during training. They propose a method that penalizes boundary activations during early denoising steps, which improves object integrity and image quality without significant computational overhead. Extensive experiments validate the effectiveness of this approach.
该研究解决了文本到图像生成中扩散模型对象不完整的问题,发现RandomCrop是训练过程中导致对象不完整的主要原因。为此,作者提出了一种无需训练的方法,在早期去噪步骤中惩罚边界激活,该方法易于应用于预训练模型且几乎无额外计算开销。实验结果显示,这种方法显著提高了对象完整性和图像质量。
Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models
Authors: Yueyan Li, Chenggong Zhao, Zeyuan Zang, Caixia Yuan, Xiaojie Wang
First: 2025-09-23T16:07:18+00:00 · Latest: 2025-09-23T16:07:18+00:00
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the "what" and "where" pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model's perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.
中文标题/摘要
标题:像阅读文本一样阅读图像:视觉语言模型中的序列图像理解
视觉语言模型(VLMs)在多种实际任务中展现了卓越的性能。然而,现有的VLMs通常通过序列化图像的方式来处理视觉信息,这种方法与人类视觉的并行性相差甚远。此外,它们不透明的内部机制阻碍了更深入的理解和架构创新。受人类视觉的双流假说启发,该假说区分了“什么”和“哪里”通路,我们把VLMs中的视觉处理分解为对象识别和空间感知,分别进行研究。对于对象识别,我们将图像转换为文本标记图,并发现模型对图像内容的感知是一个从浅层到深层的两阶段过程,始于属性识别,最终达到语义消歧。对于空间感知,我们从理论上推导并实验证明了VLMs中位置表示的几何结构。基于这些发现,我们提出了一种基于插件式视觉解码器的无指令标记压缩算法,以提高解码效率,并提出了一种RoPE缩放技术以增强空间推理能力。通过严格的实验,我们的工作验证了这些分析,提供了对VLM内部更深入的理解,并为设计更强大的未来架构提供了明确的原则。
Summary / 总结
This paper aims to improve the understanding of Vision-Language Models (VLMs) by deconstructing their visual processing into object recognition and spatial perception. The authors find that VLMs process images in a two-stage manner, starting with attribute recognition and ending with semantic disambiguation. They also derive the geometric structure underlying positional representation in VLMs and introduce techniques to enhance decoding efficiency and spatial reasoning. The experiments validate these findings, providing insights into VLM internals and guiding future architectural designs.
研究旨在通过将视觉语言模型(VLMs)的视觉处理分解为物体识别和空间感知来增强对其的理解。研究发现,VLMs在处理图像时遵循两阶段过程,从属性识别开始,最终进行语义消歧。研究还确定了VLMs中位置表示的几何结构,并引入了提高解码效率和空间推理的技术。实验验证了这些发现,提供了对VLM内部机制的深入理解,并为未来架构设计提供了明确原则。
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
Authors: DongGeon Lee, Joonwon Jang, Jihae Jeong, Hwanjo Yu
Venue: EMNLP 2025
First: 2025-05-21T11:26:40+00:00 · Latest: 2025-09-23T15:14:42+00:00
Comments: Accepted to EMNLP 2025
Abstract
Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MemeSafetyBench is publicly available at https://github.com/oneonlee/Meme-Safety-Bench.
中文标题/摘要
标题:视觉-语言模型在野外安全吗?基于梗图的基准研究
视觉-语言模型(VLMs)的快速部署放大了安全风险,但大多数评估依赖于人工图像。本研究提出的问题是:当面对普通用户分享的梗图时,当前的VLMs有多安全?为探讨这一问题,我们引入了MemeSafetyBench基准,包含50,430个实例,将真实的梗图与有害和无害的指令配对。利用全面的安全分类和基于LLM的指令生成,我们评估了多个VLMs在单轮和多轮交互中的表现。我们研究了真实世界的梗图如何影响有害输出,对话背景如何减轻这种影响,以及模型规模与安全指标之间的关系。研究结果表明,VLMs对基于梗图的有害提示比对合成或文本图像更脆弱。梗图显著增加了有害响应并减少了拒绝率。尽管多轮交互提供了一定的缓解,但脆弱性仍然存在。这些结果强调了生态有效评估和更强的安全机制的必要性。MemeSafetyBench可在https://github.com/oneonlee/Meme-Safety-Bench获取。
Summary / 总结
This study evaluates the safety of vision-language models (VLMs) using a new benchmark, MemeSafetyBench, which pairs real meme images with both harmful and benign instructions. The research finds that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images, with memes significantly increasing harmful responses and decreasing refusals compared to text-only inputs. Multi-turn interactions provide some mitigation but do not fully address the elevated vulnerability. The study emphasizes the need for more ecologically valid evaluations and stronger safety mechanisms.
该研究使用新的基准MemeSafetyBench评估视觉-语言模型(VLMs)的安全性,该基准包含50,430张真实的 meme 图像,配对了有害和良性指令。研究发现,VLMs 对基于 meme 的有害提示比对合成或文本图像更脆弱,meme 增加了有害响应并减少了拒绝。多轮交互提供了一定的缓解,但未能完全解决增强的脆弱性。该研究强调了需要更现实的评估和更强的安全机制。
FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation
Authors: Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang
First: 2025-09-23T14:49:05+00:00 · Latest: 2025-09-23T14:49:05+00:00
Comments: project website: https://sites.google.com/view/funcanon, 11 pages
Abstract
General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.
中文标题/摘要
标题:FUNCanon:通过功能对象规范化学习姿态感知的动作基元以实现通用化的机器人操作
从端到端演示中获得的通用机器人技能往往会导致任务特定的策略,这些策略无法泛化到训练分布之外。因此,我们提出了FunCanon框架,将长时操作任务转换为由执行者、动词和对象定义的动作片段序列。这些片段将策略学习的重点放在动作本身上,而不是孤立的任务,从而实现组合性和重用性。为了使策略具备姿态感知能力和类别通用性,我们进行了功能对象规范化,以实现功能对齐和自动操作轨迹转移,使用大型视觉语言模型的可用性线索将对象映射到共享的功能框架中。基于这种对齐数据训练的对象中心和动作中心的扩散策略FuncDiffuser自然地尊重对象的可用性和姿态,简化了学习并提高了泛化能力。在模拟和真实世界基准上的实验表明了类别级别的泛化、跨任务行为的重用以及稳健的仿真实现,表明功能规范化为复杂操作领域的大规模模仿学习提供了强大的归纳偏置。演示和补充材料的详细信息可在我们的项目网站https://sites.google.com/view/funcanon上获得。
Summary / 总结
The research aims to address the lack of generalization in robotic manipulation policies learned from end-to-end demonstrations. FunCanon is introduced as a framework that breaks down long-horizon tasks into action chunks defined by actors, verbs, and objects, focusing on action learning rather than isolated tasks. By performing functional object canonicalization, the framework enables pose-aware and category-general policies through shared functional frames and affordance cues. Experiments show category-level generalization, cross-task behavior reuse, and robust sim2real deployment, highlighting the effectiveness of functional canonicalization in complex manipulation domains.
研究旨在通过引入FunCanon框架解决端到端学习的机器人技能缺乏泛化性的问题,该框架将长时间的抓取任务分解为动作片段。通过功能对象的规范化对齐,将对象映射到共享的功能框架中,使策略具备姿态感知能力,并能在不同任务间重用。实验结果显示,该方法在类别层面实现了泛化、跨任务行为重用,并且在模拟和真实世界基准上的部署表现稳健,表明功能规范化为复杂抓取任务中的模仿学习提供了强大的归纳偏置。
Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications
Authors: Ganesh Mallya, Yotam Gigi, Dahun Kim, Maxim Neumann, Genady Beryozkin, Tomer Shekel, Anelia Angelova
First: 2025-09-23T14:40:52+00:00 · Latest: 2025-09-23T14:40:52+00:00
Abstract
Multi-spectral imagery plays a crucial role in diverse Remote Sensing applications including land-use classification, environmental monitoring and urban planning. These images are widely adopted because their additional spectral bands correlate strongly with physical materials on the ground, such as ice, water, and vegetation. This allows for more accurate identification, and their public availability from missions, such as Sentinel-2 and Landsat, only adds to their value. Currently, the automatic analysis of such data is predominantly managed through machine learning models specifically trained for multi-spectral input, which are costly to train and support. Furthermore, although providing a lot of utility for Remote Sensing, such additional inputs cannot be used with powerful generalist large multimodal models, which are capable of solving many visual problems, but are not able to understand specialized multi-spectral signals. To address this, we propose a training-free approach which introduces new multi-spectral data in a Zero-Shot-only mode, as inputs to generalist multimodal models, trained on RGB-only inputs. Our approach leverages the multimodal models' understanding of the visual space, and proposes to adapt to inputs to that space, and to inject domain-specific information as instructions into the model. We exemplify this idea with the Gemini2.5 model and observe strong Zero-Shot performance gains of the approach on popular Remote Sensing benchmarks for land cover and land use classification and demonstrate the easy adaptability of Gemini2.5 to new inputs. These results highlight the potential for geospatial professionals, working with non-standard specialized inputs, to easily leverage powerful multimodal models, such as Gemini2.5, to accelerate their work, benefiting from their rich reasoning and contextual capabilities, grounded in the specialized sensor data.
Summary / 总结
The paper addresses the challenge of using multi-spectral imagery in remote sensing applications by proposing a zero-shot multi-spectral learning approach. This method leverages generalist multimodal models, like Gemini2.5, which are trained on RGB inputs, to handle multi-spectral data without additional training. The approach shows strong performance gains on remote sensing benchmarks for land cover and land use classification, demonstrating the potential for geospatial professionals to easily adapt these models to specialized inputs.
论文通过提出零样本多光谱学习方法解决了遥感应用中使用多光谱图像的挑战。该方法利用仅在RGB输入上训练的通用多模态模型(如Gemini2.5)来处理多光谱数据,无需额外训练。该方法在土地覆盖和土地利用分类等遥感基准测试中表现出显著的性能提升,展示了地理空间专业人士如何轻松将这些模型适应到专业化的输入中,从而利用其丰富的推理和上下文能力。
ColorBlindnessEval: Can Vision-Language Models Pass Color Blindness Tests?
Authors: Zijian Ling, Han Zhang, Yazhuo Zhou, Jiahao Cui
Venue: ICLR 2025
First: 2025-09-23T14:33:21+00:00 · Latest: 2025-09-23T14:33:21+00:00
Comments: Accepted at the Open Science for Foundation Models (SCI-FM) Workshop at ICLR 2025
Abstract
This paper presents ColorBlindnessEval, a novel benchmark designed to evaluate the robustness of Vision-Language Models (VLMs) in visually adversarial scenarios inspired by the Ishihara color blindness test. Our dataset comprises 500 Ishihara-like images featuring numbers from 0 to 99 with varying color combinations, challenging VLMs to accurately recognize numerical information embedded in complex visual patterns. We assess 9 VLMs using Yes/No and open-ended prompts and compare their performance with human participants. Our experiments reveal limitations in the models' ability to interpret numbers in adversarial contexts, highlighting prevalent hallucination issues. These findings underscore the need to improve the robustness of VLMs in complex visual environments. ColorBlindnessEval serves as a valuable tool for benchmarking and improving the reliability of VLMs in real-world applications where accuracy is critical.
中文标题/摘要
标题:ColorBlindnessEval:视觉语言模型能否通过色盲测试?
本文介绍了ColorBlindnessEval,这是一种新型基准,旨在评估视觉语言模型(VLMs)在受伊希哈拉色盲测试启发的视觉对抗场景中的鲁棒性。我们的数据集包含500张类似伊希哈拉的图像,其中包含从0到99的数字,颜色组合各异,挑战VLMs在复杂视觉模式中准确识别嵌入的数字信息的能力。我们使用是/否和开放式提示评估了9种VLMs,并将其性能与人类参与者进行了比较。我们的实验揭示了模型在对抗性环境中解释数字能力的局限性,突显了普遍存在的幻觉问题。这些发现强调了提高VLMs在复杂视觉环境中的鲁棒性的必要性。ColorBlindnessEval作为评估和提高VLMs在关键准确度要求的实际应用中的可靠性的宝贵工具。
Summary / 总结
ColorBlindnessEval is a novel benchmark to evaluate the robustness of Vision-Language Models (VLMs) in visually adversarial scenarios, inspired by the Ishihara color blindness test. The dataset includes 500 Ishihara-like images with varying color combinations, challenging VLMs to recognize numerical information. Experiments with 9 VLMs reveal limitations in interpreting numbers in adversarial contexts, with prevalent hallucination issues. This highlights the need to improve VLMs' robustness in complex visual environments.
该论文提出了ColorBlindnessEval,这是一个用于测试视觉-语言模型(VLMs)在视觉挑战性场景中鲁棒性的基准,灵感来源于伊希哈拉色盲测试。数据集包含500张类似伊希哈拉的图像,颜色组合各异,测试模型识别数字的能力。九种VLMs使用Yes/No和开放式提示进行了评估,结果显示模型经常出现幻觉,表明需要增强其在复杂视觉环境中的鲁棒性。
CalFuse: Feature Calibration Enhanced Parameter Fusion for Class-Continual Learning
Authors: Juncen Guo, Siao Liu, Xiaoguang Zhu, Lianlong Sun, Liangyu Teng, Jingyi Wu, Di Li, Linxiao Gong, Weiwei Jiang, Wei Zhou, Ahmed Ghoneim, Liang Song
First: 2025-03-24T13:44:12+00:00 · Latest: 2025-09-23T13:56:52+00:00
Abstract
Class-Continual Learning (CCL) enables models to continuously learn new class knowledge while retaining previous classes, facilitating adaptation and evolution in dynamic, real-world environments. Traditional CCL methods primarily rely on visual features, which limits their effectiveness in complex, multimodal scenarios. In contrast, Vision-Language Models (VLMs) show promising potential for enhancing CCL by leveraging pre-trained knowledge and fusing multi-modal semantic cues such as text and vision. However, existing approaches struggle to mitigate catastrophic forgetting while preserving the generalization strengths of VLMs across diverse modalities. To address these challenges, we propose CalFuse, a framework for feature Calibration enhanced parameter Fusion, which enhances dynamic knowledge fusion. CalFuse introduces a dynamic feature calibration mechanism that iteratively adjusts the contribution of original visual features to the final class decision, thereby preserving the model's intrinsic generalization capability across modalities. Simultaneously, a parameter fusion strategy effectively fuses newly acquired knowledge with prior task parameters, maintaining a balance between acquiring new class representations and preserving old knowledge. Experimental results on popular benchmarks (e.g., CIFAR100 and ImageNet100) validate the superiority of the proposed method.
中文标题/摘要
标题:CalFuse:特征校准增强参数融合在类连续学习中的应用
类连续学习(CCL)使模型能够在不断学习新类知识的同时保留先前的类知识,从而在动态的现实环境中实现适应和进化。传统的CCL方法主要依赖于视觉特征,这限制了它们在复杂、多模态场景中的有效性。相比之下,视觉语言模型(VLMs)通过利用预训练知识并融合多模态语义线索(如文本和视觉)显示出增强CCL的潜力。然而,现有的方法难以在保留VLMs在不同模态下的泛化能力的同时减轻灾难性遗忘。为了解决这些挑战,我们提出CalFuse框架,这是一种特征校准增强参数融合框架,以增强动态知识融合。CalFuse引入了一种动态特征校准机制,该机制迭代调整原始视觉特征对最终类决策的贡献,从而保持模型在不同模态下的固有泛化能力。同时,参数融合策略有效地将新获得的知识与先前任务参数融合,保持获取新类表示和保留旧知识之间的平衡。在流行的基准测试(如CIFAR100和ImageNet100)上的实验结果验证了所提方法的优越性。
Summary / 总结
The research aims to enhance Class-Continual Learning (CCL) by addressing the limitations of traditional methods that rely solely on visual features. The proposed CalFuse framework introduces a dynamic feature calibration mechanism and a parameter fusion strategy to mitigate catastrophic forgetting and maintain generalization across diverse modalities. Experiments on CIFAR100 and ImageNet100 demonstrate the effectiveness of CalFuse in improving CCL performance.
研究旨在通过克服传统方法仅依赖视觉特征的局限性,改进类延续学习(CCL)。CalFuse框架通过引入动态特征校准机制和参数融合策略,逐步调整视觉特征的贡献,并在获取新知识与保留旧知识之间保持平衡。实验结果表明,CalFuse在CIFAR100和ImageNet100等基准数据集上能够有效缓解灾难性遗忘并保持跨模态的一般性。
Pure Vision Language Action (VLA) Models: A Comprehensive Survey
Authors: Dapeng Zhang, Jin Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, Qingguo Zhou
First: 2025-09-23T13:53:52+00:00 · Latest: 2025-09-23T13:53:52+00:00
Abstract
The emergence of Vision Language Action (VLA) models marks a paradigm shift from traditional policy-based control to generalized robotics, reframing Vision Language Models (VLMs) from passive sequence generators into active agents for manipulation and decision-making in complex, dynamic environments. This survey delves into advanced VLA methods, aiming to provide a clear taxonomy and a systematic, comprehensive review of existing research. It presents a comprehensive analysis of VLA applications across different scenarios and classifies VLA approaches into several paradigms: autoregression-based, diffusion-based, reinforcement-based, hybrid, and specialized methods; while examining their motivations, core strategies, and implementations in detail. In addition, foundational datasets, benchmarks, and simulation platforms are introduced. Building on the current VLA landscape, the review further proposes perspectives on key challenges and future directions to advance research in VLA models and generalizable robotics. By synthesizing insights from over three hundred recent studies, this survey maps the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose VLA methods.
中文标题/摘要
标题:纯视觉语言行动(VLA)模型综述
视觉语言行动(VLA)模型的出现标志着从传统基于策略的控制向通用机器人学的范式转变,重新定义了视觉语言模型(VLMs)从被动序列生成器到在复杂动态环境中进行操作和决策的主动代理。本文综述了先进的VLA方法,旨在提供清晰的分类和系统、全面的现有研究回顾。它对不同场景下的VLA应用进行了全面分析,并将VLA方法分类为自回归、扩散、强化学习、混合和专门方法;详细探讨了它们的动机、核心策略和实现。此外,还介绍了基础数据集、基准测试和模拟平台。基于当前的VLA景观,综述进一步提出了关键挑战和未来方向,以推动VLA模型和通用机器人学的研究。通过综合三百多篇近期研究的见解,本文勾勒了这一快速发展的领域的轮廓,并指出了将塑造可扩展、通用VLA方法发展的机会和挑战。
Summary / 总结
This survey explores the evolution of Vision Language Action (VLA) models, which shift the focus from policy-based control to generalized robotics. It classifies VLA approaches into autoregression-based, diffusion-based, reinforcement-based, hybrid, and specialized methods, and provides a comprehensive analysis of their applications and implementations. The survey also introduces foundational datasets, benchmarks, and simulation platforms, and proposes future directions to advance VLA research.
该综述探讨了将视觉、语言和行动结合的Vision Language Action (VLA)模型的发展,使其能够进行复杂的环境中的主动操作和决策。它将VLA方法分为自回归、扩散、强化学习、混合和专门化等类别,详细分析了它们的动机、策略和实现方式。综述还介绍了基础数据集和基准测试,并讨论了推进VLA模型和通用机器人技术的关键挑战和未来方向。
Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards
Authors: Honghao Chen, Xingzhou Lou, Xiaokun Feng, Kaiqi Huang, Xinlong Wang
Venue: NeurIPS 2025
First: 2025-09-23T13:47:32+00:00 · Latest: 2025-09-23T13:47:32+00:00
Comments: Accepted by NeurIPS 2025
Abstract
Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code will be available at https://github.com/baaivision/CoS.
中文标题/摘要
标题:揭示视觉语言模型的步骤推理链及其细粒度奖励
步骤推理链在大型语言模型中已经显示出显著的成功,但在视觉语言推理中的应用仍然是一个开放的挑战,缺乏最佳实践。现有尝试通常在粗粒度级别上使用推理链,这在进行细粒度结构化推理时表现不佳,更重要的是,难以评估中间推理的奖励和质量。在本文中,我们深入探讨了视觉语言模型的步骤推理链,使推理步骤的质量评估更加准确,并实现了细粒度奖励下的有效强化学习和推理时的扩展。我们提出了一种简单、有效且完全透明的框架,包括步骤级推理数据、过程奖励模型(PRM)和强化学习训练。通过所提出的方法,我们的模型在具有挑战性的视觉语言基准测试中建立了强大的基线,并且在性能上持续改进。更重要的是,我们进行了详尽的经验分析和消融研究,揭示了每个组件的影响以及推理时扩展的几个有趣特性。我们认为本文为视觉语言模型提供了一个基线,并为更复杂的多模态推理提供了见解。我们的数据集、PRM和代码将在https://github.com/baaivision/CoS上提供。
Summary / 总结
This work addresses the challenge of fine-grained chain of step reasoning in vision-language models by introducing a transparent framework with step-level reasoning data and process reward model (PRM). The method enables accurate evaluation of intermediate reasoning quality and effective reinforcement learning. Key experimental findings show consistent improvements on vision-language benchmarks and insights into inference-time scaling properties.
该研究通过引入包含步骤级推理数据和过程奖励模型(PRM)的透明框架,解决了视觉-语言模型中细粒度的链式推理挑战。该方法允许准确评估中间推理质量,并实现有效的强化学习。实验结果表明,在视觉-语言基准测试中表现出一致的改进,并揭示了推理时间缩放的特性。
No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning
Authors: Matheus Vinícius Todescato, Joel Luís Carbonera
First: 2025-09-23T12:54:52+00:00 · Latest: 2025-09-23T12:54:52+00:00
Comments: This paper was accepted at International Conference on Tools with Artificial Intelligence (ICTAI) 2025
Abstract
While deep learning, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), has significantly advanced classification performance, its typical reliance on extensive annotated datasets presents a major obstacle in many practical scenarios where such data is scarce. Vision-language models (VLMs) and transfer learning with pre-trained visual models appear as promising techniques to deal with this problem. This paper proposes a novel zero-shot image classification framework that combines a VLM and a pre-trained visual model within a self-learning cycle. Requiring only the set of class names and no labeled training data, our method utilizes a confidence-based pseudo-labeling strategy to train a lightweight classifier directly on the test data, enabling dynamic adaptation. The VLM identifies high-confidence samples, and the pre-trained visual model enhances their visual representations. These enhanced features then iteratively train the classifier, allowing the system to capture complementary semantic and visual cues without supervision. Notably, our approach avoids VLM fine-tuning and the use of large language models, relying on the visual-only model to reduce the dependence on semantic representation. Experimental evaluations on ten diverse datasets demonstrate that our approach outperforms the baseline zero-shot method.
中文标题/摘要
标题:无需标签:基于协作自学习的零样本图像分类
尽管深度学习,包括卷积神经网络(CNN)和视觉变换器(ViT),在分类性能上取得了显著进步,但其对大量标注数据的依赖在许多实际场景中造成了重大障碍,尤其是在标注数据稀缺的情况下。视觉语言模型(VLM)和预训练视觉模型的迁移学习似乎为解决这一问题提供了有前景的方法。本文提出了一种新颖的零样本图像分类框架,该框架结合了VLM和预训练视觉模型,并在自学习循环中使用。该方法仅需类名集合和无标注训练数据,利用基于置信度的伪标签策略直接在测试数据上训练轻量级分类器,实现动态适应。VLM识别高置信度样本,预训练视觉模型增强其视觉表示。这些增强的特征随后迭代训练分类器,使系统能够在无监督的情况下捕捉互补的语义和视觉线索。值得注意的是,我们的方法避免了VLM微调和大型语言模型的使用,依赖于视觉模型减少对语义表示的依赖。在十个不同数据集上的实验评估表明,我们的方法优于基线零样本方法。
Summary / 总结
This paper introduces a zero-shot image classification framework that leverages a vision-language model and a pre-trained visual model to classify images without labeled training data. The method uses a confidence-based pseudo-labeling strategy to train a lightweight classifier directly on test data, iteratively enhancing visual representations and adapting to new classes. Experiments on ten datasets show that this approach outperforms traditional zero-shot methods.
该论文提出了一种零样本图像分类框架,结合了视觉语言模型和预训练视觉模型,并在自我学习循环中运行,仅需类名和无标注训练数据。方法采用基于置信度的伪标签策略,在测试数据上训练轻量级分类器,迭代增强视觉表示并捕捉互补的语义和视觉线索。在十个不同数据集上的实验结果表明,该方法优于现有零样本方法。
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective
Authors: Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, Huchuan Lu
First: 2025-09-23T12:00:14+00:00 · Latest: 2025-09-23T12:00:14+00:00
Comments: a comprehensive visual spatial reasoning evaluation tool, 25 pages, 16 figures
Abstract
Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.
中文标题/摘要
标题:VLMs在视觉空间智能方面的差距:基于基准的视角
视觉空间推理(VSR)是人类核心认知能力之一,对于推进具身智能和自主系统至关重要。尽管在视觉语言模型(VLMs)方面取得了进展,但由于三维空间表示和推理的复杂性,实现人类水平的VSR仍然极具挑战性。本文系统地探讨了VLMs中的VSR,涵盖了输入模态、模型架构、训练策略和推理机制的现有方法综述。此外,我们将空间智能分为基本感知、空间理解、空间规划三个能力层次,并创建了SIBench,一个包含近20个开源数据集的视觉空间智能基准,覆盖23种任务设置。使用最先进的VLMs进行的实验揭示了感知与推理之间的显著差距,模型在基本感知任务中表现出色,但在理解和规划任务中表现不佳,特别是在数值估计、多视图推理、时间动态和空间想象方面。这些发现强调了实现空间智能所面临的巨大挑战,同时为该领域的未来研究提供了一个系统性的路线图和全面的基准。相关资源可访问:https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/。
Summary / 总结
This paper investigates the capability of Vision-Language Models (VLMs) in Visual Spatial Reasoning (VSR), a critical cognitive ability for embodied intelligence. The authors present SIBench, a comprehensive benchmark that includes nearly 20 open-source datasets across 23 task settings, and find that while VLMs excel in basic perceptual tasks, they underperform in understanding and planning tasks, highlighting significant gaps in their spatial reasoning abilities. This work provides a roadmap for future research in VSR and a benchmark for evaluating models.
本文研究了视觉语言模型(VLMs)在视觉空间推理(VSR)方面的能力,VSR是实现自主系统的关键认知能力。作者引入了SIBench,该基准包括近20个开源数据集,涵盖23种任务设置,以评估VSR。实验表明,尽管VLMs在基本感知任务中表现出色,但在理解和规划任务中表现不佳,突显了他们在空间推理方面存在的显著差距。这项工作为未来的研究提供了一个路线图和基准。
EventVL: Understand Event Streams via Multimodal Large Language Model
Authors: Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, Hui Xiong
First: 2025-01-23T14:37:21+00:00 · Latest: 2025-09-23T09:53:54+00:00
Abstract
The event-based Vision-Language Model (VLM) recently has made good progress for practical vision tasks. However, most of these works just utilize CLIP for focusing on traditional perception tasks, which obstruct model understanding explicitly the sufficient semantics and context from event streams. To address the deficiency, we propose EventVL, the first generative event-based MLLM (Multimodal Large Language Model) framework for explicit semantic understanding. Specifically, to bridge the data gap for connecting different modalities semantics, we first annotate a large event-image/video-text dataset, containing almost 1.4 million high-quality pairs of data, which enables effective learning across various scenes, e.g., drive scene or human motion. After that, we design Event Spatiotemporal Representation to fully explore the comprehensive information by diversely aggregating and segmenting the event stream. To further promote a compact semantic space, Dynamic Semantic Alignment is introduced to improve and complete sparse semantic spaces of events. Extensive experiments show that our EventVL can significantly surpass existing MLLM baselines in event captioning and scene description generation tasks. We hope our research could contribute to the development of the event vision community.
中文标题/摘要
标题:EventVL:通过多模态大型语言模型理解事件流
基于事件的视觉-语言模型(VLM)最近在实际视觉任务中取得了良好的进展。然而,这些工作大多仅利用CLIP专注于传统的感知任务,这阻碍了模型对事件流中充分的语义和上下文进行明确的理解。为了解决这一缺陷,我们提出了EventVL,这是第一个生成的基于事件的多模态大型语言模型(MLLM)框架,用于明确的语义理解。具体来说,为了弥合不同模态语义之间的数据差距,我们首先标注了一个包含近140万高质量数据对的大规模事件-图像/视频-文本数据集,这使得在各种场景中(例如驾驶场景或人体运动)的有效学习成为可能。之后,我们设计了事件时空表示,通过多样化地聚合和分割事件流来全面探索综合信息。为了进一步促进紧凑的语义空间,我们引入了动态语义对齐,以改进和补充事件的稀疏语义空间。广泛的实验表明,我们的EventVL在事件描述和场景描述生成任务中显著优于现有的MLLM基线。我们希望我们的研究能够促进事件视觉社区的发展。
Summary / 总结
EventVL is a generative event-based MLLM framework designed to enhance semantic understanding of event streams. It addresses the limitations of existing models by annotating a large dataset of event-image/video-text pairs and introducing Event Spatiotemporal Representation and Dynamic Semantic Alignment. Experimental results demonstrate that EventVL outperforms existing MLLM baselines in event captioning and scene description generation tasks.
EventVL 是一个生成性的事件基 MLLM 框架,旨在增强对事件流的语义理解。它通过标注大量事件-图像/视频-文本数据集,并引入事件时空表示和动态语义对齐来解决现有模型的局限性。实验结果表明,EventVL 在事件描述和场景描述生成任务中优于现有 MLLM 基线模型。
Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography
Authors: Gianmarco Spinaci, Lukas Klic, Giovanni Colavizza
First: 2025-09-23T09:23:31+00:00 · Latest: 2025-09-23T09:23:31+00:00
Comments: 11 pages, 2 figures
Abstract
This study evaluates the capabilities of Multimodal Large Language Models (LLMs) and Vision Language Models (VLMs) in the task of single-label classification of Christian Iconography. The goal was to assess whether general-purpose VLMs (CLIP and SigLIP) and LLMs, such as GPT-4o and Gemini 2.5, can interpret the Iconography, typically addressed by supervised classifiers, and evaluate their performance. Two research questions guided the analysis: (RQ1) How do multimodal LLMs perform on image classification of Christian saints? And (RQ2), how does performance vary when enriching input with contextual information or few-shot exemplars? We conducted a benchmarking study using three datasets supporting Iconclass natively: ArtDL, ICONCLASS, and Wikidata, filtered to include the top 10 most frequent classes. Models were tested under three conditions: (1) classification using class labels, (2) classification with Iconclass descriptions, and (3) few-shot learning with five exemplars. Results were compared against ResNet50 baselines fine-tuned on the same datasets. The findings show that Gemini-2.5 Pro and GPT-4o outperformed the ResNet50 baselines. Accuracy dropped significantly on the Wikidata dataset, where Siglip reached the highest accuracy score, suggesting model sensitivity to image size and metadata alignment. Enriching prompts with class descriptions generally improved zero-shot performance, while few-shot learning produced lower results, with only occasional and minimal increments in accuracy. We conclude that general-purpose multimodal LLMs are capable of classification in visually complex cultural heritage domains. These results support the application of LLMs as metadata curation tools in digital humanities workflows, suggesting future research on prompt optimization and the expansion of the study to other classification strategies and models.
中文标题/摘要
标题:在零样本和少样本场景中基准测试视觉语言和多模态大型语言模型:对基督教圣像学的研究
本研究评估了多模态大型语言模型(LLMs)和视觉语言模型(VLMs)在基督教圣像学单一标签分类任务中的能力。目标是评估通用视觉语言模型(如CLIP和SigLIP)和大型语言模型(如GPT-4o和Gemini 2.5)是否能够解释通常由监督分类器处理的圣像学,并评估其性能。分析由两个研究问题指导:(RQ1)多模态LLMs在基督教圣像图像分类中的表现如何?(RQ2)当输入中增加上下文信息或少样本示例时,性能如何变化?我们使用三个原生支持Iconclass的数据库进行基准测试:ArtDL、ICONCLASS和Wikidata,并筛选出前10个最频繁的类别。模型在三种条件下进行了测试:(1)使用类别标签进行分类,(2)使用Iconclass描述进行分类,(3)使用五个示例进行少样本学习。结果与在相同数据集上微调的ResNet50基线进行了比较。研究发现Gemini-2.5 Pro和GPT-4o优于ResNet50基线。在Wikidata数据集上,准确性显著下降,Siglip达到了最高的准确性分数,表明模型对图像大小和元数据对齐的敏感性。增加类别描述通常提高了零样本性能,而少样本学习产生的结果较低,仅偶尔和轻微地提高了准确性。我们得出结论,通用多模态LLMs能够在视觉复杂的文化遗产领域进行分类。这些结果支持将LLMs作为数字人文工作流程中的元数据编目工具的应用,并建议未来研究提示优化和扩展研究到其他分类策略和模型。
Summary / 总结
This study evaluates the performance of multimodal large language models (LLMs) and vision language models (VLMs) in classifying Christian iconography. The research aimed to assess their ability to interpret images of Christian saints and compare their performance against ResNet50 baselines. Models were tested under three conditions: classification using class labels, classification with Iconclass descriptions, and few-shot learning with five exemplars. Results showed that Gemini-2.5 Pro and GPT-4o outperformed the ResNet50 baselines, with accuracy dropping on the Wikidata dataset, indicating sensitivity to image size and metadata alignment. Enriching prompts with class descriptions improved zero-shot performance, but few-shot learning produced lower results.
该研究评估了多模态大型语言模型(LLMs)和视觉语言模型(VLMs)在分类基督教圣像方面的性能。研究旨在评估这些模型在通常由监督分类器处理的这一领域的能力。研究探讨了两个问题:多模态LLMs在基督教圣像图像分类中的表现如何,以及在提供上下文信息或少量示例时性能如何变化。使用三个数据集,研究在不同条件下测试了模型,并将其结果与ResNet50基线进行了比较。主要发现包括Gemini-2.5 Pro和GPT-4o优于基线,但在Wikidata数据集上的准确率下降,这表明模型对图像大小和元数据对齐的敏感性。增强提示中的类别描述提高了零样本性能,而少量样本学习产生的结果较低,仅偶尔有轻微的准确率提升。
Training-Free Data Assimilation with GenCast
Authors: Thomas Savary, François Rozet, Gilles Louppe
First: 2025-09-23T08:59:44+00:00 · Latest: 2025-09-23T08:59:44+00:00
Abstract
Data assimilation is widely used in many disciplines such as meteorology, oceanography, and robotics to estimate the state of a dynamical system from noisy observations. In this work, we propose a lightweight and general method to perform data assimilation using diffusion models pre-trained for emulating dynamical systems. Our method builds on particle filters, a class of data assimilation algorithms, and does not require any further training. As a guiding example throughout this work, we illustrate our methodology on GenCast, a diffusion-based model that generates global ensemble weather forecasts.
中文标题/摘要
标题:无需训练的数据同化方法GenCast
数据同化在气象学、海洋学和机器人学等多个学科中被广泛使用,用于从噪声观测中估计动力系统的状态。在本研究中,我们提出了一种轻量级且通用的方法,使用预训练的扩散模型来进行数据同化。该方法基于粒子滤波器,这是一种数据同化算法,不需要任何进一步的训练。在整个研究中,我们以基于扩散模型的GenCast为例,展示了我们的方法论。
Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models
Authors: Xijun Wang, Junyun Huang, Rayyan Abdalla, Chengyuan Zhang, Ruiqi Xian, Dinesh Manocha
First: 2025-09-23T07:55:48+00:00 · Latest: 2025-09-23T07:55:48+00:00
Abstract
We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth $\leq2$ bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token pruning on the quantized models and observe that there is redundancy of image tokens 90% - 99% in the quantized models. This helps us to further prune the visual tokens to improve efficiency.
中文标题/摘要
标题:Bi-VLM:在视觉语言模型中推动超低精度后训练量化边界
我们解决了视觉语言模型的计算需求与可能使用的超低位权重精度(位宽$\leq2$位)之间的关键差距,以提高效率。我们的工作受到视觉语言模型(VLM)的大量计算成本和内存需求的限制,这限制了它们在硬件受限环境中的应用。我们提出了Bi-VLM,它基于高斯分位数非均匀地分离模型权重。我们的公式将模型权重分为异常值(显著)和多个内点(不显著)子集,确保每个子集包含与其分布中分位数成比例的权重。我们提出了一种基于显著性感知的混合量化算法,并根据显著性度量和压缩目标对缩放器和二进制矩阵施加不同的约束来量化权重。我们已在不同的视觉语言模型上评估了我们的方法。对于VLM的语言模型部分,我们的Bi-VLM在四个不同的基准和三种不同的模型上,在视觉问答任务中分别优于SOTA 3%-47%。对于整体VLM,我们的Bi-VLM优于SOTA 4%-45%。我们还在量化模型上进行了标记剪枝,并观察到量化模型中的图像标记有90%-99%的冗余。这有助于我们进一步剪枝视觉标记以提高效率。
Summary / 总结
This work addresses the computational challenges of vision-language models by proposing Bi-VLM, which uses a saliency-aware hybrid quantization algorithm to separate model weights into outlier and inlier subsets based on Gaussian quantiles. The method outperforms state-of-the-art models by 3%-47% on visual question answering tasks and by 4%-45% overall, demonstrating the effectiveness of ultra-low-bit weight precision in VLMs. Additionally, token pruning in the quantized models reduces image token redundancy by 90%-99%, further enhancing efficiency.
研究针对视觉语言模型(VLMs)的计算挑战,提出了Bi-VLM方法,该方法基于高斯分位数非均匀地分离模型权重,确保每个权重子集对应于其在分布中的分位数,从而提高效率。Bi-VLM在视觉问答任务的语言模型部分上比当前最佳方法(SOTA)提高了3%-47%,在整体VLM上提高了4%-45%。此外,在量化模型中进行的标记剪枝减少了90%-99%的图像标记冗余,进一步提高了效率。
FixingGS: Enhancing 3D Gaussian Splatting via Training-Free Score Distillation
Authors: Zhaorui Wang, Yi Gu, Deming Zhou, Renjing Xu
First: 2025-09-23T07:53:46+00:00 · Latest: 2025-09-23T07:53:46+00:00
Abstract
Recently, 3D Gaussian Splatting (3DGS) has demonstrated remarkable success in 3D reconstruction and novel view synthesis. However, reconstructing 3D scenes from sparse viewpoints remains highly challenging due to insufficient visual information, which results in noticeable artifacts persisting across the 3D representation. To address this limitation, recent methods have resorted to generative priors to remove artifacts and complete missing content in under-constrained areas. Despite their effectiveness, these approaches struggle to ensure multi-view consistency, resulting in blurred structures and implausible details. In this work, we propose FixingGS, a training-free method that fully exploits the capabilities of the existing diffusion model for sparse-view 3DGS reconstruction enhancement. At the core of FixingGS is our distillation approach, which delivers more accurate and cross-view coherent diffusion priors, thereby enabling effective artifact removal and inpainting. In addition, we propose an adaptive progressive enhancement scheme that further refines reconstructions in under-constrained regions. Extensive experiments demonstrate that FixingGS surpasses existing state-of-the-art methods with superior visual quality and reconstruction performance. Our code will be released publicly.
中文标题/摘要
标题:FixingGS:通过无训练评分蒸馏增强3D高斯斑点
近年来,3D高斯斑点(3DGS)在3D重建和新颖视图合成方面取得了显著的成功。然而,从稀疏视角重建3D场景仍然极具挑战性,因为视觉信息不足,导致3D表示中存在明显的伪影。为了解决这一局限性,最近的方法转向生成先验以去除伪影并完成欠约束区域的缺失内容。尽管这些方法非常有效,但它们难以确保多视图一致性,导致结构模糊和不合理的细节。在本文中,我们提出FixingGS,这是一种无训练方法,充分利用现有扩散模型的潜力以增强稀疏视角3DGS重建。FixingGS的核心是我们提出的蒸馏方法,它提供了更准确且跨视图一致的扩散先验,从而实现有效的伪影去除和修复。此外,我们还提出了一种自适应分阶段增强方案,进一步细化欠约束区域的重建。大量实验表明,FixingGS在视觉质量和重建性能方面超越了现有最先进的方法。我们的代码将公开发布。
Summary / 总结
FixingGS is a training-free method that enhances 3D Gaussian Splatting (3DGS) by leveraging a diffusion model to improve multi-view consistency and artifact removal. It uses score distillation to provide more accurate and coherent diffusion priors, and an adaptive progressive enhancement scheme to refine reconstructions in under-constrained areas. Experimental results show that FixingGS outperforms existing state-of-the-art methods in terms of visual quality and reconstruction performance.
FixingGS 是一种无需训练的方法,通过利用现有扩散模型的能力来增强 3D 高斯斑点。它使用分数蒸馏来提供更准确和跨视图一致的先验,从而有助于去除伪影和填补缺失内容。实验结果表明,FixingGS 在视觉质量和重建性能方面优于现有最先进的方法。
Knowledge Transfer from Interaction Learning
Authors: Yilin Gao, Kangyi Chen, Zhongxing Peng, Hengjie Lu, Shugong Xu
First: 2025-09-23T07:27:36+00:00 · Latest: 2025-09-23T07:27:36+00:00
Comments: Accepted by ICCV2025
Abstract
Current visual foundation models (VFMs) face a fundamental limitation in transferring knowledge from vision language models (VLMs), while VLMs excel at modeling cross-modal interactions through unified representation spaces, existing VFMs predominantly adopt result-oriented paradigms that neglect the underlying interaction processes. This representational discrepancy hinders effective knowledge transfer and limits generalization across diverse vision tasks. We propose Learning from Interactions (LFI), a cognitive-inspired framework that addresses this gap by explicitly modeling visual understanding as an interactive process. Our key insight is that capturing the dynamic interaction patterns encoded in pre-trained VLMs enables more faithful and efficient knowledge transfer to VFMs. The approach centers on two technical innovations, Interaction Queries, which maintain persistent relational structures across network layers, and interaction-based supervision, derived from the cross-modal attention mechanisms of VLMs. Comprehensive experiments demonstrate consistent improvements across multiple benchmarks, achieving 3.3 and 1.6mAP/2.4AP absolute gains on TinyImageNet classification and COCO detection/segmentation respectively, with minimal parameter overhead and faster convergence. The framework particularly excels in cross-domain settings, delivering 2.4 and 9.3 zero-shot improvements on PACS and VLCS. Human evaluations further confirm its cognitive alignment, outperforming result-oriented methods by 2.7 times in semantic consistency metrics.
中文标题/摘要
标题:从交互学习转移知识
当前的视觉基础模型(VFMs)在从视觉语言模型(VLMs)转移知识方面面临根本性的限制,而VLMs则擅长通过统一的表示空间建模跨模态交互。现有的VFMs主要采用结果导向的方法,忽视了底层的交互过程。这种表示差异阻碍了有效知识转移,并限制了在各种视觉任务中的泛化能力。我们提出了交互学习(LFI),这是一种认知启发式的框架,通过明确建模视觉理解为一个交互过程来解决这一差距。我们的核心见解是,捕捉预训练VLMs中编码的动态交互模式能够更忠实地高效地将知识转移到VFMs中。该方法的核心在于两个技术创新:交互查询,它在网络层中保持持久的关系结构;以及基于交互的监督,源自VLMs的跨模态注意力机制。全面的实验表明,该方法在多个基准测试中表现出一致的改进,在TinyImageNet分类和COCO检测/分割上分别实现了3.3和1.6mAP/2.4AP的绝对增益,且参数开销小,收敛速度快。该框架特别适用于跨域设置,在PACS和VLCS上分别实现了2.4和9.3的零样本改进。人类评估进一步证实了其认知一致性,其在语义一致性指标上比结果导向的方法高出2.7倍。
Summary / 总结
The paper addresses the challenge of knowledge transfer from vision language models (VLMs) to visual foundation models (VFMs), proposing Learning from Interactions (LFI) as a cognitive-inspired framework. LFI explicitly models visual understanding as an interactive process, using Interaction Queries to maintain relational structures and interaction-based supervision from VLMs' cross-modal attention mechanisms. Experiments show consistent improvements across benchmarks, with significant gains in zero-shot settings, and faster convergence with minimal parameter overhead.
论文通过提出学习交互(LFI)框架,将视觉语言模型(VLMs)的知识转移到视觉基础模型(VFMs)中,该框架将视觉理解明确建模为一个交互过程。LFI引入了交互查询来保持关系结构,并从跨模态注意力机制中获取交互监督。实验表明,该方法在多个基准上表现出一致的改进,分别在TinyImageNet和COCO上获得3.3和1.6 mAP/2.4 AP的绝对增益,并在跨域设置中实现显著的零样本改进。人类评估进一步证实了其认知一致性。
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Authors: Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen
Venue: NeurIPS 2025
First: 2025-09-17T11:28:58+00:00 · Latest: 2025-09-23T07:13:20+00:00
Comments: NeurIPS 2025
Abstract
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.
中文标题/摘要
标题:ViSpec:通过视觉感知投机解码加速视觉语言模型
投机解码是一种广泛采用的技术,用于加速大型语言模型(LLMs)的推理,但其在视觉语言模型(VLMs)中的应用尚未得到充分探索,现有方法仅能实现轻微的加速(<1.5倍)。随着多模态能力在大规模模型中变得越来越重要,这一差距变得越来越显著。我们假设大型VLMs可以在逐层过滤冗余图像信息的同时不损害文本理解,而较小的草稿模型则难以做到这一点。为了解决这一问题,我们引入了视觉感知投机解码(ViSpec),这是一种针对VLMs的新型框架。ViSpec采用轻量级的视觉适配模块将图像标记压缩为紧凑表示,并将其无缝集成到草稿模型的注意力机制中,同时保留原始图像的位置信息。此外,我们为每个输入图像提取一个全局特征向量,并将该特征添加到所有后续文本标记中,以增强多模态的一致性。为了克服缺乏带有长助手响应的多模态数据集的问题,我们通过重新利用现有数据集并使用目标VLM生成扩展输出来构建一个专门的训练数据集,并使用修改后的提示。我们的训练策略减轻了草稿模型直接访问目标模型隐藏状态的风险,这在仅使用目标模型输出进行训练时可能会导致捷径学习。广泛的实验验证了ViSpec,据我们所知,这是首次在VLM投机解码中实现显著加速。代码可在https://github.com/KangJialiang/ViSpec/获得。
Summary / 总结
ViSpec is a novel framework designed to accelerate vision-language models (VLMs) by employing a lightweight vision adaptor module that compresses image tokens into a compact representation, which is integrated into the draft model's attention mechanism. This approach achieves significant speedup, overcoming the limitations of existing speculative decoding methods for VLMs. The method also enhances multimodal coherence by augmenting text tokens with global image features. Experiments show that ViSpec provides the first substantial speedup in VLM speculative decoding, with speed improvements of over 1.5x compared to previous methods.
ViSpec 是一种新颖的推测性解码框架,旨在通过集成轻量级的视觉适配模块来加速视觉语言模型(VLM)。该模块将图像标记压缩为紧凑表示,并将其无缝集成到草稿模型的注意力机制中,同时保留图像的位置信息。此外,ViSpec 还通过全局特征向量增强后续的文本标记,以增强多模态的一致性。实验表明,ViSpec 实现了 VLM 推测性解码的首次显著加速,超过之前方法超过 1.5 倍。训练使用专门的数据集进行,以避免草稿模型直接访问目标模型的隐藏状态,从而防止在仅使用目标模型输出进行训练时出现捷径学习。代码可在 GitHub 上获得。
What Makes You Unique? Attribute Prompt Composition for Object Re-Identification
Authors: Yingquan Wang, Pingping Zhang, Chong Sun, Dong Wang, Huchuan Lu
First: 2025-09-23T07:03:08+00:00 · Latest: 2025-09-23T07:03:08+00:00
Comments: Accepted by TCSVT2025
Abstract
Object Re-IDentification (ReID) aims to recognize individuals across non-overlapping camera views. While recent advances have achieved remarkable progress, most existing models are constrained to either single-domain or cross-domain scenarios, limiting their real-world applicability. Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies that may inadvertently suppress identity-specific discriminative cues. To address these limitations, we propose an Attribute Prompt Composition (APC) framework, which exploits textual semantics to jointly enhance discrimination and generalization. Specifically, we design an Attribute Prompt Generator (APG) consisting of a Semantic Attribute Dictionary (SAD) and a Prompt Composition Module (PCM). SAD is an over-complete attribute dictionary to provide rich semantic descriptions, while PCM adaptively composes relevant attributes from SAD to generate discriminative attribute-aware features. In addition, motivated by the strong generalization ability of Vision-Language Models (VLM), we propose a Fast-Slow Training Strategy (FSTS) to balance ReID-specific discrimination and generalizable representation learning. Specifically, FSTS adopts a Fast Update Stream (FUS) to rapidly acquire ReID-specific discriminative knowledge and a Slow Update Stream (SUS) to retain the generalizable knowledge inherited from the pre-trained VLM. Through a mutual interaction, the framework effectively focuses on ReID-relevant features while mitigating overfitting. Extensive experiments on both conventional and Domain Generalized (DG) ReID datasets demonstrate that our framework surpasses state-of-the-art methods, exhibiting superior performances in terms of both discrimination and generalization. The source code is available at https://github.com/AWangYQ/APC.
中文标题/摘要
标题:什么让你独一无二?对象重识别的属性提示组成
对象重识别(ReID)旨在跨非重叠摄像机视角识别个体。尽管最近的进步取得了显著进展,但大多数现有模型要么局限于单域场景,要么跨域场景,限制了其实用性。单域模型倾向于过度拟合到特定领域的特征,而跨域模型则往往依赖于多样化的归一化策略,这可能会无意中抑制身份特定的区分性线索。为了解决这些限制,我们提出了一种属性提示组成(APC)框架,该框架利用文本语义共同增强区分性和泛化能力。具体而言,我们设计了一个属性提示生成器(APG),包括一个语义属性字典(SAD)和一个提示组成模块(PCM)。SAD是一个过度完备的属性字典,提供丰富的语义描述,而PCM则从SAD中自适应地组合相关属性以生成区分性属性感知特征。此外,受视觉语言模型(VLM)的强大泛化能力启发,我们提出了一种快速-缓慢训练策略(FSTS)来平衡ReID特定的区分性和可泛化的表示学习。具体而言,FSTS采用快速更新流(FUS)迅速获取ReID特定的区分性知识,并保留从预训练VLM继承的可泛化知识。通过相互作用,该框架有效地关注ReID相关的特征,同时减轻过拟合。在传统和域泛化(DG)ReID数据集上的广泛实验表明,我们的框架超越了最先进的方法,在区分性和泛化性方面表现出更优的性能。源代码可在https://github.com/AWangYQ/APC获取。
Summary / 总结
The paper addresses the limitations of existing Object Re-IDentification (Re-ID) models by proposing an Attribute Prompt Composition (APC) framework. This framework uses textual semantics to enhance both discrimination and generalization. It includes an Attribute Prompt Generator (APG) with a Semantic Attribute Dictionary (SAD) and a Prompt Composition Module (PCM) to generate discriminative features. Additionally, a Fast-Slow Training Strategy (FSTS) is introduced to balance Re-ID-specific discrimination and generalizable representation learning. Experiments show that the proposed framework outperforms existing methods in both discrimination and generalization on conventional and Domain Generalized (DG) Re-ID datasets.
论文提出了一种属性提示组成(APC)框架,以解决现有对象重识别模型的局限性。该框架利用文本语义来增强区分性和泛化能力。它包括一个属性提示生成器(APG),包含一个语义属性字典(SAD)和一个提示组成模块(PCM),以生成区分性特征。此外,还提出了一种快速-缓慢训练策略(FSTS),以平衡特定于重识别的区分性和可泛化的表示学习。实验表明,该框架在各种数据集上在区分性和泛化能力方面均优于现有最佳方法。
RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images
Authors: Ke Li, Di Wang, Ting Wang, Fuyu Dong, Yiming Zhang, Luyao Zhang, Xiangyu Wang, Shaofeng Li, Quan Wang
First: 2025-09-23T06:52:15+00:00 · Latest: 2025-09-23T06:52:15+00:00
Abstract
Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose \textbf{RSVG-ZeroOV}, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attention\footnote[1]{In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.}maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.
中文标题/摘要
标题:RSVG-ZeroOV:探索无需训练框架在遥感图像中实现开放词汇视觉定位
遥感视觉定位(RSVG)旨在基于自由形式的自然语言表达在遥感图像中定位物体。现有方法通常受限于封闭词汇集,限制了其在开放世界场景中的应用。虽然最近尝试利用通用基础模型进行开放词汇RSVG,但它们过度依赖昂贵的高质量数据集和耗时的微调。为解决这些限制,我们提出了一种无需训练的框架RSVG-ZeroOV,旨在探索冻结的通用基础模型在零样本开放词汇RSVG中的潜力。具体而言,RSVG-ZeroOV包括三个关键阶段:(i)概览:我们利用视觉语言模型(VLM)获得跨注意力图,捕捉文本查询与视觉区域之间的语义关联。(ii)聚焦:通过利用扩散模型(DM)的精细建模先验,我们填补了对象在结构和形状信息方面的空白,这些信息往往被VLM忽略。(iii)进化:引入了一个简单而有效的注意力进化模块,抑制无关激活,生成所指物体的净化分割掩码。无需繁琐的任务特定训练,RSVG-ZeroOV提供了一种高效且可扩展的解决方案。大量实验表明,所提出框架在弱监督和零样本方法中始终表现出色。
Summary / 总结
RSVG-ZeroOV is a training-free framework designed to address the limitations of existing approaches in remote sensing visual grounding (RSVG) for open-vocabulary scenarios. It consists of three stages: Overview, Focus, and Evolve. The Overview stage uses a vision-language model to capture semantic correlations between text queries and visual regions. The Focus stage leverages a diffusion model to refine object details. The Evolve stage introduces an attention evolution module to suppress irrelevant activations, resulting in purified segmentation masks. Experiments show that RSVG-ZeroOV outperforms existing methods in weakly-supervised and zero-shot RSVG tasks.
RSVG-ZeroOV 是一个无需训练的框架,旨在解决现有方法在开放词汇场景下的远程 sensing 视觉定位 (RSVG) 问题。它包括三个阶段:概览、聚焦和进化。概览阶段使用视觉语言模型捕获文本查询与视觉区域之间的语义关联。聚焦阶段利用扩散模型细化对象细节。进化阶段引入了一个注意力进化模块来抑制无关激活,从而生成净化的分割掩码。实验表明,RSVG-ZeroOV 在弱监督和零样本 RSVG 任务中优于现有方法。
Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration
Authors: Zhitao Zeng, Guojian Yuan, Junyuan Mao, Yuxuan Wang, Xiaoshuang Jia, Yueming Jin
Venue: NeurIPS 2025
First: 2025-09-22T07:22:27+00:00 · Latest: 2025-09-23T06:52:04+00:00
Comments: 20 pages, 6 figures
Abstract
Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi-scale into two orthogonal dimensions: the temporal scale, forecasting states of humans and surgery at varying look-ahead intervals, and the state scale, modeling a hierarchy of states in general and surgical scenes. For example, in general scenes, states of contact relationships are finer-grained than states of spatial relationships. In surgical scenes, medium-level steps are finer-grained than high-level phases yet remain constrained by their encompassing phase. To support this unified task, we introduce the first MSTP Benchmark, featuring synchronized annotations across multiple state scales and temporal scales. We further propose a method, Incremental Generation and Multi-agent Collaboration (IG-MC), which integrates two key innovations. First, we present a plug-and-play incremental generation module that continuously synthesizes up-to-date visual previews at expanding temporal scales to inform multiple decision-making agents, keeping decisions and generated visuals synchronized and preventing performance degradation as look-ahead intervals lengthen. Second, we present a decision-driven multi-agent collaboration framework for multi-state prediction, comprising generation, initiation, and multi-state assessment agents that dynamically trigger and evaluate prediction cycles to balance global coherence and local fidelity.
中文标题/摘要
标题:多尺度时间预测通过增量生成和多智能体协作
准确的时间预测是全面场景理解与具身人工智能之间的桥梁。然而,对于视觉语言模型来说,在多个时间尺度上预测场景的多个细粒度状态是困难的。我们通过将多尺度分解为两个正交维度来形式化多尺度时间预测(MSTP)任务:时间尺度,预测不同展望间隔下的人类和手术状态;状态尺度,建模一般和手术场景中的状态层次结构。例如,在一般场景中,接触关系的状态比空间关系的状态更细粒度。在手术场景中,中等水平的步骤比高级阶段更细粒度,但仍受其包含阶段的约束。为了支持这一统一任务,我们引入了第一个MSTP基准,该基准在多个状态尺度和时间尺度上提供了同步注释。我们还提出了一种方法,增量生成和多智能体协作(IG-MC),该方法结合了两项关键创新。首先,我们提出了一种即插即用的增量生成模块,该模块在扩展的时间尺度上连续生成最新的视觉预览,以通知多个决策智能体,使决策和生成的视觉保持同步,并防止展望间隔延长时性能下降。其次,我们提出了一种以决策为导向的多智能体协作框架,用于多状态预测,该框架包括生成、启动和多状态评估智能体,它们动态触发和评估预测周期,以平衡全局一致性和局部保真度。
Summary / 总结
The research aims to address the challenge of predicting multiple fine-grained states at various temporal scales in scenes, which is crucial for embodied AI. The method, Incremental Generation and Multi-agent Collaboration (IG-MC), introduces an incremental generation module that continuously updates visual previews and a multi-agent collaboration framework for multi-state prediction. Key findings include improved synchronization and performance stability across different temporal and state scales, demonstrating the effectiveness of the proposed approach in MSTP tasks.
研究旨在解决在不同时间尺度上预测场景中多个精细状态的挑战,这对于实现具身AI至关重要。方法是引入增量生成和多智能体协作(IG-MC),该方法包含一个持续更新视觉预览的增量生成模块和一个多智能体协作框架来进行多状态预测。关键发现包括在不同时间和状态尺度上提高了同步性和性能稳定性,证明了所提出的方法在MSTP任务中的有效性。
Hierarchical Neural Semantic Representation for 3D Semantic Correspondence
Authors: Keyu Du, Jingyu Hu, Haipeng Li, Hao Xu, Haibing Huang, Chi-Wing Fu, Shuaicheng Liu
Venue: Siggraph Asia 2025
First: 2025-09-22T07:23:07+00:00 · Latest: 2025-09-23T05:56:37+00:00
Comments: This paper is accepted by Siggraph Asia 2025 conference track
Abstract
This paper presents a new approach to estimate accurate and robust 3D semantic correspondence with the hierarchical neural semantic representation. Our work has three key contributions. First, we design the hierarchical neural semantic representation (HNSR), which consists of a global semantic feature to capture high-level structure and multi-resolution local geometric features to preserve fine details, by carefully harnessing 3D priors from pre-trained 3D generative models. Second, we design a progressive global-to-local matching strategy, which establishes coarse semantic correspondence using the global semantic feature, then iteratively refines it with local geometric features, yielding accurate and semantically-consistent mappings. Third, our framework is training-free and broadly compatible with various pre-trained 3D generative backbones, demonstrating strong generalization across diverse shape categories. Our method also supports various applications, such as shape co-segmentation, keypoint matching, and texture transfer, and generalizes well to structurally diverse shapes, with promising results even in cross-category scenarios. Both qualitative and quantitative evaluations show that our method outperforms previous state-of-the-art techniques.
中文标题/摘要
标题:基于层次神经语义表示的3D语义对应
本文提出了一种新的方法,利用层次神经语义表示估计准确且鲁棒的3D语义对应。我们的工作有三个关键贡献。首先,我们设计了层次神经语义表示(HNSR),它由全局语义特征组成,用于捕获高层结构,并结合多分辨率局部几何特征以保留细部特征,通过精心利用预训练的3D生成模型中的3D先验知识。其次,我们设计了一种逐步全局到局部匹配策略,该策略使用全局语义特征建立粗略的语义对应,然后通过迭代细化这些对应,产生准确且语义一致的映射。第三,我们的框架无需训练,并且与各种预训练的3D生成主干网络广泛兼容,展示了在多种形状类别中的强大泛化能力。我们的方法还支持各种应用,如形状共分割、关键点匹配和纹理转移,并且在结构多样化的形状上表现出良好的泛化能力,即使在跨类别场景中也能取得有希望的结果。定性和定量评估表明,我们的方法优于之前的最先进的技术。
Summary / 总结
This paper introduces a hierarchical neural semantic representation (HNSR) to estimate accurate and robust 3D semantic correspondence. The method combines a global semantic feature for high-level structure and multi-resolution local geometric features for fine details, using pre-trained 3D generative models. It also includes a progressive global-to-local matching strategy that starts with coarse semantic correspondence and iteratively refines it, leading to semantically consistent mappings. The framework is training-free and compatible with various 3D generative models, showing strong generalization across different shape categories and supporting applications like shape co-segmentation, keypoint matching, and texture transfer. Evaluations demonstrate superior performance compared to previous techniques.
本文提出了一种层次神经语义表示(HNSR)方法,用于估计准确且鲁棒的3D语义对应。该方法包括用于高阶结构的全局语义特征和用于细节的多分辨率局部几何特征,利用预训练的3D生成模型中的3D先验。还包含一种渐进的全局到局部匹配策略,通过迭代使用局部几何特征细化粗略的语义对应,从而产生语义一致的映射。该框架无需训练且兼容各种预训练的3D生成模型,展示了在多种形状类别中的强大泛化能力,并支持如形状共分割和纹理转移等应用。
NaviSense: A Multimodal Assistive Mobile application for Object Retrieval by Persons with Visual Impairment
Authors: Ajay Narayanan Sridhar, Fuli Qiao, Nelson Daniel Troncoso Aldas, Yanpei Shi, Mehrdad Mahdavi, Laurent Itti, Vijaykrishnan Narayanan
First: 2025-09-23T05:45:11+00:00 · Latest: 2025-09-23T05:45:11+00:00
Abstract
People with visual impairments often face significant challenges in locating and retrieving objects in their surroundings. Existing assistive technologies present a trade-off: systems that offer precise guidance typically require pre-scanning or support only fixed object categories, while those with open-world object recognition lack spatial feedback for reaching the object. To address this gap, we introduce 'NaviSense', a mobile assistive system that combines conversational AI, vision-language models, augmented reality (AR), and LiDAR to support open-world object detection with real-time audio-haptic guidance. Users specify objects via natural language and receive continuous spatial feedback to navigate toward the target without needing prior setup. Designed with insights from a formative study and evaluated with 12 blind and low-vision participants, NaviSense significantly reduced object retrieval time and was preferred over existing tools, demonstrating the value of integrating open-world perception with precise, accessible guidance.
中文标题/摘要
标题:NaviSense:一种多模态辅助移动应用,用于视障人士的物体检索
视障人士在定位和检索周围环境中的物体时经常面临重大挑战。现有的辅助技术存在权衡:提供精确指导的系统通常需要预扫描或仅支持固定物体类别,而具有开放世界物体识别的系统缺乏空间反馈以引导用户接近物体。为了解决这一差距,我们引入了‘NaviSense’,这是一种结合了对话式AI、视觉语言模型、增强现实(AR)和LiDAR的移动辅助系统,支持实时音频触觉指导下的开放世界物体检测。用户可以通过自然语言指定物体,并在无需事先设置的情况下接收持续的空间反馈以导航至目标。NaviSense的设计基于形成性研究,并通过12名盲人和低视力参与者进行评估,显著减少了物体检索时间,并且比现有工具更受欢迎,证明了将开放世界感知与精确、可访问的指导相结合的价值。
Summary / 总结
NaviSense is a mobile application designed to help people with visual impairments locate and retrieve objects in their environment. It uses a combination of conversational AI, vision-language models, augmented reality, and LiDAR to provide real-time audio-haptic guidance for open-world object detection. The system was tested with 12 blind and low-vision participants and showed a significant reduction in object retrieval time compared to existing tools, highlighting the benefits of integrating open-world perception with precise guidance.
NaviSense 是一款移动应用,旨在帮助视力受损的人找到并取回环境中的物品。它结合了对话式 AI、视觉语言模型、增强现实和 LiDAR,提供实时的音频触觉指导以进行开放世界物体检测。该系统经过 12 名盲人和低视力参与者测试,显示了与现有工具相比,物体检索时间显著减少,突显了将开放世界感知与精确指导相结合的好处。
Learning neuroimaging models from health system-scale data
Authors: Yiwei Lyu, Samir Harake, Asadur Chowdury, Soumyanil Banerjee, Rachel Gologorsky, Shixuan Liu, Anna-Katharina Meissner, Akshay Rao, Chenhui Zhao, Akhil Kondepudi, Cheng Jiang, Xinhai Hou, Rushikesh S. Joshi, Volker Neuschmelting, Ashok Srinivasan, Dawn Kleindorfer, Brian Athey, Vikas Gulani, Aditya Pandey, Honglak Lee, Todd Hollon
First: 2025-09-23T04:49:59+00:00 · Latest: 2025-09-23T04:49:59+00:00
Abstract
Neuroimaging is a ubiquitous tool for evaluating patients with neurological diseases. The global demand for magnetic resonance imaging (MRI) studies has risen steadily, placing significant strain on health systems, prolonging turnaround times, and intensifying physician burnout \cite{Chen2017-bt, Rula2024-qp-1}. These challenges disproportionately impact patients in low-resource and rural settings. Here, we utilized a large academic health system as a data engine to develop Prima, the first vision language model (VLM) serving as an AI foundation for neuroimaging that supports real-world, clinical MRI studies as input. Trained on over 220,000 MRI studies, Prima uses a hierarchical vision architecture that provides general and transferable MRI features. Prima was tested in a 1-year health system-wide study that included 30K MRI studies. Across 52 radiologic diagnoses from the major neurologic disorders, including neoplastic, inflammatory, infectious, and developmental lesions, Prima achieved a mean diagnostic area under the ROC curve of 92.0, outperforming other state-of-the-art general and medical AI models. Prima offers explainable differential diagnoses, worklist priority for radiologists, and clinical referral recommendations across diverse patient demographics and MRI systems. Prima demonstrates algorithmic fairness across sensitive groups and can help mitigate health system biases, such as prolonged turnaround times for low-resource populations. These findings highlight the transformative potential of health system-scale VLMs and Prima's role in advancing AI-driven healthcare.
中文标题/摘要
标题:从健康系统规模数据中学习神经影像学模型
神经影像学是评估神经疾病患者的一种普遍工具。全球磁共振成像(MRI)研究的需求持续上升,给健康系统带来了巨大压力,延长了周转时间,并加剧了医生的职业倦怠[Chen2017-bt, Rula2024-qp-1]。这些挑战在低资源和农村地区患者中尤为突出。在这里,我们利用一个大型学术健康系统作为数据引擎,开发了Prima,这是第一个用于神经影像学的视觉语言模型(VLM),支持实际临床MRI研究作为输入。Prima基于超过220,000份MRI研究训练,采用分层视觉架构,提供通用和可转移的MRI特征。Prima在为期一年的健康系统范围内的研究中进行了测试,包括30,000份MRI研究。在52种主要神经科疾病放射学诊断中,包括肿瘤、炎症、感染和发育性病变,Prima的受试者操作特征曲线下面积均值达到92.0,优于其他最先进的通用和医学AI模型。Prima提供了可解释的鉴别诊断、放射科医生的工作列表优先级以及跨不同患者群体和MRI系统的临床转诊建议。Prima展示了对敏感群体的算法公平性,并有助于缓解健康系统偏见,如低资源人群的长期周转时间。这些发现突显了健康系统规模VLM的变革潜力以及Prima在推动AI驱动的医疗保健方面的作用。
Summary / 总结
This study addresses the challenges of MRI demand and turnaround times in health systems by developing Prima, a vision language model trained on over 220,000 MRI studies. Prima, which uses a hierarchical vision architecture, achieved a mean diagnostic area under the ROC curve of 92.0 across 52 radiologic diagnoses, outperforming other state-of-the-art models. It provides explainable diagnoses, prioritizes radiologist worklists, and offers clinical referral recommendations, demonstrating algorithmic fairness and helping mitigate health system biases.
该研究通过开发基于超过220,000份MRI研究的Prima视觉语言模型,解决了MRI需求和周转时间在健康系统中的挑战。Prima采用分层视觉架构,在52种放射学诊断中实现了92.0的ROC曲线下面积,优于其他最先进的模型。它提供了可解释的诊断、优先处理放射科医生的工作列表,并提供临床转诊建议,展示了算法公平性,并有助于缓解健康系统中的偏见。
History