Visual Spatial Tuning
Authors: Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao
First: 2025-11-07T18:59:16+00:00 · Latest: 2025-11-07T18:59:16+00:00
Abstract
Capturing spatial relationships from visual inputs is a cornerstone of
human-like general intelligence. Several previous studies have tried to enhance
the spatial awareness of Vision-Language Models (VLMs) by adding extra expert
encoders, which brings extra overhead and usually harms general capabilities.
To enhance the spatial ability in general architectures, we introduce Visual
Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with
human-like visuospatial abilities, from spatial perception to reasoning. We
first attempt to enhance spatial perception in VLMs by constructing a
large-scale dataset termed VST-P, which comprises 4.1 million samples spanning
19 skills across single views, multiple images, and videos. Then, we present
VST-R, a curated dataset with 135K samples that instruct models to reason in
space. In particular, we adopt a progressive training pipeline: supervised
fine-tuning to build foundational spatial knowledge, followed by reinforcement
learning to further improve spatial reasoning abilities. Without the
side-effect to general capabilities, the proposed VST consistently achieves
state-of-the-art results on several spatial benchmarks, including $34.8\%$ on
MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the
Vision-Language-Action models can be significantly enhanced with the proposed
spatial tuning paradigm, paving the way for more physically grounded AI.
中文标题/摘要
标题:视觉空间调谐
从视觉输入中捕捉空间关系是类人通用智能的基础。多项先前研究通过添加额外的专家编码器来增强视觉语言模型(VLMs)的空间意识,这带来了额外的开销并且通常损害了通用能力。为了在通用架构中增强空间能力,我们引入了视觉空间调谐(VST),这是一个全面的框架,旨在培养具有类人视觉空间能力的VLMs,从空间感知到推理。我们首先尝试通过构建一个名为VST-P的大规模数据集来增强VLMs的空间感知,该数据集包含410万样本,跨越单视角、多张图像和视频的19项技能。然后,我们提出了VST-R,一个包含13.5万样本的精编数据集,指导模型在空间中进行推理。特别是,我们采用了一种渐进式训练管道:监督微调以构建基础的空间知识,随后是强化学习以进一步提高空间推理能力。在不损害通用能力的情况下,所提出的VST在多个空间基准测试中始终取得了最先进的结果,包括MMSI-Bench上的34.8%和VSIBench上的61.2%。结果表明,所提出的空间调谐范式可以显著增强视觉语言行动模型,为更物理化的AI铺平了道路。
Summary / 总结
The research aims to improve the spatial awareness of Vision-Language Models (VLMs) by introducing Visual Spatial Tuning (VST), a comprehensive framework. VST includes a large-scale dataset VST-P for spatial perception and a curated dataset VST-R for spatial reasoning. Through a progressive training pipeline, VST enhances foundational spatial knowledge and further improves spatial reasoning abilities. The proposed method achieves state-of-the-art results on spatial benchmarks, such as 34.8% on MMSI-Bench and 61.2% on VSIBench, without harming general capabilities. Vision-Language-Action models are significantly enhanced with this spatial tuning paradigm, advancing physically grounded AI systems.
研究旨在通过增强视觉语言模型(VLMs)的空间意识,提高其类人通用智能。作者引入了视觉空间调优(VST)框架,包括用于空间感知的大规模数据集(VST-P)和用于空间推理的精选数据集(VST-R)。通过渐进式训练管道,模型首先进行监督微调以建立基础的空间知识,然后通过强化学习进一步提高空间推理能力。VST框架在空间基准测试中始终取得了最先进的成果,如MMSI-Bench的34.8%和VSIBench的61.2%,且未损害通用能力。这种方法显著增强了视觉语言行动模型,推动了物理上更可靠的AI系统的发展。
On the Brittleness of CLIP Text Encoders
Authors: Allie Tran, Luca Rossetto
First: 2025-11-06T10:33:55+00:00 · Latest: 2025-11-07T18:05:14+00:00
Comments: Accepted for publication at MMM'26. Analysis code can be found here:
https://github.com/allie-tran/clip-brittleness
Abstract
Multimodal co-embedding models, especially CLIP, have advanced the state of
the art in zero-shot classification and multimedia information retrieval in
recent years by aligning images and text in a shared representation space.
However, such modals trained on a contrastive alignment can lack stability
towards small input perturbations. Especially when dealing with manually
expressed queries, minor variations in the query can cause large differences in
the ranking of the best-matching results. In this paper, we present a
systematic analysis of the effect of multiple classes of non-semantic query
perturbations in an multimedia information retrieval scenario. We evaluate a
diverse set of lexical, syntactic, and semantic perturbations across multiple
CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video
collection. Across models, we find that syntactic and semantic perturbations
drive the largest instabilities, while brittleness is concentrated in trivial
surface edits such as punctuation and case. Our results highlight robustness as
a critical dimension for evaluating vision-language models beyond benchmark
accuracy.
中文标题/摘要
标题:CLIP文本编码器的脆弱性
多模态联合嵌入模型,尤其是CLIP,在近年来通过在共享表示空间中对齐图像和文本方面推动了零样本分类和多媒体信息检索的最新进展。然而,这些在对比对齐上训练的模态可能对小输入扰动缺乏稳定性。特别是在处理手动表达的查询时,查询中的细微变化会导致最佳匹配结果排名的巨大差异。在本文中,我们对多媒体信息检索场景中多种非语义查询扰动的影响进行了系统分析。我们使用TRECVID即席视频搜索查询和V3C1视频集合,对多种CLIP变体进行了多样化的词法、句法和语义扰动评估。在不同模型中,我们发现句法和语义扰动导致了最大的不稳定性,而脆弱性集中在诸如标点符号和大小写等简单的表面编辑上。我们的结果强调了鲁棒性是评估视觉语言模型的关键维度,而不仅仅是基准准确性。
Summary / 总结
This paper investigates the brittleness of CLIP text encoders in multimedia information retrieval, focusing on how small perturbations in queries can lead to large changes in ranking results. The authors analyze various types of perturbations, including lexical, syntactic, and semantic, across different CLIP models using TRECVID queries and the V3C1 video collection. They find that syntactic and semantic perturbations cause the most instability, while minor surface edits like punctuation and case have the highest brittleness. The study emphasizes the importance of robustness in evaluating vision-language models beyond just benchmark accuracy.
本文研究了CLIP文本编码器在多媒体信息检索中的脆弱性,重点关注查询中的小变化如何导致显著的结果变化。作者系统地分析了不同CLIP变体在TRECVID查询和V3C1视频集合上的各种类型的扰动,包括词汇、语法和语义变化。研究发现,语法和语义扰动导致的不稳定性最大,而标点符号和大小写的细微编辑是最脆弱的。该研究强调了在仅基准准确性之外评估视觉语言模型时,鲁棒性的重要性。
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models
Authors: Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli, Matthieu Cord
Venue: Proceedings of the 2025 International Joint Conference on Neural
Networks (IJCNN 2025)
First: 2024-03-29T10:38:25+00:00 · Latest: 2025-11-07T17:56:20+00:00
Abstract
Foundation models have exhibited unprecedented capabilities in tackling many
domains and tasks. Models such as CLIP are currently widely used to bridge
cross-modal representations, and text-to-image diffusion models are arguably
the leading models in terms of realistic image generation. Image generative
models are trained on massive datasets that provide them with powerful internal
spatial representations. In this work, we explore the potential benefits of
such representations, beyond image generation, in particular, for dense visual
prediction tasks. We focus on the task of image segmentation, which is
traditionally solved by training models on closed-vocabulary datasets, with
pixel-level annotations. To avoid the annotation cost or training large
diffusion models, we constraint our setup to be zero-shot and training-free. In
a nutshell, our pipeline leverages different and relatively small-sized,
open-source foundation models for zero-shot open-vocabulary segmentation. The
pipeline is as follows: the image is passed to both a captioner model (i.e.
BLIP) and a diffusion model (i.e., Stable Diffusion Model) to generate a text
description and visual representation, respectively. The features are clustered
and binarized to obtain class agnostic masks for each object. These masks are
then mapped to a textual class, using the CLIP model to support
open-vocabulary. Finally, we add a refinement step that allows to obtain a more
precise segmentation mask. Our approach (dubbed FreeSeg-Diff), which does not
rely on any training, outperforms many training-based approaches on both Pascal
VOC and COCO datasets. In addition, we show very competitive results compared
to the recent weakly-supervised segmentation approaches. We provide
comprehensive experiments showing the superiority of diffusion model features
compared to other pretrained models. Project page:
https://bcorrad.github.io/freesegdiff/
中文标题/摘要
标题:FreeSeg-Diff:无需训练的开放词汇分割方法与扩散模型
基础模型在处理多个领域和任务方面展现了前所未有的能力。目前,CLIP等模型广泛用于跨模态表示的桥梁构建,而文本到图像的扩散模型无疑是生成逼真图像的领先模型。图像生成模型通过大规模数据集的训练,获得了强大的内部空间表示。在本文中,我们探讨了这些表示在图像生成之外的潜在益处,特别是对于密集视觉预测任务。我们专注于图像分割任务,该任务传统上通过在封闭词汇数据集上训练模型并使用像素级注释来解决。为了避免注释成本或训练大型扩散模型,我们将设置限制为零样本且无需训练。简而言之,我们的管道利用不同的、相对较小的开源基础模型进行零样本开放词汇分割。该管道如下:图像被传递给一个描述生成模型(例如BLIP)和一个扩散模型(例如Stable Diffusion Model),以生成文本描述和视觉表示。特征被聚类并二值化,以获得每个对象的类别无关掩码。然后使用CLIP模型将这些掩码映射到文本类别,以支持开放词汇。最后,我们增加了一步细化步骤,以获得更精确的分割掩码。我们的方法(称为FreeSeg-Diff)不依赖于任何训练,在Pascal VOC和COCO数据集上均优于许多基于训练的方法。此外,我们展示了与最近的弱监督分割方法相比具有竞争力的结果。我们提供了全面的实验,展示了扩散模型特征优于其他预训练模型的优越性。项目页面:https://bcorrad.github.io/freesegdiff/
Summary / 总结
This work explores the use of diffusion models for zero-shot open-vocabulary image segmentation, leveraging the powerful spatial representations learned from massive datasets. The method involves passing images through a captioner model and a diffusion model to generate text and visual descriptions, which are then clustered and binarized to create class-agnostic masks. These masks are mapped to textual classes using CLIP, and a refinement step is applied to improve segmentation accuracy. The approach, named FreeSeg-Diff, outperforms training-based methods on Pascal VOC and COCO datasets and shows competitive results compared to recent weakly-supervised segmentation approaches.
该研究探索了利用扩散模型进行零样本开放词汇图像分割的方法,利用训练过程中学习的空间表示来完成图像生成之外的任务。该方法使用一个描述生成模型和一个扩散模型生成文本和视觉表示,然后对其进行聚类和二值化以获得类无差别掩码。这些掩码使用CLIP映射到文本类别,并通过一个细化步骤提高分割精度。该方法名为FreeSeg-Diff,在Pascal VOC和COCO数据集上优于基于训练的方法,并且与最近的弱监督分割方法具有竞争力的结果。
LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence
Authors: Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, Ningbo Dai, Renzhe Xu, Shuyang Li, Tianyang Zhang, Yue He, Yuanrui Wang, Yunjia Zhang, Zijing Xu, Dongzhe Li, Fang Gao, Hao Zou, Jiandong Liu, Jiashuo Liu, Jiawei Xu, Kaijie Cheng, Kehan Li, Linjun Zhou, Qing Li, Shaohua Fan, Xiaoyu Lin, Xinyan Han, Xuanyue Li, Yan Lu, Yuan Xue, Yuanyuan Jiang, Zimu Wang, Zhenlei Wang, Peng Cui
First: 2025-09-03T17:39:08+00:00 · Latest: 2025-11-07T16:49:47+00:00
Comments: 61 pages
Abstract
We argue that progress toward general intelligence requires complementary
foundation models grounded in language, the physical world, and structured
data. This report presents LimiX-16M and LimiX-2M, two instantiations of our
large structured-data models (LDMs). Both models treat structured data as a
joint distribution over variables and missingness, thus capable of addressing a
wide range of tabular tasks through query-based conditional prediction via a
single model. They are pretrained using masked joint-distribution modeling with
an episodic, context-conditional objective, supporting rapid, training-free
adaptation at inference. We evaluate LimiX models across 11 large
structured-data benchmarks with broad regimes of sample size, feature
dimensionality, class number, categorical-to-numerical feature ratio,
missingness, and sample-to-feature ratios. LimiX-16M consistently surpasses
strong baselines, as shown in Figure 1 and Figure 2. The superiority holds
across a wide range of tasks, such as classification, regression, missing value
imputation, and data generation, often by substantial margins, while avoiding
task-specific architectures or bespoke training per task. Notably, LimiX-2M
delivers strong results under tight compute and memory budgets. We also present
the first scaling law study for LDMs, revealing how data and model scaling
jointly influence downstream performance and offering quantitative guidance for
tabular foundation modeling. All LimiX models are publicly accessible under
Apache 2.0.
中文标题/摘要
标题:LimiX:释放通用智能的结构化数据建模能力
我们认为通用智能的进步需要语言、物理世界和结构化数据的互补基础模型。本报告介绍了LimiX-16M和LimiX-2M,这是我们的大型结构化数据模型(LDM)的两种实现。这两种模型将结构化数据视为变量和缺失值的联合分布,因此能够通过查询条件预测解决广泛的表格任务。它们通过掩码联合分布建模进行预训练,具有事件条件上下文目标,支持快速、无需训练的推理适应。我们在11个大型结构化数据基准测试中评估了LimiX模型,这些基准测试涵盖了样本大小、特征维度、类别数量、分类到数值特征的比例、缺失值以及样本到特征比率的广泛范围。如图1和图2所示,LimiX-16M始终超越了强大的基线模型。这种优越性在分类、回归、缺失值填充和数据生成等多种任务中都得到了体现,通常差距很大,而无需特定任务的架构或针对每个任务的定制训练。值得注意的是,LimiX-2M在计算和内存预算紧张的情况下也能取得出色的结果。我们还首次对LDM进行了扩展律研究,揭示了数据和模型扩展如何共同影响下游性能,并为表格基础建模提供了定量指导。所有LimiX模型均在Apache 2.0许可下公开。
Summary / 总结
The research aims to advance general intelligence by developing foundation models grounded in structured data. The study introduces LimiX-16M and LimiX-2M, which are pretrained using masked joint-distribution modeling and can adapt quickly at inference. These models outperform strong baselines across 11 structured-data benchmarks, demonstrating superior performance in tasks like classification, regression, and missing value imputation without task-specific architectures. Additionally, the research provides the first scaling law study for large structured-data models, offering insights into data and model scaling influences on performance.
研究旨在通过构建基于结构化数据的模型来推进通用智能。研究引入了LimiX-16M和LimiX-2M,这两种模型通过掩码联合分布预训练,并能在推理时快速适应。这些模型在11个结构化数据基准测试中表现出色,优于强基线,在分类、回归和缺失值填充等任务上表现出优越性能,无需特定任务的架构。此外,研究还提供了第一个大型结构化数据模型的扩展定律研究,提供了数据和模型扩展对性能影响的定量指导。
Inference-Time Hyper-Scaling with KV Cache Compression
Authors: Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti
Venue: NeurIPS 2025
First: 2025-06-05T17:59:55+00:00 · Latest: 2025-11-07T16:42:30+00:00
Comments: Accepted to NeurIPS 2025
Abstract
Inference-time scaling trades efficiency for increased reasoning accuracy by
generating longer or more parallel sequences. However, in Transformer LLMs,
generation cost is bottlenecked by the size of the key-value (KV) cache, rather
than the number of generated tokens. Hence, we explore inference-time
hyper-scaling: by compressing the KV cache, we can generate more tokens within
the same compute budget and further improve the accuracy of scaled inference.
The success of this approach, however, hinges on the ability of compression
methods to preserve accuracy even at high compression ratios. To make
hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a
novel method for sparsifying KV caches that only requires 1K training steps to
achieve 8$\times$ compression, while maintaining better accuracy than
training-free sparse attention. Instead of prematurely discarding cached
tokens, DMS delays token eviction, implicitly merging representations and
preserving critical information. We demonstrate the effectiveness of
inference-time hyper-scaling with DMS on multiple families of LLMs, showing
that it boosts accuracy for comparable inference latency and memory load. For
instance, we enhance Qwen-R1 32B by 12.0 points on AIME 24, 8.6 on GPQA, and
9.7 on LiveCodeBench on average for an equivalent number of memory reads.
中文标题/摘要
标题:推理时的超缩放与KV缓存压缩
推理时的缩放通过生成更长或更并行的序列,在提高推理准确性的同时牺牲了效率。然而,在Transformer大语言模型(LLM)中,生成成本主要受限于键值(KV)缓存的大小,而不是生成的令牌数量。因此,我们探索了推理时的超缩放:通过压缩KV缓存,我们可以在相同的计算预算内生成更多令牌,并进一步提高缩放推理的准确性。然而,这种方法的成功取决于压缩方法在高压缩比下仍能保持准确性。为了使超缩放实用,我们引入了动态内存稀疏化(DMS),这是一种新型的KV缓存稀疏化方法,只需1000个训练步骤即可实现8倍压缩,同时保持比无训练稀疏注意更好的准确性。DMS不会过早地丢弃缓存的令牌,而是延迟令牌的移除,隐式地合并表示并保留关键信息。我们通过多种LLM家族展示了DMS在推理时超缩放的有效性,表明它在相似的推理延迟和内存负载下提高了准确性。例如,我们通过DMS将Qwen-R1 32B在AIME 24上的得分提高了12.0分,在GPQA上提高了8.6分,在LiveCodeBench上提高了9.7分,这些改进是在相同数量的内存读取下实现的。
Summary / 总结
This paper explores inference-time hyper-scaling in Transformer LLMs by compressing the key-value (KV) cache to generate more tokens within the same compute budget, thereby improving accuracy. The authors introduce Dynamic Memory Sparsification (DMS), which achieves 8x compression with minimal accuracy loss and enhances models like Qwen-R1 32B by 12.0 points on AIME 24, 8.6 on GPQA, and 9.7 on LiveCodeBench on average for equivalent memory reads.
论文探讨了通过压缩关键值(KV)缓存来进行推理时的超缩放,在保持相同计算预算的情况下生成更多令牌,从而提高准确性。该方法Dynamic Memory Sparsification (DMS) 仅需1K训练步骤即可实现8倍压缩,同时保持比无训练稀疏注意更好的准确性。DMS 延迟令牌移除,保留关键信息。实验表明,DMS 在各种LLM上提高了准确性,对于相同的内存读取次数,Qwen-R1 32B 在AIME 24上的提升为12.0分,在GPQA上的提升为8.6分,在LiveCodeBench上的提升为9.7分。
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
Authors: Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Haohan Wang
First: 2024-02-05T18:54:43+00:00 · Latest: 2025-11-07T16:36:35+00:00
Comments: 28 papges
Abstract
The discovery of "jailbreaks" to bypass safety filters of Large Language
Models (LLMs) and harmful responses have encouraged the community to implement
safety measures. One major safety measure is to proactively test the LLMs with
jailbreaks prior to the release. Therefore, such testing will require a method
that can generate jailbreaks massively and efficiently. In this paper, we
follow a novel yet intuitive strategy to generate jailbreaks in the style of
the human generation. We propose a role-playing system that assigns four
different roles to the user LLMs to collaborate on new jailbreaks. Furthermore,
we collect existing jailbreaks and split them into different independent
characteristics using clustering frequency and semantic patterns sentence by
sentence. We organize these characteristics into a knowledge graph, making them
more accessible and easier to retrieve. Our system of different roles will
leverage this knowledge graph to generate new jailbreaks, which have proved
effective in inducing LLMs to generate unethical or guideline-violating
responses. In addition, we also pioneer a setting in our system that will
automatically follow the government-issued guidelines to generate jailbreaks to
test whether LLMs follow the guidelines accordingly. We refer to our system as
GUARD (Guideline Upholding through Adaptive Role-play Diagnostics). We have
empirically validated the effectiveness of GUARD on three cutting-edge
open-sourced LLMs (Vicuna-13B, LongChat-7B, and Llama-2-7B), as well as a
widely-utilized commercial LLM (ChatGPT). Moreover, our work extends to the
realm of vision language models (MiniGPT-v2 and Gemini Vision Pro), showcasing
GUARD's versatility and contributing valuable insights for the development of
safer, more reliable LLM-based applications across diverse modalities.
中文标题/摘要
标题:GUARD:通过角色扮演生成自然语言脱管测试大型语言模型指南遵守情况
大型语言模型(LLMs)的安全过滤器绕过和有害响应的发现促使社区实施安全措施。一个主要的安全措施是在发布前用脱管测试LLMs。因此,这种测试需要一种可以大规模高效生成脱管的方法。在本文中,我们采用了一种新颖且直观的策略,以人类生成的方式生成脱管。我们提出了一种角色扮演系统,将四种不同角色分配给用户LLMs,以协作生成新的脱管。此外,我们收集了现有的脱管,并逐句使用聚类频率和语义模式将其划分为不同的独立特征。我们将这些特征组织成知识图谱,使其更易于访问和检索。我们系统的不同角色将利用这一知识图谱生成新的脱管,这些脱管已被证明能有效促使LLMs生成不道德或违反指南的响应。此外,我们还在系统中首创了一种设置,该设置将自动遵循政府发布的指南来生成脱管,以测试LLMs是否遵守指南。我们将我们的系统称为GUARD(通过适应性角色扮演诊断维护指南)。我们已在三个前沿开源LLM(Vicuna-13B、LongChat-7B和Llama-2-7B)以及广泛使用的商用LLM(ChatGPT)上实证验证了GUARD的有效性。此外,我们的工作扩展到了视觉语言模型(MiniGPT-v2和Gemini Vision Pro)的领域,展示了GUARD的多功能性,并为不同模态的更安全、更可靠的LLM基础应用的发展提供了宝贵的见解。
Summary / 总结
GUARD is a role-playing system designed to generate natural-language jailbreaks for testing the safety of large language models (LLMs). It assigns four roles to user LLMs to collaboratively create jailbreaks, and organizes existing jailbreaks into a knowledge graph to facilitate the generation of new ones. GUARD has been empirically validated on various LLMs, including Vicuna-13B, LongChat-7B, Llama-2-7B, ChatGPT, MiniGPT-v2, and Gemini Vision Pro, demonstrating its effectiveness in inducing unethical or guideline-violating responses from LLMs.
GUARD 是一种新颖的角色扮演系统,旨在为测试大型语言模型(LLM)的安全性生成规避策略。它将四个角色分配给用户LLM以协作创建规避策略,并将现有的规避策略组织成知识图谱以提高检索效率。GUARD 已在包括 Vicuna-13B、LongChat-7B、Llama-2-7B、ChatGPT、MiniGPT-v2 和 Gemini Vision Pro 在内的多种LLM上得到验证,证明了其在诱导不道德或违反指导方针的响应以及测试遵守政府指导方针方面的有效性。
Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval
Authors: Janet Jenq, Hongda Shen
First: 2025-11-07T15:24:18+00:00 · Latest: 2025-11-07T15:24:18+00:00
Abstract
Multimodal product retrieval systems in e-commerce platforms rely on
effectively combining visual and textual signals to improve search relevance
and user experience. However, vision-language models such as CLIP are
vulnerable to typographic attacks, where misleading or irrelevant text embedded
in images skews model predictions. In this work, we propose a novel method that
reverses the logic of typographic attacks by rendering relevant textual content
(e.g., titles, descriptions) directly onto product images to perform
vision-text compression, thereby strengthening image-text alignment and
boosting multimodal product retrieval performance. We evaluate our method on
three vertical-specific e-commerce datasets (sneakers, handbags, and trading
cards) using six state-of-the-art vision foundation models. Our experiments
demonstrate consistent improvements in unimodal and multimodal retrieval
accuracy across categories and model families. Our findings suggest that
visually rendering product metadata is a simple yet effective enhancement for
zero-shot multimodal retrieval in e-commerce applications.
中文标题/摘要
标题:将对手转化为盟友:逆转 typographic 攻击以增强多模态电子商务产品检索
电子商务平台中的多模态产品检索系统依赖于有效结合视觉和文本信号以提高搜索相关性和用户体验。然而,诸如 CLIP 的视觉-语言模型容易受到 typographic 攻击的影响,其中嵌入在图像中的误导性或无关文本会扭曲模型的预测。在本研究中,我们提出了一种新颖的方法,通过直接在产品图像上渲染相关文本内容(例如,标题、描述)来逆转 typographic 攻击的逻辑,从而增强图像-文本对齐并提升多模态产品检索性能。我们使用六种最先进的视觉基础模型在三个垂直特定的电子商务数据集(运动鞋、手袋和交易卡)上评估了我们的方法。我们的实验表明,在各类别和模型家族中,我们的方法在单模态和多模态检索准确性方面均表现出一致的改进。我们的研究结果表明,在电子商务应用中,视觉呈现产品元数据是一种简单而有效的增强零样本多模态检索的方法。
ExGra-Med: Extended Context Graph Alignment for Medical Vision-Language Models
Authors: Duy M. H. Nguyen, Nghiem T. Diep, Trung Q. Nguyen, Hoang-Bao Le, Tai Nguyen, Tien Nguyen, TrungTin Nguyen, Nhat Ho, Pengtao Xie, Roger Wattenhofer, James Zou, Daniel Sonntag, Mathias Niepert
Venue: NeurIPS 2025
First: 2024-10-03T15:52:03+00:00 · Latest: 2025-11-07T14:48:44+00:00
Comments: Accepted at NeurIPS 2025
Abstract
State-of-the-art medical multi-modal LLMs (med-MLLMs), such as LLaVA-Med and
BioMedGPT, primarily depend on scaling model size and data volume, with
training driven largely by autoregressive objectives. However, we reveal that
this approach can lead to weak vision-language alignment, making these models
overly dependent on costly instruction-following data. To address this, we
introduce ExGra-Med, a novel multi-graph alignment framework that jointly
aligns images, instruction responses, and extended captions in the latent
space, advancing semantic grounding and cross-modal coherence. To scale to
large LLMs (e.g., LLaMA-7B), we develop an efficient end-to-end training scheme
using black-box gradient estimation, enabling fast and scalable optimization.
Empirically, ExGra-Med matches LLaVA-Med's performance using just 10% of the
pre-training data, achieving a 20.13% gain on VQA-RAD and approaching full-data
performance. It also outperforms strong baselines like BioMedGPT and RadFM on
visual chatbot and zero-shot classification tasks, demonstrating its promise
for efficient, high-quality vision-language integration in medical AI.
中文标题/摘要
标题:ExGra-Med: 扩展上下文图对齐方法在医疗视觉语言模型中的应用
当前最先进的医疗多模态大语言模型(med-MLLMs),如LLaVA-Med和BioMedGPT,主要依赖于扩大模型规模和数据量,训练主要由自回归目标驱动。然而,我们发现这种方法可能导致视觉语言对齐较弱,使这些模型过度依赖昂贵的指令跟随数据。为解决这一问题,我们提出了ExGra-Med,这是一种新颖的多图对齐框架,可以在潜在空间中同时对齐图像、指令响应和扩展的描述,从而推进语义定位和跨模态一致性。为了将该方法扩展到大型LLM(例如,LLaMA-7B),我们开发了一种高效的端到端训练方案,使用黑盒梯度估计,使优化快速且可扩展。实验结果显示,ExGra-Med仅使用10%的预训练数据就能达到与LLaVA-Med相当的性能,在VQA-RAD上获得20.13%的提升,并接近全数据性能。此外,ExGra-Med在视觉聊天机器人和零样本分类任务上也优于BioMedGPT和RadFM等强基线模型,展示了其在医疗AI中高效、高质量视觉语言集成的潜力。
Summary / 总结
ExGra-Med is a novel framework that improves vision-language alignment in medical multi-modal LLMs by jointly aligning images, instruction responses, and extended captions in the latent space. It uses an efficient end-to-end training scheme to optimize large LLMs like LLaMA-7B. ExGra-Med matches LLaVA-Med's performance with only 10% of the pre-training data and outperforms strong baselines on visual chatbot and zero-shot classification tasks, showing its potential for efficient and high-quality vision-language integration in medical AI.
ExGra-Med 是一种新颖的框架,通过在潜在空间中联合对齐图像、指令响应和扩展的描述来增强语义定位和跨模态一致性,以改进医疗多模态 LLMs。它使用高效的端到端训练方案来支持大型模型(如 LLaMA-7B),仅使用 10% 的预训练数据即可达到与 LLaVA-Med 相当的性能,并在视觉聊天机器人和零样本分类任务上优于基线模型。
FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction
Authors: Jiang Lin, Xinyu Chen, Song Wu, Zhiqiu Zhang, Jizhi Zhang, Ye Wang, Qiang Tang, Qian Wang, Jian Yang, Zili Yi
Venue: NIPS 2025
First: 2025-11-07T13:17:46+00:00 · Latest: 2025-11-07T13:17:46+00:00
Comments: Accepted by NIPS 2025
Abstract
Controlling the spatial and semantic structure of diffusion-generated images
remains a challenge. Existing methods like ControlNet rely on handcrafted
condition maps and retraining, limiting flexibility and generalization.
Inversion-based approaches offer stronger alignment but incur high inference
cost due to dual-path denoising. We present FreeControl, a training-free
framework for semantic structural control in diffusion models. Unlike prior
methods that extract attention across multiple timesteps, FreeControl performs
one-step attention extraction from a single, optimally chosen key timestep and
reuses it throughout denoising. This enables efficient structural guidance
without inversion or retraining. To further improve quality and stability, we
introduce Latent-Condition Decoupling (LCD): a principled separation of the key
timestep and the noised latent used in attention extraction. LCD provides finer
control over attention quality and eliminates structural artifacts. FreeControl
also supports compositional control via reference images assembled from
multiple sources - enabling intuitive scene layout design and stronger prompt
alignment. FreeControl introduces a new paradigm for test-time control,
enabling structurally and semantically aligned, visually coherent generation
directly from raw images, with the flexibility for intuitive compositional
design and compatibility with modern diffusion models at approximately 5
percent additional cost.
中文标题/摘要
标题:FreeControl:通过一步注意力提取实现高效、无需训练的结构控制
控制扩散生成图像的空间和语义结构仍然是一个挑战。现有方法如ControlNet依赖于手工制作的条件图和重新训练,限制了灵活性和泛化能力。基于反演的方法提供了更强的对齐,但由于双路径去噪导致推理成本高。我们提出了FreeControl,一种无需训练的框架,用于扩散模型中的语义结构控制。与之前的方法不同,FreeControl从单个、最优选择的关键时间步进行一步注意力提取,并在整个去噪过程中重用该注意力。这使得在无需反演或重新训练的情况下实现高效的结构指导成为可能。为了进一步提高质量和稳定性,我们引入了潜在条件解耦(LCD):一种原理上将关键时间步和用于注意力提取的噪声潜在变量分离的方法。LCD提供了对注意力质量的更精细控制,并消除了结构伪影。FreeControl还支持通过来自多个来源的参考图像进行组合控制——这使得直观的场景布局设计和更强的提示对齐成为可能。FreeControl引入了一种新的测试时控制范式,能够直接从原始图像生成结构上和语义上对齐、视觉上连贯的生成,具有直观的组合设计灵活性,并且与现代扩散模型兼容,成本约为5%。
Summary / 总结
FreeControl is a training-free framework for controlling the spatial and semantic structure of diffusion-generated images. It extracts attention from a single key timestep and reuses it throughout the denoising process, avoiding the need for retraining or inversion. This method, combined with Latent-Condition Decoupling, improves quality and stability, and supports compositional control via reference images, enabling intuitive scene design and better prompt alignment. The approach offers efficient structural guidance with minimal additional cost compared to existing methods.
FreeControl 是一个无需训练的框架,用于控制扩散生成图像的空间和语义结构。它从单一的关键时间步提取注意力并在去噪过程中重用,提供高效的结构指导,无需逆向操作或重新训练。关键发现包括通过潜条件分离提高质量和稳定性,以及通过多源参考图像支持组合控制,实现直观的场景布局设计和更强的提示对齐。
From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems
Authors: Yu Luo, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Xidao Wen, Yongqian Sun, Shenglin Zhang, Dan Pei
First: 2025-10-28T07:38:15+00:00 · Latest: 2025-11-07T07:03:20+00:00
Abstract
Incident management (IM) is central to the reliability of large-scale cloud
systems. Yet manual IM, where on-call engineers examine metrics, logs, and
traces is labor-intensive and error-prone in the face of massive and
heterogeneous observability data. Existing automated IM approaches often
struggle to generalize across systems, provide limited interpretability, and
incur high deployment costs, which hinders adoption in practice. In this paper,
we present OpsAgent, a lightweight, self-evolving multi-agent system for IM
that employs a training-free data processor to convert heterogeneous
observability data into structured textual descriptions, along with a
multi-agent collaboration framework that makes diagnostic inference transparent
and auditable. To support continual capability growth, OpsAgent also introduces
a dual self-evolution mechanism that integrates internal model updates with
external experience accumulation, thereby closing the deployment loop.
Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art
performance and show that OpsAgent is generalizable, interpretable,
cost-efficient, and self-evolving, making it a practically deployable and
sustainable solution for long-term operation in real-world cloud systems.
中文标题/摘要
标题:从可观测数据到诊断:云系统故障管理的演进多智能体系统
故障管理(IM)是大规模云系统可靠性的核心。然而,手动IM,其中当班工程师检查指标、日志和跟踪,面对庞大的异构可观测数据时劳动密集且容易出错。现有的自动化IM方法往往难以在系统之间泛化,提供有限的可解释性,并且部署成本高,这阻碍了其实用中的采用。在本文中,我们提出了OpsAgent,这是一种轻量级、自我演化的多智能体系统,用于IM,它采用无训练数据处理器将异构可观测数据转换为结构化的文本描述,并采用多智能体协作框架使诊断推理透明且可审计。为了支持持续的能力增长,OpsAgent还引入了一种双重自我演化机制,将内部模型更新与外部经验积累相结合,从而关闭部署循环。在OPENRCA基准上的全面实验表明,OpsAgent具有最先进的性能,并且证明了其通用性、可解释性、成本效益和自我演化,使其成为在实际云系统中长期运行的实用且可持续的解决方案。
Summary / 总结
The research aims to address the challenges of manual incident management in cloud systems by developing OpsAgent, a lightweight multi-agent system. OpsAgent uses a training-free data processor to convert heterogeneous observability data into structured textual descriptions and employs a multi-agent collaboration framework for transparent diagnostic inference. Key experimental results show that OpsAgent achieves state-of-the-art performance, is generalizable, interpretable, and cost-efficient, making it a practical and sustainable solution for cloud systems.
研究旨在通过开发OpsAgent轻量级多智能体系统来解决云系统中手动故障管理的挑战。OpsAgent使用无训练的数据处理器将异构可观测性数据转换为结构化的文本描述,并采用多智能体协作框架进行透明的诊断推理。实验结果表明,OpsAgent实现了最先进的性能,具有通用性、可解释性和低成本,使其成为适用于实际云系统长期运行的实用和可持续解决方案。
Towards Mitigating Hallucinations in Large Vision-Language Models by Refining Textual Embeddings
Authors: Aakriti Agrawal, Gouthaman KV, Rohith Aralikatti, Gauri Jagatap, Jiaxin Yuan, Vijay Kamarshi, Andrea Fanelli, Furong Huang
First: 2025-11-07T06:39:54+00:00 · Latest: 2025-11-07T06:39:54+00:00
Abstract
In this work, we identify an inherent bias in prevailing LVLM architectures
toward the language modality, largely resulting from the common practice of
simply appending visual embeddings to the input text sequence. To address this,
we propose a simple yet effective method that refines textual embeddings by
integrating average-pooled visual features. Our approach demonstrably improves
visual grounding and significantly reduces hallucinations on established
benchmarks. While average pooling offers a straightforward, robust, and
efficient means of incorporating visual information, we believe that more
sophisticated fusion methods could further enhance visual grounding and
cross-modal alignment. Given that the primary focus of this work is to
highlight the modality imbalance and its impact on hallucinations -- and to
show that refining textual embeddings with visual information mitigates this
issue -- we leave exploration of advanced fusion strategies for future work.
中文标题/摘要
标题:通过细化文本嵌入减轻大型视觉-语言模型中的幻觉
在本工作中,我们识别出当前主流的LVLM架构中存在对语言模态的固有偏见,主要源于将视觉嵌入简单地附加到输入文本序列中的常见做法。为了解决这一问题,我们提出了一种简单而有效的方法,通过整合平均池化后的视觉特征来细化文本嵌入。我们的方法在视觉定位方面表现出明显的改进,并且显著减少了在现有基准上的幻觉。虽然平均池化提供了一种直接、稳健且高效的将视觉信息纳入的方法,但我们认为更复杂的融合方法可以进一步增强视觉定位和跨模态对齐。鉴于本工作的主要重点是突出模态失衡及其对幻觉的影响,并展示通过使用视觉信息细化文本嵌入可以缓解这一问题,我们将高级融合策略的探索留作未来工作。
Summary / 总结
This work addresses the inherent bias in large vision-language models (LVLMs) towards the language modality, which is primarily due to appending visual embeddings to the text sequence. The authors propose a method to refine textual embeddings by integrating average-pooled visual features, which improves visual grounding and reduces hallucinations on benchmarks. The study demonstrates that while average pooling is effective, more sophisticated fusion methods could further enhance visual grounding and cross-modal alignment.
本文通过提出一种方法,将平均池化的视觉特征融入文本嵌入中,来解决大型视觉-语言模型(LVLM)对语言模态的固有偏向问题。该方法提高了视觉定位的准确性,并显著减少了在现有基准上的幻觉现象。作者认为,更复杂的融合策略可以进一步增强视觉定位和跨模态对齐,但这留待未来工作探索。
Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs
Authors: Lee Qi Zun, Mohamad Zulhilmi Bin Abdul Halim, Goh Man Fye
First: 2025-10-17T08:11:54+00:00 · Latest: 2025-11-07T05:24:14+00:00
Abstract
Retrieval-Augmented Generation systems are essential for providing fact-based
guidance from Malaysian Clinical Practice Guidelines. However, their
effectiveness with image-based queries is limited, as general Vision-Language
Model captions often lack clinical specificity and factual grounding. This
study proposes and validates a framework to specialize the MedGemma model for
generating high-fidelity captions that serve as superior queries. To overcome
data scarcity, we employ a knowledge distillation pipeline to create a
synthetic dataset across dermatology, fundus, and chest radiography domains,
and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance
was rigorously assessed through a dual framework measuring both classification
accuracy and, via a novel application of the RAGAS framework, caption
faithfulness, relevancy, and correctness. The fine-tuned model demonstrated
substantial improvements in classification performance, while RAGAS evaluation
confirmed significant gains in caption faithfulness and correctness, validating
the models ability to produce reliable, factually grounded descriptions. This
work establishes a robust pipeline for specializing medical VLMs and validates
the resulting model as a high-quality query generator, laying the groundwork
for enhancing multimodal RAG systems in evidence-based clinical decision
support.
中文标题/摘要
标题:Fine-Tuning MedGemma 以增强马来西亚临床实践指南的多模态RAG临床配图能力
检索增强生成系统对于提供基于马来西亚临床实践指南的事实指导至关重要。然而,它们在处理基于图像的查询时效果有限,因为通用的视觉-语言模型配图往往缺乏临床特异性和事实依据。本研究提出并验证了一种框架,以专门化MedGemma模型,生成高保真度的配图,作为更优的查询。为克服数据稀缺性,我们采用知识蒸馏管道在皮肤科、眼底和胸部X光领域创建合成数据集,并使用参数高效的QLoRA方法微调MedGemma。通过双重框架严格评估性能,该框架同时测量分类准确性和通过RAGAS框架的新应用测量配图的忠实性、相关性和正确性。微调后的模型在分类性能上表现出显著改进,而RAGAS评估证实了配图忠实性和正确性有显著提升,验证了模型生成可靠、事实依据描述的能力。本研究建立了一个稳健的管道,用于专门化医疗视觉语言模型,并验证了生成的模型作为高质量查询生成器的有效性,为增强基于证据的临床决策支持的多模态RAG系统奠定了基础。
Summary / 总结
This study aims to enhance the effectiveness of Retrieval-Augmented Generation systems in providing fact-based guidance from Malaysian Clinical Practice Guidelines by fine-tuning the MedGemma model. A knowledge distillation pipeline was used to create a synthetic dataset for dermatology, fundus, and chest radiography domains, and the QLoRA method was employed for fine-tuning. The model showed significant improvements in classification performance and notable gains in caption faithfulness and correctness as evaluated by the RAGAS framework, validating its use as a reliable query generator for multimodal RAG systems in clinical decision support.
本研究旨在增强检索增强生成系统在提供基于马来西亚临床实践指南的事实指导方面的有效性,特别是对于基于图像的查询。研究人员提出并验证了一种框架,以使MedGemma模型专门生成高质量的临床描述。他们使用知识蒸馏管道创建了一个合成数据集,并使用QLoRA方法微调MedGemma。微调后的模型在分类性能和描述的忠实性、相关性和正确性方面显示出显著改进,这通过RAGAS框架评估得出。这项工作建立了一个强大的专门化医疗视觉语言模型的管道,并验证了该模型作为高质量查询生成器在多模态RAG系统中的有效性。
A benchmark multimodal oro-dental dataset for large vision-language models
Authors: Haoxin Lv, Ijazul Haq, Jin Du, Jiaxin Ma, Binnian Zhu, Xiaobing Dang, Chaoan Liang, Ruxu Du, Yingjie Zhang, Muhammad Saqib
First: 2025-11-07T03:14:20+00:00 · Latest: 2025-11-07T03:14:20+00:00
Abstract
The advancement of artificial intelligence in oral healthcare relies on the
availability of large-scale multimodal datasets that capture the complexity of
clinical practice. In this paper, we present a comprehensive multimodal
dataset, comprising 8775 dental checkups from 4800 patients collected over
eight years (2018-2025), with patients ranging from 10 to 90 years of age. The
dataset includes 50000 intraoral images, 8056 radiographs, and detailed textual
records, including diagnoses, treatment plans, and follow-up notes. The data
were collected under standard ethical guidelines and annotated for
benchmarking. To demonstrate its utility, we fine-tuned state-of-the-art large
vision-language models, Qwen-VL 3B and 7B, and evaluated them on two tasks:
classification of six oro-dental anomalies and generation of complete
diagnostic reports from multimodal inputs. We compared the fine-tuned models
with their base counterparts and GPT-4o. The fine-tuned models achieved
substantial gains over these baselines, validating the dataset and underscoring
its effectiveness in advancing AI-driven oro-dental healthcare solutions. The
dataset is publicly available, providing an essential resource for future
research in AI dentistry.
中文标题/摘要
标题:用于大型视觉-语言模型的基准多模态口腔牙科数据集
口腔医疗保健中人工智能的进步依赖于能够捕捉临床实践复杂性的大规模多模态数据集。本文介绍了包含8775次牙科检查的数据集,来自4800名患者,时间跨度为八年(2018-2025),患者年龄从10岁到90岁不等。数据集包括50000张口腔内图像、8056张放射影像以及详细的文本记录,包括诊断、治疗计划和随访笔记。数据在标准伦理指导下收集并标注用于基准测试。为了展示其用途,我们对最先进的大型视觉-语言模型Qwen-VL 3B和7B进行了微调,并在两个任务上进行了评估:六种口腔牙科异常的分类以及从多模态输入生成完整的诊断报告。我们将微调后的模型与基模型和GPT-4o进行了比较。微调后的模型在这些基线模型上取得了显著的改进,验证了数据集的有效性,并突显了其在推动基于AI的口腔牙科健康解决方案方面的有效性。该数据集已公开,为未来AI牙科研究提供了重要资源。
Summary / 总结
This paper introduces a comprehensive multimodal dataset for oral healthcare, including 8775 dental checkups with 50000 intraoral images, 8056 radiographs, and detailed textual records. The dataset was fine-tuned on state-of-the-art large vision-language models, Qwen-VL 3B and 7B, for classifying oro-dental anomalies and generating diagnostic reports. The fine-tuned models showed significant improvements over their base models and GPT-4o, validating the dataset's utility in advancing AI-driven dental solutions.
该论文介绍了包含8775例牙科检查的综合多模态数据集,其中包括50000张口腔内图像、8056张放射影像以及详细的文本记录。该数据集被用于微调最先进的大型视觉-语言模型,这些模型在分类口腔牙科异常和生成诊断报告方面表现出显著改进。这验证了该数据集在推动牙科人工智能解决方案方面的有效性。
Prompt-Based Safety Guidance Is Ineffective for Unlearned Text-to-Image Diffusion Models
Authors: Jiwoo Shin, Byeonghu Na, Mina Kang, Wonhyeok Choi, Il-chul Moon
Venue: NeurIPS 2025
First: 2025-11-06T21:51:03+00:00 · Latest: 2025-11-06T21:51:03+00:00
Comments: Accepted at NeurIPS 2025 Workshop on Generative and Protective AI for
Content Creation
Abstract
Recent advances in text-to-image generative models have raised concerns about
their potential to produce harmful content when provided with malicious input
text prompts. To address this issue, two main approaches have emerged: (1)
fine-tuning the model to unlearn harmful concepts and (2) training-free
guidance methods that leverage negative prompts. However, we observe that
combining these two orthogonal approaches often leads to marginal or even
degraded defense performance. This observation indicates a critical
incompatibility between two paradigms, which hinders their combined
effectiveness. In this work, we address this issue by proposing a conceptually
simple yet experimentally robust method: replacing the negative prompts used in
training-free methods with implicit negative embeddings obtained through
concept inversion. Our method requires no modification to either approach and
can be easily integrated into existing pipelines. We experimentally validate
its effectiveness on nudity and violence benchmarks, demonstrating consistent
improvements in defense success rate while preserving the core semantics of
input prompts.
中文标题/摘要
标题:基于提示的安全指导对未学习的文本到图像扩散模型无效
文本到图像生成模型的最新进展引发了对其在提供恶意输入文本提示时可能生成有害内容的担忧。为解决这一问题,出现了两种主要方法:(1) 对模型进行微调以消除有害概念,(2) 不需要训练的指导方法,利用负面提示。然而,我们观察到,将这两种方法结合起来往往导致边际甚至退化的防御性能。这一观察表明,两种范式之间存在关键的不兼容性,这阻碍了它们的联合有效性。在本文中,我们通过提出一个概念上简单但实验上稳健的方法来解决这一问题:用概念反转获得的隐式负面嵌入替换不需要训练方法中使用的负面提示。该方法不需要对任何一种方法进行修改,并且可以轻松集成到现有管道中。我们在裸体和暴力基准上进行了实验验证,证明其在防御成功率方面的一致改进,同时保留了输入提示的核心语义。
Summary / 总结
This paper addresses the issue of text-to-image generative models producing harmful content by evaluating the effectiveness of combining fine-tuning and training-free guidance methods. It finds that these two approaches are incompatible, leading to marginal or degraded defense performance. The authors propose a method that replaces negative prompts with implicit negative embeddings, showing consistent improvements in defense success rates without altering existing pipelines or input semantics.
该论文评估了结合微调和无训练指导方法以防止文本生成图像模型产生有害内容的有效性。研究发现这两种方法不兼容,导致防御性能边际或下降。作者提出了一种方法,用概念反转获得的隐式负嵌入替换负提示,展示了在不改变现有管道或输入语义的情况下,一致提高防御成功率。
Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models
Authors: Alexander Htet Kyaw, Richa Gupta, Dhruv Shah, Anoop Sinha, Kory Mathewson, Stefanie Pender, Sachin Chitta, Yotto Koga, Faez Ahmed, Lawrence Sass, Randall Davis
Venue: NeurIPS 2025
First: 2025-11-04T01:02:21+00:00 · Latest: 2025-11-06T20:54:20+00:00
Comments: Accepted to NeurIPS 2025, Conference on Neural Information Processing
Systems, Creative AI Track
Abstract
Advances in 3D generative AI have enabled the creation of physical objects
from text prompts, but challenges remain in creating objects involving multiple
component types. We present a pipeline that integrates 3D generative AI with
vision-language models (VLMs) to enable the robotic assembly of multi-component
objects from natural language. Our method leverages VLMs for zero-shot,
multi-modal reasoning about geometry and functionality to decompose
AI-generated meshes into multi-component 3D models using predefined structural
and panel components. We demonstrate that a VLM is capable of determining which
mesh regions need panel components in addition to structural components, based
on the object's geometry and functionality. Evaluation across test objects
shows that users preferred the VLM-generated assignments 90.6% of the time,
compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the
system allows users to refine component assignments through conversational
feedback, enabling greater human control and agency in making physical objects
with generative AI and robotics.
中文标题/摘要
标题:基于3D生成AI和视觉语言模型的多组件物体文本到机器人装配
3D生成AI的进步使得从文本提示创建物理对象成为可能,但在创建涉及多种组件类型的对象时仍面临挑战。我们提出了一种将3D生成AI与视觉语言模型(VLMs)集成的管道,以使自然语言能够实现多组件物体的机器人装配。我们的方法利用VLMs进行零样本、多模态的几何和功能推理,将生成的AI网格分解为使用预定义结构和面板组件的多组件3D模型。我们证明VLM能够根据物体的几何形状和功能确定哪些网格区域需要面板组件。在测试对象上的评估显示,用户中有90.6%的时间更喜欢VLM生成的分配,而基于规则的分配为59.4%,随机分配为2.5%。最后,该系统允许用户通过对话反馈来细化组件分配,从而在生成AI和机器人技术中实现更大的人类控制和自主性。
PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference
Authors: Yushu Zhao, Zheng Wang, Minjia Zhang
First: 2025-11-06T20:53:02+00:00 · Latest: 2025-11-06T20:53:02+00:00
Abstract
Mixture-of-Experts (MoE) models have shown strong potential in scaling
language models efficiently by activating only a small subset of experts per
input. However, their widespread deployment remains limited due to the high
memory overhead associated with storing all expert parameters, particularly as
the number of experts increases. To address this challenge, prior works have
explored expert dropping and merging strategies, yet they often suffer from
performance drop at high compression ratios. In this paper, we introduce
PuzzleMoE, a training-free MoE compression method that achieves both high
accuracy and efficient inference through two key innovations: First, PuzzleMoE
performs sparse expert merging by identifying element-wise weight redundancy
and specialization. It uses a dual-mask to capture both shared and
expert-specific parameters. Second, to avoid the overhead of storing binary
masks and signs, PuzzleMoE introduces a bit-packed encoding scheme that reuses
underutilized exponent bits, enabling efficient MoE inference on GPUs.
Extensive experiments demonstrate that PuzzleMoE can compress MoE models by up
to 50% while maintaining accuracy across various tasks. Specifically, it
outperforms prior MoE compression methods by up to 16.7% on MMLU at 50%
compression ratio, and achieves up to 1.28\times inference speedup.
中文标题/摘要
标题:PuzzleMoE:通过稀疏专家合并和位压缩推理高效压缩大型混合专家模型
混合专家(MoE)模型在通过激活每个输入的小部分专家来高效扩展语言模型方面显示出强大的潜力。然而,由于存储所有专家参数的高内存开销,特别是在专家数量增加时,其广泛应用受到限制。为了解决这一挑战,先前的工作探索了专家丢弃和合并策略,但它们往往在高压缩比下性能下降。在本文中,我们介绍了PuzzleMoE,这是一种无需训练的MoE压缩方法,通过两种关键创新实现了高准确性和高效推理:首先,PuzzleMoE通过识别元素权重的冗余性和专业化进行稀疏专家合并,使用双重掩码捕获共享和专家特定参数。其次,为了避免存储二进制掩码和符号的开销,PuzzleMoE引入了一种位压缩编码方案,重新利用未充分利用的指数位,使GPU上的MoE推理更加高效。广泛的实验表明,PuzzleMoE可以在压缩MoE模型高达50%的同时保持各种任务的准确性。具体而言,在50%压缩比下,它在MMLU上的性能比先前的MoE压缩方法高出16.7%,并实现了高达1.28倍的推理加速。
Summary / 总结
PuzzleMoE is a training-free method for compressing Mixture-of-Experts (MoE) models by merging sparse experts and using bit-packed inference. It identifies shared and expert-specific parameters through dual-masks and employs a bit-packed encoding scheme to reduce memory overhead. Experiments show PuzzleMoE can compress MoE models by up to 50% while maintaining or improving accuracy, outperforming previous methods by up to 16.7% on MMLU at 50% compression ratio and achieving up to 1.28x inference speedup.
PuzzleMoE 是一种无需训练的方法,通过稀疏合并专家和使用位打包推理来压缩 Mixture-of-Experts (MoE) 模型。它使用双掩码来识别共享和专家特定的参数,并通过重用指数位来避免存储二进制掩码和符号。实验表明,PuzzleMoE 可以将 MoE 模型压缩多达 50%,同时保持或甚至提高准确性,在 50% 压缩比下,PuzzleMoE 在 MMLU 上的表现优于先前方法最多 16.7%,并且实现了高达 1.28 倍的推理加速。
TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models
Authors: Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem
Venue: Transactions on Machine Learning Research, 2025
First: 2025-05-29T17:59:59+00:00 · Latest: 2025-11-06T18:59:57+00:00
Comments: Published in TMLR, with a J2C Certification
Abstract
Image-text models excel at image-level tasks but struggle with detailed
visual understanding. While these models provide strong visual-language
alignment, segmentation models like SAM2 offer precise spatial boundaries for
objects. To this end, we propose TextRegion, a simple, effective, and
training-free framework that combines the strengths of image-text models and
SAM2 to generate powerful text-aligned region tokens. These tokens enable
detailed visual understanding while preserving open-vocabulary capabilities.
They can be directly applied to various downstream tasks, including open-world
semantic segmentation, referring expression comprehension, and grounding. We
conduct extensive evaluations and consistently achieve superior or competitive
performance compared to state-of-the-art training-free methods. Additionally,
our framework is compatible with many image-text models, making it highly
practical and easily extensible as stronger models emerge. Code is available
at: https://github.com/avaxiao/TextRegion.
中文标题/摘要
标题:TextRegion: 冻结图像-文本模型的文本对齐区域标记
图像-文本模型在图像级任务上表现出色,但在详细的视觉理解方面存在困难。尽管这些模型提供了强大的视觉-语言对齐,但分割模型如SAM2能够提供精确的空间边界。为此,我们提出了一种简单、有效且无需训练的TextRegion框架,该框架结合了图像-文本模型和SAM2的优点,生成强大的文本对齐区域标记。这些标记能够实现详细的视觉理解,同时保留开放词汇的能力。它们可以直接应用于各种下游任务,包括开放世界语义分割、指示表达理解以及语义定位。我们进行了广泛的评估,并且在与最先进的无需训练方法的比较中,始终取得了优越或竞争力的表现。此外,我们的框架与许多图像-文本模型兼容,使其非常实用且易于扩展,随着更强的模型出现。代码可在:https://github.com/avaxiao/TextRegion 获取。
Summary / 总结
TextRegion is a framework that combines the strengths of image-text models and SAM2 to generate text-aligned region tokens, enabling detailed visual understanding while maintaining open-vocabulary capabilities. It achieves superior or competitive performance in various downstream tasks such as open-world semantic segmentation, referring expression comprehension, and grounding. The framework is training-free and compatible with many image-text models, making it practical and easily extensible as new models are developed.
TextRegion 是一个框架,结合了图像-文本模型和 SAM2 的优势,生成文本对齐的区域令牌,既能实现详细的视觉理解,又能保持开放词汇的能力。该框架在开放世界语义分割、指示表达理解和语义定位等下游任务中表现出优越或竞争力。该框架无需训练且兼容多种图像-文本模型,使其在新模型出现时易于扩展和实用。
DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash
Authors: Omkar Gurjar, Kin Sum Liu, Praveen Kolli, Utsaw Kumar, Mandar Rahurkar
First: 2025-03-18T20:38:31+00:00 · Latest: 2025-11-06T18:08:18+00:00
Abstract
Despite the success of vision-language models in various generative tasks,
obtaining high-quality semantic representations for products and user intents
is still challenging due to the inability of off-the-shelf models to capture
nuanced relationships between the entities. In this paper, we introduce a joint
training framework for product and user queries by aligning uni-modal and
multi-modal encoders through contrastive learning on image-text data. Our novel
approach trains a query encoder with an LLM-curated relevance dataset,
eliminating the reliance on engagement history. These embeddings demonstrate
strong generalization capabilities and improve performance across applications,
including product categorization and relevance prediction. For personalized ads
recommendation, a significant uplift in the click-through rate and conversion
rate after the deployment further confirms the impact on key business metrics.
We believe that the flexibility of our framework makes it a promising solution
toward enriching the user experience across the e-commerce landscape.
中文标题/摘要
标题:DashCLIP:利用多模态模型为DoorDash生成语义嵌入
尽管视觉-语言模型在各种生成任务中取得了成功,但由于现成模型无法捕捉实体之间的细微关系,获得高质量的语义表示仍然具有挑战性。在本文中,我们通过对比学习图像-文本数据来对齐单模态和多模态编码器,提出了一种产品和用户查询的联合训练框架。我们的新方法使用LLM精炼的相关性数据集训练查询编码器,消除了对互动历史的依赖。这些嵌入展示了强大的泛化能力,并在包括产品分类和相关性预测在内的多个应用中提高了性能。对于个性化广告推荐,在部署后点击率和转化率的显著提升进一步证实了对关键业务指标的影响。我们认为,我们框架的灵活性使其成为丰富电子商务领域用户体验的有前途的解决方案。
IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs
Authors: Ali Faraz, Akash, Shaharukh Khan, Raja Kolla, Akshat Patidar, Suranjan Goswami, Abhinav Ravi, Chandra Khatri, Shubham Agarwal
First: 2025-11-06T18:01:22+00:00 · Latest: 2025-11-06T18:01:22+00:00
Abstract
Vision-language models (VLMs) have demonstrated impressive generalization
across multimodal tasks, yet most evaluation benchmarks remain Western-centric,
leaving open questions about their performance in culturally diverse and
multilingual settings. To address this gap, we introduce IndicVisionBench, the
first large-scale benchmark centered on the Indian subcontinent. Covering
English and 10 Indian languages, our benchmark spans 3 multimodal tasks,
including Optical Character Recognition (OCR), Multimodal Machine Translation
(MMT), and Visual Question Answering (VQA), covering 6 kinds of question types.
Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across
13 culturally grounded topics. In addition, we release a paired parallel corpus
of annotations across 10 Indic languages, creating a unique resource for
analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum
of 8 models, from proprietary closed-source systems to open-weights medium and
large-scale models. Our experiments reveal substantial performance gaps,
underscoring the limitations of current VLMs in culturally diverse contexts. By
centering cultural diversity and multilinguality, IndicVisionBench establishes
a reproducible evaluation framework that paves the way for more inclusive
multimodal research.
中文标题/摘要
标题:IndicVisionBench:在VLM中的文化与多语言理解基准测试
视觉-语言模型(VLMs)在多模态任务中展示了令人印象深刻的泛化能力,但大多数评估基准仍以西方为中心,留下了关于其在文化多样性和多语言环境中的表现的问题。为了解决这一差距,我们引入了IndicVisionBench,这是第一个以印度次大陆为中心的大规模基准测试。该基准测试涵盖了英语和10种印度语言,包括光学字符识别(OCR)、多模态机器翻译(MMT)和视觉问答(VQA)等3个跨模态任务,涵盖了6种问题类型。最终基准测试包括约5000张图像和37000多个问答对,涉及13个文化基础主题。此外,我们还发布了10种印度语言的配对平行注释语料库,为分析VLM中的文化和语言偏见提供了独特资源。我们评估了8种不同模型,从专有封闭源系统到开放权重的中型和大型模型。我们的实验揭示了显著的性能差距,突显了当前VLMs在文化多样环境中存在的局限性。通过强调文化多样性和多语言性,IndicVisionBench建立了一个可重复的评估框架,为更具包容性的多模态研究铺平了道路。
Summary / 总结
IndicVisionBench is a new benchmark for evaluating VLMs in culturally diverse and multilingual settings, focusing on the Indian subcontinent. It includes 3 multimodal tasks such as OCR, MMT, and VQA, covering 6 question types and 5,000 images with 37,000+ QA pairs. Evaluating 8 models, the experiments highlight significant performance differences, indicating that current VLMs struggle in culturally diverse contexts. This benchmark aims to improve inclusivity in multimodal research by addressing existing gaps in evaluation benchmarks.
IndicVisionBench 是一个针对文化多样性和多语言环境评估 VLMs 的新基准,特别关注印度次大陆。它包括 OCR、MMT 和 VQA 等 3 个跨模态任务,涵盖 6 种问题类型和 5,000 张图片以及 37,000 多个 QA 对。评估 8 模型后,研究显示存在显著的性能差异,表明当前 VLMs 在文化多样环境中表现不佳。该基准旨在通过解决现有评估方法的局限性,促进更具包容性的跨模态研究。
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Authors: Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu
First: 2025-11-06T17:25:23+00:00 · Latest: 2025-11-06T17:25:23+00:00
Comments: 36 pages, 14 figures
Abstract
"Thinking with Text" and "Thinking with Images" paradigm significantly
improve the reasoning ability of large language models (LLMs) and Vision
Language Models (VLMs). However, these paradigms have inherent limitations. (1)
Images capture only single moments and fail to represent dynamic processes or
continuous changes, and (2) The separation of text and vision as distinct
modalities, hindering unified multimodal understanding and generation. To
overcome these limitations, we introduce "Thinking with Video", a new paradigm
that leverages video generation models, such as Sora-2, to bridge visual and
textual reasoning in a unified temporal framework. To support this exploration,
we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench
encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing
Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our
evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks,
Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even
surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric
tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
Furthermore, we systematically analyse the source of these abilities. We also
find that self-consistency and in-context learning can improve Sora-2's
performance. In summary, our findings demonstrate that the video generation
model is the potential unified multimodal understanding and generation model,
positions "thinking with video" as a unified multimodal reasoning paradigm.
中文标题/摘要
标题:视频思维:视频生成作为有前景的多模态推理范式
"文本思维"和"图像思维"范式显著提高了大型语言模型(LLMs)和视觉语言模型(VLMs)的推理能力。然而,这些范式存在固有的局限性。首先,图像只能捕捉单一时刻,无法表示动态过程或连续变化;其次,将文本和视觉视为不同的模态,阻碍了统一的多模态理解和生成。为克服这些局限,我们引入了“视频思维”这一新范式,利用视频生成模型(如Sora-2)在统一的时间框架内结合视觉和文本推理。为支持这一探索,我们开发了视频思维基准(VideoThinkBench)。VideoThinkBench 包含两类任务:(1)视觉中心任务(如眼力谜题),(2)文本中心任务(如GSM8K和MMMU的子集)。我们的评估表明Sora-2是一个有效的推理者。在视觉中心任务中,Sora-2通常与最先进的视觉语言模型(SOTA VLMs)相当,甚至在某些任务(如眼力游戏)上超越了VLMs。在文本中心任务中,Sora-2在MATH上的准确率为92%,在MMMU上的准确率为75.53%。此外,我们系统地分析了这些能力的来源。我们还发现,自我一致性与上下文学习可以提高Sora-2的性能。总之,我们的研究结果表明,视频生成模型可能是统一的多模态理解和生成模型,将“视频思维”定位为统一的多模态推理范式。
Summary / 总结
The paper introduces 'Thinking with Video' as a new paradigm to enhance multimodal reasoning by leveraging video generation models. It addresses the limitations of 'Thinking with Text' and 'Thinking with Images' paradigms, such as the inability of images to represent dynamic processes and the separation of text and vision. The authors developed the Video Thinking Benchmark (VideoThinkBench) to evaluate this paradigm, showing that Sora-2, a video generation model, performs comparably to state-of-the-art vision-language models on vision-centric tasks and achieves high accuracy on text-centric tasks. The study also identifies self-consistency and in-context learning as factors that improve Sora-2's performance.
论文提出了‘视频思考’的新范式,通过利用视频生成模型来增强多模态推理能力,解决了‘文本思考’和‘图像思考’范式中的局限性,如图像无法表示动态过程和文本与视觉的分离。作者开发了视频思考基准(VideoThinkBench)来评估这一范式,结果显示视频生成模型Sora-2在视觉中心任务上与最先进的视觉语言模型相当,并在文本中心任务上取得了高准确率。研究还发现自我一致性与上下文学习是提高Sora-2性能的因素。
Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
Authors: Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, Lixing Zou, Zhaoye Zhou, Gen Li, Bo Zhao
First: 2025-11-06T17:07:49+00:00 · Latest: 2025-11-06T17:07:49+00:00
Comments: Github: https://github.com/MINT-SJTU/Evo-1
Abstract
Vision-Language-Action (VLA) models have emerged as a powerful framework that
unifies perception, language, and control, enabling robots to perform diverse
tasks through multimodal understanding. However, current VLA models typically
contain massive parameters and rely heavily on large-scale robot data
pretraining, leading to high computational costs during training, as well as
limited deployability for real-time inference. Moreover, most training
paradigms often degrade the perceptual representations of the vision-language
backbone, resulting in overfitting and poor generalization to downstream tasks.
In this work, we present Evo-1, a lightweight VLA model that reduces
computation and improves deployment efficiency, while maintaining strong
performance without pretraining on robot data. Evo-1 builds on a native
multimodal Vision-Language model (VLM), incorporating a novel cross-modulated
diffusion transformer along with an optimized integration module, together
forming an effective architecture. We further introduce a two-stage training
paradigm that progressively aligns action with perception, preserving the
representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1
achieves state-of-the-art results on the Meta-World and RoboTwin suite,
surpassing the previous best models by 12.4% and 6.9%, respectively, and also
attains a competitive result of 94.8% on LIBERO. In real-world evaluations,
Evo-1 attains a 78% success rate with high inference frequency and low memory
overhead, outperforming all baseline methods. We release code, data, and model
weights to facilitate future research on lightweight and efficient VLA models.
中文标题/摘要
标题:Evo-1:轻量级视觉-语言-行动模型,保留语义对齐
视觉-语言-行动(VLA)模型已成为一种强大的框架,统一了感知、语言和控制,使机器人能够通过多模态理解执行多种任务。然而,当前的VLA模型通常包含大量参数,并且依赖大规模机器人数据的预训练,导致训练时计算成本高,且实时推理部署能力有限。此外,大多数训练范式往往会降低视觉-语言主干的感知表示,导致过拟合和下游任务泛化能力差。在本工作中,我们提出了Evo-1,这是一种轻量级的VLA模型,减少了计算量并提高了部署效率,同时在无需机器人数据预训练的情况下保持了强大的性能。Evo-1基于原生多模态视觉-语言模型(VLM),结合了一种新颖的跨模态扩散变换器以及优化的集成模块,共同形成了有效的架构。我们进一步引入了一种两阶段训练范式,逐步将行动与感知对齐,保留了VLM的表示。值得注意的是,仅包含0.77亿个参数的Evo-1在Meta-World和RoboTwin套件上取得了最先进的结果,分别超越了之前最佳模型12.4%和6.9%,并在LIBERO上也取得了竞争力的结果,达到94.8%。在实际评估中,Evo-1以高推理频率和低内存开销实现了78%的成功率,超越了所有基线方法。我们发布了代码、数据和模型权重,以促进轻量级和高效VLA模型的研究。
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
Authors: Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
First: 2025-06-05T07:26:34+00:00 · Latest: 2025-11-06T15:28:19+00:00
Comments: Project page: https://youngwanlee.github.io/holisafe
Abstract
Despite emerging efforts to enhance the safety of Vision-Language Models
(VLMs), current approaches face two main shortcomings. 1) Existing
safety-tuning datasets and benchmarks only partially consider how image-text
interactions can yield harmful content, often overlooking contextually unsafe
outcomes from seemingly benign pairs. This narrow coverage leaves VLMs
vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely
primarily on data-centric tuning, with limited architectural innovations to
intrinsically strengthen safety. We address these gaps by introducing a
holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five
safe/unsafe image-text combinations, providing a more robust basis for both
training and evaluation (HoliSafe-Bench). We further propose a novel modular
framework for enhancing VLM safety with a visual guard module (VGM) designed to
assess the harmfulness of input images for VLMs. This module endows VLMs with a
dual functionality: they not only learn to generate safer responses but can
also provide an interpretable harmfulness classification to justify their
refusal decisions. A significant advantage of this approach is its modularity;
the VGM is designed as a plug-in component, allowing for seamless integration
with diverse pre-trained VLMs across various scales. Experiments show that
Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety
performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench
itself reveals critical vulnerabilities in existing VLM models. We hope that
HoliSafe and VGM will spur further research into robust and interpretable VLM
safety, expanding future avenues for multimodal alignment.
中文标题/摘要
标题:HoliSafe:视觉语言模型的全面安全基准和建模
尽管已经出现了增强视觉语言模型(VLMs)安全性的努力,但当前的方法存在两个主要不足。1)现有的安全调优数据集和基准仅部分考虑了图像-文本交互可能导致有害内容的问题,经常忽视看似无害的配对所引发的上下文不安全结果。这种狭窄的覆盖范围使VLMs在未见配置中容易受到脱狱攻击。2)先前的方法主要依赖于数据驱动的调优,缺乏对内在增强安全性的架构创新。我们通过引入一个全面的安全数据集和基准——HoliSafe,解决了这些差距,该基准涵盖了所有五种安全/不安全的图像-文本组合,为训练和评估提供了更坚实的基础(HoliSafe-Bench)。我们还提出了一种新的模块化框架,通过视觉守护模块(VGM)增强VLM的安全性,该模块旨在评估输入图像对VLM的有害性。该模块赋予VLMs双重功能:它们不仅学习生成更安全的响应,还可以提供可解释的有害性分类,以证明其拒绝决策的合理性。这种方法的一个重要优势是其模块化;VGM被设计为插件组件,可以无缝集成到各种规模的预训练VLMs中。实验表明,使用VGM训练的Safe-VLM在多个VLM基准上实现了最先进的安全性能。此外,HoliSafe-Bench本身揭示了现有VLM模型中的关键漏洞。我们希望HoliSafe和VGM能够激发更多关于稳健和可解释的VLM安全性的研究,扩展未来多模态对齐的途径。
Evaluating LLM-Contaminated Crowdsourcing Data Without Ground Truth
Authors: Yichi Zhang, Jinlong Pang, Zhaowei Zhu, Yang Liu
First: 2025-06-08T04:38:39+00:00 · Latest: 2025-11-06T15:24:22+00:00
Comments: 32 pages, 7 figures
Abstract
The recent success of generative AI highlights the crucial role of
high-quality human feedback in building trustworthy AI systems. However, the
increasing use of large language models (LLMs) by crowdsourcing workers poses a
significant challenge: datasets intended to reflect human input may be
compromised by LLM-generated responses. Existing LLM detection approaches often
rely on high-dimensional training data such as text, making them unsuitable for
annotation tasks like multiple-choice labeling. In this work, we investigate
the potential of peer prediction -- a mechanism that evaluates the information
within workers' responses without using ground truth -- to mitigate
LLM-assisted cheating in crowdsourcing with a focus on annotation tasks. Our
approach quantifies the correlations between worker answers while conditioning
on (a subset of) LLM-generated labels available to the requester. Building on
prior research, we propose a training-free scoring mechanism with theoretical
guarantees under a crowdsourcing model that accounts for LLM collusion. We
establish conditions under which our method is effective and empirically
demonstrate its robustness in detecting low-effort cheating on real-world
crowdsourcing datasets.
中文标题/摘要
标题:无需地面真实性的LLM污染众包数据评估
生成式AI的近期成功突显了高质量人类反馈在构建可信赖AI系统中的关键作用。然而,众包工作者越来越多地使用大型语言模型(LLM)带来了重大挑战:旨在反映人类输入的数据可能受到LLM生成响应的污染。现有的LLM检测方法通常依赖于高维训练数据(如文本),使其不适合用于如多项选择标注等注释任务。在本文中,我们研究了同伴预测——一种机制,该机制可以在不使用地面真实性的前提下评估工人响应中的信息——在众包注释任务中对抗LLM辅助作弊的潜力。我们的方法在条件(LLM生成标签的一部分)下量化了工人答案之间的相关性。基于先前的研究,我们提出了一种无需训练的评分机制,并在考虑LLM合谋的众包模型下提供了理论保证。我们确定了该方法有效性的条件,并通过在真实世界众包数据集上进行实证研究,证明了其在检测低努力作弊方面的鲁棒性。
Summary / 总结
This work addresses the challenge of large language model (LLM) contamination in crowdsourced data by leveraging peer prediction to evaluate worker responses without ground truth. The method conditions on a subset of LLM-generated labels to quantify correlations between worker answers, providing a training-free scoring mechanism with theoretical guarantees. Empirical results show its effectiveness in detecting low-effort cheating on real-world crowdsourcing datasets.
该研究通过利用同伴预测机制解决了大规模语言模型(LLM)污染 crowdsourcing 数据的挑战。方法无需使用 ground truth 数据来评估工人回答,重点关注标注任务。通过量化工人答案之间的相关性并基于 LLM 生成的标签进行条件处理,该方法有效检测了低努力作弊,并在真实世界的数据集中展示了鲁棒性。
GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
Authors: Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
First: 2025-11-06T12:19:02+00:00 · Latest: 2025-11-06T12:19:02+00:00
Abstract
We introduce GUI-360$^\circ$, a large-scale, comprehensive dataset and
benchmark suite designed to advance computer-using agents (CUAs). CUAs present
unique challenges and is constrained by three persistent gaps: a scarcity of
real-world CUA tasks, the lack of automated collection-and-annotation pipelines
for multi-modal trajectories, and the absence of a unified benchmark that
jointly evaluates GUI grounding, screen parsing, and action prediction.
GUI-360$^\circ$ addresses these gaps with an LLM-augmented, largely automated
pipeline for query sourcing, environment-template construction, task
instantiation, batched execution, and LLM-driven quality filtering. The
released corpus contains over 1.2M executed action steps across thousands of
trajectories in popular Windows office applications, and includes
full-resolution screenshots, accessibility metadata when available,
instantiated goals, intermediate reasoning traces, and both successful and
failed action trajectories. The dataset supports three canonical tasks, GUI
grounding, screen parsing, and action prediction, and a hybrid GUI+API action
space that reflects modern agent designs. Benchmarking state-of-the-art
vision--language models on GUI-360$^\circ$ reveals substantial out-of-the-box
shortcomings in grounding and action prediction; supervised fine-tuning and
reinforcement learning yield significant gains but do not close the gap to
human-level reliability. We release GUI-360$^\circ$ and accompanying code to
facilitate reproducible research and accelerate progress on robust desktop
CUAs.
The full dataset has been made public on
https://huggingface.co/datasets/vyokky/GUI-360.
中文标题/摘要
标题:GUI-360:计算机使用代理的综合数据集和基准
我们介绍了GUI-360°,这是一个大规模、综合性的数据集和基准套件,旨在推动计算机使用代理(CUAs)的发展。CUAs面临独特的挑战,并受到三个持续存在的缺口的限制:现实世界CUA任务的稀缺性、多模态轨迹的自动化收集和注释管道的缺乏,以及缺乏一个统一的基准来联合评估GUI定位、屏幕解析和动作预测。
GUI-360°通过一个增强的LLM辅助、主要自动化的查询来源、环境模板构建、任务实例化、批量执行和LLM驱动的质量过滤管道来解决这些缺口。发布的语料库包含超过120万执行的动作步骤,跨越数千个轨迹,涵盖了流行的Windows办公应用程序,并包括全分辨率截图、可用时的无障碍元数据、实例化的目标、中间推理轨迹以及成功和失败的动作轨迹。该数据集支持三个经典任务:GUI定位、屏幕解析和动作预测,以及反映现代代理设计的GUI+API动作空间。在GUI-360°上对最先进的视觉-语言模型进行基准测试揭示了在定位和动作预测方面存在显著的开箱即用的不足;监督微调和强化学习取得了显著的改进,但并未完全弥补到人类水平的可靠性差距。我们发布了GUI-360°及其配套代码,以促进可重复研究并加速对稳健桌面CUAs的进展。
整个数据集已公开发布于https://huggingface.co/datasets/vyokky/GUI-360。
Summary / 总结
GUI-360 is a comprehensive dataset and benchmark suite for computer-using agents (CUAs), addressing the scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines, and the absence of a unified benchmark. It includes over 1.2 million action steps from thousands of trajectories in popular Windows office applications, with full-resolution screenshots, accessibility metadata, and reasoning traces. The dataset supports three tasks: GUI grounding, screen parsing, and action prediction. Benchmarking shows that state-of-the-art models have significant shortcomings in grounding and action prediction, but supervised fine-tuning and reinforcement learning improve performance. The dataset is publicly available at https://huggingface.co/datasets/vyokky/GUI-360.
GUI-360是一个针对计算机使用代理(CUAs)的综合数据集和基准套件,解决了现实世界CUA任务稀缺、自动化收集和注解管道缺乏以及缺乏统一基准的问题。它包含来自流行Windows办公应用程序的超过120万条操作步骤,包括全分辨率截图、无障碍元数据和推理轨迹。该数据集支持三个任务:GUI定位、屏幕解析和动作预测。基准测试显示,最先进的模型在定位和动作预测方面存在显著缺陷,但监督微调和强化学习可以显著提高性能。数据集已公开发布在https://huggingface.co/datasets/vyokky/GUI-360。
TowerVision: Understanding and Improving Multilinguality in Vision-Language Models
Authors: André G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M. Guerreiro, Amin Farajian, Pierre Colombo, Graham Neubig, André F. T. Martins
First: 2025-10-22T17:02:48+00:00 · Latest: 2025-11-06T11:09:11+00:00
Comments: 15 pages, 7 figures, submitted to arXiv October 2025. All models,
datasets, and training code will be released at
https://huggingface.co/collections/utter-project/towervision
Abstract
Despite significant advances in vision-language models (VLMs), most existing
work follows an English-centric design process, limiting their effectiveness in
multilingual settings. In this work, we provide a comprehensive empirical study
analyzing the impact of several multilingual design choices, such as training
data composition, encoder selection, and text backbones. The result is
TowerVision, a family of open multilingual VLMs for both image-text and
video-text tasks, built upon the multilingual text-only model Tower+.
TowerVision achieves competitive performance on multiple multimodal
multilingual benchmarks and shows particular strength in culturally grounded
tasks and multimodal translation. By incorporating visual and cultural context
during fine-tuning, our models surpass existing approaches trained on
substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image
tasks) and ViMUL-Bench (video tasks). Alongside the models, we release
VisionBlocks, a high-quality, curated vision-language dataset. Our findings
highlight that multilingual vision-language training data substantially
improves cross-lingual generalization -- both from high-resource to
underrepresented languages and vice versa -- and that instruction-tuned LLMs
are not always the optimal initialization point. To support further research,
we publicly release all models, data, and training recipes.
中文标题/摘要
标题:TowerVision:理解并改进视觉语言模型中的多语言性
尽管在视觉语言模型(VLMs)方面取得了显著进展,但大多数现有工作都遵循以英语为中心的设计过程,限制了它们在多语言环境中的有效性。在本研究中,我们提供了一项全面的经验性研究,分析了多种多语言设计选择的影响,如训练数据组成、编码器选择和文本骨干。结果是TowerVision,一个基于多语言文本模型Tower+的多语言VLM家族,适用于图像文本和视频文本任务。TowerVision在多个跨模态多语言基准测试中取得了竞争力的表现,并在文化背景任务和跨模态翻译方面表现出特别的优势。通过在微调过程中结合视觉和文化背景,我们的模型在ALM-Bench和Multi30K(图像任务)以及ViMUL-Bench(视频任务)上超过了现有在更大数据集上训练的方法。除了模型外,我们还发布了VisionBlocks,一个高质量、精选的视觉语言数据集。我们的研究结果表明,多语言视觉语言训练数据显著提高了跨语言泛化能力——无论是从高资源语言到未充分代表的语言,还是反之亦然——并且指令调优的大规模语言模型并不总是最佳的初始化点。为了支持进一步的研究,我们将在https://huggingface.co/collections/utter-project/towervision上公开发布所有模型、数据和训练配方。
RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability
Authors: Jonggwon Park, Byungmu Yoon, Soobum Kim, Kyoyun Choi
Venue: NeurIPS 2025
First: 2025-04-10T03:14:17+00:00 · Latest: 2025-11-06T09:22:17+00:00
Comments: NeurIPS 2025
Abstract
Recent advancements in multimodal models have significantly improved
vision-language (VL) alignment in radiology. However, existing approaches
struggle to effectively utilize complex radiology reports for learning and
offer limited interpretability through attention probability visualizations. To
address these challenges, we introduce $\textbf{RadZero}$, a novel framework
for VL alignment in chest X-ray with zero-shot multi-task capability. A key
component of our approach is $\textbf{VL-CABS}$
($\textbf{V}$ision-$\textbf{L}$anguage $\textbf{C}$ross-$\textbf{A}$ttention
$\textbf{B}$ased on $\textbf{S}$imilarity), which aligns text embeddings with
local image features for interpretable, fine-grained VL reasoning. RadZero
leverages large language models to extract concise semantic sentences from
radiology reports and employs multi-positive contrastive training to
effectively capture relationships between images and multiple relevant textual
descriptions. It uses a pre-trained vision encoder with additional trainable
Transformer layers, allowing efficient high-resolution image processing. By
computing similarity between text embeddings and local image patch features,
VL-CABS enables zero-shot inference with similarity probability for
classification, and pixel-level VL similarity maps for grounding and
segmentation. Experimental results on public chest radiograph benchmarks show
that RadZero outperforms state-of-the-art methods in zero-shot classification,
grounding, and segmentation. Furthermore, VL similarity map analysis highlights
the potential of VL-CABS for improving explainability in VL alignment.
Additionally, qualitative evaluation demonstrates RadZero's capability for
open-vocabulary semantic segmentation, further validating its effectiveness in
medical imaging. Code is available at
$\href{https://github.com/deepnoid-ai/RadZero}{https://github.com/deepnoid-ai/RadZero}$.
中文标题/摘要
标题:RadZero:基于相似性的跨注意力在胸部X光片上实现可解释的视觉-语言对齐,并具备零样本多任务能力
近年来,多模态模型在放射学中的视觉-语言(VL)对齐方面取得了显著进步。然而,现有方法难以有效利用复杂的放射学报告进行学习,并且通过注意力概率可视化提供有限的可解释性。为了解决这些挑战,我们提出了**RadZero**,一种具备零样本多任务能力的新型胸部X光片上VL对齐框架。我们方法的关键组件是**VL-CABS**(基于相似性的视觉-语言跨注意力),它将文本嵌入与局部图像特征对齐,以实现可解释的细粒度VL推理。RadZero 利用大型语言模型从放射学报告中提取简洁的语义句子,并采用多正样本对比训练来有效捕捉图像与多个相关文本描述之间的关系。它使用预训练的视觉编码器和额外的可训练Transformer层,实现高效的高分辨率图像处理。通过计算文本嵌入与局部图像块特征之间的相似性,VL-CABS 使VL对齐能够进行零样本推理,通过相似概率进行分类,并生成像素级VL相似图以实现定位和分割。在公共胸部X光片基准测试上的实验结果表明,RadZero 在零样本分类、定位和分割方面优于现有最先进的方法。此外,VL相似图分析突显了VL-CABS 在VL对齐中的解释性潜力。此外,定性评估进一步验证了RadZero 在医学成像中的开放词汇语义分割能力。代码可在$\href{https://github.com/deepnoid-ai/RadZero}{https://github.com/deepnoid-ai/RadZero}$ 获取。
Summary / 总结
RadZero is a novel framework for vision-language alignment in chest X-ray with zero-shot multi-task capability. It introduces VL-CABS, a similarity-based cross-attention mechanism that aligns text embeddings with local image features for interpretable reasoning. RadZero leverages large language models and multi-positive contrastive training to effectively capture relationships between images and textual descriptions, achieving superior performance in zero-shot classification, grounding, and segmentation on public chest radiograph benchmarks. The similarity probability maps generated by VL-CABS enhance the explainability of the model.
RadZero 是一种用于胸部 X 光片的视图语言对齐框架,具备零样本多任务能力。它引入了基于相似性的跨注意力机制 VL-CABS,用于将文本嵌入与局部图像特征对齐,实现可解释的细粒度推理。RadZero 利用大型语言模型从放射学报告中提取语义句子,并使用多正样本对比训练来捕捉图像与文本描述之间的关系。实验结果表明,RadZero 在零样本分类、定位和分割方面优于现有最先进的方法,并突出了 VL-CABS 在视图语言对齐中的解释性潜力。
Text to Sketch Generation with Multi-Styles
Authors: Tengjie Li, Shikui Tu, Lei Xu
Venue: NeurIPS 2025
First: 2025-11-06T07:13:56+00:00 · Latest: 2025-11-06T07:13:56+00:00
Comments: Accepted by NeurIPS 2025
Abstract
Recent advances in vision-language models have facilitated progress in sketch
generation. However, existing specialized methods primarily focus on generic
synthesis and lack mechanisms for precise control over sketch styles. In this
work, we propose a training-free framework based on diffusion models that
enables explicit style guidance via textual prompts and referenced style
sketches. Unlike previous style transfer methods that overwrite key and value
matrices in self-attention, we incorporate the reference features as auxiliary
information with linear smoothing and leverage a style-content guidance
mechanism. This design effectively reduces content leakage from reference
sketches and enhances synthesis quality, especially in cases with low
structural similarity between reference and target sketches. Furthermore, we
extend our framework to support controllable multi-style generation by
integrating features from multiple reference sketches, coordinated via a joint
AdaIN module. Extensive experiments demonstrate that our approach achieves
high-quality sketch generation with accurate style alignment and improved
flexibility in style control. The official implementation of M3S is available
at https://github.com/CMACH508/M3S.
中文标题/摘要
标题:基于多风格的文本到草图生成
视觉语言模型的最新进展促进了草图生成的进步。然而,现有的专门方法主要集中在通用合成上,缺乏对草图风格的精确控制机制。在本工作中,我们提出了一种基于扩散模型的无需训练框架,通过文本提示和参考风格草图实现显式的风格指导。与之前的方法不同,我们通过线性平滑将参考特征作为辅助信息纳入,并利用风格-内容指导机制。这种设计有效地减少了参考草图中的内容泄露,提高了合成质量,特别是在参考草图和目标草图结构相似度低的情况下。此外,我们通过结合多个参考草图的特征,利用联合AdaIN模块协调,将框架扩展到支持可控的多风格生成。广泛的实验表明,我们的方法实现了高质量的草图生成,具有准确的风格对齐和增强的风格控制灵活性。M3S的官方实现可在https://github.com/CMACH508/M3S获得。
Summary / 总结
This work addresses the limitation of existing methods in controlling sketch styles precisely by proposing a training-free framework based on diffusion models. The framework uses textual prompts and reference sketches for explicit style guidance, incorporating reference features with linear smoothing and a style-content guidance mechanism. This approach reduces content leakage and improves synthesis quality, especially in cases with low structural similarity. The framework is further extended to support multi-style generation by integrating features from multiple reference sketches. Experiments show high-quality sketch generation with accurate style alignment and improved flexibility in style control.
该研究针对现有方法在精确控制素描风格方面的不足,提出了一种基于扩散模型的无需训练框架。该框架通过文本提示和参考风格素描来引导生成过程,采用线性平滑和风格-内容引导机制整合参考特征。这种方法在参考素描和目标素描结构相似度低的情况下,提高了生成质量。此外,该框架还支持多风格生成,通过联合AdaIN模块整合多个参考素描的特征。实验表明,所提出的方法能够实现高质量的素描生成,具有准确的风格对齐和增强的风格控制灵活性。
Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration
Authors: Yunghee Lee, Byeonghyun Pak, Junwha Hong, Hoseong Kim
Venue: NeurIPS 2025
First: 2025-11-06T07:08:58+00:00 · Latest: 2025-11-06T07:08:58+00:00
Comments: 21 pages, 8 figures. NeurIPS 2025. Project page:
https://yhlee-add.github.io/THG
Abstract
In this paper, we propose Tortoise and Hare Guidance (THG), a training-free
strategy that accelerates diffusion sampling while maintaining high-fidelity
generation. We demonstrate that the noise estimate and the additional guidance
term exhibit markedly different sensitivity to numerical error by reformulating
the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our
error-bound analysis shows that the additional guidance branch is more robust
to approximation, revealing substantial redundancy that conventional solvers
fail to exploit. Building on this insight, THG significantly reduces the
computation of the additional guidance: the noise estimate is integrated with
the tortoise equation on the original, fine-grained timestep grid, while the
additional guidance is integrated with the hare equation only on a coarse grid.
We also introduce (i) an error-bound-aware timestep sampler that adaptively
selects step sizes and (ii) a guidance-scale scheduler that stabilizes large
extrapolation spans. THG reduces the number of function evaluations (NFE) by up
to 30% with virtually no loss in generation fidelity ($\Delta$ImageReward
$\leq$ 0.032) and outperforms state-of-the-art CFG-based training-free
accelerators under identical computation budgets. Our findings highlight the
potential of multirate formulations for diffusion solvers, paving the way for
real-time high-quality image synthesis without any model retraining. The source
code is available at https://github.com/yhlee-add/THG.
Summary / 总结
Tortoise and Hare Guidance (THG) is a training-free method that accelerates diffusion model inference by reformulating the classifier-free guidance (CFG) ODE as a multirate system. THG integrates the noise estimate on a fine-grained timestep grid and the additional guidance on a coarser grid, reducing the number of function evaluations by up to 30% without compromising generation fidelity. It also includes an adaptive timestep sampler and a guidance-scale scheduler to further stabilize the process. THG outperforms other state-of-the-art CFG-based accelerators under the same computational budget.
Tortoise and Hare Guidance (THG) 是一种无需训练的方法,通过将 classifier-free guidance (CFG) ODE 重新表述为多速率系统来加速扩散模型推理。THG 在细粒度的时间步网格上积分噪声估计,在较粗粒度的时间步网格上积分附加指导,最多可减少 30% 的函数评估次数,同时不牺牲生成保真度。它还包含一个自适应时间步采样器和一个指导尺度调度器以进一步稳定过程。THG 在相同的计算预算下优于其他最先进的 CFG 基准加速器。
FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models
Authors: Shengming Yuan, Xinyu Lyu, Shuailong Wang, Beitao Chen, Jingkuan Song, Lianli Gao
Venue: NeurIPS 2025
First: 2025-10-13T09:22:12+00:00 · Latest: 2025-11-06T06:08:08+00:00
Comments: 19 pages, 11 figures. Accepted by the 39th Conference on Neural
Information Processing Systems (NeurIPS 2025)
Abstract
Multimodal large language models (MLLMs) face an inherent trade-off between
faithfulness and creativity, as different tasks require varying degrees of
associative reasoning. However, existing methods lack the flexibility to
modulate this reasoning strength, limiting MLLMs' adaptability across factual
and creative scenarios. To bridge this gap, we propose equipping MLLMs with
mechanisms that enable flexible control over associative reasoning. We begin by
investigating the internal mechanisms underlying associative behavior in MLLMs
and find that: (1) middle layers play a pivotal role in shaping model's
associative tendencies, (2) modifying representations in these layers
effectively regulates associative reasoning strength, and (3) hallucinations
can be exploited to derive steering vectors that guide this modulation.
Building on these findings, we introduce Flexible Association Control (FlexAC),
a lightweight and training-free framework for modulating associative behavior
in MLLMs. FlexAC first induces hallucination-guided intermediate
representations to encode associative directions. Then, it selects
high-association instances to construct effective associative steering vectors,
whose strengths are adaptively calibrated to balance creative guidance with
output stability. Finally, recognizing the multi-dimensional nature of
associative reasoning, FlexAC incorporates task-specific associative vectors
derived from a forward pass on a few target-domain samples, enabling models to
follow diverse associative directions and better adapt to creative tasks.
Notably, our method achieves up to a 5.8x improvement in creativity on
Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing
existing baselines and demonstrating its effectiveness in enabling flexible
control over associative reasoning in MLLMs. Our code is available at
https://github.com/ylhz/FlexAC.
中文标题/摘要
标题:FlexAC:向多模态大型语言模型灵活控制关联推理的方向
多模态大型语言模型(MLLMs)在忠实性和创造性之间存在固有的权衡,因为不同的任务需要不同程度的关联推理。然而,现有方法缺乏调节这种推理强度的灵活性,限制了MLLMs在事实性和创造性场景中的适应性。为了解决这一问题,我们提出为MLLMs配备机制,使其能够灵活控制关联推理。我们首先研究了MLLMs内部驱动关联行为的机制,并发现:(1) 中间层在塑造模型的关联倾向中起着关键作用,(2) 修改这些层中的表示可以有效地调节关联推理强度,(3) 可以利用幻觉来推导出引导这种调节的引导向量。基于这些发现,我们引入了灵活关联控制(FlexAC),这是一种轻量级且无需训练的框架,用于调节MLLMs的关联行为。FlexAC 首先通过幻觉引导的中间表示来编码关联方向。然后,它选择高关联实例来构建有效的关联引导向量,其强度会根据创造性指导与输出稳定性之间的平衡进行自适应校准。最后,考虑到关联推理的多维性质,FlexAC 结合了从少量目标领域样本前向传递中推导出的任务特定关联向量,使模型能够遵循多种关联方向,更好地适应创造性任务。值得注意的是,我们的方法在Creation-MMBench上的创造性提高了5.8倍,在CHAIR上的幻觉率降低了29%,超过了现有基线,证明了其在MLLMs中实现灵活控制关联推理的有效性。我们的代码可在https://github.com/ylhz/FlexAC/获取。
Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving
Authors: Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, Liam Paull
First: 2025-06-12T19:14:00+00:00 · Latest: 2025-11-06T05:07:51+00:00
Abstract
Maintaining good driving behavior in out-of-distribution scenarios remains a
critical challenge in autonomous driving. A promising direction is to leverage
the generalist knowledge and reasoning capabilities of large-language models by
treating unusual driving scenarios as a logical reasoning task. In this work,
we present Poutine, a method that uses an off-the-shelf 3B-parameter
vision-language model (VLM) - without any additional components - to achieve
robust end-to-end autonomous driving via a simple and scalable training recipe.
To learn strong base driving capabilities, we first train Poutine-Base using
self-supervised next-token prediction over vision, language, and trajectory
(VLT) tokens, leveraging both nominal and long-tail driving data. In the second
stage, we fine-tune Poutine-Base using Group Relative Policy Optimization
(GRPO) with a small set of human preference-labeled examples. We evaluated our
approach on the Waymo end-to-end driving benchmark curated for long-tail
scenarios. The final Poutine model achieves an RFS of 7.99 on the test set,
placing 1st in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a
significant margin. Our results suggest that handcrafted tokenizers or custom
architectural components added to base VLMs in prior work are not necessary to
achieve strong driving performance. Instead, this work highlights the potential
of scalable VLT pretraining combined with lightweight RL fine-tuning to enable
robust and generalizable autonomous driving.
中文标题/摘要
标题:普顿:视觉-语言-轨迹预训练和强化学习后训练实现稳健的端到端自动驾驶
在异常驾驶场景中保持良好的驾驶行为仍然是自动驾驶领域的关键挑战。一种有前景的方向是利用大型语言模型的通用知识和推理能力,将异常驾驶场景视为逻辑推理任务。在本工作中,我们提出了普顿方法,该方法仅使用一个现成的30亿参数视觉-语言模型(VLM),无需任何额外组件,通过简单的可扩展训练食谱实现稳健的端到端自动驾驶。为了学习强大的基础驾驶能力,我们首先使用自我监督的下一个标记预测方法对普顿基础模型进行训练,利用标准和长尾驾驶数据。在第二阶段,我们使用组相对策略优化(GRPO)对普顿基础模型进行微调,使用少量的人类偏好标注示例。我们在为长尾场景定制的Waymo端到端驾驶基准上评估了我们的方法。最终的普顿模型在测试集上的RFS为7.99,在2025年Waymo基于视觉的端到端驾驶挑战赛中以显著优势获得第一名。我们的结果表明,在先前的工作中,添加手工制作的分词器或自定义架构组件以增强基础VLM并不是实现强大驾驶性能的必要条件。相反,本工作强调了可扩展的视觉-语言-轨迹预训练与轻量级的强化学习微调相结合的潜力,以实现稳健和泛化的自动驾驶。