arXiv 论文速递

Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models

Authors: Runchu Tian, Junxia Cui, Xueqiang Xu, Feng Yao, Jingbo Shang

First: 2025-10-06T17:56:46+00:00 · Latest: 2025-10-06T17:56:46+00:00

Comments: 17 pages, 8 figures. Work in progress

Abstract

Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering advantages such as accelerated parallel decoding and bidirectional context modeling. However, the vanilla decoding strategy in discrete dLLMs suffers from a critical limitation: once a token is accepted, it can no longer be revised in subsequent steps. As a result, early mistakes persist across iterations, harming both intermediate predictions and final output quality. To address this issue, we propose Tolerator (Token-Level Cross-Validation Refinement), a training-free decoding strategy that leverages cross-validation among predicted tokens. Unlike existing methods that follow a single progressive unmasking procedure, Tolerator introduces a two-stage process: (i) sequence fill-up and (ii) iterative refinement by remasking and decoding a subset of tokens while treating the remaining as context. This design enables previously accepted tokens to be reconsidered and corrected when necessary, leading to more reliable diffusion decoding outputs. We evaluate Tolerator on five standard benchmarks covering language understanding, code generation, and mathematics. Experiments show that our method achieves consistent improvements over the baselines under the same computational budget. These findings suggest that decoding algorithms are crucial to realizing the full potential of diffusion large language models. Code and data are publicly available.

中文标题/摘要

标题：先完成，后完善：用于扩散大型语言模型的测试时标记级交叉验证

扩散大型语言模型（dLLMs）最近作为一种有前途的替代自回归（AR）模型的方案出现，提供了诸如加速并行解码和双向上下文建模等优势。然而，离散dLLMs中的基本解码策略存在一个关键限制：一旦接受了一个标记，它在后续步骤中将无法被修改。因此，早期错误会在迭代中持续存在，损害中间预测和最终输出的质量。为了解决这一问题，我们提出了Tolerator（标记级交叉验证精炼），这是一种无需训练的解码策略，利用预测标记之间的交叉验证。与现有方法遵循单一渐进解码掩蔽程序不同，Tolerator引入了两阶段过程：（i）序列填充和（ii）通过重新掩蔽和解码一部分标记并将其余部分视为上下文的迭代精炼。这种设计使先前接受的标记能够在必要时被重新考虑和修正，从而产生更可靠的扩散解码输出。我们在涵盖语言理解、代码生成和数学的五个标准基准上评估了Tolerator。实验表明，在相同的计算预算下，我们的方法在基线之上实现了持续改进。这些发现表明，解码算法对于实现扩散大型语言模型的全部潜力至关重要。代码和数据已公开。

Summary / 总结

The paper addresses the limitation of vanilla decoding in diffusion large language models (dLLMs) where early mistakes persist. It introduces Tolerator, a training-free decoding strategy that uses cross-validation among predicted tokens. Tolerator employs a two-stage process: sequence fill-up and iterative refinement. This allows previously accepted tokens to be reconsidered and corrected, improving the reliability of the final output. Experiments on five benchmarks show consistent improvements over baselines under the same computational budget, highlighting the importance of decoding algorithms in dLLMs.

论文针对扩散大型语言模型(dLLMs)中vanilla解码方式早期错误无法修正的问题，提出了Tolerator，这是一种无需训练的解码策略，通过预测令牌之间的交叉验证来实现。Tolerator采用两阶段过程：序列填充后进行迭代精炼。这使得之前接受的令牌可以在必要时重新考虑，从而提高最终输出的可靠性。在五个基准上的实验显示，该方法在相同的计算预算下比基线方法有持续的改进，强调了解码算法对dLLMs潜力实现的重要性。

SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs

Authors: Dachuan Shi, Abedelkadir Asi, Keying Li, Xiangchi Yuan, Leyan Pan, Wenke Lee, Wen Xiao

First: 2025-10-06T17:46:34+00:00 · Latest: 2025-10-06T17:46:34+00:00

Comments: Code: https://github.com/sdc17/SwiReasoning, Website: https://swireasoning.github.io/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Recent work shows that, beyond discrete reasoning through explicit chain-of-thought steps, which are limited by the boundaries of natural languages, large language models (LLMs) can also reason continuously in latent space, allowing richer information per step and thereby improving token efficiency. Despite this promise, latent reasoning still faces two challenges, especially in training-free settings: 1) purely latent reasoning broadens the search distribution by maintaining multiple implicit paths, which diffuses probability mass, introduces noise, and impedes convergence to a single high-confidence solution, thereby hurting accuracy; and 2) overthinking persists even without explicit text, wasting tokens and degrading efficiency. To address these issues, we introduce SwiReasoning, a training-free framework for LLM reasoning which features two key innovations: 1) SwiReasoning dynamically switches between explicit and latent reasoning, guided by block-wise confidence estimated from entropy trends in next-token distributions, to balance exploration and exploitation and promote timely convergence. 2) By limiting the maximum number of thinking-block switches, SwiReasoning curbs overthinking and improves token efficiency across varying problem difficulties. On widely used mathematics and STEM benchmarks, SwiReasoning consistently improves average accuracy by 1.5%-2.8% across reasoning LLMs of different model families and scales. Furthermore, under constrained budgets, SwiReasoning improves average token efficiency by 56%-79%, with larger gains as budgets tighten.

中文标题/摘要

标题：SwiReasoning：在潜在和显式空间中的切换思考以促进帕累托占优的LLM推理

近期研究表明，除了通过有限自然语言边界的离散推理之外，大型语言模型（LLMs）还可以在潜在空间中进行连续推理，允许每步包含更丰富的信息，从而提高标记效率。尽管如此，潜在推理仍然面临两个挑战，尤其是在无需训练的情况下：1）纯粹的潜在推理通过保持多个隐式路径来扩大搜索范围，这会分散概率质量，引入噪声并阻碍向单一高置信度解决方案的收敛，从而损害准确性；2）即使没有显式文本，过度思考也会持续存在，浪费标记并降低效率。为了解决这些问题，我们提出了SwiReasoning，这是一种无需训练的LLM推理框架，具有两个关键创新：1）SwiReasoning根据来自下一个标记分布熵趋势的块级置信度估计动态在显式和潜在推理之间切换，以平衡探索和利用并促进及时收敛。2）通过限制思考块切换的最大次数，SwiReasoning抑制过度思考并在不同问题难度下提高标记效率。在广泛使用的数学和STEM基准测试中，SwiReasoning在不同模型家族和规模的推理LLM中的一致平均准确性提高了1.5%-2.8%。此外，在预算受限的情况下，SwiReasoning将平均标记效率提高了56%-79%，预算越紧，收益越大。

Summary / 总结

SwiReasoning is a training-free framework that dynamically switches between explicit and latent reasoning to improve accuracy and token efficiency in large language models (LLMs). It uses block-wise confidence estimated from entropy trends to balance exploration and exploitation, and limits the number of switches to prevent overthinking. On various mathematics and STEM benchmarks, SwiReasoning enhances average accuracy by 1.5%-2.8% and improves token efficiency by 56%-79%, especially under constrained budgets.

SwiReasoning 是一个无需训练的框架，通过在显式和隐式推理之间动态切换来提高准确性和令牌效率。它使用熵趋势估计的块级置信度来平衡探索和利用，并限制思考块的切换次数以遏制过度思考。实验表明，SwiReasoning 在数学和 STEM 基准测试中将平均准确率提高了 1.5%-2.8%，平均令牌效率提高了 56%-79%。

ResCP: Reservoir Conformal Prediction for Time Series Forecasting

Authors: Roberto Neglia, Andrea Cini, Michael M. Bronstein, Filippo Maria Bianchi

First: 2025-10-06T17:37:44+00:00 · Latest: 2025-10-06T17:37:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Conformal prediction offers a powerful framework for building distribution-free prediction intervals for exchangeable data. Existing methods that extend conformal prediction to sequential data rely on fitting a relatively complex model to capture temporal dependencies. However, these methods can fail if the sample size is small and often require expensive retraining when the underlying data distribution changes. To overcome these limitations, we propose Reservoir Conformal Prediction (ResCP), a novel training-free conformal prediction method for time series. Our approach leverages the efficiency and representation learning capabilities of reservoir computing to dynamically reweight conformity scores. In particular, we compute similarity scores among reservoir states and use them to adaptively reweight the observed residuals at each step. With this approach, ResCP enables us to account for local temporal dynamics when modeling the error distribution without compromising computational scalability. We prove that, under reasonable assumptions, ResCP achieves asymptotic conditional coverage, and we empirically demonstrate its effectiveness across diverse forecasting tasks.

中文标题/摘要

标题：ResCP：储层同态预测在时间序列预测中的应用

同态预测提供了一种强大的框架，用于构建适用于可交换数据的分布无关预测区间。现有的将同态预测扩展到序列数据的方法依赖于拟合相对复杂的模型来捕捉时间依赖性。然而，这些方法在样本量小的情况下可能会失败，并且通常需要在数据分布变化时进行昂贵的重新训练。为克服这些限制，我们提出了储层同态预测（ResCP），这是一种新的无需训练的同态预测方法，适用于时间序列。我们的方法利用了储层计算的效率和表示学习能力，动态重新加权同态分数。特别是，我们计算了储层状态之间的相似性分数，并在每一步中使用它们来自适应地重新加权观察到的残差。通过这种方法，ResCP 允许我们在建模误差分布时考虑局部时间动态，而不牺牲计算可扩展性。我们证明，在合理假设下，ResCP 实现了渐近条件覆盖，并通过多种预测任务的实验证明了其有效性。

Summary / 总结

ResCP is a novel conformal prediction method for time series forecasting that addresses the limitations of existing methods by leveraging reservoir computing to dynamically reweight conformity scores. It computes similarity scores among reservoir states and uses them to adaptively reweight observed residuals, allowing for the modeling of local temporal dynamics without requiring retraining. Empirical results show that ResCP achieves effective and scalable distribution-free prediction intervals across various forecasting tasks.

ResCP 是一种利用蓄水池计算动态重新权重一致性分数的新颖时间序列预测方法。它避免了复杂模型拟合和重新训练的需要，适用于小样本大小和分布变化的情况。实验表明，ResCP 能有效考虑局部时间动态并实现渐近条件覆盖。

Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts

Authors: Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi

First: 2025-10-06T17:16:41+00:00 · Latest: 2025-10-06T17:16:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion-based large language models (dLLMs) are trained flexibly to model extreme dependence in the data distribution; however, how to best utilize this information at inference time remains an open problem. In this work, we uncover an interesting property of these models: dLLMs trained on textual data implicitly learn a mixture of semi-autoregressive experts, where different generation orders reveal different specialized behaviors. We show that committing to any single, fixed inference time schedule, a common practice, collapses performance by failing to leverage this latent ensemble. To address this, we introduce HEX (Hidden semiautoregressive EXperts for test-time scaling), a training-free inference method that ensembles across heterogeneous block schedules. By doing a majority vote over diverse block-sized generation paths, HEX robustly avoids failure modes associated with any single fixed schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to 3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and specialized fine-tuned methods like GRPO, without additional training. HEX even yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%. Our results establish a new paradigm for test-time scaling in diffusion-based LLMs (dLLMs), revealing that the sequence in which masking is performed plays a critical role in determining performance during inference.

中文标题/摘要

标题：扩散大语言模型在推理时的缩放通过隐藏半自回归专家

基于扩散的大语言模型（dLLMs）灵活地训练以建模数据分布中的极端依赖性；然而，在推理时如何最好地利用这些信息仍然是一个开放的问题。在本工作中，我们揭示了这些模型的一个有趣特性：在文本数据上训练的dLLMs隐式地学习了一种半自回归专家的混合模型，不同的生成顺序揭示了不同的专业化行为。我们展示了任何单一固定推理时间表的承诺会因未能利用这种潜在的集合而降低性能。为了解决这个问题，我们引入了HEX（隐藏半自回归专家在推理时的缩放），这是一种无需训练的推理方法，它在异构块调度中进行集成。通过在多种块大小生成路径上进行多数投票，HEX稳健地避免了任何单一固定时间表相关的失败模式。在诸如GSM8K的推理基准测试中，它将准确度提高了3.56倍（从24.72%提高到88.10%），超过了顶级K值推理和专门微调方法（如GRPO）的性能，且无需额外训练。HEX甚至在MATH基准测试中从16.40%提高到40.00%，在ARC-C的科学推理中从54.18%提高到87.80%，在TruthfulQA中从28.36%提高到57.46%。我们的结果确立了扩散大语言模型（dLLMs）在推理时缩放的新范式，揭示了在推理过程中执行掩码的顺序对性能起着关键作用。

Summary / 总结

This paper addresses the challenge of effectively utilizing the flexible training of diffusion-based large language models (dLLMs) at inference time. It introduces HEX, a training-free method that ensembles across heterogeneous block schedules to avoid the limitations of fixed inference time schedules. HEX significantly improves performance on reasoning benchmarks, achieving up to 3.56X accuracy boosts compared to existing methods like top-K margin inference and specialized fine-tuned methods.

本文解决了如何在推理时有效利用基于扩散的大语言模型（dLLMs）的灵活训练问题。它提出了HEX，一种无需训练的方法，通过在不同的块调度下进行集成，避免了固定推理时间调度的局限性。HEX在推理基准测试中显著提高了性能，相比现有的方法如top-K边距推理和专门的微调方法，实现了高达3.56倍的准确率提升。

Fast constrained sampling in pre-trained diffusion models

Authors: Alexandros Graikos, Nebojsa Jojic, Dimitris Samaras

First: 2024-10-24T14:52:38+00:00 · Latest: 2025-10-06T16:59:09+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large denoising diffusion models, such as Stable Diffusion, have been trained on billions of image-caption pairs to perform text-conditioned image generation. As a byproduct of this training, these models have acquired general knowledge about image statistics, which can be useful for other inference tasks. However, when confronted with sampling an image under new constraints, e.g. generating the missing parts of an image, using large pre-trained text-to-image diffusion models is inefficient and often unreliable. Previous approaches either utilized backpropagation through the denoiser network, making them significantly slower and more memory-demanding than simple text-to-image generation, or only enforced the constraint locally, failing to capture critical long-range correlations in the sampled image. In this work, we propose an algorithm that enables fast, high-quality generation under arbitrary constraints. We show that in denoising diffusion models, we can employ an approximation to Newton's optimization method that allows us to speed up inference and avoid the expensive backpropagation operations. Our approach produces results that rival or surpass the state-of-the-art training-free inference methods while requiring a fraction of the time. We demonstrate the effectiveness of our algorithm under both linear (inpainting, super-resolution) and non-linear (style-guided generation) constraints. An implementation is provided at https://github.com/cvlab-stonybrook/fast-constrained-sampling.

中文标题/摘要

标题：预训练扩散模型中的快速约束采样

大型去噪扩散模型，如Stable Diffusion，已通过数十亿张图像-描述对进行训练，以执行基于文本的图像生成。作为训练的副产品，这些模型获得了关于图像统计的一般知识，这在其他推理任务中可能很有用。然而，当需要在新约束下采样图像时，例如生成图像的缺失部分，使用大型预训练的文本到图像扩散模型是低效且往往不可靠的。先前的方法要么通过去噪器网络进行反向传播，使其比简单的文本到图像生成慢得多且内存需求更大，要么仅在局部施加约束，无法捕捉采样图像中的关键长程相关性。在本文中，我们提出了一种算法，可以在任意约束下实现快速、高质量的生成。我们展示了在去噪扩散模型中，可以使用近似牛顿优化方法的近似，这使我们能够加快推理速度并避免昂贵的反向传播操作。我们的方法产生的结果与最先进的无训练推理方法相当或更优，但所需时间仅为其中的一小部分。我们证明了在线性（填充、超分辨率）和非线性（风格引导生成）约束下，我们的算法的有效性。有关实现，请参见https://github.com/cvlab-stonybrook/fast-constrained-sampling。

Summary / 总结

This paper addresses the inefficiency and unreliability of using large pre-trained text-to-image diffusion models for constrained sampling tasks. It proposes an algorithm that uses an approximation to Newton's optimization method to enable fast, high-quality image generation under various constraints. The method avoids expensive backpropagation operations, producing results comparable to state-of-the-art training-free inference methods but requiring significantly less time. The approach is demonstrated to be effective for both linear and non-linear constraints, such as inpainting and style-guided generation.

本文解决了使用大型预训练文本到图像扩散模型进行约束采样任务时的低效性和不可靠性问题。它提出了一种算法，利用牛顿优化方法的近似来实现快速且高质量的生成，适用于任意约束条件。该方法避免了昂贵的反向传播操作，并且生成的结果与最先进的无训练推理方法相当或更优，所需时间显著减少。该算法适用于线性（图像修补、超分辨率）和非线性（风格引导生成）约束条件。

Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Authors: Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi

First: 2025-10-01T17:55:37+00:00 · Latest: 2025-10-06T16:29:44+00:00

Comments: 79 pages, 27 figures, 31 tables. Code is available at https://github.com/CHATS-lab/verbalize-sampling

Abs · PDF · Code1 · Code2 · Code3

Abstract

Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., "Generate 5 jokes about coffee and their corresponding probabilities"). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.

中文标题/摘要

标题：口头采样：如何缓解模式崩溃并解锁大模型多样性

后训练对齐通常会减少大模型的多样性，导致模式崩溃现象。不同于以往将这一现象归因于算法限制的研究，我们发现了一个根本性的、普遍存在的数据层面驱动因素：偏好数据中的典型性偏差，即注释者系统地偏好熟悉的文本，这源于认知心理学中的已有发现。我们从理论上正式化了这一偏差，通过偏好数据集进行实证验证，并表明它在模式崩溃中起着核心作用。基于这一分析，我们引入了口头采样，这是一种简单的、无需训练的提示策略，以绕过模式崩溃。VS促使模型口头化一个响应的概率分布（例如，“生成5个关于咖啡的笑话及其相应的概率”）。全面的实验表明，VS在创意写作（诗歌、故事、笑话）、对话模拟、开放性问答和合成数据生成等方面显著提高了性能，而不会牺牲事实准确性与安全性。例如，在创意写作中，VS将多样性提高了1.6-2.1倍。我们还观察到一个新兴趋势，即更强大的模型从VS中获益更多。总之，我们的工作提供了一种新的数据导向视角来理解模式崩溃，并提供了一种实用的推理时修正方法，有助于解锁预训练生成多样性。

Summary / 总结

The paper addresses the issue of mode collapse in large language models (LLMs) after post-training alignment, which reduces their diversity. It identifies typicality bias in preference data as the root cause, where annotators favor familiar text. To mitigate this, the authors propose Verbalized Sampling (VS), a training-free prompting strategy that prompts the model to verbalize a probability distribution over responses. Experimental results show that VS significantly enhances diversity in creative writing, dialogue simulation, open-ended QA, and synthetic data generation without compromising factual accuracy and safety. More capable models benefit more from this approach.

论文探讨了LLM后训练对齐中的模式塌陷问题，将其归因于偏好数据中的典型性偏见。提出了一个名为Verbalized Sampling (VS) 的训练外提示策略，该策略促使模型口头化一组响应的概率分布。实验表明，VS 在创意写作、对话模拟和开放式问答中将多样性提高了1.6-2.1倍，同时不牺牲事实准确性和安全性。

Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind

Authors: Lance Ying, Xinyi Li, Shivam Aarya, Yizirui Fang, Yifan Yin, Jason Xinyu Liu, Stefanie Tellex, Joshua B. Tenenbaum, Tianmin Shu

First: 2024-09-17T02:36:10+00:00 · Latest: 2025-10-06T16:05:39+00:00

Comments: 8 pages, 7 figures

Abs · PDF · Code1 · Code2

Abstract

Spoken language instructions are ubiquitous in agent collaboration. However, in real-world human-robot collaboration, following human spoken instructions can be challenging due to various speaker and environmental factors, such as background noise or mispronunciation. When faced with noisy auditory inputs, humans can leverage the collaborative context in the embodied environment to interpret noisy spoken instructions and take pragmatic assistive actions. In this paper, we present a cognitively inspired neurosymbolic model, Spoken Instruction Following through Theory of Mind (SIFToM), which leverages a Vision-Language Model with model-based mental inference to enable robots to pragmatically follow human instructions under diverse speech conditions. We test SIFToM in both simulated environments (VirtualHome) and real-world human-robot collaborative settings with human evaluations. Results show that SIFToM can significantly improve the performance of a lightweight base VLM (Gemini 2.5 Flash), outperforming state-of-the-art VLMs (Gemini 2.5 Pro) and approaching human-level accuracy on challenging spoken instruction following tasks.

中文标题/摘要

标题：实用体态口语指令跟随在具有共情心理的人机协作中

口语指令在代理协作中无处不在。然而，在现实世界的人机协作中，遵循人类的口语指令可能会因各种说话者和环境因素（如背景噪音或发音错误）而具有挑战性。面对嘈杂的听觉输入，人类可以利用体态环境中的协作上下文来解释嘈杂的口语指令并采取实用的辅助行动。在本文中，我们提出了一种基于认知的神经符号模型——通过共情心理的口语指令跟随（SIFToM），该模型利用基于模型的心理推理的视觉-语言模型，使机器人能够在多种语音条件下实用地遵循人类指令。我们分别在模拟环境（VirtualHome）和真实世界的人机协作环境中测试了SIFToM，并进行了人类评估。结果显示，SIFToM可以显著提高轻量级基础视觉-语言模型（Gemini 2.5 Flash）的性能，优于最先进的视觉-语言模型（Gemini 2.5 Pro），并在具有挑战性的口语指令跟随任务上接近人类水平的准确性。

Summary / 总结

This paper addresses the challenge of robots following human spoken instructions in noisy environments. It introduces SIFToM, a neurosymbolic model that uses a Vision-Language Model with mental inference to help robots interpret and follow instructions pragmatically. Experimental results show that SIFToM improves the performance of a lightweight base VLM and outperforms state-of-the-art VLMs on challenging spoken instruction following tasks, approaching human-level accuracy in both simulated and real-world settings.

论文旨在解决在人类-机器人协作中，尤其是在嘈杂条件下跟随人类口头指令的挑战。它提出了一种名为SIFToM的认知启发式神经符号模型，该模型利用视觉-语言模型和基于模型的心理推理来帮助机器人理解和执行指令。实验结果显示，SIFToM在模拟和真实世界的人机协作环境中，对于口头指令跟随任务的表现超过了最先进的视觉-语言模型，并达到了人类水平的准确性。

A Tale of Two Experts: Cooperative Learning for Source-Free Unsupervised Domain Adaptation

Authors: Jiaping Yu, Muli Yang, Jiapeng Ji, Jiexi Yan, Cheng Deng

First: 2025-09-26T11:39:50+00:00 · Latest: 2025-10-06T15:55:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Source-Free Unsupervised Domain Adaptation (SFUDA) addresses the realistic challenge of adapting a source-trained model to a target domain without access to the source data, driven by concerns over privacy and cost. Existing SFUDA methods either exploit only the source model's predictions or fine-tune large multimodal models, yet both neglect complementary insights and the latent structure of target data. In this paper, we propose the Experts Cooperative Learning (EXCL). EXCL contains the Dual Experts framework and Retrieval-Augmentation-Interaction optimization pipeline. The Dual Experts framework places a frozen source-domain model (augmented with Conv-Adapter) and a pretrained vision-language model (with a trainable text prompt) on equal footing to mine consensus knowledge from unlabeled target samples. To effectively train these plug-in modules under purely unsupervised conditions, we introduce Retrieval-Augmented-Interaction(RAIN), a three-stage pipeline that (1) collaboratively retrieves pseudo-source and complex target samples, (2) separately fine-tunes each expert on its respective sample set, and (3) enforces learning object consistency via a shared learning result. Extensive experiments on four benchmark datasets demonstrate that our approach matches state-of-the-art performance.

中文标题/摘要

标题：两位专家的故事：无源无监督领域适应的协同学习

无源无监督领域适应（SFUDA）解决了在不访问源数据的情况下将源训练模型适应目标领域的真实挑战，这源于对隐私和成本的担忧。现有SFUDA方法要么仅利用源模型的预测，要么微调大型多模态模型，但两者都忽视了目标数据的互补见解和潜在结构。在本文中，我们提出了协同学习专家（EXCL）。EXCL 包含双专家框架和检索增强交互优化管道。双专家框架将冻结的源域模型（附加Conv-Adapter）和预训练的视觉语言模型（带有可训练文本提示）置于平等地位，以从未标记的目标样本中挖掘共识知识。为了在纯无监督条件下有效训练这些插件模块，我们引入了检索增强交互（RAIN），这是一个三阶段管道，包括（1）协作检索伪源和复杂目标样本，（2）分别对每个专家进行其各自的样本集微调，以及（3）通过共享学习结果强制学习对象一致性。在四个基准数据集上的广泛实验表明，我们的方法达到了最先进的性能。

Summary / 总结

The paper addresses the challenge of adapting a source-trained model to a target domain without access to source data, a common issue in privacy and cost concerns. It introduces Experts Cooperative Learning (EXCL), which uses a Dual Experts framework with a frozen source-domain model and a pretrained vision-language model, both fine-tuned on unlabeled target data. The approach employs a three-stage Retrieval-Augmented-Interaction (RAIN) pipeline to collaboratively retrieve and fine-tune the experts, ensuring learning object consistency. Experiments show that EXCL achieves state-of-the-art performance on four benchmark datasets.

论文解决了在没有访问源数据的情况下将源训练模型适应目标域的问题，这在隐私和成本方面是一个常见问题。提出了专家协同学习（EXCL），该方法使用一个双专家框架，包括一个冻结的源域模型和一个预训练的视觉-语言模型，两者都在未标记的目标数据上进行微调。该方法采用一个三阶段的检索增强交互（RAIN）管道，协作检索伪源和复杂的目标样本，分别对每个专家进行微调，并通过共享学习结果确保学习对象一致性。实验表明，EXCL在四个基准数据集上达到了最先进的性能。

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Authors: Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman

Venue: NeurIPS 2025 spotlight

First: 2025-05-15T17:52:54+00:00 · Latest: 2025-10-06T15:41:20+00:00

Comments: Accepted as a spotlight at NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

中文标题/摘要

标题：MMLongBench：有效且全面地评估长上下文视觉语言模型

大型视觉语言模型中上下文窗口的迅速扩展催生了长上下文视觉语言模型（LCVLMs），使其能够在单次前向传递中处理数百张带有交错文本标记的图像。本文介绍了MMLongBench，这是首个涵盖多种长上下文视觉语言任务的基准测试，旨在有效且全面地评估LCVLMs。MMLongBench 包含13,331个示例，覆盖五个下游任务类别，如视觉RAG和多次示例ICL。它还提供了各种图像类型的广泛覆盖，包括自然和合成图像。为了评估模型对不同输入长度的鲁棒性，所有示例通过结合视觉补丁和文本标记的跨模态标记方案以五种标准化输入长度（8K-128K标记）提供。通过对46个闭源和开源LCVLMs进行全面基准测试，我们提供了当前模型视觉语言长上下文能力的全面分析。结果显示：i) 单个任务上的表现是整体长上下文能力的弱代理；ii) 无论是闭源还是开源模型，在长上下文视觉语言任务中都面临挑战，表明未来有巨大的改进空间；iii) 具有更强推理能力的模型通常表现出更好的长上下文性能。通过提供广泛的任务覆盖、各种图像类型和严格的长度控制，MMLongBench 为诊断和推进下一代LCVLMs提供了缺失的基础。

Summary / 总结

MMLongBench 是一个用于评估长上下文视觉语言模型 (LCVLMs) 的基准，涵盖了多种任务和输入长度。它包括 13,331 个例子，分布在五个类别和各种图像类型中，输入长度从 8K 到 128K 个标记标准化。研究发现，单任务性能不是整体长上下文能力的良好指标，且无论是闭源还是开源模型在长上下文任务中都有改进空间。具有更强推理能力的模型在长上下文任务中表现更好。

Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

Authors: Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen

First: 2025-08-12T17:59:57+00:00 · Latest: 2025-10-06T14:46:22+00:00

Comments: Project webpage: https://aim-uofa.github.io/dLLM-MidTruth

Abs · PDF · Code1 · Code2 · Project1

Abstract

Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.

中文标题/摘要

标题：时间是特征：利用扩散语言模型中的时间动态

扩散大型语言模型（dLLMs）通过迭代去噪生成文本，但当前解码策略倾向于丢弃中间丰富的预测，而保留最终输出。我们的研究揭示了一个关键现象，即时间振荡，其中正确的答案往往在中间过程中出现，但在后续去噪步骤中被覆盖。为了解决这一问题，我们引入了两种互补的方法来利用时间一致性：1) 时间自我一致性投票，这是一种无需训练的测试时解码策略，通过聚合去噪步骤中的预测来选择最一致的输出；2) 一种后训练方法，称为时间一致性强化，它使用时间语义熵（TSE），这是一种衡量中间预测语义稳定性的度量，作为奖励信号以鼓励稳定生成。在多个基准上的实验证明了我们方法的有效性。仅使用负TSE奖励，我们在Countdown数据集上观察到与现有dLLM相比平均改进了24.7%。结合准确性奖励，我们在GSM8K上实现了2.0%的绝对收益，在MATH500上实现了4.3%的收益，在SVAMP上实现了6.6%的收益，在Countdown上实现了25.3%的收益。我们的研究结果强调了dLLMs中时间动态的未开发潜力，并提供了两种简单而有效的工具来利用它们。

Summary / 总结

This paper addresses the issue of temporal oscillation in diffusion large language models (dLLMs), where correct answers often appear in intermediate steps but are overwritten later. To tackle this, the authors propose two methods: Temporal Self-Consistency Voting, a test-time decoding strategy that aggregates predictions from different denoising steps, and Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy as a reward signal to encourage stable generations. Experiments show significant improvements across various benchmarks, with the negative TSE reward alone improving performance by 24.7% on the Countdown dataset and combined methods achieving gains up to 25.3% on Countdown and 6.6% on SVAMP.

本文研究了扩散大型语言模型（dLLMs）中的时间振荡问题，即正确答案往往在中间步骤出现但后来被覆盖。为此，作者提出了两种方法：时间自一致性投票，这是一种测试时的解码策略，通过聚合去噪步骤中的预测来选择最一致的输出；以及时间一致性强化，它使用时间语义熵作为奖励信号来鼓励稳定生成。实验结果显示，在多个基准测试中取得了显著改进，单独使用负TSE奖励在Countdown数据集上提高了24.7%，结合其他方法在Countdown上提高了25.3%，在SVAMP上提高了6.6%。

TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration

Authors: Yanshu Li, Jianjiang Yang, Tian Yun, Pinyuan Feng, Jinfa Huang, Ruixiang Tang

First: 2025-05-21T05:22:21+00:00 · Latest: 2025-10-06T13:42:58+00:00

Comments: EMNLP2025 Main, 28 pages, 11 figures, 19 tables

Abs · PDF · Code1 · Code2

Abstract

Multimodal in-context learning (ICL) has emerged as a key mechanism for harnessing the capabilities of large vision-language models (LVLMs). However, its effectiveness remains highly sensitive to the quality of input ICL sequences, particularly for tasks involving complex reasoning or open-ended generation. A major limitation is our limited understanding of how LVLMs actually exploit these sequences during inference. To bridge this gap, we systematically interpret multimodal ICL through the lens of task mapping, which reveals how local and global relationships within and among demonstrations guide model reasoning. Building on this insight, we present TACO, a lightweight transformer-based model equipped with task-aware attention that dynamically configures ICL sequences. By injecting task-mapping signals into the autoregressive decoding process, TACO creates a bidirectional synergy between sequence construction and task reasoning. Experiments on five LVLMs and nine datasets demonstrate that TACO consistently surpasses baselines across diverse ICL tasks. These results position task mapping as a novel and valuable perspective for interpreting and improving multimodal ICL.

中文标题/摘要

标题：TACO：通过任务映射引导的序列配置增强多模态上下文学习

多模态上下文学习（ICL）已成为利用大型视觉-语言模型（LVLM）能力的关键机制。然而，其有效性高度依赖于输入ICL序列的质量，特别是在涉及复杂推理或开放生成的任务中。一个主要限制是我们对LVLM在推理过程中如何利用这些序列的理解有限。为了解决这一问题，我们系统地通过任务映射的视角来解释多模态ICL，揭示了局部和全局关系如何在示范之间引导模型推理。基于这一见解，我们提出了TACO，一种轻量级的基于变压器的模型，配备了任务感知注意力，能够动态配置ICL序列。通过将任务映射信号注入自回归解码过程，TACO在序列构建和任务推理之间创造了双向协同作用。在五个LVLM和九个数据集上的实验表明，TACO在各种ICL任务中始终优于基线。这些结果将任务映射定位为解释和改进多模态ICL的一种新颖且有价值的视角。

Summary / 总结

The research aims to enhance the effectiveness of multimodal in-context learning (ICL) in large vision-language models (LVLMs) by addressing the sensitivity of ICL to input sequence quality. The study introduces TACO, a task-aware transformer model that dynamically configures ICL sequences through task-mapping signals, improving model reasoning. Experiments show that TACO outperforms baseline models across various ICL tasks on five LVLMs and nine datasets, suggesting task mapping as a valuable approach for ICL interpretation and improvement.

研究旨在通过提高输入序列的质量来增强视觉语言模型（LVLM）中的多模态上下文学习（ICL），特别是对于复杂推理任务。TACO 是一种任务感知的变压器模型，通过任务映射信号在自回归解码过程中动态配置 ICL 序列。实验表明，TACO 在五个 LVLM 和九个数据集上的各种 ICL 任务中均优于基线模型，突显了任务映射在 ICL 解释和改进中的重要性。

CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Authors: Cristiano Patrício, Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira, João C. Neves

First: 2025-01-21T16:38:04+00:00 · Latest: 2025-10-06T13:22:25+00:00

Comments: Accepted for publication in Computers in Biology and Medicine

Abs · PDF · Code1 · Code2 · Project1

Abstract

The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the model output on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: https://cristianopatricio.github.io/CBVLM/.

中文标题/摘要

标题：CBVLM：无需训练的基于概念的大规模视觉语言模型在医学图像分类中的可解释性

深度学习解决方案在医疗工作流程中的应用受限于标注数据的可用性和此类系统的缺乏可解释性。概念瓶颈模型（CBMs）通过在预定义的人类可解释的概念集上约束模型输出来解决后者。然而，通过这些基于概念的解释获得的增强可解释性意味着更高的标注负担。此外，如果需要添加新概念，整个系统都需要重新训练。受大规模视觉语言模型（LVLM）在少样本设置中表现出色的启发，我们提出了一种简单而有效的方法CBVLM，以解决上述两个问题。首先，对于每个概念，我们提示LVLM判断该概念是否出现在输入图像中。然后，我们要求LVLM根据之前的概念预测对图像进行分类。此外，在两个阶段中，我们引入了一个检索模块，负责选择最佳示例进行上下文学习。通过基于预测的概念进行最终诊断，我们确保了可解释性，并通过利用LVLM的少样本能力，我们大幅降低了标注成本。我们通过在四个医学数据集和十二个LVLM（通用和医学）上进行广泛的实验验证了我们的方法，并展示了CBVLM在无需任何训练且仅使用少量标注示例的情况下，始终优于CBMs和任务特定的监督方法。更多关于我们项目的详细信息，请参阅项目页面：https://cristianopatricio.github.io/CBVLM/。

Summary / 总结

The paper addresses the challenges of limited annotated data and lack of interpretability in medical image classification using deep learning. It proposes CBVLM, a method that leverages Large Vision-Language Models (LVLMs) for both concept detection and image classification, reducing annotation costs and enhancing explainability. CBVLM prompts LVLMs to predict the presence of concepts in images and then classifies the images based on these predictions, using a retrieval module for in-context learning. Experiments across four medical datasets and twelve LVLMs show that CBVLM outperforms existing methods without requiring training and using minimal annotated examples.

该论文针对深度学习在医疗图像分类中面临的标注数据有限和缺乏解释性的问题，提出了一种名为CBVLM的方法，利用大型视觉-语言模型（LVLM）进行少样本学习。通过促使LVLM预测图像中概念的存在，并基于这些预测进行分类，CBVLM确保了解释性并降低了标注成本。实验结果显示，CBVLM在四个医疗数据集和十二种LVLM（包括通用和医疗专用）上表现优于概念瓶颈模型（CBMs）和特定任务的监督方法，无需训练且仅使用少量标注样本。

Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning

Authors: Xiaomeng Fan, Yuchuan Mao, Zhi Gao, Yuwei Wu, Jin Chen, Yunde Jia

First: 2025-10-06T12:43:59+00:00 · Latest: 2025-10-06T12:43:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary learning requires modeling the data distribution in open environments, which consists of both seen-class and unseen-class data. Existing methods estimate the distribution in open environments using seen-class data, where the absence of unseen classes makes the estimation error inherently unidentifiable. Intuitively, learning beyond the seen classes is crucial for distribution estimation to bound the estimation error. We theoretically demonstrate that the distribution can be effectively estimated by generating unseen-class data, through which the estimation error is upper-bounded. Building on this theoretical insight, we propose a novel open-vocabulary learning method, which generates unseen-class data for estimating the distribution in open environments. The method consists of a class-domain-wise data generation pipeline and a distribution alignment algorithm. The data generation pipeline generates unseen-class data under the guidance of a hierarchical semantic tree and domain information inferred from the seen-class data, facilitating accurate distribution estimation. With the generated data, the distribution alignment algorithm estimates and maximizes the posterior probability to enhance generalization in open-vocabulary learning. Extensive experiments on $11$ datasets demonstrate that our method outperforms baseline approaches by up to $14\%$, highlighting its effectiveness and superiority.

中文标题/摘要

标题：超越所见：开放词汇量分布估计

开放词汇量学习需要在开放环境中建模数据分布，这包括已见过的类别和未见过的类别。现有方法使用已见过的类别数据估计开放环境中的分布，由于未见过的类别缺失，使得估计误差无法识别。直观上，学习未见过的类别对于分布估计以限制估计误差至关重要。我们从理论上证明，通过生成未见过的类别数据，可以有效估计分布，从而将估计误差上界。基于这一理论洞察，我们提出了一种新的开放词汇量学习方法，该方法生成未见过的类别数据以估计开放环境中的分布。该方法由一个类别领域数据生成流水线和一个分布对齐算法组成。数据生成流水线在层次语义树和从已见过的类别数据推断出的领域信息的指导下生成未见过的类别数据，有助于准确的分布估计。通过生成的数据，分布对齐算法估计并最大化后验概率，以增强开放词汇量学习中的泛化能力。在11个数据集上的广泛实验表明，我们的方法比基线方法高出最多14%，突显了其有效性和优越性。

Summary / 总结

The paper addresses the challenge of estimating the data distribution in open-vocabulary learning, which involves both seen and unseen classes. It proposes a method that generates unseen-class data to bound the estimation error, theoretically demonstrating that this approach can effectively estimate the distribution. The method includes a data generation pipeline and a distribution alignment algorithm, leading to improved performance in open-vocabulary learning, with up to 14% better results compared to baseline approaches on 11 datasets.

论文解决了开放词汇学习中数据分布估计的挑战，涉及已知和未知类别。提出了一种生成未知类别数据的方法来限制估计误差，并理论证明了这种方法可以有效估计分布。该方法包括数据生成管道和分布对齐算法，使开放词汇学习的性能得到提升，在11个数据集上的表现比基线方法高出最多14%。

Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction

Authors: Chi Yan, Dan Xu

First: 2025-10-06T12:36:07+00:00 · Latest: 2025-10-06T12:36:07+00:00

Comments: Project Page: https://yanchi-3dv.github.io/PG-Occ

Abs · PDF · Code1 · Code2 · Project1

Abstract

The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ

中文标题/摘要

标题：具有各向异性感知采样的渐进高斯变换器用于开放词汇占用预测

近年来，3D 占有率预测任务取得了显著进展，在基于视觉的自动驾驶系统中发挥着重要作用。虽然传统方法局限于固定语义类别，但最近的方法转向预测与文本对齐的特征，以支持现实场景中的开放词汇文本查询。然而，在文本对齐场景建模中存在权衡：稀疏的高斯表示难以捕捉场景中的小物体，而密集表示则会带来显著的计算开销。为了解决这些限制，我们提出了 PG-Occ，一种创新的渐进高斯变换器框架，以实现开放词汇的3D 占有率预测。我们的框架采用渐进在线稠密化，这是一种逐步增强3D 高斯表示以捕捉细粒度场景细节的前馈策略。通过迭代增强表示，框架能够实现越来越精确和详细的场景理解。另一个关键贡献是引入了具有时空融合的各向异性感知采样策略，该策略能够根据不同尺度和阶段适配高斯的感知域，从而实现更有效的特征聚合和更丰富的场景信息捕获。通过广泛的评估，我们证明 PG-Occ 达到了最先进的性能，相对提高了14.3% 的mIoU。代码和预训练模型将在发表后在我们的项目页面上发布：https://yanchi-3dv.github.io/PG-Occ

Summary / 总结

The paper introduces PG-Occ, a Progressive Gaussian Transformer Framework designed for open-vocabulary 3D occupancy prediction. It addresses the limitations of sparse and dense Gaussian representations by employing progressive online densification and an anisotropy-aware sampling strategy. Experimental results show that PG-Occ outperforms previous methods with a 14.3% relative improvement in mIoU. The framework enhances 3D scene understanding through iterative representation refinement and effective feature aggregation, making it suitable for vision-based autonomous driving systems.

论文提出了PG-Occ，一种渐进高斯变换器框架，用于开放词汇的3D占用预测。该框架通过渐进在线密集化和各向异性感知采样策略解决了稀疏和密集高斯表示的局限性。框架通过迭代增强3D高斯表示来捕捉细粒度细节，并适配性地分配感受野，从而实现了最先进的性能，相对mIoU改进了14.3%。

Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction

Authors: KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho

Venue: NeurIPS 2025

First: 2025-10-06T11:33:09+00:00 · Latest: 2025-10-06T11:33:09+00:00

Comments: Accepted by NeurIPS 2025. Code: https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes

Abs · PDF · Code1 · Code2 · Code3

Abstract

3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods. Our code is publicly available at https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes.

中文标题/摘要

标题：面向对象的表示学习以增强3D场景图预测

3D语义场景图预测旨在检测3D场景中的对象及其语义关系，并已成为机器人技术和AR/VR应用中的关键技术。尽管先前的研究已解决了数据集限制并探索了各种方法，包括开放式词汇设置，但它们经常未能优化对象和关系特征的表示能力，过度依赖于图神经网络，尽管这些网络的区分能力不足。在本工作中，我们通过广泛的分析表明，对象特征的质量在决定整体场景图准确性方面起着关键作用。为了解决这一挑战，我们设计了一种高度区分性的对象特征编码器，并采用对比预训练策略，将对象表示学习与场景图预测分离。这一设计不仅提高了对象分类准确性，还直接提高了关系预测。值得注意的是，当将我们的预训练编码器插入现有框架时，我们观察到所有评估指标上都取得了显著性能提升。此外，与现有方法未能充分利用关系信息的整合不同，我们有效结合了几何和语义特征，实现了更优的关系预测。在3DSSG数据集上的全面实验表明，我们的方法显著优于先前的最先进方法。我们的代码可在https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes 获取。

Summary / 总结

This research aims to improve 3D semantic scene graph prediction by focusing on the quality of object features. The authors introduce a discriminative object feature encoder and a contrastive pretraining strategy to enhance object and relationship prediction. Experiments on the 3DSSG dataset show that their approach significantly outperforms existing methods across all evaluation metrics, demonstrating substantial improvements in both object and relationship prediction accuracy. The code is publicly available.

该研究针对3D语义场景图预测中的对象特征质量不足问题，提出了一种区分性对象特征编码器和对比预训练策略，以提升对象和关系预测能力。在3DSSG数据集上的全面实验表明，该方法在所有评估指标上显著优于先前的方法。

ViTs: Teaching Machines to See Time Series Anomalies Like Human Experts

Authors: Zexin Wang, Changhua Pei, Yang Liu, Hengyue Jiang, Quan Zhou, Haotian Si, Hang Cui, Jianhui Li, Gaogang Xie, Jingjing Li, Dan Pei

First: 2025-10-06T11:24:53+00:00 · Latest: 2025-10-06T11:24:53+00:00

Comments: 13 pages

Abs · PDF · Code1 · Code2

Abstract

Web service administrators must ensure the stability of multiple systems by promptly detecting anomalies in Key Performance Indicators (KPIs). Achieving the goal of "train once, infer across scenarios" remains a fundamental challenge for time series anomaly detection models. Beyond improving zero-shot generalization, such models must also flexibly handle sequences of varying lengths during inference, ranging from one hour to one week, without retraining. Conventional approaches rely on sliding-window encoding and self-supervised learning, which restrict inference to fixed-length inputs. Large Language Models (LLMs) have demonstrated remarkable zero-shot capabilities across general domains. However, when applied to time series data, they face inherent limitations due to context length. To address this issue, we propose ViTs, a Vision-Language Model (VLM)-based framework that converts time series curves into visual representations. By rescaling time series images, temporal dependencies are preserved while maintaining a consistent input size, thereby enabling efficient processing of arbitrarily long sequences without context constraints. Training VLMs for this purpose introduces unique challenges, primarily due to the scarcity of aligned time series image-text data. To overcome this, we employ an evolutionary algorithm to automatically generate thousands of high-quality image-text pairs and design a three-stage training pipeline consisting of: (1) time series knowledge injection, (2) anomaly detection enhancement, and (3) anomaly reasoning refinement. Extensive experiments demonstrate that ViTs substantially enhance the ability of VLMs to understand and detect anomalies in time series data. All datasets and code will be publicly released at: https://anonymous.4open.science/r/ViTs-C484/.

中文标题/摘要

标题：ViTs：教会机器像人类专家一样识别时间序列异常

网络服务管理员必须通过及时检测关键性能指标（KPIs）的异常来确保多个系统的稳定性。时间序列异常检测模型实现“一次训练，多场景推断”的目标仍然是一个基本挑战。除了提高零样本泛化能力，这些模型在推断过程中还必须灵活处理从一小时到一周不等长度的序列，而无需重新训练。传统方法依赖滑动窗口编码和半监督学习，这限制了推断的固定长度输入。大型语言模型（LLMs）在通用领域展示了出色的零样本能力。然而，当应用于时间序列数据时，它们由于上下文长度的限制而面临固有的局限性。为了解决这一问题，我们提出了一种基于视觉语言模型（VLM）的ViTs框架，将时间序列曲线转换为视觉表示。通过缩放时间序列图像，保持时间依赖性的同时保持一致的输入大小，从而能够高效处理任意长度的序列而无上下文约束。为了训练这种目的的VLM，引入了独特的挑战，主要是由于时间序列图像-文本对数据的稀缺性。为了解决这一问题，我们使用进化算法自动生成数千个高质量的图像-文本对，并设计了一个三阶段训练管道：（1）时间序列知识注入，（2）异常检测增强，（3）异常推理精炼。广泛的实验表明，ViTs显著增强了VLM理解并检测时间序列数据中异常的能力。所有数据集和代码将在以下网址公开发布：https://anonymous.4open.science/r/ViTs-C484/

Summary / 总结

The research aims to develop a time series anomaly detection model that can generalize across different scenarios without retraining. The proposed ViTs framework converts time series data into visual representations, allowing for efficient processing of sequences of varying lengths. Key findings show that ViTs significantly improve the ability of Vision-Language Models to understand and detect anomalies in time series data, overcoming the limitations of conventional approaches and LLMs in handling long sequences without context constraints.

研究旨在开发一种无需重新训练即可跨不同场景泛化的时序异常检测模型。提出的ViTs框架将时序数据转换为视觉表示，使得能够高效处理长度可变的序列。关键发现表明，ViTs显著提高了视觉语言模型理解并检测时序数据中异常的能力，克服了传统方法和LLMs在处理长序列时缺乏上下文约束的局限性。

LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps

Authors: Yihao Wang, Raphael Memmesheimer, Sven Behnke

First: 2025-03-15T18:54:06+00:00 · Latest: 2025-10-06T10:49:54+00:00

Comments: 12 pages, 4 figures, 2 tables, 19th International Conference on Intelligent Autonomous Systems (IAS), Genoa, Italy, June 2025

Abs · PDF · Code1 · Code2

Abstract

The availability of large language models and open-vocabulary object perception methods enables more flexibility for domestic service robots. The large variability of domestic tasks can be addressed without implementing each task individually by providing the robot with a task description along with appropriate environment information. In this work, we propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. Language and image inputs are encoded with a CLIP backbone, for which we designed two pre-training tasks to fine-tune its weights and pre-align the latent spaces. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks. Our results demonstrate the importance of pre-aligning embedding spaces from different modalities and the efficacy of incorporating semantic maps.

中文标题/摘要

标题：LIAM：多模态变换器用于语言指令、图像、动作和语义地图

大型语言模型和开放词汇对象感知方法的可用性为家用服务机器人提供了更大的灵活性。通过向机器人提供任务描述以及适当的环境信息，可以解决家庭任务的大量变异性，而无需单独实现每个任务。在本工作中，我们提出了一种名为LIAM的端到端模型，该模型基于语言、图像、动作和地图输入预测动作脚本。语言和图像输入使用CLIP骨干网络进行编码，为此我们设计了两个预训练任务来微调其权重并预先对齐潜在空间。我们在ALFRED数据集上评估了我们的方法，这是一个由模拟器生成的家庭任务基准。我们的结果表明，不同模态嵌入空间的预先对齐的重要性以及引入语义地图的有效性。

Summary / 总结

The paper introduces LIAM, an end-to-end model that predicts action transcripts using language, images, actions, and semantic maps. It uses a CLIP backbone to encode language and images, with two pre-training tasks to fine-tune and align latent spaces. The model is evaluated on the ALFRED dataset, showing the importance of pre-aligning embedding spaces and the benefits of incorporating semantic maps.

论文介绍了LIAM模型，该模型使用语言、图像、动作和语义地图来预测行动脚本。它使用CLIP骨干网络来编码语言和图像，并通过预训练两个任务来对齐不同模态的嵌入空间。该模型在ALFRED数据集上进行了评估，展示了预对齐嵌入空间的重要性以及引入语义地图的好处。

SIA: Enhancing Safety via Intent Awareness for Vision-Language Models

Authors: Youngjin Na, Sangheon Jeong, Youngwan Lee, Jian Lee, Dawoon Jeong, Youngman Kim

First: 2025-07-21T13:59:50+00:00 · Latest: 2025-10-06T10:16:31+00:00

Comments: Accepted to Safe and Trustworthy Multimodal AI Systems(SafeMM-AI) Workshop at ICCV2025, Non-archival track

Abs · PDF · Code1 · Code2

Abstract

With the growing deployment of Vision-Language Models (VLMs) in real-world applications, previously overlooked safety risks are becoming increasingly evident. In particular, seemingly innocuous multimodal inputs can combine to reveal harmful intent, leading to unsafe model outputs. While multimodal safety has received increasing attention, existing approaches often fail to address such latent risks, especially when harmfulness arises only from the interaction between modalities. We propose SIA (Safety via Intent Awareness), a training-free, intent-aware safety framework that proactively detects harmful intent in multimodal inputs and uses it to guide the generation of safe responses. SIA follows a three-stage process: (1) visual abstraction via captioning; (2) intent inference through few-shot chain-of-thought (CoT) prompting; and (3) intent-conditioned response generation. By dynamically adapting to the implicit intent inferred from an image-text pair, SIA mitigates harmful outputs without extensive retraining. Extensive experiments on safety benchmarks, including SIUO, MM-SafetyBench, and HoliSafe, show that SIA consistently improves safety and outperforms prior training-free methods.

中文标题/摘要

标题：SIA：通过意图意识提升视觉-语言模型的安全性

随着视觉-语言模型（VLMs）在实际应用中的部署越来越多，之前被忽视的安全风险变得越来越明显。特别是，看似无害的多模态输入可以结合在一起揭示有害意图，导致不安全的模型输出。尽管多模态安全问题已经引起了越来越多的关注，但现有的方法往往未能解决此类潜在风险，尤其是当有害性仅来源于模态之间的交互时。我们提出了SIA（通过意图意识提升安全性），这是一种无需训练的、意图意识的安全框架，可以主动检测多模态输入中的有害意图，并利用这些意图来引导生成安全的响应。SIA遵循三个阶段的过程：（1）通过配图进行视觉抽象；（2）通过少量链式思考（CoT）提示进行意图推断；（3）基于意图的响应生成。通过动态适应从图像-文本对中隐含推断出的意图，SIA可以在不进行大量重新训练的情况下减轻有害输出。在SIUO、MM-SafetyBench和HoliSafe等安全性基准上的广泛实验表明，SIA在提升安全性方面表现一致，并优于先前的无需训练的方法。

Summary / 总结

SIA is a training-free framework that enhances the safety of Vision-Language Models by detecting harmful intent in multimodal inputs and guiding safe response generation. It uses a three-stage process: visual abstraction via captioning, intent inference through few-shot CoT prompting, and intent-conditioned response generation. Experiments on safety benchmarks show that SIA consistently improves safety and outperforms prior methods without extensive retraining.

SIA 是一个无需训练的框架，通过检测多模态输入中的有害意图并引导生成安全的响应来增强视觉-语言模型的安全性。它包括三个阶段：通过图像描述进行视觉抽象、通过少量样本的链式思考（CoT）提示进行意图推断以及基于意图的响应生成。SIA 在多个安全基准测试中表现出一致的安全性改进，并优于之前的无需训练的方法。

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

Authors: Zhe Hu, Jing Li, Zhongzhu Pu, Hou Pong Chan, Yu Yin

Venue: NeurIPS 2025

First: 2025-03-21T09:25:23+00:00 · Latest: 2025-10-06T10:03:29+00:00

Comments: Accepted at NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models exhibit impressive performance for various tasks, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are replaced by textual descriptions, suggesting foundational reasoning can be effectively learned from language. Motivated by this insight, we propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Experiments across diverse decision-making benchmarks demonstrate that Praxis-VLM substantially outperforms standard supervised fine-tuning, exhibiting superior performance and generalizability. Further analysis confirms that our models engage in explicit and effective reasoning, underpinning their enhanced performance and adaptability.

Summary / 总结

This paper addresses the limitation of Vision Language Models (VLMs) in complex decision-making by proposing Praxis-VLM, which uses text-driven reinforcement learning to enhance reasoning capabilities. Praxis-VLM employs the GRPO algorithm to learn from textual descriptions, enabling it to evaluate actions and their consequences. The model's reasoning skills, acquired from text, effectively transfer to multimodal inference with visual inputs, leading to superior performance and generalizability across various decision-making benchmarks, outperforming standard supervised fine-tuning methods.

该论文通过提出使用文本驱动的强化学习来增强推理能力的Praxis-VLM，解决了视觉语言模型在复杂决策任务中的局限性。Praxis-VLM 使用GRPO算法从文本描述中学习，使其能够评估行动及其后果。该模型在各种决策基准测试中表现出色，显著优于标准的监督微调方法，展示了更强的性能和泛化能力。

Predictive Feature Caching for Training-free Acceleration of Molecular Geometry Generation

Authors: Johanna Sommer, John Rachwan, Nils Fleischmann, Stephan Günnemann, Bertrand Charpentier

Venue: NeurIPS 2025

First: 2025-10-06T09:49:14+00:00 · Latest: 2025-10-06T09:49:14+00:00

Comments: Accepted at the AI for Science Workshop @ NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Flow matching models generate high-fidelity molecular geometries but incur significant computational costs during inference, requiring hundreds of network evaluations. This inference overhead becomes the primary bottleneck when such models are employed in practice to sample large numbers of molecular candidates. This work discusses a training-free caching strategy that accelerates molecular geometry generation by predicting intermediate hidden states across solver steps. The proposed method operates directly on the SE(3)-equivariant backbone, is compatible with pretrained models, and is orthogonal to existing training-based accelerations and system-level optimizations. Experiments on the GEOM-Drugs dataset demonstrate that caching achieves a twofold reduction in wall-clock inference time at matched sample quality and a speedup of up to 3x compared to the base model with minimal sample quality degradation. Because these gains compound with other optimizations, applying caching alongside other general, lossless optimizations yield as much as a 7x speedup.

Summary / 总结

This work addresses the computational challenge of generating high-fidelity molecular geometries using flow matching models, which require numerous network evaluations during inference. The authors propose a training-free caching strategy that predicts intermediate hidden states to accelerate this process. Experiments show that this method reduces wall-clock inference time by a factor of two while maintaining sample quality, and can achieve up to a 3x speedup compared to the base model with minor quality degradation. These benefits can be compounded with other optimizations, leading to a maximum of 7x speedup.

本文提出了一种训练-free 缓存策略，以解决使用流匹配模型生成高保真分子几何结构时的计算开销问题。该方法在 SE(3)-对称主干上预测推理过程中的中间隐藏状态，并且兼容预训练模型。实验表明，该缓存策略可以将墙钟推理时间减少两倍，同时保持样本质量，相对于基模型可以实现最多3倍的加速，而样本质量略有下降，与其他优化结合使用时，总加速倍数可达7倍。

Conditional Representation Learning for Customized Tasks

Authors: Honglin Liu, Chao Sun, Peng Hu, Yunfan Li, Xi Peng

First: 2025-10-06T08:00:59+00:00 · Latest: 2025-10-06T08:00:59+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Conventional representation learning methods learn a universal representation that primarily captures dominant semantics, which may not always align with customized downstream tasks. For instance, in animal habitat analysis, researchers prioritize scene-related features, whereas universal embeddings emphasize categorical semantics, leading to suboptimal results. As a solution, existing approaches resort to supervised fine-tuning, which however incurs high computational and annotation costs. In this paper, we propose Conditional Representation Learning (CRL), aiming to extract representations tailored to arbitrary user-specified criteria. Specifically, we reveal that the semantics of a space are determined by its basis, thereby enabling a set of descriptive words to approximate the basis for a customized feature space. Building upon this insight, given a user-specified criterion, CRL first employs a large language model (LLM) to generate descriptive texts to construct the semantic basis, then projects the image representation into this conditional feature space leveraging a vision-language model (VLM). The conditional representation better captures semantics for the specific criterion, which could be utilized for multiple customized tasks. Extensive experiments on classification and retrieval tasks demonstrate the superiority and generality of the proposed CRL. The code is available at https://github.com/XLearning-SCU/2025-NeurIPS-CRL.

中文标题/摘要

标题：条件表示学习以适应定制任务

传统的表示学习方法学习一种通用表示，主要捕捉主要语义，这可能并不总是与定制的下游任务相一致。例如，在动物栖息地分析中，研究人员优先考虑场景相关的特征，而通用嵌入则强调类别语义，导致结果次优。为了解决这一问题，现有方法采用监督微调，但这种方法会带来高昂的计算和标注成本。在本文中，我们提出了一种条件表示学习（CRL），旨在提取适应任意用户指定标准的表示。具体而言，我们揭示空间语义由其基底决定，从而使得一组描述性词汇能够近似定制特征空间的基底。基于这一洞察，给定用户指定的标准，CRL首先使用大型语言模型（LLM）生成描述性文本以构建语义基底，然后利用视觉语言模型（VLM）将图像表示投影到该条件特征空间中。条件表示更好地捕捉特定标准的语义，可用于多种定制任务。在分类和检索任务上的广泛实验表明，所提出的CRL具有优越性和普适性。代码可在https://github.com/XLearning-SCU/2025-NeurIPS-CRL/ 获取。

Summary / 总结

This paper addresses the limitation of conventional representation learning methods that focus on universal semantics rather than task-specific needs. To tackle this issue, the authors propose Conditional Representation Learning (CRL), which generates representations tailored to user-specified criteria. CRL uses a large language model to generate descriptive texts that define the semantic basis, then projects image representations into a conditional feature space using a vision-language model. Experiments on classification and retrieval tasks show that CRL outperforms existing methods and is highly versatile for various customized tasks.

本文解决了传统表示学习方法专注于通用语义而非特定任务需求的问题。为此，作者提出了条件表示学习（CRL），该方法生成符合用户指定标准的表示。CRL 使用大型语言模型生成描述性文本来定义语义基础，然后使用视觉语言模型将图像表示投影到条件特征空间中。实验表明，CRL 在分类和检索任务中优于现有方法，并且具有很高的通用性。

Post-training quantization of vision encoders needs prefixing registers

Authors: Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee

First: 2025-10-06T07:27:46+00:00 · Latest: 2025-10-06T07:27:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Transformer-based vision encoders -- such as CLIP -- are central to multimodal intelligence, powering applications from autonomous web agents to robotic control. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Post-training quantization offers a practical path, but remains challenging even at 8-bit precision due to massive-scale activations (i.e., outliers). In this work, we propose $\textit{RegCache}$, a training-free algorithm to mitigate outliers in vision encoders, enabling quantization with significantly smaller accuracy drops. The proposed RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the target vision encoder, which prevents other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experiments show that our method consistently improves the accuracy of quantized models across both text-supervised and self-supervised vision encoders.

中文标题/摘要

标题：视觉编码器的后训练量化需要前缀寄存器

基于变换器的视觉编码器——如CLIP——是多模态智能的核心，驱动着从自主网络代理到机器人控制的各种应用。由于这些应用通常需要实时处理大量视觉数据，因此降低视觉编码器的推理成本至关重要。后训练量化提供了一条可行的路径，但由于大规模激活（即异常值）的存在，即使在8位精度下也仍然具有挑战性。在本工作中，我们提出了一种无需训练的$\textit{RegCache}$算法，以减轻视觉编码器中的异常值问题，从而实现显著减小精度下降的量化。所提出的RegCache在目标视觉编码器中引入了前缀令牌，这些前缀令牌具有异常值但没有语义意义，从而防止其他令牌出现异常值。值得注意的是，我们观察到视觉编码器中的异常值与语言模型中的异常值行为不同，这促使我们提出了两种技术创新：中间层前缀和令牌删除。实验表明，我们的方法在文本监督和自我监督的视觉编码器中均能一致地提高量化模型的准确性。

Summary / 总结

This study addresses the challenge of reducing the inference cost of transformer-based vision encoders, which are crucial for real-time processing in various applications. The authors propose RegCache, a training-free method that introduces prefix tokens to mitigate outliers in the encoders, enabling more efficient quantization with minimal accuracy loss. Experiments demonstrate that this approach enhances the accuracy of quantized models across both text-supervised and self-supervised vision encoders.

该研究旨在降低基于变换器的视觉编码器的推理成本，这些编码器在各种应用中的实时处理至关重要。作者提出了一种名为RegCache的无训练方法，通过引入前缀标记来缓解这些编码器中的异常值问题，从而在几乎不损失准确性的前提下实现量化。实验结果表明，RegCache能够提高跨文本监督和自我监督视觉编码器的量化模型的准确性。

More Than Meets the Eye? Uncovering the Reasoning-Planning Disconnect in Training Vision-Language Driving Models

Authors: Xurui Song, Shuo Huai, JingJing Jiang, Jiayi Kong, Jun Luo

First: 2025-10-06T06:50:16+00:00 · Latest: 2025-10-06T06:50:16+00:00

Comments: The dataset will be released publicly once the paper is accepted for publication

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Model (VLM) driving agents promise explainable end-to-end autonomy by first producing natural-language reasoning and then predicting trajectory planning. However, whether planning is causally driven by this reasoning remains a critical but unverified assumption. To investigate this, we build DriveMind, a large-scale driving Visual Question Answering (VQA) corpus with plan-aligned Chain-of-Thought (CoT), automatically generated from nuPlan. Our data generation process converts sensors and annotations into structured inputs and, crucially, separates priors from to-be-reasoned signals, enabling clean information ablations. Using DriveMind, we train representative VLM agents with Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) and evaluate them with nuPlan's metrics. Our results, unfortunately, indicate a consistent causal disconnect in reasoning-planning: removing ego/navigation priors causes large drops in planning scores, whereas removing CoT produces only minor changes. Attention analysis further shows that planning primarily focuses on priors rather than the CoT. Based on this evidence, we propose the Reasoning-Planning Decoupling Hypothesis, positing that the training-yielded reasoning is an ancillary byproduct rather than a causal mediator. To enable efficient diagnosis, we also introduce a novel, training-free probe that measures an agent's reliance on priors by evaluating its planning robustness against minor input perturbations. In summary, we provide the community with a new dataset and a diagnostic tool to evaluate the causal fidelity of future models.

中文标题/摘要

标题：超出表面的真相？揭开训练视觉-语言驾驶模型中的推理-规划断层

视觉-语言模型（VLM）驾驶代理通过首先生成自然语言推理，然后预测轨迹规划，承诺实现可解释的端到端自主性。然而，这种规划是否由这种推理因果驱动仍然是一个关键但未验证的假设。为了调查这一点，我们构建了DriveMind，这是一个大规模的驾驶视觉问答（VQA）数据集，其中包含与规划对齐的思维链（CoT），自动从nuPlan生成。我们的数据生成过程将传感器和注释转换为结构化输入，并且最关键的是，将先验知识与待推理的信号分离，从而实现干净的信息消融。使用DriveMind，我们用监督微调（SFT）和组相对策略优化（GRPO）训练代表性VLM代理，并用nuPlan的指标评估它们。不幸的是，我们的结果表明推理-规划之间存在一致的因果断层：移除自我/导航先验会导致规划分数大幅下降，而移除CoT只会产生微小变化。注意力分析进一步表明，规划主要关注先验而非CoT。基于这些证据，我们提出了推理-规划解耦假设，认为训练产生的推理是一个附带的副产品，而不是因果中介。为了实现高效的诊断，我们还引入了一种新的、无需训练的探针，通过评估代理在轻微输入扰动下的规划稳健性来衡量其对先验的依赖性。总之，我们为社区提供了一个新的数据集和诊断工具，以评估未来模型的因果保真度。

Summary / 总结

This study investigates whether natural-language reasoning influences trajectory planning in vision-language driving models. By creating DriveMind, a large-scale driving VQA corpus with plan-aligned CoT, the researchers found that removing ego/navigation priors significantly impacts planning scores, while removing CoT has minimal effects. This suggests that reasoning is an ancillary byproduct rather than a causal mediator in planning. The study introduces a novel probe to measure an agent's reliance on priors, providing a diagnostic tool for evaluating causal fidelity in future models.

该研究探讨了视觉-语言驾驶模型中推理与规划之间的因果关系。通过创建包含计划对齐的CoT的大规模驾驶VQA数据集DriveMind，研究人员发现移除 ego/导航先验会显著影响规划，而移除CoT则几乎没有影响。这表明推理是辅助产物而非规划的因果中介。研究还引入了一种新的探针，用于衡量代理对先验的依赖程度，为未来模型提供诊断工具。

Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Authors: Young Kyun Jang, Ser-nam Lim

First: 2024-05-23T15:46:35+00:00 · Latest: 2025-10-06T06:02:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.

中文标题/摘要

标题：跨模态后向兼容表示学习研究

现代检索系统往往难以升级到新的更强大的模型，因为旧模型和新模型之间的嵌入不兼容。这需要一个昂贵的过程，称为回填，涉及重新计算大量数据样本的嵌入。在视觉领域，已经提出了后向兼容训练（BT）以确保新模型与旧模型的嵌入对齐。本文将仅限于视觉的BT扩展到跨模态检索领域，这是首次尝试解决跨模态BT（XBT）问题。我们的目标是实现视觉语言预训练（VLP）模型，如CLIP，在跨模态检索任务中的后向兼容性。为了解决XBT挑战，我们提出了一种高效的解决方案：一个投影模块，将新模型的嵌入映射到旧模型的嵌入。该模块仅使用文本数据进行预训练，大大减少了XBT学习所需的图像-文本对数量，并且在预训练后，在训练过程中避免使用旧模型。此外，我们还利用参数高效的训练策略，提高效率并保留现成新模型的知识，避免任何修改。在跨模态检索数据集上的实验结果表明XBT的有效性及其在新VLP模型出现时实现无回填升级的潜力。

Summary / 总结

The research aims to enhance cross-modal backward-compatible representation learning for vision-language models to avoid the costly backfilling process. The authors propose a projection module that maps new model embeddings to those of old models using only text data, reducing the need for image-text pairs during training. Experiments on cross-modal retrieval datasets show the effectiveness of this approach and its potential to enable backfill-free upgrades when new VLP models are introduced.

本文旨在解决升级视觉语言模型而不重新计算嵌入的成本问题。它引入了跨模态反向兼容训练（XBT）以确保新旧模型之间的兼容性。作者提出了一种仅用文本数据预训练的投影模块，将新模型的嵌入映射到旧模型的嵌入，减少了需要的图像-文本对数量，并在训练过程中避免使用旧模型。实验结果表明，XBT能够有效实现反向兼容，并且可以在新视觉语言预训练模型出现时，无需重新计算嵌入即可实现升级过程。

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

Authors: Zijian Song, Xiaoxin Lin, Qiuming Huang, Guangrun Wang, Liang Lin

First: 2025-06-17T13:40:00+00:00 · Latest: 2025-10-06T04:31:42+00:00

Comments: 14 pages

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving.

中文标题/摘要

标题：SIRI-Bench：通过复杂推理任务挑战VLM的空间智能

大型语言模型（LLMs）取得了迅速的进步，这主要归功于在复杂推理任务上的强化学习。相比之下，虽然空间智能对于视觉语言模型（VLMs）在现实世界中的交互至关重要，但对其复杂的空间推理的系统研究仍然未被充分探索。为了弥合这一差距，我们引入了SIRI-Bench，这是一个旨在通过空间关联推理任务评估VLMs结构空间智能的基准。SIRI-Bench 包含9,000个视频-问题-答案三元组，其中每个问题都嵌入在真实的3D场景中。该基准精心设计，使得解决每个问题都需要空间理解与结构推理。为了促进大规模数据合成，我们开发了一个自动场景创建引擎，该引擎利用协作的LLM代理将抽象的数学问题转化为忠实的3D场景。实验结果表明，最先进的VLMs在SIRI-Bench上面临巨大挑战，突显了结构空间推理的难度。我们希望我们的研究能够引起研究人员对空间关联推理的关注，并推动VLMs在视觉问题解决方面的进步。

Summary / 总结

SIRI-Bench is designed to evaluate VLMs' spatial intelligence through complex reasoning tasks, addressing the underexplored area of structural spatial reasoning. The benchmark includes 9,000 video-question-answer triplets set in realistic 3D scenes, requiring both spatial comprehension and structural reasoning. State-of-the-art VLMs perform poorly on SIRI-Bench, highlighting the difficulty of structural spatial reasoning. This study aims to draw attention to spatially grounded reasoning and improve VLMs in visual problem-solving.

SIRI-Bench 旨在通过现实的 3D 场景中的复杂推理任务来评估 VLM 的空间智能，包含 9,000 个视频-问题-答案三元组，需要同时具备空间理解能力和结构推理能力。最新的 VLM 在这个基准上表现不佳，突显了结构空间推理的难度。该研究旨在引起研究人员对空间地基推理的关注，并提高 VLM 在视觉问题解决方面的表现。

VaseVQA-3D: Benchmarking 3D VLMs on Ancient Greek Pottery

Authors: Nonghai Zhang, Zeyu Zhang, Jiazi Wang, Yang Zhao, Hao Tang

First: 2025-10-06T04:28:39+00:00 · Latest: 2025-10-06T04:28:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have achieved significant progress in multimodal understanding tasks, demonstrating strong capabilities particularly in general tasks such as image captioning and visual reasoning. However, when dealing with specialized cultural heritage domains like 3D vase artifacts, existing models face severe data scarcity issues and insufficient domain knowledge limitations. Due to the lack of targeted training data, current VLMs struggle to effectively handle such culturally significant specialized tasks. To address these challenges, we propose the VaseVQA-3D dataset, which serves as the first 3D visual question answering dataset for ancient Greek pottery analysis, collecting 664 ancient Greek vase 3D models with corresponding question-answer data and establishing a complete data construction pipeline. We further develop the VaseVLM model, enhancing model performance in vase artifact analysis through domain-adaptive training. Experimental results validate the effectiveness of our approach, where we improve by 12.8% on R@1 metrics and by 6.6% on lexical similarity compared with previous state-of-the-art on the VaseVQA-3D dataset, significantly improving the recognition and understanding of 3D vase artifacts, providing new technical pathways for digital heritage preservation research.

中文标题/摘要

标题：VaseVQA-3D：在古希腊陶器上评估3D VLMs

视觉-语言模型（VLMs）在多模态理解任务中取得了显著进展，特别是在图像字幕和视觉推理等通用任务中表现出强大的能力。然而，在处理如3D陶器这类专门的文化遗产领域时，现有模型面临严重的数据稀缺问题和不足的领域知识限制。由于缺乏针对性的训练数据，当前的VLMs难以有效处理这类文化上重要的专门任务。为应对这些挑战，我们提出了VaseVQA-3D数据集，这是首个用于古希腊陶器分析的3D视觉问答数据集，收集了664个古希腊陶器3D模型及其对应的问答数据，并建立了完整的数据构建管道。我们进一步开发了VaseVLM模型，通过领域适应性训练增强模型在陶器文物分析中的性能。实验结果验证了我们方法的有效性，我们在R@1指标上提高了12.8%，在词汇相似性上提高了6.6%，显著提高了3D陶器文物的识别和理解能力，为数字遗产保护研究提供了新的技术途径。

Summary / 总结

The paper addresses the limitations of existing Vision-Language Models (VLMs) in specialized cultural heritage domains, particularly 3D vase artifacts. It introduces the VaseVQA-3D dataset, the first 3D visual question answering dataset for ancient Greek pottery, and develops the VaseVLM model for domain-adaptive training. The results show a 12.8% improvement in R@1 metrics and a 6.6% increase in lexical similarity compared to previous state-of-the-art models, enhancing the recognition and understanding of 3D vase artifacts.

研究旨在利用视觉语言模型（VLM）增强对古希腊陶器的理解和分析。为了解决数据稀缺性和领域特定知识的限制，作者提出了VaseVQA-3D数据集，其中包括664个古希腊陶器的3D模型及其对应的问答对。他们还开发了VaseVLM模型，该模型在R@1指标上提高了12.8%，在词汇相似度上提高了6.6%，相比之前的最先进的模型，显著提升了对3D陶器的识别和理解能力。

MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models

Authors: Soo Yong Kim, Suin Cho, Vincent-Daniel Yun, Gyeongyeon Hwang

First: 2025-10-06T04:26:39+00:00 · Latest: 2025-10-06T04:26:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Bridging clinical diagnostic reasoning with AI remains a central challenge in medical imaging. We introduce MedCLM, an automated pipeline that converts detection datasets into large-scale medical visual question answering (VQA) data with Chain-of-Thought (CoT) reasoning by linking lesion boxes to organ segmentation and structured rationales. These contextual signals enable medical vision-language models to generate question-answer pairs with step-by-step reasoning. To utilize this data effectively, we propose an Integrated CoT-Curriculum Strategy composed of an Easy stage with explicit lesion boxes for visual grounding, a Medium stage that encourages implicit localization, and a Hard stage for weakly supervised reasoning. Experimental results demonstrate that MedCLM attains state-of-the-art performance on several medical VQA benchmarks, providing a scalable framework for developing clinically aligned medical vision-language models.

中文标题/摘要

标题：MedCLM：通过CoT课程学习定位和推理

将临床诊断推理与AI结合仍然是医学影像中的一个核心挑战。我们介绍了MedCLM，这是一种自动流水线，通过将病变框与器官分割和结构化推理链接起来，将检测数据转换为大规模的医学视觉问答（VQA）数据，并通过链式思考（CoT）推理。这些上下文信号使医学视觉语言模型能够生成带有逐步推理的问题-答案对。为了有效利用这些数据，我们提出了一种综合的CoT课程策略，包括一个易于操作的阶段，其中包含显式的病变框以实现视觉定位，一个鼓励隐式定位的中等阶段，以及一个弱监督推理的困难阶段。实验结果表明，MedCLM在多个医学VQA基准测试中达到了最先进的性能，提供了一种可扩展的框架，用于开发临床对齐的医学视觉语言模型。

Summary / 总结

MedCLM is designed to enhance medical vision-language models by integrating clinical diagnostic reasoning. It converts detection datasets into VQA data with CoT reasoning, linking lesion boxes to organ segmentation. MedCLM uses an Integrated CoT-Curriculum Strategy with three stages to improve visual grounding and reasoning. The model achieves state-of-the-art performance on medical VQA benchmarks, offering a scalable framework for clinically aligned models.

MedCLM旨在将临床诊断推理与AI结合应用于医学影像。它通过将检测数据集转换为大规模的医学视觉问答数据，并使用链式思考（CoT）课程，将病灶框与器官分割和结构化推理链接起来，使医学视觉语言模型能够生成带有逐步推理的问答对。MedCLM包含一个集成的CoT课程策略，分为简单、中等和困难三个阶段，以有效利用数据。实验结果表明，MedCLM在多个医学VQA基准测试中达到了最先进的性能，提供了一个可扩展的框架，用于开发临床对齐的医学视觉语言模型。

ExGS: Extreme 3D Gaussian Compression with Diffusion Priors

Authors: Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun

First: 2025-09-29T13:23:06+00:00 · Latest: 2025-10-06T02:40:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality.We introduce ExGS, a novel feed-forward framework that unifies Universal Gaussian Compression (UGC) with GaussPainter for Extreme 3DGS compression. UGC performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas GaussPainter leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings.To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over 100X compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering.Our code repository will be released at: https://github.com/chenttt2001/ExGS

中文标题/摘要

标题：ExGS：极端3D高斯压缩与扩散先验

神经场景表示，如3D高斯斑点化（3DGS），已实现高质量的神经渲染；然而，其庞大的存储和传输成本阻碍了在资源受限环境中的部署。现有压缩方法要么依赖昂贵的优化，这既慢又场景特定，要么采用无训练剪枝和量化，这在高压缩比下会降低渲染质量。相比之下，最近的数据驱动方法为克服这一权衡提供了有希望的方向，实现了高效压缩同时保持高质量渲染。我们引入了ExGS，这是一种新颖的前馈框架，将通用高斯压缩（UGC）与GaussPainter结合用于极端3DGS压缩。UGC通过不重新优化的剪枝激进地减少高斯原语，同时保留关键信息，而GaussPainter利用强大的扩散先验和掩码引导细化，从高度剪枝的高斯场景中恢复高质量渲染。与传统的修复不同，GaussPainter不仅填补缺失区域，还增强可见像素，显著改善了降级渲染。为了确保实用性，它采用轻量级的VAE和一步扩散设计，实现实时恢复。我们的框架在保持保真度的同时，即使在具有挑战性的条件下也能显著提高图像质量，甚至可以实现超过100倍的压缩（将典型的354.77 MB模型减少到约3.31 MB）。这些结果突显了扩散先验在极端压缩与高质量神经渲染之间桥梁作用。我们的代码库将在以下地址发布：https://github.com/chenttt2001/ExGS

Summary / 总结

ExGS is a novel feed-forward framework that combines Universal Gaussian Compression (UGC) and GaussPainter for efficient 3D Gaussian Splatting (3DGS) compression. UGC reduces Gaussian primitives while retaining essential information, and GaussPainter uses diffusion priors for mask-guided refinement to restore high-quality renderings. ExGS achieves over 100X compression while maintaining fidelity and improving image quality under challenging conditions, demonstrating the effectiveness of diffusion priors in extreme compression.

ExGS 是一种结合了通用高斯压缩 (UGC) 和 GaussPainter 的新型前馈框架，用于高效压缩 3D 高斯散点图 (3DGS)。UGC 在不重新优化的情况下进行剪枝，而 GaussPainter 利用扩散先验进行细化和增强。ExGS 实现了超过 100 倍的压缩，同时保持高质量的渲染效果和实际的实时恢复能力。

A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering

Authors: Yuanhao Zou, Shengji Jin, Andong Deng, Youpeng Zhao, Jun Wang, Chen Chen

First: 2025-10-06T01:51:13+00:00 · Latest: 2025-10-06T01:51:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Effectively applying Vision-Language Models (VLMs) to Video Question Answering (VideoQA) hinges on selecting a concise yet comprehensive set of frames, as processing entire videos is computationally infeasible. However, current frame selection methods face a critical trade-off: approaches relying on lightweight similarity models, such as CLIP, often fail to capture the nuances of complex queries, resulting in inaccurate similarity scores that cannot reflect the authentic query-frame relevance, which further undermines frame selection. Meanwhile, methods that leverage a VLM for deeper analysis achieve higher accuracy but incur prohibitive computational costs. To address these limitations, we propose A.I.R., a training-free approach for Adaptive, Iterative, and Reasoning-based frame selection. We leverage a powerful VLM to perform deep, semantic analysis on complex queries, and this analysis is deployed within a cost-effective iterative loop that processes only a small batch of the most high-potential frames at a time. Extensive experiments on various VideoQA benchmarks demonstrate that our approach outperforms existing frame selection methods, significantly boosts the performance of the foundation VLM, and achieves substantial gains in computational efficiency over other VLM-based techniques.

中文标题/摘要

标题：A.I.R.: 使视频问答中的自适应、迭代和基于推理的帧选择成为可能

有效地将视觉-语言模型（VLMs）应用于视频问答（VideoQA）的关键在于选择简洁而全面的帧集，因为处理整个视频在计算上是不可行的。然而，当前的帧选择方法面临一个关键权衡：依赖轻量级相似性模型（如CLIP）的方法往往无法捕捉复杂查询的细微差别，导致不准确的相似性评分，无法反映查询-帧的相关性，从而进一步削弱了帧选择。同时，利用VLM进行更深入分析的方法虽然准确性更高，但计算成本却非常高。为了解决这些限制，我们提出了A.I.R.，一种无需训练的自适应、迭代和基于推理的帧选择方法。我们利用强大的VLM对复杂的查询进行深入的语义分析，并在成本效益高的迭代循环中部署此分析，每次仅处理最有潜力的少量帧。在各种VideoQA基准上的广泛实验表明，我们的方法优于现有的帧选择方法，显著提升了基础VLM的性能，并在其他基于VLM的技术中实现了显著的计算效率提升。

Summary / 总结

The research aims to improve the effectiveness of Vision-Language Models (VLMs) in Video Question Answering (VideoQA) by addressing the challenge of frame selection. A.I.R., an adaptive, iterative, and reasoning-based approach, uses a powerful VLM for deep semantic analysis of complex queries and processes only the most promising frames iteratively. Experiments show that A.I.R. outperforms existing methods, enhances the performance of the foundation VLM, and achieves significant computational efficiency gains compared to other VLM-based techniques.

研究旨在通过解决现有方法的局限性，提高视频问答（VideoQA）中帧选择的效率和准确性。A.I.R. 是一种适应性、迭代性和推理基于的帧选择方法，利用强大的视觉-语言模型（VLM）对复杂查询进行深层次语义分析，并在每次迭代中仅处理最有潜力的少量帧。实验表明，A.I.R. 在性能上优于现有方法，显著提升了基础VLM的表现，并在与其他VLM基线技术相比时实现了显著的计算效率提升。

Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting

Authors: Xuyang Guo, Zekai Huang, Zhenmei Shi, Zhao Song, Jiahao Zhang

First: 2025-10-06T00:11:24+00:00 · Latest: 2025-10-06T00:11:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have become a central focus of today's AI community, owing to their impressive abilities gained from training on large-scale vision-language data from the Web. These models have demonstrated strong performance across diverse tasks, including image understanding, video understanding, complex visual reasoning, and embodied AI. Despite these noteworthy successes, a fundamental question remains: Can VLMs count objects correctly? In this paper, we introduce a simple yet effective benchmark, VLMCountBench, designed under a minimalist setting with only basic geometric shapes (e.g., triangles, circles) and their compositions, focusing exclusively on counting tasks without interference from other factors. We adopt strict independent variable control and systematically study the effects of simple properties such as color, size, and prompt refinement in a controlled ablation. Our empirical results reveal that while VLMs can count reliably when only one shape type is present, they exhibit substantial failures when multiple shape types are combined (i.e., compositional counting). This highlights a fundamental empirical limitation of current VLMs and motivates important directions for future research.

中文标题/摘要

标题：你的视觉语言模型连数到20都做不到：揭示VLMs在组合计数中的失败

视觉语言模型（VLMs）已成为当今AI社区的中心焦点，这得益于它们从网络大规模视觉语言数据中获得的令人印象深刻的性能。这些模型在图像理解、视频理解、复杂视觉推理和嵌入式AI等多种任务中表现出色。尽管取得了这些显著的成功，一个基本问题仍然存在：VLMs能否正确计数物体？在本文中，我们介绍了一个简单而有效的基准VLMCountBench，该基准在仅包含基本几何形状（例如三角形、圆）及其组合的极简主义设置下设计，专注于计数任务，不受到其他因素的干扰。我们采用严格的独立变量控制，并系统研究了颜色、大小和提示细化等简单属性的影响。我们的实证结果表明，当只有一种形状类型存在时，VLMs可以可靠地计数，但当多种形状类型组合（即组合计数）时，它们表现出显著的失败。这揭示了当前VLMs的基本实证局限性，并为未来研究指出了重要方向。

Summary / 总结

This paper introduces VLMCountBench, a minimalist benchmark for evaluating the counting abilities of Vision-Language Models (VLMs). The study focuses on basic geometric shapes and their compositions, controlling for factors like color, size, and prompt refinement. The results show that VLMs can count reliably when only one shape type is present but struggle significantly with compositional counting when multiple shape types are combined, indicating a fundamental limitation in current VLMs' compositional reasoning capabilities.

该论文引入了VLMCountBench基准，用于评估VLM的计数能力。它专注于基本几何形状及其组合，以研究颜色和大小等简单属性的影响。结果表明，当只有一种形状类型时，VLM可以可靠地计数，但当结合多种形状类型时，它们的表现显著下降，这表明当前VLM在组合计数方面存在根本性的局限性。