Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
Authors: Xiaoxiao Sun, Mingyang Li, Kun yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
First: 2026-01-29T18:59:24+00:00 · Latest: 2026-01-29T18:59:24+00:00
Comments: 26 pages, 31 figures, 13 tables. Project Page: https://sites.google.com/view/vi-probe/
Abstract
Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.
中文标题/摘要
标题:VLMs 是感知还是回忆?经典视觉错觉探究视觉感知与记忆
大型视觉-语言模型(VLMs)在原始图像上对经典视觉错觉通常会给出“正确”的回答,但在错觉因素反转后仍然坚持相同的回答,尽管人类可以明显察觉到视觉变化。这引发了一个基本问题:VLMs 是感知视觉变化还是仅仅回忆已记忆的模式?尽管已有几项研究注意到了这一现象,但其背后的成因仍然不清楚。为了从观察转向系统理解,本文引入了VI-Probe,这是一种可控的视觉错觉框架,具有分级扰动和匹配的视觉对照(无错觉诱导器),以解开基于视觉的感知与语言驱动的回忆之间的关系。不同于以往工作主要关注平均准确率,我们使用极性反转一致性、模板固定指数以及与匹配对照标准化后的错觉乘数来衡量稳定性和敏感性。在不同家族的实验中发现,反应持久性源自多种原因而非单一机制。例如,GPT-5 表现出记忆覆盖,Claude-Opus-4.1 显示感知与记忆的竞争,而 Qwen 变体则表明视觉处理的限制。我们的发现挑战了单一成因的观点,并促使基于探针的评估,该评估衡量知识和对受控视觉变化的敏感性。数据和代码可在 https://sites.google.com/view/vi-probe/ 获取。
Summary / 总结
This paper investigates whether large vision-language models (VLMs) perceive visual changes or merely recall memorized patterns by using a controllable visual-illusion framework called VI-Probe. The study measures stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier. The experiments across different VLMs reveal that response persistence arises from various causes, challenging the notion of a single mechanism. The findings suggest that probing-based evaluation is necessary to measure both knowledge and sensitivity to controlled visual changes.
该研究通过使用可控视觉错觉框架VI-Probe,探讨大型视觉-语言模型(VLMs)是感知视觉变化还是仅回忆记忆模式。研究使用极性反转一致性、模板固定指数和与匹配控制相比的错觉乘数来衡量稳定性和敏感性。实验表明,响应持久性源自异质原因,挑战了单一原因的观点,并强调了需要进行基于探针的评估,以衡量对受控视觉变化的知识和敏感性。
SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence
Authors: Saoud Aldowaish, Yashwanth Karumanchi, Kai-Chen Chiang, Soroosh Noorzad, Morteza Fayazi
First: 2026-01-29T18:41:52+00:00 · Latest: 2026-01-29T18:41:52+00:00
Abstract
Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.
中文标题/摘要
标题:SINA:使用人工智能的电路原理图图像到网表生成器
当前将电路原理图图像转换为机器可读网表的方法在组件识别和连接推理方面存在困难。在本文中,我们介绍了SINA,这是一个开源的全自动电路原理图图像到网表生成器。SINA结合了深度学习进行准确的组件检测、连通组件标记(CCL)进行精确的连接提取、光学字符识别(OCR)进行组件参考标识符检索,并使用视觉语言模型(VLM)进行可靠的参考标识符分配。在我们的实验中,SINA的整体网表生成准确率为96.47%,比最先进的方法高2.72倍。
Summary / 总结
SINA is an open-source tool that uses artificial intelligence to convert circuit schematic images into machine-readable netlists. It combines deep learning for component detection, CCL for connectivity extraction, OCR for reference designator retrieval, and a VLM for reliable reference designator assignments. The experiments show that SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72 times higher than the current state-of-the-art methods.
SINA 是一种自动化的电路原理图图像到网表生成器,它使用深度学习进行组件检测、CCL 进行连接提取、OCR 进行参考标识符检索,以及 VLM 进行参考标识符分配。它实现了 96.47% 的整体网表生成准确率,比现有方法高 2.72 倍。
Exploring Diverse Generation Paths via Inference-time Stiefel Activation Steering
Authors: Dongxuan Zhu, Ly Tran Ho Khanh, Andy Yat-Ming Cheung, Man-Chung Yue, Viet Anh Nguyen
Venue: ICLR 2026
First: 2026-01-29T17:17:04+00:00 · Latest: 2026-01-29T17:17:04+00:00
Comments: 34 pages, 2 figures. Accepted for publication at ICLR 2026
Abstract
Language models often default to a narrow set of high-probability outputs, leaving their generation paths homogeneous and prone to mode collapse. Sampling-based strategies inject randomness but still struggle to guarantee diversity across multiple concurrent generation runs. We address this limitation by introducing STARS ($\textbf{St}$iefel-based $\textbf{A}$ctivation Steering for Diverse $\textbf{R}$ea$\textbf{S}$oning), a training-free, inference-time intervention method that transforms activation steering into an exploration engine. At each token, STARS collects the hidden activations of concurrent generation runs and optimizes multiple additive steering directions jointly on the Stiefel manifold. STARS maximizes the geometric volume of the steered activations, while the Stiefel manifold induces orthogonality of the steering interventions. This formulation explicitly promotes divergent activation vectors of concurrent generation runs, and implicitly promotes divergent generation trajectories. This manifold optimization formulation can be solved using a Riemannian gradient descent algorithm with convergence guarantees, but this algorithm is too time-consuming for real-time inference. To guarantee low latency, we further design a lightweight one-step update with an aggressive, closed-form stepsize. For test case generation and scientific discovery benchmarks, STARS consistently outperforms standard sampling methods, achieving greater diversity without sacrificing qualitative performance.
中文标题/摘要
标题:通过推断时斯蒂费尔激活引导探索多样的生成路径
语言模型通常默认生成一组窄范围的高概率输出,导致生成路径同质化且容易发生模式崩溃。基于采样的策略虽然引入了随机性,但在多个并发生成运行中仍难以保证多样性。我们通过引入STARS(基于斯蒂费尔的激活引导以促进多样推理)来解决这一限制,这是一种无需训练的、在推断时进行干预的方法,将激活引导转化为探索引擎。在每个标记处,STARS 收集并发生成运行的隐藏激活,并在斯蒂费尔流形上联合优化多个附加的引导方向。STARS 最大化引导激活的几何体积,而斯蒂费尔流形诱导引导干预的正交性。这种形式明确地促进了并发生成运行的发散激活向量,并隐式地促进了发散的生成轨迹。这种流形优化形式可以通过黎曼梯度下降算法求解,具有收敛保证,但该算法对于实时推断来说耗时过长。为了保证低延迟,我们进一步设计了一种轻量级的一步更新,采用激进的闭式步长。在测试案例生成和科学发现基准测试中,STARS 一致地优于标准采样方法,在不牺牲质量性能的情况下实现了更高的多样性。
Summary / 总结
The paper addresses the issue of homogeneity in language model generation paths by introducing STARS, a method that optimizes activation steering at inference time. STARS collects hidden activations from concurrent generation runs and optimizes multiple steering directions on the Stiefel manifold, promoting divergent activation vectors and generation trajectories. Experiments show that STARS outperforms standard sampling methods in generating diverse test cases and scientific discoveries without compromising qualitative performance.
论文通过引入STARS方法,在推理时优化多个在Stiefel流形上的引导方向,以促进多样化的生成路径。STARS最大化引导激活的几何体积,并确保引导干预的正交性,从而导致生成轨迹的发散。实验表明,STARS在保持质量的同时,比标准采样方法在多样性方面表现出色。
Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models
Authors: Konstantinos P. Panousis, Diego Marcos
First: 2026-01-29T16:28:55+00:00 · Latest: 2026-01-29T16:28:55+00:00
Abstract
The widespread adoption of Vision-Language Models (VLMs) across fields has amplified concerns about model interpretability. Distressingly, these models are often treated as black-boxes, with limited or non-existent investigation of their decision making process. Despite numerous post- and ante-hoc interepretability methods, systematic and objective evaluation of the learned representations remains limited, particularly for sparsity-aware methods that are increasingly considered to "induce interpretability". In this work, we focus on Concept Bottleneck Models and investigate how different modeling decisions affect the emerging representations. We introduce the notion of clarity, a measure, capturing the interplay between the downstream performance and the sparsity and precision of the concept representation, while proposing an interpretability assessment framework using datasets with ground truth concept annotations. We consider both VLM- and attribute predictor-based CBMs, and three different sparsity-inducing strategies: per example $\ell_1, \ell_0$ and Bernoulli-based formulations. Our experiments reveal a critical trade-off between flexibility and interpretability, under which a given method can exhibit markedly different behaviors even at comparable performance levels. The code will be made publicly available upon publication.
中文标题/摘要
标题:清晰度:稀疏感知概念瓶颈模型中的灵活性-可解释性权衡
视觉-语言模型(VLMs)在各领域的广泛应用加剧了人们对模型可解释性的担忧。令人不安的是,这些模型往往被视为黑箱,对其决策过程的研究极为有限或不存在。尽管存在众多事后和事前的可解释性方法,但对学习表示的系统性和客观性评估仍然有限,特别是对于越来越多被认为能“诱导可解释性”的稀疏感知方法。在本文中,我们专注于概念瓶颈模型,并探讨不同的建模决策如何影响生成的表示。我们引入了清晰度的概念,这是一个衡量指标,捕捉下游性能与概念表示的稀疏性和精度之间的相互作用,同时提出了一种使用具有真实概念注释的数据集进行可解释性评估的框架。我们考虑了基于VLM和属性预测器的概念瓶颈模型(CBMs),以及三种不同的稀疏诱导策略:每例$\ell_1$、$\ell_0$和伯努利形式。我们的实验揭示了灵活性和可解释性之间的关键权衡,在相似的性能水平下,给定的方法可能会表现出截然不同的行为。代码将在发表后公开可用。
Summary / 总结
This work addresses the interpretability challenge in Vision-Language Models by focusing on Concept Bottleneck Models. It introduces a measure called clarity to evaluate the trade-off between the flexibility and interpretability of sparsity-aware concept representations. The study considers different sparsity-inducing strategies and finds that there is a critical trade-off between flexibility and interpretability, with methods showing different behaviors even at similar performance levels.
该研究关注Vision-Language Models (VLMs)的可解释性问题,通过聚焦Concept Bottleneck Models进行探讨。引入了一个称为清晰度的度量,评估模型性能与概念表示的稀疏性和精确性的平衡。研究提出了一种使用带有概念标注真实值的数据集的可解释性评估框架,并考虑了三种稀疏性诱导策略。实验显示,在相似性能水平下,不同方法之间存在灵活性和可解释性之间的权衡,不同方法的表现可能会有所不同。
A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances
Authors: Brian B. Moser, Arundhati S. Shanbhag, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel
First: 2025-05-23T12:18:34+00:00 · Latest: 2026-01-29T16:22:04+00:00
Abstract
Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.
中文标题/摘要
标题:核心集选择的核心集选择文献综述:介绍与最新进展
核心集选择旨在找到一个能够保留大数据集关键模式的小型代表性子集,以支持有效的机器学习。尽管已有许多综述研究了数据缩减策略,但大多数综述主要集中在经典几何方法或主动学习技术上。相比之下,本文综述通过将无训练、以训练为导向和无标签三大类核心集研究方法统一到一个分类体系中,提供了一个更全面的观点。我们介绍了现有工作中经常忽略的子领域,包括亚模形式、 bilevel优化以及无标签数据集中的伪标签最新进展。此外,我们还探讨了剪枝策略如何影响泛化能力和神经网络的标度法则,提供了先前综述中未提及的新见解。最后,我们在不同的计算、鲁棒性和性能需求下比较了这些方法,并指出了未来研究中需要解决的开放挑战,如鲁棒性、异常值过滤以及将核心集选择适应基础模型等。
Summary / 总结
This paper aims to provide a comprehensive overview of coreset selection methods, which involve selecting a small subset of data that retains the essential patterns for effective machine learning. The authors unify three major lines of research—training-free, training-oriented, and label-free approaches—into a single taxonomy. They also explore submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Key findings include insights into how pruning strategies affect generalization and neural scaling laws, and they highlight open challenges such as robustness and outlier filtering for future research.
本文探讨了从大规模数据集中选择一个小而具代表性的子集以实现有效机器学习的问题,即核心集选择。它引入了一个统一的分类框架,结合了训练前、训练中和无标签方法,涵盖了子模形式、 bilevel优化以及无标签数据集中的伪标签最新进展。研究还探讨了剪枝策略如何影响泛化能力和神经网络的规模法则,提供了关于核心集选择方法及其在不同条件下的性能的新见解。还指出了鲁棒性、异常值过滤等未来研究中的开放挑战。
Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models
Authors: Cong Cao, Huanjing Yue, Shangbin Xie, Xin Liu, Jingyu Yang
First: 2026-01-29T16:14:07+00:00 · Latest: 2026-01-29T16:14:07+00:00
Abstract
Although diffusion-based zero-shot image restoration and enhancement methods have achieved great success, applying them to video restoration or enhancement will lead to severe temporal flickering. In this paper, we propose the first framework that utilizes the rapidly-developed video diffusion model to assist the image-based method in maintaining more temporal consistency for zero-shot video restoration and enhancement. We propose homologous latents fusion, heterogenous latents fusion, and a COT-based fusion ratio strategy to utilize both homologous and heterogenous text-to-video diffusion models to complement the image method. Moreover, we propose temporal-strengthening post-processing to utilize the image-to-video diffusion model to further improve temporal consistency. Our method is training-free and can be applied to any diffusion-based image restoration and enhancement methods. Experimental results demonstrate the superiority of the proposed method.
中文标题/摘要
标题:利用视频扩散模型辅助的零样本视频恢复与增强
尽管基于扩散的零样本图像恢复和增强方法已经取得了巨大成功,但将其应用于视频恢复或增强会导致严重的时域闪烁。在本文中,我们提出了第一个利用快速发展的视频扩散模型辅助基于图像的方法,以保持更时域一致性进行零样本视频恢复和增强的框架。我们提出了同源潜变量融合、异源潜变量融合以及基于COT的融合比例策略,利用同源和异源文本到视频扩散模型来补充图像方法。此外,我们提出了时域增强后处理,利用图像到视频扩散模型进一步提高时域一致性。我们的方法无需训练,可以应用于任何基于扩散的图像恢复和增强方法。实验结果表明了所提出方法的优越性。
Summary / 总结
The research aims to address the issue of temporal flickering in zero-shot video restoration and enhancement using diffusion models. The authors propose a framework that integrates video diffusion models to enhance temporal consistency. They introduce fusion strategies for homologous and heterogenous text-to-video diffusion models and a COT-based fusion ratio strategy. Additionally, they propose temporal-strengthening post-processing to further improve temporal consistency. The method is training-free and can be applied to any diffusion-based image restoration and enhancement methods. Experiments show the proposed method's superiority in maintaining temporal consistency.
研究旨在解决使用扩散模型进行零样本视频修复和增强时出现的严重时间闪烁问题。作者提出了一种框架,利用视频扩散模型辅助图像方法,以确保更好的时间一致性。他们引入了三种融合策略和后处理技术,以整合同源和异源的文本到视频扩散模型。该方法无需训练,可以应用于任何基于扩散的图像修复和增强方法。实验结果表明,所提出的方法在保持时间一致性方面优于现有方法。
FreeFuse: Multi-Subject LoRA Fusion via Adaptive Token-Level Routing at Test Time
Authors: Yaoli Liu, Yao-Xiang Ding, Kun Zhou
First: 2025-10-27T16:54:08+00:00 · Latest: 2026-01-29T16:14:07+00:00
Abstract
This paper proposes FreeFuse, a training-free framework for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to prior studies that focus on retraining LoRA to alleviate feature conflicts, our analysis reveals that simply spatially confining the subject LoRA's output to its target region and preventing other LoRAs from directly intruding into this area is sufficient for effective mitigation. Accordingly, we implement Adaptive Token-Level Routing during the inference phase. We introduce FreeFuseAttn, a mechanism that exploits the flow matching model's intrinsic semantic alignment to dynamically match subject-specific tokens to their corresponding spatial regions at early denoising timesteps, thereby bypassing the need for external segmentors. FreeFuse distinguishes itself through high practicality: it necessitates no additional training, model modifications, or user-defined masks spatial conditions. Users need only provide subject activation words to achieve seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both identity preservation and compositional fidelity. Our code is available at https://github.com/yaoliliu/FreeFuse.
中文标题/摘要
标题:FreeFuse:通过自适应令牌级路由在测试时进行多主题LoRA融合的无训练框架
本文提出FreeFuse,一种无需训练的多主题文本到图像生成框架,通过自动融合多个主题LoRA实现。与以往专注于重新训练LoRA以缓解特征冲突的研究不同,我们的分析表明,简单地将主题LoRA的输出空间限制在其目标区域,并防止其他LoRA直接侵入该区域就足以有效缓解冲突。因此,在推理阶段我们实现了自适应令牌级路由。我们引入了FreeFuseAttn机制,该机制利用流匹配模型固有的语义对齐,在早期去噪时间步动态匹配主题特定的令牌到其相应的空间区域,从而绕过了对外部分割器的需求。FreeFuse通过其高实用性脱颖而出:它不需要额外的训练、模型修改或用户定义的空间条件。用户只需提供主题激活词即可无缝集成到标准工作流程中。广泛的实验验证了FreeFuse在身份保留和组成保真度方面优于现有方法。我们的代码可在https://github.com/yaoliliu/FreeFuse获取。
Summary / 总结
FreeFuse is a training-free framework for multi-subject text-to-image generation that automatically fuses multiple subject LoRAs without retraining. It uses Adaptive Token-Level Routing during inference to spatially confine each subject's output to its target region, preventing other LoRAs from intruding. FreeFuse outperforms existing methods in identity preservation and compositional fidelity, requiring no additional training, model modifications, or user-defined masks. Users only need to provide subject activation words. Extensive experiments validate its effectiveness.
FreeFuse 是一个无需训练的多主题文本到图像生成框架,通过自动融合多个主题 LoRA 实现。它在推理阶段使用自适应令牌级路由,将主题 LoRA 的输出限制在其目标区域内,防止其他 LoRA 干扰。实验表明,FreeFuse 在保留身份和组成保真度方面优于现有方法,无需额外训练或模型修改。用户只需提供主题激活词即可。
Improving Classifier-Free Guidance of Flow Matching via Manifold Projection
Authors: Jian-Feng Cai, Haixia Liu, Zhengyi Su, Chao Wang
First: 2026-01-29T15:49:31+00:00 · Latest: 2026-01-29T15:49:31+00:00
Comments: 24 pages, 14 figures
Abstract
Classifier-free guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5.
中文标题/摘要
标题:通过流匹配中的流形投影改进无分类器引导
无分类器引导(CFG)是一种广泛用于扩散和基于流模型可控生成的技术。尽管在实践中取得了成功,但CFG依赖于敏感于引导尺度的启发式线性外推。在本文中,我们通过优化的角度为CFG提供了一个原理性的解释。我们证明流匹配中的速度场对应于一系列平滑距离函数的梯度,这引导潜在变量向缩放的目标图像集移动。这种视角揭示了标准的CFG公式是该梯度的近似,其中预测差距,即条件输出与无条件输出之间的差异,决定了引导的敏感性。利用这一洞察,我们将CFG采样重新表述为具有流形约束的同伦优化。这种表述需要一个流形投影步骤,我们在采样过程中通过增量梯度下降方案实现。为了提高计算效率和稳定性,我们进一步通过Anderson加速改进了这一迭代过程,而无需额外的模型评估。我们提出的方法是训练免费的,并且一致地提高了生成保真度、提示对齐和对引导尺度的鲁棒性。我们在多种基准上验证了其有效性,展示了在DiT-XL-2-256、Flux和Stable Diffusion 3.5等大型模型上的显著改进。
Summary / 总结
This work aims to improve classifier-free guidance (CFG) in flow-based models by providing a principled interpretation through optimization. The authors show that the velocity field in flow matching corresponds to the gradient of smoothed distance functions, guiding latent variables towards the scaled target image set. They reformulate CFG as a homotopy optimization with a manifold constraint, implementing a manifold projection via incremental gradient descent and enhancing it with Anderson Acceleration. These methods consistently refine generation fidelity, prompt alignment, and robustness to guidance scale, showing significant improvements on large-scale models like DiT-XL-2-256, Flux, and Stable Diffusion 3.5.
本文通过优化视角对分类器无指导(CFG)在扩散和流基模型中的局限性进行了改进,将其重新表述为具有流形约束的同伦优化,并通过增量梯度下降方案实现流形投影。通过Anderson加速进一步提高计算效率和稳定性,而无需额外的模型评估。实验表明,该方法在DiT-XL-2-256、Flux和Stable Diffusion 3.5等模型上一致提高了生成保真度、提示对齐和对指导尺度的鲁棒性。
Trajectory-Guided Diffusion for Foreground-Preserving Background Generation in Multi-Layer Documents
Authors: Taewon Kang
First: 2026-01-29T15:28:48+00:00 · Latest: 2026-01-29T15:28:48+00:00
Comments: 47 pages, 36 figures
Abstract
We present a diffusion-based framework for document-centric background generation that achieves foreground preservation and multi-page stylistic consistency through latent-space design rather than explicit constraints. Instead of suppressing diffusion updates or applying masking heuristics, our approach reinterprets diffusion as the evolution of stochastic trajectories through a structured latent space. By shaping the initial noise and its geometric alignment, background generation naturally avoids designated foreground regions, allowing readable content to remain intact without auxiliary mechanisms. To address the long-standing issue of stylistic drift across pages, we decouple style control from text conditioning and introduce cached style directions as persistent vectors in latent space. Once selected, these directions constrain diffusion trajectories to a shared stylistic subspace, ensuring consistent appearance across pages and editing iterations. This formulation eliminates the need for repeated prompt-based style specification and provides a more stable foundation for multi-page generation. Our framework admits a geometric and physical interpretation, where diffusion paths evolve on a latent manifold shaped by preferred directions, and foreground regions are rarely traversed as a consequence of trajectory initialization rather than explicit exclusion. The proposed method is training-free, compatible with existing diffusion backbones, and produces visually coherent, foreground-preserving results across complex documents. By reframing diffusion as trajectory design in latent space, we offer a principled approach to consistent and structured generative modeling.
中文标题/摘要
标题:轨迹引导扩散在多层文档中实现前景保留背景生成
我们提出了一种基于扩散的文档中心背景生成框架,通过潜在空间设计而非显式约束实现前景保留和多页风格一致性。我们的方法重新解释了扩散作为结构化潜在空间中随机轨迹的演变,而不是抑制扩散更新或应用遮罩启发式方法。通过塑造初始噪声及其几何对齐,背景生成自然地避开指定的前景区域,使可读内容保持完整,无需辅助机制。为了解决跨页风格漂移这一长期问题,我们将风格控制与文本条件分离,并引入缓存风格方向作为潜在空间中的持久向量。一旦选定,这些方向将约束扩散轨迹到共享的风格子空间,确保跨页和编辑迭代的一致外观。这种表述消除了重复提示式风格指定的需要,并为多页生成提供了一个更稳定的基础。我们的框架具有几何和物理解释,其中扩散路径在由偏好方向塑造的潜在流形上演变,由于轨迹初始化而不是显式排除,前景区域很少被穿越。所提出的方法无需训练,与现有的扩散主干兼容,并在复杂文档中产生视觉上连贯、前景保留的结果。通过将扩散重新构想为潜在空间中的轨迹设计,我们提供了一种原理性的方法来实现一致和结构化的生成建模。
Summary / 总结
This paper introduces a diffusion-based framework for generating background in multi-layer documents while preserving the foreground. The method reinterprets diffusion as the evolution of stochastic trajectories through a structured latent space, avoiding designated foreground regions by shaping initial noise. To maintain stylistic consistency across pages, the approach uses cached style directions as persistent vectors in latent space, ensuring consistent appearance without the need for repeated style specification. The results are visually coherent and foreground-preserving across complex documents.
论文提出了一种基于扩散的框架,用于在多层文档中生成背景同时保留前景。该方法通过潜空间设计避免显式地抑制前景,并通过引入缓存的风格方向来实现跨页面的一致性风格。这种方法消除了重复风格指定的需要,并确保在不使用辅助机制的情况下产生视觉上连贯的结果,为多页生成提供了一个稳定的基础。
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
Authors: Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, Lijun Wu
First: 2026-01-29T15:07:28+00:00 · Latest: 2026-01-29T15:07:28+00:00
Abstract
Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a "less is more" phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.
中文标题/摘要
标题:MMFineReason: 通过开放数据为中心的方法缩小多模态推理差距
视觉语言模型(VLMs)的最新进展推动了视觉推理的重大进步。然而,开源VLMs仍落后于专有系统,主要原因是缺乏高质量的推理数据。现有数据集在STEM图表和视觉谜题等挑战性领域覆盖不足,缺乏用于激发强大推理能力的一致且长格式的推理链(CoT)注释。为弥合这一差距,我们引入了MMFineReason,这是一个包含180万样本和51亿个解决方案标记的大规模多模态推理数据集,这些注释源自Qwen3-VL-235B-A22B-Thinking。该数据集通过系统性的三阶段管道建立:(1)大规模数据收集和标准化,(2)生成推理链(CoT)理由,(3)基于推理质量和难度意识的全面筛选。该数据集涵盖了STEM问题、视觉谜题、游戏和复杂图表,每个样本都附有视觉支持的推理痕迹。我们对MMFineReason进行微调Qwen3-VL-Instruct,开发了MMFineReason-2B/4B/8B版本。我们的模型在相应规模类别中建立了新的最佳结果。值得注意的是,MMFineReason-4B成功超越了Qwen3-VL-8B-Thinking,而MMFineReason-8B甚至超过了Qwen3-VL-30B-A3B-Thinking,接近Qwen3-VL-32B-Thinking,展示了显著的参数效率。通过我们的难度意识筛选策略,我们发现了一个“少即是多”的现象:仅7%(12.3万样本)的子集就能达到与完整数据集相当的性能。此外,我们揭示了推理导向的数据组合具有协同效应,同时提升了通用能力。
LLM-based Few-Shot Early Rumor Detection with Imitation Agent
Authors: Fengzhu Zeng, Qian Shao, Ling Cheng, Wei Gao, Shih-Fen Cheng, Jing Ma, Cheng Niu
Venue: KDD 2026
First: 2025-12-20T12:42:27+00:00 · Latest: 2026-01-29T15:01:08+00:00
Comments: Accepted at KDD 2026
Abstract
Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.
中文标题/摘要
标题:基于LLM的少样本早期谣言检测与模仿代理
早期谣言检测(EARD)旨在根据一系列社交媒体帖子,识别出一个声明可以被准确分类的最早时间点。这在数据稀缺的环境中尤其具有挑战性。虽然大型语言模型(LLMs)在少样本NLP任务中表现良好,但它们不适用于时间序列数据,并且在训练和推理时计算成本高昂。在本文中,我们提出了一种新颖的EARD框架,该框架结合了一个自主代理和一个基于LLM的检测模型,其中代理作为可靠的决策者负责确定\textit{早期时间点},而LLM则作为强大的\textit{谣言检测器}。这种方法提供了第一个少样本EARD的解决方案,只需要训练一个轻量级的代理,而LLM则无需训练。在四个真实世界数据集上的广泛实验表明,我们的方法在LLMs上提升了性能,并在准确性和及时性方面超越了现有的EARD方法。
Summary / 总结
The research aims to develop an efficient early rumor detection method in data-scarce settings. It proposes a framework combining an autonomous agent for early time point determination and an LLM for rumor detection, which requires only the training of a lightweight agent and allows the LLM to remain training-free. The approach outperforms existing methods in accuracy and earliness across four real-world datasets.
研究旨在通过结合轻量级代理和大型语言模型(LLM),在数据稀缺环境下高效地进行早期谣言检测。提出的框架使用代理确定谣言分类的最早时间点,而LLM负责检测谣言。该方法仅需训练代理,因此计算效率高。在四个真实世界数据集上的实验表明,该方法在准确性和谣言检测的及时性方面均优于现有方法。
Moral Outrage Shapes Commitments Beyond Attention: Multimodal Moral Emotions on YouTube in Korea and the US
Authors: Seongchan Park, Jaehong Kim, Hyeonseung Kim, Heejin Bin, Sue Moon, Wonjae Lee
Venue: The Web Conference 2026
First: 2026-01-29T14:58:54+00:00 · Latest: 2026-01-29T14:58:54+00:00
Comments: Accepted at The Web Conference 2026. We release Korean and English multimodal moral emotion classifiers
Abstract
Understanding how media rhetoric shapes audience engagement is crucial in the attention economy. This study examines how moral emotional framing by mainstream news channels on YouTube influences user behavior across Korea and the United States. To capture the platform's multimodal nature, combining thumbnail images and video titles, we develop a multimodal moral emotion classifier by fine tuning a vision language model. The model is trained on human annotated multimodal datasets in both languages and applied to approximately 400,000 videos from major news outlets. We analyze engagement levels including views, likes, and comments, representing increasing degrees of commitment. The results show that other condemning rhetoric expressions of moral outrage that criticize others morally consistently increase all forms of engagement across cultures, with effects ranging from passive viewing to active commenting. These findings suggest that moral outrage is a particularly effective emotional strategy, attracting not only attention but also active participation. We discuss concerns about the potential misuse of other condemning rhetoric, as such practices may deepen polarization by reinforcing in group and out group divisions. To facilitate future research and ensure reproducibility, we publicly release our Korean and English multimodal moral emotion classifiers.
中文标题/摘要
标题:道德愤怒塑造超越注意力的承诺:韩国和美国YouTube上的多模态道德情绪
在注意力经济中,理解媒体修辞如何影响受众参与至关重要。本研究探讨了主流新闻频道在YouTube上以道德情感框架呈现内容如何影响韩国和美国用户的在线行为。为了捕捉平台的多模态特性,结合缩略图图像和视频标题,我们通过微调视觉语言模型开发了一个多模态道德情绪分类器。该模型在两种语言的人标注多模态数据集上进行训练,并应用于来自主要新闻机构的约40万条视频。我们分析了包括观看次数、点赞和评论在内的参与度水平,代表了不同程度的承诺。结果显示,其他谴责性道德愤怒表达,批评他人道德,无论在哪个文化中,都一致地增加了所有形式的参与度,从被动观看到积极评论。这些发现表明,道德愤怒是一种特别有效的心理策略,不仅吸引注意力,还吸引积极参与。我们讨论了其他谴责性修辞可能被滥用的问题,因为这种做法可能会通过强化群体内部和群体外部的分化来加深分歧。为了促进未来研究并确保可重复性,我们公开发布了韩语和英语多模态道德情绪分类器。
Summary / 总结
This study investigates how moral emotional framing by news channels on YouTube influences user engagement in Korea and the US. By developing a multimodal moral emotion classifier, the researchers analyzed approximately 400,000 videos from major news outlets, focusing on thumbnail images and video titles. The findings indicate that moral outrage consistently increases engagement, from passive views to active comments, suggesting its effectiveness in attracting both attention and active participation across cultures. The research highlights concerns about the potential for such rhetoric to deepen polarization. The classifiers used in the study are publicly available for future research.
该研究探讨了新闻频道在YouTube上进行道德情感框架如何影响韩国和美国的用户参与度。通过开发一个多模态道德情感分类器,研究人员分析了来自主要新闻机构的约40万条视频,重点关注缩略图和视频标题。研究发现,道德愤怒能够一致地增加参与度,从被动观看到积极评论,表明其在吸引注意力和参与方面的有效性。研究还指出,这种言论可能加剧 polarization。研究中使用的分类器已公开,以促进未来研究并确保可重复性。
Knowledge Vector Weakening: Efficient Training-free Unlearning for Large Vision-Language Models
Authors: Yejin Kim, Dongjun Hwang, Sungmin Cha, Junsuk Choe
First: 2026-01-29T14:41:01+00:00 · Latest: 2026-01-29T14:41:01+00:00
Abstract
Large Vision-Language Models (LVLMs) are widely adopted for their strong multimodal capabilities, yet they raise serious concerns such as privacy leakage and harmful content generation. Machine unlearning has emerged as a promising solution for removing the influence of specific data from trained models. However, existing approaches largely rely on gradient-based optimization, incurring substantial computational costs for large-scale LVLMs. To address this limitation, we propose Knowledge Vector Weakening (KVW), a training-free unlearning method that directly intervenes in the full model without gradient computation. KVW identifies knowledge vectors that are activated during the model's output generation on the forget set and progressively weakens their contributions, thereby preventing the model from exploiting undesirable knowledge. Experiments on the MLLMU and CLEAR benchmarks demonstrate that KVW achieves a stable forget-retain trade-off while significantly improving computational efficiency over gradient-based and LoRA-based unlearning methods.
中文标题/摘要
标题:知识向量削弱:大型视觉-语言模型的高效无训练卸载方法
大型视觉-语言模型(LVLMs)因其强大的多模态能力而被广泛采用,但它们引发了严重的隐私泄露和有害内容生成等问题。机器卸载已作为去除训练模型中特定数据影响的一种有前景的解决方案出现。然而,现有方法大多依赖于基于梯度的优化,对大规模LVLMs来说会带来巨大的计算成本。为解决这一局限,我们提出了知识向量削弱(KVW),这是一种无需训练的卸载方法,可以直接干预整个模型而无需进行梯度计算。KVW 识别出在忘记集上模型输出生成过程中被激活的知识向量,并逐步削弱它们的贡献,从而防止模型利用不希望的知识。在MLLMU和CLEAR基准上的实验表明,KVW 在稳定遗忘-保留权衡的同时,显著提高了计算效率,优于基于梯度和LoRA的卸载方法。
Summary / 总结
The paper addresses the challenge of removing specific data influence from large vision-language models (LVLMs) without retraining, a process known as machine unlearning. It introduces Knowledge Vector Weakening (KVW), a training-free method that directly modifies the model's knowledge vectors to weaken their contributions, thus preventing the model from using undesirable knowledge. Experiments show that KVW maintains a good balance between forgetting and retaining information while being more computationally efficient than gradient-based and LoRA-based methods.
研究旨在通过提出训练-free 的知识向量削弱(KVW)方法来解决大型视觉-语言模型中的隐私和内容问题。KVW 不需要梯度计算即可直接干预模型,通过识别并削弱对忘记集输出有不利影响的知识向量来削弱其贡献。实验表明,KVW 在保持遗忘和保留信息之间的良好平衡的同时,提高了计算效率,优于基于梯度和 LoRA 的方法。
Error Amplification Limits ANN-to-SNN Conversion in Continuous Control
Authors: Zijie Xu, Zihan Huang, Yiting Dong, Kang Chen, Wenxuan Liu, Zhaofei Yu
First: 2026-01-29T14:28:00+00:00 · Latest: 2026-01-29T14:28:00+00:00
Abstract
Spiking Neural Networks (SNNs) can achieve competitive performance by converting already existing well-trained Artificial Neural Networks (ANNs), avoiding further costly training. This property is particularly attractive in Reinforcement Learning (RL), where training through environment interaction is expensive and potentially unsafe. However, existing conversion methods perform poorly in continuous control, where suitable baselines are largely absent. We identify error amplification as the key cause: small action approximation errors become temporally correlated across decision steps, inducing cumulative state distribution shift and severe performance degradation. To address this issue, we propose Cross-Step Residual Potential Initialization (CRPI), a lightweight training-free mechanism that carries over residual membrane potentials across decision steps to suppress temporally correlated errors. Experiments on continuous control benchmarks with both vector and visual observations demonstrate that CRPI can be integrated into existing conversion pipelines and substantially recovers lost performance. Our results highlight continuous control as a critical and challenging benchmark for ANN-to-SNN conversion, where small errors can be strongly amplified and impact performance.
中文标题/摘要
标题:误差放大会限制ANN到SNN在连续控制中的转换
通过将已经训练好的人工神经网络(ANN)转换为已有的Spiking神经网络(SNN),SNN可以在不进行进一步昂贵训练的情况下实现竞争性性能。这一特性在强化学习(RL)中尤其具有吸引力,因为在RL中通过环境交互进行训练既昂贵又可能不安全。然而,现有的转换方法在连续控制中表现不佳,因为缺乏合适的基线。我们确定误差放大会是主要原因:小的动作近似误差在决策步骤之间变得时序相关,导致状态分布的累积变化和严重的性能下降。为了解决这一问题,我们提出了跨步骤残余膜电位初始化(CRPI),这是一种轻量级的无需训练机制,可以在决策步骤之间传递残余膜电位以抑制时序相关误差。在具有向量和视觉观察的连续控制基准测试中,CRPI可以集成到现有的转换管道中,并显著恢复了丢失的性能。我们的结果强调了连续控制是ANN到SNN转换的一个关键且具有挑战性的基准,其中小的误差可以被强烈放大并影响性能。
Summary / 总结
The paper investigates the limitations of converting ANNs to SNNs for continuous control tasks, where existing methods suffer due to error amplification. The authors propose CRPI, a lightweight mechanism that suppresses temporally correlated errors by carrying over residual membrane potentials across decision steps. Experiments show that CRPI can significantly recover performance on continuous control benchmarks with both vector and visual observations.
该论文研究了在连续控制任务中将ANN转换为SNN时存在的限制,现有方法往往由于误差放大而失败。作者提出了一种轻量级机制CRPI,通过在决策步骤之间传递残余膜电位来抑制时间相关误差。实验表明,CRPI可以显著提高具有向量和视觉观察的连续控制基准上的性能。
Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation
Authors: Qian-Wei Wang, Yaguang Song, Shu-Tao Xia
First: 2025-06-03T12:48:54+00:00 · Latest: 2026-01-29T13:56:19+00:00
Abstract
In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVA, and GPT-4V, leveraging these models to replace time-consuming manual annotation and enable annotation-free training has become a promising research direction. This paper studies learning from noisy partial labels generated by pre-trained VLMs and proposes a collaborative consistency regularization (Co-Reg) framework. Unlike symmetric noise commonly assumed in traditional noisy label learning, VLM-generated noise is instance-dependent and reflects the intrinsic biases of pre-trained models, posing greater challenges. To address this issue, we jointly train two neural networks to perform collaborative label purification via a co-pseudo-labeling mechanism, while enforcing consistency regularization in both label and feature representation spaces. In addition, multiple anti-overfitting strategies are introduced, including alternating optimization of contrastive representations and pseudo-labels, as well as maintaining class prototypes in a shared feature space. The proposed method can further incorporate few-shot manually annotated labels for performance enhancement. Extensive experiments under various settings demonstrate the effectiveness of our approach and highlight the potential of integrating weakly supervised learning into the knowledge distillation of pre-trained models.
中文标题/摘要
标题:弱监督学习与VLM精炼的桥梁:基于预训练VLM的嘈杂部分标签学习以实现高效下游适应
在嘈杂部分标签学习(NPLL)的背景下,每个训练样本都与多个嘈杂注释者标注的一组候选标签相关联。随着高性能预训练视觉-语言模型(VLMs)如CLIP、LLaVA和GPT-4V的出现,利用这些模型替代耗时的手动标注并实现无标注训练已成为一个有前景的研究方向。本文研究了从预训练VLM生成的嘈杂部分标签中学习,并提出了一种协作一致性正则化(Co-Reg)框架。与传统嘈杂标签学习中假设的对称噪声不同,VLM生成的噪声是实例相关的,并反映了预训练模型的固有偏差,提出了更大的挑战。为了解决这一问题,我们联合训练两个神经网络,通过共伪标签机制进行协作标签净化,同时在标签和特征表示空间中强制执行一致性正则化。此外,还引入了多种抗过拟合策略,包括对比表示和伪标签的交替优化,以及在共享特征空间中保持类原型。所提出的方法还可以进一步结合少量的手动标注标签以提高性能。在各种设置下的广泛实验表明了我们方法的有效性,并突显了将弱监督学习整合到预训练模型的知识精炼中的潜力。
Summary / 总结
This paper addresses noisy partial label learning (NPLL) where each training sample has multiple candidate labels from noisy annotators. It proposes a collaborative consistency regularization (Co-Reg) framework to purify labels using pre-trained vision-language models (VLMs) like CLIP and LLaVA. The method involves training two neural networks collaboratively and enforcing consistency regularization in both label and feature spaces. Anti-overfitting strategies are also introduced. Experiments show the effectiveness of the proposed method and its potential to integrate weakly supervised learning into the knowledge distillation of pre-trained models.
该论文针对每个训练样本有多名标注者提供的多个候选标签的噪声部分标签学习(NPLL)问题,提出了一种协作一致性正则化(Co-Reg)框架,利用预训练的视觉-语言模型(如CLIP、LLaVA等)来净化标签。方法包括两个神经网络的协作训练,并在标签和特征空间中施加一致性正则化,同时引入了防止过拟合的策略。实验表明该方法在减少手动标注的情况下,能有效提升下游任务的适应性。
Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling
Authors: Xinglin Wang, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Yueqi Zhang, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li
First: 2026-01-29T13:18:36+00:00 · Latest: 2026-01-29T13:18:36+00:00
Comments: preprint
Abstract
Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This systemic memorylessness leads to massive computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE, validating its advantage over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art scaling efficiency.
中文标题/摘要
标题:不要浪费你的部署:回收搜索经验以实现高效的测试时扩展
测试时扩展通过分配额外的推理计算资源来扩展解决方案空间,从而增强大型语言模型的推理能力。然而,现有的搜索策略通常将部署视为一次性样本,其中在每次试验后有价值的中间见解被有效丢弃。这种系统性的记忆缺失导致了巨大的计算冗余,因为模型在广泛的尝试中反复重新推导出已发现的结论并重新访问已知的死胡同。为了弥合这一差距,我们提出了**回收搜索经验(RSE)**,这是一种无需训练的自我引导策略,将测试时搜索从一系列孤立的试验转变为累积过程。通过积极地将原始轨迹提炼为共享的经验库,RSE 使中间结论的正向回收能够缩短冗余推导,并使失败模式的负向回收能够修剪遇到的死胡同。理论上,我们提供了一种分析,正式化了RSE的效率增益,并验证了它在解决复杂推理任务时比独立采样具有优势。实验上,在HMMT24、HMMT25、IMO-Bench和HLE上进行的大量实验表明,RSE 以与强基线相当的计算成本实现了最先进的扩展效率。
Summary / 总结
The paper addresses the inefficiency of existing test-time scaling methods for Large Language Models, which discard valuable intermediate insights after each trial. It introduces Recycling Search Experience (RSE), a method that recycles search experiences to reduce redundant computations and improve efficiency. Experiments on HMMT24, HMMT25, IMO-Bench, and HLE demonstrate that RSE outperforms strong baselines with similar computational costs, achieving state-of-the-art scaling efficiency.
论文针对现有大型语言模型测试时扩展方法在每次试验后丢弃有价值中间洞察的问题,提出了一种自我引导的策略——回收搜索经验(RSE),以减少冗余计算并提高效率。实验结果显示,RSE 在 HMMT24、HMMT25、IMO-Bench 和 HLE 上的表现优于强基线,且具有相似的计算成本,实现了最先进的扩展效率。
Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification
Authors: Dexuan Ding, Ciyuan Peng, Endrowednes Kuantama, Jingcai Guo, Jia Wu, Jian Yang, Amin Beheshti, Ming-Hsuan Yang, Yuankai Qi
First: 2026-01-29T13:05:46+00:00 · Latest: 2026-01-29T13:05:46+00:00
Abstract
High-dimensional structural MRI (sMRI) images are widely used for Alzheimer's Disease (AD) diagnosis. Most existing methods for sMRI representation learning rely on 3D architectures (e.g., 3D CNNs), slice-wise feature extraction with late aggregation, or apply training-free feature extractions using 2D foundation models (e.g., DINO). However, these three paradigms suffer from high computational cost, loss of cross-slice relations, and limited ability to extract discriminative features, respectively. To address these challenges, we propose Multimodal Visual Surrogate Compression (MVSC). It learns to compress and adapt large 3D sMRI volumes into compact 2D features, termed as visual surrogates, which are better aligned with frozen 2D foundation models to extract powerful representations for final AD classification. MVSC has two key components: a Volume Context Encoder that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner. Extensive experiments on three large-scale Alzheimer's disease benchmarks demonstrate our MVSC performs favourably on both binary and multi-class classification tasks compared against state-of-the-art methods.
中文标题/摘要
标题:阿尔茨海默病分类的多模态视觉代理压缩
高维结构磁共振成像(sMRI)图像广泛用于阿尔茨海默病(AD)诊断。大多数现有的sMRI表示学习方法依赖于3D架构(例如,3D CNNs)、切片级特征提取与后期聚合,或使用2D基础模型(例如,DINO)进行无训练特征提取。然而,这三种范式分别面临高计算成本、跨切片关系丢失和提取判别特征能力有限的问题。为了解决这些挑战,我们提出了多模态视觉代理压缩(MVSC)。它学习将大型3D sMRI体素压缩和适应为紧凑的2D特征,称为视觉代理,这些特征与冻结的2D基础模型更好地对齐,以提取最终AD分类的强大表示。MVSC有两个关键组件:一个体积上下文编码器,在文本引导下捕捉全局跨切片上下文,以及一个增强切片融合模块,在文本增强、块级方式下聚合切片级信息。在三个大规模阿尔茨海默病基准上的广泛实验表明,与最先进的方法相比,我们的MVSC在二分类和多分类任务中均表现良好。
Summary / 总结
The research aims to improve the efficiency and accuracy of Alzheimer's Disease (AD) diagnosis using structural MRI (sMRI) images. MVSC is proposed to compress 3D sMRI volumes into 2D visual surrogates, which are better aligned with 2D foundation models for feature extraction. MVSC consists of a Volume Context Encoder and an Adaptive Slice Fusion module. Experiments show that MVSC outperforms existing methods on both binary and multi-class AD classification tasks.
论文针对高维结构MRI图像在阿尔茨海默病诊断中的挑战,如高计算成本、跨切片关系丢失和特征提取能力有限。提出了一种多模态视觉代理压缩(MVSC)方法,将3D MRI体积压缩为2D视觉代理,并通过文本指导增强这些代理以更好地与2D基础模型对齐。实验表明,MVSC在二分类和多分类任务中均优于现有方法。
Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching
Authors: Li Ju, Mayank Nautiyal, Andreas Hellander, Ekta Vats, Prashant Singh
First: 2026-01-29T12:58:42+00:00 · Latest: 2026-01-29T12:58:42+00:00
Abstract
Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.
中文标题/摘要
标题:通过黎曼流匹配计算预训练VLMs的 epistemic 不确定性
视觉-语言模型(VLMs)通常具有确定性,缺乏内在机制来量化epistemic不确定性,这反映了模型对其自身表示的无知或知识不足。我们从理论上将嵌入的负对数密度作为epistemic不确定性的一个代理,其中低密度区域表示模型的无知。所提出的方法REPVLM使用黎曼流匹配计算VLM嵌入在超球面流形上的概率密度。我们实验证明,REPVLM在不确定性与预测误差之间的相关性接近完美,显著优于现有基线。除了分类之外,我们还证明该模型还提供了一种可扩展的用于检测异常分布和自动数据整理的度量标准。
Summary / 总结
The research aims to address the lack of epistemic uncertainty quantification in Vision-Language Models (VLMs) by proposing REPVLM, which uses Riemannian Flow Matching to compute the probability density on the hyperspherical manifold of VLM embeddings. The method correlates uncertainty with prediction error effectively and outperforms existing baselines. Additionally, REPVLM provides a scalable metric for out-of-distribution detection and data curation.
研究旨在通过提出REPVLM方法来解决视觉-语言模型(VLMs)中缺乏表征不确定性量化的问题,该方法使用黎曼流匹配在VLM嵌入的超球面流形上计算概率密度。该方法在不确定性与预测误差之间的相关性上达到了近乎完美的效果,优于现有基线。此外,REPVLM还提供了一种适用于分类之外任务的大规模异常检测和数据自动整理的度量标准。
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
Authors: Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, Zhixiong Zeng
First: 2026-01-29T12:43:02+00:00 · Latest: 2026-01-29T12:43:02+00:00
Abstract
The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (\textbf{Text-centric OCR}), neglecting the identification of visual elements from visually information-dense image sources (\textbf{Vision-centric OCR}), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose \textbf{OCRVerse}, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.
中文标题/摘要
标题:OCRVerse:端到端视觉语言模型中的全方位OCR
大型视觉语言模型的发展推动了对管理和应用大量多模态数据的需求,使得从视觉图像中提取信息的OCR技术越来越受欢迎。然而,现有的OCR方法主要集中在识别图像或扫描文档中的文本元素(文本中心的OCR),忽视了从视觉信息密集型图像源(视觉中心的OCR)中识别视觉元素,例如图表、网页和科学图表。实际上,这些视觉信息密集型图像在互联网上广泛存在,并具有重要的现实应用价值,如数据可视化和网页分析。在本技术报告中,我们提出了OCRVerse,这是一种端到端的全方位OCR方法,能够统一处理文本中心的OCR和视觉中心的OCR。为此,我们构建了全面的数据工程,涵盖了广泛的文本中心文档,如报纸、杂志和书籍,以及视觉中心的渲染组合,包括图表、网页和科学图表。此外,我们为OCRVerse提出了两阶段的SFT-RL多域训练方法。SFT直接混合跨域数据进行训练和建立初始领域知识,而RL则专注于为每个领域的特性设计个性化的奖励策略。具体而言,由于不同领域需要不同的输出格式和预期输出,我们在RL阶段提供了足够的灵活性,为每个领域定制灵活的奖励信号,从而提高跨域融合并避免数据冲突。实验结果表明了OCRVerse的有效性,其在文本中心和视觉中心数据类型上的表现具有竞争力,甚至可以与大规模开源和闭源模型相媲美。
Summary / 总结
The paper introduces OCRVerse, a holistic OCR method that integrates text-centric and vision-centric OCR in an end-to-end vision-language model. It addresses the limitation of existing OCR methods by constructing a comprehensive dataset and proposing a two-stage SFT-RL training method. The experimental results show that OCRVerse performs competitively across different types of OCR tasks, matching the performance of large-scale models.
论文提出了OCRVerse,这是一种将文本中心OCR和视觉中心OCR整合到端到端视觉语言模型中的综合方法。该方法通过构建综合数据集和采用两阶段SFT-RL训练方法来解决现有OCR方法的局限性。实验结果表明,OCRVerse在各种OCR任务中表现出色,特别是在文本中心和视觉中心数据类型上取得了竞争力的结果。
PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization
Authors: Songhan Jiang, Fengchun Liu, Ziyue Wang, Linghan Cai, Yongbing Zhang
First: 2026-01-29T12:21:16+00:00 · Latest: 2026-01-29T12:21:16+00:00
Abstract
Vision-Language Models (VLMs) are advancing computational pathology with superior visual understanding capabilities. However, current systems often reduce diagnosis to directly output conclusions without verifiable evidence-linked reasoning, which severely limits clinical trust and hinders expert error rectification. To address these barriers, we construct PathReasoner, the first large-scale dataset of whole-slide image (WSI) reasoning. Unlike previous work reliant on unverified distillation, we develop a rigorous knowledge-guided generation pipeline. By leveraging medical knowledge graphs, we explicitly align structured pathological findings and clinical reasoning with diagnoses, generating over 20K high-quality instructional samples. Based on the database, we propose PathReasoner-R1, which synergizes trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning to instill structured chain-of-thought capabilities. To ensure medical rigor, we engineer a knowledge-aware multi-granular reward function incorporating an Entity Reward mechanism strictly aligned with knowledge graphs. This effectively guides the model to optimize for logical consistency rather than mere outcome matching, thereby enhancing robustness. Extensive experiments demonstrate that PathReasoner-R1 achieves state-of-the-art performance on both PathReasoner and public benchmarks across various image scales, equipping pathology models with transparent, clinically grounded reasoning capabilities. Dataset and code are available at https://github.com/cyclexfy/PathReasoner-R1.
中文标题/摘要
标题:PathReasoner-R1: 通过知识引导的策略优化将结构化推理融入病理视觉语言模型
视觉语言模型(VLMs)正在推动计算病理学的发展,具备卓越的视觉理解能力。然而,当前系统往往直接输出结论而缺乏可验证的证据链推理,这严重限制了临床信任并阻碍了专家错误的纠正。为解决这些问题,我们构建了PathReasoner,这是首个大规模的全切片图像(WSI)推理数据集。不同于以往依赖未经验证的蒸馏工作,我们开发了一个严格的知识引导生成管道。通过利用医学知识图谱,我们明确地将结构化的病理发现和临床推理与诊断对齐,生成了超过20000个高质量的指导样本。基于该数据库,我们提出了PathReasoner-R1,该模型结合了轨迹掩蔽监督微调与推理导向的强化学习,以植入结构化的推理链能力。为了确保医学严谨性,我们设计了一个知识感知的多粒度奖励函数,其中包括严格与知识图谱对齐的实体奖励机制。这有效地引导模型优化逻辑一致性而非仅仅匹配结果,从而增强其鲁棒性。大量实验表明,PathReasoner-R1 在PathReasoner 和公共基准测试中均实现了最先进的性能,为病理模型提供了透明且临床相关的推理能力。数据集和代码可在 https://github.com/cyclexfy/PathReasoner-R1 获取。
Summary / 总结
The research aims to enhance the reasoning capabilities of Vision-Language Models (VLMs) in computational pathology by providing verifiable evidence-linked reasoning. To achieve this, the authors developed PathReasoner, a large-scale dataset of whole-slide image reasoning, and PathReasoner-R1, which combines trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning. The model uses a knowledge-aware reward function to ensure logical consistency, leading to state-of-the-art performance on both PathReasoner and public benchmarks. This work enhances the robustness and clinical trustworthiness of pathology models.
研究旨在通过提供可验证的证据链推理来增强视觉-语言模型(VLM)在计算病理学中的推理能力。为此,作者开发了PathReasoner,这是一个大规模的全切片图像推理数据集,以及PathReasoner-R1,该模型结合了轨迹掩蔽监督微调和以推理为导向的强化学习。模型使用知识感知的奖励函数以确保逻辑一致性,从而在PathReasoner和公共基准测试上实现了最先进的性能。这项工作增强了病理模型的稳健性和临床可信度。
WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models
Authors: Zijin Yang, Yu Sun, Kejiang Chen, Jiawei Zhao, Jun Jiang, Weiming Zhang, Nenghai Yu
First: 2026-01-29T12:14:32+00:00 · Latest: 2026-01-29T12:14:32+00:00
Abstract
Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.
中文标题/摘要
标题:WMVLM:通过视觉语言模型评估扩散模型图像水印
数字水印对于保护来自扩散模型的生成图像至关重要。准确的水印评估对于算法开发至关重要,但现有方法存在显著局限性:缺乏统一框架处理残差和语义水印,结果缺乏可解释性,忽视了全面的安全考虑,并且经常使用不合适的语义水印度量标准。为解决这些差距,我们提出了WMVLM,这是首个通过视觉语言模型(VLMs)统一且可解释的扩散模型图像水印评估框架。我们重新定义了每种水印类型的质量和安全性度量标准:残差水印通过艺术强度和擦除抗性进行评估,而语义水印则通过潜在分布偏移进行评估。此外,我们引入了三阶段训练策略,逐步使模型实现分类、评分和可解释的文本生成。实验表明,WMVLM在数据集、扩散模型和水印方法之间具有强大的泛化能力,优于最先进的VLMs。
Summary / 总结
The research aims to improve the evaluation of digital watermarks in images generated by diffusion models. It introduces WMVLM, a unified and interpretable framework using vision-language models to evaluate both residual and semantic watermarks. WMVLM redefines quality and security metrics, and employs a three-stage training strategy. Experiments demonstrate that WMVLM outperforms existing methods with strong generalization capabilities across various datasets, diffusion models, and watermarking techniques.
研究旨在改进对由扩散模型生成的图像中数字水印的评估,解决现有方法中存在的缺乏统一框架和解释性等问题。WMVLM 是一个使用视觉语言模型的新型统一评估框架,根据残留水印的伪影强度和擦除抗性以及语义水印的潜在分布变化来进行评估。该框架还包含一个三阶段训练策略。实验表明,WMVLM 在各种数据集、扩散模型和水印方法上的性能更优且具有较强的泛化能力。
Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting
Authors: Sangoh Lee, Sangwoo Mo, Wook-Shin Han
First: 2025-12-23T03:13:39+00:00 · Latest: 2026-01-29T12:02:16+00:00
Comments: Project page with videos and code: https://vap-project.github.io/
Abstract
While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.
中文标题/摘要
标题:Bring My Cup!使用视觉注意提示个性化视觉-语言-动作模型
尽管视觉-语言-动作(VLA)模型在通用指令上表现出色,但在处理个性化命令如“bring my cup”时却遇到困难,其中机器人必须在视觉上相似的对象中执行特定实例的操作。我们研究了操作个人物品的场景,在这种场景中,VLA 必须使用少量参考图像识别并控制训练期间未见过的用户特定对象。为了解决这一挑战,我们提出了视觉注意提示(VAP),这是一种简单而有效的无需训练的感知适配器,为冻结的VLA提供自上而下的选择性注意。VAP 将参考图像视为非参数化的视觉记忆,通过开放式词汇检测和基于嵌入的匹配将个人对象定位在场景中,然后通过突出显示对象并重写指令将这种定位作为视觉提示注入。我们构建了两个模拟基准 Personalized-SIMPLER 和 Personalized-VLABench,以及一个真实世界的桌面基准,以评估跨多个机器人和任务的个性化操作。实验表明,VAP 在成功率和正确对象操作方面始终优于通用策略和标记学习基线,有助于弥合语义理解和实例级控制之间的差距。
Summary / 总结
The research addresses the challenge of personalizing vision-language-action models to handle specific commands like 'bring my cup,' where the robot must identify and manipulate a specific object among similar ones. It introduces Visual Attentive Prompting (VAP), a training-free method that enhances frozen VLA models with top-down selective attention using reference images as non-parametric visual memory. Experiments demonstrate that VAP improves success rates and correct-object manipulation compared to generic policies and token-learning baselines across various robots and tasks.
研究解决了使视觉-语言-动作模型处理特定命令如“拿我的杯子”的问题,其中机器人需要在相似的物体中识别并操作特定的一个。研究引入了视觉注意提示(VAP),这是一种无需训练的方法,利用参考图像引导模型的注意力,相比通用策略和基于标记的学习基线,提高了成功率和正确的物体操作。
Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening
Authors: Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou Ammar
First: 2026-01-29T12:01:53+00:00 · Latest: 2026-01-29T12:01:53+00:00
Abstract
Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.
中文标题/摘要
标题:可扩展的功率采样:通过分布锐化解锁LLMs无需训练的高效推理
强化学习(RL)后训练是提高大型语言模型(LLMs)推理性能的主要方法,但越来越多的证据表明,其主要增益来自于分布锐化而非新能力的获取。最近的研究表明,使用马尔可夫链蒙特卡洛(MCMC)从LLMs的幂分布中采样可以恢复与RL后训练相当的性能,且无需依赖外部奖励;然而,MCMC的高计算成本使其在广泛应用中不可行。在本文中,我们提出了一种理论依据的方法,以消除迭代MCMC的需要。我们推导出一种新的公式,表明全局幂分布可以由一个按令牌缩放的低温分布近似,其中缩放因子捕捉未来轨迹的质量。利用这一洞察,我们引入了一种无需训练和验证者的算法,以自回归方式锐化基础模型的生成分布。实验上,我们在四个LLMs上对数学、问答和代码任务进行了评估,结果显示我们的方法在不依赖任何外部奖励的情况下,能够匹配或超越单次GRPO,同时将推理延迟降低超过10倍,相比基于MCMC的采样方法。
Summary / 总结
This work addresses the high computational cost of using Markov chain Monte Carlo (MCMC) for distribution sharpening in large language models (LLMs), proposing a novel method that avoids iterative MCMC. The method approximates the global power distribution with a token-level scaled low-temperature distribution, which is used to sharpen the base model's generative distribution without training or external rewards. Experiments on math, QA, and code tasks across four LLMs demonstrate that this approach matches or surpasses one-shot GRPO performance and reduces inference latency by over 10x compared to MCMC-based sampling.
该研究旨在解决大规模语言模型通过强化学习后训练提高推理性能时的高计算成本问题,主要归因于分布锐化。作者提出了一种新方法,通过使用标记级别的缩放低温分布来近似全局功率分布,从而避免了迭代的马尔可夫链蒙特卡洛采样。实验结果显示,该方法在数学、问答和代码任务上与单次GRPO的性能相当或更优,同时将推理延迟降低了超过10倍。
DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
Authors: Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, Raanan Fattal
First: 2025-10-23T17:42:14+00:00 · Latest: 2026-01-29T11:37:31+00:00
Abstract
Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.
中文标题/摘要
标题:DyPE:动态位置外推在超高清扩散中的应用
扩散变换器模型可以生成具有非凡保真度和细节的图像,但由于自注意力机制与图像标记数量的平方级扩展,训练它们在超高清分辨率上仍然极其昂贵。在本文中,我们引入了一种名为动态位置外推(DyPE)的新型、无需训练的方法,该方法使预训练的扩散变换器能够在其训练数据之外生成高得多分辨率的图像,且无需额外的采样成本。DyPE 利用了扩散过程固有的频谱进展,其中低频结构早期收敛,而高频结构需要更多步骤才能解决。具体而言,DyPE 在每次扩散步骤中动态调整模型的位置编码,使其频谱与生成过程的当前阶段相匹配。这种方法使我们能够生成远超训练分辨率的图像,例如,使用 FLUX 生成 1600 万像素的图像。在多个基准测试中,DyPE 一致地提高了性能,并在超高清图像生成中达到了最先进的保真度,尤其是在更高分辨率下,性能提升更为显著。项目页面可在 https://noamissachar.github.io/DyPE/ 获取。
Summary / 总结
The research motivation is to address the high computational cost of training diffusion models at ultra-high resolutions. DyPE, a training-free method, enables pre-trained diffusion transformers to generate images at resolutions far beyond their training data. Key experimental findings show that DyPE can generate images at resolutions up to 16 million pixels, improving performance and achieving state-of-the-art fidelity in ultra-high-resolution image generation, especially at higher resolutions.
DyPE 是一种无需训练的方法,通过在扩散过程中动态调整位置编码,使预训练的扩散变换器能够在超高清分辨率下生成图像。该方法利用扩散过程中的频谱进展,使模型的位置编码频谱与当前生成阶段相匹配,从而能够在远超训练数据的分辨率下合成图像。DyPE 显著提高了性能,并在超高清分辨率图像生成中达到了最先进的保真度,尤其是在更高分辨率下效果更为显著。
Bi-Anchor Interpolation Solver for Accelerating Generative Modeling
Authors: Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen
First: 2026-01-29T10:59:36+00:00 · Latest: 2026-01-29T10:59:36+00:00
Abstract
Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.
中文标题/摘要
标题:双锚点内插求解器加速生成建模
流动匹配(FM)模型已成为高保真合成的领先范式。然而,它们依赖于迭代常微分方程(ODE)求解,这造成了显著的延迟瓶颈。现有解决方案面临两难境地:无训练求解器在低神经网络评估(NFE)下性能严重下降,而基于训练的一或几步生成方法则会带来高昂的训练成本,并缺乏即插即用的灵活性。为弥合这一差距,我们提出了双锚点内插求解器(BA-求解器)。BA-求解器保留了标准无训练求解器的灵活性,同时通过引入一个轻量级的SideNet(1-2%主干大小)与冻结的主干结合,实现了显著加速。具体而言,我们的方法基于两个协同组件:1)双向时间感知,其中SideNet学习在不重新训练重的主干的情况下,近似未来和历史速度;2)双锚点速度集成,利用SideNet与两个锚点速度,高效近似批量高阶积分的中间速度。通过利用主干建立高精度的“锚点”并使用SideNet细化轨迹,BA-求解器能够以最小化误差的方式使用大时间间隔。在ImageNet-256²上的实验结果表明,BA-求解器仅在10个NFE下就能达到与100多个NFE的欧拉求解器相当的生成质量,并且在仅5个NFE下仍能保持高保真度,且几乎不产生训练成本。此外,BA-求解器确保与现有生成管道无缝集成,便于下游任务如图像编辑。
Summary / 总结
The paper addresses the latency issue in Flow Matching (FM) models by proposing the Bi-Anchor Interpolation Solver (BA-solver). This method combines a lightweight SideNet with a frozen backbone to achieve significant acceleration. The BA-solver uses bidirectional temporal perception to approximate velocities and bi-anchor velocity integration to efficiently compute intermediate velocities. Experiments on ImageNet-256^2 show that BA-solver can generate images comparable to a 100+ Neural Function Evaluations (NFEs) Euler solver with only 10 NFEs, and maintain high fidelity with as few as 5 NFEs, without incurring significant training costs. Additionally, BA-solver is plug-and-play, making it easy to integrate with existing generative pipelines for tasks like image editing.
论文通过提出双锚点插值求解器(BA-solver)解决了Flow Matching模型中的延迟问题。BA-solver结合轻量级的SideNet和冻结的主干网络,实现了显著的加速。该方法利用双向时间感知来近似速度,并利用双锚点速度集成高效计算中间速度,从而在最小化误差的同时允许较大的时间间隔。实验结果表明,BA-solver在仅需10次神经网络评估(NFEs)的情况下就能达到与100多次NFEs欧拉求解器相当的生成质量,并且在最少5次NFEs时仍能保持高保真度,且无需显著的训练成本。此外,BA-solver与现有的生成管道无缝集成,有助于下游任务如图像编辑。
On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression
Authors: Xinwei Zhang, Hangcheng Liu, Li Bai, Hao Wang, Qingqing Ye, Tianwei Zhang, Haibo Hu
First: 2026-01-29T10:47:21+00:00 · Latest: 2026-01-29T10:47:21+00:00
Comments: Under Review, 20 pages
Abstract
Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks can substantially overestimate the robustness of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.
中文标题/摘要
标题:大规模视觉-语言模型在视觉标记压缩下的对抗鲁棒性研究
视觉标记压缩被广泛用于通过剪枝或合并视觉标记来加速大规模视觉-语言模型(LVLMs),但其对抗鲁棒性尚未被探索。我们表明,现有的基于编码器的攻击会显著高估压缩LVLMs的鲁棒性,这是因为优化与推理之间的不匹配:扰动是在完整标记表示上进行优化,而推理则是通过标记压缩瓶颈进行的。为了解决这一差距,我们提出了压缩对齐攻击(CAGE),该攻击在不假设访问部署的压缩机制或其标记预算的情况下,将扰动优化与压缩推理对齐。CAGE 结合了(i)预期特征破坏,将扰动集中在那些在可能的预算范围内可能存活的标记上,以及(ii)排名失真对齐,主动将标记扰动与排名分数对齐,以促进高失真证据的保留。在多种代表性的插即用压缩机制和数据集上,我们的结果表明,CAGE 一致地实现了比基线更低的鲁棒准确性。这项工作强调了忽略压缩的鲁棒性评估可能会过于乐观,呼吁对高效的LVLMs进行压缩感知的安全评估和防御。
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
Authors: Xiuyu Li, Jinkai Zhang, Mingyang Yi, Yu Li, Longqiang Wang, Yue Wang, Ju Fan
First: 2026-01-29T10:06:52+00:00 · Latest: 2026-01-29T10:06:52+00:00
Abstract
Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.
中文标题/摘要
标题:ETS:能量引导的测试时缩放以实现无需训练的RL对齐
语言模型的强化学习(RL)后训练对齐在实践中有效,但代价高昂且不稳定,因为其复杂的训练过程。为了解决这个问题,我们提出了一种无需训练的推理方法,可以直接从最优RL策略中采样。应用于掩码语言模型(MLM)的转换概率包括一个参考策略模型和一个能量项。在此基础上,我们的算法能量引导的测试时缩放(ETS)通过在线蒙特卡洛估计关键的能量项,并具有可证明的收敛速率。此外,为了确保实际效率,ETS 利用现代加速框架和定制的重要性采样估计器,大幅减少了推理延迟,同时可证明地保持了采样质量。在包括自回归模型和扩散语言模型的MLM(涵盖推理、编程和科学基准)实验中,我们的ETS始终提高了生成质量,验证了其有效性和设计。
Summary / 总结
The research aims to address the cost and instability issues of RL post-training alignment for language models by proposing a training-free inference method called Energy-Guided Test-Time Scaling (ETS). ETS uses a reference policy model and an energy term to estimate the transition probability for Masked Language Modeling. The method employs online Monte Carlo estimation with a provable convergence rate and leverages modern acceleration frameworks and importance sampling estimators to reduce inference latency while maintaining sampling quality. Experiments show that ETS improves generation quality across various language benchmarks, validating its effectiveness.
论文旨在通过提出一种训练-free 推断方法——能量引导的测试时缩放(ETS)来解决语言模型的 RL 后训练对齐成本高和不稳定的问题。ETS 使用参考策略模型和能量项来估计 Masked Language Modeling 的转移概率,并利用在线蒙特卡洛实现收敛,使用重要性采样估计器提高效率。实验表明,ETS 在各种基准测试中提高了生成质量,验证了其有效性和设计。
NOSA: Native and Offloadable Sparse Attention
Authors: Yuxiang Huang, Pengjie Wang, Jicheng Han, Weilin Zhao, Zhou Su, Ao Sun, Hongya Lyu, Hengyu Zhao, Yudong Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu
First: 2025-10-15T14:33:16+00:00 · Latest: 2026-01-29T08:26:24+00:00
Comments: Preprint
Abstract
Decoding throughput improvements from larger inference batches are limited by GPU memory, which is largely consumed by the key-value (KV) cache. Prior training-free KV cache offloading alleviates this by keeping redundant context on the CPU and fetching only a sparse subset for attention, but it often degrades long-generation quality due to training-inference mismatch on sparse patterns. Meanwhile, trainable sparse attention is incompatible with efficient offloading, as unconstrained KV accesses may force large CPU-to-GPU transfers and erase throughput gains. To this end, we propose NOSA, a trainable sparse attention mechanism natively designed for KV cache offloading. NOSA explicitly constrains the volume of CPU-GPU KV transfers, thereby achieving low communication overhead and high decoding throughput. We further build NOSI, a KV cache offloading inference system that fully unlocks NOSA's efficiency. Empirical results on 1,3,8B LLMs demonstrate that NOSA outperforms KV cache offloading baselines on general, long-input, and long-generation tasks, while boosting decoding throughput by up to 5.04x, 1.92x, and 1.83x over FullAttn, InfLLMv2, and ShadowKV, respectively. We release our code at https://github.com/thunlp/NOSA.
中文标题/摘要
标题:NOSA:原生可卸载的稀疏注意机制
从更大推理批次中获得的解码吞吐量提升受限于GPU内存,大部分被键值(KV)缓存消耗。先前的无训练KV缓存卸载通过在CPU上保留冗余上下文并仅获取稀疏子集来进行注意,从而缓解了这一问题,但往往会因稀疏模式的训练-推理不匹配而降低长生成质量。同时,可训练的稀疏注意机制与高效的卸载不兼容,因为不受约束的KV访问可能会强制进行大量CPU到GPU的数据传输,从而消除吞吐量提升。为此,我们提出NOSA,一种原生设计用于KV缓存卸载的可训练稀疏注意机制。NOSA 明确限制了CPU-GPU KV传输的体积,从而实现低通信开销和高解码吞吐量。我们进一步构建了NOSI,一个KV缓存卸载推理系统,完全释放了NOSA的效率。在1,3,8B大语言模型上的实验证明,NOSA在通用、长输入和长生成任务上优于KV缓存卸载基线,分别将解码吞吐量提升至FullAttn的5.04倍、InfLLMv2的1.92倍和ShadowKV的1.83倍。我们已在https://github.com/thunlp/NOSA/发布了我们的代码。
Summary / 总结
NOSA is a trainable sparse attention mechanism designed for efficient key-value cache offloading, addressing the limitations of previous methods by constraining CPU-GPU data transfers. Experiments on 1.3B, 3B, and 8B language models show that NOSA outperforms existing baselines, improving decoding throughput by up to 5.04x, 1.92x, and 1.83x over FullAttn, InfLLMv2, and ShadowKV, respectively. NOSI, an offloading inference system, fully leverages NOSA's efficiency for better performance on various tasks.
NOSA 是一种可训练的稀疏注意力机制,旨在高效地进行键值缓存卸载,通过限制 CPU-GPU 数据传输来解决先前方法的限制,从而保持高解码吞吐量。实验表明,NOSA 在 1.3B、3B 和 8B 语言模型上优于现有基线,分别将解码吞吐量提高到 FullAttn、InfLLMv2 和 ShadowKV 的 5.04 倍、1.92 倍和 1.83 倍。
Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models
Authors: Chengzhi Zhong, Fei Cheng, Qianying Liu, Yugo Murawaki, Chenhui Chu, Sadao Kurohashi
First: 2025-10-08T16:46:57+00:00 · Latest: 2026-01-29T07:52:53+00:00
Comments: Accepted at EACL 2026 (Main). Our code will be available at: https://github.com/ku-nlp/language-specific-dimensions
Abstract
Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.
中文标题/摘要
标题:语言存在于稀疏维度中:面向可解释和高效的多语言控制的大语言模型
大语言模型在有限的非英语数据暴露下表现出强大的多语言能力。先前的研究表明,以英语为中心的大语言模型在中间层将多语言内容映射到英语对齐的表示,然后在最终层将它们投影回目标语言的标记空间。基于这一观察,我们假设这种跨语言过渡由一组小而稀疏的维度控制,这些维度在中间层到最终层的一致索引中出现。基于这一见解,我们提出了一种简单的、无需训练的方法来识别和操作这些维度,只需要少量(最多50句)平行或单语数据。在多语言生成控制任务上的实验揭示了这些维度的可解释性,表明在这些维度上的干预可以切换输出语言同时保留语义内容,并且在较低的成本下超过了先前基于神经元的方法的性能。
Summary / 总结
The research aims to understand the sparse dimensions that govern the multilingual capabilities of large language models. The method involves identifying and manipulating these dimensions with minimal data, showing that interventions in these dimensions can switch the output language while preserving semantic content. This approach outperforms previous neuron-based methods at a lower cost.
研究探讨了大型语言模型中稀疏的多语言内容控制维度。通过假设这些维度在各层中是一致的,研究人员提出了一种无需训练的方法,仅使用少量数据来操控这些维度以实现语言控制。实验表明,对这些维度的干预可以切换输出语言同时保留语义内容,并且在成本更低的情况下超越了之前的神经元基方法。
The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition
Authors: Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao
First: 2025-12-31T19:00:03+00:00 · Latest: 2026-01-29T06:04:53+00:00
Abstract
The open-weight language model ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single breaker token that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and evades outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge
中文标题/摘要
标题:词汇中的特洛伊木马:LLM 组合中的隐蔽破坏
开放权重语言模型生态系统越来越多地通过模型组合技术(如权重合并、推测解码和词汇扩展)来重新混搭来自不同来源的能力。在这些方法能够在不同模型家族之间应用之前,一个关键的前提是分词器移植,它将不兼容的词汇表对齐到共享的嵌入空间。我们证明了这一关键的互操作性步骤引入了一个供应链漏洞:我们设计了一个单一的破坏性标记,在捐赠模型中功能中立,但在移植到基础模型后可靠地重建为一个高相关性的恶意特征。通过利用系数重用的几何特性,我们的攻击破坏了基础模型的生成能力,同时让捐赠模型的功能统计上与正常行为无显著差异。我们将此问题形式化为一个双目标优化问题,并使用稀疏求解器实例化了攻击。实验中,该攻击无需训练即可逃避异常检测,并且能够抵抗微调和权重合并的结构性持久性,突显了模块化人工智能组合管道中的隐藏风险。代码可在 https://github.com/xz-liu/tokenforge 获取
Summary / 总结
This paper investigates the security risks in the language model ecosystem, particularly focusing on the vulnerability introduced by tokenizer transplant. The authors demonstrate a method to insert a 'breaker' token into a donor model that remains harmless but reverts to a malicious function when transplanted into a base model. This attack exploits coefficient reuse to disrupt the base model's generation without affecting the donor model's performance, making it difficult to detect and counteract. The attack is shown to persist against fine-tuning and weight merging, highlighting a significant security concern in model composition techniques. Code for the attack is available on GitHub.
该论文研究了语言模型生态系统中的安全风险,特别是由分词器移植引入的漏洞。作者展示了如何将一个‘破坏者’标记插入捐赠模型中,该标记在捐赠模型中无害,但在移植到基础模型后会恢复为恶意功能。该攻击利用系数重用的几何特性来破坏基础模型的生成,而不影响捐赠模型的性能,使其难以检测和对抗。该攻击被证明可以抵抗微调和权重合并,突显了模型组合技术中的一个重大安全风险。攻击代码可在GitHub上获得。