Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
Authors: Xiaoxiao Sun, Mingyang Li, Kun yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
First: 2026-01-29T18:59:24+00:00 · Latest: 2026-01-29T18:59:24+00:00
Comments: 26 pages, 31 figures, 13 tables. Project Page: https://sites.google.com/view/vi-probe/
Abstract
Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.
中文标题/摘要
标题:VLMs 是感知还是回忆?经典视觉错觉探究视觉感知与记忆
大型视觉-语言模型(VLMs)在原始图像上通常能正确回答经典视觉错觉,但在错觉因素反转后仍坚持相同的回答,尽管这些视觉变化对人类来说非常明显。这引发了一个基本问题:VLMs 是感知视觉变化还是仅仅回忆已记忆的模式?尽管已有几项研究注意到了这一现象,但其背后的成因仍不清楚。为了从观察转向系统理解,本文引入了VI-Probe,这是一种可控的视觉错觉框架,具有分级扰动和匹配的视觉对照(无错觉诱导器),以解开视觉驱动感知与语言驱动回忆之间的关系。不同于以往工作主要关注平均准确率,我们使用极性反转一致性、模板固定指数和与匹配对照归一化的错觉乘数来衡量稳定性和敏感性。不同家族的实验表明,反应持久性源于多种原因而非单一机制。例如,GPT-5 表现出记忆覆盖,Claude-Opus-4.1 显示感知与记忆的竞争,而 Qwen 变体则表明视觉处理的限制。我们的发现挑战了单一成因的观点,并促使基于探究的评估,以衡量知识和对受控视觉变化的敏感性。数据和代码可在 https://sites.google.com/view/vi-probe/ 获取。
Summary / 总结
This paper investigates whether large vision-language models (VLMs) perceive visual changes or merely recall memorized patterns by using a controllable visual-illusion framework called VI-Probe. The study measures stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different VLMs reveal that response persistence arises from heterogeneous causes, challenging the notion of a single mechanism and highlighting the need for probing-based evaluation that measures both knowledge and sensitivity to controlled visual change.
该研究通过使用可控视觉错觉框架VI-Probe,探讨大型视觉语言模型(VLMs)是感知视觉变化还是仅回忆记忆模式。研究使用极性反转一致性、模板固定指数和与匹配控制相比的错觉乘数来衡量稳定性和敏感性。不同VLMs的实验结果显示,响应持久性源于多种原因,挑战了单一原因的观点,并强调了需要进行基于探针的评估,以衡量对受控视觉变化的知识和敏感性。
SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence
Authors: Saoud Aldowaish, Yashwanth Karumanchi, Kai-Chen Chiang, Soroosh Noorzad, Morteza Fayazi
First: 2026-01-29T18:41:52+00:00 · Latest: 2026-01-29T18:41:52+00:00
Abstract
Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.
中文标题/摘要
标题:SINA:使用人工智能的电路原理图图像到网表生成器
当前将电路原理图图像转换为机器可读网表的方法在组件识别和连接推理方面存在困难。在本文中,我们介绍了SINA,这是一个开源的全自动电路原理图图像到网表生成器。SINA结合了深度学习进行准确的组件检测、连通组件标记(CCL)进行精确的连接提取、光学字符识别(OCR)进行组件参考标识符检索,并使用视觉语言模型(VLM)进行可靠的参考标识符分配。在我们的实验中,SINA的整体网表生成准确率为96.47%,比最先进的方法高出2.72倍。
Summary / 总结
SINA is an AI-based circuit schematic image-to-netlist generator that uses deep learning for component detection, CCL for connectivity extraction, OCR for reference designator retrieval, and a VLM for reference designator assignments. Experiments show that SINA achieves 96.47% overall netlist-generation accuracy, surpassing state-of-the-art methods by 2.72 times.
SINA 是一种基于 AI 的电路图图像到网表生成器,使用深度学习进行组件检测、CCL 进行连接提取、OCR 进行参考设计ator检索,以及 VLM 进行参考设计ator分配。实验结果显示,SINA 的整体网表生成准确率为 96.47%,比现有最佳方法高出 2.72 倍。
Exploring Diverse Generation Paths via Inference-time Stiefel Activation Steering
Authors: Dongxuan Zhu, Ly Tran Ho Khanh, Andy Yat-Ming Cheung, Man-Chung Yue, Viet Anh Nguyen
Venue: ICLR 2026
First: 2026-01-29T17:17:04+00:00 · Latest: 2026-01-29T17:17:04+00:00
Comments: 34 pages, 2 figures. Accepted for publication at ICLR 2026
Abstract
Language models often default to a narrow set of high-probability outputs, leaving their generation paths homogeneous and prone to mode collapse. Sampling-based strategies inject randomness but still struggle to guarantee diversity across multiple concurrent generation runs. We address this limitation by introducing STARS ($\textbf{St}$iefel-based $\textbf{A}$ctivation Steering for Diverse $\textbf{R}$ea$\textbf{S}$oning), a training-free, inference-time intervention method that transforms activation steering into an exploration engine. At each token, STARS collects the hidden activations of concurrent generation runs and optimizes multiple additive steering directions jointly on the Stiefel manifold. STARS maximizes the geometric volume of the steered activations, while the Stiefel manifold induces orthogonality of the steering interventions. This formulation explicitly promotes divergent activation vectors of concurrent generation runs, and implicitly promotes divergent generation trajectories. This manifold optimization formulation can be solved using a Riemannian gradient descent algorithm with convergence guarantees, but this algorithm is too time-consuming for real-time inference. To guarantee low latency, we further design a lightweight one-step update with an aggressive, closed-form stepsize. For test case generation and scientific discovery benchmarks, STARS consistently outperforms standard sampling methods, achieving greater diversity without sacrificing qualitative performance.
中文标题/摘要
标题:通过推断时斯蒂费尔激活引导探索多样的生成路径
语言模型通常默认生成一组窄范围的高概率输出,导致生成路径同质化且容易发生模式崩溃。基于采样的策略虽然引入了随机性,但在多个并发生成运行中仍难以保证多样性。我们通过引入STARS(基于斯蒂费尔的激活引导以促进多样推理)来解决这一限制,这是一种无需训练的推断时干预方法,将激活引导转化为探索引擎。在每个标记处,STARS 收集并发生成运行的隐藏激活,并在斯蒂费尔流形上联合优化多个附加引导方向。STARS 最大化引导激活的几何体积,而斯蒂费尔流形诱导引导干预的正交性。这种形式明确促进了并发生成运行的发散激活向量,并隐式促进了发散的生成轨迹。这种流形优化形式可以通过黎曼梯度下降算法求解,具有收敛保证,但该算法对于实时推断来说耗时过长。为了保证低延迟,我们进一步设计了一种轻量级的一步更新,具有激进的闭式步长。在测试案例生成和科学发现基准测试中,STARS 一致地优于标准采样方法,实现了更高的多样性而不牺牲定性性能。
Summary / 总结
The paper addresses the issue of language models generating homogeneous outputs by introducing STARS, a method that optimizes activation steering at inference time. STARS uses the Stiefel manifold to ensure orthogonality of steering interventions, promoting divergent activation vectors and generation trajectories. Experiments show that STARS outperforms standard sampling methods in generating diverse outputs for test cases and scientific discovery benchmarks without compromising quality.
论文通过引入STARS方法解决了语言模型生成同质输出的问题,该方法在推理时优化激活方向。STARS从并发生成运行中收集隐藏激活,并在Stiefel流形上优化多个引导方向,促进发散的激活向量和生成轨迹。实验表明,STARS在测试案例生成和科学发现基准测试中优于标准采样方法,能够生成更具多样性的输出而不牺牲质量。
Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models
Authors: Konstantinos P. Panousis, Diego Marcos
First: 2026-01-29T16:28:55+00:00 · Latest: 2026-01-29T16:28:55+00:00
Abstract
The widespread adoption of Vision-Language Models (VLMs) across fields has amplified concerns about model interpretability. Distressingly, these models are often treated as black-boxes, with limited or non-existent investigation of their decision making process. Despite numerous post- and ante-hoc interepretability methods, systematic and objective evaluation of the learned representations remains limited, particularly for sparsity-aware methods that are increasingly considered to "induce interpretability". In this work, we focus on Concept Bottleneck Models and investigate how different modeling decisions affect the emerging representations. We introduce the notion of clarity, a measure, capturing the interplay between the downstream performance and the sparsity and precision of the concept representation, while proposing an interpretability assessment framework using datasets with ground truth concept annotations. We consider both VLM- and attribute predictor-based CBMs, and three different sparsity-inducing strategies: per example $\ell_1, \ell_0$ and Bernoulli-based formulations. Our experiments reveal a critical trade-off between flexibility and interpretability, under which a given method can exhibit markedly different behaviors even at comparable performance levels. The code will be made publicly available upon publication.
中文标题/摘要
标题:清晰度:稀疏感知概念瓶颈模型中的灵活性-可解释性权衡
视觉-语言模型(VLMs)在各领域的广泛应用加剧了人们对模型可解释性的担忧。令人不安的是,这些模型往往被视为黑箱,对其决策过程的研究有限或几乎不存在。尽管存在众多事后和事前的可解释性方法,但对学习表示的系统性和客观性评估仍然有限,特别是对于那些越来越被认为“诱导可解释性”的稀疏感知方法。在本文中,我们专注于概念瓶颈模型,并探讨不同的建模决策如何影响生成的表示。我们引入了清晰度的概念,这是一个衡量指标,捕捉下游性能与概念表示的稀疏性和精确性之间的相互作用,同时提出了一种使用具有真实概念注释的数据集进行可解释性评估的框架。我们考虑了基于VLM和属性预测器的概念瓶颈模型(CBMs),以及三种不同的稀疏诱导策略:每例$\ell_1$、$\ell_0$和伯努利形式。我们的实验揭示了灵活性和可解释性之间的关键权衡,在相似的性能水平下,给定的方法可能会表现出截然不同的行为。代码将在发表后公开可用。
Summary / 总结
This work addresses the interpretability issue in Vision-Language Models (VLMs) by focusing on Concept Bottleneck Models (CBMs). The authors introduce a measure called clarity, which evaluates the balance between the model's performance and the sparsity and precision of the concept representation. They propose an interpretability assessment framework using datasets with ground truth concept annotations and consider three sparsity-inducing strategies. The experiments show a trade-off between flexibility and interpretability, where different methods can behave differently even when performance is similar.
该研究关注Vision-Language Models (VLMs)的可解释性问题,通过聚焦Concept Bottleneck Models进行探讨。引入了一个称为清晰度的度量,用于评估概念表示的稀疏性和精度与下游性能之间的平衡。研究提出了一个使用带有概念标注真实值的数据集的可解释性评估框架,并考虑了三种稀疏性诱导策略。实验表明,在相似性能水平下,不同方法的表现可能大不相同,存在灵活性与可解释性之间的权衡。
A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances
Authors: Brian B. Moser, Arundhati S. Shanbhag, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel
First: 2025-05-23T12:18:34+00:00 · Latest: 2026-01-29T16:22:04+00:00
Abstract
Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.
中文标题/摘要
标题:coreset 选择的coreset 选择文献综述:介绍与最新进展
coreset 选择旨在解决如何从大型数据集中找到一个小型、具有代表性的子集,以保留关键模式并有效进行机器学习的问题。尽管已有几篇综述研究了数据缩减策略,但大多数综述仅专注于经典几何方法或主动学习技术。相比之下,本文综述提供了一个更全面的观点,通过将无训练、训练导向和无标签三大类coreset 研究方法统一到一个分类体系中。我们介绍了现有工作中经常忽略的子领域,包括子模形式、 bilevel 优化以及无标签数据集中的伪标签最新进展。此外,我们探讨了剪枝策略如何影响泛化能力和神经网络的标度法则,提供了先前综述中未提及的新见解。最后,我们在不同的计算、鲁棒性和性能需求下比较了这些方法,并指出了未来研究中需要解决的开放挑战,如鲁棒性、异常值过滤以及将coreset 选择适应基础模型。
Summary / 总结
This paper aims to provide a comprehensive overview of coreset selection methods, which involve selecting a small, representative subset of a large dataset to preserve essential patterns for machine learning. The study unifies three major research lines: training-free, training-oriented, and label-free approaches, and explores subfields such as submodular formulations, bilevel optimization, and pseudo-labeling. Key findings include insights into how pruning strategies affect generalization and neural scaling laws, and the comparison of these methods under different computational, robustness, and performance demands, highlighting open challenges for future research.
本文解决了选择一个大型数据集的小型、代表性子集以保留关键模式以实现有效机器学习的挑战。它将训练无关、训练导向和标签无关的三大类聚簇研究统一到一个分类体系中。研究还探讨了子模形式、双层优化以及未标记数据集的伪标签等子领域,并研究了剪枝策略如何影响泛化能力和神经网络的标度法则。关键发现包括在不同计算、鲁棒性和性能需求下比较这些方法,并指出了鲁棒性、异常值过滤等未来研究中的开放挑战。
Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models
Authors: Cong Cao, Huanjing Yue, Shangbin Xie, Xin Liu, Jingyu Yang
First: 2026-01-29T16:14:07+00:00 · Latest: 2026-01-29T16:14:07+00:00
Abstract
Although diffusion-based zero-shot image restoration and enhancement methods have achieved great success, applying them to video restoration or enhancement will lead to severe temporal flickering. In this paper, we propose the first framework that utilizes the rapidly-developed video diffusion model to assist the image-based method in maintaining more temporal consistency for zero-shot video restoration and enhancement. We propose homologous latents fusion, heterogenous latents fusion, and a COT-based fusion ratio strategy to utilize both homologous and heterogenous text-to-video diffusion models to complement the image method. Moreover, we propose temporal-strengthening post-processing to utilize the image-to-video diffusion model to further improve temporal consistency. Our method is training-free and can be applied to any diffusion-based image restoration and enhancement methods. Experimental results demonstrate the superiority of the proposed method.
中文标题/摘要
标题:利用视频扩散模型辅助的零样本视频恢复与增强
尽管基于扩散的零样本图像恢复和增强方法已经取得了巨大成功,但将其应用于视频恢复或增强会导致严重的时域闪烁。本文提出了一种利用快速发展的视频扩散模型辅助基于图像的方法,以保持更佳的时域一致性。我们提出了同构潜变量融合、异构潜变量融合以及基于COT的融合比例策略,利用同构和异构文本到视频扩散模型来补充图像方法。此外,我们提出了时域增强后处理,利用图像到视频扩散模型进一步提高时域一致性。该方法无需训练,可以应用于任何基于扩散的图像恢复和增强方法。实验结果表明了所提方法的优越性。
Summary / 总结
The research aims to address the issue of temporal flickering in zero-shot video restoration and enhancement using diffusion models. The authors propose a framework that leverages video diffusion models to assist image-based methods, ensuring better temporal consistency. They introduce fusion strategies for homologous and heterogenous text-to-video diffusion models and a COT-based fusion ratio strategy. Additionally, they propose temporal-strengthening post-processing to enhance temporal consistency further. The method is training-free and can be applied to any diffusion-based image restoration and enhancement methods. Experiments show the proposed method's superiority in maintaining temporal consistency.
研究旨在解决使用扩散模型进行零样本视频修复和增强时出现的严重时间闪烁问题。作者提出了一种框架,利用视频扩散模型辅助图像方法,以确保更好的时间一致性。他们引入了同源和异源文本到视频扩散模型的融合策略,并提出了一种基于COT的融合比例策略。此外,他们还提出了时间增强后处理,以进一步提高时间一致性。该方法无需训练,可以应用于任何基于扩散的图像修复和增强方法。实验结果表明,所提出的方法在保持时间一致性方面具有优越性。
FreeFuse: Multi-Subject LoRA Fusion via Adaptive Token-Level Routing at Test Time
Authors: Yaoli Liu, Yao-Xiang Ding, Kun Zhou
First: 2025-10-27T16:54:08+00:00 · Latest: 2026-01-29T16:14:07+00:00
Abstract
This paper proposes FreeFuse, a training-free framework for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to prior studies that focus on retraining LoRA to alleviate feature conflicts, our analysis reveals that simply spatially confining the subject LoRA's output to its target region and preventing other LoRAs from directly intruding into this area is sufficient for effective mitigation. Accordingly, we implement Adaptive Token-Level Routing during the inference phase. We introduce FreeFuseAttn, a mechanism that exploits the flow matching model's intrinsic semantic alignment to dynamically match subject-specific tokens to their corresponding spatial regions at early denoising timesteps, thereby bypassing the need for external segmentors. FreeFuse distinguishes itself through high practicality: it necessitates no additional training, model modifications, or user-defined masks spatial conditions. Users need only provide subject activation words to achieve seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both identity preservation and compositional fidelity. Our code is available at https://github.com/yaoliliu/FreeFuse.
中文标题/摘要
标题:FreeFuse:通过自适应令牌级路由在测试时进行多主题LoRA融合的无训练框架
本文提出FreeFuse,一种无需训练的多主题文本到图像生成框架,通过自动融合多个主题LoRA实现。与以往专注于重新训练LoRA以缓解特征冲突的研究不同,我们的分析表明,简单地将主题LoRA的输出空间限制在其目标区域,并防止其他LoRA直接侵入该区域就足以实现有效的缓解。因此,在推理阶段我们实现了自适应令牌级路由。我们引入了FreeFuseAttn机制,该机制利用流匹配模型固有的语义对齐,在早期去噪时间步动态匹配主题特定的令牌到其相应的空间区域,从而绕过了对外部分割器的需求。FreeFuse通过其高实用性脱颖而出:它不需要额外的训练、模型修改或用户定义的空间条件。用户只需提供主题激活词即可无缝集成到标准工作流程中。广泛的实验验证了FreeFuse在身份保留和组成保真度方面优于现有方法。我们的代码可在https://github.com/yaoliliu/FreeFuse获取。
Summary / 总结
FreeFuse is a training-free framework for multi-subject text-to-image generation that automatically fuses multiple subject LoRAs without retraining. It uses Adaptive Token-Level Routing during inference to spatially confine each subject's output to its target region, preventing interference from other LoRAs. FreeFuse outperforms existing methods in preserving identity and compositional fidelity, and it does not require additional training, model modifications, or user-defined masks. Users only need to provide subject activation words. Extensive experiments validate its effectiveness.
FreeFuse 是一个无需训练的多主题文本到图像生成框架,它通过自适应的令牌级路由在推理阶段自动融合多个主题的 LoRA。它将每个主题的输出限制在其目标区域内,防止其他 LoRA 的干扰。FreeFuse 在保持身份和组成保真度方面优于现有方法,无需额外训练、模型修改或用户定义的遮罩。用户只需提供主题激活词即可。广泛的实验验证了其有效性。
Improving Classifier-Free Guidance of Flow Matching via Manifold Projection
Authors: Jian-Feng Cai, Haixia Liu, Zhengyi Su, Chao Wang
First: 2026-01-29T15:49:31+00:00 · Latest: 2026-01-29T15:49:31+00:00
Comments: 24 pages, 14 figures
Abstract
Classifier-free guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5.
中文标题/摘要
标题:通过流匹配中的流形投影改进无分类器引导
无分类器引导(CFG)是一种广泛用于扩散和基于流模型的可控生成的技术。尽管在实践中取得了成功,但CFG依赖于一种敏感于引导尺度的启发式线性外推。在本文中,我们通过优化的角度为CFG提供了一个原理性的解释。我们证明了流匹配中的速度场对应于一系列平滑距离函数的梯度,这引导潜在变量向缩放的目标图像集移动。这种视角揭示了标准的CFG公式是该梯度的近似,其中预测差距,即条件输出与无条件输出之间的差异,决定了引导的敏感性。利用这一洞察,我们将CFG采样重新表述为具有流形约束的同伦优化。这种表述需要一个流形投影步骤,我们在采样过程中通过增量梯度下降方案实现。为了提高计算效率和稳定性,我们进一步通过Anderson加速改进了这一迭代过程,而无需额外的模型评估。我们提出的方法是训练免费的,并且一致地提高了生成保真度、提示对齐和对引导尺度的鲁棒性。我们在多种基准上验证了其有效性,展示了在DiT-XL-2-256、Flux和Stable Diffusion 3.5等大型模型上取得了显著改进。
Summary / 总结
This work aims to improve classifier-free guidance (CFG) in flow-based models by providing a principled interpretation through optimization. The authors show that the velocity field in flow matching corresponds to the gradient of smoothed distance functions, guiding latent variables towards the scaled target image set. They reformulate CFG as a homotopy optimization with a manifold constraint, implementing a manifold projection step via incremental gradient descent. This method enhances computational efficiency and stability with Anderson Acceleration, leading to consistent improvements in generation fidelity, prompt alignment, and robustness to the guidance scale across various models like DiT-XL-2-256, Flux, and Stable Diffusion 3.5.
本文旨在通过优化提供一种原理性的解释来改进流基模型中的无分类器引导(CFG)。作者表明,流匹配中的速度场对应于平滑距离函数的梯度,引导潜在变量向缩放的目标图像集移动。他们将CFG重新表述为具有流形约束的同伦优化,并通过增量梯度下降实现流形投影,进一步使用Anderson加速提高迭代过程的效率和稳定性。所提出的方法能够一致地提高生成保真度、提示对齐和对引导尺度的鲁棒性,在大型模型如DiT-XL-2-256、Flux和Stable Diffusion 3.5上显示出显著的改进。
Trajectory-Guided Diffusion for Foreground-Preserving Background Generation in Multi-Layer Documents
Authors: Taewon Kang
First: 2026-01-29T15:28:48+00:00 · Latest: 2026-01-29T15:28:48+00:00
Comments: 47 pages, 36 figures
Abstract
We present a diffusion-based framework for document-centric background generation that achieves foreground preservation and multi-page stylistic consistency through latent-space design rather than explicit constraints. Instead of suppressing diffusion updates or applying masking heuristics, our approach reinterprets diffusion as the evolution of stochastic trajectories through a structured latent space. By shaping the initial noise and its geometric alignment, background generation naturally avoids designated foreground regions, allowing readable content to remain intact without auxiliary mechanisms. To address the long-standing issue of stylistic drift across pages, we decouple style control from text conditioning and introduce cached style directions as persistent vectors in latent space. Once selected, these directions constrain diffusion trajectories to a shared stylistic subspace, ensuring consistent appearance across pages and editing iterations. This formulation eliminates the need for repeated prompt-based style specification and provides a more stable foundation for multi-page generation. Our framework admits a geometric and physical interpretation, where diffusion paths evolve on a latent manifold shaped by preferred directions, and foreground regions are rarely traversed as a consequence of trajectory initialization rather than explicit exclusion. The proposed method is training-free, compatible with existing diffusion backbones, and produces visually coherent, foreground-preserving results across complex documents. By reframing diffusion as trajectory design in latent space, we offer a principled approach to consistent and structured generative modeling.
中文标题/摘要
标题:轨迹引导扩散在多层文档中实现前景保留背景生成
我们提出了一种基于扩散的文档中心背景生成框架,通过潜在空间设计而非显式约束实现前景保留和多页风格一致性。我们的方法重新解释了扩散作为结构化潜在空间中随机轨迹的演变。通过塑造初始噪声及其几何对齐,背景生成自然地避开指定的前景区域,使可读内容保持完整,无需辅助机制。为了解决跨页风格漂移的长期问题,我们将风格控制与文本条件分离,并引入缓存的风格方向作为潜在空间中的持久向量。一旦选定,这些方向将约束扩散轨迹到共享的风格子空间,确保跨页和编辑迭代的一致外观。这种表述消除了重复提示式风格指定的需要,并为多页生成提供了一个更稳定的基座。我们的框架具有几何和物理解释,其中扩散路径在由偏好方向塑造的潜在流形上演变,由于轨迹初始化而不是明确排除,前景区域很少被穿越。所提出的方法无需训练,与现有的扩散骨干兼容,并在复杂文档中产生视觉上连贯、前景保留的结果。通过将扩散重新构想为潜在空间中的轨迹设计,我们提供了一种原理性的方法来实现一致和结构化的生成建模。
Summary / 总结
This paper introduces a diffusion-based framework for generating background in multi-layer documents while preserving the foreground. The method reinterprets diffusion as the evolution of stochastic trajectories through a structured latent space, shaping initial noise to avoid designated foreground regions. To ensure stylistic consistency across pages, the framework uses cached style directions as persistent vectors, constraining diffusion trajectories to a shared stylistic subspace. This approach eliminates the need for repeated style specification and produces visually coherent results without auxiliary mechanisms. The framework is training-free and compatible with existing diffusion models, offering a principled solution for multi-page document generation.
论文提出了一种基于扩散的框架,用于在多层文档中生成背景同时保留前景。该方法通过潜空间设计避免指定的前景区域,而无需使用显式约束。引入了缓存的风格方向以确保跨页面的一致性,从而消除重复的风格指定的需要。这种方法无需训练即可生成视觉上连贯的结果,并且兼容现有的扩散模型。
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
Authors: Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, Lijun Wu
First: 2026-01-29T15:07:28+00:00 · Latest: 2026-01-29T15:07:28+00:00
Abstract
Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a "less is more" phenomenon via our difficulty-aware filtering strategy: a subset of just 7\% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.
中文标题/摘要
标题:MMFineReason:通过开源数据为中心的方法缩小多模态推理差距
视觉语言模型(VLMs)的最新进展在视觉推理方面取得了显著进展。然而,开源VLMs仍然落后于专有系统,主要是因为缺乏高质量的推理数据。现有数据集在涵盖STEM图表和视觉谜题等挑战性领域方面覆盖面有限,并且缺乏用于激发强大推理能力的一致且长形式的推理链(CoT)注释。为了弥合这一差距,我们引入了MMFineReason,这是一个包含180万样本和51亿个解决方案标记的大规模多模态推理数据集,这些注释是从Qwen3-VL-235B-A22B-Thinking中提炼出来的高质量推理注释。该数据集通过一个系统性的三阶段管道建立:(1)大规模数据收集和标准化,(2)生成推理链(CoT)理由,(3)基于推理质量和难度意识的全面筛选。该数据集涵盖了STEM问题、视觉谜题、游戏和复杂图表,每个样本都注释了视觉支持的推理痕迹。我们对MMFineReason进行微调Qwen3-VL-Instruct,开发了MMFineReason-2B/4B/8B版本。我们的模型在相应规模类别中建立了新的最佳结果。值得注意的是,MMFineReason-4B成功超越了Qwen3-VL-8B-Thinking,而MMFineReason-8B甚至超过了Qwen3-VL-30B-A3B-Thinking,接近Qwen3-VL-32B-Thinking,展示了显著的参数效率。通过我们的难度意识筛选策略,我们发现了一个“少即是多”的现象:仅7%(12.3万样本)的子集就达到了与完整数据集相当的性能。值得注意的是,我们揭示了推理导向的数据组合具有协同效应,同时提升了通用能力。
LLM-based Few-Shot Early Rumor Detection with Imitation Agent
Authors: Fengzhu Zeng, Qian Shao, Ling Cheng, Wei Gao, Shih-Fen Cheng, Jing Ma, Cheng Niu
Venue: KDD 2026
First: 2025-12-20T12:42:27+00:00 · Latest: 2026-01-29T15:01:08+00:00
Comments: Accepted at KDD 2026
Abstract
Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.
中文标题/摘要
标题:基于LLM的少样本早期谣言检测与模仿代理
早期谣言检测(EARD)旨在根据一系列社交媒体帖子,识别出一个声明可以被准确分类的最早时间点。这在数据稀缺的环境中尤其具有挑战性。尽管大型语言模型(LLMs)在少样本NLP任务中表现出色,但它们不适用于时间序列数据,并且在训练和推理时计算成本高昂。在本文中,我们提出了一种新颖的EARD框架,该框架结合了一个自主代理和一个基于LLM的检测模型,其中代理作为可靠的决策者负责确定\textit{早期时间点},而LLM则作为强大的\textit{谣言检测器}。这种方法提供了第一个少样本EARD的解决方案,只需要训练一个轻量级的代理,而LLM则无需训练。在四个真实世界数据集上的广泛实验表明,我们的方法在LLM上提升了性能,并在准确性和及时性方面超越了现有的EARD方法。
Summary / 总结
This work addresses the challenge of early rumor detection in data-scarce settings by proposing a novel framework that combines an autonomous agent and an LLM-based detection model. The agent determines the earliest time point for accurate classification, while the LLM detects rumors. This approach requires only the training of a lightweight agent and allows the LLM to remain training-free. Experiments on four real-world datasets demonstrate that this method outperforms existing EARD methods in both accuracy and earliness across different LLMs.
研究旨在通过提出一种新颖框架解决数据稀缺环境下的早期谣言检测难题,该框架结合了轻量级自主代理和大型语言模型(LLM)。代理确定准确分类的最早时间点,而LLM负责检测谣言。该方法仅需训练代理,从而提高计算效率,并在四个真实世界数据集上超越现有方法,在准确性和及时性方面均表现出色。
Moral Outrage Shapes Commitments Beyond Attention: Multimodal Moral Emotions on YouTube in Korea and the US
Authors: Seongchan Park, Jaehong Kim, Hyeonseung Kim, Heejin Bin, Sue Moon, Wonjae Lee
Venue: The Web Conference 2026
First: 2026-01-29T14:58:54+00:00 · Latest: 2026-01-29T14:58:54+00:00
Comments: Accepted at The Web Conference 2026. We release Korean and English multimodal moral emotion classifiers
Abstract
Understanding how media rhetoric shapes audience engagement is crucial in the attention economy. This study examines how moral emotional framing by mainstream news channels on YouTube influences user behavior across Korea and the United States. To capture the platform's multimodal nature, combining thumbnail images and video titles, we develop a multimodal moral emotion classifier by fine tuning a vision language model. The model is trained on human annotated multimodal datasets in both languages and applied to approximately 400,000 videos from major news outlets. We analyze engagement levels including views, likes, and comments, representing increasing degrees of commitment. The results show that other condemning rhetoric expressions of moral outrage that criticize others morally consistently increase all forms of engagement across cultures, with effects ranging from passive viewing to active commenting. These findings suggest that moral outrage is a particularly effective emotional strategy, attracting not only attention but also active participation. We discuss concerns about the potential misuse of other condemning rhetoric, as such practices may deepen polarization by reinforcing in group and out group divisions. To facilitate future research and ensure reproducibility, we publicly release our Korean and English multimodal moral emotion classifiers.
中文标题/摘要
标题:道德愤怒塑造超越注意力的承诺:韩国和美国YouTube上的多模态道德情绪
在注意力经济中,理解媒体修辞如何影响受众参与至关重要。本研究探讨了主流新闻频道在YouTube上以道德情感框架呈现内容如何影响韩国和美国用户的在线行为。为了捕捉平台的多模态特性,结合缩略图图像和视频标题,我们通过微调视觉语言模型开发了一个多模态道德情绪分类器。该模型在两种语言的人标注多模态数据集上进行训练,并应用于来自主要新闻机构的约40万条视频。我们分析了包括观看次数、点赞和评论在内的参与度水平,代表了不同程度的承诺。结果显示,其他谴责性道德愤怒表达,批评他人道德,无论在哪个文化中,都能一致地增加所有形式的参与度,从被动观看到积极评论。这些发现表明,道德愤怒是一种特别有效的心理策略,不仅能吸引注意力,还能激发积极的参与。我们讨论了其他谴责性修辞可能被滥用的问题,因为这种做法可能会通过强化群体内部和群体外部的分化来加深分歧。为了促进未来研究并确保可重复性,我们公开发布了韩语和英语的多模态道德情绪分类器。
Knowledge Vector Weakening: Efficient Training-free Unlearning for Large Vision-Language Models
Authors: Yejin Kim, Dongjun Hwang, Sungmin Cha, Junsuk Choe
First: 2026-01-29T14:41:01+00:00 · Latest: 2026-01-29T14:41:01+00:00
Abstract
Large Vision-Language Models (LVLMs) are widely adopted for their strong multimodal capabilities, yet they raise serious concerns such as privacy leakage and harmful content generation. Machine unlearning has emerged as a promising solution for removing the influence of specific data from trained models. However, existing approaches largely rely on gradient-based optimization, incurring substantial computational costs for large-scale LVLMs. To address this limitation, we propose Knowledge Vector Weakening (KVW), a training-free unlearning method that directly intervenes in the full model without gradient computation. KVW identifies knowledge vectors that are activated during the model's output generation on the forget set and progressively weakens their contributions, thereby preventing the model from exploiting undesirable knowledge. Experiments on the MLLMU and CLEAR benchmarks demonstrate that KVW achieves a stable forget-retain trade-off while significantly improving computational efficiency over gradient-based and LoRA-based unlearning methods.
中文标题/摘要
标题:知识向量削弱:大型视觉-语言模型的高效无训练卸载方法
大型视觉-语言模型(LVLMs)因其强大的多模态能力而被广泛采用,但它们引发了严重的隐私泄露和有害内容生成等问题。机器卸载已作为去除训练模型中特定数据影响的一种有前景的解决方案出现。然而,现有方法大多依赖于基于梯度的优化,对大规模LVLMs来说会带来巨大的计算成本。为解决这一局限,我们提出了一种名为知识向量削弱(KVW)的无训练卸载方法,该方法直接干预整个模型而不进行梯度计算。KVW 识别出在模型对忘记集生成输出时被激活的知识向量,并逐步削弱它们的贡献,从而防止模型利用不希望的知识。在MLLMU和CLEAR基准上的实验表明,KVW 在稳定遗忘-保留权衡的同时,显著提高了计算效率,优于基于梯度和LoRA的卸载方法。
Summary / 总结
The research aims to address privacy and content concerns in large vision-language models by proposing Knowledge Vector Weakening (KVW), a training-free unlearning method. KVW directly intervenes in the model without requiring gradient computation, identifying and weakening knowledge vectors activated during output generation on the forget set. Experiments show that KVW maintains a good balance between forgetting and retaining information while enhancing computational efficiency compared to gradient-based and LoRA-based methods.
研究旨在通过提出训练-free 的遗忘方法知识向量削弱(KVW),解决大型视觉-语言模型中的隐私和内容问题。KVW 不进行梯度计算直接干预模型,通过识别并削弱在忘记集上生成输出时激活的知识向量来削弱模型的不良知识利用。实验表明,KVW 保持了遗忘与保留之间的稳定权衡,并且在计算效率上优于基于梯度和 LoRA 的方法。
Error Amplification Limits ANN-to-SNN Conversion in Continuous Control
Authors: Zijie Xu, Zihan Huang, Yiting Dong, Kang Chen, Wenxuan Liu, Zhaofei Yu
First: 2026-01-29T14:28:00+00:00 · Latest: 2026-01-29T14:28:00+00:00
Abstract
Spiking Neural Networks (SNNs) can achieve competitive performance by converting already existing well-trained Artificial Neural Networks (ANNs), avoiding further costly training. This property is particularly attractive in Reinforcement Learning (RL), where training through environment interaction is expensive and potentially unsafe. However, existing conversion methods perform poorly in continuous control, where suitable baselines are largely absent. We identify error amplification as the key cause: small action approximation errors become temporally correlated across decision steps, inducing cumulative state distribution shift and severe performance degradation. To address this issue, we propose Cross-Step Residual Potential Initialization (CRPI), a lightweight training-free mechanism that carries over residual membrane potentials across decision steps to suppress temporally correlated errors. Experiments on continuous control benchmarks with both vector and visual observations demonstrate that CRPI can be integrated into existing conversion pipelines and substantially recovers lost performance. Our results highlight continuous control as a critical and challenging benchmark for ANN-to-SNN conversion, where small errors can be strongly amplified and impact performance.
中文标题/摘要
标题:误差放大限制了ANN到SNN在连续控制中的转换
通过将已经训练好的人工神经网络(ANN)转换为现有的Spiking神经网络(SNN),SNN可以在不进行进一步昂贵训练的情况下实现竞争性性能。这一特性在强化学习(RL)中尤其具有吸引力,因为在RL中通过环境交互进行训练既昂贵又可能不安全。然而,现有的转换方法在连续控制中表现不佳,因为缺乏合适的基线。我们确定误差放大是主要原因:小的动作近似误差在决策步骤之间变得时序相关,导致状态分布的累积变化和严重的性能下降。为了解决这一问题,我们提出了跨步骤残差膜电位初始化(CRPI),这是一种轻量级的无需训练机制,可以在决策步骤之间传递残差膜电位以抑制时序相关误差。在具有向量和视觉观察的连续控制基准测试中,CRPI可以集成到现有的转换管道中,并显著恢复了丢失的性能。我们的结果强调了连续控制是ANN到SNN转换的关键和具有挑战性的基准,其中小的误差可以被强烈放大并影响性能。
Summary / 总结
The paper addresses the issue of error amplification in converting ANNs to SNNs for continuous control tasks, where small action approximation errors accumulate and degrade performance. The authors propose CRPI, a lightweight mechanism that transfers residual membrane potentials across decision steps to mitigate these errors. Experiments show that CRPI can significantly improve performance on continuous control benchmarks with both vector and visual observations, highlighting the critical challenge of error amplification in this domain.
论文探讨了在连续控制任务中将ANN转换为SNN时出现的误差放大问题,其中小的动作近似误差会累积并导致性能下降。作者提出了一种轻量级机制CRPI,它可以在决策步骤之间传递残余膜电位以减轻这些误差。实验表明,CRPI可以在具有向量和视觉观察的连续控制基准测试中显著提高性能,突显了在这一领域中误差放大带来的重大挑战。
Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation
Authors: Qian-Wei Wang, Yaguang Song, Shu-Tao Xia
First: 2025-06-03T12:48:54+00:00 · Latest: 2026-01-29T13:56:19+00:00
Abstract
In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVA, and GPT-4V, leveraging these models to replace time-consuming manual annotation and enable annotation-free training has become a promising research direction. This paper studies learning from noisy partial labels generated by pre-trained VLMs and proposes a collaborative consistency regularization (Co-Reg) framework. Unlike symmetric noise commonly assumed in traditional noisy label learning, VLM-generated noise is instance-dependent and reflects the intrinsic biases of pre-trained models, posing greater challenges. To address this issue, we jointly train two neural networks to perform collaborative label purification via a co-pseudo-labeling mechanism, while enforcing consistency regularization in both label and feature representation spaces. In addition, multiple anti-overfitting strategies are introduced, including alternating optimization of contrastive representations and pseudo-labels, as well as maintaining class prototypes in a shared feature space. The proposed method can further incorporate few-shot manually annotated labels for performance enhancement. Extensive experiments under various settings demonstrate the effectiveness of our approach and highlight the potential of integrating weakly supervised learning into the knowledge distillation of pre-trained models.
中文标题/摘要
标题:弱监督学习与VLM精炼的桥梁:基于预训练VLM的嘈杂部分标签学习以实现高效下游适应
在嘈杂部分标签学习(NPLL)的背景下,每个训练样本都与多个嘈杂注释者标注的一组候选标签相关联。随着高性能预训练视觉-语言模型(VLMs)如CLIP、LLaVA和GPT-4V的出现,利用这些模型替代耗时的手动标注并实现无标注训练已成为一个有前景的研究方向。本文研究了从预训练VLM生成的嘈杂部分标签中学习,并提出了一种协作一致性正则化(Co-Reg)框架。与传统嘈杂标签学习中假设的对称噪声不同,VLM生成的噪声是实例相关的,并反映了预训练模型的固有偏差,提出了更大的挑战。为了解决这一问题,我们联合训练两个神经网络,通过共伪标签机制进行协作标签净化,同时在标签和特征表示空间中强制执行一致性正则化。此外,还引入了多种防止过拟合的策略,包括对比表示和伪标签的交替优化,以及在共享特征空间中保持类原型。所提出的方法还可以进一步结合少量手动标注的标签以提高性能。在各种设置下的广泛实验表明了我们方法的有效性,并突显了将弱监督学习整合到预训练模型的知识精炼中的潜力。
Summary / 总结
This paper addresses the challenge of noisy partial label learning (NPLL) by proposing a collaborative consistency regularization (Co-Reg) framework. It leverages pre-trained vision-language models to generate pseudo-labels and trains two neural networks collaboratively to purify labels and enforce consistency in both label and feature spaces. The method also includes anti-overfitting strategies and can incorporate few-shot manual annotations. Experiments show the effectiveness of the proposed approach in various settings, indicating its potential to integrate weakly supervised learning into the knowledge distillation of pre-trained models.
本文提出了一种协作一致性正则化(Co-Reg)框架,利用预训练的视觉-语言模型(VLMs)来净化标签,以应对嘈杂的部分标签学习(NPLL)的挑战。与传统的嘈杂标签学习不同,VLM生成的噪声是实例相关的,并反映了模型的偏差,使其更具挑战性。Co-Reg框架通过联合训练两个神经网络进行共伪标签生成,并在标签和特征表示空间中施加一致性正则化,同时引入了防止过拟合的策略。实验表明,所提出的方法有效地净化了标签,并在少量手动注释标签的支持下增强了下游适应性能。
Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling
Authors: Xinglin Wang, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Yueqi Zhang, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li
First: 2026-01-29T13:18:36+00:00 · Latest: 2026-01-29T13:18:36+00:00
Comments: preprint
Abstract
Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This systemic memorylessness leads to massive computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE, validating its advantage over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art scaling efficiency.
中文标题/摘要
标题:不要浪费你的部署:回收搜索经验以实现高效的测试时扩展
测试时扩展通过分配额外的推理计算资源来扩展解决方案空间,从而增强大型语言模型的推理能力。然而,现有的搜索策略通常将部署视为一次性样本,其中在每次试验后有价值的中间见解被有效丢弃。这种系统性的记忆缺失导致了巨大的计算冗余,因为模型在广泛的尝试中反复重新推导出已发现的结论并重新访问已知的死胡同。为了弥合这一差距,我们提出了**回收搜索经验(RSE)**,这是一种无需训练的自我引导策略,将测试时搜索从一系列孤立的试验转变为累积过程。通过积极地将原始轨迹提炼为共享的经验库,RSE 使中间结论的正向回收能够缩短冗余推导,并使失败模式的负向回收能够修剪遇到的死胡同。理论上,我们提供了一种分析,正式化了RSE的效率增益,并验证了它在解决复杂推理任务时比独立采样具有优势。实验上,在HMMT24、HMMT25、IMO-Bench和HLE上进行的大量实验表明,RSE 以与强基线相当的计算成本实现了最先进的扩展效率。
Summary / 总结
The paper addresses the inefficiency of existing test-time scaling methods for Large Language Models, which discard valuable intermediate insights after each trial. It introduces Recycling Search Experience (RSE), a self-guided strategy that recycles search experiences to reduce computational redundancy. Empirically, RSE outperforms strong baselines on various benchmarks with similar computational costs, demonstrating superior scaling efficiency.
论文针对现有测试时扩展方法中在每次试验后丢弃有价值见解的问题,提出了一种自我引导的策略——回收搜索经验(RSE),该策略回收中间结论和失败模式以减少冗余计算。实验结果表明,RSE 在各种基准测试中以相似的计算成本优于强基线,实现了最先进的扩展效率。
Multimodal Visual Surrogate Compression for Alzheimer's Disease Classification
Authors: Dexuan Ding, Ciyuan Peng, Endrowednes Kuantama, Jingcai Guo, Jia Wu, Jian Yang, Amin Beheshti, Ming-Hsuan Yang, Yuankai Qi
First: 2026-01-29T13:05:46+00:00 · Latest: 2026-01-29T13:05:46+00:00
Abstract
High-dimensional structural MRI (sMRI) images are widely used for Alzheimer's Disease (AD) diagnosis. Most existing methods for sMRI representation learning rely on 3D architectures (e.g., 3D CNNs), slice-wise feature extraction with late aggregation, or apply training-free feature extractions using 2D foundation models (e.g., DINO). However, these three paradigms suffer from high computational cost, loss of cross-slice relations, and limited ability to extract discriminative features, respectively. To address these challenges, we propose Multimodal Visual Surrogate Compression (MVSC). It learns to compress and adapt large 3D sMRI volumes into compact 2D features, termed as visual surrogates, which are better aligned with frozen 2D foundation models to extract powerful representations for final AD classification. MVSC has two key components: a Volume Context Encoder that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner. Extensive experiments on three large-scale Alzheimer's disease benchmarks demonstrate our MVSC performs favourably on both binary and multi-class classification tasks compared against state-of-the-art methods.
中文标题/摘要
标题:阿尔茨海默病分类的多模态视觉代理压缩
高维结构磁共振成像(sMRI)图像广泛用于阿尔茨海默病(AD)诊断。大多数现有的sMRI表示学习方法依赖于3D架构(例如,3D CNNs)、切片级特征提取与后期聚合,或使用2D基础模型(例如,DINO)进行无训练特征提取。然而,这三种范式分别面临高计算成本、跨切片关系丢失和提取判别特征能力有限的问题。为了解决这些挑战,我们提出了多模态视觉代理压缩(MVSC)。MVSC学习将大型3D sMRI体素压缩和适应为紧凑的2D特征,称为视觉代理,这些特征与冻结的2D基础模型更好地对齐,以提取用于最终AD分类的强大表示。MVSC有两个关键组件:体素上下文编码器,在文本引导下捕获全局跨切片上下文,以及增强切片融合模块,在文本增强、块级方式下聚合切片级信息。在三个大规模阿尔茨海默病基准上的广泛实验表明,与最先进的方法相比,我们的MVSC在二分类和多分类任务上表现更优。
Summary / 总结
The paper addresses the challenges of using high-dimensional structural MRI (sMRI) images for Alzheimer's Disease (AD) diagnosis, such as high computational cost and loss of cross-slice relations. It proposes Multimodal Visual Surrogate Compression (MVSC), which compresses 3D sMRI volumes into compact 2D features, termed visual surrogates, for better alignment with 2D foundation models. MVSC includes a Volume Context Encoder for capturing global cross-slice context and an Adaptive Slice Fusion module for text-enhanced, patch-wise aggregation. Experiments show that MVSC outperforms state-of-the-art methods on both binary and multi-class AD classification tasks.
研究旨在通过结构MRI (sMRI) 图像提高阿尔茨海默病(AD)的诊断效率和准确性。提出了一种多模态视觉代理压缩(MVSC)方法,将3D sMRI体素压缩成2D视觉代理,以便更好地与2D基础模型对齐以提取特征。MVSC包括一个体积上下文编码器来捕捉全局跨层上下文,以及一个增强切片级信息聚合的自适应切片融合模块。实验表明,MVSC在二分类和多分类AD诊断任务上均优于现有方法。
Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching
Authors: Li Ju, Mayank Nautiyal, Andreas Hellander, Ekta Vats, Prashant Singh
First: 2026-01-29T12:58:42+00:00 · Latest: 2026-01-29T12:58:42+00:00
Abstract
Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.
中文标题/摘要
标题:通过黎曼流匹配计算预训练VLMs的 epistemic 不确定性
视觉-语言模型(VLMs)通常具有确定性,缺乏内在机制来量化epistemic不确定性,这反映了模型对其自身表示的无知或知识不足。我们从理论上将嵌入的负对数密度作为epistemic不确定性的一个代理,其中低密度区域表示模型的无知。所提出的方法REPVLM使用黎曼流匹配计算VLM嵌入在超球面流形上的概率密度。我们实验证明,REPVLM在不确定性与预测误差之间的相关性接近完美,显著优于现有基线。除了分类之外,我们还证明该模型还提供了一种可扩展的用于检测异常分布和自动数据整理的度量标准。
Summary / 总结
The research aims to address the lack of epistemic uncertainty quantification in Vision-Language Models (VLMs) by proposing REPVLM, which uses Riemannian Flow Matching to compute the probability density on the hyperspherical manifold of VLM embeddings. The method shows near-perfect correlation between uncertainty and prediction error, outperforming existing baselines. Additionally, REPVLM provides a scalable metric for out-of-distribution detection and automated data curation beyond classification tasks.
论文通过提出REPVLM方法,利用黎曼流匹配计算VLM嵌入在超球面流形上的概率密度,解决了VLM中缺乏表征不确定性的问题。该方法将不确定性与预测误差相关联,并优于现有基线。此外,REPVLM还提供了一种可扩展的用于检测异常分布和数据筛选的度量标准。
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
Authors: Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, Zhixiong Zeng
First: 2026-01-29T12:43:02+00:00 · Latest: 2026-01-29T12:43:02+00:00
Abstract
The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (\textbf{Text-centric OCR}), neglecting the identification of visual elements from visually information-dense image sources (\textbf{Vision-centric OCR}), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose \textbf{OCRVerse}, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.
中文标题/摘要
标题:OCRVerse:端到端视觉语言模型中的全方位OCR
大型视觉语言模型的发展推动了对管理和应用大量多模态数据的需求,使得从视觉图像中提取信息的OCR技术越来越受欢迎。然而,现有的OCR方法主要集中在识别图像或扫描文档中的文本元素(文本中心的OCR),忽视了从视觉信息密集型图像源(视觉中心的OCR)中识别视觉元素,如图表、网页和科学图表。实际上,这些视觉信息密集型图像在互联网上广泛存在,并具有重要的现实应用价值,如数据可视化和网页分析。在本技术报告中,我们提出了OCRVerse,这是一种端到端的全方位OCR方法,能够统一处理文本中心的OCR和视觉中心的OCR。为此,我们构建了全面的数据工程,涵盖了广泛的文本中心文档,如报纸、杂志和书籍,以及视觉中心的渲染复合体,包括图表、网页和科学图表。此外,我们为OCRVerse提出了两阶段的SFT-RL多域训练方法。SFT直接混合跨域数据进行训练和建立初始领域知识,而RL则专注于为每个领域的特性设计个性化的奖励策略。具体而言,由于不同领域需要不同的输出格式和预期输出,我们在RL阶段提供了足够的灵活性,为每个领域定制灵活的奖励信号,从而提高跨域融合并避免数据冲突。实验结果表明,OCRVerse的有效性,其在文本中心和视觉中心数据类型上的表现与大规模开源和闭源模型相当甚至更优。
Summary / 总结
The paper presents OCRVerse, a holistic OCR method for both text-centric and vision-centric OCR in an end-to-end vision-language model. It addresses the gap in existing OCR methods by incorporating a comprehensive data engineering approach and a two-stage SFT-RL training method. The results show that OCRVerse performs competitively across different types of data, even matching large-scale models.
论文提出了OCRVerse,这是一种端到端的统一方法,适用于文本中心OCR和视觉中心OCR。它通过构建全面的数据工程并提出两阶段SFT-RL训练方法来解决现有OCR方法的局限性。实验结果表明,OCRVerse在各种类型的数据上表现优异,甚至可以与大规模开源和闭源模型相媲美。
PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization
Authors: Songhan Jiang, Fengchun Liu, Ziyue Wang, Linghan Cai, Yongbing Zhang
First: 2026-01-29T12:21:16+00:00 · Latest: 2026-01-29T12:21:16+00:00
Abstract
Vision-Language Models (VLMs) are advancing computational pathology with superior visual understanding capabilities. However, current systems often reduce diagnosis to directly output conclusions without verifiable evidence-linked reasoning, which severely limits clinical trust and hinders expert error rectification. To address these barriers, we construct PathReasoner, the first large-scale dataset of whole-slide image (WSI) reasoning. Unlike previous work reliant on unverified distillation, we develop a rigorous knowledge-guided generation pipeline. By leveraging medical knowledge graphs, we explicitly align structured pathological findings and clinical reasoning with diagnoses, generating over 20K high-quality instructional samples. Based on the database, we propose PathReasoner-R1, which synergizes trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning to instill structured chain-of-thought capabilities. To ensure medical rigor, we engineer a knowledge-aware multi-granular reward function incorporating an Entity Reward mechanism strictly aligned with knowledge graphs. This effectively guides the model to optimize for logical consistency rather than mere outcome matching, thereby enhancing robustness. Extensive experiments demonstrate that PathReasoner-R1 achieves state-of-the-art performance on both PathReasoner and public benchmarks across various image scales, equipping pathology models with transparent, clinically grounded reasoning capabilities. Dataset and code are available at https://github.com/cyclexfy/PathReasoner-R1.
中文标题/摘要
标题:PathReasoner-R1: 通过知识引导的策略优化将结构化推理融入病理视觉语言模型
视觉语言模型(VLMs)正在推动计算病理学的发展,具备卓越的视觉理解能力。然而,当前系统往往直接输出结论而缺乏可验证的证据链推理,这严重限制了临床信任并阻碍了专家错误纠正。为解决这些问题,我们构建了PathReasoner,这是首个大规模的全切片图像(WSI)推理数据集。不同于以往依赖未经验证的蒸馏工作,我们开发了一个严格的知识引导生成管道。通过利用医学知识图谱,我们明确地将结构化的病理发现和临床推理与诊断对齐,生成了超过20000个高质量的指导样本。基于该数据库,我们提出了PathReasoner-R1,该模型结合了轨迹掩蔽监督微调与推理导向的强化学习,以植入结构化的推理链能力。为了确保医学严谨性,我们设计了一个知识感知的多粒度奖励函数,其中包括严格与知识图谱对齐的实体奖励机制。这有效地引导模型优化逻辑一致性而非仅仅匹配结果,从而增强其鲁棒性。大量实验表明,PathReasoner-R1 在PathReasoner和公共基准测试中均实现了最先进的性能,为病理模型提供了透明且临床相关的推理能力。数据集和代码可在https://github.com/cyclexfy/PathReasoner-R1 获取。
Summary / 总结
The research aims to enhance the reasoning capabilities of Vision-Language Models (VLMs) in computational pathology by introducing PathReasoner-R1, which integrates structured reasoning through a knowledge-guided generation pipeline. The method involves creating a large-scale dataset of whole-slide image reasoning and using a knowledge-aware reward function to guide the model towards logical consistency. Key findings show that PathReasoner-R1 outperforms existing models on both PathReasoner and public benchmarks, providing transparent and clinically grounded reasoning capabilities.
研究旨在通过引入知识引导生成管道来增强Vision-Language Models (VLMs)在计算病理学中的诊断推理能力。方法包括创建一个包含超过20K与医学知识图谱对齐的指令样本的大规模数据集PathReasoner。基于该数据集的PathReasoner-R1模型结合了轨迹掩蔽监督微调与推理导向的强化学习,以植入结构化的推理能力。实验结果表明,PathReasoner-R1在PathReasoner和公共基准测试上均优于现有模型,提供了透明且临床依据的推理能力。
WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models
Authors: Zijin Yang, Yu Sun, Kejiang Chen, Jiawei Zhao, Jun Jiang, Weiming Zhang, Nenghai Yu
First: 2026-01-29T12:14:32+00:00 · Latest: 2026-01-29T12:14:32+00:00
Abstract
Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.
中文标题/摘要
标题:WMVLM:通过视觉语言模型评估扩散模型图像水印
数字水印对于保护来自扩散模型的生成图像至关重要。准确的水印评估对于算法开发至关重要,但现有方法存在显著局限性:缺乏统一框架处理残差和语义水印,结果缺乏可解释性,忽视了全面的安全考虑,且经常使用不合适的语义水印度量标准。为解决这些差距,我们提出了WMVLM,这是首个通过视觉语言模型(VLMs)统一且可解释的扩散模型图像水印评估框架。我们重新定义了每种水印类型的质量和安全性度量标准:残差水印通过艺术强度和擦除抗性进行评估,而语义水印则通过潜在分布偏移进行评估。此外,我们引入了三阶段训练策略,逐步使模型实现分类、评分和可解释的文本生成。实验表明,WMVLM在不同数据集、扩散模型和水印方法上的泛化能力优于最先进的VLMs。
Summary / 总结
The paper addresses the need for a unified and interpretable framework to evaluate digital watermarks in images generated by diffusion models. It introduces WMVLM, which uses vision-language models to assess both residual and semantic watermarks based on artifact strength, erasure resistance, and latent distribution shifts. Experiments demonstrate that WMVLM outperforms existing methods with strong generalization capabilities across various datasets, diffusion models, and watermarking techniques.
研究旨在利用视觉语言模型开发一种统一且可解释的扩散模型图像水印评估框架。方法提出了针对残差和语义水印的质量和安全度量标准,并采用三阶段训练策略。关键实验结果显示,WMVLM在不同数据集、扩散模型和水印方法上具有较强的泛化能力,优于现有方法。
Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting
Authors: Sangoh Lee, Sangwoo Mo, Wook-Shin Han
First: 2025-12-23T03:13:39+00:00 · Latest: 2026-01-29T12:02:16+00:00
Comments: Project page with videos and code: https://vap-project.github.io/
Abstract
While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.
中文标题/摘要
标题:Bring My Cup!使用视觉注意提示个性化视觉-语言-行动模型
尽管视觉-语言-行动(VLA)模型在通用指令上表现良好,但在处理个性化命令如“bring my cup”时却遇到困难,其中机器人必须在视觉上相似的对象中执行特定实例的操作。我们研究了操作个人物品的场景,在这种场景中,VLA 必须使用少量参考图像识别并控制训练期间未见过的用户特定对象。为了解决这一挑战,我们提出了视觉注意提示(VAP),这是一种简单而有效的无需训练的感知适配器,为冻结的VLA提供自上而下的选择性注意。VAP 将参考图像视为非参数化的视觉记忆,通过开放式词汇检测和基于嵌入的匹配将个人对象定位在场景中,然后通过突出显示对象并重写指令将这种定位作为视觉提示注入。我们构建了两个模拟基准,个性化-SIMPLER 和 个性化-VLABench,以及一个真实世界的桌面基准,以评估多个机器人和任务中的个性化操作。实验表明,VAP 在成功率和正确对象操作方面始终优于通用策略和标记学习基线,有助于弥合语义理解和实例级控制之间的差距。
Summary / 总结
The research addresses the challenge of personalizing vision-language-action models to handle specific commands like 'bring my cup,' where the robot must identify and manipulate a particular object among similar ones. It introduces Visual Attentive Prompting (VAP), a training-free method that enhances frozen VLA models with top-down selective attention using reference images as a visual memory. Experiments show VAP improves success rates and correct-object manipulation compared to generic policies and token-learning baselines, bridging the gap between semantic understanding and instance-level control.
研究旨在提高Vision-Language-Action (VLA)模型处理个性化命令如'bring my cup'的能力,即机器人必须在相似物体中识别并操作特定物体。方法Visual Attentive Prompting (VAP) 使用少量参考图像引导VLA的注意力到正确物体上,增强其实例级控制。实验表明VAP在成功率和正确物体操作方面优于通用策略和token学习基线,适用于多种任务和机器人。
Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening
Authors: Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou Ammar
First: 2026-01-29T12:01:53+00:00 · Latest: 2026-01-29T12:01:53+00:00
Abstract
Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model's generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.
中文标题/摘要
标题:可扩展的功率采样:通过分布锐化解锁LLMs无需训练的高效推理
强化学习(RL)后训练是提高大型语言模型(LLMs)推理性能的主要方法,但越来越多的证据表明,其主要增益来自于分布锐化而非新能力的获取。最近的研究表明,使用马尔可夫链蒙特卡洛(MCMC)从LLMs的幂分布中采样可以恢复与RL后训练相当的性能,且无需依赖外部奖励;然而,MCMC的高计算成本使其在广泛应用中不可行。在本文中,我们提出了一种理论依据的方法,消除了迭代MCMC的需要。我们推导出一种新的公式,表明全局幂分布可以由一个按令牌缩放的低温分布近似,其中缩放因子捕捉未来轨迹的质量。利用这一洞察,我们引入了一种无需训练和验证者的算法,以自回归方式锐化基础模型的生成分布。实验上,我们在四个LLMs上对数学、问答和代码任务进行了评估,表明我们的方法在不依赖任何外部奖励的情况下,能够匹配或超越单次GRPO,同时将推理延迟降低超过10倍,相比基于MCMC的采样方法。
Summary / 总结
This work addresses the high computational cost of using Markov chain Monte Carlo (MCMC) for distribution sharpening in large language models (LLMs), proposing a novel method that approximates the global power distribution with a token-level scaled low-temperature distribution. The method, which is training-free and verifier-free, sharpens the generative distribution autoregressively. Experiments on math, QA, and code tasks across four LLMs demonstrate that this approach matches or surpasses one-shot GRPO without external rewards and reduces inference latency by over 10x compared to MCMC-based sampling.
本文旨在解决使用马尔可夫链蒙特卡洛(MCMC)进行大规模语言模型(LLM)分布锐化时的计算效率问题,提出了一种无需训练和验证的新方法。通过将全局功率分布近似为一个按令牌缩放的低温度分布,作者在无需外部奖励的情况下实现了与强化学习后训练相当的性能,并将推理延迟显著降低了超过10倍,相比基于MCMC的采样方法。
DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
Authors: Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, Raanan Fattal
First: 2025-10-23T17:42:14+00:00 · Latest: 2026-01-29T11:37:31+00:00
Abstract
Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.
中文标题/摘要
标题:DyPE:动态位置外推在超高清扩散中的应用
扩散变换器模型能够生成具有非凡保真度和细节的图像,但由于自注意力机制与图像标记数量的平方级扩展,训练它们在超高清分辨率上仍然极其昂贵。在本文中,我们提出了动态位置外推(DyPE),这是一种无需训练的新颖方法,使预训练的扩散变换器能够在其训练数据之外生成高得多分辨率的图像,且无需额外的采样成本。DyPE 利用了扩散过程中固有的频谱进展,其中低频结构早期收敛,高频则需要更多步骤来解决。具体而言,DyPE 在每次扩散步骤中动态调整模型的位置编码,使其频谱与生成过程的当前阶段相匹配。这种方法使我们能够在远超训练分辨率的分辨率下生成图像,例如,使用 FLUX 生成 16 百万像素的图像。在多个基准测试中,DyPE 一致地提高了性能,并在超高清图像生成中达到了最先进的保真度,尤其是在更高分辨率下,性能提升更为显著。项目页面可在 https://noamissachar.github.io/DyPE/ 获取。
Summary / 总结
DyPE is a training-free method that enables pre-trained diffusion transformers to synthesize images at ultra-high resolutions by dynamically adjusting the model's positional encoding. This method leverages the spectral progression of the diffusion process to match the frequency spectrum of the model with the current generative stage, allowing for image synthesis at resolutions far beyond the training data. DyPE improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, especially at higher resolutions, without additional sampling cost.
DyPE 是一种无需训练的方法,通过动态调整模型的位置编码,使预训练的扩散变换器能够在超高清分辨率下生成图像。该方法利用扩散过程中的频谱进展,使模型的频谱与当前生成阶段相匹配,从而能够在远超训练数据的分辨率下生成图像。DyPE 在超高清分辨率图像生成中提高了性能,并实现了最先进的保真度,尤其是在更高分辨率时效果更为显著,且无需额外的采样成本。
Bi-Anchor Interpolation Solver for Accelerating Generative Modeling
Authors: Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen
First: 2026-01-29T10:59:36+00:00 · Latest: 2026-01-29T10:59:36+00:00
Abstract
Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.
中文标题/摘要
标题:双锚点内插求解器加速生成建模
流匹配(FM)模型已成为高保真合成的领先范式。然而,它们依赖于迭代常微分方程(ODE)求解,这造成了显著的延迟瓶颈。现有解决方案面临两难境地:无训练求解器在低神经网络评估(NFE)下性能严重下降,而基于训练的一或几步生成方法则会带来高昂的训练成本,并缺乏即插即用的灵活性。为弥合这一差距,我们提出了双锚点内插求解器(BA-求解器)。BA-求解器保留了标准无训练求解器的灵活性,同时通过引入轻量级的SideNet(1-2%主干大小)与冻结的主干结合,实现了显著加速。具体而言,我们的方法基于两个协同组件:1)双向时间感知,其中SideNet学习在不重新训练重的主干的情况下,近似未来和历史速度;2)双锚点速度集成,利用SideNet与两个锚点速度来高效近似批量高阶积分的中间速度。通过利用主干建立高精度的“锚点”并使用SideNet细化轨迹,BA-求解器能够以最小化误差的方式使用大时间间隔。在ImageNet-256²上的实验结果表明,BA-求解器仅在10个NFE下就能达到与100多个NFE的欧拉求解器相当的生成质量,并且在仅5个NFE下仍能保持高保真度,且几乎不产生训练成本。此外,BA-求解器确保与现有的生成管道无缝集成,便于下游任务如图像编辑的实现。
Summary / 总结
The Bi-Anchor Interpolation Solver (BA-solver) addresses the latency issue in Flow Matching models by introducing a lightweight SideNet alongside the frozen backbone. This method combines bidirectional temporal perception and bi-anchor velocity integration to approximate intermediate velocities efficiently. BA-solver achieves generation quality comparable to 100+ Neural Function Evaluations (NFEs) in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, without significant training costs, making it a versatile and efficient solution for generative modeling.
论文提出了一种双向锚点插值求解器(BA-solver),以解决Flow Matching模型中的延迟问题。BA-solver结合了轻量级的SideNet和冻结的主干网络,实现了显著的加速。它通过双向时间感知来近似未来和历史速度,并通过双向锚点速度集成高效估计中间速度。实验结果表明,BA-solver在仅使用10 NFEs的情况下可以生成与100+ NFEs欧拉求解器相当的结果,并且在最少5 NFEs时仍能保持高保真度,且训练成本可忽略不计,同时与现有的生成管道无缝集成,便于下游任务如图像编辑。
On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression
Authors: Xinwei Zhang, Hangcheng Liu, Li Bai, Hao Wang, Qingqing Ye, Tianwei Zhang, Haibo Hu
First: 2026-01-29T10:47:21+00:00 · Latest: 2026-01-29T10:47:21+00:00
Comments: Under Review, 20 pages
Abstract
Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks can substantially overestimate the robustness of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.
中文标题/摘要
标题:大规模视觉语言模型在视觉标记压缩下的对抗鲁棒性研究
视觉标记压缩被广泛用于通过剪枝或合并视觉标记来加速大规模视觉语言模型(LVLMs),但其对抗鲁棒性尚未被探索。我们表明,现有的基于编码器的攻击会显著高估压缩LVLMs的鲁棒性,这是因为优化与推理之间的不匹配:扰动是在完整标记表示上进行优化,而推理则通过标记压缩瓶颈进行。为了解决这一差距,我们提出了压缩对齐攻击(CAGE),该攻击在不假设访问部署的压缩机制或其标记预算的情况下,将扰动优化与压缩推理对齐。CAGE 结合了(i)预期特征破坏,将扰动集中在那些在可能的预算范围内可能存活的标记上,以及(ii)排名失真对齐,主动将标记扰动与排名分数对齐,以促进高失真证据的保留。在多种代表性的插即用压缩机制和数据集上,我们的结果表明,CAGE 一致地实现了比基线更低的鲁棒准确率。这项工作强调了忽略压缩的鲁棒性评估可能会过于乐观,呼吁对高效的LVLMs进行压缩感知的安全评估和防御。
Summary / 总结
The paper investigates the adversarial robustness of large vision-language models (LVLMs) under visual token compression. It proposes the Compression-AliGnEd (CAGE) attack, which aligns perturbation optimization with compression inference without requiring knowledge of the compression mechanism. The study demonstrates that existing encoder-based attacks overestimate the robustness of compressed LVLMs, and CAGE achieves lower robust accuracy compared to the baseline across various compression mechanisms and datasets, highlighting the need for compression-aware security evaluations.
论文研究了视觉标记压缩对大型视觉-语言模型(LVLMs)的对抗鲁棒性。提出了压缩对齐增强(CAGE)攻击,该攻击在不需要了解压缩机制的情况下,使扰动优化与压缩推理对齐。研究显示,现有的基于编码器的攻击高估了压缩LVLMs的鲁棒性,CAGE在各种压缩机制和数据集上的鲁棒准确性低于基线,强调了需要进行压缩感知的安全评估和防御措施。
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
Authors: Xiuyu Li, Jinkai Zhang, Mingyang Yi, Yu Li, Longqiang Wang, Yue Wang, Ju Fan
First: 2026-01-29T10:06:52+00:00 · Latest: 2026-01-29T10:06:52+00:00
Abstract
Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.
中文标题/摘要
标题:ETS:能量引导的测试时缩放以实现无需训练的RL对齐
语言模型的强化学习(RL)后训练对齐在实践中有效,但代价高昂且不稳定,因为其复杂的训练过程。为了解决这个问题,我们提出了一种无需训练的推理方法,可以直接从最优RL策略中采样。应用于掩码语言模型(MLM)的转换概率包括一个参考策略模型和一个能量项。在此基础上,我们的算法能量引导的测试时缩放(ETS)通过在线蒙特卡洛估计关键的能量项,并具有可证明的收敛速率。此外,为了确保实际效率,ETS 利用现代加速框架和定制的重要性采样估计器,大幅减少了推理延迟,同时可证明地保持了采样质量。在包括自回归模型和扩散语言模型的MLM(涵盖推理、编码和科学基准)实验中,我们的ETS始终提高了生成质量,验证了其有效性和设计。
Summary / 总结
The research aims to address the cost and instability of RL post-training alignment for language models. The proposed method, Energy-Guided Test-Time Scaling (ETS), is a training-free inference approach that uses an energy term to sample directly from the optimal RL policy. ETS estimates the energy term via online Monte Carlo and leverages modern acceleration frameworks and importance sampling estimators to reduce inference latency while maintaining sampling quality. Experiments on various language models across different benchmarks demonstrate that ETS improves generation quality, validating its effectiveness and design.
研究旨在通过提出一种训练-free 推断方法——能量引导的测试时缩放 (ETS),解决语言模型的 RL 后训练对齐成本高且不稳定的问题。ETS 使用参考策略模型和能量项来估计 Masked Language Modeling (MLM) 的转移概率。该方法使用在线蒙特卡洛估计能量项,并具有可证明的收敛率。此外,ETS 利用现代加速框架和定制的重要性采样估计器,大幅减少推理延迟同时保持采样质量。实验表明,ETS 在各种基准测试中提高了生成质量,验证了其有效性和设计。
NOSA: Native and Offloadable Sparse Attention
Authors: Yuxiang Huang, Pengjie Wang, Jicheng Han, Weilin Zhao, Zhou Su, Ao Sun, Hongya Lyu, Hengyu Zhao, Yudong Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu
First: 2025-10-15T14:33:16+00:00 · Latest: 2026-01-29T08:26:24+00:00
Comments: Preprint
Abstract
Decoding throughput improvements from larger inference batches are limited by GPU memory, which is largely consumed by the key-value (KV) cache. Prior training-free KV cache offloading alleviates this by keeping redundant context on the CPU and fetching only a sparse subset for attention, but it often degrades long-generation quality due to training-inference mismatch on sparse patterns. Meanwhile, trainable sparse attention is incompatible with efficient offloading, as unconstrained KV accesses may force large CPU-to-GPU transfers and erase throughput gains. To this end, we propose NOSA, a trainable sparse attention mechanism natively designed for KV cache offloading. NOSA explicitly constrains the volume of CPU-GPU KV transfers, thereby achieving low communication overhead and high decoding throughput. We further build NOSI, a KV cache offloading inference system that fully unlocks NOSA's efficiency. Empirical results on 1,3,8B LLMs demonstrate that NOSA outperforms KV cache offloading baselines on general, long-input, and long-generation tasks, while boosting decoding throughput by up to 5.04x, 1.92x, and 1.83x over FullAttn, InfLLMv2, and ShadowKV, respectively. We release our code at https://github.com/thunlp/NOSA.
中文标题/摘要
标题:NOSA:原生可卸载的稀疏注意机制
从更大推理批次中获得的解码吞吐量提升受限于GPU内存,大部分被键值(KV)缓存消耗。先前的无训练KV缓存卸载通过在CPU上保留冗余上下文并仅获取稀疏子集来进行注意,从而缓解了这一问题,但往往会因稀疏模式上的训练-推理不匹配而降低长生成质量。同时,可训练的稀疏注意机制与高效的卸载不兼容,因为不受约束的KV访问可能会强制进行大量CPU到GPU的数据传输,从而消除吞吐量提升。为此,我们提出NOSA,一种原生设计用于KV缓存卸载的可训练稀疏注意机制。NOSA 明确限制了CPU-GPU KV传输的体积,从而实现低通信开销和高解码吞吐量。我们进一步构建了NOSI,一个KV缓存卸载推理系统,完全释放了NOSA的效率。在1,3,8B大语言模型上的实验证明,NOSA在通用、长输入和长生成任务上优于KV缓存卸载基线,分别将解码吞吐量提升至FullAttn的5.04倍、InfLLMv2的1.92倍和ShadowKV的1.83倍。我们已在https://github.com/thunlp/NOSA/发布我们的代码。
Summary / 总结
NOSA is a trainable sparse attention mechanism designed for efficient key-value cache offloading, which constrains CPU-GPU data transfers to maintain low communication overhead and high decoding throughput. Experiments on 1, 3, and 8B LLMs show that NOSA outperforms existing baselines, achieving up to 5.04x, 1.92x, and 1.83x higher decoding throughput compared to FullAttn, InfLLMv2, and ShadowKV, respectively.
NOSA 是一种可训练的稀疏注意力机制,旨在高效地进行键值缓存卸载,通过限制 CPU-GPU 数据传输来解决先前方法的限制,保持高解码吞吐量。实验表明,NOSA 在 1.3B、3B 和 8B 语言模型上的表现优于现有方法,分别比 FullAttn、InfLLMv2 和 ShadowKV 提高了 5.04 倍、1.92 倍和 1.83 倍的解码吞吐量。
Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models
Authors: Chengzhi Zhong, Fei Cheng, Qianying Liu, Yugo Murawaki, Chenhui Chu, Sadao Kurohashi
First: 2025-10-08T16:46:57+00:00 · Latest: 2026-01-29T07:52:53+00:00
Comments: Accepted at EACL 2026 (Main). Our code will be available at: https://github.com/ku-nlp/language-specific-dimensions
Abstract
Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.
中文标题/摘要
标题:语言存在于稀疏维度中:面向可解释和高效的多语言控制的大语言模型
大语言模型在有限的非英语数据暴露下表现出强大的多语言能力。先前的研究表明,以英语为中心的大语言模型在中间层将多语言内容映射到英语对齐的表示,然后在最终层将其投影回目标语言的标记空间。从这一观察出发,我们假设这种跨语言过渡是由一组小而稀疏的维度控制的,这些维度在中间层到最终层之间具有一致的索引。基于这一见解,我们提出了一种简单的、无需训练的方法来识别和操作这些维度,只需要少量(最多50句)平行或单语数据。在多语言生成控制任务上的实验揭示了这些维度的可解释性,表明在这些维度上的干预可以切换输出语言同时保留语义内容,并且在较低的成本下超过了先前基于神经元的方法的性能。
Summary / 总结
This study investigates the sparse dimensions that govern the multilingual capabilities of large language models. By hypothesizing that a small set of dimensions at consistent indices across layers controls cross-lingual transitions, the researchers propose a training-free method using minimal data to manipulate these dimensions. Experiments show that interventions in these dimensions can switch the output language while maintaining semantic content, outperforming previous neuron-based approaches with lower costs.
研究旨在通过识别控制跨语言转换的稀疏维度来提升大型语言模型的可解释性和效率。方法使用少量(50句)平行或单语言数据进行无训练操作,以操控这些维度。实验表明,这些维度的干预可以切换输出语言同时保持语义内容,并且在成本更低的情况下优于先前的神经元方法。
The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition
Authors: Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao
First: 2025-12-31T19:00:03+00:00 · Latest: 2026-01-29T06:04:53+00:00
Abstract
The open-weight language model ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single breaker token that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and evades outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge
中文标题/摘要
标题:词汇中的特洛伊木马:LLM 组合中的隐蔽破坏
开放权重语言模型生态系统越来越多地通过模型组合技术(如权重合并、推测解码和词汇扩展)来重新混搭来自不同来源的能力。在这些方法能够在不同模型家族之间应用之前,一个关键的前提是分词器移植,它将不兼容的词汇表对齐到共享的嵌入空间。我们证明了这一关键的互操作性步骤引入了一个供应链漏洞:我们设计了一个单一的破坏性标记,在捐赠模型中功能中立,但在移植到基础模型后可靠地重建为一个高相关性的恶意特征。通过利用系数重用的几何特性,我们的攻击破坏了基础模型的生成能力,同时让捐赠模型的功能统计上与正常行为无显著差异。我们将此问题形式化为一个双目标优化问题,并使用稀疏求解器实例化了攻击。实验表明,该攻击无需训练即可逃避异常检测,并且能够抵抗微调和权重合并的结构性持久性,突显了模块化人工智能组合管道中的隐藏风险。代码可在 https://github.com/xz-liu/tokenforge 获取
Summary / 总结
The paper investigates the security risks in model composition techniques used in language models, focusing on the tokenizer transplant step. By introducing a single breaker token that is harmless in the donor model but becomes malicious in the base model after transplant, the authors demonstrate a stealthy sabotage method. The attack exploits coefficient reuse to disrupt the base model's generation without affecting the donor model's performance, making it robust against fine-tuning and weight merging. This work highlights a hidden risk in modular AI composition pipelines. Code for the attack is available on GitHub.
研究关注通过分词器移植实现语言模型互操作性时的安全风险,这是模型组合技术中的关键步骤。研究引入了一种方法,在捐赠模型中嵌入一个‘破坏性标记’,该标记在移植到基础模型后会变成一个恶意特征,但保持无害。攻击利用系数重用的几何特性来破坏基础模型的生成能力,而不影响捐赠模型的性能,使其能够抵御微调和权重合并。这项工作揭示了模块化AI组合管道中的隐藏风险,并提供了一种形式化的优化方法来实施这种攻击。实验表明,该攻击有效且能够逃避检测。攻击代码可在GitHub上获取。