arXiv 论文速递

2026-03-20 03:56
Snapshot: 20260320_0356
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Authors: Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee
First: 2026-03-18T17:59:56+00:00 · Latest: 2026-03-18T17:59:56+00:00
Abstract
Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.
中文标题/摘要
标题:统一时空令牌评分以提高视频VLMs的效率
令牌剪枝对于提高视觉语言模型(VLMs)的计算效率至关重要,特别是在视频任务中,时间冗余普遍存在。先前的方法通常仅在视觉变换器(ViT)内剪枝令牌,适用于单模态感知任务如动作识别和对象分割,而不适应下游视觉语言任务;或者仅在LLM内剪枝令牌,而保留ViT输出不变,通常需要复杂的文本条件令牌选择机制。在本文中,我们引入了时空令牌评分(STTS),这是一种简单且轻量级的模块,可以在ViT和LLM之间剪枝视觉令牌,无需文本条件或令牌合并,并且完全兼容端到端训练。通过学习如何通过辅助损失学习时间评分以及通过LLM下游梯度学习空间评分,借助我们高效的打包算法,STTS在整个架构中剪枝了50%的视觉令牌,从而在训练和推理过程中效率提高了62%,并且平均性能下降了0.7%。随着每段视频采样帧数的增加,效率提升更加明显。在长视频问答测试时应用缩放进一步提高了0.5-1%的性能,与基线相比。总体而言,STTS代表了一种新颖、简单而有效的统一架构视觉令牌剪枝技术。
Summary / 总结
This paper addresses the need for computational efficiency in vision-language models for video tasks by introducing Spatio-Temporal Token Scoring (STTS), a method that prunes both vision and language tokens across the entire model without text conditioning or token merging. STTS uses an auxiliary loss for temporal scoring and LLM gradients for spatial scoring, and is compatible with end-to-end training. It prunes 50% of vision tokens, improving efficiency by 62% with only a 0.7% drop in performance across 13 video QA tasks. Efficiency gains are more pronounced with more sampled frames, and test-time scaling can further improve performance by 0.5-1%.
本文提出了一种时空令牌评分(STTS)方法,该方法在视觉语言模型(VLMs)中的视觉变换器和语言模型之间剪枝视觉令牌,无需文本条件或令牌合并。STTS 通过仅 62% 的效率提升,在 13 个视频问答任务中将平均性能下降控制在 0.7% 以内。该技术学习空间和时间上的令牌评分,并完全兼容端到端训练,使其成为增强视频任务中 VLMs 效率的简单而有效的方法。
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
Authors: Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys
First: 2026-03-18T17:59:10+00:00 · Latest: 2026-03-18T17:59:10+00:00
Comments: Project Page: https://kevinqu7.github.io/loc3r-vlm
Abstract
Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm
中文标题/摘要
标题:Loc3R-VLM:基于语言的空间定位和三维推理
多模态大型语言模型(MLLMs)在连接视觉和语言方面取得了显著进展,但仍然难以理解空间关系和视角相关的推理。最近的努力旨在通过几何提示增强输入表示,而不是明确地教模型在三维空间中进行推理。我们提出了Loc3R-VLM框架,该框架使二维视觉语言模型具备从单目视频输入中获得的高级三维理解能力。受人类空间认知的启发,Loc3R-VLM依赖于两个联合目标:全局布局重建以构建场景结构的整体表示,以及明确的情境建模以锚定主观视角。这些目标提供了直接的空间监督,使感知和语言在三维上下文中得到约束。为了确保几何一致性并实现度量级对齐,我们利用从预训练的三维基础模型中提取的轻量级相机姿态先验。Loc3R-VLM在基于语言的空间定位方面达到了最先进的性能,并在基于图像和视频的现有方法上在涉及和一般三维问答基准测试中表现出色,证明了我们的空间监督框架能够实现强大的三维理解。项目页面:https://kevinqu7.github.io/loc3r-vlm
Summary / 总结
Loc3R-VLM is a framework that enhances 2D Vision-Language Models with 3D understanding capabilities using monocular video input. It focuses on global layout reconstruction and explicit situation modeling to provide spatial supervision. This approach improves language-based localization and outperforms existing methods on 3D question-answering benchmarks, showing strong 3D understanding through geometric consistency and metric-scale alignment.
Loc3R-VLM 是一个框架,通过单目视频输入增强 2D 视觉-语言模型的 3D 理解能力。它侧重于全局布局重建和显式情况建模以提供空间监督。这种方法在语言基于的定位和 3D 问答基准测试中达到了最先进的性能,并优于现有方法。
Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search
Authors: Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi
First: 2026-03-17T16:02:38+00:00 · Latest: 2026-03-18T17:58:04+00:00
Comments: 14 pages, 9 figures
Abstract
We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.
中文标题/摘要
标题:Search2Motion:无需训练的对象级运动控制
我们提出了Search2Motion,一种无需训练的框架,用于图像到视频生成中的对象级运动编辑。与需要轨迹、边界框、掩码或运动场的先前方法不同,Search2Motion 采用目标帧基于的控制,利用首尾帧运动先验来实现对象重定位,同时保持场景稳定性,无需微调。通过语义引导的对象插入和鲁棒的背景修复,实现了可靠的目标帧构建。我们进一步展示了早期步骤的自我注意力图预测对象和相机动力学,提供可解释的用户反馈,并激发了ACE-Seed(注意力共识早期步骤种子选择)这一轻量级搜索策略,该策略在无需前瞻采样或外部评估者的情况下提高了运动保真度。鉴于现有基准混淆了对象和相机运动,我们引入了S2M-DAVIS和S2M-OMB进行稳定相机、对象仅有的评估,以及FLF2V-obj指标,该指标隔离了对象伪影,无需真实轨迹。Search2Motion 在FLF2V-obj 和 VBench 上均优于基线。
The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering
Authors: Yigit Ekin, Yossi Gandelsman
First: 2026-03-18T17:57:53+00:00 · Latest: 2026-03-18T17:57:53+00:00
Comments: Project Page: https://yigitekin.github.io/diffusion-sliders
Abstract
We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.
中文标题/摘要
标题:基于文本嵌入插值的无训练连续图像操控框架
我们提出了一种无需训练的框架,在测试时对文本条件生成模型进行连续可控的图像编辑。与依赖额外训练或手动用户干预的先前方法不同,我们发现简单的文本嵌入空间中的操控足以产生平滑的编辑控制。给定一个目标概念(例如,增强照片逼真度或改变面部表情),我们使用大型语言模型自动生成一组去偏见的对比提示对,从中计算生成器文本编码器空间中的操控向量。然后,我们将此向量直接添加到输入提示表示中,以沿所需的语义轴控制生成。为了获得连续控制,我们提出了一种弹性范围搜索程序,自动识别有效的操控幅度范围,避免过度操控(改变其他属性)和不足操控(无编辑)。在该范围内添加该向量的缩放版本可产生平滑且连续的编辑。由于我们的方法仅修改文本表示,因此自然适用于文本条件的各种模态,包括图像和视频生成。为了量化操控连续性,我们引入了一个新的评估指标,该指标衡量编辑强度下语义变化的均匀性。我们比较了不同方法的连续编辑行为,并发现尽管我们的方法简单且设计轻量,但在与基于训练的替代方法相当的情况下,优于其他无训练方法。
Summary / 总结
This paper introduces a training-free framework for continuous and controllable image editing using text embeddings. By steering in the text-embedding space, the authors achieve smooth control over image generation without additional training or manual intervention. The method uses a large language model to generate prompt pairs and computes a steering vector, which is added to the input prompt to control the generation along desired semantic axes. An elastic range search procedure ensures continuous control by identifying an effective interval of steering magnitudes, resulting in smooth and consistent edits. The approach is shown to be comparable to training-based methods and outperforms other training-free methods in terms of continuous editing behavior.
该论文提出了一种无需训练的框架,利用文本嵌入进行连续可控的图像编辑。它利用大型语言模型生成对比提示对,计算文本编码空间中的偏航向量,并将其添加到输入提示中以沿所需语义轴控制图像生成。弹性范围搜索过程确保了平滑和连续的编辑。实验表明,尽管该方法简单轻量,但在连续编辑行为上与基于训练的方法相当,并且优于其他无需训练的方法。
Versatile Editing of Video Content, Actions, and Dynamics without Training
Authors: Vladimir Kulikov, Roni Paiss, Andrey Voynov, Inbar Mosseri, Tali Dekel, Tomer Michaeli
First: 2026-03-18T17:50:56+00:00 · Latest: 2026-03-18T17:50:56+00:00
Comments: Project page at https://dynaedit.github.io/
Abstract
Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.
中文标题/摘要
标题:无需训练即可灵活编辑视频内容、动作和动态
受控视频生成在近年来取得了显著进步。然而,编辑动作和动态事件,或插入应影响其他对象行为的内容,仍然是一个重大挑战。现有的训练模型难以处理复杂的编辑,这可能是因为难以收集相关的训练数据。同样,现有的无训练方法本质上只能进行结构和运动保持的编辑,不支持修改运动或交互。在这里,我们介绍了一种无训练编辑方法DynaEdit,该方法利用预训练的文本到视频流模型解锁了灵活的视频编辑能力。我们的方法依赖于最近引入的无需反演的方法,该方法不会干预模型的内部机制,因此是模型无关的。我们展示了直接尝试将此方法适应于一般不受约束的编辑会导致严重的低频错位和高频抖动。我们解释了这些现象的来源,并引入了克服它们的新机制。通过广泛的实验,我们展示了DynaEdit在复杂的基于文本的视频编辑任务中达到了最先进的效果,包括修改动作、插入与场景交互的对象以及引入全局效果。
Summary / 总结
DynaEdit is a training-free method that uses pretrained text-to-video flow models to enable versatile video editing, including modifying actions, inserting interactive objects, and introducing global effects. It addresses the limitations of existing methods by overcoming low-frequency misalignment and high-frequency jitter through novel mechanisms. Experiments demonstrate that DynaEdit outperforms existing approaches on complex text-based video editing tasks.
研究解决了在没有训练数据的情况下编辑视频中的动作和动态事件的难题,现有模型难以应对。DynaEdit 是一种无需训练的方法,利用预训练的文本到视频流模型实现视频的多功能编辑。该方法引入了新的机制来克服低频失真和高频抖动等问题,实现了复杂文本驱动的视频编辑任务的领先成果。
LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency
Authors: Weilong Yan, Haipeng Li, Hao Xu, Nianjin Ye, Yihao Ai, Shuaicheng Liu, Jingyu Hu
First: 2026-02-21T06:55:28+00:00 · Latest: 2026-03-18T17:12:08+00:00
Comments: Accepted by CVPR2026
Abstract
This paper introduces LaS-Comp, a zero-shot and category-agnostic approach that leverages the rich geometric priors of 3D foundation models to enable 3D shape completion across diverse types of partial observations. Our contributions are threefold: First, \ourname{} harnesses these powerful generative priors for completion through a complementary two-stage design: (i) an explicit replacement stage that preserves the partial observation geometry to ensure faithful completion; and (ii) an implicit refinement stage ensures seamless boundaries between the observed and synthesized regions. Second, our framework is training-free and compatible with different 3D foundation models. Third, we introduce Omni-Comp, a comprehensive benchmark combining real-world and synthetic data with diverse and challenging partial patterns, enabling a more thorough and realistic evaluation. Both quantitative and qualitative experiments demonstrate that our approach outperforms previous state-of-the-art approaches. Our code and data will be available at \href{https://github.com/DavidYan2001/LaS-Comp}{LaS-Comp}.
中文标题/摘要
标题:LaS-Comp:利用潜在空间一致性实现零样本3D补全
本文介绍了LaS-Comp,这是一种零样本且类别无关的方法,利用3D基础模型丰富的几何先验,实现不同类型的不完整观测下的3D形状补全。我们的贡献包括三个方面:首先,我们通过互补的两阶段设计利用这些强大的生成先验进行补全:(i) 显式的替换阶段,保留不完整观测的几何结构,确保补全是忠实的;(ii) 隐式的细化阶段,确保观察区域和合成区域之间的边界无缝衔接。其次,我们的框架无需训练且兼容不同的3D基础模型。第三,我们引入了Omni-Comp,这是一个综合基准,结合了真实世界和合成数据,具有多样且具有挑战性的不完整模式,使评估更加全面和真实。定量和定性的实验表明,我们的方法优于之前的最先进的方法。我们的代码和数据将在https://github.com/DavidYan2001/LaS-Comp/LaS-Comp上提供。
Only relative ranks matter in weight-clustered large language models
Authors: Borja Aizpurua, Sukhbinder Singh, Román Orús
First: 2026-03-18T16:55:13+00:00 · Latest: 2026-03-18T16:55:13+00:00
Comments: 10 pages, 3 figures, 9 tables
Abstract
Large language models (LLMs) contain billions of parameters, yet many exact values are not essential. We show that what matters most is the relative rank of weights-whether one connection is stronger or weaker than another-rather than precise magnitudes. To reduce the number of unique weight values, we apply weight clustering to pretrained models, replacing every weight matrix with K shared values from K-means. For Llama 3.1-8B-Instruct and SmolLM2-135M, reducing each matrix to only 16-64 distinct values preserves strong accuracy without retraining, providing a simple, training-free method to compress LLMs on disk. Optionally fine-tuning only the cluster means (centroids) recovers 30-40 percent of the remaining accuracy gap at minimal cost. We then systematically randomize cluster means while keeping assignments fixed. Scrambling the relative ranks of the clusters degrades quality sharply-perplexity can increase by orders of magnitude-even when global statistics such as mean and variance are preserved. In contrast, rank-preserving randomizations cause almost no loss at mid and late layers. On the other hand, when many layers are perturbed simultaneously, progressive layer-by-layer replacement reveals that scale drift-not rank distortion-is the dominant collapse mechanism; however, an affine correction w' = aw + b with a > 0 (which preserves both rank order and overall weight distribution) can substantially delay this drift. This rank-based perspective offers a new lens on model compression and robustness.
中文标题/摘要
标题:在重量聚类的大语言模型中,相对排名最重要
大语言模型(LLMs)包含数十亿个参数,但许多精确值并不重要。我们表明,最重要的是权重的相对排名——一个连接是否比另一个更强或更弱,而不是精确的大小。为了减少唯一的权重值数量,我们对预训练模型应用权重聚类,将每个权重矩阵替换为K-means的K个共享值。对于Llama 3.1-8B-Instruct和SmolLM2-135M,将每个矩阵减少到仅16-64个不同的值,可以在不重新训练的情况下保持强大的准确性,提供了一种简单且无需训练的方法来压缩LLM。可选地仅微调聚类均值(质心),可以恢复30-40%的剩余准确性差距,且成本极低。然后系统地随机化聚类均值,同时保持分配固定。打乱聚类的相对排名会急剧降低质量——困惑度可以增加几个数量级——即使全局统计量如均值和方差保持不变。相比之下,在中间和后期层,保持排名的随机化几乎不会造成损失。另一方面,当同时扰动许多层时,逐层逐层的替换显示,尺度漂移而不是排名失真是主要的崩溃机制;然而,一个仿射修正w'=aw+b(其中a>0,既保持了排名顺序,也保持了整体权重分布)可以显著延迟这种漂移。基于排名的观点为模型压缩和鲁棒性提供了一个新的视角。
Summary / 总结
The research explores the importance of relative ranks over precise weight values in large language models (LLMs). By applying weight clustering using K-means, the study shows that reducing the number of unique weight values to 16-64 per matrix preserves strong accuracy without retraining. Fine-tuning only the cluster means recovers some lost accuracy. Randomizing cluster means while keeping assignments fixed degrades model quality significantly, indicating that rank preservation is crucial for maintaining model performance. The study also finds that scale drift rather than rank distortion is the primary cause of model collapse when many layers are perturbed simultaneously, and an affine correction can mitigate this issue.
研究探讨了大型语言模型(LLMs)中相对排名而非精确权重值的重要性。通过使用K-means进行权重聚类,研究减少了唯一权重值的数量,同时保持了强大的准确性且无需重新训练。仅微调聚类中心可以恢复部分丢失的准确性。在保持分配不变的情况下随机化聚类中心会导致模型质量显著下降,表明相对排名至关重要。然而,保持排名顺序和整体权重分布的线性变换可以导致几乎无损失。研究表明,当许多层同时受到扰动时,尺度漂移而非排名扭曲是主要的崩溃机制,而线性修正可以显著延缓这种漂移。
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
Authors: Zunzhe Zhang, Runhan Huang, Yicheng Liu, Shaoting Zhu, Linzhan Mou, Hang Zhao
First: 2026-03-18T15:27:17+00:00 · Latest: 2026-03-18T15:27:17+00:00
Comments: 10 pages, 6 figures
Abstract
Diffusion models and flow matching have become a cornerstone of robotic imitation learning, yet they suffer from a structural inefficiency where inference is often bound to a fixed integration schedule that is agnostic to state complexity. This paradigm forces the policy to expend the same computational budget on trivial motions as it does on complex tasks. We introduce Generative Control as Optimization (GeCO), a time-unconditional framework that transforms action synthesis from trajectory integration into iterative optimization. GeCO learns a stationary velocity field in the action-sequence space where expert behaviors form stable attractors. Consequently, test-time inference becomes an adaptive process that allocates computation based on convergence--exiting early for simple states while refining longer for difficult ones. Furthermore, this stationary geometry yields an intrinsic, training-free safety signal, as the field norm at the optimized action serves as a robust out-of-distribution (OOD) detector, remaining low for in-distribution states while significantly increasing for anomalies. We validate GeCO on standard simulation benchmarks and demonstrate seamless scaling to pi0-series Vision-Language-Action (VLA) models. As a plug-and-play replacement for standard flow-matching heads, GeCO improves success rates and efficiency with an optimization-native mechanism for safe deployment. Video and code can be found at https://hrh6666.github.io/GeCO/
中文标题/摘要
标题:生成控制作为优化:时间无条件流匹配的自适应和鲁棒机器人控制
扩散模型和流匹配已成为机器人模仿学习的基石,但它们遭受了一种结构上的低效性,即推理往往受限于固定的时间积分计划,这种计划对状态复杂性是无感知的。这种范式迫使策略在简单动作和复杂任务上花费相同的计算预算。我们提出了生成控制作为优化(GeCO),这是一种时间无条件框架,将动作合成从轨迹积分转变为迭代优化。GeCO 在动作序列空间中学习一个不变的速度场,其中专家行为形成稳定的吸引子。因此,测试时的推理成为一种根据收敛性进行自适应的过程——对于简单状态提前退出,而对于困难状态则进行更长时间的细化。此外,这种不变的几何结构提供了一个内在的、无需训练的安全信号,优化动作的速度场在正常状态下的值较低,而在异常状态下显著增加。我们在标准的仿真基准上验证了 GeCO,并展示了其无缝扩展到 pi0 系列视觉-语言-动作(VLA)模型。作为标准流匹配头部的即插即用替代品,GeCO 通过一种优化原生机制提高了成功率和效率,实现了安全部署。有关视频和代码请参见 https://hrh6666.github.io/GeCO/
Summary / 总结
The research addresses the inefficiency of fixed integration schedules in diffusion models and flow matching for robotic imitation learning, which forces the policy to use the same computational budget regardless of the complexity of the task. GeCO, a time-unconditional framework, transforms action synthesis into iterative optimization, allowing adaptive computation allocation based on convergence. Key findings include improved success rates and efficiency, and an intrinsic safety signal for out-of-distribution detection. GeCO is validated on standard simulation benchmarks and scales well to Vision-Language-Action models, enhancing safe deployment through optimization.
研究解决了固定积分时间表在机器人模仿学习中的低效性,该方法强制策略在不同任务复杂度下分配相同的计算预算。GeCO 引入了一种时间无条件框架,将动作合成转换为迭代优化,允许根据收敛情况适应性地分配计算。关键发现包括提高了成功率和效率,以及一种内在的用于检测异常的安全部件信号。GeCO 可无缝扩展到视觉-语言-动作模型,并可作为标准流匹配头部的即插即用替代方案,以实现安全部署。
Steering Video Diffusion Transformers with Massive Activations
Authors: Xianhang Cheng, Yujian Zheng, Zhenyu Xie, Tingting Liao, Hao Li
First: 2026-03-18T15:24:12+00:00 · Latest: 2026-03-18T15:24:12+00:00
Abstract
Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.
中文标题/摘要
标题:利用大规模激活引导视频扩散变换器
尽管视频扩散变换器取得了快速进展,但如何利用其内部模型信号以最小的开销提升视频生成质量仍鲜有探索。在本工作中,我们研究了大规模激活(MAs)的作用,MAs是视频扩散变换器中罕见的高幅度隐藏状态突跃。我们观察到,MAs在所有视觉标记中一致出现,具有明显的幅度层次:第一帧标记显示出最大的MA幅度,潜在帧边界标记(每个潜在空间时间片段的头部和尾部)显示出较高的但略低的MA幅度,而每个潜在帧内的内部标记保持较高但相对适度的幅度。这种结构化模式表明,模型隐式地优先处理与潜在空间时间分块对齐的标记位置。基于这一观察,我们提出了一种无需训练的类似自我引导的方法——结构化激活引导(STAS),该方法将第一帧和边界标记的MA值引导至缩放后的全局最大参考幅度。STAS在不同文本到视频模型中实现了视频质量和时间连贯性的持续改进,同时引入了可忽略不计的计算开销。
Summary / 总结
This study explores the role of Massive Activations (MAs) in video diffusion transformers to enhance video generation quality. MAs are rare, high-magnitude hidden state spikes that emerge consistently across visual tokens, with a clear hierarchy in magnitude. Based on this observation, the authors propose Structured Activation Steering (STAS), a training-free method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude, achieving consistent improvements in video quality and temporal coherence without significant computational overhead.
该研究探讨了在视频扩散变换器中利用大规模激活(MAs)以提升视频生成质量的方法。作者观察到MAs在第一帧和边界位置显示出较高的幅度,并基于此提出了一种名为结构化激活引导(STAS)的训练免费方法,该方法将这些位置的MA值引导至全局最大值,从而在不增加显著计算开销的情况下,实现了视频质量和时间连贯性的持续改进。
Training-free Detection of Generated Videos via Spatial-Temporal Likelihoods
Authors: Omer Ben Hayun, Roy Betser, Meir Yossef Levi, Levi Kassel, Guy Gilboa
Venue: CVPR 2026
First: 2026-03-16T09:26:56+00:00 · Latest: 2026-03-18T15:12:55+00:00
Comments: Accepted to CVPR 2026
Abstract
Following major advances in text and image generation, the video domain has surged, producing highly realistic and controllable sequences. Along with this progress, these models also raise serious concerns about misinformation, making reliable detection of synthetic videos increasingly crucial. Image-based detectors are fundamentally limited because they operate per frame and ignore temporal dynamics, while supervised video detectors generalize poorly to unseen generators, a critical drawback given the rapid emergence of new models. These challenges motivate zero-shot approaches, which avoid synthetic data and instead score content against real-data statistics, enabling training-free, model-agnostic detection. We introduce STALL, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. We evaluate STALL on two public benchmarks and introduce ComGenVid, a new benchmark with state-of-the-art generative models. STALL consistently outperforms prior image- and video-based baselines. Code and data are available at https://omerbenhayun.github.io/stall-video.
中文标题/摘要
标题:基于空间-时间似然性的无训练检测生成视频
随着文本和图像生成技术的重大进展,视频领域迅速发展,产生了高度逼真且可控的序列。伴随这一进展,这些模型也引发了严重的虚假信息问题,使得可靠检测合成视频变得越来越重要。基于图像的检测器本质上受到限制,因为它们逐帧操作并忽略时间动态,而监督视频检测器对未见过的生成器泛化能力差,这是一个关键缺陷,鉴于新模型的快速涌现。这些挑战促使了零样本方法的发展,这些方法避免使用合成数据,而是将内容与真实数据统计进行评分,从而实现无训练、模型无关的检测。我们引入了STALL,这是一种简单、无训练、理论上合理的检测器,它为视频提供基于似然性的评分,并在概率框架内联合建模空间和时间证据。我们在两个公开基准上评估了STALL,并引入了ComGenVid,这是一个包含最新生成模型的新基准。STALL在所有先前基于图像和视频的基线中表现优异。代码和数据可在https://omerbenhayun.github.io/stall-video/获取。
Summary / 总结
This paper addresses the challenge of detecting synthetic videos by introducing STALL, a training-free detector that scores content based on spatial-temporal likelihoods. Unlike image-based detectors that operate per frame and ignore temporal dynamics, or supervised video detectors that generalize poorly to new generators, STALL models both spatial and temporal evidence within a probabilistic framework, making it model-agnostic and reliable. Experiments on two public benchmarks show that STALL outperforms existing image- and video-based baselines consistently.
本文通过引入STALL,一种基于时空似然性的训练-free检测器,来解决合成视频的检测问题。不同于仅在单帧上操作并忽略时间动态的图像检测器,或对新生成器泛化能力差的监督视频检测器,STALL在概率框架内同时建模空间和时间证据,使其成为一种模型无感知且可靠的检测方法。在两个公开基准上的实验表明,STALL在所有现有基于图像和视频的基线方法上表现更优。
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients
Authors: Ziwei Xiang, Fanhu Zeng, Hongjian Fang, Rui-Qi Wang, Renxing Chen, Yanan Zhu, Yi Chen, Peipei Yang, Xu-Yao Zhang
Venue: CVPR 2026
First: 2026-03-18T15:03:43+00:00 · Latest: 2026-03-18T15:03:43+00:00
Comments: Accepted by CVPR 2026 Main Conference
Abstract
Large Vision Language Models (LVLMs) have achieved remarkable success in a range of downstream tasks that require multimodal interaction, but their capabilities come with substantial computational and memory overhead, which hinders practical deployment. Among numerous acceleration techniques, post-training quantization is a popular and effective strategy for reducing memory cost and accelerating inference. However, existing LVLM quantization methods typically measure token sensitivity at the modality level, which fails to capture the complex cross-token interactions and falls short in quantitatively measuring the quantization error at the token level. As tokens interact within the model, the distinction between modalities gradually diminishes, suggesting the need for fine-grained calibration. Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. Extensive experiments on multiple LVLMs under both W4A8 and W3A16 settings show that our method improves accuracy across models and benchmarks with negligible latency overhead. For example, under 3-bit weight-only quantization, our method improves the average accuracy of LLaVA-onevision-7B by 1.60%, reducing the gap to its full-precision counterpart to only 1.33%. The code is available at https://github.com/ucas-xiang/QIG.
中文标题/摘要
标题:大型视觉语言模型的细粒度后训练量化方法:基于量化感知集成梯度
大型视觉语言模型(LVLMs)在需要多模态交互的下游任务中取得了显著成功,但其能力伴随着巨大的计算和内存开销,阻碍了其实用部署。众多加速技术中,后训练量化是一种流行且有效的方法,用于减少内存成本和加速推理。然而,现有的LVLM量化方法通常在模态级别测量标记敏感性,无法捕捉复杂的跨标记交互,也未能在标记级别定量测量量化误差。随着模型中标记的交互,模态之间的区别逐渐消失,表明需要细粒度校准。受机械可解释性中公理归因的启发,我们引入了一种基于量化感知集成梯度(QIG)的细粒度量化策略,利用集成梯度定量评估标记敏感性,并将粒度从模态级别推至标记级别,反映跨模态和内模态动力学。在W4A8和W3A16设置下的多个LVLMs上进行的广泛实验表明,我们的方法在模型和基准测试中提高了准确性,且几乎无延迟开销。例如,在3比特权重量化下,我们的方法将LLaVA-onevision-7B的平均准确性提高了1.60%,使其与全精度版本的差距缩小至仅1.33%。代码可在https://github.com/ucas-xiang/QIG/ 获取。
Summary / 总结
This paper proposes a fine-grained post-training quantization method for large vision language models using Quantization-aware Integrated Gradients (QIG). The method evaluates token sensitivity at the token level, capturing complex cross-token interactions and reducing quantization error. Experiments show that the method improves accuracy across models with minimal latency overhead, such as a 1.60% improvement in LLaVA-onevision-7B under 3-bit weight-only quantization, narrowing the gap to full-precision by 1.33%.
本文提出了一种使用量化感知集成梯度(QIG)的细粒度后训练量化方法,以解决大规模视觉语言模型(LVLM)的部署问题。该方法在token级别评估token的敏感性,能够捕捉跨模态和同模态的动力学,比现有方法在模态级别测量敏感性更为精确。实验表明,该方法在多个LVLM模型上提高了准确性,且延迟开销很小,例如在3比特权重量化下,LLaVA-onevision-7B的准确性提高了1.60%,与全精度模型的差距缩小至1.33%。
YOLO26: An Analysis of NMS-Free End to End Framework for Real-Time Object Detection
Authors: Sudip Chakrabarty
First: 2026-01-19T09:36:08+00:00 · Latest: 2026-03-18T14:50:45+00:00
Abstract
The ``You Only Look Once'' (YOLO) framework has long served as a standard for real-time object detection, though traditional iterations have utilized Non-Maximum Suppression (NMS) post-processing, which introduces specific latency and hyperparameter variables. This paper presents a comprehensive architectural analysis of YOLO26, a model that shifts toward a native end-to-end learning strategy by eliminating NMS. This study examines the core mechanisms driving this framework: the MuSGD optimizer for backbone stabilization, Small-Target-Aware Label Assignment (STAL), and ProgLoss for dynamic supervision. To contextualize its performance, this article reviews exhaustive benchmark data from the COCO \texttt{val2017} leaderboard. This evaluation provides an objective comparison of YOLO26 across various model scales (Nano to Extra-Large) against both prior CNN lineages and contemporary Transformer-based architectures (e.g., RT-DETR, DEIM, RF-DETR), detailing the observed speed-accuracy trade-offs and parameter requirements without asserting a singular optimal model. Additionally, the analysis covers the framework's unified multi-task capabilities, including the YOLOE-26 open-vocabulary module for promptable detection. Ultimately, this paper serves to document how decoupling representation learning from heuristic post-processing impacts the "Export Gap" and deterministic latency in modern edge-based computer vision deployments.
中文标题/摘要
标题:YOLO26:无NMS端到端实时目标检测框架的分析
“仅看一次”(YOLO)框架长期以来一直是实时目标检测的标准,尽管传统版本使用了非最大抑制(NMS)后处理,这引入了特定的延迟和超参数变量。本文对YOLO26模型进行了全面的架构分析,该模型通过消除NMS转向了原生端到端学习策略。本文研究了驱动该框架的核心机制:MuSGD优化器用于主干稳定,Small-Target-Aware标签分配(STAL),以及ProgLoss用于动态监督。为了对性能进行背景化,本文回顾了COCO \texttt{val2017}排行榜上的详尽基准数据。该评估提供了YOLO26在不同模型规模(Nano到Extra-Large)下与先前的CNN谱系和当代基于Transformer的架构(例如RT-DETR、DEIM、RF-DETR)的客观比较,详细说明了观察到的速度-准确度权衡和参数要求,但未断言单一最优模型。此外,分析还涵盖了该框架的统一多任务能力,包括YOLOE-26开放词汇模块,用于可提示检测。最终,本文记录了将表示学习与启发式后处理解耦如何影响现代边缘计算视觉部署中的“导出差距”和确定性延迟。
Summary / 总结
This paper analyzes YOLO26, an end-to-end object detection model that removes Non-Maximum Suppression (NMS) and instead uses MuSGD optimizer, Small-Target-Aware Label Assignment (STAL), and ProgLoss for dynamic supervision. It evaluates YOLO26 across different model scales against both CNN and Transformer-based architectures, providing a detailed speed-accuracy trade-off and parameter requirements without favoring a single model. The study also explores YOLO26's multi-task capabilities, including the YOLOE-26 open-vocabulary module for promptable detection, and documents the impact of removing NMS on the 'Export Gap' and deterministic latency in edge-based computer vision systems.
该研究分析了YOLO26模型,这是一种移除了非极大值抑制(NMS)的端到端目标检测模型,使用了MuSGD优化器、Small-Target-Aware Label Assignment(STAL)和ProgLoss进行监督。研究评估了YOLO26在不同模型规模下的性能,与传统的CNN架构和基于Transformer的架构进行了详细比较,没有推荐单一的最佳模型。研究还强调了YOLO26的多任务能力,包括YOLOE-26开放词汇模块,用于灵活的目标检测。研究展示了移除启发式后处理如何影响实时目标检测系统的性能和延迟。
Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs
Authors: Yuxin Liu, Fei Wang, Kun Li, Yiqi Nie, Junjie Chen, Zhangling Duan, Zhaohong Jia
First: 2026-03-18T14:22:45+00:00 · Latest: 2026-03-18T14:22:45+00:00
Abstract
Image Deepfake Detection (IDD) separates manipulated images from authentic ones by spotting artifacts of synthesis or tampering. Although large vision-language models (LVLMs) offer strong image understanding, adapting them to IDD often demands costly fine-tuning and generalizes poorly to diverse, evolving manipulations. We propose the Semantic Consistent Evidence Pack (SCEP), a training-free LVLM framework that replaces whole-image inference with evidence-driven reasoning. SCEP mines a compact set of suspicious patch tokens that best reveal manipulation cues. It uses the vision encoder's CLS token as a global reference, clusters patch features into coherent groups, and scores patches with a fused metric combining CLS-guided semantic mismatch with frequency-and noise-based anomalies. To cover dispersed traces and avoid redundancy, SCEP samples a few high-confidence patches per cluster and applies grid-based NMS, producing an evidence pack that conditions a frozen LVLM for prediction. Experiments on diverse benchmarks show SCEP outperforms strong baselines without LVLM fine-tuning.
中文标题/摘要
标题:跨域图像深度伪造检测的证据打包方法
图像深度伪造检测(IDD)通过识别合成或篡改的特征将被篡改的图像与真实图像区分开来。尽管大型视觉-语言模型(LVLM)提供了强大的图像理解能力,但将其应用于IDD通常需要昂贵的微调,并且难以适应多样且不断演变的篡改方式。我们提出了语义一致的证据打包(SCEP),这是一种无需训练的LVLM框架,用基于证据的推理替代了整个图像的推理。SCEP 挖掘出一组最能揭示篡改线索的可疑补丁标记。它使用视觉编码器的CLS标记作为全局参考,将补丁特征聚类成一致的组,并使用结合了CLS引导语义不匹配和频率及噪声基异常的融合度量来评分补丁。为了覆盖分散的痕迹并避免冗余,SCEP 每个聚类中采样几个高置信度的补丁,并应用基于网格的NMS,生成一个证据包,该证据包用于条件化冻结的LVLM进行预测。在多种基准上的实验表明,SCEP 在无需LVLM微调的情况下优于强大的基线方法。
Summary / 总结
The research aims to improve cross-domain image deepfake detection using large vision-language models (LVLMs) without fine-tuning. The proposed Semantic Consistent Evidence Pack (SCEP) framework extracts a compact set of suspicious patch tokens to reveal manipulation cues. It uses the vision encoder's CLS token as a reference, clusters patch features, and scores them with a combined metric. SCEP then selects high-confidence patches per cluster and applies grid-based non-max suppression to produce an evidence pack, which conditions a frozen LVLM for prediction. Experiments demonstrate that SCEP outperforms strong baselines on various benchmarks without LVLM fine-tuning.
研究旨在利用大型视觉语言模型(LVLM)进行跨域图像深伪检测,无需微调。提出的语义一致证据包(SCEP)框架挖掘可疑的补丁令牌以揭示篡改线索,使用全局参考和聚类技术。实验表明,SCEP在各种基准上优于强基线,且无需LVLM微调。
MultiMedEval: A Benchmark and a Toolkit for Evaluating Medical Vision-Language Models
Authors: Corentin Royer, Bjoern Menze, Anjany Sekuboyina
First: 2024-02-14T15:49:08+00:00 · Latest: 2026-03-18T14:09:41+00:00
Comments: Accepted at MIDL 2024
Abstract
We introduce MultiMedEval, an open-source toolkit for fair and reproducible evaluation of large, medical vision-language models (VLM). MultiMedEval comprehensively assesses the models' performance on a broad array of six multi-modal tasks, conducted over 23 datasets, and spanning over 11 medical domains. The chosen tasks and performance metrics are based on their widespread adoption in the community and their diversity, ensuring a thorough evaluation of the model's overall generalizability. We open-source a Python toolkit (github.com/corentin-ryr/MultiMedEval) with a simple interface and setup process, enabling the evaluation of any VLM in just a few lines of code. Our goal is to simplify the intricate landscape of VLM evaluation, thus promoting fair and uniform benchmarking of future models.
中文标题/摘要
标题:MultiMedEval:评估医疗视觉语言模型的基准和工具包
我们介绍了MultiMedEval,一个开源工具包,用于公平和可重复地评估大型医疗视觉语言模型(VLM)。MultiMedEval全面评估了模型在23个数据集上的表现,涵盖了11个医学领域中的六种多模态任务。所选任务和性能指标基于它们在社区中的广泛应用和多样性,确保了对模型整体泛化能力的全面评估。我们开源了一个Python工具包(github.com/corentin-ryr/MultiMedEval),具有简单的接口和设置过程,只需几行代码即可评估任何VLM。我们的目标是简化VLM评估的复杂景观,从而促进未来模型的公平和统一基准测试。
Summary / 总结
MultiMedEval is an open-source toolkit designed to evaluate medical vision-language models (VLM) fairly and reproducibly. It assesses models across 23 datasets covering 11 medical domains through six multi-modal tasks. The toolkit simplifies the evaluation process with a Python interface, allowing any VLM to be evaluated in just a few lines of code. This initiative aims to promote fair benchmarking of VLMs in the medical field.
MultiMedEval 是一个开源工具包,旨在公平且可重复地评估大型医疗视觉-语言模型。它通过六个跨模态任务在23个数据集和11个医学领域中评估模型,使用广泛采用的性能指标。该工具包通过 Python 接口简化了评估过程,只需几行代码即可评估任何 VLM。这促进了未来模型的公平基准测试。
SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition
Authors: Jingxiao Yang, DaLin He, Miao Pan, Ge Su, Wenqi Zhang, Yifeng Hu, Tangwei Li, Yuke Li, Xuhong Zhang
First: 2026-03-18T13:49:27+00:00 · Latest: 2026-03-18T13:49:27+00:00
Comments: preprint, under review
Abstract
Recent advances in Large Vision-Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories. Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency; (2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases. To address these limitations, we propose SARE, a Sample-wise Adaptive textbfREasoning framework for training-free FGVR. Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates. Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.
中文标题/摘要
标题:SARE:样本自适应推理用于无需训练的细粒度视觉识别
大型视觉-语言模型(LVLMs)的最新进展使无需训练的细粒度视觉识别(FGVR)成为可能。然而,有效地利用LVLMs进行FGVR仍然具有挑战性,因为子类别级别的固有视觉模糊性。现有方法主要采用检索导向或推理导向的方法来应对这一挑战,但都受到两个基本限制的制约:(1) 它们对所有样本应用相同的推理管道,而不考虑识别难度的不均衡性,从而导致性能和效率不佳;(2) 缺乏机制来整合和重用特定错误的经验,导致在类似具有挑战性的案例上重复失败。为了解决这些限制,我们提出了SARE,一种样本自适应推理框架,用于无需训练的FGVR。具体而言,SARE采用级联设计,结合快速候选检索与细粒度推理,仅在必要时调用后者。在推理过程中,SARE引入了一种自我反思的经验机制,利用过去的失败来在推理过程中提供可转移的判别性指导,而无需更新任何参数。在14个数据集上的广泛实验表明,SARE在保持高性能的同时,显著减少了计算开销。
Summary / 总结
The research aims to improve training-free fine-grained visual recognition by addressing the inherent visual ambiguity of subordinate-level categories. SARE, a Sample-wise Adaptive Reasoning framework, is proposed to tackle this challenge by adopting a cascaded design that combines fast candidate retrieval with fine-grained reasoning, only invoked when necessary. It incorporates a self-reflective experience mechanism to provide discriminative guidance during inference, leveraging past failures without parameter updates. Experiments on 14 datasets show that SARE achieves state-of-the-art performance with reduced computational overhead.
SARE 是一种针对训练-free 细粒度视觉识别的样本自适应推理框架,通过适应样本特定的识别难度和利用过去的失败经验进行判别性指导来解决现有方法的局限性。广泛的实验表明,SARE 达到了最先进的性能并减少了计算开销。
From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving
Authors: A. Humnabadkar, A. Sikdar, B. Cave, H. Zhang, N. Bessis, A. Behera
First: 2026-03-18T13:32:26+00:00 · Latest: 2026-03-18T13:32:26+00:00
Comments: Accepted manuscript - Transactions on Intelligent Transportation Systems
Abstract
Autonomous driving technologies have achieved significant advances in recent years, yet their real-world deployment remains constrained by data scarcity, safety requirements, and the need for generalization across diverse environments. In response, synthetic data and virtual environments have emerged as powerful enablers, offering scalable, controllable, and richly annotated scenarios for training and evaluation. This survey presents a comprehensive review of recent developments at the intersection of autonomous driving, simulation technologies, and synthetic datasets. We organize the landscape across three core dimensions: (i) the use of synthetic data for perception and planning, (ii) digital twin-based simulation for system validation, and (iii) domain adaptation strategies bridging synthetic and real-world data. We also highlight the role of vision-language models and simulation realism in enhancing scene understanding and generalization. A detailed taxonomy of datasets, tools, and simulation platforms is provided, alongside an analysis of trends in benchmark design. Finally, we discuss critical challenges and open research directions, including Sim2Real transfer, scalable safety validation, cooperative autonomy, and simulation-driven policy learning, that must be addressed to accelerate the path toward safe, generalizable, and globally deployable autonomous driving systems.
中文标题/摘要
标题:从虚拟环境到现实世界试验:自主驾驶新兴趋势
近年来,自主驾驶技术取得了显著进展,但其在现实世界的部署仍受到数据稀缺性、安全要求以及跨不同环境泛化的限制。为应对这些挑战,合成数据和虚拟环境已成为强大的助力,提供了可扩展、可控且注释丰富的场景,用于训练和评估。本文综述了自主驾驶、模拟技术和合成数据集交叉领域的最新进展。我们从三个核心维度组织了这一景观:(i) 使用合成数据进行感知和规划,(ii) 基于数字孪生的模拟用于系统验证,(iii) 跨合成和现实世界数据的领域适应策略。我们还强调了视觉语言模型和模拟现实性在增强场景理解和泛化方面的作用。我们提供了数据集、工具和模拟平台的详细分类,并分析了基准设计的趋势。最后,我们讨论了必须解决的关键挑战和开放研究方向,包括Sim2Real迁移、可扩展的安全验证、协同自主以及基于模拟的策略学习,以加速实现安全、泛化和全球部署的自主驾驶系统。
Summary / 总结
The paper explores the use of synthetic data and virtual environments to address the challenges of deploying autonomous driving technologies in the real world, such as data scarcity and safety requirements. It reviews recent developments in the integration of autonomous driving, simulation technologies, and synthetic datasets, focusing on perception and planning, system validation, and domain adaptation. Key findings include the importance of vision-language models and simulation realism in enhancing scene understanding and generalization, and the need for Sim2Real transfer and scalable safety validation to achieve safe and generalizable autonomous driving systems.
论文探讨了使用合成数据和虚拟环境来应对自动驾驶技术在现实世界中部署所面临的挑战,如数据稀缺性和安全性要求。它回顾了自动驾驶、仿真技术和合成数据集集成的最新进展,重点关注感知和规划、系统验证和领域适应。关键发现包括视觉-语言模型和仿真逼真性在增强场景理解和泛化方面的重要性,以及为了实现安全和通用的自动驾驶系统,需要进行Sim2Real转移和可扩展的安全验证。
Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos
Authors: Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu
First: 2026-03-18T13:10:47+00:00 · Latest: 2026-03-18T13:10:47+00:00
Abstract
The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data generated by proprietary models contains systematic errors in fundamental temporal perception, such as confusing motion directions or misjudging speeds. We introduce SynRL, a post-training framework that teaches models temporal primitives, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. Our key insight is that these abstract primitives, learned from programmatically generated synthetic videos, transfer effectively to real-world scenarios. We decompose temporal understanding into short-term perceptual primitives (speed, direction) and long-term cognitive primitives, constructing 7.7K CoT and 7K RL samples with ground-truth frame-level annotations through code-based video generation. Despite training on simple geometric shapes, SynRL achieves substantial improvements across 15 benchmarks spanning temporal grounding, complex reasoning, and general video understanding. Remarkably, our 7.7K synthetic CoT samples outperform Video-R1 with 165K real-world samples. We attribute this to fundamental temporal skills, such as tracking frame by frame changes and comparing velocity, that transfer effectively from abstract synthetic patterns to complex real-world scenarios. This establishes a new paradigm for video post-training: video temporal learning through carefully designed synthetic data provides a more cost efficient scaling path.
中文标题/摘要
标题:通过合成视频学习可迁移的时间基本要素以实现视频推理
从图像理解过渡到视频理解需要视觉-语言模型(VLMs)从识别静态模式转向推理时间动态,如运动轨迹、速度变化和状态转换。然而,当前的后训练方法由于两个关键限制而效果不佳:(1)现有数据集往往缺乏时间中心性,答案可以从孤立的关键帧中推断出来,而不需要整体的时间整合;(2)由专有模型生成的训练数据包含基本时间感知中的系统性错误,如混淆运动方向或误判速度。我们引入了SynRL,这是一种后训练框架,用于教授模型时间基本要素,即时间理解的基本构建块,包括方向、速度和状态跟踪。我们的关键见解是,这些抽象的基本要素,从程序生成的合成视频中学习,可以有效地转移到现实世界场景中。我们将时间理解分解为短期感知基本要素(速度、方向)和长期认知基本要素,通过基于代码的视频生成构建了7,700个CoT样本和7,000个RL样本,带有帧级注释。尽管仅在简单的几何形状上进行训练,SynRL在15个涵盖时间定位、复杂推理和一般视频理解的基准测试中均取得了显著改进。令人惊讶的是,我们的7,700个合成CoT样本在165,000个真实世界样本的Video-R1中表现更优。我们归因于这些基本的时间技能,如逐帧跟踪变化和比较速度,能够从抽象的合成模式有效转移到复杂的现实世界场景中。这确立了视频后训练的新范式:通过精心设计的合成数据进行视频时间学习提供了一种更经济的扩展路径。
Summary / 总结
The paper addresses the challenge of teaching vision-language models to reason over temporal dynamics in videos. It introduces SynRL, a post-training framework that learns temporal primitives such as direction, speed, and state tracking from synthetic videos. By generating 7.7K CoT and 7K RL samples with ground-truth annotations, SynRL significantly improves performance across 15 benchmarks, outperforming Video-R1 with 165K real-world samples. This demonstrates that synthetic data can effectively transfer fundamental temporal skills to real-world scenarios, offering a cost-efficient approach to video understanding.
该论文提出了SynRL,一种后训练框架,通过合成视频教授模型时间上的基本要素,如方向、速度和状态跟踪。它通过利用程序生成的合成视频解决了现有数据集和训练方法的局限性,展示了在15个基准测试中的显著改进,并表明少量的合成样本比Video-R1的165K真实世界样本更有效,突出了从抽象的合成模式学习对复杂现实场景的时间理解的有效性。
Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers
Authors: Wenhao Sun, Ji Li, Zhaoqiang Liu
First: 2026-03-11T13:16:41+00:00 · Latest: 2026-03-18T13:10:33+00:00
Comments: Accepted by CVPR2026. Project Page: https://wenhao-sun77.github.io/JiT/
Abstract
Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.
中文标题/摘要
标题:Just-in-Time: 无需训练的空间加速方法用于扩散变换器
扩散变换器已在图像合成领域确立了新的前沿地位,但迭代采样的高计算成本严重阻碍了其实用部署。尽管现有的加速方法通常集中在时间域,但它们忽视了生成过程中固有的大量空间冗余,即全局结构在细粒度细节形成之前就已经出现。对所有空间区域的均匀计算处理是一个关键的低效率。在本文中,我们引入了Just-in-Time (JiT),这是一种新颖的无需训练的框架,通过在空间域加速来解决这一挑战。JiT 形式化了一个基于动态选择的稀疏锚定标记的子集进行计算的空间近似生成常微分方程 (ODE),以驱动整个潜在状态的演变。为了确保在新标记被纳入以扩展潜在状态维度时无缝过渡,我们提出了一种确定性微流,这是一种简单且有效的有限时间 ODE,能够保持结构连贯性和统计正确性。在最先进的 FLUX.1-dev 模型上的广泛实验表明,JiT 可以实现高达 7 倍的速度提升,几乎不损失性能,显著优于现有加速方法,并建立了推理速度和生成保真度之间新的优越权衡。
Summary / 总结
This paper introduces Just-in-Time (JiT), a training-free framework that accelerates diffusion transformers by focusing on spatial efficiency. JiT uses a spatially approximated generative ODE to drive the evolution of the latent state based on computations from a sparse subset of anchor tokens. The method achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and improving the trade-off between inference speed and generation fidelity.
论文提出了一种名为Just-in-Time (JiT)的无训练加速框架,通过在空间域中使用基于动态选择的稀疏锚点 token 的生成 ODE 近似,解决扩散变换器在图像合成中的高计算成本问题。实验显示,JiT 可以实现高达 7 倍的加速,同时保持几乎无损的性能,显著优于现有方法。
WeatherReasonSeg: A Benchmark for Weather-Aware Reasoning Segmentation in Visual Language Models
Authors: Wanjun Du, Zifeng Yuan, Tingting Chen, Fucai Ke, Beibei Lin, Shunli Zhang
First: 2026-03-18T12:57:18+00:00 · Latest: 2026-03-18T12:57:18+00:00
Abstract
Existing vision-language models (VLMs) have demonstrated impressive performance in reasoning-based segmentation. However, current benchmarks are primarily constructed from high-quality images captured under idealized conditions. This raises a critical question: when visual cues are severely degraded by adverse weather conditions such as rain, snow, or fog, can VLMs sustain reliable reasoning segmentation capabilities? In response to this challenge, we introduce WeatherReasonSeg, a benchmark designed to evaluate VLM performance in reasoning-based segmentation under adverse weather conditions. It consists of two complementary components. First, we construct a controllable reasoning dataset by applying synthetic weather with varying severity levels to existing segmentation datasets, enabling fine-grained robustness analysis. Second, to capture real-world complexity, we curate a real-world adverse-weather reasoning segmentation dataset with semantically consistent queries generated via mask-guided LLM prompting. We further broaden the evaluation scope across five reasoning dimensions, including functionality, application scenarios, structural attributes, interactions, and requirement matching. Extensive experiments across diverse VLMs reveal two key findings: (1) VLM performance degrades monotonically with increasing weather severity, and (2) different weather types induce distinct vulnerability patterns. We hope WeatherReasonSeg will serve as a foundation for advancing robust, weather-aware reasoning.
中文标题/摘要
标题:WeatherReasonSeg:视觉语言模型在恶劣天气条件下的推理分割基准
现有的视觉-语言模型(VLMs)在基于推理的分割任务上表现出色。然而,当前的基准主要由在理想条件下拍摄的高质量图像构建而成。这引发了一个关键问题:当视觉线索因恶劣天气条件(如雨、雪或雾)严重退化时,VLMs能否保持可靠的推理分割能力?为应对这一挑战,我们引入了WeatherReasonSeg,这是一个旨在评估VLM在恶劣天气条件下进行基于推理的分割性能的基准。它包含两个互补的组成部分。首先,我们通过将合成天气以不同严重程度应用到现有的分割数据集中,构建了一个可控的推理数据集,从而实现精细的鲁棒性分析。其次,为了捕捉现实世界的复杂性,我们通过掩码引导的LLM提示生成语义一致的查询,构建了一个现实世界恶劣天气推理分割数据集。我们进一步将评估范围扩展到五个推理维度,包括功能、应用场景、结构属性、交互和需求匹配。在多种VLM上的广泛实验揭示了两个关键发现:(1)VLM的性能随着天气严重程度的增加而单调下降;(2)不同类型的天气会引发不同的脆弱性模式。我们希望WeatherReasonSeg能够成为推动鲁棒、天气感知推理的基础。
Summary / 总结
The research aims to evaluate the robustness of vision-language models (VLMs) in reasoning-based segmentation under adverse weather conditions. The authors introduce WeatherReasonSeg, a benchmark that includes a controllable reasoning dataset with synthetic weather and a real-world adverse-weather dataset. Key findings show that VLM performance decreases with increasing weather severity and that different weather types affect models differently. The benchmark aims to advance the development of more robust VLMs.
研究旨在评估视觉语言模型(VLMs)在恶劣天气条件下的推理分割能力。为此,作者引入了WeatherReasonSeg基准,该基准包括一个带有合成天气的可控推理数据集和一个真实世界的恶劣天气推理分割数据集。关键发现表明,随着天气严重性的增加,VLM的性能会下降,不同类型的天气对VLM的影响也不同。该基准旨在推动VLM在恶劣天气下的稳健推理能力的提升。
Adaptive Guidance for Retrieval-Augmented Masked Diffusion Models
Authors: Jaemin Kim, Jong Chul Ye
First: 2026-03-18T12:54:50+00:00 · Latest: 2026-03-18T12:54:50+00:00
Abstract
Retrieval-Augmented Generation (RAG) improves factual grounding by incorporating external knowledge into language model generation. However, when retrieved context is noisy, unreliable, or inconsistent with the model's parametric knowledge, it introduces retrieval-prior conflicts that can degrade generation quality. While this problem has been studied in autoregressive language models, it remains largely unexplored in diffusion-based language models, where the iterative denoising process introduces unique challenges for integrating retrieved context. In this work, we propose Adaptive Retrieval-Augmented Masked Diffusion (ARAM), a training-free adaptive guidance framework for Masked Diffusion Models (MDMs) in RAG settings. ARAM dynamically calibrates the guidance scale during denoising according to the Signal-to-Noise Ratio (SNR) of the distributional shift induced by retrieved context. Intuitively, the model strengthens guidance when the retrieved context provides reliable corrective evidence and suppresses it when the contextual signal is noisy or non-supportive. Extensive experiments on multiple knowledge-intensive QA benchmarks show that ARAM improves overall QA performance over competitive RAG baselines.
Summary / 总结
The paper addresses the issue of retrieval-prior conflicts in Retrieval-Augmented Masked Diffusion Models (MDMs) where external knowledge can degrade generation quality if noisy or inconsistent. It introduces ARAM, an adaptive guidance framework that dynamically adjusts the guidance scale based on the Signal-to-Noise Ratio (SNR) of the retrieved context. Experiments on QA benchmarks demonstrate that ARAM outperforms existing RAG baselines in terms of overall QA performance.
研究旨在解决扩散型语言模型中检索先验冲突的问题,这可能会降低生成质量。提出的自适应检索增强遮蔽扩散(ARAM)框架会根据检索上下文引起的数据分布变化的信噪比动态调整指导尺度。在多个知识密集型问答基准测试上的实验表明,ARAM在整体问答性能上优于竞争性的RAG基线。
Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
Authors: Yaze Zhao, Yixiong Zou, Yuhua Li, Ruixuan Li
Venue: CVPR 2026
First: 2026-03-18T12:20:21+00:00 · Latest: 2026-03-18T12:20:21+00:00
Comments: CVPR 2026
Abstract
Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.
中文标题/摘要
标题:可解释的跨域少样本学习与校正的目标域局部对齐
跨域少样本学习(CDFSL)利用大规模通用数据(源域)训练的模型适应只有少量训练数据的目标域,其中基于视觉语言模型(如CLIP)的研究仍处于初级阶段。典型的下游领域,如医学诊断,需要细粒度的视觉线索进行可解释的识别,但我们发现当前微调的CLIP模型很难关注这些线索,尽管它们可以在源域中粗略地关注重要区域。尽管当前的工作已经证明了CLIP在捕捉局部细微模式方面的不足,但在本文中,我们发现领域差距和稀缺的训练数据进一步加剧了这种不足,远超过整体模式,我们称之为基于CLIP的CDFSL中的局部对齐问题。为了解决这个问题,由于缺乏对局部视觉特征和文本语义对齐的监督,我们转向了自我监督信息。受翻译任务的启发,我们提出了具有循环一致性(CC-CDFSL)的方法,将局部视觉特征翻译成文本特征,然后再翻译回视觉特征(反之亦然),并约束原始特征接近翻译回的特征。为了减少由视觉模态中更丰富信息引入的噪声,我们进一步提出了语义锚机制,首先增强视觉特征以提供更大的文本到图像映射的语料库,然后缩小图像特征以过滤掉无关的图像到文本映射。在各种基准、骨干和微调方法上的广泛实验表明,我们能够(1)有效提高局部视觉语言对齐,(2)通过可视化块增强学习模式和模型决策的可解释性,(3)达到最先进的性能。
Summary / 总结
The paper addresses the challenge of Cross-Domain Few-Shot Learning (CDFSL) by focusing on the local misalignment problem in CLIP-based models. It proposes the CC-CDFSL method, which uses cycle consistency to align local visual features with text semantics and introduces a Semantic Anchor mechanism to enhance interpretability. Experiments demonstrate improved local vision-language alignment, enhanced interpretability, and state-of-the-art performance on various benchmarks.
该论文针对使用CLIP等视觉-语言模型的跨域少样本学习(CDFSL)中的局部错位问题,该问题在领域差距和稀缺训练数据的影响下更为严重。为了解决这一问题,作者提出了CC-CDFSL方法,利用循环一致性对齐局部视觉特征与文本语义,并引入了语义锚机制以增强可解释性。实验结果显示,该方法在局部视觉-语言对齐、可解释性以及多种基准上的性能均达到了最先进的水平。
Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment
Authors: Dongqiang Gou, Xuming He
First: 2026-03-18T12:07:42+00:00 · Latest: 2026-03-18T12:07:42+00:00
Abstract
Grounding natural language questions to functionally relevant regions in 3D objects -- termed language-driven 3D affordance grounding -- is essential for embodied intelligence and human-AI interaction. Existing methods, while progressing from label-based to language-driven approaches, still face challenges in open-vocabulary generalization, fine-grained geometric alignment, and part-level semantic consistency. To address these issues, we propose a novel two-stage cross-modal framework that enhances both semantic and geometric representations for open-vocabulary 3D affordance grounding. In the first stage, large language models generate part-aware instructions to recover missing semantics, enabling the model to link semantically similar affordances. In the second stage, we introduce two key components: Affordance Prototype Aggregation (APA), which captures cross-object geometric consistency for each affordance, and Intra-Object Relational Modeling (IORM), which refines geometric differentiation within objects to support precise semantic alignment. We validate the effectiveness of our method through extensive experiments on a newly introduced benchmark, as well as two existing benchmarks, demonstrating superior performance in comparison with existing methods.
中文标题/摘要
标题:基于原型语义和几何对齐的分部位开放词汇3D功能定位
将自然语言问题与3D物体的功能相关区域对接——称为语言驱动的3D功能定位——对于体态智能和人机交互至关重要。现有方法虽然从基于标签的方法进步到语言驱动的方法,但在开放词汇泛化、精细几何对齐和部位语义一致性方面仍然面临挑战。为了解决这些问题,我们提出了一种新颖的两阶段跨模态框架,以增强开放词汇3D功能定位的语义和几何表示。在第一阶段,大型语言模型生成分部位指令以恢复缺失的语义,使模型能够链接语义相似的功能。在第二阶段,我们引入了两个关键组件:功能原型聚合(APA),用于捕获每个功能的跨物体几何一致性,以及对象内关系建模(IORM),用于在对象内部细化几何差异以支持精确的语义对齐。我们通过在新引入的基准以及两个现有基准上的广泛实验验证了我们方法的有效性,展示了与现有方法相比的优越性能。
Summary / 总结
The research aims to improve language-driven 3D affordance grounding by addressing challenges in open-vocabulary generalization, fine-grained geometric alignment, and part-level semantic consistency. The proposed method uses a two-stage cross-modal framework with part-aware instructions in the first stage and Affordance Prototype Aggregation and Intra-Object Relational Modeling in the second stage. Experiments on new and existing benchmarks show that the method outperforms existing approaches in these aspects.
研究旨在通过解决开放词汇泛化、精细几何对齐和部分语义一致性等挑战,改进基于语言的3D功能定位。提出的方法采用两阶段跨模态框架,第一阶段生成部分感知的指令以恢复缺失的语义,第二阶段引入Affordance Prototype Aggregation (APA) 和 Intra-Object Relational Modeling (IORM) 来增强几何和语义表示。实验结果显示,该方法在新引入的基准和两个现有基准上均优于现有方法。
ReLaGS: Relational Language Gaussian Splatting
Authors: Yaxu Xie, Abdalla Arafa, Alireza Javanmardi, Christen Millerdurai, Jia Cheng Hu, Shaoxiang Wang, Alain Pagani, Didier Stricker
Venue: CVPR 2026
First: 2026-03-18T11:18:23+00:00 · Latest: 2026-03-18T11:18:23+00:00
Comments: Accepted at CVPR 2026
Abstract
Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. A Gaussian pruning mechanism refines scene geometry, while a robust multi-view language alignment strategy aggregates noisy 2D features into accurate 3D object embeddings. On top of this hierarchy, we build an open-vocabulary 3D scene graph with Vision Language derived annotations and Graph Neural Network-based relational reasoning. Our approach enables efficient and scalable open-vocabulary 3D reasoning by jointly modeling hierarchical semantics and inter/intra-object relationships, validated across tasks including open-vocabulary segmentation, scene graph generation, and relation-guided retrieval. Project page: https://dfki-av.github.io/ReLaGS/
中文标题/摘要
标题:ReLaGS: 关系语言高斯点积
在语义分割、检索和关系理解等任务中实现统一的3D感知和推理仍然具有挑战性,因为现有方法要么以对象为中心,要么需要昂贵的训练来实现跨对象推理。我们提出了一种新的框架,该框架构建了一个层次化的语言提炼高斯场景及其3D语义场景图,无需针对特定场景进行训练。高斯剪枝机制细化场景几何结构,而鲁棒的多视图语言对齐策略将嘈杂的2D特征聚合为准确的3D对象嵌入。在此层次结构之上,我们构建了一个基于视觉语言注解和图神经网络关系推理的开放词汇3D场景图。我们的方法通过联合建模层次语义和跨/内对象关系,实现了高效的可扩展的开放词汇3D推理,并在包括开放词汇分割、场景图生成和关系引导检索在内的任务中得到了验证。项目页面:https://dfki-av.github.io/ReLaGS/
Summary / 总结
The research aims to develop a unified framework for 3D perception and reasoning across various tasks by addressing the limitations of existing object-centric methods and costly inter-object reasoning training. The method involves constructing a hierarchical language-distilled Gaussian scene and 3D semantic scene graph, using a Gaussian pruning mechanism to refine scene geometry and a robust multi-view language alignment strategy to aggregate 2D features into accurate 3D object embeddings. Key experimental findings show that the approach enables efficient and scalable open-vocabulary 3D reasoning, validated across tasks such as open-vocabulary segmentation, scene graph generation, and relation-guided retrieval.
研究旨在通过解决现有对象中心方法的局限性和昂贵的跨对象推理训练问题,统一3D感知和跨多种任务的推理。提出的ReLaGS框架构建了一个层次化的语言提炼高斯场景及其3D语义场景图,无需特定场景训练。该框架使用高斯修剪机制来细化场景几何,并使用稳健的多视图语言对齐策略将嘈杂的2D特征聚合为准确的3D对象嵌入。该框架在开放词汇量分割、场景图生成和关系引导检索等任务中展示了高效的可扩展3D推理能力。
Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing
Authors: Seongrae Noh, SeungWon Seo, Gyeong-Moon Park, HyeongYeop Kang
Venue: CVPR 2026
First: 2026-03-18T10:46:42+00:00 · Latest: 2026-03-18T10:46:42+00:00
Comments: Accepted to CVPR 2026
Abstract
Editing a 3D indoor scene from natural language is conceptually straightforward but technically challenging. Existing open-vocabulary systems often regenerate large portions of a scene or rely on image-space edits that disrupt spatial structure, resulting in unintended global changes or physically inconsistent layouts. These limitations stem from treating editing primarily as a generative task. We take a different view. A user instruction defines a desired world state, and editing should be the minimal sequence of actions that makes this state true while preserving everything else. This perspective motivates Edit-As-Act, a framework that performs open-vocabulary scene editing as goal-regressive planning in 3D space. Given a source scene and free-form instruction, Edit-As-Act predicts symbolic goal predicates and plans in EditLang, a PDDL-inspired action language that we design with explicit preconditions and effects encoding support, contact, collision, and other geometric relations. A language-driven planner proposes actions, and a validator enforces goal-directedness, monotonicity, and physical feasibility, producing interpretable and physically coherent transformations. By separating reasoning from low-level generation, Edit-As-Act achieves instruction fidelity, semantic consistency, and physical plausibility - three criteria that existing paradigms cannot satisfy together. On E2A-Bench, our benchmark of 63 editing tasks across 9 indoor environments, Edit-As-Act significantly outperforms prior approaches across all edit types and scene categories.
中文标题/摘要
标题:编辑即行动:目标递归规划在开放词汇3D室内场景编辑中的应用
从自然语言编辑3D室内场景在概念上是直接的,但在技术上具有挑战性。现有的开放词汇系统通常会重新生成场景的大部 分内容,或者依赖于图像空间的编辑,这会破坏空间结构,导致意外的全局变化或物理不一致的布局。这些限制源于将编辑主要视为生成任务。我们采取了不同的观点。用户指令定义了期望的世界状态,编辑应该是使这种状态成为现实的最小行动序列,同时保留其他所有内容。这种观点促使我们提出了编辑即行动框架,该框架在3D空间中以目标递归规划的方式进行开放词汇场景编辑。给定源场景和自由形式的指令,编辑即行动预测符号目标谓词,并在我们设计的EditLang中进行规划,这是一种灵感来源于PDDL的动作语言,其中明确编码了支持、碰撞和其他几何关系的先决条件和效果。语言驱动的规划器提出行动,验证器确保目标导向性、单调性和物理可行性,从而产生可解释且物理上一致的变换。通过将推理与低级生成分离,编辑即行动实现了指令忠实度、语义一致性和物理合理性——这是现有范式无法同时满足的三个标准。在包含63个编辑任务的E2A-Bench基准测试中,我们的基准测试覆盖了9个室内环境,编辑即行动在所有编辑类型和场景类别中均显著优于先前的方法。
Summary / 总结
The research addresses the challenge of editing 3D indoor scenes based on natural language instructions, which is conceptually simple but technically difficult. Existing methods often regenerate large parts of the scene or rely on image-space edits that disrupt spatial structure. To overcome these limitations, the paper proposes Edit-As-Act, a framework that views editing as goal-regressive planning in 3D space. It uses a language-driven planner to predict symbolic goals and plan actions in a custom action language, ensuring that the edits are physically coherent and semantically consistent. On a benchmark of 63 editing tasks, Edit-As-Act outperforms previous approaches in terms of instruction fidelity, semantic consistency, and physical plausibility.
论文解决了基于自然语言指令编辑3D室内场景的技术难题,虽然概念上简单但实现起来困难。提出了Edit-As-Act框架,将编辑视为3D空间中的目标逆向规划。给定源场景和指令,Edit-As-Act预测符号目标并在自定义动作语言EditLang中规划动作,确保编辑结果物理上合理且语义上一致。该方法在包含63个编辑任务的9个室内环境基准测试中显著优于现有方法。
FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion
Authors: Hugo Caselles-Dupré, Mathis Koroglu, Guillaume Jeanneret, Arnaud Dapogny, Matthieu Cord
First: 2026-03-18T10:02:37+00:00 · Latest: 2026-03-18T10:02:37+00:00
Comments: 5 authors. Hugo Caselles-Dupré, Mathis Koroglu, and Guillaume Jeanneret contributed equally. 14 pages, 7 figures
Abstract
Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.
中文标题/摘要
标题:FrescoDiffusion:使用先验正则化分块扩散生成4K 图像到视频
基于扩散的图像到视频(I2V)模型越来越有效,但它们难以扩展到超高清输入(例如4K)。在模型的原生分辨率下生成视频往往会丢失精细结构,而高分辨率分块去噪则会保留局部细节但破坏全局布局一致性。这种失败模式在壁画动画设置中尤为严重:包含许多不同角色、物体和语义不同的子场景的大型艺术作品,这些场景必须在时间上保持空间一致性。我们提出了FrescoDiffusion,这是一种无需训练的方法,可以从单个复杂图像生成一致的大格式I2V。关键思想是将分块去噪与预先计算的潜在先验相结合:我们首先生成一个低分辨率视频,其分辨率与基础模型相同,并将该视频的潜在轨迹上采样以获得一个全局参考,该参考捕捉了长程时间和空间结构。对于4K生成,我们计算每个分块的噪声预测,并在每次扩散时间步长中将其与该参考融合,通过在模型输出空间中最小化一个加权最小二乘目标来实现。该目标结合了标准的分块合并标准和我们的正则化项,从而获得一个闭式融合更新,增强了全局一致性同时保留了精细细节。我们还提供了一个空间正则化变量,允许在允许运动的区域级别上进行控制。在VBench-I2V数据集和我们提出的壁画I2V数据集上的实验表明,与分块基线相比,该方法在全局一致性和保真度方面有所改进,同时计算效率高。我们的正则化使创意与一致性之间的权衡具有明确的可控性。
Summary / 总结
FrescoDiffusion addresses the challenge of generating coherent 4K image-to-video (I2V) content by introducing a training-free method that combines tiled denoising with a precomputed latent prior. This approach enhances global temporal and spatial consistency while preserving fine details. The method computes per-tile noise predictions and fuses them with a global reference trajectory, achieving improved fidelity and global consistency over tiled baselines.
FrescoDiffusion通过结合分块去噪和预计算的潜在先验解决了生成4K图像到视频(I2V)内容的挑战。该方法生成低分辨率视频并上采样其潜在轨迹以保持全局一致性同时保留细节数。对于4K生成,它计算每个分块的噪声预测并将它们与参考轨迹融合,从而在全局一致性和细节保真度方面优于分块基线。
MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing
Authors: Yimin Wei, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya
First: 2026-03-18T09:34:23+00:00 · Latest: 2026-03-18T09:34:23+00:00
Abstract
Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.
中文标题/摘要
标题:MM-OVSeg:多模态光学-SAR融合在遥感中的开放词汇分割
开放词汇分割能够在开放类别文本中实现像素级识别,允许超越固定类别的泛化。尽管在遥感领域具有巨大潜力,但该领域的进展仍然主要局限于晴朗光学数据,并且在多云或污染条件下表现不佳。我们提出了MM-OVSeg,这是一种在恶劣天气条件下具有弹性的多模态光学-SAR融合框架,用于开放词汇分割。MM-OVSeg 利用了两种模态的互补优势——光学图像提供了丰富的光谱语义,而合成孔径雷达(SAR)提供了穿透云层的结构线索。为了解决跨模态领域差距和当前视觉语言模型的有限密集预测能力,我们提出了两种关键设计:一种跨模态统一过程,用于多传感器表示对齐,以及一种双编码器融合模块,该模块结合了多个视觉基础模型的层次特征,以实现文本对齐的多模态分割。广泛的实验表明,MM-OVSeg 在多种云条件下实现了更优的鲁棒性和泛化能力。源数据集和代码可在此处获取。
Summary / 总结
MM-OVSeg is a multimodal Optical-SAR fusion framework designed for open-vocabulary segmentation under adverse weather conditions. It combines the rich spectral semantics of optical imagery with the cloud-penetrating structural cues from SAR to address the limitations of clear-sky optical data. Key designs include a cross-modal unification process and a dual-encoder fusion module, which enhance multi-sensor representation alignment and text-aligned multimodal segmentation. Experiments show that MM-OVSeg outperforms existing methods in terms of robustness and generalization across various cloud conditions.
MM-OVSeg 是一种多模态光学-SAR 融合框架,旨在在恶劣天气条件下进行遥感领域的开放词汇分割。该框架结合了光学图像丰富的光谱语义和 SAR 提供的穿透云层的结构线索,以解决领域差距并提高密集预测能力。实验结果表明,MM-OVSeg 在各种云条件下的鲁棒性和泛化能力优于现有方法。
PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation
Authors: Jianjian Yin, Tao Chen, Yi Chen, Gensheng Pei, Xiangbo Shu, Yazhou Yao, Fumin Shen
First: 2026-03-18T09:26:43+00:00 · Latest: 2026-03-18T09:26:43+00:00
Comments: Accepted by CVPR2026
Abstract
Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.
中文标题/摘要
标题:PCA-Seg:重访开放词汇语义和部件分割的成本聚合
近期视觉-语言模型(VLMs)在开放词汇语义和部件分割(OSPS)方面引起了广泛关注。然而,现有方法通过空间和类别聚合的串行结构从成本体中提取图像-文本对齐线索,导致类别级语义和空间上下文之间的知识干扰。因此,本文提出了一种简单而有效的并行成本聚合(PCA-Seg)范式来缓解上述挑战,使模型能够从成本体中捕获更丰富的视觉-语言对齐信息。具体而言,我们设计了一个专家驱动的感知学习(EPL)模块,高效地整合了语义和上下文流。该模块包含一个多专家解析器,从多个视角提取互补特征。此外,设计了一个系数映射器,以自适应地学习每个特征的像素特定权重,使互补知识能够整合到统一且稳健的特征嵌入中。此外,我们提出了一种特征正交分解(FOD)策略来缓解语义和上下文流之间的冗余,这使得EPL模块能够从正交化特征中学习多样化的知识。在八个基准上的广泛实验表明,PCA-Seg中的每个并行块仅增加0.35M参数,同时实现了最先进的OSPS性能。
Summary / 总结
This paper addresses the challenge of knowledge interference in open-vocabulary semantic and part segmentation by proposing PCA-Seg, a parallel cost aggregation paradigm. It introduces an expert-driven perceptual learning module that integrates semantic and contextual streams, using a multi-expert parser and a coefficient mapper to adaptively learn pixel-specific weights. Additionally, a feature orthogonalization decoupling strategy is proposed to reduce redundancy between streams. Experiments on eight benchmarks demonstrate that PCA-Seg achieves state-of-the-art performance with minimal additional parameters.
本文提出PCA-Seg,一种并行成本聚合范式,以解决开放词汇语义和部分分割中的知识干扰问题。该方法引入了一个专家驱动的感知学习模块,整合了语义和上下文流,使用多专家解析器和系数映射器来适应性地学习像素特定权重。特征正交分解策略进一步减少了流之间的冗余。实验在八个基准上显示,每个并行块仅增加0.35M参数,同时达到最先进的性能。
UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images
Authors: Guibiao Liao, Qian Ren, Kaimin Liao, Hua Wang, Zhi Chen, Luchao Wang, Yaohua Tang
First: 2026-03-18T09:26:25+00:00 · Latest: 2026-03-18T09:26:25+00:00
Abstract
Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model's own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.
中文标题/摘要
标题:UniSem: 通用语义稀疏未对齐图像的3D重建
基于稀疏、未对齐图像的语义感知3D重建对于前馈3D高斯点积(3DGS)来说仍然具有挑战性。现有方法在稀疏视角监督下通常会预测一个过完备的高斯基元集合,导致几何不稳定和深度质量较差。同时,它们仅依赖2D分割器特征进行语义提升,这提供了较弱的3D级监督和有限的泛化监督,导致在新场景中3D语义不完整。为了解决这些问题,我们提出了一种统一框架UniSem,通过两个关键组件联合提高深度精度和语义泛化。首先,错误感知高斯丢弃(EGD)通过使用渲染误差提示抑制易产生冗余的高斯基元,生成具有意义的、几何稳定的高斯表示,以提高深度估计。其次,我们引入了一种混合训练课程(MTC),逐步将2D分割器提升的语义与模型自身新兴的3D语义先验融合,通过对象级原型对齐增强语义的一致性和完整性。在ScanNet和Replica上的广泛实验表明,UniSem在不同输入视图数量下的深度预测和开放词汇3D分割中均表现出优越性能。值得注意的是,在16视图输入下,UniSem将深度Rel降低了15.2%,并提高了开放词汇分割mAcc 3.7%以上,超越了强大的基线。
Summary / 总结
UniSem addresses the challenges of semantic-aware 3D reconstruction from sparse, unposed images by proposing a unified framework that includes Error-aware Gaussian Dropout (EGD) and a Mix-training Curriculum (MTC). EGD suppresses redundant Gaussian primitives using rendering error cues, enhancing depth estimation stability. MTC progressively integrates 2D segmenter-lifted semantics with the model's own 3D semantic priors, improving semantic coherence and completeness. Experiments on ScanNet and Replica demonstrate that UniSem outperforms strong baselines, reducing depth Rel by 15.2% and improving open-vocabulary segmentation mAcc by 3.7% with 16-view inputs.
UniSem 提出了一种统一框架,通过 Error-aware Gaussian Dropout (EGD) 和 Mix-training Curriculum (MTC) 来解决稀疏未对齐图像的语义感知 3D 重建问题。EGD 使用渲染误差线索抑制冗余高斯模型,增强深度估计的稳定性。MTC 逐步将 2D 分割器提取的语义与模型自身的 3D 语义先验相结合,提高语义的一致性和完整性。实验结果表明,UniSem 在 ScanNet 和 Replica 上的表现优于强基线,使用 16 视图输入时,深度 Rel 减少了 15.2%,开放词汇分割 mAcc 提高了 3.7%。
Bodhi VLM: Privacy-Alignment Modeling for Hierarchical Visual Representations in Vision Backbones and VLM Encoders via Bottom-Up and Top-Down Feature Search
Authors: Bo Ma, Wei Qi Yan, Jinsong Wu
First: 2026-03-14T03:11:31+00:00 · Latest: 2026-03-18T08:32:49+00:00
Abstract
Learning systems that preserve privacy often inject noise into hierarchical visual representations; a central challenge is to \emph{model} how such perturbations align with a declared privacy budget in a way that is interpretable and applicable across vision backbones and vision--language models (VLMs). We propose \emph{Bodhi VLM}, a \emph{privacy-alignment modeling} framework for \emph{hierarchical neural representations}: it (1) links sensitive concepts to layer-wise grouping via NCP and MDAV-based clustering; (2) locates sensitive feature regions using bottom-up (BUA) and top-down (TDA) strategies over multi-scale representations (e.g., feature pyramids or vision-encoder layers); and (3) uses an Expectation-Maximization Privacy Assessment (EMPA) module to produce an interpretable \emph{budget-alignment signal} by comparing the fitted sensitive-feature distribution to an evaluator-specified reference (e.g., Laplace or Gaussian with scale $c/ε$). The output is reference-relative and is \emph{not} a formal differential-privacy estimator. We formalize BUA/TDA over hierarchical feature structures and validate the framework on object detectors (YOLO, PPDPTS, DETR) and on the \emph{visual encoders} of VLMs (CLIP, LLaVA, BLIP). BUA and TDA yield comparable deviation trends; EMPA provides a stable alignment signal under the reported setups. We compare with generic discrepancy baselines (Chi-square, K-L, MMD) and with task-relevant baselines (MomentReg, NoiseMLE, Wass-1). Results are reported as mean$\pm$std over multiple seeds with confidence intervals in the supplementary materials. This work contributes a learnable, interpretable modeling perspective for privacy-aligned hierarchical representations rather than a post hoc audit only. Source code: \href{https://github.com/mabo1215/bodhi-vlm.git}{Bodhi-VLM GitHub repository}
Summary / 总结
Bodhi VLM is a privacy-alignment modeling framework that links sensitive concepts to hierarchical visual representations in vision backbones and VLM encoders through clustering and feature search strategies. It uses bottom-up and top-down approaches to locate sensitive feature regions and an Expectation-Maximization Privacy Assessment module to produce an interpretable budget-alignment signal. The framework is validated on object detectors and visual encoders of VLMs, showing comparable deviation trends with bottom-up and top-down strategies and a stable alignment signal. It contributes a learnable, interpretable perspective for privacy-aligned hierarchical representations rather than a post hoc audit only.
Bodhi VLM 是一种隐私对齐建模框架,通过聚类和特征搜索策略将敏感概念与视觉表示中的层次结构联系起来。它使用自底向上和自顶向下的方法来定位敏感特征区域,并使用期望最大化隐私评估模块生成可解释的预算对齐信号。实验表明,自底向上和自顶向下方法在对象检测器和 VLM 视觉编码器上的偏差趋势相似,并且 EMPA 在报告的设置下提供了稳定的对齐信号。该框架提供了一种可学习且可解释的视角,用于隐私对齐的层次结构表示,而不仅仅是事后审计。
Chain of Mindset: Reasoning with Adaptive Cognitive Modes
Authors: Tianyi Jiang, Arctanx An, Hengyi Feng, Naixin Zhai, Haodong Li, Xiaomin Yu, Jiahui Liu, Hanwen Du, Shuo Zhang, Zhi Yang, Jie Huang, Youhua Li, Yongxin Ni, Huacan Wang, Ronghao Chen
First: 2026-02-10T18:31:47+00:00 · Latest: 2026-03-18T08:27:44+00:00
Abstract
Human problem-solving is never the repetition of a single mindset, by which we mean a distinct mode of cognitive processing. When tackling a specific task, we do not rely on a single mindset; instead, we integrate multiple mindsets within the single solution process. However, existing LLM reasoning methods fall into a common trap: they apply the same fixed mindset across all steps, overlooking that different stages of solving the same problem require fundamentally different mindsets. This single-minded assumption prevents models from reaching the next level of intelligence. To address this limitation, we propose Chain of Mindset (CoM), a training-free agentic framework that enables step-level adaptive mindset orchestration. CoM decomposes reasoning into four functionally heterogeneous mindsets: Spatial, Convergent, Divergent, and Algorithmic. A Meta-Agent dynamically selects the optimal mindset based on the evolving reasoning state, while a bidirectional Context Gate filters cross-module information flow to maintain effectiveness and efficiency. Experiments across six challenging benchmarks spanning mathematics, code generation, scientific QA, and spatial reasoning demonstrate that CoM achieves state-of-the-art performance, outperforming the strongest baseline by 4.96\% and 4.72\% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, while balancing reasoning efficiency. Our code is publicly available at \href{https://github.com/QuantaAlpha/chain-of-mindset}{https://github.com/QuantaAlpha/chain-of-mindset}.
中文标题/摘要
标题:思维链:适应性认知模式推理
人类解决问题绝非单一思维模式的重复,我们面对特定任务时,并非依赖单一思维模式,而是将多种思维模式整合到单一的解决方案过程中。然而,现有的大语言模型推理方法往往陷入一个常见陷阱:它们在所有步骤中都采用相同的固定思维模式,忽视了解决同一问题的不同阶段需要根本不同的思维模式。这种单一思维模式的假设阻碍了模型达到更高层次的智能。为解决这一局限,我们提出了一种无需训练的代理框架——思维链(CoM),该框架能够实现步骤级别的适应性思维模式编排。CoM 将推理分解为四个功能异质的思维模式:空间思维、收敛思维、发散思维和算法思维。一个元代理根据推理状态的演变动态选择最优思维模式,而双向上下文门控则过滤模块间的信息流,以保持有效性和效率。跨六个涵盖数学、代码生成、科学问答和空间推理的挑战性基准实验表明,CoM 达到了最先进的性能,在 Qwen3-VL-32B-Instruct 和 Gemini-2.0-Flash 上的整体准确率分别比最强基线高出 4.96% 和 4.72%,同时平衡了推理效率。我们的代码已公开发布于 https://github.com/QuantaAlpha/chain-of-mindset。
Summary / 总结
The research aims to enhance the adaptability of large language models (LLMs) in problem-solving by addressing their tendency to use a single fixed mindset. The proposed Chain of Mindset (CoM) framework dynamically switches between four mindsets—Spatial, Convergent, Divergent, and Algorithmic—based on the evolving reasoning state. Experiments show that CoM outperforms existing methods by 4.96% and 4.72% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, respectively, while maintaining efficiency across six benchmarks including mathematics, code generation, and spatial reasoning.
论文提出了一种名为Chain of Mindset (CoM) 的框架,以解决现有大语言模型在推理中的单一定态假设问题,该框架能够在推理过程中实现步骤级别的自适应心态编排。CoM 将推理分解为四种心态:空间、收敛、发散和算法,并使用一个元代理根据推理状态的演变动态选择最优心态。实验结果显示,CoM 在数学、代码生成、科学问答和空间推理等六个基准测试中,分别在 Qwen3-VL-32B-Instruct 和 Gemini-2.0-Flash 上的整体准确率上优于最强基线 4.96% 和 4.72%,同时保持了推理效率。
History
20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553