WorldCache: Content-Aware Caching for Accelerated Video World Models
Authors: Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan, Fahad Shahbaz Khan
First: 2026-03-23T17:59:54+00:00 · Latest: 2026-03-23T17:59:54+00:00
Comments: 33 Pages
Abstract
Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.
中文标题/摘要
标题:WorldCache:内容感知缓存加速视频世界模型
扩散变换器(DiTs)驱动高保真视频世界模型,但由于顺序去噪和昂贵的空间-时间注意力,计算成本仍然很高。无需训练的特征缓存通过在去噪步骤中重用中间激活来加速推理;然而,现有方法主要依赖于零阶保持假设,即在全局漂移较小时,将缓存特征作为静态快照重用。这通常会导致动态场景中的鬼影伪影、模糊和运动不一致。我们提出了**WorldCache**,一种感知约束动力缓存框架,以改进何时以及如何重用特征。WorldCache 引入了运动自适应阈值、显著性加权漂移估计、通过混合和扭曲进行的最佳近似以及扩散步骤中的相位感知阈值调度。我们的一体化方法使在无需重新训练的情况下实现适应性、运动一致的特征重用成为可能。在PAI-Bench上评估的Cosmos-Predict2.5-2B上,WorldCache 达到了**2.3倍**的推理加速,同时保持了**99.4%**的基本质量,显著优于先前的无需训练的缓存方法。我们的代码可以在**World-Cache**(https://umair1221.github.io/World-Cache/)上访问。
Summary / 总结
WorldCache proposes a Perception-Constrained Dynamical Caching framework to improve feature reuse in diffusion transformers for video world models. It introduces motion-adaptive thresholds, saliency-weighted drift estimation, and phase-aware threshold scheduling to avoid ghosting artifacts and motion inconsistencies. On PAI-Bench, WorldCache achieves a 2.3x inference speedup while maintaining 99.4% of baseline quality, outperforming previous training-free caching methods.
WorldCache 提出了一种感知约束的动力学缓存框架,以改进视频世界模型中扩散变换器中的特征重用。它引入了运动自适应阈值、显著性加权漂移估计和相位感知阈值调度,以避免鬼影伪影和运动不一致性。在 Cosmos-Predict2.5-2B 上,WorldCache 实现了 2.3 倍的推理加速,同时保持了 99.4% 的基线质量,显著优于之前的无训练缓存方法。
ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model
Authors: Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu
First: 2026-03-23T17:59:42+00:00 · Latest: 2026-03-23T17:59:42+00:00
Comments: 10 pages, 5 figures
Abstract
Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.
中文标题/摘要
标题:ThinkJEPA:利用大型视觉-语言推理模型增强潜在世界模型
潜在世界模型(例如V-JEPA2)的近期进展显示了从视频观察中预测未来世界状态的有希望的能力。然而,从短观察窗口进行密集预测限制了时间上下文,并可能导致预测偏向局部、低级的外推,难以捕捉长时序语义,从而降低下游实用性。视觉-语言模型(VLMs)通过在均匀采样的帧上进行推理提供了强大的语义基础和通用知识,但由于计算驱动的稀疏采样、语言输出瓶颈以及在适应小型动作条件数据集时的数据范式不匹配,它们并不适合作为独立的密集预测器。我们提出了一种由VLM引导的JEPA风格的潜在世界建模框架,该框架结合了密集帧动力学建模和通过双时间路径进行长时序语义指导:一个密集JEPA分支用于精细的运动和交互提示,以及一个具有较大时间步长的均匀采样VLM \emph{思考者}分支,用于知识丰富的指导。为了有效地转移VLM的渐进推理信号,我们引入了一种分层金字塔表示提取模块,将多层VLM表示聚合为与潜在预测兼容的指导特征。在手操作轨迹预测实验中,我们的方法优于一个强大的VLM仅基线和一个JEPA预测器基线,并且表现出更稳健的长时序展开行为。
Summary / 总结
The research aims to enhance latent world models by integrating a vision-language model (VLM) for better long-term prediction. The method uses a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM 'thinker' branch for semantic guidance. The hierarchical pyramid representation extraction module transfers VLM's reasoning signals effectively. Experiments show the proposed method outperforms both a VLM-only baseline and a JEPA-predictor baseline, providing more robust long-term predictions.
研究旨在通过整合视觉-语言模型(VLM)来提升潜在世界模型的长期预测能力。方法使用双时间路径,结合密集的JEPA分支来捕捉精细的运动和交互线索,以及VLM思考者分支提供丰富的语义指导。实验表明,所提出的方法在长期滚动行为的鲁棒性方面优于仅VLM基线和JEPA预测器基线。
3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
Authors: Haoyu Zhen, Xiaolong Li, Yilin Zhao, Han Zhang, Sifei Liu, Kaichun Mo, Chuang Gan, Subhashree Radhakrishnan
First: 2026-03-23T17:59:14+00:00 · Latest: 2026-03-23T17:59:14+00:00
Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing. We introduce a Structured Reasoning framework that performs text-conditioned spatial layout editing via scene-graph reasoning. Given an input scene graph and a natural-language instruction, the model reasons over the graph to generate an updated scene graph that satisfies the text condition while maintaining spatial coherence. By explicitly guiding the reasoning process through structured relational representations, our approach improves both interpretability and control over spatial relationships. We evaluate our method on a new text-guided layout editing benchmark encompassing sorting, spatial alignment, and room-editing tasks. Our training paradigm yields an average 15% improvement in IoU and 25% reduction in center-distance error compared to Chain of Thought Fine-tuning (CoT-SFT) and vanilla GRPO baselines. Compared to SOTA zero-shot LLMs, our best models achieve up to 20% higher mIoU, demonstrating markedly improved spatial precision.
中文标题/摘要
标题:3D-布局-R1:语言指导的空间编辑结构化推理
大型语言模型(LLMs)和视觉语言模型(VLMs)展示了令人印象深刻的推理能力,但在执行精细视觉编辑时,它们在空间理解和布局一致性方面存在困难。我们提出了一种结构化推理框架,通过场景图推理进行文本条件下的空间布局编辑。给定输入的场景图和自然语言指令,模型在图上进行推理以生成满足文本条件并保持空间连贯性的更新场景图。通过明确引导推理过程,我们的方法提高了对空间关系的可解释性和控制。我们在一个包含排序、空间对齐和房间编辑任务的新文本指导布局编辑基准上评估了我们的方法。我们的训练范式在IoU上平均提高了15%,在中心距离误差上减少了25%,优于链式思维微调(CoT-SFT)和vanilla GRPO基线。与最先进的零样本LLMs相比,我们的最佳模型在mIoU上提高了高达20%,显示出显著提高的空间精度。
Summary / 总结
Large Language Models (LLMs) and Vision Language Models (VLMs) have shown impressive reasoning abilities, yet they struggle with spatial understanding and layout consistency when performing fine-grained visual editing.
The Dual Mechanisms of Spatial Reasoning in Vision-Language Models
Authors: Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham
First: 2026-03-23T17:58:02+00:00 · Latest: 2026-03-23T17:58:02+00:00
Comments: 26 pages, 35 figures
Abstract
Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.
中文标题/摘要
标题:视觉语言模型中空间推理的双重机制
许多多模态任务,如图像字幕和视觉问答,要求视觉语言模型(VLMs)将物体与其属性和空间关系联系起来。然而,尚不清楚这些联系在VLMs中的何处和如何进行计算。在本工作中,我们展示了VLMs依赖于两种并发机制来表示这些联系。在语言模型骨干中,中间层在视觉标记(对应于物体)之上表示内容无关的空间关系。然而,这种机制在塑造模型预测方面仅起次要作用。相反,空间信息的主要来源在于视觉编码器,其表示编码了物体的布局,并直接被语言模型骨干利用。值得注意的是,这种空间信号在视觉标记中是全局分布的,延伸到物体区域之外的背景区域。我们展示了在所有图像标记中增强这些视觉衍生的空间表示可以提高自然图像上的空间推理性能。综上所述,我们的结果阐明了空间关联在VLMs中的计算方式,并突显了视觉编码器在实现空间推理中的核心作用。
Summary / 总结
This study investigates the mechanisms by which vision-language models (VLMs) handle spatial reasoning in tasks like image captioning and visual question answering. The research reveals that VLMs use two concurrent mechanisms: one in the language model backbone for representing content-independent spatial relations, and another in the vision encoder for encoding the layout of objects. The latter mechanism is more influential in shaping model predictions. Enhancing the vision-derived spatial representations globally improves spatial reasoning performance. The study clarifies the role of vision encoders in spatial reasoning within VLMs.
该研究探讨了视觉语言模型(VLMs)在图像字幕和视觉问答等任务中处理空间推理的机制。研究发现,VLMs 使用两种并发机制:一种是在语言模型主干中表示与内容无关的空间关系,另一种是在视觉编码器中编码物体布局。后一种机制对模型预测的影响更大。通过在全球增强视觉衍生的空间表示可以提高空间推理性能。该研究澄清了VLMs 中空间关联的计算方式,并突出了视觉编码器在实现空间推理中的核心作用。
Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
Authors: Alexandra Zelenin, Alexandra Zhuravlyova
First: 2026-03-23T17:57:24+00:00 · Latest: 2026-03-23T17:57:24+00:00
Comments: 30 pages, 15 figures, 15 tables, including appendices. Code and data at https://github.com/sockeye44/dorafactors
Abstract
Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved.
We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice.
Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.
中文标题/摘要
标题:DoRA的扩展:通过分解范数和融合核函数实现高秩适应
权重分解低秩适应(DoRA)通过将权重幅度与方向解耦来扩展LoRA,但其前向传播需要计算W + sBA的行范数,每个我们调查的主要框架都通过实现密集的[d_out, d_in]乘积BA来执行此计算。在d_in = 8192和秩r = 384的情况下,单个模块的范数需要大约512 MB的临时工作内存(bf16),这使得高秩DoRA在涉及数百个已适应模块和检查点时在常见的单GPU设置上变得昂贵且不可行。
我们提出了两个系统贡献。分解范数将平方范数分解为基、交叉和格朗项,这些项可以通过O(d_out r + r^2)中间量计算,从而消除密集乘积。融合Triton内核将四核DoRA组合简化为单次通过,减少了约4倍的内存流量,并使用了在幅度缩放集中在实际情况下避免了灾难性消减的数值稳定形式。
在六种8-32B视觉-语言模型(VLMs)上,使用三个NVIDIA GPU(RTX 6000 PRO、H200、B200)在bf16下r = 384时,融合实现比Hugging Face PEFT的DoRA实现快1.5-2.0倍的推理速度,梯度计算(排除优化器步骤)快1.5-1.9倍,峰值VRAM低7 GB。六种跨越四个架构代(L40S、A100、RTX 6000 PRO、H200、B200、B300)的微基准测试确认了1.5-2.7倍组合内核加速。所有模型/GPU对的最终logit余弦相似度超过0.9999,多种子训练曲线在2000步内每步损失差值的均值内匹配7.1 x 10^-4。
Summary / 总结
This paper addresses the computational challenges of high-rank Weight-Decomposed Low-Rank Adaptation (DoRA) by introducing a factored norm and fused Triton kernels. The factored norm decomposes the squared norm into base, cross, and Gram terms, reducing the memory requirement from O(d_out * d_in) to O(d_out * r + r^2), making high-rank DoRA feasible on single-GPU setups. The fused Triton kernels further optimize the DoRA composition, reducing memory traffic and improving numerical stability. Experiments on six vision-language models across three NVIDIA GPUs show that the fused implementation is 1.5-2.0x faster for inference and 1.5-1.9x faster for gradient computation, with up to 7 GB lower peak VRAM usage compared to Hugging Face PEFT's DoRA implementation.
本文解决了高秩Weight-Decomposed Low-Rank Adaptation (DoRA)的计算挑战,通过引入分解范数和融合Triton内核。分解范数将平方范数分解为基、交叉和格朗项,消除了密集乘积的需要,而融合Triton内核减少了内存流量并提高了数值稳定性。实验结果显示,在三个NVIDIA GPU上对六种视觉-语言模型进行测试时,融合实现比现有DoRA实现快1.5-2.0倍的推理速度和1.5-1.9倍的梯度计算速度,且峰值VRAM使用量最多可降低7 GB。
SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
Authors: Sashuai Zhou, Qiang Zhou, Junpeng Ma, Yue Cao, Ruofan Hu, Ziang Zhang, Xiaoda Yang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Zhou Zhao
First: 2026-03-23T17:26:35+00:00 · Latest: 2026-03-23T17:26:35+00:00
Abstract
Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
中文标题/摘要
标题:SpatialReward:可验证的空间奖励建模以实现文本到图像生成中的细粒度空间一致性
通过强化学习(RL)实现的文本到图像(T2I)生成的最近进展得益于那些评估语义对齐和视觉质量的奖励模型。然而,大多数现有的奖励模型对细粒度的空间关系关注有限,经常生成整体上看似合理的图像,但包含物体定位的不准确之处。在本文中,我们提出了**SpatialReward**,一种明确设计用于评估生成图像的空间布局的可验证奖励模型。SpatialReward 采用多阶段管道:一个**提示分解器**从自由形式的提示中提取实体、属性和空间元数据;专家检测器提供准确的视觉定位和属性;一个视觉语言模型在地基观察上进行链式推理,以评估规则方法难以处理的复杂空间关系。为了更全面地评估生成图像中的空间关系,我们引入了**SpatRelBench**,一个涵盖物体属性、方向、物体间关系和渲染文本位置的基准。在Stable Diffusion和FLUX上的实验表明,将SpatialReward纳入RL训练中可以一致地提高空间一致性和整体生成质量,结果与人类判断更为一致。这些发现表明,可验证的奖励模型在使文本到图像生成模型更准确和可控的优化方面具有巨大潜力。
Summary / 总结
The research aims to improve the spatial consistency in text-to-image generation by addressing the limitations of existing reward models that often overlook fine-grained spatial relationships. The method involves a multi-stage pipeline including a Prompt Decomposer, expert detectors, and a vision-language model for evaluating complex spatial relations. Experiments show that integrating SpatialReward into RL training enhances spatial consistency and overall generation quality, aligning more closely with human judgments.
研究动机是利用强化学习提高文本生成图像中的细粒度空间一致性。主要方法是采用名为SpatialReward的多阶段管道,包括提示分解器、专家检测器和视觉语言模型来评估复杂的空间关系。关键实验发现表明,将SpatialReward集成到RL训练中可以提高空间一致性并改善整体生成质量,更接近人类的判断。
VL-Nav: A Neuro-Symbolic Approach for Reasoning-based Vision-Language Navigation
Authors: Yi Du, Taimeng Fu, Zhipeng Zhao, Shaoshu Su, Zitong Zhan, Qiwei Du, Zhuoqun Chen, Bowen Li, Chen Wang
First: 2025-02-02T21:44:15+00:00 · Latest: 2026-03-23T17:26:04+00:00
Abstract
Navigating unseen, large-scale environments based on complex and abstract human instructions remains a formidable challenge for autonomous mobile robots. Addressing this requires robots to infer implicit semantics and efficiently explore large-scale task spaces. However, existing methods, ranging from end-to-end learning to foundation model-based modular architectures, often lack the capability to decompose complex tasks or employ efficient exploration strategies, leading to robot aimless wandering or target recognition failures. To address these limitations, we propose VL-Nav, a neuro-symbolic (NeSy) vision-language navigation system. The proposed system intertwines neural reasoning with symbolic guidance through two core components: (1) a NeSy task planner that leverages a symbolic 3D scene graph and image memory system to enhance the vision language models' (VLMs) neural reasoning capabilities for task decomposition and replanning; and (2) a NeSy exploration system that couples neural semantic cues with the symbolic heuristic function to efficiently gather the task-related information while minimizing unnecessary repeat travel during exploration. Validated on the DARPA TIAMAT Challenge navigation tasks, our system achieved an 83.4% success rate (SR) in indoor environments and 75% in outdoor scenarios. VL-Nav achieved an 86.3% SR in real-world experiments, including a challenging 483-meter run. Finally, we validate the system with complex instructions in a 3D multi-floor scenario.
中文标题/摘要
标题:VL-Nav:一种基于推理的视觉语言导航的神经符号方法
基于复杂和抽象的人类指令自主导航未见过的大型环境仍然是自主移动机器人的一大挑战。解决这一问题需要机器人推断隐含语义并高效探索大规模任务空间。然而,现有的方法,从端到端学习到基于基础模型的模块化架构,往往缺乏分解复杂任务或采用高效探索策略的能力,导致机器人盲目游荡或目标识别失败。为了解决这些限制,我们提出了VL-Nav,一种神经符号(NeSy)视觉语言导航系统。该系统通过两个核心组件将神经推理与符号指导相结合:(1)一个NeSy任务规划器,利用符号3D场景图和图像记忆系统增强视觉语言模型(VLMs)的神经推理能力,用于任务分解和重新规划;(2)一个NeSy探索系统,将神经语义线索与符号启发式函数耦合,以高效地收集任务相关信息,同时在探索过程中尽量减少不必要的重复旅行。在DARPA TIAMAT挑战导航任务中,我们的系统在室内环境中的成功率(SR)为83.4%,在室外场景中为75%。在真实世界实验中,VL-Nav在一次具有挑战性的483米跑步中实现了86.3%的SR。最后,我们在一个3D多层场景中使用复杂指令验证了该系统。
Summary / 总结
VL-Nav is a neuro-symbolic approach designed to help autonomous robots navigate based on complex human instructions. It integrates neural reasoning with symbolic guidance through a task planner and an exploration system. The task planner uses a symbolic 3D scene graph and image memory to improve task decomposition and replanning, while the exploration system efficiently gathers task-related information. Experiments show VL-Nav achieved an 83.4% success rate in indoor environments, 75% in outdoor scenarios, and 86.3% in real-world tests, including a 483-meter run.
VL-Nav 是一种神经符号方法,旨在帮助自主机器人根据复杂的口头指示进行导航。它利用符号3D场景图和图像记忆来增强神经推理能力,并结合启发式函数高效探索环境。该系统在室内导航中的成功率达到了83.4%,在室外场景中的成功率为75%,在真实世界测试中的483米跑步成功率为86.3%。
Power-SMC: Low-Latency Sequence-Level Power Sampling for Training-Free LLM Reasoning
Authors: Seyedarmin Azizi, Erfan Baghaei Potraghloo, Minoo Ahmadi, Souvik Kundu, Massoud Pedram
First: 2026-02-10T20:31:40+00:00 · Latest: 2026-03-23T17:01:31+00:00
Abstract
Many recent reasoning gains in large language models can be explained as distribution sharpening: biasing generation toward high-likelihood trajectories already supported by the pretrained model, rather than modifying its weights. A natural formalization is the sequence-level power distribution $π_α(y\mid x)\propto p_θ(y\mid x)^α$ ($α>1$), which concentrates mass on whole sequences instead of adjusting token-level temperature. Prior work shows that Metropolis--Hastings (MH) sampling from this distribution recovers strong reasoning performance, but at order-of-magnitude inference slowdowns. We introduce Power-SMC, a training-free Sequential Monte Carlo scheme that targets the same objective while remaining close to standard decoding latency. Power-SMC advances a small particle set in parallel, corrects importance weights token-by-token, and resamples when necessary, all within a single GPU-friendly batched decode. We prove that temperature $τ=1/α$ is the unique prefix-only proposal minimizing incremental weight variance, interpret residual instability via prefix-conditioned Rényi entropies, and introduce an exponent-bridging schedule that improves particle stability without altering the target. On MATH500, Power-SMC matches or exceeds MH power sampling while reducing latency from $16$--$28\times$ to $1.4$--$3.3\times$ over baseline decoding. The code is available at https://github.com/ArminAzizi98/Power-SMC.
中文标题/摘要
标题:Power-SMC:低延迟序列级功率采样以实现无需训练的LLM推理
许多大型语言模型中最近的推理改进可以解释为分布硬化:将生成偏向于预训练模型已支持的高概率轨迹,而不是修改其权重。一种自然的形式化是序列级功率分布$π_α(y\mid x)\propto p_θ(y\mid x)^α$($α>1$),它将质量集中在整个序列上,而不是调整令牌级温度。先前的工作表明,从该分布进行的Metropolis--Hastings(MH)采样可以恢复强大的推理性能,但会带来数量级的推理延迟。我们引入了Power-SMC,这是一种无需训练的顺序蒙特卡洛方案,可以针对相同的目标,同时保持接近标准解码延迟。Power-SMC 并行推进一小组粒子,逐个令牌校正重要性权重,并在必要时重新采样,全部在单个GPU友好的批量解码中完成。我们证明了温度$τ=1/α$是唯一前缀仅提议,最小化增量权重方差的,通过前缀条件的Rényi熵解释残余不稳定性,并引入了一个指数桥梁调度,提高了粒子稳定性而不改变目标。在MATH500上,Power-SMC 在延迟从$16$--$28\times$ 减少到$1.4$--$3.3\times$ 基准解码的情况下,匹配或超过了MH功率采样。代码可在https://github.com/ArminAzizi98/Power-SMC/ 获取。
Summary / 总结
Power-SMC is a training-free method that uses Sequential Monte Carlo to sample from a sequence-level power distribution, which concentrates mass on whole sequences rather than adjusting token-level temperature. It achieves strong reasoning performance while maintaining low latency, reducing the inference slowdown compared to Metropolis--Hastings sampling. On MATH500, Power-SMC matches or exceeds the performance of MH power sampling while reducing latency by a factor of 4 to 8 times over baseline decoding.
Power-SMC 是一种无需训练的方法,使用顺序蒙特卡洛从序列级幂分布中采样,以在保持低延迟的同时实现强大的推理性能。它并行推进一小组粒子,逐个令牌校正重要性权重,并在必要时重新采样,所有操作都在单个批处理解码中完成。在 MATH500 上,Power-SMC 在减少基线解码延迟 1.4 到 3.3 倍的同时,达到了或超过了 Metropolis--Hastings 幂采样的效果。
MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management
Authors: Jack W O'Sullivan, Mohammad Asadi, Lennart Elbe, Akshay Chaudhari, Tahoura Nedaee, Francois Haddad, Michael Salerno, Li Fe-Fei, Ehsan Adeli, Rima Arnaout, Euan A Ashley
First: 2026-03-23T16:42:11+00:00 · Latest: 2026-03-23T16:42:11+00:00
Abstract
Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.
中文标题/摘要
标题:MARCUS:一种自主的多模态视图语言模型,用于心脏诊断和管理
心血管疾病仍然是全球死亡的主要原因,进步受阻于人类对复杂心脏测试的解释。当前的AI视图语言模型仅限于单模态输入且不具备交互性。我们介绍了MARCUS(多模态自主推理和超声与信号聊天),这是一种用于独立解释心电图(ECG)、超声心动图和心脏磁共振成像(CMR)以及作为多模态输入的端到端解释的自主视图语言系统。MARCUS采用分层的自主架构,包括模态特定的视图语言专家模型,每个模型整合了领域训练的视觉编码器和多阶段语言模型优化,由多模态协调器协调。MARCUS基于1350万张图像(包括25万张心电图、130万张超声心动图图像、1200万张心脏磁共振成像图像)以及我们新构建的专家标注数据集(涵盖160万问题),实现了最先进的性能,超越了前沿模型(GPT-5 Thinking、Gemini 2.5 Pro Deep Think)。在内部(斯坦福)和外部(UCSF)测试组中,MARCUS的心电图准确率为87-91%,超声心动图准确率为67-86%,心脏磁共振成像准确率为85-88%,分别比前沿模型高出34-45%(P<0.001)。在多模态病例中,MARCUS的准确率为70%,几乎是前沿模型(22-28%)的三倍,且自由文本质量评分提高了1.7-3.0倍。我们的自主架构还赋予了模型对幻象推理的抵抗力,即视图语言模型从无意的文本信号或虚构的视觉内容中推断出推理。MARCUS展示了领域特定的视觉编码器与自主协调器相结合,能够实现多模态心脏解释。我们已开源发布我们的模型、代码和基准测试。
ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints
Authors: Kaili Huang, Hongming Zhang, Rui Shen, Linjun Dai, Jiahao Wang, Hanming Deng, Lewei Lu
First: 2026-03-23T16:26:11+00:00 · Latest: 2026-03-23T16:26:11+00:00
Abstract
While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.
中文标题/摘要
标题:ACPO:通过不对称约束对抗视觉语言对齐中的可能性位移
尽管直接偏好优化(DPO)已成为对大型视觉语言模型(LVLMs)进行对齐的默认方法,但它会遭受可能性位移的问题,即选择和拒绝的响应概率都会下降。这种优化缺陷在多模态设置中尤为有害:选择可能性的侵蚀——我们称之为视觉锚点崩溃——导致模型放弃视觉证据,转而依赖强大的语言先验,从而引发显著的幻觉。为了解决这一问题,我们提出了不对称约束偏好优化(ACPO),这是一种跨模态的对齐机制,它对偏好优化应用动态、目标导向的缩放。ACPO 通过仅应用于拒绝奖励的复杂性感知缩放系数,不对称地抑制拒绝项的梯度流动,同时保持选择分布作为梯度稳定的参考。虽然本质上是一种通用目标,但打破这种梯度对称性对于多模态任务至关重要,因为它可以减轻语言先验对视觉标记的抑制。在 InternVL 模型上的实验表明,ACPO 有效地逆转了标准 DPO 的选择奖励退化。通过阻止视觉锚点崩溃,ACPO 在幻觉基准(HallusionBench, MM-IFEval)和通用排行榜(MMBench, MMStar, OCRBenchV2)上通常优于基线,同时推动了通用能力的同步提升。
Summary / 总结
The paper addresses the issue of Likelihood Displacement in Direct Preference Optimization (DPO) for aligning Large Vision-Language Models (LVLMs), which leads to Visual Anchor Collapse and significant hallucinations. To tackle this, the authors propose Asymmetric Constrained Preference Optimization (ACPO), which dynamically scales the rejected reward to preserve the chosen distribution as a gradient-stable reference. Experiments show that ACPO improves performance on hallucination benchmarks and general leaderboards compared to standard DPO methods, while enhancing overall model capabilities.
论文针对直接偏好优化(DPO)在大型视觉语言模型(LVLM)对齐中出现的似然位移问题,导致视觉锚点坍缩和显著的幻觉。为此,作者提出了不对称约束偏好优化(ACPO),动态调整拒绝奖励以保持选择分布作为梯度稳定的参考。实验表明,ACPO在幻觉基准和通用排行榜上优于标准DPO方法,并且提高了模型的整体能力。
Training-Free Layout-to-Image Generation with Marginal Attention Constraints
Authors: Huancheng Chen, Jingtao Li, Weiming Zhuang, Haris Vikalo, Lingjuan Lyu
First: 2024-11-15T05:44:45+00:00 · Latest: 2026-03-23T15:52:01+00:00
Abstract
Recently, many text-to-image diffusion models have excelled at generating high-resolution images from text but struggle with precise control over spatial composition and object counting. To address these challenges, prior works have developed layout-to-image (L2I) approaches that incorporate layout instructions into text-to-image models. However, existing L2I methods typically require fine-tuning of pre-trained parameters or training additional control modules for diffusion models. In this work, we propose a training-free L2I approach, MAC (Marginal Attention Constrained Generation), which eliminates the need for additional modules or fine-tuning. Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions, and then compute loss functions to optimize latent features during the diffusion reverse process. To enhance spatial controllability and mitigate semantic failures under complex layout instructions, we leverage pixel-to-pixel correlations in self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features. Comprehensive experimental results on both L2I and non-L2I pretrained diffusion models demonstrate that our method outperforms existing training-free L2I techniques, both quantitatively and qualitatively, in terms of image composition on the DrawBench and HRS benchmarks.
中文标题/摘要
标题:基于边缘注意力约束的无需训练布局到图像生成
近年来,许多文本到图像的扩散模型在从文本生成高分辨率图像方面表现出色,但在精确控制空间组成和物体计数方面存在困难。为了解决这些挑战,先前的工作开发了布局到图像(L2I)方法,将布局指令融入到文本到图像模型中。然而,现有的L2I方法通常需要对预训练参数进行微调或为扩散模型训练额外的控制模块。在本文中,我们提出了一种无需训练的L2I方法,即MAC(边缘注意力约束生成),该方法消除了对额外模块或微调的需求。具体而言,我们使用文本-视觉交叉注意力特征图来量化生成图像的布局与提供的指令之间的不一致性,并在此过程中计算损失函数以优化扩散逆过程中的潜在特征。为了增强空间可控性和在复杂布局指令下减轻语义失败,我们利用自注意力特征图中的像素到像素相关性对交叉注意力图进行对齐,并结合由边界注意力约束的三个损失函数来更新潜在特征。在DrawBench和HRS基准上的全面实验结果表明,与现有的无需训练的L2I技术相比,我们的方法在图像组成方面在定量和定性上均表现出更优的效果。
Summary / 总结
This work introduces a training-free layout-to-image generation method called MAC (Marginal Attention Constrained Generation) to address the challenges of precise spatial composition and object counting in text-to-image generation. MAC uses text-visual cross-attention feature maps to quantify inconsistencies and compute loss functions during the diffusion reverse process, while leveraging pixel-to-pixel correlations and boundary attention to enhance spatial controllability. Experimental results show that MAC outperforms existing training-free L2I techniques on DrawBench and HRS benchmarks in terms of image composition.
该研究提出了MAC(Marginal Attention Constrained Generation)方法,该方法利用文本-视觉交叉注意力特征图在扩散逆过程优化潜在特征。通过利用像素到像素的相关性和边界注意力,MAC增强了空间可控性和减少了语义错误。实验结果表明,MAC在DrawBench和HRS基准测试中在图像组成方面优于现有的无训练L2I技术。
Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models
Authors: Xingyu Zhu, Beier Zhu, Shuo Wang, Junfeng Fang, Kesen Zhao, Hanwang Zhang, Xiangnan He
Venue: CVPR 2026
First: 2026-03-23T15:23:23+00:00 · Latest: 2026-03-23T15:23:23+00:00
Comments: CVPR 2026
Abstract
As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage. Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness. However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs. Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness. To better balance safety and utility, we propose NullSteer, a null-space projected activation defense framework. Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model's general capabilities. Extensive experiments show that NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15 percent on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.
中文标题/摘要
标题:基于零空间投影的有原则引导以防御视觉语言模型的脱狱攻击
随着视觉语言模型(VLMs)在开放场景中的广泛应用,它们容易受到视觉脱狱攻击的诱导,生成有害内容,这对模型的安全性和可信使用构成了严重威胁。最近的激活引导方法在推理过程中向模型激活中注入方向向量以诱导拒绝行为,并已显示出有效性。然而,一个引导向量可能会同时增强拒绝能力并导致过度拒绝,从而降低模型在良性输入上的性能。此外,由于缺乏理论可解释性,这些方法仍然存在有限的鲁棒性和有效性。为了更好地平衡安全性和实用性,我们提出了NullSteer,一种零空间投影激活防御框架。我们的方法通过线性变换在模型激活中构建拒绝方向:它在良性子空间中保持零扰动,同时动态诱导潜在有害方向上的拒绝,从而理论上实现安全性增强而不损害模型的一般能力。广泛实验表明,NullSteer在各种脱狱攻击下显著减少了有害输出(MiniGPT-4的平均ASR降低超过15%),同时在通用基准上保持与原始模型相当的性能。
Summary / 总结
The paper addresses the risk of visual jailbreak attacks on vision-language models, which can lead to harmful content generation. It introduces NullSteer, a null-space projected activation defense framework that enhances model safety by dynamically inducing refusal along harmful directions while maintaining zero perturbation on benign inputs. Experiments show NullSteer reduces harmful outputs by over 15 percent under various jailbreak attacks while preserving the model's performance on general benchmarks.
论文针对视觉语言模型因视觉逃逸攻击生成有害内容的风险,提出了一种名为NullSteer的激活防御框架,通过动态诱导有害方向上的拒绝行为,同时在良性输入上保持零扰动,从而增强安全性。实验结果显示,NullSteer在MiniGPT-4等不同逃逸攻击下,将有害输出减少了超过15个百分点,且在通用基准测试上的性能与原模型相当。
P-Flow: Prompting Visual Effects Generation
Authors: Rui Zhao, Mike Zheng Shou
First: 2026-03-23T15:21:28+00:00 · Latest: 2026-03-23T15:21:28+00:00
Abstract
Recent advancements in video generation models have significantly improved their ability to follow text prompts. However, the customization of dynamic visual effects, defined as temporally evolving and appearance-driven visual phenomena like object crushing or explosion, remains underexplored. Prior works on motion customization or control mainly focus on low-level motions of the subject or camera, which can be guided using explicit control signals such as motion trajectories. In contrast, dynamic visual effects involve higher-level semantics that are more naturally suited for control via text prompts. However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. By leveraging the semantic and temporal reasoning capabilities of vision-language models, P-Flow performs test-time prompt optimization, refining prompts based on the discrepancy between the visual effects of the reference video and the generated output. Through iterative refinement, the prompts evolve to better induce the desired dynamic effect in novel scenes. Experiments demonstrate that P-Flow achieves high-fidelity and diverse visual effect customization and outperforms other models on both text-to-video and image-to-video generation tasks. Code is available at https://github.com/showlab/P-Flow.
中文标题/摘要
标题:P-Flow:提示视觉效果生成
近期视频生成模型的发展显著提高了其遵循文本提示的能力。然而,动态视觉效果的定制,即时间上不断演变且以外观为导向的视觉现象(如物体挤压或爆炸)的定义,仍然未被充分探索。此前关于运动定制或控制的工作主要集中在主体或摄像机的低级运动上,这些运动可以通过显式的控制信号(如运动轨迹)进行引导。相比之下,动态视觉效果涉及更高层次的语义,更自然地适合通过文本提示进行控制。然而,人类很难仅凭一个提示准确地指定这些效果,因为它们需要复杂的时序推理和时间上的迭代优化。为了解决这一挑战,我们提出了一种名为P-Flow的新型无训练框架,用于在不修改底层模型的情况下定制视频生成中的动态视觉效果。通过利用视觉语言模型的语义和时序推理能力,P-Flow在测试时进行提示优化,根据参考视频和生成输出之间的差异来细化提示。通过迭代优化,提示逐渐演化以更好地诱导新场景中的所需动态效果。实验表明,P-Flow实现了高保真度和多样化的视觉效果定制,并在文本到视频和图像到视频生成任务中均优于其他模型。代码可在https://github.com/showlab/P-Flow/ 获取。
Summary / 总结
Recent advancements in video generation models have significantly improved their ability to follow text prompts.
Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning
Authors: Xingyu Zhu, Liang Yi, Shuo Wang, Wenbo Zhu, Yonglinag Wu, Beier Zhu, Hanwang Zhang
Venue: CVPR 2026
First: 2026-03-23T15:03:47+00:00 · Latest: 2026-03-23T15:03:47+00:00
Comments: CVPR 2026
Abstract
Multimodal 3D vision-language models show strong generalization across diverse 3D tasks, but their performance still degrades notably under domain shifts. This has motivated recent studies on test-time adaptation (TTA), which enables models to adapt online using test-time data. Among existing TTA methods, cache-based mechanisms are widely adopted for leveraging previously observed samples in online prediction refinement. However, they store only limited historical information, leading to progressive information loss as the test stream evolves. In addition, their prediction logits are fused heuristically, making adaptation unstable. To address these limitations, we propose BayesMM, a Multimodal Bayesian Distribution Learning framework for test-time point cloud analysis. BayesMM models textual priors and streaming visual features of each class as Gaussian distributions: textual parameters are derived from semantic prompts, while visual parameters are updated online with arriving samples. The two modalities are fused via Bayesian model averaging, which automatically adjusts their contributions based on posterior evidence, yielding a unified prediction that adapts continually to evolving test-time data without training. Extensive experiments on multiple point cloud benchmarks demonstrate that BayesMM maintains robustness under distributional shifts, yielding over 4% average improvement.
中文标题/摘要
标题:通过多模态贝叶斯分布学习适应点云分析
多模态3D视觉-语言模型在多种3D任务上表现出强大的泛化能力,但在领域转移下其性能显著下降。这促使了测试时自适应(TTA)的研究,使模型能够在测试时利用测试数据在线进行自我调整。现有TTA方法中,基于缓存的机制广泛采用,用于利用先前观察到的样本进行在线预测细化。然而,它们仅存储有限的历史信息,随着测试流的演变,导致信息逐渐丢失。此外,它们的预测logits是基于启发式方法融合的,使得自适应不稳定。为解决这些局限性,我们提出了BayesMM,一种用于测试时点云分析的多模态贝叶斯分布学习框架。BayesMM将每个类别的文本先验和流式视觉特征建模为高斯分布:文本参数来自语义提示,而视觉参数则随着到达的样本在线更新。两种模态通过贝叶斯模型平均融合,根据后验证据自动调整其贡献,从而在无需训练的情况下生成统一的预测,能够持续适应不断变化的测试时数据。在多个点云基准上的广泛实验表明,BayesMM在分布转移下保持了鲁棒性,平均提高了超过4%。
Summary / 总结
The paper addresses the limitations of existing test-time adaptation methods for 3D vision-language models, particularly their instability and progressive information loss. It introduces BayesMM, a framework that models textual and visual features as Gaussian distributions and fuses them via Bayesian model averaging, enabling continuous adaptation without retraining. Experiments show BayesMM improves robustness under distributional shifts by over 4% on multiple benchmarks.
论文提出了一种名为BayesMM的框架,用于解决3D视觉-语言模型在领域变化下性能下降的问题。BayesMM将文本和视觉特征建模为高斯分布,并通过贝叶斯模型平均进行融合,实现无需训练的持续适应。实验表明,BayesMM在分布变化下保持了鲁棒性,平均提高了超过4%。
FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation
Authors: Wuyang Luo, Chengkai Tan, Chang Ge, Binye Hong, Su Yang, Yongjiu Ma
Venue: CVPR 2026
First: 2026-03-23T14:53:12+00:00 · Latest: 2026-03-23T14:53:12+00:00
Comments: To appear in CVPR 2026
Abstract
Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures. We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which contains diverse element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level. To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. In addition, edge repainting is applied to make boundaries more natural. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity, while also supporting flexible controls such as style mixture.
中文标题/摘要
标题:FontCrafter:基于元素驱动的高保真艺术字体生成与视觉上下文生成
艺术字体生成旨在根据参考风格合成风格化的字符。然而,现有方法在风格多样性方面有限且控制粗糙。在本文中,我们探索了基于元素的艺术字体生成的潜力。元素是字体的基本视觉单元,作为所需风格的参考图像。概念上,我们将元素分为具有独特结构的对象元素(如花朵或石头)和具有无序纹理的非定形元素(如火焰或云朵)。我们引入了FontCrafter,这是一种基于元素的字体生成框架,并构建了一个大规模的数据集ElementFont,其中包含多种元素类型和高质量的字符图像。然而,实现参考元素的高保真纹理和结构重建仍然具有挑战性。为了解决这一问题,我们提出了一种上下文生成策略,将元素图像视为视觉上下文,并使用修复模型在像素级别将元素风格转移到字符区域。为了进一步控制字符形状,我们设计了一种轻量级的上下文感知掩码适配器(CMA),注入形状信息。此外,一种无需训练的注意力重定向机制实现了区域感知的风格控制,并抑制了笔画幻觉。另外,应用边缘重绘使边界更加自然。大量实验表明,FontCrafter实现了强大的零样本生成性能,特别是在保留结构和纹理保真度方面表现尤为突出,同时支持灵活的风格混合控制。
Summary / 总结
FontCrafter is an element-driven framework for artistic font generation that addresses the limitations of existing approaches by introducing a large-scale dataset, ElementFont, and an in-context generation strategy. The framework uses an inpainting model to transfer element styles into glyph regions and a Context-aware Mask Adapter to control glyph shapes. Experimental results show that FontCrafter achieves strong zero-shot generation performance, preserving both structural and textural fidelity while supporting flexible style controls.
FontCrafter 是一种基于元素的艺术字体生成框架,通过引入大规模数据集 ElementFont 和上下文生成策略来解决现有方法的局限性。该框架使用修复模型和上下文感知掩码适配器将元素样式转移到字形区域,并控制字形形状。实验结果表明,FontCrafter 在保持结构和纹理保真度方面表现出色,并支持灵活的风格混合。
Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
Authors: Hayeon Kim, Ji Ha Jang, Junghun James Kim, Se Young Chun
First: 2026-03-23T14:41:20+00:00 · Latest: 2026-03-23T14:41:20+00:00
Abstract
While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.
中文标题/摘要
标题:基于不确定性引导的分组成分对齐与超曲面视觉-语言模型中的部分到整体语义代表性
尽管视觉-语言模型(VLMs)已经取得了显著的性能,但它们的欧几里得嵌入在捕捉部分到整体或父级子级结构的层次关系方面仍然有限,并且在多对象组成场景中经常面临挑战。超曲面VLMs通过更好地保留层次结构并利用蕴含来建模部分到整体关系(即整个场景及其部分图像),从而缓解了这一问题。然而,现有方法并未建模每个部分对整体的不同语义代表性。我们提出了基于不确定性引导的超曲面组成对齐(UNCHA)以增强超曲面VLMs。UNCHA通过赋予更代表性的部分较低的不确定性并赋予整个场景中较不具代表性的部分较高的不确定性,来建模部分到整体的语义代表性。这种代表性随后通过不确定性引导的权重纳入对比目标中。最后,不确定性通过基于熵的项正则化的蕴含损失进一步校准。通过提出的损失,UNCHA学习了更准确的部分到整体顺序的超曲面嵌入,捕捉图像中的潜在组成结构并提高了对复杂多对象场景的理解。UNCHA在零样本分类、检索和多标签分类基准测试中达到了最先进的性能。我们的代码和模型可在:https://github.com/jeeit17/UNCHA.git获取。
Summary / 总结
The research aims to enhance the hierarchical relationship representation in Vision-Language Models (VLMs) by addressing the limitations of Euclidean embeddings in capturing part-to-whole structures. The proposed UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) method models the semantic representativeness of parts to the whole using hyperbolic uncertainty, which is then integrated into a contrastive objective with uncertainty-guided weights. The method further refines the uncertainty through an entailment loss regularized by entropy. UNCHA improves the accuracy of part-whole ordering and enhances the understanding of complex multi-object scenes, achieving state-of-the-art performance on various benchmarks.
研究旨在通过解决欧几里得嵌入在捕捉部分到整体结构方面的局限性,改进视觉-语言模型(VLMs)中的层次关系建模。方法UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) 引入了双曲不确定性来建模部分对整体的不同代表性,并将其整合到具有不确定性引导权重的对比目标中。该模型进一步通过熵为基础的项正则化的蕴含损失来校准不确定性。实验结果表明,UNCHA 在零样本分类、检索和多标签分类基准测试中优于现有方法,提高了对复杂多对象场景的理解。
Tuning Real-World Image Restoration at Inference: A Test-Time Scaling Paradigm for Flow Matching Models
Authors: Purui Bai, Junxian Duan, Pin Wang, Jinhua Hao, Ming Sun, Chao Zhou, Huaibo Huang
First: 2026-03-23T14:33:43+00:00 · Latest: 2026-03-23T14:33:43+00:00
Comments: 27 pages, 10 figures
Abstract
Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.
中文标题/摘要
标题:在推理时调谐现实世界图像恢复:FLUX1-dev的测试时缩放范式
尽管基于扩散的现实世界图像恢复(Real-IR)已经取得了显著进展,但高效利用超大规模预训练的文本到图像(T2I)模型并充分利用其潜力仍然是重大挑战。为了解决这一问题,我们提出了ResFlow-Tuner,这是一种基于最新流匹配模型FLUX1-dev的图像恢复框架,该框架结合了统一多模态融合(UMMF)和测试时缩放(TTS),以实现前所未有的恢复性能。我们的方法充分利用了多模态扩散变换器(MM-DiT)架构的优势,通过将多模态条件编码为统一序列来指导高质量图像的合成。此外,我们还引入了一种无需训练的测试时缩放范式,专门适用于图像恢复。在推理过程中,该技术通过反馈模型(RM)动态引导去噪方向,从而实现显著的性能提升,同时具有可控的计算开销。大量实验表明,我们的方法在多个标准基准上达到了最先进的性能。这项工作不仅验证了流匹配模型在低级视觉任务中的强大能力,更重要的是,提出了适用于大型预训练模型的新型高效推理时缩放范式。
Summary / 总结
The paper proposes ResFlow-Tuner, an image restoration framework that leverages the FLUX.1-dev flow matching model with unified multi-modal fusion and test-time scaling to achieve superior restoration performance. During inference, a training-free test-time scaling paradigm dynamically adjusts the denoising process based on feedback from a reward model, leading to significant performance improvements with manageable computational overhead. Experiments show that ResFlow-Tuner outperforms existing methods on multiple benchmarks.
论文旨在高效利用大型预训练文本到图像模型进行实际图像恢复。它提出了ResFlow-Tuner,结合了统一多模态融合和推理时缩放技术,以提升恢复性能。在推理过程中,奖励模型动态引导去噪过程,从而实现显著改进,同时保持可控的计算成本。实验表明,ResFlow-Tuner在多个基准测试中优于现有方法。
What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging
Authors: Inha Kang, Youngsun Lim, Seonho Lee, Jiho Choi, Junsuk Choe, Hyunjung Shim
Venue: ICLR 2026
First: 2025-10-15T07:36:38+00:00 · Latest: 2026-03-23T13:45:07+00:00
Comments: 56 pages
Abstract
State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.
中文标题/摘要
标题:什么是“不”要检测的:基于结构化推理和词元合并的感知否定意识VLM
最先进的视觉-语言模型(VLMs)在理解否定方面存在严重缺陷,通常被称为肯定偏见。这一限制在描述对象检测(DOD)任务中尤为严重。为了解决这一问题,我们提出了两个主要贡献:(1)一个新的数据集管道和(2)一种新颖的轻量级适应配方。首先,我们引入了CoVAND,这是一种使用系统性的链式思考(CoT)和VQA管道生成高质量、实例基础的否定数据的数据集。其次,我们提出了NegToMe,这是一种新颖的文本词元合并模块,直接解决了肯定偏见的架构原因。NegToMe从根本上解决了词元化中否定线索的结构损失,将它们与属性组合成连贯的语义短语。它在输入级别保持正确的极性,即使在数据有限的情况下也能实现稳健的否定理解。例如,为了防止模型将“not”和“girl”这两个碎片化的词元简单地视为“girl”,NegToMe将它们合并为一个词元,其含义与单独的“girl”不同。该模块与参数高效且战略性的LoRA微调方法集成。我们的方法在具有降低的假阳性率的挑战性否定基准测试中显著提高了性能,在OVDEval上NMS-AP提高了最多+10.8分,并展示了对最先进的VLMs的一般化能力。这项工作标志着在解决实际检测应用中的否定理解方面迈出了一大步。
Summary / 总结
This paper addresses the issue of affirmative bias in vision-language models (VLMs) regarding negation understanding, particularly in described object detection tasks. It introduces CoVAND, a new dataset pipeline, and NegToMe, a novel text token merging module. NegToMe groups negation cues with attributes into coherent semantic phrases, improving negation understanding. The method significantly enhances performance on negation benchmarks, reducing false positives and boosting NMS-AP by up to 10.8 points on OVDEval, demonstrating generalization to state-of-the-art VLMs.
该论文针对视觉语言模型(VLMs)在处理否定时的肯定偏见问题,特别是在描述对象检测任务中。它引入了CoVAND,一个新的数据集生成管道,以及NegToMe,一种新型的文本词元合并模块。NegToMe将否定线索与属性组合成连贯的语义短语,提高否定理解能力。该方法在否定基准测试中显著提升了性能,减少了误报,并在OVDEval上将NMS-AP提高了高达10.8个百分点,展示了对最先进的VLMs的一般化能力。
Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection
Authors: Youbin Kim, Jinho Park, Hogun Park, Eunbyung Park
First: 2026-03-23T13:01:14+00:00 · Latest: 2026-03-23T13:01:14+00:00
Comments: 24 pages, 7 figures, Project page: https://ubin108.github.io/Group3D/
Abstract
Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.
Summary / 总结
Group3D is a multi-view open-vocabulary 3D detection framework that integrates semantic constraints into instance construction, addressing the limitations of geometry-only merging. It uses a scene-adaptive vocabulary from a multimodal large language model to organize objects into semantic compatibility groups, which act as merge-time constraints. Experiments show that Group3D outperforms existing methods in multi-view open-vocabulary 3D detection and generalizes well in zero-shot scenarios.
Group3D 是一个多视图开放词汇 3D 检测框架,将语义约束集成到实例构建过程中,解决了仅靠几何合并的局限性。它使用多模态大语言模型生成的场景自适应词汇表,将物体组织成语义兼容组,作为合并时的约束条件。实验表明,Group3D 在多视图开放词汇 3D 检测中优于现有方法,并在零样本场景中表现出良好的泛化能力。
AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression
Authors: Rui Cen, QiangQiang Hu, Hong Huang, Hong Liu, Song Liu, Xin Luo, Lin Niu, Yifan Tan, Decheng Wu, Linchuan Xie, Rubing Yang, Guanghua Yu, Jianchen Zhu
First: 2026-02-07T07:02:56+00:00 · Latest: 2026-03-23T12:35:58+00:00
Abstract
This technical report introduces AngelSlim, a comprehensive and versatile toolkit for large model compression developed by the Tencent Hunyuan team. By consolidating cutting-edge algorithms, including quantization, speculative decoding, token pruning, and distillation. AngelSlim provides a unified pipeline that streamlines the transition from model compression to industrial-scale deployment. To facilitate efficient acceleration, we integrate state-of-the-art FP8 and INT8 Post-Training Quantization (PTQ) algorithms alongside pioneering research in ultra-low-bit regimes, featuring HY-1.8B-int2 as the first industrially viable 2-bit large model. Beyond quantization, we propose a training-aligned speculative decoding framework compatible with multimodal architectures and modern inference engines, achieving 1.8x to 2.0x throughput gains without compromising output correctness. Furthermore, we develop a training-free sparse attention framework that reduces Time-to-First-Token (TTFT) in long-context scenarios by decoupling sparse kernels from model architectures through a hybrid of static patterns and dynamic token selection. For multimodal models, AngelSlim incorporates specialized pruning strategies, namely IDPruner for optimizing vision tokens via Maximal Marginal Relevance and Samp for adaptive audio token merging and pruning. By integrating these compression strategies from low-level implementations, AngelSlim enables algorithm-focused research and tool-assisted deployment.
中文标题/摘要
标题:AngelSlim:一种更易获取、全面且高效的大型模型压缩工具包
本技术报告介绍了由腾讯混元团队开发的AngelSlim,这是一个全面且多功能的大型模型压缩工具包。通过整合包括量化、推测解码、标记剪枝和蒸馏在内的前沿算法,AngelSlim 提供了一个统一的流水线,简化了从模型压缩到工业规模部署的过渡。为了实现高效的加速,我们集成了最先进的 FP8 和 INT8 后训练量化(PTQ)算法,并结合了超低位宽领域的开创性研究,HY-1.8B-int2 是第一个工业上可行的 2 位大型模型。除了量化,我们还提出了一种与多模态架构和现代推理引擎兼容的训练对齐推测解码框架,实现了 1.8 倍到 2.0 倍的吞吐量增益,同时不牺牲输出正确性。此外,我们开发了一种无需训练的稀疏注意力框架,通过将稀疏内核与模型架构解耦,结合静态模式和动态标记选择的混合方法,减少了长上下文场景中的首次标记时间(TTFT)。对于多模态模型,AngelSlim 结合了专门的剪枝策略,包括 IDPruner 用于通过最大边际相关性优化视觉标记,以及 Samp 用于自适应音频标记合并和剪枝。通过将这些压缩策略从低级实现中集成,AngelSlim 使算法研究和工具辅助部署成为可能。
Summary / 总结
AngelSlim is a comprehensive toolkit for large model compression developed by Tencent Hunyuan, integrating quantization, speculative decoding, token pruning, and distillation. It provides a unified pipeline for transitioning from model compression to industrial-scale deployment. Key findings include achieving 1.8x to 2.0x throughput gains with a training-aligned speculative decoding framework and reducing Time-to-First-Token in long-context scenarios through a hybrid sparse attention framework. Additionally, AngelSlim includes specialized pruning strategies for multimodal models, such as IDPruner and Samp, and supports ultra-low-bit regimes like HY-1.8B-int2, the first 2-bit large model for industrial use.
AngelSlim 是腾讯混沌团队开发的一个全面的大型模型压缩工具包,整合了量化、推测解码、token 剪枝和蒸馏等技术。它提供了一个从模型压缩到工业规模部署的统一管道,并包括先进的 FP8 和 INT8 后训练量化算法,实现了2比特模型压缩。AngelSlim 还提出了一种训练对齐的推测解码框架和一种训练免费的稀疏注意力框架,通过在静态模式和动态 token 选择的混合中解耦稀疏内核,提高了1.8到2.0倍的吞吐量,同时不牺牲输出正确性。此外,它还为多模态模型引入了专门的剪枝策略,提高了长上下文场景中的 Time-to-First-Token。
Improving Fairness of Large Language Model-Based ICU Mortality Prediction via Case-Based Prompting
Authors: Gangxiong Zhang, Yongchao Long, Yuxi Zhou, Yong Zhang, Shenda Hong
First: 2025-12-17T12:29:53+00:00 · Latest: 2026-03-23T12:34:05+00:00
Abstract
Accurately predicting mortality risk in intensive care unit (ICU) patients is essential for clinical decision-making. Although large language models (LLMs) show strong potential in structured medical prediction tasks, their outputs may exhibit biases related to demographic attributes such as sex, age, and race, limiting their reliability in fairness-critical clinical settings. Existing debiasing methods often degrade predictive performance, making it difficult to balance fairness and accuracy.
In this study, we systematically analyze fairness issues in LLM-based ICU mortality prediction and propose a clinically adaptive prompting framework that improves both performance and fairness without model retraining. We first design a multi-dimensional bias assessment scheme to identify subgroup disparities. Based on this, we introduce CAse Prompting (CAP), a training-free framework that integrates existing debiasing strategies and further guides models using similar historical misprediction cases paired with correct outcomes to correct biased reasoning.
We evaluate CAP on the MIMIC-IV dataset. Results show that AUROC improves from 0.806 to 0.873 and AUPRC from 0.497 to 0.694. Meanwhile, prediction disparities are substantially reduced across demographic groups, with reductions exceeding 90% in sex and certain White-Black comparisons. Feature reliance analysis further reveals highly consistent attention patterns across groups, with similarity above 0.98.
These findings demonstrate that fairness and performance in LLM-based clinical prediction can be jointly optimized through carefully designed prompting, offering a practical paradigm for developing reliable and equitable clinical decision-support systems.
中文标题/摘要
标题:通过案例提示提高基于大型语言模型的ICU病死率预测公平性
准确预测重症监护病房(ICU)患者的死亡风险对于临床决策至关重要。尽管大型语言模型(LLMs)在结构化医疗预测任务中显示出强大的潜力,但其输出可能表现出与性别、年龄和种族等人口统计属性相关的偏差,限制了其在公平性关键的临床环境中的可靠性。现有的去偏方法往往会降低预测性能,使得难以在公平性和准确性之间取得平衡。
在这项研究中,我们系统地分析了LLM在ICU死亡率预测中的公平性问题,并提出了一种临床适应性的提示框架,该框架在无需重新训练模型的情况下提高了性能和公平性。我们首先设计了一个多维度的偏见评估方案来识别子群体差异。在此基础上,我们引入了案例提示(CAP),这是一种无需训练的框架,它整合了现有的去偏策略,并进一步通过与正确结果配对的历史误预测案例来引导模型纠正有偏的推理。
我们在MIMIC-IV数据集上评估了CAP。结果显示,AUROC从0.806提高到0.873,AUPRC从0.497提高到0.694。同时,预测差异在不同的人口统计群体中显著减少,性别和某些白人-黑人比较中的减少率超过90%。特征依赖性分析进一步揭示了各群体之间高度一致的注意力模式,相似度超过0.98。
这些发现表明,通过精心设计的提示可以同时优化LLM在临床预测中的公平性和性能,为开发可靠和公平的临床决策支持系统提供了实用的范式。
Summary / 总结
This study addresses the fairness issues in large language model (LLM)-based ICU mortality prediction by proposing a clinically adaptive prompting framework called CAse Prompting (CAP). The method involves a multi-dimensional bias assessment scheme to identify subgroup disparities and integrates existing debiasing strategies to correct biased reasoning using similar historical misprediction cases. Evaluation on the MIMIC-IV dataset shows significant improvements in AUROC and AUPRC, with substantial reductions in prediction disparities across demographic groups and highly consistent attention patterns across groups.
该研究通过提出一种临床适应性的提示框架CAse Prompting (CAP)来解决大型语言模型在ICU死亡率预测中的公平性问题。该方法包括一个多维度的偏见评估方案来识别子群体差异,并使用配对的历史误预测案例来纠正偏见推理。在MIMIC-IV数据集上的评估显示,AUROC和AUPRC有显著提升,同时在不同人口群体中的预测差异大幅减少,且各群体之间的注意力模式高度一致。
LEO-VL: Efficient Scene Representation for Scalable 3D Vision-Language Learning
Authors: Jiangyong Huang, Xiaojian Ma, Xiongkun Linghu, Junchao He, Qing Li, Song-Chun Zhu, Yixin Chen, Baoxiong Jia, Siyuan Huang
First: 2025-06-11T16:56:34+00:00 · Latest: 2026-03-23T10:34:50+00:00
Comments: Project page: https://leo-vl.github.io
Abstract
Developing vision-language models (VLMs) capable of understanding 3D scenes has been a longstanding research goal. Despite recent progress, 3D VLMs still struggle with spatial reasoning and robustness. We identify three key obstacles hindering their progress: (1) scene representation is constrained by a capacity-efficiency trade-off, which impedes scalable learning; (2) training data lacks a comprehensive scheme, with limited diversity across tasks and scene domains; and (3) models exhibit robustness deficiencies and lack effective post-training. To address these challenges, we first propose condensed feature grid (CFG), an efficient scene representation that significantly reduces token overhead while preserving strong perceptual capacity. Building on CFG, we introduce LEO-VL, a 3D VLM trained on over 700k 3D vision-language (3D-VL) data spanning four real-world indoor domains and five tasks such as captioning and dialogue. To further improve robustness, we propose SceneDPO, a novel post-training objective that incorporates contrastive signals across both answers and scenes. LEO-VL achieves state-of-the-art performance on various 3D-VL benchmarks, such as SQA3D, Beacon3D, and Scan2Cap. Extensive analyses highlight the efficiency of CFG and provide key insights such as the importance of task and scene diversity, the priority of data quality for effective scaling, and the advantages of SceneDPO.
中文标题/摘要
标题:LEO-VL:高效场景表示以实现可扩展的3D视觉语言学习
开发能够理解3D场景的视觉语言模型(VLMs)一直是研究目标。尽管取得了进展,但3D VLMs仍然在空间推理和鲁棒性方面存在困难。我们识别出三个阻碍其进展的关键障碍:(1)场景表示受到容量-效率权衡的限制,阻碍了可扩展学习;(2)训练数据缺乏全面方案,任务和场景领域多样性有限;(3)模型表现出鲁棒性不足,缺乏有效的后训练方法。为了解决这些挑战,我们首先提出了紧凑特征网格(CFG),这是一种高效场景表示,显著减少了标记开销,同时保持了强大的感知能力。基于CFG,我们引入了LEO-VL,这是一种在超过70万3D视觉语言(3D-VL)数据上训练的3D VLM,这些数据覆盖了四个真实世界室内领域和五个任务,如描述和对话。为了进一步提高鲁棒性,我们提出了场景DPO,这是一种新颖的后训练目标,结合了答案和场景之间的对比信号。LEO-VL在各种3D-VL基准测试中取得了最先进的性能,如SQA3D、Beacon3D和Scan2Cap。广泛的分析突显了CFG的效率,并提供了关键见解,如任务和场景多样性的重要性、高质量数据对于有效扩展的优先级以及场景DPO的优势。
Summary / 总结
The research aims to improve 3D vision-language models (VLMs) by addressing spatial reasoning and robustness issues. To achieve this, the authors propose a condensed feature grid (CFG) for efficient scene representation and introduce LEO-VL, a 3D VLM trained on diverse 3D vision-language data. Additionally, they propose SceneDPO, a post-training objective that enhances robustness. LEO-VL demonstrates state-of-the-art performance on benchmarks like SQA3D, Beacon3D, and Scan2Cap, highlighting the efficiency and effectiveness of CFG and SceneDPO.
研究旨在开发能够理解3D场景的视觉语言模型(VLMs),解决空间推理和鲁棒性等挑战。为此,作者提出了高效的场景表示方法——压缩特征网格(CFG),并引入了基于大量3D视觉语言数据训练的LEO-VL模型。此外,他们还提出了场景DPO,一种新的后训练目标,以提高鲁棒性。LEO-VL在多种3D-VL基准测试中表现出最先进的性能,强调了CFG的效率以及任务和场景多样性的重要性。
SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
Authors: Bingxuan Zhao, Qing Zhou, Chuang Yang, Qi Wang
First: 2026-03-23T10:25:45+00:00 · Latest: 2026-03-23T10:25:45+00:00
Abstract
Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.
中文标题/摘要
标题:SHARP:基于频谱感知的高动态分辨率提升适应性在遥感合成中的应用
基于扩散变换器(DiTs)的文本到图像生成取得了显著进展,但遥感(RS)合成由于两个障碍落后:缺乏专门领域的DiT先验以及在RS应用所需的大型分辨率下训练成本高昂。通过旋转位置嵌入(RoPE)缩放进行无训练的分辨率提升提供了一种实用的解决方案,但现有方法在整个去噪过程中都应用了静态的位置缩放规则。这种统一的压缩对RS图像特别有害,因为其密度更高的中高频能量编码了诸如车辆、建筑轮廓和道路标记等空中场景真实性的关键结构。解决这两个挑战需要一种专门领域的生成先验与去噪感知的位置适应策略相结合。为此,我们对超过100,000张精心挑选的RS图像进行微调,构建了一个强大的领域先验(RS-FLUX),并提出了一种无训练的分辨率提升方法——基于频谱感知的高动态适应性(SHARP),该方法引入了RoPE中的合理分数时间表k_rs(t)。SHARP在早期布局形成阶段应用强烈的定位提升,并在细节恢复过程中逐步放松,使外推强度与扩散去噪的频率渐进性质相一致。其无分辨率的表述还使从一组超参数生成多尺度图像变得稳健。在六个正方形和矩形分辨率的广泛实验中,SHARP在CLIP分数、美学分数和HPSv2上始终优于所有无训练基线,且在更激进的外推因子下性能差距扩大,且计算开销可忽略不计。代码和权重可在https://github.com/bxuanz/SHARP获取。
Summary / 总结
The research addresses the challenges of remote sensing (RS) synthesis using text-to-image generation models, particularly the lack of a specialized diffusion model and the high cost of training at large resolutions. It introduces SHARP, a training-free method that fine-tunes FLUX on RS images and uses a spectrum-aware, highly-dynamic positional adaptation strategy to promote resolution. SHARP consistently outperforms existing methods on CLIP Score, Aesthetic Score, and HPSv2 across multiple resolutions with minimal computational overhead.
研究解决了使用文本到图像生成模型进行遥感(RS)合成面临的挑战,特别是缺乏专门的扩散模型和在大分辨率下训练的高成本。它引入了SHARP方法,通过在RS图像上微调FLUX并使用谱感知的、高度动态的位置适应策略来促进分辨率。SHARP在多个分辨率上在CLIP得分、美学得分和HPSv2上始终优于现有方法,且计算开销很小。
Show Me What You Don't Know: Efficient Sampling from Invariant Sets for Model Validation
Authors: Armand Rousselot, Joran Wendebourg, Ullrich Köthe
First: 2026-03-23T10:24:58+00:00 · Latest: 2026-03-23T10:24:58+00:00
Comments: 19 pages, 19 figures
Abstract
The performance of machine learning models is determined by the quality of their learned features. They should be invariant under irrelevant data variation but sensitive to task-relevant details. To visualize whether this is the case, we propose a method to analyze feature extractors by sampling from their fibers -- equivalence classes defined by their invariances -- given an arbitrary representative. Unlike existing work where a dedicated generative model is trained for each feature detector, our algorithm is training-free and exploits a pretrained diffusion or flow-matching model as a prior. The fiber loss -- which penalizes mismatch in features -- guides the denoising process toward the desired equivalence class, via non-linear diffusion trajectory matching. This replaces days of training for invariance learning with a single guided generation procedure at comparable fidelity. Experiments on popular datasets (ImageNet, CheXpert) and model types (ResNet, DINO, BiomedClip) demonstrate that our framework can reveal invariances ranging from very desirable to concerning behaviour. For instance, we show how Qwen-2B places patients with situs inversus (heart on the right side) in the same fiber as typical anatomy.
中文标题/摘要
标题:展示你不知道的:从不变集合高效采样以进行模型验证
机器学习模型的性能取决于其学习特征的质量。这些特征应该在无关数据变化下保持不变,但在任务相关细节上敏感。为了可视化这一点,我们提出了一种方法,通过给定任意代表样本在其纤维——由其不变性定义的等价类——中采样来分析特征提取器。与现有工作需要为每个特征检测器训练专用生成模型不同,我们的算法无需训练,并利用预训练的扩散或流匹配模型作为先验。纤维损失——惩罚特征不匹配——通过非线性扩散轨迹匹配引导去噪过程向所需的等价类发展。这将不变性学习所需的时间从几天缩短为一次指导生成过程,且保真度相当。在流行的数据集(ImageNet,CheXpert)和模型类型(ResNet,DINO,BiomedClip)上的实验表明,我们的框架可以揭示从非常理想的到令人担忧的各种不变性。例如,我们展示了Qwen-2B如何将心肺位置异常(心脏位于右侧)的患者与典型解剖结构放在同一个纤维中。
Summary / 总结
This paper proposes a method to analyze feature extractors by sampling from their fibers, which are equivalence classes defined by invariances. Unlike previous methods that require training a dedicated generative model for each feature detector, this algorithm is training-free and uses a pretrained diffusion or flow-matching model as a prior. The fiber loss guides the denoising process to the desired equivalence class, allowing for efficient sampling and revealing invariances ranging from desirable to concerning behavior. Experiments on ImageNet, CheXpert, ResNet, DINO, and BiomedClip show that the framework can uncover various invariances, including one that incorrectly groups patients with situs inversus with typical anatomy.
本文提出了一种通过从特征提取器的纤维(由不变性定义的等价类)中采样来分析特征提取器的方法。不同于之前需要为每个特征检测器训练专用生成模型的方法,该算法无需训练,并使用预训练的扩散或流匹配模型作为先验。纤维损失指导去噪过程到达所需的等价类,从而实现高效采样并揭示从可取到令人担忧的各种不变性。在ImageNet、CheXpert、ResNet、DINO和BiomedClip上的实验表明,该框架能够发现各种不变性,包括一个错误地将心脏位于右侧的患者与典型解剖学归为一类的例子。
From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs
Authors: Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi
First: 2025-11-14T16:07:18+00:00 · Latest: 2026-03-23T09:57:02+00:00
Abstract
Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects' attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli improves performance by 13% on real-world data (COCO), outperforming models fine-tuned on the full COCO train set.
中文标题/摘要
标题:从合成场景到真实表现:增强VLM的空间推理能力
对视觉-语言模型(VLMs)进行微调是一种常见的策略,以提高性能,通常是在收集和标注真实场景数据后进行。然而,这一过程往往容易出现偏差、错误和分布不平衡,导致过拟合和性能不平衡。尽管有一些研究尝试通过生成合成数据来解决这个问题,但它们缺乏对分布偏差和标注质量的控制。为了解决这些挑战,我们以两种方式重新设计了微调过程。首先,我们控制数据及其标注的生成,确保其无偏差、无分布不平衡和无标注错误。我们通过全面采样场景中对象的属性(包括颜色、形状、大小和位置)自动构建数据集。其次,使用这个标注数据集,我们微调最先进的VLMs,并在绝对位置任务上评估其性能转移性。我们在合成和真实世界基准上进行了详尽的评估。我们的实验揭示了两个关键发现:1)在平衡的合成数据上进行微调可以在视觉场景中获得一致的性能并减轻常见偏差;2)在合成刺激上进行微调在真实世界数据(COCO)上的性能提高了13%,优于在完整COCO训练集上进行微调的模型。
Summary / 总结
The study aims to enhance the spatial reasoning capabilities of Vision-Language Models (VLMs) by fine-tuning them on synthetic data that is free from bias and distribution imbalance. The method involves automatically constructing a dataset by sampling objects' attributes and ensuring high-quality annotations. Key experimental findings show that fine-tuning on balanced synthetic data improves performance across the visual scene and mitigates biases, with a 13% improvement on real-world data compared to models fine-tuned on the full COCO train set.
研究旨在通过解决细调数据集中的偏差和分布不平衡问题,提升视觉语言模型(VLMs)的空间推理能力。作者重新设计了细调过程,生成具有控制属性和注释的平衡合成数据。实验表明,使用这种合成数据进行细调可以提高在真实世界任务上的性能,特别是在COCO数据集上实现了比使用完整COCO训练集进行细调的模型高出13%的性能提升。
Getting to the Point: Why Pointing Improves LVLMs
Authors: Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi
First: 2026-03-23T09:38:15+00:00 · Latest: 2026-03-23T09:38:15+00:00
Abstract
Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.
中文标题/摘要
标题:直达要点:为什么指认能提升LVLMs
指认通过将语义关联和推理建模为明确的顺序步骤,提高了大型视觉-语言模型(LVLMs)的准确性和可解释性。模型通过预测对象的坐标来将自然语言查询中的对象进行语义关联,然后基于这些点生成答案。尽管指认已被证明可以提高LVLMs的准确性,但尚不清楚哪种机制支持这些增益及其在认知任务中的相关性。此外,中间点的可靠性仍需进一步研究,限制了它们作为视觉解释的使用。在本研究中,我们探讨了指认在认知任务中的作用:从视觉场景中进行零样本计数。我们按照两种方法对最先进的LVLMs进行微调:直接计数,其中模型仅预测对象总数;指认后计数,其中LVLMs生成目标对象的坐标,然后进行计数。结果表明,指认后计数在分布外泛化方面表现更好,表明坐标有助于LVLMs学习技能而非仅在狭窄任务上过拟合。尽管在超过89%的情况下(通过F1衡量),预测的点准确地与图像中的对象关联,但在不同图像区域中的表现存在差异,揭示了空间偏见。最后,机制分析表明,计数的增益源自于坐标中编码的空间信息。
Summary / 总结
This study investigates how pointing enhances the performance of Large Vision-Language Models (LVLMs) in a zero-shot counting task. By predicting object coordinates and then counting, the Point-then-Count approach improves out-of-distribution generalization compared to Direct Counting. While the points are accurately grounded in 89% of cases, performance varies spatially, indicating potential biases. These findings suggest that coordinates help LVLMs learn generalizable skills rather than overfitting on specific tasks.
本研究探讨了指针如何在零样本计数任务中提升大型视觉语言模型(LVLM)的性能。通过将接地和推理建模为显式的步骤,指针增强了LVLMs的准确性和可解释性。Point-then-Count方法,即模型首先预测物体的坐标,然后进行计数,比直接计数方法表现更好,表明坐标有助于学习通用技能而非过度拟合。然而,性能在图像区域之间存在差异,表明模型预测存在空间偏见。
Rethinking Token Reduction for Large Vision-Language Models
Authors: Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang
First: 2026-03-23T08:40:08+00:00 · Latest: 2026-03-23T08:40:08+00:00
Abstract
Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.
中文标题/摘要
标题:重新思考大型视觉语言模型中的代币减少
大型视觉语言模型(LVLMs)在视觉理解和推理方面表现出色,但过多的视觉代币导致推理成本高昂。尽管最近的代币减少方法缓解了这一问题,但它们主要针对单轮视觉问答(VQA),而多轮视觉问答(MT-VQA)场景则被很大程度上忽略了。MT-VQA 引入了额外的挑战,因为后续问题事先未知且可能引用图像中的任意区域,使得现有的减少策略无效。具体来说,当前的方法分为两类:依赖提示的方法,倾向于初始文本提示并丢弃后续轮次有用的信息;以及不依赖提示的方法,尽管在多轮设置中技术上是适用的,但依赖启发式的减少指标,如注意力分数,导致性能不佳。在本文中,我们提出了一种基于学习的不依赖提示的方法,称为MetaCompress,克服了启发式设计的局限性。我们首先将代币减少形式化为可学习的压缩映射,将现有格式如剪枝和合并统一到单一的学习目标中。在此基础上,我们引入了一种数据高效的训练范式,能够在有限的计算成本下学习最优的压缩映射。在MT-VQA基准和多个LVLM架构上的广泛实验表明,MetaCompress 在保持对话轮次间强泛化能力的同时,实现了更优的效率-准确度权衡。我们的代码可在 https://github.com/MArSha1147/MetaCompress 获取。
Summary / 总结
Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs.
本文提出了一种基于学习的提示无关方法MetaCompress,以解决大型视觉语言模型(LVLM)中由于过多视觉标记导致的高推理成本问题。MetaCompress 将标记减少视为可学习的压缩映射,并使用高效的数据训练范式来学习最优压缩映射。实验表明,MetaCompress 在效率和准确性方面优于现有方法,特别是在多轮视觉问答场景中表现出色,同时在对话轮次中保持了良好的泛化能力。
UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
Authors: Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren
First: 2026-01-07T23:49:52+00:00 · Latest: 2026-03-23T08:29:57+00:00
Comments: Project Page: https://unidrive-wm.github.io/UniDrive-WM
Abstract
World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM's trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 7.3% in L2 trajectory error and 10.4% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM.
中文标题/摘要
标题:UniDrive-WM:统一理解、规划和生成世界模型在自动驾驶中的应用
世界模型已成为自动驾驶的核心,准确的场景理解和未来预测对于安全控制至关重要。近期研究探索了使用视觉-语言模型(VLMs)进行规划,但现有方法通常将感知、预测和规划视为独立模块。我们提出了UniDrive-WM,这是一种基于VLM的统一世界模型,能够在单一架构中联合执行驾驶场景理解、轨迹规划和基于轨迹的未来图像生成。UniDrive-WM的轨迹规划器预测未来轨迹,条件化VLM图像生成器以生成合理的未来帧。这些预测提供了额外的监督信号,增强场景理解并迭代细化轨迹生成。我们进一步比较了离散和连续输出表示对未来图像预测的影响,分析其对下游驾驶性能的影响。在具有挑战性的Bench2Drive基准测试中,UniDrive-WM生成了高保真度的未来图像,并在L2轨迹误差和碰撞率方面分别提高了7.3%和10.4%,超过了之前的最佳方法。这些结果表明,将VLM驱动的推理、规划和生成世界建模紧密集成对于自动驾驶的优势。项目页面可在https://unidrive-wm.github.io/UniDrive-WM/获取。
Summary / 总结
UniDrive-WM is a unified VLM-based world model that integrates driving-scene understanding, trajectory planning, and future image generation. It uses a trajectory planner to predict future paths, which conditions a VLM to generate plausible future frames. Experiments show that UniDrive-WM improves planning performance by 7.3% in L2 trajectory error and 10.4% in collision rate compared to the previous best method. This demonstrates the benefits of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving.
UniDrive-WM 是一个统一的基于 VLM 的世界模型,整合了驾驶场景理解、轨迹规划和未来图像生成。它使用轨迹规划器预测未来路径,进而条件化 VLM 生成可能的未来帧。实验结果显示,UniDrive-WM 在 L2 轨迹误差和碰撞率方面分别比前最佳方法提高了 7.3% 和 10.4%,这表明将 VLM 驱动的推理、规划和生成式世界建模紧密集成对于自动驾驶的优势。
Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Authors: Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong
First: 2025-08-21T13:42:49+00:00 · Latest: 2026-03-23T07:49:49+00:00
Abstract
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
中文标题/摘要
标题:无需反向传播的测试时自适应通过概率高斯对齐
测试时自适应(TTA)通过在推理过程中利用未标记的测试数据来增强零样本鲁棒性,从而在分布偏移下提高鲁棒性。尽管取得了显著进展,但几个挑战仍然限制了其更广泛的适用性。首先,大多数方法依赖于反向传播或迭代优化,这限制了可扩展性并阻碍了实时部署。其次,它们缺乏对类条件特征分布的显式建模。这种建模对于生成可靠决策边界和校准预测至关重要,但由于缺乏测试时的源数据和监督,这种建模仍然未被充分探索。在本文中,我们提出了一种名为ADAPT的高级分布感知且无需反向传播的测试时自适应方法。我们将TTA重新定义为一个高斯概率推理任务,通过使用逐渐更新的类均值和共享协方差矩阵来建模类条件似然性。这使得可以进行闭式、无需训练的推理。为了纠正潜在的似然偏差,我们引入了由CLIP先验和历史知识库引导的轻量级正则化。ADAPT不需要源数据、不需要梯度更新,并且不需要完全访问目标数据,支持在线和归纳设置。在多种基准上的广泛实验表明,我们的方法在广泛的分布偏移下实现了最先进的性能,具有更好的可扩展性和鲁棒性。
Summary / 总结
The paper addresses the challenges of test-time adaptation (TTA) by proposing ADAPT, which reframes TTA as a Gaussian probabilistic inference task. ADAPT uses gradually updated class means and a shared covariance matrix, enabling closed-form inference without backpropagation or iterative optimization. The method introduces lightweight regularization to correct likelihood bias and does not require source data or access to target data, making it suitable for online and transductive settings. Experiments show that ADAPT outperforms existing methods in various benchmarks with better scalability and robustness under distribution shifts.
论文通过提出ADAPT方法解决了测试时适应(TTA)的挑战,将TTA重新定义为高斯概率推理任务。ADAPT避免了反向传播和迭代优化,实现了可扩展和实时部署。它使用更新后的类均值和共享协方差矩阵来建模条件类似性似然性,允许在无需训练数据的情况下进行闭式推理。ADAPT引入了使用CLIP先验和历史知识库的正则化来纠正似然性偏差。该方法在各种基准测试中实现了最先进的性能,展示了在分布变化下的优越可扩展性和鲁棒性。
Unleashing the Potential of All Test Samples: Mean-Shift Guided Test-Time Adaptation
Authors: Jizhou Han, Chenhao Ding, SongLin Dong, Yuhang He, Xinyuan Gao, Yihong Gong
First: 2025-07-01T06:22:00+00:00 · Latest: 2026-03-23T06:56:02+00:00
Comments: Accepted by IEEE TCSVT. This is the author's version which has not been fully edited and content may change prior to final publication
Abstract
Visual-language models (VLMs) like CLIP exhibit strong generalization but struggle with distribution shifts at test time. Existing training-free test-time adaptation (TTA) methods operate strictly within CLIP's original feature space, relying on high-confidence samples while overlooking the potential of low-confidence ones. We propose MS-TTA, a training-free approach that enhances feature representations beyond CLIP's space using a single-step k-nearest neighbors (kNN) Mean-Shift. By refining all test samples, MS-TTA improves feature compactness and class separability, leading to more stable adaptation. Additionally, a cache of refined embeddings further enhances inference by providing Mean Shift enhanced logits. Extensive evaluations on OOD and cross-dataset benchmarks demonstrate that MS-TTA consistently outperforms state-of-the-art training-free TTA methods, achieving robust adaptation without requiring additional training.
中文标题/摘要
标题:利用均值偏移引导测试时自适应释放所有测试样本的潜力
视觉语言模型(VLMs)如CLIP表现出强大的泛化能力,但在测试时面对分布偏移时存在困难。现有的无需训练的测试时自适应(TTA)方法严格局限于CLIP的原始特征空间,依赖于高置信度样本,而忽视了低置信度样本的潜力。我们提出MS-TTA,这是一种无需训练的方法,通过单步k近邻(kNN)均值偏移增强特征表示,超越CLIP的空间。通过改进所有测试样本,MS-TTA提高特征紧凑性和类别可分性,从而实现更稳定的自适应。此外,缓存的改进嵌入进一步增强了推理,通过提供均值偏移增强的对数。在广泛的异常分布(OOD)和跨数据集基准测试中,MS-TTA一致优于最先进的无需训练的TTA方法,无需额外训练即可实现稳健的自适应。
Summary / 总结
The research aims to improve the performance of visual-language models like CLIP in distribution shifts at test time. The proposed MS-TTA method uses a single-step kNN Mean-Shift to enhance feature representations beyond the original CLIP feature space, refining all test samples to improve feature compactness and class separability. Extensive evaluations show that MS-TTA outperforms existing training-free TTA methods on OOD and cross-dataset benchmarks, achieving robust adaptation without additional training.
研究旨在通过解决测试时的分布偏移来提高视觉-语言模型CLIP的泛化能力。提出的MS-TTA方法使用单步kNN Mean-Shift来增强特征表示,超越CLIP的原始特征空间,对所有测试样本进行细化以提高特征紧凑性和类别可分性。广泛评估表明,MS-TTA在OOD和跨数据集基准上优于现有训练免费的TTA方法,实现了无需额外训练的稳健适应。