arXiv 论文速递

2025-12-03 03:33
Snapshot: 20251203_0333
Improved Mean Flows: On the Challenges of Fastforward Generative Models
Authors: Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter, Kaiming He
First: 2025-12-01T18:59:49+00:00 · Latest: 2025-12-01T18:59:49+00:00
Comments: Technical report
Abstract
MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.
中文标题/摘要
标题:改进的均值流:快速前向生成模型的挑战
均值流(MF)最近被确立为一种一步生成建模的框架。然而,其“快速前向”的性质在训练目标和引导机制中引入了关键挑战。首先,原始MF的训练目标不仅依赖于底层的真实场,还依赖于网络本身。为了解决这一问题,我们将目标重新定义为瞬时速度$v$的损失,$v$由一个网络预测平均速度$u$参数化。我们的重新定义产生了一个更标准的回归问题,并提高了训练稳定性。其次,原始MF在训练过程中固定了无分类器引导比例,牺牲了灵活性。我们通过将引导形式化为显式条件变量来解决这一问题,从而在测试时保留了灵活性。通过上下文条件处理各种条件,减少了模型大小并提高了性能。总体而言,我们的改进均值流(iMF)方法,从头开始训练,实现了ImageNet 256×256单函数评估(1-NFE)的1.72 FID。iMF在性能上显著优于此类先前方法,并且在不使用蒸馏的情况下接近多步方法的性能。我们希望我们的工作能进一步推动快速前向生成建模作为独立范式的进展。
Summary / 总结
The paper addresses challenges in MeanFlow (MF) for one-step generative modeling, including the network-dependent training target and fixed guidance scale. It reformulates the objective to focus on the average velocity, improving training stability, and introduces explicit conditioning variables for guidance, enhancing flexibility. The improved MeanFlow (iMF) method achieves 1.72 FID with a single function evaluation on ImageNet 256x256, outperforming previous methods and closing the gap with multi-step methods without using distillation.
论文解决了MeanFlow (MF)在单步生成建模中的挑战,如训练目标对网络本身的依赖以及训练过程中固定的指导尺度。作者将目标重新表述为瞬时速度的损失,提高了训练稳定性。他们还引入了显式的条件变量进行指导,增强了测试时的灵活性。改进后的MeanFlow (iMF)方法在ImageNet 256x256上使用单次函数评估实现了1.72的FID,超越了之前的同类方法,并且在不使用蒸馏的情况下缩小了与多步方法的差距。
STORM: Segment, Track, and Object Re-Localization from a Single Image
Authors: Yu Deng, Teng Cao, Hikaru Shindo, Jiahong Xue, Quentin Delfosse, Kristian Kersting
First: 2025-11-12T22:06:51+00:00 · Latest: 2025-12-01T18:48:10+00:00
Abstract
Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically require a pre-defined 3D model of the target and rely on a manually annotated segmentation mask in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limitations, we propose STORM (Segment, Track, and Object Re-localization from a single iMage), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and produce precise masks and 3D models for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.
中文标题/摘要
标题:STORM:从单张图像中进行分割、跟踪和对象再定位
准确的6D姿态估计和跟踪是物理AI系统(如机器人)的基本能力。然而,现有方法通常需要目标的预定义3D模型,并依赖于第一帧的手动标注分割掩码,这既耗时又导致在面对遮挡或快速移动时性能降低。为了解决这些限制,我们提出了STORM(从单张图像中进行分割、跟踪和对象再定位),这是一种开源的鲁棒实时6D姿态估计系统,无需手动标注。STORM采用了一种新颖的三阶段流水线,结合了视觉-语言理解与特征匹配:上下文对象描述引导定位,自我交叉注意力机制识别候选区域并生成精确的掩码和3D模型以实现准确的姿态估计。另一个关键创新是我们自动再注册机制,通过特征相似性监控检测跟踪失败,并从严重遮挡或快速运动中恢复。STORM在包含多对象遮挡、高速运动和变化光照的具有挑战性的工业数据集上实现了最先进的精度,同时以实时速度运行而无需额外训练。这种无需标注的方法显著降低了部署成本,为现代应用(如灵活制造和智能质量控制)提供了实用的解决方案。
Summary / 总结
STORM is an open-source 6D pose estimation system that does not require manual annotation, addressing the limitations of existing methods that rely on pre-defined 3D models and manual segmentation masks. It uses a three-stage pipeline combining vision-language understanding and feature matching to guide localization, identify candidate regions, and produce precise masks and 3D models for accurate pose estimation. STORM also includes an automatic re-registration mechanism to handle tracking failures and severe occlusions. The system achieves state-of-the-art accuracy on challenging industrial datasets while operating in real-time without additional training.
STORM 是一个无需手动标注的实时 6D 姿态估计系统,解决了现有方法依赖预定义 3D 模型和分割掩码的局限性。它使用结合视觉-语言理解和自我交叉注意力机制的三阶段管道来引导定位、识别候选区域并生成精确的掩码和 3D 模型以实现准确的姿态估计。STORM 还包括一个自动重新注册机制来处理跟踪失败和严重遮挡。该系统在具有多对象遮挡、高速运动和变化光照的挑战性工业数据集上实现了最先进的准确性,并且在无需额外训练的情况下以实时速度运行。
Low-Rank Prehab: Preparing Neural Networks for SVD Compression
Authors: Haoran Qin, Shansita Sharma, Ali Abbasi, Chayne Thrash, Soheil Kolouri
First: 2025-12-01T18:37:53+00:00 · Latest: 2025-12-01T18:37:53+00:00
Abstract
Low-rank approximation methods such as singular value decomposition (SVD) and its variants (e.g., Fisher-weighted SVD, Activation SVD) have recently emerged as effective tools for neural network compression. In this setting, decomposition acts as a "surgical" intervention, followed by fine-tuning that serves as "rehab" to recover accuracy. Inspired by prehabilitation in surgery, we introduce a pre-compression fine-tuning stage, Low-Rank Prehab, that explicitly encourages low-rank structure in weight matrices while preserving task performance. By conditioning the model before SVD, Prehab steers weights toward spectrally compact regions of the parameter space, enabling smoother low-rank approximation and improved recovery. Experiments on large language models (LLMs) and other Transformer-based architectures, including Vision Transformers (ViTs), show that Prehab substantially reduces the immediate accuracy drop after compression and consistently improves post-finetuning performance. Across a wide range of compression ratios, our method outperforms state-of-the-art SVD-based techniques such as SVD-LLM, highlighting the importance of preparing models for compression rather than only improving the compression and recovery stages. Source code is available at https://github.com/niqretnuh/PREHAB-SVD
中文标题/摘要
标题:低秩预康复:为神经网络准备SVD压缩
低秩近似方法,如奇异值分解(SVD)及其变体(例如Fisher加权SVD、激活SVD)最近已成为神经网络压缩的有效工具。在这种情况下,分解作为一种“外科”干预,随后的微调作为“康复”以恢复准确性。受外科手术前康复的启发,我们引入了一个预压缩微调阶段,低秩预康复,它明确地鼓励权重矩阵中的低秩结构,同时保持任务性能。通过在SVD之前对模型进行条件处理,预康复引导权重向参数空间的谱紧凑区域移动,从而实现更平滑的低秩近似并提高恢复效果。在大型语言模型(LLMs)和其他基于Transformer的架构,包括视觉变压器(ViTs)上的实验表明,预康复显著减少了压缩后的即时准确性下降,并且在微调后的一致性性能上有所提高。在广泛的压缩比范围内,我们的方法优于最先进的基于SVD的技术,如SVD-LLM,突显了准备模型进行压缩的重要性,而不仅仅是改进压缩和恢复阶段。源代码可在https://github.com/niqretnuh/PREHAB-SVD获取
Summary / 总结
The paper introduces Low-Rank Prehab, a method that prepares neural networks for singular value decomposition (SVD) compression by encouraging low-rank structure in weight matrices during pre-compression fine-tuning. This approach reduces the immediate accuracy drop after compression and improves post-finetuning performance across various architectures, including large language models and Vision Transformers. The method consistently outperforms state-of-the-art SVD-based techniques by conditioning the model before compression, enabling smoother low-rank approximation and better recovery. Source code is available at https://github.com/niqretnuh/PREHAB-SVD.
论文提出了Low-Rank Prehab方法,在压缩前通过预训练阶段鼓励低秩结构,减少压缩后的准确率下降,并在包括大型语言模型和视觉变换器等各类架构中提高了压缩后的性能。该方法通过在压缩前对模型进行条件化,优于最先进的SVD基技术,实现了更平滑的低秩逼近和更好的恢复效果。
Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
Authors: Aiden Yiliu Li, Bizhi Yu, Daoan Lei, Tianhe Ren, Shilong Liu
First: 2025-12-01T18:37:19+00:00 · Latest: 2025-12-01T18:37:19+00:00
Abstract
GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled industrial control panels with visual distortions such as blur and masking. On TPanel UI Chain of Ground improves over the strong baseline Qwen3 VL 235B by 6.9 points showing the effectiveness of multi step training free grounding across real world and digital interfaces. These results highlight a direction for unlocking grounding potential through structured iterative refinement instead of additional training.
中文标题/摘要
标题:地面链路:通过迭代推理和参考反馈提高GUI定位
GUI定位旨在将自然语言指令与复杂用户界面中的精确区域对齐。高级多模态大型语言模型在视觉GUI定位方面表现出强大的能力,但仍难以处理小型或视觉上相似的目标以及现实世界布局中的歧义。这些限制源于定位能力有限以及对现有推理潜力的不足利用。我们提出了地面链路CoG,这是一种无需训练的多步定位框架,利用多模态大型语言模型进行迭代视觉推理和细化。模型不是直接预测,而是逐步反思和调整其假设,从而实现更准确和可解释的定位。我们的方法在ScreenSpot Pro基准测试中达到了68.4的准确率,提高了4.8个百分点。为了衡量实际应用中的泛化能力,我们引入了TPanel UI数据集,包含420个带有视觉失真(如模糊和遮挡)的工业控制面板标签。在TPanel UI上,地面链路提高了强基线Qwen3 VL 235B的6.9个百分点,展示了多步训练免费定位在现实世界和数字界面中的有效性。这些结果突显了通过结构化迭代细化来解锁定位潜力的方向,而不是额外的训练。
Summary / 总结
The paper addresses the challenge of aligning natural language instructions with precise regions in complex user interfaces, particularly for small or visually similar targets and ambiguous layouts. It introduces Chain-of-Ground (CoG), a training-free framework that uses iterative visual reasoning and refinement with multimodal large language models. CoG progressively reflects and adjusts hypotheses, leading to more accurate and interpretable localization. The approach achieves 68.4% accuracy on the ScreenSpot Pro benchmark, an improvement of 4.8 points over previous methods. Additionally, CoG shows effectiveness on the TPanel UI dataset, improving over the strong baseline Qwen3 VL 235B by 6.9 points, demonstrating its real-world generalization capabilities.
论文针对GUI接地问题,即自然语言指令需要与复杂用户界面中的精确区域对齐的挑战。它提出了Chain-of-Ground (CoG) 方法,这是一种使用迭代视觉推理和修正的多步接地框架,基于多模态大型语言模型。该方法在ScreenSpot Pro基准测试上的准确率达到68.4%,比基线Qwen3 VL 235B提高了4.8个百分点,并在包含视觉失真如模糊和遮挡的TPanel UI数据集上展示了更好的现实世界泛化能力,优于基线6.9个百分点。
Structure is Supervision: Multiview Masked Autoencoders for Radiology
Authors: Sonia Laguna, Andrea Agostini, Alain Ryser, Samuel Ruiperez-Campillo, Irene Cannistraci, Moritz Vandenhirtz, Stephan Mandt, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M. Sutter, Julia E. Vogt
First: 2025-11-27T10:20:51+00:00 · Latest: 2025-12-01T18:27:37+00:00
Abstract
Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.
中文标题/摘要
标题:结构即监督:放射学多视图掩蔽自编码器
构建稳健的医疗机器学习系统需要利用临床数据中固有的结构的预训练策略。我们引入了多视图掩蔽自编码器(MVMAE),这是一种自监督框架,利用放射学研究的自然多视图组织来学习视图不变且与疾病相关的表示。MVMAE 结合了掩蔽图像重建与跨视图对齐,将投影间的临床冗余转化为强大的自监督信号。我们进一步通过 MVMAE-V2T 扩展了这种方法,将放射学报告作为辅助的基于文本的学习信号,增强语义关联的同时保持完全基于视觉的推理。在三个大规模公开数据集 MIMIC-CXR、CheXpert 和 PadChest 上的下游疾病分类任务中,MVMAE 一致优于监督和视觉-语言基线。此外,MVMAE-V2T 在低标签情况下提供了额外的增益,特别是在结构化文本监督最有益的情况下。这些结果共同确立了结构和文本监督作为实现可扩展且临床相关的医疗基础模型互补路径的重要性。
Summary / 总结
The research aims to develop robust medical machine learning systems by leveraging the inherent structure in clinical data. The Multiview Masked Autoencoder (MVMAE) framework is introduced, which uses natural multi-view organization of radiology studies to learn invariant and disease-relevant representations through masked image reconstruction and cross-view alignment. MVMAE-V2T further integrates radiology reports to enhance semantic grounding. Experiments on three large-scale public datasets show that MVMAE outperforms supervised and vision-language baselines, and MVMAE-V2T provides additional gains in low-label regimes, highlighting the importance of structural and textual supervision for scalable medical foundation models.
研究旨在通过自监督学习利用临床数据中的内在结构来开发稳健的医疗机器学习系统。引入了多视图掩码自编码器(MVMAE)框架,该框架利用放射学研究的多视图性质进行不变视图和疾病相关表示的学习。MVMAE-V2T进一步整合了放射学报告以增强语义关联。在三个大规模数据集上的实验表明,MVMAE在下游疾病分类任务中优于监督学习和视觉语言基线,而MVMAE-V2T在低标签情况下提供了额外的改进。
Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
Authors: Yifei Dong, Fengyi Wu, Sanjian Zhang, Guangyu Chen, Yuzhi Hu, Masumi Yano, Jingdong Sun, Siyu Huang, Feng Liu, Qi Dai, Zhi-Qi Cheng
Venue: CVPR Best Paper
First: 2025-04-16T10:58:33+00:00 · Latest: 2025-12-01T18:22:15+00:00
Comments: Best Paper, Accepted at CVPR Workshop Anti-UAV 2025. 16 pages
Abstract
Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges. This survey provides a wide-ranging examination of the anti-UAV domain, centering on three core objectives-classification, detection, and tracking-while detailing emerging methodologies such as diffusion-based data synthesis, multi-modal fusion, vision-language modeling, self-supervised learning, and reinforcement learning. We systematically evaluate state-of-the-art solutions across both single-modality and multi-sensor pipelines (spanning RGB, infrared, audio, radar, and RF) and discuss large-scale as well as adversarially oriented benchmarks. Our analysis reveals persistent gaps in real-time performance, stealth detection, and swarm-based scenarios, underscoring pressing needs for robust, adaptive anti-UAV systems. By highlighting open research directions, we aim to foster innovation and guide the development of next-generation defense strategies in an era marked by the extensive use of UAVs.
中文标题/摘要
标题:确保天空安全:无人机反制方法综述、基准测试及未来方向
无人驾驶航空器(UAVs)对于基础设施检查、监控及相关任务不可或缺,但同时也带来了关键的安全挑战。本文综述了反无人机领域,重点关注分类、检测和跟踪三大核心目标,详细介绍了诸如基于扩散的数据合成、多模态融合、视觉-语言建模、自监督学习和强化学习等新兴方法。我们系统地评估了单模态和多传感器管道(包括RGB、红外、音频、雷达和RF)中的最新解决方案,并讨论了大规模及对抗性导向的基准测试。我们的分析揭示了实时性能、隐形检测和群无人机场景中的持续差距,强调了需要开发稳健且适应性强的反无人机系统。通过突出开放的研究方向,我们旨在促进创新并指导无人机广泛使用时代下新一代防御策略的发展。
Summary / 总结
This survey examines anti-UAV methods focusing on classification, detection, and tracking, evaluating state-of-the-art solutions across single-modality and multi-sensor pipelines. Key findings include persistent gaps in real-time performance, stealth detection, and swarm scenarios, emphasizing the need for robust, adaptive anti-UAV systems. The study highlights open research directions to guide future defense strategies in the UAV era.
该调研聚焦于无人机分类、检测和跟踪方法,评估了包括RGB、红外、音频、雷达和RF等多种模态的先进解决方案。研究指出在实时性能和隐身检测方面存在不足,强调了需要开发稳健且适应性强的反无人机系统。该研究旨在指导未来的研究和开发工作。
Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness
Authors: Tavish McDonald, Bo Lei, Stanislav Fort, Bhavya Kailkhura, Brian Bartoldson
First: 2025-10-08T09:18:53+00:00 · Latest: 2025-12-01T18:15:29+00:00
Comments: 21 pages
Abstract
Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute investments into their robustification. Zaremba et al. (2025) make progress on this problem at test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test compute fades when attackers are given access to gradients or multimodal inputs. We address this gap, clarifying that inference-compute offers benefits even in such cases. Our approach argues that compositional generalization, through which OOD data is understandable via its in-distribution (ID) components, enables adherence to defensive specifications on adversarially OOD inputs. Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as the model's training data better reflects the attacked data's components. We empirically support this hypothesis across vision language model and attack types, finding robustness gains from test-time compute if specification following on OOD data is unlocked by compositional generalization. For example, InternVL 3.5 gpt-oss 20B gains little robustness when its test compute is scaled, but such scaling adds significant robustness if we first robustify its vision encoder. This correlation of inference-compute's robustness benefit with base model robustness is the rich-get-richer dynamic of the RICH: attacked data components are more ID for robustified models, aiding compositional generalization to OOD data. Thus, we advise layering train-time and test-time defenses to obtain their synergistic benefit.
中文标题/摘要
标题:变富或死亡:通过推理计算提高鲁棒性的盈利性推理计算扩展
尽管在模型的鲁棒性提升上投入了大量训练计算资源,但模型仍然容易受到对抗性离分布(OOD)数据的影响。Zaremba等人(2025)在测试时对此问题取得进展,表明语言模型的推理能力提高了模型规范的满足度,这些规范旨在抵御攻击,从而在推理努力与对抗性破解的鲁棒性之间建立了相关性。然而,当攻击者获得梯度访问权或多种模态输入时,这种测试计算的好处会消失。我们解决了这一缺口,阐明了即使在这些情况下,推理计算也提供了益处。我们的方法认为,通过组成泛化,OOD数据可以通过其在分布(ID)组件的理解来解释,从而在对抗性OOD输入上遵守防御规范。具体而言,我们提出了推理计算鲁棒性假设(RICH):当模型的训练数据更好地反映攻击数据的组件时,推理计算防御会受益。我们通过视觉语言模型和攻击类型的经验支持这一假设,发现如果通过组成泛化解锁OOD数据上的规范遵循,测试计算可以带来鲁棒性提升。例如,InternVL 3.5 gpt-oss 20B在测试计算扩展时几乎没有鲁棒性提升,但如果首先使其视觉编码器鲁棒化,这种扩展会显著增加鲁棒性。推理计算鲁棒性益处与基础模型鲁棒性的这种相关性是RICH的富者愈富动态:被攻击数据的组件对于鲁棒化模型来说更像在分布数据,从而帮助组成泛化到OOD数据。因此,我们建议叠加训练时间和测试时间的防御以获得它们的协同效益。
Summary / 总结
The paper addresses the vulnerability of models to adversarially out-of-distribution (OOD) data despite extensive robustification efforts. It introduces the Robustness from Inference Compute Hypothesis (RICH), suggesting that increasing inference compute can enhance model robustness, particularly when compositional generalization allows OOD data to be understood via its in-distribution components. Experiments across different vision-language models and attack types show that scaling test-time compute can significantly improve robustness, especially after robustifying the base model components. This rich-get-richer dynamic supports the hypothesis that more robust models benefit more from inference compute.
论文探讨了尽管投入大量训练计算资源,模型仍对对抗性离域数据易受攻击的问题。它提出了推理计算增强假设(RICH),认为增加推理计算可以提升模型的鲁棒性,尤其是在模型的训练数据更好地反映了攻击数据的组成部分时。研究通过实验证明,对于如InternVL 3.5 gpt-oss 20B这样的模型,增加测试时的计算资源可以提高其鲁棒性,特别是在先使视觉编码器变得鲁棒的情况下。这种富者愈富的动态效应归因于组成性泛化,它使模型能够通过其域内组成部分来理解离域数据。
Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models
Authors: Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid
First: 2025-12-01T17:57:27+00:00 · Latest: 2025-12-01T17:57:27+00:00
Comments: 9 pages, 9 figures, 6 tables
Abstract
Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data.
中文标题/摘要
标题:卫报:使用视觉语言模型检测机器人规划与执行错误
稳健的机器人操作需要可靠的故障检测和恢复。尽管当前的视觉语言模型(VLMs)显示出潜力,但它们的准确性和泛化能力受限于故障数据的稀缺性。为解决这一数据缺口,我们提出了一种自动机器人故障合成方法,通过程序化地扰动成功的轨迹来生成多样化的规划和执行故障。该方法不仅生成二元分类标签,还生成详细的故障类别和推理轨迹,适用于仿真和真实世界。通过这种方法,我们构建了三个新的故障检测基准:RLBench-Fail、BridgeDataV2-Fail 和 UR5-Fail,显著扩展了现有故障数据集的多样性和规模。然后,我们训练了Guardian,这是一种具有多视角图像的VLM,用于详细的故障推理和检测。Guardian在现有和新引入的基准测试中均达到了最先进的性能。当将其集成到最先进的操作系统中时,它也有效提高了仿真和真实机器人中的任务成功率,证明了我们生成的故障数据的影响。
Summary / 总结
The paper aims to improve failure detection in robotic manipulation by addressing the scarcity of failure data. It proposes an automatic robot failure synthesis approach that generates diverse planning and execution failures by procedurally perturbing successful trajectories. This method produces binary and fine-grained failure labels and reasoning traces. The approach constructs three new failure detection benchmarks and trains Guardian, a VLM, which achieves state-of-the-art performance and improves task success rates in both simulation and real robots.
研究旨在通过开发自动生成多样化故障场景的方法来提高机器人的操作能力,以更好地进行故障检测。该方法通过对成功轨迹进行程序化扰动来创建二元和详细的故障标签。这种方法构建了新的故障检测基准,并训练了Vision-Language模型Guardian,其在仿真和真实机器人中均表现出最先进的性能,并提高了任务成功率。
Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding
Authors: Zahra Mahdavi, Zahra Khodakaramimaghsoud, Hooman Khaloo, Sina Bakhshandeh Taleshani, Erfan Hashemi, Javad Mirzapour Kaleybar, Omid Nejati Manzari
Venue: Computers in Biology and Medicine (2026)
First: 2025-12-01T17:40:03+00:00 · Latest: 2025-12-01T17:40:03+00:00
Abstract
Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but are in fact incorrect. In the natural image domain, several decoding strategies have been proposed to mitigate hallucinations by reinforcing visual evidence, but most rely on secondary decoding or rollback procedures that substantially slow inference. Moreover, existing solutions are often domain-specific and may introduce misalignment between modalities or between generated and ground-truth content. We introduce Med-VCD, a sparse visual-contrastive decoding method that mitigates hallucinations in medical LVLMs without the time overhead of secondary decoding. Med-VCD incorporates a novel token-sparsification strategy that selects visually informed tokens on the fly, trimming redundancy while retaining critical visual context and thus balancing efficiency with reliability. Evaluations on eight medical datasets, spanning ophthalmology, radiology, and pathology tasks in visual question answering, report generation, and dedicated hallucination benchmarks, show that Med-VCD raises factual accuracy by an average of 13\% and improves hallucination accuracy by 6\% relative to baseline medical LVLMs.
中文标题/摘要
标题:Med-VCD:通过视觉对比解码减轻医疗大型视觉语言模型的幻觉
大型视觉-语言模型(LVLMs)现在在医疗应用中占据中心地位,如医学视觉问答和影像报告生成。然而,这些模型仍然容易产生看似合理但实际上错误的幻觉输出。在自然图像领域,已经提出了几种解码策略来通过强化视觉证据来减轻幻觉,但大多数依赖于次级解码或回滚程序,这会显著减慢推理速度。此外,现有解决方案往往是特定领域的,可能会在模态之间或生成内容与真实内容之间引入对齐问题。我们引入了Med-VCD,这是一种稀疏视觉对比解码方法,可以在不增加次级解码时间开销的情况下减轻医疗LVLMs的幻觉。Med-VCD 结合了一种新颖的令牌稀疏化策略,该策略可以实时选择视觉信息丰富的令牌,从而去除冗余同时保留关键的视觉上下文,从而在效率与可靠性之间取得平衡。在八个医疗数据集上的评估涵盖了眼科、放射学和病理学任务中的视觉问答、报告生成和专门的幻觉基准测试,显示Med-VCD 的事实准确性平均提高了13%,幻觉准确性提高了6%,相对于基线医疗LVLMs。
Summary / 总结
Med-VCD is a sparse visual-contrastive decoding method designed to reduce hallucination in medical large vision-language models. It selects visually informed tokens dynamically to maintain critical visual context, enhancing efficiency and reliability. Evaluations across eight medical datasets show that Med-VCD increases factual accuracy by 13% and improves hallucination accuracy by 6% compared to baseline models.
Med-VCD 是一种通过引入视觉对比解码方法来减少医疗大型视觉语言模型幻觉的方法。它将事实准确性提高了13%,幻觉准确性提高了6%,同时没有增加推理时间。
KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM
Authors: Zaid Nasser, Mikhail Iumanov, Tianhao Li, Maxim Popov, Jaafar Mahmoud, Malik Mohrat, Ilya Obrubov, Ekaterina Derevyanka, Ivan Sosin, Sergey Kolyubin
First: 2025-12-01T17:10:40+00:00 · Latest: 2025-12-01T17:10:40+00:00
Abstract
We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.
中文标题/摘要
标题:KM-ViPE:在线紧密耦合的视觉-语言-几何融合开放词汇语义SLAM
我们提出了KM-ViPE(知识映射视频姿态引擎),一种适用于动态环境的实时开放词汇语义SLAM框架,适用于未标定的单目相机。与需要深度传感器和离线标定的系统不同,KM-ViPE 直接处理原始RGB流,使其适用于第一人称应用,并能从互联网规模的视频数据中收集训练数据。KM-ViPE 通过基于高级特征的自适应鲁棒核将DINO视觉特征与几何约束紧密耦合,处理移动物体和可移动静态物体(例如,第一人称视角中的移动家具)。系统通过融合几何和深度视觉特征与语言嵌入对齐来进行同时在线定位和开放词汇语义映射。我们的结果与现有最佳方法相当,而现有解决方案要么离线运行,要么需要深度数据和/或姿态估计,或者缺乏动态场景的鲁棒性。KM-ViPE 从互联网规模的训练中受益,并且独特地结合了在线操作、未标定的单目输入和对动态场景的鲁棒处理,使其非常适合自主机器人和AR/VR应用,并为具身AI的实际空间智能能力的进步做出了贡献。
Unifying Sign and Magnitude for Optimizing Deep Vision Networks via ThermoLion
Authors: Ahmed Nebli
First: 2025-12-01T17:04:17+00:00 · Latest: 2025-12-01T17:04:17+00:00
Abstract
The training of deep vision models is fundamentally a signal recovery problem amidst high-dimensional stochastic noise. Current optimization paradigms impose a static compromise on information channel capacity. For instance, magnitude-based methods, such as AdamW, operate on the assumption that gradient norms are high-fidelity curvature signals. While this allows for precision in smooth regimes, it leads to catastrophic noise amplification when applied to rugged, non-convex landscapes. Conversely, sign-based methods (e.g., Lion) perform a radical 1-bit quantization of the gradient, which aims to provide robust regularization at the cost of discarding fine-grained descent information. We propose that optimal convergence requires neither static prior, but rather a dynamic modulation of the update bitrate. We introduce \textbf{ThermoLion}, a vision-centric framework that utilizes local Signal-to-Noise Ratio (SNR) gating to autonomously transition parameters between a "low-bit" exploration phase and a "high-precision" exploitation phase. Furthermore, we introduce a Momentum Alignment mechanism that detects constructive interference between historical drift and instantaneous gradients to accelerate convergence during stable trajectories. Empirical benchmarks across 12 diverse vision datasets (including CIFAR, SVHN, and GTSRB) demonstrate that ThermoLion serves as a hyperparameter-free generalist, surpassing both AdamW and Lion in convergence speed and terminal accuracy without architecture-specific tuning.
中文标题/摘要
标题:通过ThermoLion统一符号和幅度优化深度视觉网络
深度视觉模型的训练本质上是在高维随机噪声中恢复信号的问题。当前的优化范式在信息信道容量上施加了静态妥协。例如,幅度基方法(如AdamW)假设梯度范数是高保真曲率信号。虽然这在平滑区域允许精度,但在应用于崎岖的非凸景观时会导致灾难性的噪声放大。相反,符号基方法(例如Lion)对梯度进行剧烈的一比特量化,旨在提供鲁棒正则化,但会牺牲细粒度的下降信息。我们提出,最优收敛既不需要静态先验,而是需要动态调节更新比特率。我们引入了**ThermoLion**,这是一种以视觉为中心的框架,利用局部信噪比(SNR)门控自主地在“低比特”探索阶段和“高精度”利用阶段之间转换参数。此外,我们引入了一种动量对齐机制,该机制检测历史漂移和瞬时梯度之间的建设性干涉,以在稳定轨迹期间加速收敛。在包括CIFAR、SVHN和GTSRB在内的12个不同视觉数据集上的实证基准测试表明,ThermoLion作为一种超参数自由的一般化方法,在收敛速度和最终精度上均优于AdamW和Lion,无需针对特定架构进行调整。
Summary / 总结
The paper addresses the challenge of optimizing deep vision models by proposing ThermoLion, a framework that dynamically adjusts the update bitrate based on local Signal-to-Noise Ratio (SNR). This method transitions between a low-bit exploration phase and a high-precision exploitation phase. ThermoLion also includes a Momentum Alignment mechanism to enhance convergence during stable trajectories. Experiments across 12 datasets show that ThermoLion outperforms both AdamW and Lion in terms of convergence speed and terminal accuracy without requiring architecture-specific tuning.
论文提出了一种名为ThermoLion的框架,该框架基于局部信噪比(SNR)动态调整更新位宽,交替处于低位宽探索阶段和高精度利用阶段,并引入了动量对齐机制以在稳定轨迹期间加速收敛。实验结果表明,ThermoLion在12个视觉数据集上表现出更快的收敛速度和更高的终端精度,且无需针对特定架构进行调优。
PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
Authors: Zeqing Wang, Keze Wang, Lei Zhang
First: 2025-12-01T16:28:13+00:00 · Latest: 2025-12-01T16:28:13+00:00
Comments: 17 pages, 8 figures
Abstract
Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.
中文标题/摘要
标题:PhyDetEx:检测和解释T2V模型的物理合理性
随着文本到视频(T2V)生成模型的生成能力和训练规模的不断增长,这些模型在视频质量、长度和指令遵循能力方面取得了显著进展。然而,这些模型是否能够理解物理原理并生成物理上合理的视频仍然是一个问题。尽管视觉语言模型(VLMs)在各种应用中被广泛用作通用评估工具,但它们难以识别生成视频中的物理不可能内容。为了研究这一问题,我们构建了一个包含500个手动标注视频的测试集和2,588个配对视频的训练集的PID(物理不合理性检测)数据集,其中每个不合理视频都是通过仔细重写其对应真实视频的标题来生成的,以诱导T2V模型生成物理上不合理的内容。借助构建的数据集,我们提出了一种轻量级微调方法,使VLMs不仅能检测物理上不合理事件,还能生成违反物理原理的文本解释。将微调后的VLM作为物理合理性检测器和解释器,即PhyDetEx,我们对一系列最先进的T2V模型进行了基准测试,以评估它们遵守物理定律的情况。我们的研究发现,尽管最近的T2V模型在生成物理上合理的内容方面取得了显著进展,但理解和遵守物理定律仍然是一个具有挑战性的问题,尤其是在开源模型方面。我们的数据集、训练代码和检查点可在https://github.com/Zeqing-Wang/PhyDetEx获取。
Summary / 总结
This paper addresses the challenge of detecting and explaining the physical plausibility of Text-to-Video (T2V) models. It introduces a dataset called PID, consisting of 500 manually annotated videos for testing and 2,588 paired videos for training, designed to evaluate the physical implausibility of generated videos. Using a lightweight fine-tuning approach, the paper develops a model called PhyDetEx, which can not only detect physically implausible events but also provide textual explanations for the violated physical principles. The study finds that while recent T2V models have improved in generating physically plausible content, they still struggle with understanding and adhering to physical laws, particularly for open-source models.
该论文旨在检测和解释Text-to-Video (T2V) 模型的物理合理性。作者构建了一个名为PID的数据集,包含500个手动标注的视频和2,588对配对视频,用于测试T2V模型生成物理合理内容的能力。研究提出了一种轻量级的微调方法,使Vision-Language模型不仅能检测物理不合理事件,还能生成违反物理原理的文本解释。研究结果表明,尽管最近的T2V模型在生成物理合理内容方面取得了显著进步,但在遵循物理定律方面仍然面临挑战,尤其是开源模型。
OpenREAD: Reinforced Open-Ended Reasoing for End-to-End Autonomous Driving with LLM-as-Critic
Authors: Songyan Zhang, Wenhui Huang, Zhan Chen, Chua Jiahao Collister, Qihang Huang, Chen Lv
First: 2025-12-01T16:11:57+00:00 · Latest: 2025-12-01T16:11:57+00:00
Abstract
Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.
中文标题/摘要
标题:OpenREAD:基于LLM-as-Critic的端到端自主驾驶开放性推理强化
近年来,两阶段微调策略,例如通过监督微调(SFT)获取必要的驾驶知识,然后通过强化微调(RFT)进一步增强决策和规划能力,已经在推动知识驱动的自主驾驶(AD)范式方面显示出强大的潜力。然而,SFT的学习性质仍然限制了推理的泛化能力,从而限制了驾驶性能的全部潜力。同时,当前的RFT方法主要应用于下游任务,因为场景理解是一个开放性问题,其中相应的奖励难以量化。为了解决这些限制,我们提出了OpenREAD,这是一种基于视觉语言模型(VLM)的开放性推理自主驾驶(AD)框架,能够在从高层次推理到低层次轨迹规划的整个光谱范围内实现端到端的RFT。具体而言,我们首先在开源驾驶相关知识数据集上构建大规模的思维链(CoT)注释,并使用强大的Qwen3大型语言模型(LLM)作为RFT中的批评者,在奖励建模过程中量化开放性问题的推理质量。广泛的实验表明,联合端到端的RFT在上游和下游任务中均取得了显著的改进,使OpenREAD在推理和规划基准测试中达到了最先进的性能。
Summary / 总结
The paper proposes OpenREAD, an open-ended reasoning framework for autonomous driving that combines reinforcement learning with large language models. It addresses the limitations of two-stage fine-tuning by enabling end-to-end reinforcement fine-tuning from high-level reasoning to low-level trajectory planning. The framework constructs large-scale Chain-of-Thought annotations and uses a large language model as the critic to quantify reasoning quality, leading to improved performance on reasoning and planning benchmarks.
该论文提出了OpenREAD,一种结合大规模Chain-of-Thought注解和大型语言模型作为批评家的端到端强化学习框架,以提升推理质量。该方法解决了两阶段微调的局限性,能够在所有层次的推理和规划中进行强化微调。实验结果表明,在上游和下游任务上均取得了显著改进,达到了推理和规划基准的最先进性能。
CauSight: Learning to Supersense for Visual Causal Discovery
Authors: Yize Zhang, Meiqi Chen, Sirui Chen, Bo Peng, Yanxi Zhang, Tianyu Li, Chaochao Lu
First: 2025-12-01T16:05:13+00:00 · Latest: 2025-12-01T16:05:13+00:00
Comments: project page: https://github.com/OpenCausaLab/CauSight
Abstract
Causal thinking enables humans to understand not just what is seen, but why it happens. To replicate this capability in modern AI systems, we introduce the task of visual causal discovery. It requires models to infer cause-and-effect relations among visual entities across diverse scenarios instead of merely perceiving their presence. To this end, we first construct the Visual Causal Graph dataset (VCG-32K), a large-scale collection of over 32,000 images annotated with entity-level causal graphs, and further develop CauSight, a novel vision-language model to perform visual causal discovery through causally aware reasoning. Our training recipe integrates three components: (1) training data curation from VCG-32K, (2) Tree-of-Causal-Thought (ToCT) for synthesizing reasoning trajectories, and (3) reinforcement learning with a designed causal reward to refine the reasoning policy. Experiments show that CauSight outperforms GPT-4.1 on visual causal discovery, achieving over a threefold performance boost (21% absolute gain). Our code, model, and dataset are fully open-sourced at project page: https://github.com/OpenCausaLab/CauSight.
中文标题/摘要
标题:CauSight:学习超感知进行视觉因果发现
因果思考使人类不仅理解所见之事,还能理解其发生的原因。为了在现代AI系统中复制这一能力,我们引入了视觉因果发现的任务。它要求模型在多种场景中推断视觉实体之间的因果关系,而不仅仅是感知它们的存在。为此,我们首先构建了包含超过32,000张图像的视觉因果图数据集(VCG-32K),每张图像都标注了实体级别的因果图,并进一步开发了CauSight,这是一种新型的视觉-语言模型,通过因果感知推理来进行视觉因果发现。我们的训练方法整合了三个组成部分:(1) VCG-32K的数据整理,(2) 因果思维树(ToCT)用于合成推理轨迹,(3) 设计的因果奖励强化学习以优化推理策略。实验表明,CauSight在视觉因果发现任务上优于GPT-4.1,性能提升超过三倍(绝对提升21%)。我们的代码、模型和数据集已完全开源,项目页面:https://github.com/OpenCausaLab/CauSight。
Dimension-free error estimate for diffusion model and optimal scheduling
Authors: Valentin de Bortoli, Romuald Elie, Anna Kazeykina, Zhenjie Ren, Jiacheng Zhang
First: 2025-12-01T15:58:20+00:00 · Latest: 2025-12-01T15:58:20+00:00
Abstract
Diffusion generative models have emerged as powerful tools for producing synthetic data from an empirically observed distribution. A common approach involves simulating the time-reversal of an Ornstein-Uhlenbeck (OU) process initialized at the true data distribution. Since the score function associated with the OU process is typically unknown, it is approximated using a trained neural network. This approximation, along with finite time simulation, time discretization and statistical approximation, introduce several sources of error whose impact on the generated samples must be carefully understood. Previous analyses have quantified the error between the generated and the true data distributions in terms of Wasserstein distance or Kullback-Leibler (KL) divergence. However, both metrics present limitations: KL divergence requires absolute continuity between distributions, while Wasserstein distance, though more general, leads to error bounds that scale poorly with dimension, rendering them impractical in high-dimensional settings. In this work, we derive an explicit, dimension-free bound on the discrepancy between the generated and the true data distributions. The bound is expressed in terms of a smooth test functional with bounded first and second derivatives. The key novelty lies in the use of this weaker, functional metric to obtain dimension-independent guarantees, at the cost of higher regularity on the test functions. As an application, we formulate and solve a variational problem to minimize the time-discretization error, leading to the derivation of an optimal time-scheduling strategy for the reverse-time diffusion. Interestingly, this scheduler has appeared previously in the literature in a different context; our analysis provides a new justification for its optimality, now grounded in minimizing the discretization bias in generative sampling.
中文标题/摘要
标题:无维误差估计的扩散模型及其最优调度
扩散生成模型已成为从经验观察分布生成合成数据的强大工具。一种常见方法是模拟Ornstein-Uhlenbeck (OU) 过程的时间逆过程,并将其初始化在真实数据分布上。由于与OU过程相关的分数函数通常是未知的,因此使用训练的神经网络进行近似。这种近似,加上有限时间模拟、时间离散化和统计近似,引入了多种误差源,这些误差对生成样本的影响必须仔细理解。先前的分析以Wasserstein距离或Kullback-Leibler (KL) 散度量化了生成分布与真实数据分布之间的误差。然而,这两种度量都存在局限性:KL散度需要分布之间的绝对连续性,而Wasserstein距离虽然更通用,但误差界随维度增加而劣化,使其在高维设置中不切实际。在本文中,我们推导了生成分布与真实数据分布之间差异的显式、无维界。该界用具有有界一阶和二阶导数的光滑测试函数表示。关键新颖之处在于使用这种较弱的函数度量来获得无维保证,代价是测试函数的更高正则性。作为应用,我们提出了一个变分问题来最小化时间离散化误差,并推导出反向扩散的最优时间调度策略。有趣的是,该调度器在文献中曾以不同背景出现;我们的分析提供了其最优性的新理由,现在是基于最小化生成采样中的离散偏差。
Summary / 总结
This paper addresses the error in diffusion generative models by deriving a dimension-free bound on the discrepancy between generated and true data distributions. The method uses a smooth test functional with bounded derivatives to quantify the error, which is a weaker metric than Wasserstein distance or KL divergence, avoiding dimensionality issues. The key finding is an optimal time-scheduling strategy for reverse-time diffusion, minimizing time-discretization error, which has been previously observed but now justified by minimizing discretization bias in generative sampling.
该论文通过推导生成数据分布与真实数据分布之间差异的无维数上限来解决扩散生成模型中的误差问题。方法是使用具有有界一阶和二阶导数的光滑测试函数来量化误差,从而避免了Wasserstein距离和KL散度的局限性。主要发现是推导出了一种最优的时间调度策略,用于反向时间扩散,该策略可以最小化时间离散化误差,并且证明了其在生成采样中的最优性。
Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos
Authors: Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram
First: 2025-12-01T15:36:33+00:00 · Latest: 2025-12-01T15:36:33+00:00
Abstract
Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.
中文标题/摘要
标题:生成动作叙述:评估合成视频中的人体运动
尽管视频生成模型取得了快速进展,但评估复杂人体动作的视觉和时间正确性的稳健度量仍然难以捉摸。关键的是,现有的纯视觉编码器和多模态大型语言模型(MLLMs)对外观有很强的偏见,缺乏时间理解能力,因此难以区分生成视频中的精细运动动态和解剖学上的不合理性。我们通过引入一种源自真实世界人体动作学习潜在空间的新评价指标来填补这一空白。我们的方法首先通过融合无外观特征的人体骨骼几何特征与基于外观的特征,捕捉真实世界运动的细微差别、约束条件和时间平滑性。我们认为这种结合的特征空间提供了动作合理性的稳健表示。给定一个生成的视频,我们的指标通过测量其潜在表示与学习到的真实世界动作分布之间的距离来量化其动作质量。为了进行严格的验证,我们开发了一个新的多维度基准,专门设计用于探测人体动作保真度中的时间挑战方面。通过广泛的实验,我们展示了我们的指标在我们的基准上相比现有最先进的方法取得了超过68%的显著改进,表现与现有外部基准相当,并且与人类感知有更强的相关性。我们深入的分析揭示了当前视频生成模型的关键局限性,并确立了视频生成高级研究的新标准。
Summary / 总结
The research aims to develop a robust metric for evaluating the visual and temporal correctness of complex human actions in synthesized videos. The method combines appearance-agnostic skeletal geometry features with appearance-based features to capture the nuances and constraints of real-world motion. Experiments show that the proposed metric outperforms existing methods by more than 68% on a new benchmark designed to test temporal aspects of human action fidelity, and correlates well with human perception.
本文解决了评估合成视频中复杂人体动作的视觉和时间正确性的问题,现有指标偏向于外观且缺乏时间理解。作者提出了一种基于真实世界人体动作学习潜空间的新评价指标,结合了外观无关的骨骼几何特征和外观特征,以捕捉动作动态和时间平滑性。实验表明,该指标在专门设计的新基准上比现有方法高出68%以上,且与人类感知有很强的相关性。
GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation
Authors: Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, Wanli Peng, Jingchao Qiao, Zeyu Ren, Haixin Shi, Zhi Su, Jiawen Tian, Yuyang Xiao, Shenyu Zhang, Liwei Zheng, Hang Li, Yonghui Wu
First: 2025-12-01T15:33:59+00:00 · Latest: 2025-12-01T15:33:59+00:00
Abstract
We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting $Q$-values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundations models to specialize into reliable real-world experts.
中文标题/摘要
标题:GR-RL:实现长时距灵巧操作的多能向精确化
我们提出了GR-RL,一种将通用视觉-语言-动作(VLA)策略转化为适用于长时距灵巧操作的高效专家的机器人学习框架。假设人类演示的最优性是现有VLA策略的核心。然而,我们认为在高度灵巧和精确的操作任务中,人类演示是嘈杂且次优的。GR-RL 提出了一种多阶段训练管道,通过强化学习过滤、增强和强化演示。首先,GR-RL 学习一个视觉-语言条件下的任务进度,过滤演示轨迹,仅保留对进度有积极贡献的转换。具体来说,我们展示了通过直接应用离线RL和稀疏奖励,所得到的$Q$值可以被视为一个稳健的进度函数。接下来,我们引入了形态对称增强,极大地提高了GR-RL的泛化能力和性能。最后,为了更好地使VLA策略与部署行为对高精度控制进行对齐,我们通过学习潜在空间噪声预测器进行在线RL。通过这个管道,GR-RL,据我们所知,是第一个能够自主穿鞋带的策略,成功率为83.3%,该任务需要长时距推理、毫米级精度和柔体交互。我们希望GR-RL能够为使通用机器人基础模型专门化为可靠的现实世界专家提供一步。
Summary / 总结
GR-RL is a robotic learning framework that transforms a generalist vision-language-action policy into a specialist for long-horizon dexterous manipulation. It uses a multi-stage training pipeline involving filtering, augmentation, and reinforcement learning to improve the robustness and performance of the policy. Key findings include a 83.3% success rate in autonomously lacing up a shoe, demonstrating long-horizon reasoning and millimeter-level precision. This work advances the specialization of generalist robot models into reliable real-world experts.
GR-RL 是一种机器人学习框架,将通用的视觉-语言-动作政策转化为擅长长时间精细操作的专家。它采用多阶段训练管道,包括过滤、增强和强化学习,以提高政策的稳健性和性能。关键发现包括在自主系鞋带任务中达到83.3%的成功率,展示了长时间推理和毫米级精度。这项工作推动了通用机器人模型向可靠的现实世界专家的专业化转变。
L2RU: a Structured State Space Model with prescribed L2-bound
Authors: Leonardo Massai, Muhammad Zakwan, Giancarlo Ferrari-Trecate
First: 2025-03-31T07:56:17+00:00 · Latest: 2025-12-01T15:33:04+00:00
Abstract
Structured state-space models (SSMs) have recently emerged as a powerful architecture at the intersection of machine learning and control, featuring layers composed of discrete-time linear time-invariant (LTI) systems followed by pointwise nonlinearities. These models combine the expressiveness of deep neural networks with the interpretability and inductive bias of dynamical systems, offering strong performance on long-sequence tasks with favorable computational complexity. However, their adoption in applications such as system identification and optimal control remains limited by the difficulty of enforcing stability and robustness in a principled and tractable manner. We introduce L2RU, a class of SSMs endowed with a prescribed $\mathcal{L}_2$-gain bound, guaranteeing input--output stability and robustness for all parameter values. The L2RU architecture is derived from free parametrizations of LTI systems satisfying an $\mathcal{L}_2$ constraint, enabling unconstrained optimization via standard gradient-based methods while preserving rigorous stability guarantees. Specifically, we develop two complementary parametrizations: a non-conservative formulation that provides a complete characterization of square LTI systems with a given $\mathcal{L}_2$-bound, and a conservative formulation that extends the approach to general (possibly non-square) systems while improving computational efficiency through a structured representation of the system matrices. Both parametrizations admit efficient initialization schemes that facilitate training long-memory models. We demonstrate the effectiveness of the proposed framework on a nonlinear system identification benchmark, where L2RU achieves improved performance and training stability compared to existing SSM architectures, highlighting its potential as a principled and robust building block for learning and control.
中文标题/摘要
标题:L2RU:具有指定L2界结构的状态空间模型
结构化状态空间模型(SSMs)最近在机器学习和控制的交叉领域中崭露头角,其结构由离散时间线性时不变(LTI)系统层和点非线性层组成。这些模型结合了深度神经网络的表达能力和动态系统的可解释性和归纳偏置,提供了在长序列任务上强大的性能,同时具有有利的计算复杂度。然而,由于在原理上确保稳定性和鲁棒性具有挑战性,它们在系统识别和最优控制等应用中的采用仍然受到限制。 我们引入了L2RU,这是一种具有指定$\mathcal{L}_2$增益界的状态空间模型类,确保所有参数值下的输入-输出稳定性和鲁棒性。L2RU架构源自满足$\mathcal{L}_2$约束的LTI系统的自由参数化,通过标准梯度方法实现无约束优化,同时保持严格的稳定性保证。具体而言,我们开发了两种互补的参数化:一种非保守形式,提供了具有给定$\mathcal{L}_2$界的标准LTI系统的完整表征,以及一种保守形式,将方法扩展到一般(可能非标准)系统,通过系统矩阵的结构化表示提高计算效率。 两种参数化都允许高效的初始化方案,有助于训练长记忆模型。我们在一个非线性系统识别基准上展示了所提出框架的有效性,其中L2RU在性能和训练稳定性方面优于现有SSM架构,突显了其作为学习和控制中原理上和鲁棒性基础组件的潜力。
Summary / 总结
L2RU is a structured state-space model with a prescribed L2-bound, ensuring stability and robustness. It uses two parametrizations of LTI systems to enable unconstrained optimization while maintaining rigorous stability guarantees. L2RU outperforms existing SSM architectures in nonlinear system identification, demonstrating improved performance and training stability.
论文提出了L2RU,这是一种具有指定L2界的状态空间模型,确保了输入输出的稳定性和鲁棒性。该模型结合了深度神经网络的表达能力和动态系统的稳定性保证。在非线性系统识别基准测试中,L2RU在性能和训练稳定性方面均优于现有状态空间模型架构,使其成为学习和控制系统的有前途的构建块。
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Authors: Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, Yu-Gang Jiang
First: 2025-04-08T17:59:49+00:00 · Latest: 2025-12-01T15:10:24+00:00
Comments: 20 pages; Project Page: https://omnisvg.github.io/
Abstract
Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.
中文标题/摘要
标题:OmniSVG:统一的可扩展矢量图形生成模型
可扩展矢量图形(SVG)是一种广泛应用于图形设计的重要图像格式,由于其分辨率独立性和可编辑性。生成高质量SVG的研究一直受到AIGC社区设计师和研究人员的关注。然而,现有方法要么产生结构不规则的输出并带来巨大的计算成本,要么仅限于生成结构过于简化的单色图标。为了生成高质量和复杂的SVG,我们提出了OmniSVG,这是一种统一框架,利用预训练的视觉-语言模型(VLMs)进行端到端的多模态SVG生成。通过将SVG命令和坐标参数化为离散的标记,OmniSVG将结构逻辑与低级几何学解耦,以提高训练效率,同时保持复杂SVG结构的表达能力。为了进一步推进SVG合成的发展,我们引入了MMSVG-2M,这是一个包含两百万个丰富注释的多模态数据集,以及条件SVG生成任务的标准评估协议。广泛的实验表明,OmniSVG优于现有方法,并展示了其在专业SVG设计工作流程中的潜在应用。
Summary / 总结
OmniSVG is a unified framework for generating high-quality Scalable Vector Graphics (SVG) using pre-trained Vision-Language Models (VLMs). By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG efficiently trains while maintaining complex SVG structure expressiveness. Experiments show that OmniSVG outperforms existing methods and is suitable for professional SVG design workflows. Additionally, the paper introduces MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, for evaluating conditional SVG generation tasks.
OmniSVG 是一个使用预训练的视觉-语言模型生成高质量和复杂 SVG 的统一框架。通过将 SVG 命令和坐标参数化为离散的标记,它将结构逻辑与低级几何学分离,从而实现高效的训练同时保持复杂 SVG 结构的表达性。实验表明,OmniSVG 在性能上优于现有方法,并且具有集成到专业 SVG 设计工作流程中的潜力。为了支持这一点,作者还引入了 MMSVG-2M,这是一个包含两百万个丰富注释的多模态数据集,以及用于条件 SVG 生成任务的标准评估协议。
FreqEdit: Preserving High-Frequency Features for Robust Multi-Turn Image Editing
Authors: Yucheng Liao, Jiajun Liang, Kaiqian Cui, Baoquan Zhao, Haoran Xie, Wei Liu, Qing Li, Xudong Mao
First: 2025-12-01T15:00:47+00:00 · Latest: 2025-12-01T15:00:47+00:00
Abstract
Instruction-based image editing through natural language has emerged as a powerful paradigm for intuitive visual manipulation. While recent models achieve impressive results on single edits, they suffer from severe quality degradation under multi-turn editing. Through systematic analysis, we identify progressive loss of high-frequency information as the primary cause of this quality degradation. We present FreqEdit, a training-free framework that enables stable editing across 10+ consecutive iterations. Our approach comprises three synergistic components: (1) high-frequency feature injection from reference velocity fields to preserve fine-grained details, (2) an adaptive injection strategy that spatially modulates injection strength for precise region-specific control, and (3) a path compensation mechanism that periodically recalibrates the editing trajectory to prevent over-constraint. Extensive experiments demonstrate that FreqEdit achieves superior performance in both identity preservation and instruction following compared to seven state-of-the-art baselines.
中文标题/摘要
标题:FreqEdit:保留高频特征以实现稳健的多轮图像编辑
基于指令的图像编辑通过自然语言已成为一种强大的直观视觉操作范式。尽管最近的模型在单次编辑上取得了令人印象深刻的成果,但在多轮编辑下它们会遭受严重的质量退化。通过系统的分析,我们确定渐进损失的高频信息是这种质量退化的主要原因。我们提出了一种无需训练的框架FreqEdit,使其能够在10多轮连续编辑中保持稳定。我们的方法包括三个协同工作的组件:(1)从参考速度场注入高频特征以保留细粒度细节,(2)一种自适应注入策略,根据空间调节注入强度以实现精确的区域特定控制,以及(3)一种路径补偿机制,定期重新校准编辑轨迹以防止过度约束。广泛的实验表明,FreqEdit在身份保留和指令遵循方面均优于七个最先进的基线模型。
Summary / 总结
FreqEdit addresses the issue of quality degradation in multi-turn image editing by preserving high-frequency features. It introduces a training-free framework with three components: high-frequency feature injection, adaptive injection strategy, and path compensation mechanism. Experimental results show that FreqEdit outperforms seven state-of-the-art baselines in both identity preservation and instruction following.
FreqEdit 通过保留高频特征来解决多轮图像编辑中的质量下降问题。它提出了一种无需训练的框架,包含三个组件:高频特征注入、自适应注入策略和路径补偿机制。实验表明,FreqEdit 在保留原始图像和准确遵循指令方面优于七个最先进的基线方法。
Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks
Authors: Ngoc-Bao Nguyen, Sy-Tuyen Ho, Koh Jun Hao, Ngai-Man Cheung
First: 2025-08-06T05:30:05+00:00 · Latest: 2025-12-01T13:16:25+00:00
Comments: Under review
Abstract
Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior studies have primarily examined unimodal deep networks, the vulnerability of vision-language models (VLMs) remains largely unexplored. In this work, we present the first systematic study of MI attacks on VLMs to understand their susceptibility to leaking private visual training data. Our work makes two main contributions. First, tailored to the token-generative nature of VLMs, we introduce a suite of token-based and sequence-based model inversion strategies, providing a comprehensive analysis of VLMs' vulnerability under different attack formulations. Second, based on the observation that tokens vary in their visual grounding, and hence their gradients differ in informativeness for image reconstruction, we propose Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW) as a novel MI for VLMs. SMI-AW dynamically reweights each token's loss gradient according to its visual grounding, enabling the optimization to focus on visually informative tokens and more effectively guide the reconstruction of private images. Through extensive experiments and human evaluations on a range of state-of-the-art VLMs across multiple datasets, we show that VLMs are susceptible to training data leakage. Human evaluation of the reconstructed images yields an attack accuracy of 61.21%, underscoring the severity of these privacy risks. Notably, we demonstrate that publicly released VLMs are vulnerable to such attacks. Our study highlights the urgent need for privacy safeguards as VLMs become increasingly deployed in sensitive domains such as healthcare and finance. Additional experiments are provided in Supp.
中文标题/摘要
标题:视觉-语言模型泄露所学内容吗?自适应令牌加权模型反转攻击
模型反转(MI)攻击通过从训练好的神经网络中重建私人训练数据,对隐私构成重大风险。尽管先前的研究主要关注单模态深度网络,但视觉-语言模型(VLMs)的脆弱性仍鲜有研究。本文首次系统研究了VLMs的MI攻击,以了解其在不同攻击形式下泄露私人视觉训练数据的易感性。我们的工作做出了两项主要贡献。首先,针对VLMs的令牌生成特性,我们提出了一系列基于令牌和序列的模型反转策略,对VLMs在不同攻击形式下的脆弱性进行了全面分析。其次,基于观察到令牌在视觉定位上的差异,以及因此在图像重建中的梯度信息量不同,我们提出了基于序列的模型反转与自适应令牌加权(SMI-AW)作为VLMs的新颖MI方法。SMI-AW动态调整每个令牌的损失梯度权重,使其优化聚焦于视觉信息丰富的令牌,更有效地指导私人图像的重建。通过在多个数据集上对多种最先进的VLMs进行广泛的实验和人工评估,我们展示了VLMs对训练数据泄露的易感性。人工评估重建图像的准确率为61.21%,突显了这些隐私风险的严重性。值得注意的是,我们证明了公开发布的VLMs对这种攻击易感。我们的研究强调了随着VLMs在医疗保健和金融等敏感领域中的广泛应用,迫切需要隐私保护措施。附加实验详见补充材料。
Summary / 总结
This work investigates the vulnerability of vision-language models (VLMs) to model inversion (MI) attacks, which can reconstruct private training data. The authors introduce token-based and sequence-based MI strategies tailored to VLMs and propose SMI-AW, which adaptively reweights token gradients based on visual grounding. Extensive experiments show that VLMs are susceptible to training data leakage, with human evaluations indicating an attack accuracy of 61.21%. This highlights the need for privacy safeguards in VLMs deployed in sensitive domains.
该研究探讨了视觉语言模型(VLMs)对模型反转(MI)攻击的脆弱性,这些攻击可以从训练好的神经网络中重建私人训练数据。作者提出了针对VLMs的基于令牌和序列的MI策略,并提出了一种基于序列的具有自适应令牌加权的模型反转(SMI-AW),该方法根据视觉接地动态调整每个令牌的损失梯度。实验表明,VLMs容易泄露训练数据,人工评估显示攻击准确率为61.21%。这强调了在敏感领域如医疗和金融中部署VLMs时需要隐私保护措施。
Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval
Authors: Xin Wang, Haipeng Zhang, Mang Li, Zhaohui Xia, Yueguo Chen, Yu Zhang, Chunyu Wei
First: 2025-12-01T13:04:55+00:00 · Latest: 2025-12-01T13:04:55+00:00
Abstract
Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification. While supervised CIR methods achieve high accuracy, their reliance on costly triplet annotations motivates zero-shot solutions. The core challenge in zero-shot CIR (ZS-CIR) stems from a fundamental dilemma: existing text-centric or diffusion-based approaches struggle to effectively bridge the vision-language modality gap. To address this, we propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment. First, it introduces a multimodal fusion feature editing strategy within a joint vision-language (VL) space, substantially narrowing the modality gap. Second, to maximize data efficiency, the framework incorporates a lightweight Control-Adapter, enabling state-of-the-art performance through fine-tuning on only a limited-scale synthetic dataset of 200K samples. Extensive experiments on standard CIR benchmarks (CIRR, FashionIQ, and CIRCO) demonstrate that Fusion-Diff significantly outperforms prior zero-shot approaches. We further enhance the interpretability of our model by visualizing the fused multimodal representations.
中文标题/摘要
标题:联合视觉-语言空间中的生成编辑用于零样本组合图像检索
组合图像检索(CIR)通过结合参考图像和文本修改,实现精细的视觉搜索。虽然监督CIR方法具有高精度,但它们对昂贵的三元组注解的依赖促使了零样本解决方案的发展。零样本CIR(ZS-CIR)的核心挑战源于一个根本性的难题:现有的以文本为中心或基于扩散的方法难以有效弥合视觉-语言模态的差距。为了解决这一问题,我们提出了一种名为Fusion-Diff的新型生成编辑框架,该框架旨在实现多模态对齐,具有高效性和数据效率。首先,它在联合视觉-语言(VL)空间中引入了多模态融合特征编辑策略,显著缩小了模态差距。其次,为了最大化数据效率,该框架引入了一个轻量级的控制适配器,通过在仅包含20万样本的合成数据集上进行微调,实现了最先进的性能。在标准CIR基准(CIRR、FashionIQ和CIRCO)上的广泛实验表明,Fusion-Diff显著优于之前的零样本方法。我们进一步通过可视化融合的多模态表示增强了模型的可解释性。
Summary / 总结
The paper addresses the challenge of Composed Image Retrieval (CIR) by proposing Fusion-Diff, a generative editing framework that operates in a joint vision-language space. It introduces a multimodal fusion feature editing strategy to reduce the vision-language modality gap and uses a lightweight Control-Adapter for fine-tuning on a limited dataset, achieving superior performance on standard CIR benchmarks compared to previous zero-shot approaches. The model also improves interpretability through visualized fused representations.
论文提出了一种名为Fusion-Diff的生成编辑框架,该框架在联合视觉-语言空间中运行,引入了多模态融合特征编辑策略以减少视觉-语言模态差距,并通过在少量合成数据集上进行微调,实现了在标准CIR基准上的优越性能,超越了之前的零样本方法。此外,该模型通过可视化融合表示提高了可解释性。
VITA: Vision-to-Action Flow Matching Policy
Authors: Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani
First: 2025-07-17T15:41:57+00:00 · Latest: 2025-12-01T12:22:05+00:00
Comments: Project page: https://ucd-dare.github.io/VITA/ Code: https://github.com/ucd-dare/VITA
Abstract
Conventional flow matching and diffusion-based policies sample through iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA(VIsion-To-Action policy), a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need of visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent space collapse, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equation) solving steps. We evaluate VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. VITA achieves 1.5x-2x faster inference compared to conventional methods with conditioning modules, while outperforming or matching state-of-the-art policies. Codes, datasets, and demos are available at our project page: https://ucd-dare.github.io/VITA/.
中文标题/摘要
标题:VITA:视觉到行动的流匹配策略
传统的流匹配和基于扩散的策略从标准噪声分布(例如高斯分布)迭代去噪采样,并在生成过程中反复结合视觉信息,导致大量时间和内存开销。为降低复杂度,我们开发了VITA(视觉到行动策略),一种无需噪声和条件模块的流匹配策略学习框架,直接从视觉表示流向潜在动作。由于流的来源是视觉导向的,VITA 在生成过程中消除了视觉条件的需求。由于视觉和行动之间的桥梁建立具有挑战性,因为行动比视觉表示低维、不规则且稀疏;此外,流匹配要求源和目标具有相同的维度。为克服这一问题,我们引入了动作自编码器,将原始动作映射到与视觉潜在变量对齐的结构化潜在空间,并与流匹配联合训练。为了进一步防止潜在空间坍塌,我们提出了流潜在解码,通过流匹配 ODE(常微分方程)求解步骤反向传播动作重建损失,锚定潜在生成过程。我们在 ALOHA 和 Robomimic 的 9 个仿真和 5 个真实世界任务上评估了 VITA。与具有条件模块的传统方法相比,VITA 的推理速度提高了 1.5 到 2 倍,同时优于或匹配最先进的策略。代码、数据集和演示可在我们的项目页面获取:https://ucd-dare.github.io/VITA/
Summary / 总结
VITA is a noise-free and conditioning-free flow matching policy that directly maps visual representations to latent actions, reducing computational overhead. It introduces an action autoencoder to align raw actions with visual latents and uses flow latent decoding to prevent latent space collapse. VITA outperforms or matches state-of-the-art policies on 14 tasks, achieving 1.5x-2x faster inference compared to conventional methods with conditioning modules.
VITA 是一种无噪声和无条件的流匹配策略,直接将视觉表示连接到潜在动作,减少了传统方法的复杂性和开销。它引入了动作自编码器将原始动作映射到与视觉潜在变量对齐的结构化潜在空间,并提出了流潜在解码来防止潜在空间坍塌。VITA 在 9 个模拟和 5 个真实世界任务上优于或匹配了最先进的策略,并且相比具有条件模块的传统方法,推理速度提高了 1.5 倍到 2 倍。
RobustVLA: Robustness-Aware Reinforcement Post-Training for Vision-Language-Action Models
Authors: Hongyin Zhang, Shuo Zhang, Junxi Jin, Qixin Zeng, Runze Li, Donglin Wang
First: 2025-11-03T08:30:48+00:00 · Latest: 2025-12-01T12:13:52+00:00
Abstract
Vision-Language-Action (VLA) models have recently emerged as powerful general-purpose policies for robotic manipulation, benefiting from large-scale multi-modal pre-training. However, they often fail to generalize reliably in out-of-distribution deployments, where unavoidable disturbances such as observation noise, sensor errors, or actuation perturbations become prevalent. While recent Reinforcement Learning (RL)-based post-training provides a practical means to adapt pre-trained VLA models, existing methods mainly emphasize reward maximization and overlook robustness to environmental uncertainty. In this work, we introduce RobustVLA, a lightweight online RL post-training method designed to explicitly enhance the resilience of VLA models. Through a systematic robustness analysis, we identify two key regularizations: Jacobian regularization, which mitigates sensitivity to observation noise, and smoothness regularization, which stabilizes policies under action perturbations. Extensive experiments across diverse robotic environments demonstrate that RobustVLA significantly outperforms prior state-of-the-art methods in robustness and reliability. Our results highlight the importance of principled robustness-aware RL post-training as a key step toward improving the reliability and robustness of VLA models.
中文标题/摘要
标题:RobustVLA:面向视觉-语言-动作模型的鲁棒性意识强化后训练
视觉-语言-动作(VLA)模型最近作为强大的通用策略出现,用于机器人操作,得益于大规模多模态预训练。然而,它们在分布外部署中往往无法可靠地泛化,其中不可避免的干扰如观测噪声、传感器错误或动作扰动变得普遍。虽然基于强化学习(RL)的后训练提供了一种实用的方法来适应预训练的VLA模型,但现有方法主要强调奖励最大化,而忽视了对环境不确定性鲁棒性的考虑。在本文中,我们引入了RobustVLA,这是一种轻量级的在线RL后训练方法,旨在显式增强VLA模型的鲁棒性。通过系统的鲁棒性分析,我们确定了两个关键正则化项:雅可比正则化,它减轻了对观测噪声的敏感性;平滑正则化,它在动作扰动下稳定策略。在多种多样的机器人环境中进行的广泛实验表明,RobustVLA在鲁棒性和可靠性方面显著优于先前的最先进的方法。我们的结果强调了鲁棒性意识的强化学习后训练作为提高VLA模型可靠性和鲁棒性关键步骤的重要性。
Summary / 总结
RobustVLA is a lightweight online reinforcement learning post-training method that enhances the resilience of Vision-Language-Action (VLA) models by addressing observation noise and action perturbations. Through robustness analysis, it introduces Jacobian regularization to reduce sensitivity to observation noise and smoothness regularization to stabilize policies under action perturbations. Experiments show that RobustVLA outperforms existing methods in robustness and reliability across various robotic environments.
RobustVLA 是一种轻量级的在线强化学习后训练方法,旨在增强 Vision-Language-Action (VLA) 模型的鲁棒性,这些模型用于机器人操作。通过引入雅可比正则化来减少对观测噪声的敏感性,并引入平滑正则化来在动作扰动下稳定策略,RobustVLA 提高了 VLA 模型的可靠性和鲁棒性。实验表明,RobustVLA 在各种机器人环境中优于现有方法。
NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction
Authors: Fei Liu, Shichao Xie, Minghua Luo, Zedong Chu, Junjun Hu, Xiaolong Wu, Mu Xu
First: 2025-12-01T11:24:16+00:00 · Latest: 2025-12-01T11:24:16+00:00
Abstract
Embodied navigation for long-horizon tasks, guided by complex natural language instructions, remains a formidable challenge in artificial intelligence. Existing agents often struggle with robust long-term planning about unseen environments, leading to high failure rates. To address these limitations, we introduce NavForesee, a novel Vision-Language Model (VLM) that unifies high-level language planning and predictive world model imagination within a single, unified framework. Our approach empowers a single VLM to concurrently perform planning and predictive foresight. Conditioned on the full instruction and historical observations, the model is trained to understand the navigation instructions by decomposing the task, tracking its progress, and formulating the subsequent sub-goal. Simultaneously, it functions as a generative world model, providing crucial foresight by predicting short-term environmental dynamics and long-term navigation milestones. The VLM's structured plan guides its targeted prediction, while the imagined future provides rich context to inform the navigation actions, creating a powerful internal feedback loop of perception-planning/prediction-action. We demonstrate through extensive experiments on the R2R-CE and RxR-CE benchmark that NavForesee achieves highly competitive performance in complex scenarios. Our work highlights the immense potential of fusing explicit language planning with implicit spatiotemporal prediction, paving the way for more intelligent and capable embodied agents.
中文标题/摘要
标题:NavForesee: 统一的视觉-语言世界模型,用于分层规划和双时间尺度导航预测
在人工智能领域,通过复杂的自然语言指令指导长期任务的体感导航仍然是一个巨大的挑战。现有的代理往往难以进行关于未见环境的稳健长期规划,导致高失败率。为了解决这些限制,我们引入了NavForesee,这是一种新颖的视觉-语言模型(VLM),它在单一统一框架中统一了高层次的语言规划和预测世界模型想象。我们的方法使单一的VLM能够同时进行规划和预测先见。基于完整的指令和历史观察,模型通过分解任务、跟踪进度和制定后续子目标来理解导航指令进行训练。同时,它还作为生成世界模型,通过预测短期环境动态和长期导航里程碑提供关键的先见。VLM的结构化计划指导其目标预测,而想象的未来提供了丰富的上下文来指导导航动作,从而形成感知-规划/预测-行动的强大力量内部反馈循环。通过在R2R-CE和RxR-CE基准上的广泛实验,我们证明NavForesee在复杂场景中实现了高度竞争力的性能。我们的工作突显了将显式语言规划与隐式时空预测融合的巨大潜力,为更智能和能力更强的体感代理铺平了道路。
Summary / 总结
NavForesee is a Vision-Language Model that unifies high-level language planning and predictive world model imagination for embodied navigation. It decomposes instructions, tracks progress, and predicts short-term dynamics and long-term milestones. Extensive experiments show NavForesee performs competitively on complex scenarios, highlighting the potential of integrating explicit language planning with implicit spatiotemporal prediction for intelligent embodied agents.
NavForesee 是一个统一的视觉-语言模型,将高级语言规划和预测世界模型想象结合在一起,用于体感导航任务。它分解指令、跟踪进度并制定子目标,同时预测短期动态和长期里程碑。实验表明,NavForesee 在复杂的导航基准测试中表现出色,展示了将显式语言规划与隐式时空预测结合的有效性。
3EED: Ground Everything Everywhere in 3D
Authors: Rong Li, Yuhao Dong, Tianshuai Hu, Ao Liang, Youquan Liu, Dongyue Lu, Liang Pan, Lingdong Kong, Junwei Liang, Ziwei Liu
Venue: NeurIPS 2025
First: 2025-11-03T17:05:22+00:00 · Latest: 2025-12-01T10:15:49+00:00
Comments: NeurIPS 2025 DB Track; 38 pages, 17 figures, 10 tables; Project Page at https://project-3eed.github.io/
Abstract
Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.
中文标题/摘要
标题:3EED:在3D中扎根一切
在3D中的视觉扎根是使具身智能体在开放世界环境中定位语言所指对象的关键。然而,现有的基准测试局限于室内场景、单一平台限制和小规模。我们引入了3EED,这是一个多平台、多模态的3D扎根基准,包含来自车辆、无人机和四足平台的RGB和LiDAR数据。我们提供了超过128,000个物体和22,000个验证过的参照表达,覆盖了多样化的户外场景——比现有数据集大10倍。我们开发了一种可扩展的注释流水线,结合视觉-语言模型提示与人工验证,以确保高质量的空间扎根。为了支持跨平台学习,我们提出了平台感知的标准化和跨模态对齐技术,并建立了领域内和跨平台的基准测试协议。我们的研究结果揭示了显著的性能差距,突显了通用3D扎根的挑战和机遇。3EED数据集和基准测试工具包已发布,以促进未来基于语言的3D具身感知研究。
Summary / 总结
The research aims to improve visual grounding in 3D environments for embodied agents, addressing limitations of existing benchmarks. The method involves creating 3EED, a multi-platform and multi-modal 3D grounding benchmark with extensive RGB and LiDAR data from various platforms. Key findings include significant performance gaps across platforms, indicating the challenges in achieving generalizable 3D grounding. The dataset and benchmark toolkit are released to facilitate future research.
研究旨在通过引入3EED,一个多平台和多模态的3D定位基准,来解决现有基准在视觉定位方面的局限性。方法包括开发一个可扩展的注释流水线和平台感知的归一化技术,以确保高质量的空间定位。关键实验发现表明,不同平台之间存在显著的性能差距,这表明实现通用的3D定位的挑战。3EED数据集和基准工具包已发布,以促进该领域的未来研究。
ChronosObserver: Taming 4D World with Hyperspace Diffusion Sampling
Authors: Qisen Wang, Yifan Zhao, Peisen Shen, Jialu Li, Jia Li
First: 2025-12-01T10:00:26+00:00 · Latest: 2025-12-01T10:00:26+00:00
Abstract
Although prevailing camera-controlled video generation models can produce cinematic results, lifting them directly to the generation of 3D-consistent and high-fidelity time-synchronized multi-view videos remains challenging, which is a pivotal capability for taming 4D worlds. Some works resort to data augmentation or test-time optimization, but these strategies are constrained by limited model generalization and scalability issues. To this end, we propose ChronosObserver, a training-free method including World State Hyperspace to represent the spatiotemporal constraints of a 4D world scene, and Hyperspace Guided Sampling to synchronize the diffusion sampling trajectories of multiple views using the hyperspace. Experimental results demonstrate that our method achieves high-fidelity and 3D-consistent time-synchronized multi-view videos generation without training or fine-tuning for diffusion models.
中文标题/摘要
标题:ChronosObserver:以超空间扩散采样驯服四维世界
尽管现有的相机控制视频生成模型可以产生电影级的效果,但将它们直接提升到生成3D一致且高保真时间同步多视角视频仍然具有挑战性,这是驯服四维世界的关键能力。一些工作依赖于数据增强或测试时优化,但这些策略受到模型泛化能力和可扩展性问题的限制。为此,我们提出ChronosObserver,这是一种无需训练的方法,包括世界状态超空间来表示四维世界场景的时空约束,以及超空间引导采样来使用超空间同步多个视角的扩散采样轨迹。实验结果表明,我们的方法在无需对扩散模型进行训练或微调的情况下,实现了高保真且3D一致的时间同步多视角视频生成。
Summary / 总结
The research motivation is to generate high-fidelity and 3D-consistent time-synchronized multi-view videos from 4D worlds, which is challenging for existing camera-controlled video generation models. The proposed method, ChronosObserver, uses World State Hyperspace to represent spatiotemporal constraints and Hyperspace Guided Sampling to synchronize multiple views. The key experimental finding is that the method can generate high-fidelity and 3D-consistent time-synchronized multi-view videos without training or fine-tuning diffusion models.
该研究旨在生成来自4D世界场景的高保真度和3D一致的时间同步多视角视频,这给现有的摄像机控制视频生成模型带来了挑战。提出的ChronosObserver方法使用World State Hyperspace来表示时空约束,并使用Hyperspace Guided Sampling来同步多个视角。关键实验发现是,ChronosObserver可以在不训练或微调扩散模型的情况下生成高保真度和3D一致的时间同步多视角视频。
Stay Unique, Stay Efficient: Preserving Model Personality in Multi-Task Merging
Authors: Kuangpu Guo, Yuhe Ding, Jian Liang, Zilei Wang, Ran He
First: 2025-12-01T09:47:17+00:00 · Latest: 2025-12-01T09:47:17+00:00
Abstract
Model merging has emerged as a promising paradigm for enabling multi-task capabilities without additional training. However, existing methods often experience substantial performance degradation compared with individually fine-tuned models, even on similar tasks, underscoring the need to preserve task-specific information. This paper proposes Decomposition, Thresholding, and Scaling (DTS), an approximation-based personalized merging framework that preserves task-specific information with minimal storage overhead. DTS first applies singular value decomposition to the task-specific information and retains only a small subset of singular values and vectors. It then introduces a novel thresholding strategy that partitions singular vector elements into groups and assigns a scaling factor to each group. To enable generalization to unseen tasks, we further extend DTS with a variant that fuses task-specific information in a data-free manner based on the semantic similarity of task characteristics. Extensive experiments demonstrate that DTS consistently outperforms state-of-the-art baselines while requiring only 1\% additional storage per task. Furthermore, experiments on unseen tasks show that the DTS variant achieves significantly better generalization performance. Our code is available at https://github.com/krumpguo/DTS.
中文标题/摘要
标题:保持独特性,保持高效性:在多任务合并中保留模型个性
模型合并已成为一种有前景的范式,用于实现多任务能力而无需额外训练。然而,现有方法在相似任务上与单独微调的模型相比,通常会经历显著的性能下降,突显了保留任务特定信息的必要性。本文提出了一种基于近似的个性化合并框架——分解、阈值化和缩放(DTS),该框架在最小存储开销的情况下保留了任务特定信息。DTS 首先对任务特定信息应用奇异值分解,并仅保留一小部分奇异值和向量。然后引入了一种新颖的阈值化策略,将奇异向量元素分为组,并为每个组分配一个缩放因子。为了使 DTS 能够泛化到未见过的任务,我们进一步基于任务特征的语义相似性以数据无关的方式扩展了 DTS,融合任务特定信息。大量实验表明,DTS 在仅需每个任务 1% 的额外存储空间的情况下,始终优于最先进的基线方法。此外,对未见过的任务的实验表明,DTS 变体实现了显著更好的泛化性能。我们的代码可在 https://github.com/krumpguo/DTS 获取。
Summary / 总结
This paper addresses the issue of performance degradation in model merging by proposing DTS, a framework that preserves task-specific information using singular value decomposition and thresholding. Experiments show that DTS outperforms existing methods with only a 1% increase in storage per task and achieves better generalization on unseen tasks.
本文提出了一种称为分解、阈值化和缩放(DTS)的框架,以解决多任务模型中的性能下降问题。DTS通过奇异值分解和阈值化保留任务特定的信息,只需要少量额外的存储空间。实验表明,DTS在需要1%额外存储空间的情况下优于现有方法,并且针对未见过的任务的变体表现出更好的泛化性能。
Language-Guided Open-World Anomaly Segmentation
Authors: Klara Reichard, Nikolas Brasch, Nassir Navab, Federico Tombari
First: 2025-12-01T09:08:59+00:00 · Latest: 2025-12-01T09:08:59+00:00
Abstract
Open-world and anomaly segmentation methods seek to enable autonomous driving systems to detect and segment both known and unknown objects in real-world scenes. However, existing methods do not assign semantically meaningful labels to unknown regions, and distinguishing and learning representations for unknown classes remains difficult. While open-vocabulary segmentation methods show promise in generalizing to novel classes, they require a fixed inference vocabulary and thus cannot be directly applied to anomaly segmentation where unknown classes are unconstrained. We propose Clipomaly, the first CLIP-based open-world and anomaly segmentation method for autonomous driving. Our zero-shot approach requires no anomaly-specific training data and leverages CLIP's shared image-text embedding space to both segment unknown objects and assign human-interpretable names to them. Unlike open-vocabulary methods, our model dynamically extends its vocabulary at inference time without retraining, enabling robust detection and naming of anomalies beyond common class definitions such as those in Cityscapes. Clipomaly achieves state-of-the-art performance on established anomaly segmentation benchmarks while providing interpretability and flexibility essential for practical deployment.
中文标题/摘要
标题:语言引导的开放世界异常分割
开放世界和异常分割方法旨在使自动驾驶系统能够检测和分割现实场景中的已知和未知对象。然而,现有方法无法为未知区域分配语义上有意义的标签,区分和学习未知类别的表示仍然很困难。虽然开放词汇分割方法在泛化到新类别方面显示出潜力,但它们需要固定推理词汇表,因此无法直接应用于未知类别不受约束的异常分割。我们提出了Clipomaly,这是第一个基于CLIP的开放世界和异常分割方法,适用于自动驾驶。我们的零样本方法不需要异常特定的训练数据,并利用CLIP共享的图像-文本嵌入空间来分割未知对象并为它们分配可解释的名称。与开放词汇方法不同,我们的模型在推理时动态扩展其词汇表而无需重新训练,从而实现对超出城市景观等常见类别定义之外的异常的稳健检测和命名。Clipomaly在现有的异常分割基准测试中达到了最先进的性能,同时提供了对于实际部署至关重要的可解释性和灵活性。
Summary / 总结
The research aims to improve autonomous driving systems by enabling them to detect and segment both known and unknown objects in real-world scenes. The method, Clipomaly, uses CLIP's image-text embedding space for zero-shot open-world and anomaly segmentation, allowing dynamic vocabulary extension at inference time without retraining. Key findings show that Clipomaly outperforms existing methods on established anomaly segmentation benchmarks while offering interpretability and flexibility for practical deployment.
研究旨在开发能够在现实场景中检测和分割已知和未知物体的方法,以提升自动驾驶系统的性能。提出的Clipomaly方法利用CLIP的图像-文本嵌入空间进行零样本开放世界和异常分割,在推理时动态扩展词汇表而无需重新训练。该方法在异常分割基准测试中达到了最先进的性能,同时提供了实用部署所需的可解释性和灵活性。
ResDiT: Evoking the Intrinsic Resolution Scalability in Diffusion Transformers
Authors: Yiyang Ma, Feng Zhou, Xuedan Yin, Pu Cao, Yonghao Dang, Jianqin Yin
First: 2025-12-01T09:08:01+00:00 · Latest: 2025-12-01T09:08:01+00:00
Comments: 8 pages
Abstract
Leveraging pre-trained Diffusion Transformers (DiTs) for high-resolution (HR) image synthesis often leads to spatial layout collapse and degraded texture fidelity. Prior work mitigates these issues with complex pipelines that first perform a base-resolution (i.e., training-resolution) denoising process to guide HR generation. We instead explore the intrinsic generative mechanisms of DiTs and propose ResDiT, a training-free method that scales resolution efficiently. We identify the core factor governing spatial layout, position embeddings (PEs), and show that the original PEs encode incorrect positional information when extrapolated to HR, which triggers layout collapse. To address this, we introduce a PE scaling technique that rectifies positional encoding under resolution changes. To further remedy low-fidelity details, we develop a local-enhancement mechanism grounded in base-resolution local attention. We design a patch-level fusion module that aggregates global and local cues, together with a Gaussian-weighted splicing strategy that eliminates grid artifacts. Comprehensive evaluations demonstrate that ResDiT consistently delivers high-fidelity, high-resolution image synthesis and integrates seamlessly with downstream tasks, including spatially controlled generation.
中文标题/摘要
标题:ResDiT: 激发扩散变换器内在的分辨率可扩展性
利用预训练的扩散变换器(DiTs)进行高分辨率(HR)图像合成往往会导致空间布局崩溃和纹理保真度下降。先前的工作通过复杂的管道首先执行一个基础分辨率(即训练分辨率)的去噪过程来引导HR生成,以缓解这些问题。我们相反地探索了DiTs的内在生成机制,并提出了一种无需训练的方法ResDiT,该方法可以高效地扩展分辨率。我们确定了控制空间布局的核心因素,即位置嵌入(PEs),并表明当将原始PEs外推到HR时,它们编码了错误的位置信息,从而触发布局崩溃。为了解决这一问题,我们引入了一种PE缩放技术,以在分辨率变化时纠正位置编码。为了进一步修复低保真细节,我们开发了一种基于基础分辨率局部注意力的局部增强机制。我们设计了一个块级融合模块,该模块聚合全局和局部线索,并采用高斯加权拼接策略以消除网格伪影。全面的评估表明,ResDiT始终能够提供高保真、高分辨率的图像合成,并且能够无缝集成到下游任务中,包括空间控制生成。
Summary / 总结
ResDiT addresses the spatial layout collapse and degraded texture fidelity in high-resolution image synthesis using pre-trained Diffusion Transformers (DiTs). It proposes a training-free method, ResDiT, which scales resolution efficiently by correcting position embeddings and introducing a local-enhancement mechanism. The method includes a patch-level fusion module and a Gaussian-weighted splicing strategy to enhance details and remove artifacts, leading to high-fidelity, high-resolution image synthesis that integrates well with downstream tasks.
论文针对使用预训练的扩散变换器(DiTs)进行高分辨率图像合成时遇到的空间布局坍塌和纹理保真度下降等问题。提出了一种无需训练的方法ResDiT,以增强DiTs的内在生成机制来高效地扩展分辨率。关键发现包括将位置嵌入识别为导致空间布局问题的核心因素,并引入了位置嵌入缩放技术和局部增强机制来提高高分辨率图像的质量。
History
20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553