LitePT: Lighter Yet Stronger Point Transformer
Authors: Yuanwen Yue, Damien Robert, Jianyuan Wang, Sunghwan Hong, Jan Dirk Wegner, Christian Rupprecht, Konrad Schindler
First: 2025-12-15T18:59:57+00:00 · Latest: 2025-12-15T18:59:57+00:00
Comments: Project page: https://litept.github.io/
Abstract
Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6\times$ fewer parameters, runs $2\times$ faster, and uses $2\times$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.
中文标题/摘要
标题:LitePT:更轻量但更强大的点变换器
现代用于3D点云处理的神经架构包含卷积层和注意力模块,但它们的最佳组合方式尚不明确。我们分析了3D点云网络中不同计算模块的作用,并发现一种直观的行为:卷积在早期高分辨率层中适合作为提取低级几何特征的方法,而注意力在低分辨率的深层中更有效地捕捉高级语义和上下文,但在此层中使用注意力代价高昂且无益;为了遵循这一设计原则,我们提出了一种新的3D点云骨干网络,该网络在早期阶段使用卷积,在深层中切换到注意力。为了在丢弃冗余卷积层时避免空间布局信息的损失,我们引入了一种新的、无需训练的3D位置编码,PointROPE。结果表明,LitePT模型参数量减少了3.6倍,运行速度提高了2倍,内存使用量减少了2倍,但在多种任务和数据集上与最先进的Point Transformer V3相比,其性能相当甚至更好。代码和模型可在:https://github.com/prs-eth/LitePT 获取。
Summary / 总结
The paper proposes LitePT, a new 3D point cloud backbone that uses convolutions in early layers and switches to attention in deeper layers, guided by the observation that convolutions are sufficient for low-level geometry extraction while attention is more efficient for high-level context. LitePT incorporates a novel 3D positional encoding, PointROPE, to preserve spatial layout information. The model has 3.6 times fewer parameters, runs 2 times faster, and uses 2 times less memory than Point Transformer V3, while matching or outperforming it on various tasks and datasets.
论文提出LitePT,这是一种新的3D点云处理骨干网络,早期使用卷积而深层使用注意力机制,这一设计基于卷积适用于低级几何提取而注意力在高层语义和上下文提取中更有效的观察。这种设计使得模型参数量大幅减少,计算需求更低,同时在多种任务和数据集上与最先进的Point Transformer V3相当甚至更优。
LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
Authors: Tianye Ding, Yiming Xie, Yiqing Liang, Moitreya Chatterjee, Pedro Miraldo, Huaizu Jiang
First: 2025-12-15T18:59:04+00:00 · Latest: 2025-12-15T18:59:04+00:00
Comments: 16 pages
Abstract
Recent feed-forward reconstruction models like VGGT and $π^3$ achieve impressive reconstruction quality but cannot process streaming videos due to quadratic memory complexity, limiting their practical deployment. While existing streaming methods address this through learned memory mechanisms or causal attention, they require extensive retraining and may not fully leverage the strong geometric priors of state-of-the-art offline models. We propose LASER, a training-free framework that converts an offline reconstruction model into a streaming system by aligning predictions across consecutive temporal windows. We observe that simple similarity transformation ($\mathrm{Sim}(3)$) alignment fails due to layer depth misalignment: monocular scale ambiguity causes relative depth scales of different scene layers to vary inconsistently between windows. To address this, we introduce layer-wise scale alignment, which segments depth predictions into discrete layers, computes per-layer scale factors, and propagates them across both adjacent windows and timestamps. Extensive experiments show that LASER achieves state-of-the-art performance on camera pose estimation and point map reconstruction %quality with offline models while operating at 14 FPS with 6 GB peak memory on a RTX A6000 GPU, enabling practical deployment for kilometer-scale streaming videos. Project website: $\href{https://neu-vi.github.io/LASER/}{\texttt{https://neu-vi.github.io/LASER/}}$
Summary / 总结
The research aims to address the memory complexity issue of feed-forward reconstruction models like VGGT and $π^3$, which hinders their practical deployment for streaming videos. LASER proposes a training-free framework that aligns predictions across consecutive temporal windows using layer-wise scale alignment, overcoming the layer depth misalignment caused by monocular scale ambiguity. Experiments demonstrate that LASER achieves state-of-the-art performance in camera pose estimation and point map reconstruction while operating efficiently at 14 FPS with 6 GB peak memory.
研究旨在通过解决VGGT和$π^3$等前馈模型的内存复杂性问题,实现无需训练的流式4D重建。LASER提出了一种分层尺度对齐方法,将深度预测分割成层,计算每层的尺度因子,并在连续的时间窗口间传播以对齐预测。实验表明,LASER在相机姿态估计和点云重建方面达到了最先进的性能,同时在RTX A6000 GPU上以14 FPS的速度运行,峰值内存使用为6 GB,适用于千米级流式视频的实际部署。
AgentIAD: Tool-Augmented Single-Agent for Industrial Anomaly Detection
Authors: Junwen Miao, Penghui Du, Yi Liu, Yu Wang, Yan Wang
First: 2025-12-15T18:57:04+00:00 · Latest: 2025-12-15T18:57:04+00:00
Abstract
Industrial anomaly detection (IAD) is difficult due to the scarcity of normal reference samples and the subtle, localized nature of many defects. Single-pass vision-language models (VLMs) often overlook small abnormalities and lack explicit mechanisms to compare against canonical normal patterns. We propose AgentIAD, a tool-driven agentic framework that enables multi-stage visual inspection. The agent is equipped with a Perceptive Zoomer (PZ) for localized fine-grained analysis and a Comparative Retriever (CR) for querying normal exemplars when evidence is ambiguous. To teach these inspection behaviors, we construct structured perceptive and comparative trajectories from the MMAD dataset and train the model in two stages: supervised fine-tuning followed by reinforcement learning. A two-part reward design drives this process: a perception reward that supervises classification accuracy, spatial alignment, and type correctness, and a behavior reward that encourages efficient tool use. Together, these components enable the model to refine its judgment through step-wise observation, zooming, and verification. AgentIAD achieves a new state-of-the-art 97.62% classification accuracy on MMAD, surpassing prior MLLM-based approaches while producing transparent and interpretable inspection traces.
中文标题/摘要
标题:AgentIAD:工具增强的单智能体工业异常检测
工业异常检测(IAD)由于正常参考样本稀缺和许多缺陷的细微、局部性质而具有挑战性。单次视图语言模型(VLMs)往往忽视小异常,缺乏与标准正常模式进行对比的显式机制。我们提出了一种工具驱动的代理框架AgentIAD,以实现多阶段视觉检查。该代理配备了感知放大器(PZ)进行局部精细分析,以及比较检索器(CR)在证据模糊时查询正常示例。为了教授这些检查行为,我们从MMAD数据集中构建了结构化的感知和比较轨迹,并通过两阶段训练模型:监督微调后进行强化学习。两部分奖励设计驱动这一过程:感知奖励监督分类准确性、空间对齐和类型正确性,行为奖励鼓励高效使用工具。这些组件共同使模型能够通过逐步观察、放大和验证来完善其判断。AgentIAD在MMAD上实现了新的97.62%分类准确率,超越了基于MLLM的先前方法,同时生成透明且可解释的检查轨迹。
Summary / 总结
AgentIAD is a tool-driven framework for industrial anomaly detection that uses a multi-stage visual inspection process. It includes a Perceptive Zoomer for detailed analysis and a Comparative Retriever for comparing against normal patterns. The model is trained in two stages: supervised fine-tuning and reinforcement learning, with rewards designed to improve classification accuracy and tool efficiency. AgentIAD achieves 97.62% classification accuracy on MMAD, outperforming previous methods and providing transparent inspection traces.
AgentIAD 是一种工具驱动的工业异常检测框架,使用感知放大器进行局部分析,并使用比较检索器查询正常模式。它通过监督微调和强化学习两个阶段进行训练,奖励系统鼓励准确分类和高效使用工具。AgentIAD 在 MMAD 上达到 97.62% 的分类准确率,超越了先前的方法,并提供透明的检测轨迹。
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
Authors: Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang
First: 2025-12-15T18:52:43+00:00 · Latest: 2025-12-15T18:52:43+00:00
Comments: Project page: https://zhoues.github.io/RoboTracer
Abstract
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.
中文标题/摘要
标题:RoboTracer:通过视觉语言模型中的推理掌握空间跟踪
空间跟踪是机器人基本的体态交互能力,由于需要多步度量导向的推理和复杂的空间指代以及现实世界的度量测量,因此本质上具有挑战性。然而,现有方法在处理这种组合任务时存在困难。为此,我们提出RoboTracer,这是一种3D感知的VLM,首先通过通用空间编码器和回归监督解码器实现3D空间指代和测量,增强监督微调(SFT)期间的尺度意识。此外,RoboTracer通过度量敏感的过程奖励进行强化微调(RFT),监督关键中间感知提示以准确生成空间跟踪。为了支持SFT和RFT训练,我们引入了TraceSpatial,这是一个包含3000万QA对的大规模数据集,涵盖了户外/室内/桌面场景,并支持复杂的推理过程(多达9步)。我们还提出了TraceSpatial-Bench,这是一个具有挑战性的基准,填补了空间跟踪评估的空白。实验结果表明,RoboTracer在空间理解、测量和指代方面超越了基线,平均成功率达到了79.1%,并且在TraceSpatial-Bench上也以显著优势达到了SOTA性能,比Gemini-2.5-Pro高出36%的准确率。值得注意的是,RoboTracer可以与各种控制策略结合,执行跨不同机器人(UR5,G1人形机器人)的复杂场景中的长期任务。
Summary / 总结
RoboTracer is designed to address the challenges of spatial tracing in robotics by integrating 3D spatial reasoning into vision-language models. It uses a universal spatial encoder and regression-supervised decoder for 3D spatial referring and measuring, and reinforcement fine-tuning to enhance multi-step metric-grounded reasoning. RoboTracer outperforms existing methods with an average success rate of 79.1% and sets a new state-of-the-art on the TraceSpatial-Bench benchmark, surpassing Gemini-2.5-Pro by 36% accuracy. The method is versatile and can be integrated with various control policies for dynamic tasks across different robots in complex environments.
RoboTracer 通过将 3D 空间推理集成到视觉语言模型中,旨在解决机器人领域的空间跟踪挑战。它使用通用的空间编码器和回归监督解码器进行空间引用和测量,并采用强化微调来增强多步度量导向的推理。该模型在包含 3000 万个问答对的 TraceSpatial 数据集上进行训练,这些问答对涵盖了复杂的推理过程。RoboTracer 的平均成功率达到了 79.1%,并在 TraceSpatial-Bench 上超越了现有方法,特别是超越了 Gemini-2.5-Pro,准确率提高了 36%。它可以应用于各种机器人,执行复杂环境中的长期任务。
Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models
Authors: Shweta Mahajan, Shreya Kadambi, Hoang Le, Munawar Hayat, Fatih Porikli
First: 2025-12-15T18:03:42+00:00 · Latest: 2025-12-15T18:03:42+00:00
Abstract
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating physically plausible scene transformations driven by real-world actions. Unlike prior work focused on object-level edits, Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world. We curate a large-scale dataset of reversible actions from real-world videos and design a training strategy enforcing consistency for robust action grounding. Our experiments reveal that current models struggle with physical reversibility, underscoring the importance of this task for embodied AI, robotics, and physics-aware generative modeling. Do-Undo establishes an intuitive testbed for evaluating and advancing physical reasoning in multimodal systems.
中文标题/摘要
标题:Do-Undo:生成和逆转物理动作在视觉语言模型中的应用
我们引入了Do-Undo任务和基准测试,以解决视觉语言模型中的一个关键问题:理解并生成由真实世界动作驱动的物理上合理的场景变换。与以往专注于对象级编辑的工作不同,Do-Undo要求模型模拟物理动作的结果,然后准确地逆转它,反映视觉世界中的真正因果关系。我们从真实世界的视频中收集了一个大规模的可逆动作数据集,并设计了一种训练策略,以确保动作定位的稳健性。我们的实验表明,当前的模型在物理可逆性方面存在困难,突显了该任务对于具身人工智能、机器人技术和物理感知生成建模的重要性。Do-Undo为评估和推进多模态系统中的物理推理提供了一个直观的测试平台。
Summary / 总结
The Do-Undo task and benchmark are introduced to address the gap in vision-language models' ability to understand and generate physically plausible scene transformations. Unlike previous work focusing on object-level edits, Do-Undo requires models to simulate and reverse physical actions, reflecting true cause-and-effect. The study curates a large dataset of reversible actions from real-world videos and employs a training strategy to enforce consistency. Experiments show that current models struggle with physical reversibility, highlighting the importance of this task for embodied AI, robotics, and physics-aware generative modeling.
提出了Do-Undo任务和基准,旨在通过要求模型模拟和逆转现实世界动作来增强视觉语言模型的物理推理能力。与以往专注于对象级编辑的工作不同,该任务要求模型理解视觉场景中的因果关系。作者从真实视频中收集了一个大规模的可逆动作数据集,并开发了一种提高一致性的训练策略。实验表明,当前模型在物理逆转方面存在困难,突显了这一任务对于具身AI和物理感知生成建模的重要性。
Image Diffusion Preview with Consistency Solver
Authors: Fu-Yun Wang, Hao Zhou, Liangzhe Yuan, Sanghyun Woo, Boqing Gong, Bohyung Han, Ming-Hsuan Yang, Han Zhang, Yukun Zhu, Ting Liu, Long Zhao
First: 2025-12-15T17:47:49+00:00 · Latest: 2025-12-15T17:47:49+00:00
Abstract
The slow inference process of image diffusion models significantly degrades interactive user experiences. To address this, we introduce Diffusion Preview, a novel paradigm employing rapid, low-step sampling to generate preliminary outputs for user evaluation, deferring full-step refinement until the preview is deemed satisfactory. Existing acceleration methods, including training-free solvers and post-training distillation, struggle to deliver high-quality previews or ensure consistency between previews and final outputs. We propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. Experimental results demonstrate that ConsistencySolver significantly improves generation quality and consistency in low-step scenarios, making it ideal for efficient preview-and-refine workflows. Notably, it achieves FID scores on-par with Multistep DPM-Solver using 47% fewer steps, while outperforming distillation baselines. Furthermore, user studies indicate our approach reduces overall user interaction time by nearly 50% while maintaining generation quality. Code is available at https://github.com/G-U-N/consolver.
中文标题/摘要
标题:图像扩散预览与一致性求解器
图像扩散模型的缓慢推理过程显著降低了交互式用户体验。为解决这一问题,我们引入了预览模式,这是一种新颖的范式,通过快速、低步数采样生成初步输出供用户评估,直到预览被判定为满意时才进行完整的步数细化。现有的加速方法,包括无训练求解器和后训练蒸馏,难以提供高质量的预览或确保预览与最终输出之间的一致性。我们提出了一致性求解器ConsistencySolver,这是一种源自通用线性多步法的轻量级、可训练的高阶求解器,通过强化学习优化,能够提升预览质量和一致性。实验结果表明,ConsistencySolver在低步数场景中显著提高了生成质量和一致性,使其成为高效的预览和细化工作流的理想选择。值得注意的是,它在使用47%更少的步骤时,FID分数与Multistep DPM-Solver相当,同时优于蒸馏基线。此外,用户研究显示,我们的方法将总体用户交互时间减少了近50%,同时保持了生成质量。代码可在https://github.com/G-U-N/consolver/ 获取。
Summary / 总结
The paper addresses the slow inference process of image diffusion models by introducing Diffusion Preview, which uses rapid, low-step sampling to generate preliminary outputs for user evaluation. To enhance preview quality and consistency, the authors propose ConsistencySolver, a lightweight, trainable solver derived from general linear multistep methods and optimized via Reinforcement Learning. Experimental results show that ConsistencySolver significantly improves generation quality and consistency, achieving FID scores comparable to Multistep DPM-Solver with fewer steps and outperforming distillation baselines. User studies also indicate a 50% reduction in user interaction time while maintaining generation quality.
研究旨在通过加速图像扩散模型的推理过程来改善交互用户体验。它引入了Diffusion Preview,使用快速的低步数采样生成初步输出供用户评估,直到预览满意后再进行全步数细化。提出的ConsistencySolver是一种基于一般线性多步法的轻量级、可训练的高阶求解器,通过强化学习优化,显著提高了预览质量和一致性。实验结果表明,ConsistencySolver在使用47%更少的步骤时,FID分数与Multistep DPM-Solver相当,并且优于蒸馏基线。用户研究显示,与之前的方法相比,该方法将整体用户交互时间减少了近50%,同时保持了生成质量。
Deep priors for satellite image restoration with accurate uncertainties
Authors: Biquard Maud, Marie Chabert, Florence Genin, Christophe Latry, Thomas Oberlin
Venue: IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1-16, 2025, Art no. 5652916
First: 2024-12-05T12:56:03+00:00 · Latest: 2025-12-15T16:43:56+00:00
Abstract
Satellite optical images, upon their on-ground receipt, offer a distorted view of the observed scene. Their restoration, including denoising, deblurring, and sometimes super-resolution, is required before their exploitation. Moreover, quantifying the uncertainties related to this restoration helps to reduce the risks of misinterpreting the image content. Deep learning methods are now state-of-the-art for satellite image restoration. Among them, direct inversion methods train a specific network for each sensor, and generally provide a point estimation of the restored image without the associated uncertainties. Alternatively, deep regularization (DR) methods learn a deep prior on target images before plugging it, as the regularization term, into a model-based optimization scheme. This allows for restoring images from several sensors with a single network and possibly for estimating associated uncertainties. In this paper, we introduce VBLE-xz, a DR method that solves the inverse problem in the latent space of a variational compressive autoencoder (CAE). We adapt the regularization strength by modulating the bitrate of the trained CAE with a training-free approach. Then, VBLE-xz estimates relevant uncertainties jointly in the latent and in the image spaces by sampling an explicit posterior estimated within variational inference. This enables fast posterior sampling, unlike state-of-the-art DR methods that use Markov chains or diffusion-based approaches. We conduct a comprehensive set of experiments on very high-resolution simulated and real Pléiades images, asserting the performance, robustness and scalability of the proposed method. They demonstrate that VBLE-xz represents a compelling alternative to direct inversion methods when uncertainty quantification is required. The code associated to this paper is available in https://github.com/MaudBqrd/VBLExz.
中文标题/摘要
标题:卫星图像恢复的深度先验及其准确的不确定性
卫星光学图像在地面接收时会呈现出观测场景的失真视图。在利用这些图像之前,需要对其进行恢复,包括去噪、去模糊,有时还包括超分辨率。此外,量化恢复过程相关的不确定性有助于降低误读图像内容的风险。如今,深度学习方法已成为卫星图像恢复的前沿技术。其中,直接反演方法为每个传感器训练特定的网络,并通常提供恢复图像的点估计,而不提供相关的不确定性。相反,深度正则化(DR)方法在目标图像上学习一个深度先验,然后将其作为正则化项插入基于模型的优化方案中。这使得使用单个网络从多个传感器恢复图像,并可能估计相关不确定性。在本文中,我们引入了VBLE-xz,这是一种DR方法,它在变分压缩自编码器(CAE)的潜在空间中解决逆问题。我们通过训练免费的方法调节正则化强度。然后,VBLE-xz通过在变分推断中采样显式后验,在潜在空间和图像空间中联合估计相关不确定性。这使得后验采样速度快于最先进的DR方法,这些方法使用马尔可夫链或扩散方法。我们在非常高分辨率的模拟和实际Pléiades图像上进行了全面的实验,证明了所提方法的性能、鲁棒性和可扩展性。这些结果表明,当需要量化不确定性时,VBLE-xz是直接反演方法的一个有吸引力的替代方案。与本文相关的代码可在https://github.com/MaudBqrd/VBLExz/获得。
Summary / 总结
This paper introduces VBLE-xz, a deep regularization method for satellite image restoration that learns a deep prior in the latent space of a variational compressive autoencoder. It modulates the regularization strength through bitrate adjustment and estimates uncertainties in both latent and image spaces using variational inference. Experiments on high-resolution Pléiades images show that VBLE-xz outperforms direct inversion methods in terms of performance, robustness, and scalability, especially when uncertainty quantification is needed.
该论文提出了一种名为VBLE-xz的深度正则化方法,用于卫星图像恢复,该方法在变分压缩自编码器的潜在空间中学习先验知识。通过比特率调整调节正则化强度,并使用变分推断在潜在空间和图像空间中联合估计不确定性。实验结果表明,VBLE-xz在高分辨率Pléiades图像上的性能、鲁棒性和可扩展性均优于直接反演方法,特别是在需要不确定性量化时更为出色。
HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs
Authors: Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Yi Yuan, Jingdong Chen, Le Wang
First: 2025-08-14T12:14:15+00:00 · Latest: 2025-12-15T16:23:44+00:00
Comments: Accepted by AAAI2026
Abstract
While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks.Furthermore, grounded in the observation that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, we posit that reasoning ability serves as the key to unlocking it. We devise a multi-stage, modality-progressive reinforcement learning approach, resulting in HumanSense-Omni-Reasoning, which substantially enhances performance on higher-level understanding and interactive tasks. Additionally, we observe that successful reasoning processes appear to exhibit consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner.Project page: \textcolor{brightpink}{https://digital-avatar.github.io/ai/HumanSense/}
中文标题/摘要
标题:HumanSense:通过推理MLLM实现共情的上下文感知响应
尽管多模态大型语言模型(MLLMs)在实现真正的人类交互方面展现出巨大的潜力,但缺乏针对人类中心场景的精细评估框架,涵盖复杂的人类意图理解和提供共情的、上下文感知的响应,这阻碍了进展。在此,我们介绍了HumanSense,一个全面的基准,旨在评估MLLMs的人类中心感知和交互能力,特别关注对扩展多模态上下文的深刻理解以及合理反馈的形成。我们的评估表明,领先MLLMs在高级交互任务方面仍有很大的改进空间。将视觉输入补充以音频和文本信息可带来显著改进,而全模态模型在这些任务上表现出优势。此外,基于适当反馈源自对话者需求和情绪的上下文分析这一观察,我们认为推理能力是解锁这一能力的关键。我们设计了一种多阶段、模态渐进的强化学习方法,从而产生了HumanSense-Omni-Reasoning,显著提升了高层次理解和交互任务的性能。此外,我们观察到成功的推理过程似乎表现出一致的思维模式。通过设计相应的提示,我们还以无训练的方式增强了非推理模型的性能。项目页面:https://digital-avatar.github.io/ai/HumanSense/
Summary / 总结
HumanSense is a benchmark designed to evaluate MLLMs' human-centered perception and interaction capabilities, focusing on understanding complex contexts and providing empathetic responses. The evaluation shows that current MLLMs need improvement, especially for advanced interaction tasks. By integrating audio and text with visual inputs, and using a multi-stage reinforcement learning approach, HumanSense-Omni-Reasoning significantly improves performance on higher-level understanding and interactive tasks. Thought patterns in reasoning processes are consistent, and corresponding prompts enhance non-reasoning models without training.
研究旨在开发一个全面基准HumanSense,以评估MLLMs在人本感知和交互能力方面的表现,重点在于理解复杂的人类意图并提供同理心的回应。研究发现,领先MLLMs在高级交互任务上仍需改进,整合音频、文本与视觉输入显著提升了性能。提出了一种多阶段、模态渐进的强化学习方法HumanSense-Omni-Reasoning,以提高高层次理解和交互任务的表现,展示了推理过程中的一致思维模式。非推理模型也通过相应的提示设计在无需额外训练的情况下得到了性能提升。
Instance-Level Composed Image Retrieval
Authors: Bill Psomas, George Retsinas, Nikos Efthymiadis, Panagiotis Filntisis, Yannis Avrithis, Petros Maragos, Ondrej Chum, Giorgos Tolias
Venue: NeurIPS 2025
First: 2025-10-29T10:57:59+00:00 · Latest: 2025-12-15T14:45:31+00:00
Comments: NeurIPS 2025
Abstract
The progress of composed image retrieval (CIR), a popular research direction in image retrieval, where a combined visual and textual query is used, is held back by the absence of high-quality training and evaluation data. We introduce a new evaluation dataset, i-CIR, which, unlike existing datasets, focuses on an instance-level class definition. The goal is to retrieve images that contain the same particular object as the visual query, presented under a variety of modifications defined by textual queries. Its design and curation process keep the dataset compact to facilitate future research, while maintaining its challenge-comparable to retrieval among more than 40M random distractors-through a semi-automated selection of hard negatives.
To overcome the challenge of obtaining clean, diverse, and suitable training data, we leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC. The method separately estimates query-image-to-image and query-text-to-image similarities, performing late fusion to upweight images that satisfy both queries, while down-weighting those that exhibit high similarity with only one of the two. Each individual similarity is further improved by a set of components that are simple and intuitive. BASIC sets a new state of the art on i-CIR but also on existing CIR datasets that follow a semantic-level class definition. Project page: https://vrg.fel.cvut.cz/icir/.
中文标题/摘要
标题:实例级合成图像检索
合成图像检索(CIR),作为一种流行的图像检索研究方向,使用结合视觉和文本的查询,其进展受限于高质量训练和评估数据的缺失。我们引入了一个新的评估数据集i-CIR,与现有数据集不同,它专注于实例级类定义。目标是检索包含与视觉查询相同特定对象的图像,这些图像在文本查询定义的多种修改下呈现。其设计和编纂过程使数据集保持紧凑,以促进未来研究,同时通过半自动选择硬负例,保持其挑战性与在超过4000万随机干扰项中的检索相当。
为克服获得干净、多样且合适的训练数据的挑战,我们利用预训练的视觉-语言模型(VLMs)采用无训练方法BASIC。该方法分别估计查询-图像到图像和查询-文本到图像的相似性,在晚期融合中加权满足两个查询的图像,而降低仅与其中一个查询高度相似的图像的权重。每个单独的相似性进一步通过一组简单直观的组件改进。BASIC在i-CIR上以及遵循语义级类定义的现有CIR数据集上均达到了新的最佳性能。项目页面:https://vrg.fel.cvut.cz/icir/
Summary / 总结
The research addresses the limitation of existing composed image retrieval datasets by introducing i-CIR, a new dataset focusing on instance-level object retrieval. The method, BASIC, uses pre-trained vision-and-language models to estimate similarities between queries and images, performing late fusion to prioritize images that satisfy both visual and textual queries. BASIC outperforms existing methods on i-CIR and other CIR datasets, setting a new state of the art.
研究通过引入专注于实例级对象检索的新数据集i-CIR,解决了现有组成图像检索数据集的限制。方法BASIC利用预训练的视觉-语言模型估计查询与图像之间的相似性,并进行后期融合以优先考虑同时匹配视觉和文本查询的图像。BASIC在i-CIR和其他组成图像检索数据集上均达到了最先进的性能,展示了其在处理实例级查询方面的有效性。
Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs
Authors: Anran Qi, Changjian Li, Adrien Bousseau, Niloy J. Mitra
First: 2025-12-15T14:45:05+00:00 · Latest: 2025-12-15T14:45:05+00:00
Abstract
We address image-to-video generation with explicit user control over the final frame's disoccluded regions. Current image-to-video pipelines produce plausible motion but struggle to generate predictable, articulated motions while enforcing user-specified content in newly revealed areas. Our key idea is to separate motion specification from appearance synthesis: we introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that deterministically yet approximately drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion. In our training-free pipeline, the user loosely annotates and reposes a PDG, from which we compute a dense motion flow to leverage diffusion as a motion-guided shader. We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas. This design yields controllable articulation and user control over disocclusions without fine-tuning. We demonstrate clear advantages against state-of-the-art alternatives towards images turned into short videos of articulated objects, furniture, vehicles, and deformables. Our method mixes generative control, in the form of loose pose and structure, with predictable controls, in the form of appearance specification in the final frame in the disoccluded regions, unlocking a new image-to-video workflow. Code will be released on acceptance. Project page: https://anranqi.github.io/beyondvisible.github.io/
中文标题/摘要
标题:超越可见:基于代理动态图的消隐感知编辑
我们解决了带有明确用户控制最终帧消隐区域的图像到视频生成问题。当前的图像到视频流水线能够生成可信的运动,但在生成可预测、有条理的运动同时确保新揭示区域中用户指定内容方面存在困难。我们的核心思想是将运动规范与外观合成分离:我们引入了一个轻量级、用户可编辑的代理动态图(PDG),它以确定性但近似的方式驱动部分运动,而冻结的扩散先验用于合成遵循该运动的可信外观。在我们的无需训练的流水线中,用户对PDG进行粗略标注和重新定位,从中我们计算密集的运动流以利用扩散作为运动导向的着色器。然后,用户可以在图像的消隐区域编辑外观,并利用PDG编码的可见性信息在这些区域执行潜在空间合成,以在运动与用户意图之间达成一致。此设计实现了可控的有条理性和对消隐的用户控制,无需微调。我们展示了与最先进的替代方案相比,我们的方法在将图像转化为包含有条理对象、家具、车辆和变形体的短视频方面的明显优势。我们的方法结合了生成性控制(松散的姿态和结构)与最终帧中消隐区域外观规范的可预测控制,解锁了一种新的图像到视频工作流程。代码将在接受后发布。项目页面:https://anranqi.github.io/beyondvisible.github.io/
Summary / 总结
The paper addresses the challenge of generating plausible motion in image-to-video conversion while allowing user control over disoccluded regions. It introduces a Proxy Dynamic Graph (PDG) that drives part motion deterministically, while a frozen diffusion model synthesizes plausible appearance. Users can loosely annotate and reposition the PDG to compute dense motion flows, edit appearance in disoccluded areas, and reconcile motion with user intent. The method provides controllable articulation and user control over disocclusions without fine-tuning, outperforming state-of-the-art alternatives in generating short videos of articulated objects, furniture, vehicles, and deformables.
论文解决了在图像到视频转换中生成合理运动的同时让用户控制不遮挡区域的问题。它引入了一个代理动态图(PDG),以确定性方式驱动部分运动,同时使用冻结的扩散模型合成合理的外观。用户可以松散地标注和重新定位PDG,计算密集的运动流,编辑不遮挡区域的外观,并在这些区域中将运动与用户意图统一起来。该方法提供了可控的运动控制和对不遮挡区域的用户控制,无需微调,优于最先进的替代方案,在生成具有运动特征的对象、家具、车辆和变形体的短视频方面表现出色。
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models
Authors: Yongxian Wei, Yilin Zhao, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Gang Liu, Jiahong Yan, Chun Yuan, Dian Li
First: 2025-11-13T03:08:51+00:00 · Latest: 2025-12-15T13:55:27+00:00
Abstract
Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data bootstrap problem-design strategies from the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models. Moreover, a solver trained on the synthesized data provides improved rewards for continued generator training, enabling co-evolution and yielding a further 0.7% performance gain. Our code will be made publicly available here.
中文标题/摘要
标题:学习提出问题:基于推理驱动和求解器自适应的数据合成方法
为大型推理模型训练的数据合成提供了比有限的人工精选数据集更具扩展性的替代方案,能够生成高质量的数据。然而,现有方法面临几个挑战:(i)无差别生成忽略求解器能力,导致低价值问题,或依赖复杂的数据管道平衡问题难度;(ii)问题生成缺乏推理,导致浅层问题变体。在本文中,我们开发了一个问题生成器,该生成器在合成前明确推理以规划问题方向,并根据求解器能力调整难度。具体而言,我们构建了相关问题对,并通过推理模型生成中间问题设计推理(CoT)进行增强。这些数据为生成器提供了问题设计策略的启动。然后,我们将求解器对合成问题的反馈作为奖励信号,使生成器能够校准难度并生成接近求解器能力边缘的互补问题。在10个数学和通用推理基准上的广泛实验表明,我们的方法平均提高了2.5%,并能够泛化到语言和视觉语言模型。此外,使用合成数据训练的求解器为生成器的持续训练提供了改进的奖励,促进了协同进化,进一步提高了0.7%的性能。我们的代码将在此公开发布。
Summary / 总结
This paper addresses the challenges in data synthesis for training large reasoning models, such as indiscriminate generation and lack of reasoning in problem creation. The authors propose a method that reasons explicitly to plan problem directions and adapts difficulty based on the solver's ability. By constructing related problem pairs and using a reasoning model to generate intermediate problem-design CoT, the method improves the quality of synthetic data. Experiments on 10 benchmarks show an average improvement of 2.5% and generalization to both language and vision-language models. Additionally, a co-evolutionary process between the solver and generator further enhances performance by 0.7%.
本文通过开发一个显式推理规划问题方向并根据求解器能力调整难度的问题生成器,解决了大规模推理模型训练中的数据合成挑战。该方法构建相关问题对,并使用推理模型生成的中间问题设计推理过程进行增强。它利用求解器对合成问题的反馈作为奖励信号来调整难度并生成接近求解器能力边缘的互补问题。在10个基准测试上的实验显示平均改进了2.5%,并且能够泛化到语言和视觉语言模型,进一步通过协同进化获得0.7%的性能提升。
CausalCLIP: Causally-Informed Feature Disentanglement and Filtering for Generalizable Detection of Generated Images
Authors: Bo Liu, Qiao Qin, Qinghui He
Venue: AAAI 2026
First: 2025-12-15T12:48:27+00:00 · Latest: 2025-12-15T12:48:27+00:00
Comments: 9 pages Accepted to AAAI 2026
Abstract
The rapid advancement of generative models has increased the demand for generated image detectors capable of generalizing across diverse and evolving generation techniques. However, existing methods, including those leveraging pre-trained vision-language models, often produce highly entangled representations, mixing task-relevant forensic cues (causal features) with spurious or irrelevant patterns (non-causal features), thus limiting generalization. To address this issue, we propose CausalCLIP, a framework that explicitly disentangles causal from non-causal features and employs targeted filtering guided by causal inference principles to retain only the most transferable and discriminative forensic cues. By modeling the generation process with a structural causal model and enforcing statistical independence through Gumbel-Softmax-based feature masking and Hilbert-Schmidt Independence Criterion (HSIC) constraints, CausalCLIP isolates stable causal features robust to distribution shifts. When tested on unseen generative models from different series, CausalCLIP demonstrates strong generalization ability, achieving improvements of 6.83% in accuracy and 4.06% in average precision over state-of-the-art methods.
中文标题/摘要
标题:CausalCLIP:因果驱动的特征解缠与过滤以实现生成图像检测的泛化能力
生成模型的快速发展增加了对能够跨多样且不断演变的生成技术进行泛化的生成图像检测器的需求。然而,现有的方法,包括利用预训练的视觉-语言模型的方法,往往会产生高度缠结的表示,将与任务相关的法医线索(因果特征)与虚假或无关的模式(非因果特征)混合在一起,从而限制了泛化能力。为了解决这一问题,我们提出了CausalCLIP框架,该框架明确地解缠因果特征与非因果特征,并通过因果推理原则进行目标过滤,仅保留最具转移性和区分性的法医线索。通过使用结构因果模型建模生成过程,并通过Gumbel-Softmax基特征掩蔽和Hilbert-Schmidt独立性判别准则(HSIC)约束来强制统计独立性,CausalCLIP隔离了对分布偏移具有鲁棒性的稳定因果特征。当在不同系列的未见过的生成模型上进行测试时,CausalCLIP展示了强大的泛化能力,相对于最先进的方法,在准确性和平均精度上分别提高了6.83%和4.06%。
Summary / 总结
CausalCLIP is a framework designed to improve the generalization of generated image detectors by disentangling causal and non-causal features. It uses causal inference principles to filter out non-causal features, ensuring that only the most discriminative and transferable causal features are retained. By modeling the generation process with a structural causal model and applying Gumbel-Softmax-based feature masking and HSIC constraints, CausalCLIP isolates stable causal features that are robust to distribution shifts. Experiments show that CausalCLIP outperforms existing methods, achieving a 6.83% improvement in accuracy and a 4.06% improvement in average precision on unseen generative models from different series.
CausalCLIP旨在通过分离因果和非因果特征并过滤掉非因果特征来提高生成图像检测器的泛化能力。它使用结构因果模型和基于Gumbel-Softmax的特征掩蔽以及HSIC约束来隔离稳定的因果特征。实验表明,CausalCLIP优于现有方法,准确率提高了6.83%,平均精度提高了4.06%。
Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection
Authors: Juil Koo, Daehyeon Choi, Sangwoo Youn, Phillip Y. Lee, Minhyuk Sung
First: 2025-12-15T12:04:26+00:00 · Latest: 2025-12-15T12:04:26+00:00
Comments: Project page: https://active-view-selection.github.io/
Abstract
Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.
中文标题/摘要
标题:迈向移动视觉:学习基于视觉的主动视角选择
视觉语言模型(VLMs)在视觉问答(VQA)方面表现出色,但仍然局限于静态视觉,仅能从静态图像中进行推理。相比之下,具身智能体需要移动视觉,能够主动移动以获取更有信息量的视角。我们引入了基于视觉的主动视角选择(VG-AVS)任务,该任务仅使用当前图像中的视觉信息来选择最有信息量的下一个视角,而不依赖于场景记忆或外部知识。为了支持这一任务,我们构建了一个合成数据集,其中包含自动生成的查询-目标视图配对以及问题-答案提示。我们还提出了一种框架,通过监督微调(SFT)后跟基于强化学习的策略优化来微调预训练的VLMs。我们的方法在基于视角选择的问题回答方面表现出色,并且能够稳健地泛化到未见过的合成和真实场景中。此外,将我们学习到的VG-AVS框架集成到现有的场景探索为基础的EQA系统中,可以提高下游问题回答的准确性。
Summary / 总结
The research aims to enable embodied agents with ambulatory vision by developing Visually Grounded Active View Selection (VG-AVS), which selects the most informative next viewpoint based on the current visual information. The method involves creating a synthetic dataset with paired query-target views and question-answer prompts, and fine-tuning pretrained Vision Language Models (VLMs) through supervised fine-tuning followed by reinforcement learning-based policy optimization. The approach demonstrates strong performance in question answering based on viewpoint selection and generalizes well to unseen scenes, enhancing the accuracy of existing scene-exploration-based question-answering systems.
研究旨在使机器人能够选择最有信息量的视角进行视觉问答,超越静态图像分析。方法包括创建合成数据集进行训练,并使用一种框架,该框架通过监督微调和基于强化学习的策略优化来微调预训练的视觉语言模型。关键发现表明,在基于视角选择的问答任务中表现出色,并且能够稳健地泛化到未见过的场景。将此框架集成到现有系统中可以提高场景探索任务中的问答准确性。
Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection
Authors: Zihui Zhao, Zechang Li
First: 2025-12-15T11:55:55+00:00 · Latest: 2025-12-15T11:55:55+00:00
Abstract
Direct Preference Optimization (DPO) has emerged as a lightweight and effective alternative to Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with AI Feedback (RLAIF) for aligning large language and vision-language models. However, the standard DPO formulation, in which both the chosen and rejected responses are generated by the same policy, suffers from a weak learning signal because the two responses often share similar errors and exhibit small Kullback-Leibler (KL) divergence. This leads to slow and unstable convergence. To address this limitation, we introduce Reflective Preference Optimization (RPO), a new framework that incorporates hint-guided reflection into the DPO paradigm. RPO uses external models to identify hallucination sources and generate concise reflective hints, enabling the construction of on-policy preference pairs with stronger contrastiveness and clearer preference signals. We theoretically show that conditioning on hints increases the expected preference margin through mutual information and improves sample efficiency while remaining within the policy distribution family. Empirically, RPO achieves superior alignment with fewer training samples and iterations, substantially reducing hallucination rates and delivering state-of-the-art performance across multimodal benchmarks.
中文标题/摘要
标题:反射偏好优化(RPO):通过提示引导的反思增强策略对齐
直接偏好优化(DPO)已成为一种轻量级且有效的替代强化学习从人类反馈(RLHF)和强化学习与AI反馈(RLAIF)的方法,用于对齐大型语言和视觉-语言模型。然而,标准的DPO公式,其中选择和拒绝的响应均由同一策略生成,由于两者经常共享相似的错误并表现出较小的Kullback-Leibler(KL)散度,导致学习信号较弱,从而导致收敛缓慢且不稳定。为解决这一局限性,我们引入了反射偏好优化(RPO),这是一种新的框架,将提示引导的反思融入DPO范式中。RPO使用外部模型来识别幻觉来源并生成简洁的反思提示,从而能够构建具有更强对比度和更清晰偏好信号的策略对齐偏好对。我们从理论上证明,通过互信息条件化提示可以增加预期的偏好边际,提高样本效率,同时保持在策略分布家族内。实验上,RPO在较少的训练样本和迭代次数下实现了更好的对齐,显著降低了幻觉率,并在多模态基准测试中达到了最先进的性能。
Summary / 总结
Reflective Preference Optimization (RPO) addresses the limitations of Direct Preference Optimization (DPO) by introducing hint-guided reflection to enhance on-policy alignment. RPO uses external models to generate reflective hints, which help construct preference pairs with stronger contrastiveness. Empirically, RPO achieves better alignment with fewer samples and iterations, reducing hallucination rates and outperforming existing methods on multimodal benchmarks.
Reflective Preference Optimization (RPO) 通过引入提示引导的反思来改进直接偏好优化(DPO),以增强策略对齐。RPO 使用外部模型生成反思提示,从而提高偏好信号的对比度和清晰度。实验表明,RPO 使用更少的训练样本和迭代次数实现了更好的对齐效果,降低了幻觉率,并在多模态基准测试中达到了最先进的性能。
MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion
Authors: Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen, Gang Wang, Runwei Guan, Weiping Ding
First: 2025-12-15T10:37:59+00:00 · Latest: 2025-12-15T10:37:59+00:00
Abstract
Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.
中文标题/摘要
标题:MMDrive:超越视觉的多表示融合交互场景理解
视觉语言模型通过多源信息融合,使复杂交通场景的理解和推理成为可能,成为自动驾驶的核心技术。然而,现有的视觉语言模型受限于二维图像理解范式,限制了其对三维空间信息的感知能力和深层次语义融合的能力,导致在复杂自动驾驶环境中表现不佳。本研究提出MMDrive,这是一种多模态视觉语言模型框架,将传统的图像理解扩展到一个通用的三维场景理解框架。MMDrive整合了三种互补的模态,包括占用地图、LiDAR点云和文本场景描述。为此,它引入了两种新的组件,用于自适应跨模态融合和关键信息提取。具体来说,文本导向的多模态调制器根据问题中的语义线索动态加权每个模态的贡献,引导上下文感知特征整合。跨模态抽象器使用可学习的抽象标记生成紧凑的跨模态摘要,突出关键区域和重要语义。在DriveLM和NuScenes-QA基准上的全面评估表明,MMDrive在自动驾驶中显著优于现有视觉语言模型,DriveLM上的BLEU-4得分为54.56,METEOR得分为41.78,NuScenes-QA上的准确率为62.7%。MMDrive有效地突破了传统的仅图像理解障碍,能够在复杂驾驶环境中实现稳健的多模态推理,并为可解释的自动驾驶场景理解提供新的基础。
Summary / 总结
MMDrive is a multimodal vision-language model framework that extends traditional 2D image understanding to a 3D scene understanding framework, incorporating occupancy maps, LiDAR point clouds, and textual scene descriptions. It introduces two novel components: a Text-oriented Multimodal Modulator for adaptive cross-modal fusion based on semantic cues, and a Cross-Modal Abstractor for generating compact summaries. MMDrive significantly outperforms existing models on DriveLM and NuScenes-QA benchmarks, with BLEU-4 scores of 54.56 and METEOR scores of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA, demonstrating robust multimodal reasoning in complex driving environments.
MMDrive 是一种多模态视觉-语言模型框架,将传统的二维图像理解扩展到三维场景理解框架,结合了占用地图、LiDAR 点云和文本场景描述。它引入了两个新型组件:面向文本的多模态调节器,用于基于语义线索的自适应跨模态融合,以及跨模态摘要器,用于生成紧凑的跨模态摘要。MMDrive 在 DriveLM 和 NuScenes-QA 基准测试中显著优于现有模型,DriveLM 的 BLEU-4 得分为 54.56,METEOR 得分为 41.78,NuScenes-QA 的准确率为 62.7%,展示了在复杂驾驶环境中的稳健多模态推理能力。
MR-COSMO: Visual-Text Memory Recall and Direct CrOSs-MOdal Alignment Method for Query-Driven 3D Segmentation
Authors: Chade Li, Pengju Zhang, Yihong Wu
Venue: AAAI 2026
First: 2025-06-26T04:10:33+00:00 · Latest: 2025-12-15T09:43:49+00:00
Comments: Accepted by AAAI 2026. Copyright (c) 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved
Abstract
The rapid advancement of vision-language models (VLMs) in 3D domains has accelerated research in text-query-guided point cloud processing, though existing methods underperform in point-level segmentation due to inadequate 3D-text alignment that limits local feature-text context linking. To address this limitation, we propose MR-COSMO, a Visual-Text Memory Recall and Direct CrOSs-MOdal Alignment Method for Query-Driven 3D Segmentation, establishing explicit alignment between 3D point clouds and text/2D image data through a dedicated direct cross-modal alignment module while implementing a visual-text memory module with specialized feature banks. This direct alignment mechanism enables precise fusion of geometric and semantic features, while the memory module employs specialized banks storing text features, visual features, and their correspondence mappings to dynamically enhance scene-specific representations via attention-based knowledge recall. Comprehensive experiments across 3D instruction, reference, and semantic segmentation benchmarks confirm state-of-the-art performance.
中文标题/摘要
标题:MR-COSMO:基于视觉-文本记忆召回和直接跨模态对齐的查询驱动3D分割方法
视觉语言模型(VLMs)在3D领域的迅速发展加速了文本查询引导的点云处理研究,尽管现有方法在点级分割方面表现不佳,因为缺乏足够的3D-文本对齐,限制了局部特征-文本上下文链接。为了解决这一限制,我们提出了MR-COSMO,一种基于视觉-文本记忆召回和直接跨模态对齐的查询驱动3D分割方法,通过专用的直接跨模态对齐模块建立了3D点云和文本/2D图像数据之间的显式对齐,同时实现了具有专门特征库的视觉-文本记忆模块。这种直接对齐机制能够精确融合几何和语义特征,而记忆模块则通过基于注意力的知识召回动态增强场景特定表示,存储文本特征、视觉特征及其对应映射的专门库。全面的实验结果表明,该方法在3D指令、参考和语义分割基准测试中表现出最先进的性能。
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
Authors: Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang
First: 2025-12-02T07:42:38+00:00 · Latest: 2025-12-15T09:43:11+00:00
Abstract
Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce $\text{dots.ocr}$, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, $\text{dots.ocr}$ establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.
中文标题/摘要
标题:dots.ocr:单个视觉语言模型中的多语言文档布局解析
文档布局解析是人工智能(AI)访问和解释世界庞大结构化知识库的关键途径。这一过程包括布局检测、文本识别和关系理解,对于增强下一代视觉语言模型至关重要。然而,当前的方法依赖于分段的多阶段管道,容易产生错误传播,并且无法充分利用联合训练的协同效应。在本文中,我们介绍了$\text{dots.ocr}$,这是一种单个视觉语言模型,首次在统一的端到端框架中联合学习三个核心任务。这得益于一个高度可扩展的数据引擎,该引擎综合了一个庞大的多语言语料库,使模型能够在各种任务中表现出强大的性能,涵盖多种语言、布局和领域。我们通过在综合的OmniDocBench上取得最先进的性能验证了我们统一范式的有效性。此外,为了促进全球文档智能的研究,我们引入了XDocParse,这是一个涵盖126种语言的具有挑战性的新基准。在这一测试平台上,$\text{dots.ocr}$建立了新的基准,比第二好的竞争对手高出显著的+7.4分,并证明了其无与伦比的多语言能力。
Summary / 总结
The paper introduces dots.ocr, a single Vision-Language Model that jointly learns layout detection, text recognition, and relational understanding in an end-to-end framework. This approach, enabled by a scalable data engine, achieves state-of-the-art performance on OmniDocBench and sets a new baseline on XDocParse, a benchmark covering 126 languages, outperforming the next-best competitor by 7.4 points.
论文介绍了dots.ocr,这是一个联合学习布局检测、文本识别和关系理解的单一体视语言模型,在端到端框架中实现,克服了分段管道的局限性。该模型利用可扩展的数据引擎合成多语料库,实现了在OmniDocBench上的最先进性能,并在涵盖126种语言的XDocParse基准测试中建立了新的基线,比第二名竞争对手高出7.4个百分点,证明了其无与伦比的多语言能力。
UniVCD: A New Method for Unsupervised Change Detection in the Open-Vocabulary Era
Authors: Ziqiang Zhu, Bowei Yang
First: 2025-12-15T08:42:23+00:00 · Latest: 2025-12-15T08:42:23+00:00
Comments: 10 pages, 6 figures
Abstract
Change detection (CD) identifies scene changes from multi-temporal observations and is widely used in urban development and environmental monitoring. Most existing CD methods rely on supervised learning, making performance strongly dataset-dependent and incurring high annotation costs; they typically focus on a few predefined categories and generalize poorly to diverse scenes. With the rise of vision foundation models such as SAM2 and CLIP, new opportunities have emerged to relax these constraints. We propose Unified Open-Vocabulary Change Detection (UniVCD), an unsupervised, open-vocabulary change detection method built on frozen SAM2 and CLIP. UniVCD detects category-agnostic changes across diverse scenes and imaging geometries without any labeled data or paired change images. A lightweight feature alignment module is introduced to bridge the spatially detailed representations from SAM2 and the semantic priors from CLIP, enabling high-resolution, semantically aware change estimation while keeping the number of trainable parameters small. On top of this, a streamlined post-processing pipeline is further introduced to suppress noise and pseudo-changes, improving the detection accuracy for objects with well-defined boundaries. Experiments on several public BCD (Binary Change Detection) and SCD (Semantic Change Detection) benchmarks show that UniVCD achieves consistently strong performance and matches or surpasses existing open-vocabulary CD methods in key metrics such as F1 and IoU. The results demonstrate that unsupervised change detection with frozen vision foundation models and lightweight multi-modal alignment is a practical and effective paradigm for open-vocabulary CD. Code and pretrained models will be released at https://github.com/Die-Xie/UniVCD.
中文标题/摘要
标题:UniVCD:开放词汇时代的无监督变化检测新方法
变化检测(CD)通过多时相观测识别场景变化,在城市开发和环境监测中广泛应用。现有大多数CD方法依赖于监督学习,导致性能高度依赖于数据集且注释成本高昂;它们通常专注于少数预定义类别,难以泛化到多样化的场景。随着SAM2和CLIP等视觉基础模型的兴起,出现了放松这些限制的新机会。我们提出了统一开放词汇变化检测(UniVCD),这是一种基于冻结的SAM2和CLIP构建的无监督、开放词汇变化检测方法。UniVCD能够在没有任何标注数据或配对变化图像的情况下,检测跨多样场景和成像几何的变化。引入了一个轻量级特征对齐模块,将SAM2的空间详细表示与CLIP的语义先验相结合,实现高分辨率、语义感知的变化估计,同时保持可训练参数数量较少。在此基础上,引入了一条简化的后处理流水线,以抑制噪声和伪变化,提高具有明确边界对象的检测准确性。在几个公开的二值变化检测(BCD)和语义变化检测(SCD)基准测试上进行的实验表明,UniVCD在F1和IoU等关键指标上表现出一致的强性能,并且在某些情况下超越了现有的开放词汇变化检测方法。结果表明,使用冻结的视觉基础模型和轻量级多模态对齐的无监督变化检测是一种实用且有效的开放词汇变化检测范式。代码和预训练模型将在https://github.com/Die-Xie/UniVCD上发布。
Summary / 总结
UniVCD is an unsupervised change detection method that leverages frozen SAM2 and CLIP to detect category-agnostic changes across various scenes and imaging geometries without labeled data. It introduces a lightweight feature alignment module to combine spatially detailed representations from SAM2 and semantic priors from CLIP, enabling high-resolution, semantically aware change estimation. Experiments on multiple benchmarks show that UniVCD outperforms existing open-vocabulary change detection methods in metrics such as F1 and IoU, demonstrating the practicality and effectiveness of this approach.
UniVCD 是一种无需标注数据的无监督变化检测方法,利用冻结的 SAM2 和 CLIP 来检测各种场景和成像几何下的类别无关变化。它引入了一个轻量级的特征对齐模块,将 SAM2 的空间详细表示和 CLIP 的语义先验结合起来,实现高分辨率、语义感知的变化估计。实验结果显示,UniVCD 在多个基准上的表现优于现有开放词汇变化检测方法,在 F1 和 IoU 等关键指标上表现出色,证明了该方法的实用性和有效性。
CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics
Authors: Dahyeon Kye, Jeahun Sung, Mingyu Jeon, Jihyong Oh
First: 2025-12-08T04:39:12+00:00 · Latest: 2025-12-15T08:33:55+00:00
Comments: Please visit our project page at https://cmlab-korea.github.io/CHIMERA/
Abstract
Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.
中文标题/摘要
标题:CHIMERA:自适应缓存注入与语义锚点提示的零样本图像形态变换及其形态导向度量
扩散模型展示了卓越的生成能力,但在实现平滑且语义一致的图像形态变换方面仍面临挑战。现有方法往往由于缺乏自适应的结构和语义对齐而产生突兀的过渡或过度饱和的外观。我们提出CHIMERA,一种基于扩散的零样本框架,将形态变换形式化为缓存反演引导的去噪过程。为处理大规模的语义和外观差异,我们提出了自适应缓存注入和语义锚点提示。自适应缓存注入(ACI)在DDIM反演过程中缓存来自两个输入的下、中、上层特征,并在去噪过程中适配性地重新注入,从而在深度和时间自适应的方式下实现空间和语义对齐,并实现自然特征融合和平滑过渡。语义锚点提示(SAP)利用视觉-语言模型生成共享的锚点提示,作为语义锚点,连接不相似的输入,并引导去噪过程向一致的结果发展。最后,我们引入全局-局部一致性评分(GLCS),这是一种形态导向度量,同时评估两个输入的全局和谐性和局部形态变换的平滑度。广泛的实验和用户研究显示,CHIMERA实现了比现有方法更平滑且更语义对齐的过渡,建立了图像形态变换的新基准。代码和项目页面将公开发布。
Summary / 总结
CHIMERA is a zero-shot diffusion-based framework designed to achieve smooth and semantically consistent image morphing by formulating morphing as a cached inversion-guided denoising process. It introduces Adaptive Cache Injection (ACI) and Semantic Anchor Prompting (SAP) to handle large semantic and appearance disparities. ACI caches and re-injects features from both inputs during the denoising process, while SAP uses a vision-language model to generate a shared anchor prompt. CHIMERA outperforms existing methods in achieving smoother and more semantically aligned transitions, setting a new state of the art in image morphing. The Global-Local Consistency Score (GLCS) is introduced as a morphing-oriented metric to evaluate the global harmonization and local smoothness of the morphing process.
CHIMERA 是一个零样本扩散模型框架,通过将图像变形视为缓存反演引导的去噪过程来实现平滑且语义一致的图像变形。它引入了自适应缓存注入(ACI)和语义锚点提示(SAP)来处理大语义和外观差异。ACI 适当地缓存和重新注入特征,而 SAP 使用视觉语言模型生成共享的锚点提示。实验结果表明,CHIMERA 在实现更平滑且语义对齐的过渡方面优于现有方法,建立了图像变形的新基准。
Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models
Authors: Zizhi Chen, Yizhen Gao, Minghao Han, Yizhou Liu, Zhaoyu Chen, Dingkang Yang, Lihua Zhang
First: 2025-12-15T08:09:40+00:00 · Latest: 2025-12-15T08:09:40+00:00
Abstract
Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous \textbf{M}edical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.
中文标题/摘要
标题:锻造动态记忆:基于检索的持续学习框架促进通用医疗基础模型
多模态生物医学视觉-语言模型(VLMs)在持续学习(CL)领域展现出巨大的潜力。然而,它们面临一个核心难题:如何在不同模态之间显著的领域差距中保留精细的跨模态特征。为了解决这一挑战,我们提出了一种全面的框架。利用我们从PubMed科学论文中提取的1800万规模的多模态综合医学检索数据库,我们首次将检索增强生成(RAG)集成到持续学习中。具体而言,我们采用多模态、多层RAG系统,通过动态、按需的知识检索为模型微调提供实时指导。在此基础上,我们引入了一种动态知识蒸馏框架。该框架通过动态调节参数空间的重要性、蒸馏知识的粒度以及参考数据集的数据分布,与所需的细节水平相匹配,精确解决了上述核心难题。为了彻底验证我们策略的临床价值,我们设计了一个更严格的医学通用任务增量学习(MGTIL)基准。该基准旨在同时评估模型在面对显著领域变化时的适应能力、保留细微跨域特征的能力以及实时学习新型复杂医学任务的能力。广泛的实验结果表明,我们提出的方法在所有指标上均达到了最先进的(SOTA)性能。代码已提供在补充材料中。
Summary / 总结
The paper addresses the challenge of preserving fine-grained intra-modality features while bridging the domain gap in multimodal biomedical Vision-Language Models (VLMs) through Continual Learning (CL). It proposes a framework integrating Retrieval-Augmented Generation (RAG) for real-time knowledge retrieval and dynamic knowledge distillation to modulate the parameter space and distilled knowledge granularity. The method is validated using a rigorous Medical Generalist Task Incremental Learning (MGTIL) benchmark, showing state-of-the-art performance across all metrics.
论文解决了在多模态生物医学视觉-语言模型(VLMs)中持续学习(CL)时保留细粒度的跨模态特征和跨域鸿沟的挑战。它提出了一种结合检索增强生成(RAG)和动态知识蒸馏的框架,动态调节参数空间的重要性以及蒸馏知识的粒度。该方法使用严格的医学通用任务增量学习(MGTIL)基准进行验证,展示了在所有指标上的最佳性能。
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Authors: Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye
First: 2025-12-15T07:11:56+00:00 · Latest: 2025-12-15T07:11:56+00:00
Abstract
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
中文标题/摘要
标题:GTR-Turbo:合并检查点秘密地成为自主VLM训练的免费教师
基于视觉语言模型(VLMs)构建的多模态代理的多轮强化学习(RL)受到稀疏奖励和长期信用分配的阻碍。最近的方法通过查询提供逐步反馈的教师来增加奖励密度,例如引导思想强化学习(GTR)和在线策略蒸馏,但依赖于昂贵且通常是有特权的模型作为教师,限制了其实用性和可再现性。我们引入了GTR-Turbo,这是一种GTR的高效升级版,无需训练或查询昂贵的教师模型即可达到相同性能。具体而言,GTR-Turbo将正在进行的RL训练过程中生成的检查点权重合并,并使用此合并模型作为“免费”的教师,通过监督微调或软logit蒸馏来指导后续的RL。此设计消除了对特权VLM(例如GPT或Gemini)的依赖,缓解了先前工作中观察到的“熵崩溃”现象,并保持了训练的稳定性。在多种视觉代理任务中,GTR-Turbo将基线模型的准确性提高了10-30%,同时将墙钟训练时间减少了50%,计算成本减少了60%相对于GTR。
Summary / 总结
GTR-Turbo addresses the challenges of multi-turn reinforcement learning for multi-modal agents by merging checkpoints from ongoing RL training to create a 'free' teacher model. This method eliminates the need for expensive, privileged models and improves the baseline model's accuracy by 10-30% while reducing training time and compute cost by 50% and 60%, respectively, compared to GTR.
GTR-Turbo通过合并正在进行的RL训练产生的检查点来创建一个‘免费’教师模型,以解决多轮强化学习中多模态代理面临的稀疏奖励和长期信用分配问题。这种方法消除了对昂贵的特权教师模型的依赖,相比GTR,将基线模型的准确性提高了10-30%,并将训练时间和计算成本分别减少了50%和60%。
DualMap: Online Open-Vocabulary Semantic Mapping for Natural Language Navigation in Dynamic Changing Scenes
Authors: Jiajun Jiang, Yiming Zhu, Zirui Wu, Jie Song
Venue: IEEE Robotics and Automation Letters, Vol. 10, No. 12, pp. 12612-12619, 2025
First: 2025-06-02T17:59:10+00:00 · Latest: 2025-12-15T06:43:59+00:00
Comments: 14 pages, 14 figures. Published in IEEE Robotics and Automation Letters (RA-L), 2025. Code: https://github.com/Eku127/DualMap Project page: https://eku127.github.io/DualMap/
Abstract
We introduce DualMap, an online open-vocabulary mapping system that enables robots to understand and navigate dynamically changing environments through natural language queries. Designed for efficient semantic mapping and adaptability to changing environments, DualMap meets the essential requirements for real-world robot navigation applications. Our proposed hybrid segmentation frontend and object-level status check eliminate the costly 3D object merging required by prior methods, enabling efficient online scene mapping. The dual-map representation combines a global abstract map for high-level candidate selection with a local concrete map for precise goal-reaching, effectively managing and updating dynamic changes in the environment. Through extensive experiments in both simulation and real-world scenarios, we demonstrate state-of-the-art performance in 3D open-vocabulary segmentation, efficient scene mapping, and online language-guided navigation. Project page: https://eku127.github.io/DualMap/
中文标题/摘要
标题:DualMap:动态变化场景中基于自然语言导航的在线开放词汇语义映射
我们介绍了DualMap,一种在线开放词汇映射系统,使机器人能够通过自然语言查询理解并导航动态变化的环境。DualMap 旨在高效地进行语义映射并适应变化的环境,满足现实世界机器人导航应用的基本要求。我们提出的混合分段前端和对象级状态检查消除了先前方法所需的昂贵3D对象合并,从而实现高效的在线场景映射。双映射表示结合了一个全局抽象地图用于高层候选选择和一个局部具体地图用于精确目标到达,有效地管理和更新环境中的动态变化。通过在仿真和真实场景中的广泛实验,我们展示了在3D开放词汇分割、高效场景映射和在线语言引导导航方面的最新性能。项目页面:https://eku127.github.io/DualMap/
Summary / 总结
DualMap is an online open-vocabulary semantic mapping system that allows robots to navigate dynamically changing environments through natural language queries. It uses a hybrid segmentation frontend and object-level status check to avoid the need for costly 3D object merging, enabling efficient online scene mapping. The system combines a global abstract map and a local concrete map to manage and update dynamic changes, demonstrating state-of-the-art performance in 3D segmentation, efficient scene mapping, and online language-guided navigation in both simulation and real-world scenarios.
DualMap 是一种在线开放词汇映射系统,使机器人能够通过自然语言查询来导航动态变化的环境。它使用混合分割前端和对象级状态检查来高效地映射场景,无需进行 3D 对象合并,并采用双重地图表示法进行高精度导航。广泛的实验表明,DualMap 在 3D 分割、场景映射和语言引导导航方面优于先前的方法。
TWLR: Text-Guided Weakly-Supervised Lesion Localization and Severity Regression for Explainable Diabetic Retinopathy Grading
Authors: Xi Luo, Shixin Xu, Ying Xie, JianZhong Hu, Yuwei He, Yuhui Deng, Huaxiong Huang
First: 2025-12-15T06:08:16+00:00 · Latest: 2025-12-15T06:08:16+00:00
Abstract
Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high-quality expert annotations Obtaining pixel-level labels for medical images, particularly fundus images, remains costly and time-consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two-stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision-language model integrates domain-specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly-supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel-level supervision and providing an interpretable visualization of disease-to-healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation-efficient solution for automated retinal image analysis.
中文标题/摘要
标题:TWLR:基于文本引导的弱监督病变定位与严重程度回归方法及其在可解释性糖尿病视网膜病变分级中的应用
准确的医学图像分析可以大大辅助临床诊断,但其效果依赖于高质量的专家注释。获取医学图像的像素级标签,尤其是眼底图像的标签,仍然成本高昂且耗时。同时,尽管深度学习在医学成像领域取得了成功,但缺乏可解释性限制了其在临床中的应用。为了解决这些挑战,我们提出了一种名为TWLR的两阶段框架,用于可解释的糖尿病视网膜病变(DR)评估。在第一阶段,视觉-语言模型将领域特定的眼科知识整合到文本嵌入中,联合执行DR分级和病变分类,有效地将语义医学概念与视觉特征联系起来。第二阶段引入了一种基于弱监督语义分割的迭代严重程度回归框架。通过迭代细化生成的病变显著图引导一种渐进的修复机制,系统地消除病理特征,有效降低疾病严重程度,使其向更健康的眼底外观转变。关键的是,这种严重程度回归方法实现了双重好处:在无需像素级监督的情况下实现准确的病变定位,并提供疾病到健康转变的可解释可视化。在FGADR、DDR和一个私人数据集上的实验结果表明,TWLR在DR分类和病变分割方面均取得了竞争力的表现,提供了一种更可解释和注释高效的自动视网膜图像分析解决方案。
Summary / 总结
TWLR is a two-stage framework designed to improve the interpretability and efficiency of diabetic retinopathy (DR) grading. In the first stage, a vision-language model integrates ophthalmological knowledge to perform DR grading and lesion classification, linking semantic concepts with visual features. The second stage uses an iterative severity regression framework based on weakly-supervised semantic segmentation to refine lesion saliency maps and systematically eliminate pathological features, leading to more accurate and explainable disease severity downgrading. Experiments show that TWLR performs competitively in DR classification and lesion segmentation, offering a more efficient and interpretable solution.
TWLR 是一个两阶段框架,旨在通过将眼科知识与深度学习结合来提高糖尿病视网膜病变(DR)的分级。第一阶段使用视觉-语言模型进行 DR 分级和病灶分类,将医学概念与视觉特征联系起来。第二阶段采用迭代的病灶严重性回归框架进行弱监督语义分割,通过逐步细化病灶显著图来系统地减少疾病严重性。TWLR 实现了准确的病灶定位,并提供了疾病到健康转变的可解释可视化,多个数据集上的实验结果表明其性能具有竞争力。
SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions
Authors: Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap
Venue: NeurIPS 2025
First: 2025-06-29T00:54:13+00:00 · Latest: 2025-12-15T04:45:27+00:00
Comments: 24 pages, 6 figures
Abstract
Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model's ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.
中文标题/摘要
标题:SoMi-ToM:评估具身社会互动中的多视角理论思维
人类在动态的现实世界社会互动中不断推断他人的状态、目标和行为。然而,大多数理论思维(ToM)基准仅评估静态的文本场景,与实际互动存在显著差距。我们提出了SoMi-ToM基准,旨在评估多视角理论思维在具身多智能体复杂社会互动中的表现。该基准基于由互动环境SoMi生成的丰富多模态互动数据,涵盖了多样的制作目标和社会关系。我们的框架支持多层次评估:(1)第一人称评估提供任务期间从第一人称视角的多模态(视觉、对话、动作等)输入,用于实时状态推断;(2)第三人称评估提供任务结束后完整的第三人称视角视频和文本记录,用于目标和行为推断。这种评估方法允许从主观即时体验和客观全局观察两个方面对模型的ToM能力进行更全面的考察。我们构建了一个具有挑战性的数据集,包含35个第三人称视角视频、363个第一人称视角图像和1225个专家标注的多项选择题(三个选项)。在该数据集上,我们系统地评估了人类被试和几种最先进的大型视觉-语言模型(LVLMs)的表现。结果显示,LVLMs在SoMi-ToM上的表现显著低于人类:在第一人称评估中,人类和模型的平均准确率差距为40.1%,在第三人称评估中为26.4%。这表明未来LVLMs需要进一步提高其在具身、复杂社会互动中的理论思维能力。
Summary / 总结
The research aims to evaluate multi-perspective Theory of Mind (ToM) in embodied social interactions by proposing the SoMi-ToM benchmark, which uses rich multimodal interaction data from the SoMi interaction environment. The benchmark supports both first-person and third-person evaluations, providing a more comprehensive examination of ToM capabilities. Experimental results show that state-of-the-art large vision-language models perform significantly worse than humans, with an average accuracy gap of 40.1% in first-person evaluation and 26.4% in third-person evaluation, highlighting the need for improved ToM capabilities in embodied social interactions.
论文提出了SoMi-ToM基准,以评估多视角的理论思维(ToM)在实体社会互动中的表现,弥补了静态文本场景与真实互动之间的差距。该基准利用来自互动环境SoMi的丰富多模态数据,支持第一人称和第三人称评估。实验结果显示,最先进的大型视觉-语言模型的表现明显不如人类,第一人称和第三人称评估的准确率差距分别为40.1%和26.4%,这表明未来需要进一步提高这些模型在实体复杂社会互动中的理论思维能力。
Content Adaptive based Motion Alignment Framework for Learned Video Compression
Authors: Tiange Zhang, Xiandong Meng, Siwei Ma
First: 2025-12-15T02:51:47+00:00 · Latest: 2025-12-15T02:51:47+00:00
Comments: Accepted to Data Compression Conference (DCC) 2026 as a poster paper
Abstract
Recent advances in end-to-end video compression have shown promising results owing to their unified end-to-end learning optimization. However, such generalized frameworks often lack content-specific adaptation, leading to suboptimal compression performance. To address this, this paper proposes a content adaptive based motion alignment framework that improves performance by adapting encoding strategies to diverse content characteristics. Specifically, we first introduce a two-stage flow-guided deformable warping mechanism that refines motion compensation with coarse-to-fine offset prediction and mask modulation, enabling precise feature alignment. Second, we propose a multi-reference quality aware strategy that adjusts distortion weights based on reference quality, and applies it to hierarchical training to reduce error propagation. Third, we integrate a training-free module that downsamples frames by motion magnitude and resolution to obtain smooth motion estimation. Experimental results on standard test datasets demonstrate that our framework CAMA achieves significant improvements over state-of-the-art Neural Video Compression models, achieving a 24.95% BD-rate (PSNR) savings over our baseline model DCVC-TCM, while also outperforming reproduced DCVC-DC and traditional codec HM-16.25.
中文标题/摘要
标题:基于内容自适应的运动对齐框架用于学习视频压缩
端到端视频压缩的最新进展因其统一的端到端学习优化而显示出有希望的结果。然而,这些通用框架往往缺乏内容特定的适应性,导致压缩性能不佳。为了解决这个问题,本文提出了一种基于内容自适应的运动对齐框架,通过根据多样化的内容特性调整编码策略来提高性能。具体而言,我们首先引入了一种两阶段的流导向可变形扭曲机制,通过粗到细的偏移预测和掩码调制来细化运动补偿,从而实现精确的特征对齐。其次,我们提出了一种多参考质量感知策略,根据参考质量调整失真权重,并将其应用于分层训练以减少误差传播。第三,我们整合了一个无需训练的模块,通过运动幅度和分辨率对帧进行下采样,以获得平滑的运动估计。在标准测试数据集上的实验结果表明,我们的框架CAMA在与我们的基线模型DCVC-TCM相比时,实现了24.95%的BD率(PSNR)节省,同时在重新实现的DCVC-DC和传统编解码器HM-16.25上也表现出色。
Summary / 总结
This paper proposes a content adaptive based motion alignment framework (CAMA) to enhance the performance of learned video compression. The method introduces a two-stage flow-guided deformable warping mechanism and a multi-reference quality-aware strategy, and integrates a training-free module for motion estimation. Experimental results show that CAMA outperforms state-of-the-art models, achieving a 24.95% BD-rate (PSNR) savings compared to the baseline DCVC-TCM and outperforming both reproduced DCVC-DC and traditional codec HM-16.25.
本文提出了一种内容自适应的运动对齐框架(CAMA),以提高学习视频压缩的性能。该方法引入了两阶段流导向可变形变形机制和多参考质量感知策略,并集成了一个无训练模块进行运动估计。实验结果表明,CAMA在与基线DCVC-TCM相比时,实现了24.95%的BD率(PSNR)节省,并且优于重新生成的DCVC-DC和传统编解码器HM-16.25。
ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation
Authors: Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Hengyu Liu, Tingting Shen, Yadong MU
First: 2025-11-01T11:29:14+00:00 · Latest: 2025-12-15T02:43:16+00:00
Comments: Project page: https://angericky.github.io/ID-Crafter, Code: https://github.com/paulpanwang/IDCrafter
Abstract
Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement learning phase to further refine the model for critical concepts. Furthermore, we construct a new dataset to facilitate robust training and evaluation. Extensive experiments demonstrate that ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality. Project page: https://angericky.github.io/ID-Crafter
中文标题/摘要
标题:ID-Crafter:基于VLM的在线强化学习多主体视频生成
在高保真视频合成方面取得了显著进展,但当前范式往往难以有效整合多主体的身份信息,导致语义冲突和身份及互动的次优表现,限制了可控性和应用范围。为解决这一问题,我们提出了ID-Crafter,一种实现多主体视频生成的框架,能够实现卓越的身份保留和语义一致性。ID-Crafter 结合了三个关键组件:(i) 一种分层的身份保留注意力机制,逐步在主体内、主体间和跨模态层面聚合特征;(ii) 由预训练的视觉-语言模型(VLM)驱动的语义理解模块,提供精细的指导并捕捉复杂的主体间关系;(iii) 一个在线强化学习阶段,进一步细化模型以优化关键概念。此外,我们构建了一个新的数据集以促进稳健的训练和评估。广泛的实验表明,ID-Crafter 在多主体视频生成基准测试中建立了新的最佳性能,特别是在身份保留、时间一致性和整体视频质量方面表现出色。
Summary / 总结
ID-Crafter is a framework for multi-subject video generation that addresses the challenge of integrating identity information from multiple subjects. It uses a hierarchical identity-preserving attention mechanism, a semantic understanding module powered by a pretrained Vision-Language Model, and an online reinforcement learning phase. Experiments show that ID-Crafter outperforms existing methods in identity preservation, temporal consistency, and overall video quality on multi-subject video generation benchmarks.
ID-Crafter 是一个多主体视频生成框架,结合了层次化的身份保留注意力机制、使用预训练视觉语言模型的语义理解模块以及在线强化学习阶段。这种方法增强了身份保留和语义一致性,使其在身份保留、时间一致性和整体视频质量等方面在多主体视频生成基准测试中优于现有方法。
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Authors: Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen
First: 2025-12-02T12:30:05+00:00 · Latest: 2025-12-15T02:13:12+00:00
Abstract
Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup. The code is available at https://github.com/Casey-bit/VLMPruner.
中文标题/摘要
标题:VLM-Pruner:高效VLM中基于离心式标记剪枝的空间稀疏性缓冲
视觉语言模型(VLMs)在图像理解任务中表现出色,但大量的视觉标记导致了显著的计算成本,阻碍了其在移动设备上的部署。许多剪枝方法仅依赖于标记的重要性,从而忽视了标记间的冗余性,保留了大量重复的标记,浪费了容量。尽管已经提出了一些具有冗余意识的方法,但它们往往忽略了视觉标记之间的空间关系。这可能导致保留标记的选择过于稀疏,无法充分覆盖目标对象的区域。为了解决这些限制,我们提出了VLM-Pruner,这是一种无需训练的标记剪枝算法,明确平衡冗余性和空间稀疏性。我们引入了一种离心式标记剪枝范式,能够在优先保留细粒度对象细节的同时实现近到远的选择。此外,我们设计了一种空间稀疏性缓冲(BSS)准则,推迟选择空间上距离较远的标记。我们还采用了一种并行贪婪策略来高效地进行标记选择。为了减轻剪枝带来的信息损失,我们有选择地将被丢弃标记中的重要信息融合到保留的标记中。全面的比较表明,VLM-Pruner在五个VLM中以88.9%的剪枝率持续优于强大的基线模型,同时实现了端到端的推理加速。代码可在https://github.com/Casey-bit/VLMPruner获取。
Summary / 总结
VLM-Pruner is a training-free token pruning algorithm designed to address the computational challenges of vision-language models by balancing redundancy and spatial sparsity. It introduces a centrifugal token pruning paradigm and a Buffering for Spatial Sparsity (BSS) criterion to efficiently select tokens while preserving fine-grained object details. Experimental results show that VLM-Pruner outperforms strong baselines with an 88.9% pruning rate and provides an end-to-end inference speedup without significant loss of performance.
VLM-Pruner 是一种无需训练的 token 剪枝算法,旨在通过平衡冗余和空间稀疏性来解决视觉-语言模型的计算挑战。它引入了离心 token 剪枝范式和空间稀疏性缓冲标准,以高效地选择 token 同时保留细粒度的物体细节。实验结果表明,VLM-Pruner 在 88.9% 的剪枝率下优于强基线,并提供了端到端的推理加速。
SignRAG: A Retrieval-Augmented System for Scalable Zero-Shot Road Sign Recognition
Authors: Minghao Zhu, Zhihao Zhang, Anmol Sidhu, Keith Redmill
First: 2025-12-14T23:56:34+00:00 · Latest: 2025-12-14T23:56:34+00:00
Comments: Submitted to IV 2026
Abstract
Automated road sign recognition is a critical task for intelligent transportation systems, but traditional deep learning methods struggle with the sheer number of sign classes and the impracticality of creating exhaustive labeled datasets. This paper introduces a novel zero-shot recognition framework that adapts the Retrieval-Augmented Generation (RAG) paradigm to address this challenge. Our method first uses a Vision Language Model (VLM) to generate a textual description of a sign from an input image. This description is used to retrieve a small set of the most relevant sign candidates from a vector database of reference designs. Subsequently, a Large Language Model (LLM) reasons over the retrieved candidates to make a final, fine-grained recognition. We validate this approach on a comprehensive set of 303 regulatory signs from the Ohio MUTCD. Experimental results demonstrate the framework's effectiveness, achieving 95.58% accuracy on ideal reference images and 82.45% on challenging real-world road data. This work demonstrates the viability of RAG-based architectures for creating scalable and accurate systems for road sign recognition without task-specific training.
中文标题/摘要
标题:SignRAG:一种用于可扩展零样本道路标志识别的检索增强系统
自动道路标志识别是智能交通系统中的关键任务,但传统的深度学习方法难以应对标志类别的庞大数量以及创建详尽标注数据集的不切实际性。本文提出了一种新颖的零样本识别框架,将检索增强生成(RAG)范式适应于解决这一挑战。我们的方法首先使用视觉语言模型(VLM)从输入图像生成标志的文本描述。该描述用于从参考设计向量数据库中检索最相关的标志候选集。随后,大型语言模型(LLM)对检索到的候选集进行推理以做出最终的细粒度识别。我们在来自俄亥俄州 MUTCD 的全面的 303 个监管标志数据集上验证了该方法。实验结果表明该框架的有效性,在理想参考图像上准确率为 95.58%,在具有挑战性的实际道路数据上准确率为 82.45%。这项工作证明了基于 RAG 的架构在无需特定任务训练的情况下创建可扩展且准确的道路标志识别系统的可行性。
Summary / 总结
The paper addresses the challenge of recognizing a large number of road sign classes in intelligent transportation systems by proposing a zero-shot recognition framework using a Retrieval-Augmented Generation (RAG) paradigm. It leverages a Vision Language Model (VLM) to generate a textual description of the sign from an image, retrieves relevant sign candidates from a vector database, and uses a Large Language Model (LLM) to make a final recognition. The framework achieves 95.58% accuracy on ideal images and 82.45% on real-world data, demonstrating its effectiveness and scalability.
本文提出了SignRAG,这是一种利用检索增强生成(RAG)范式的零样本道路标志识别系统。该系统使用视觉语言模型(VLM)从图像中生成标志的文本描述,然后从矢量数据库中检索相关标志候选。大型语言模型(LLM)随后对这些候选进行推理以做出最终识别。该系统在303个监管标志上进行了测试,理想图像上的准确率为95.58%,真实世界数据上的准确率为82.45%,展示了其在无需特定任务训练的情况下实现可扩展和准确的道路标志识别的有效性。
Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners
Authors: N. K. B. M. P. K. B. Narasinghe, Uthayasanker Thayasivam
First: 2025-12-14T20:13:21+00:00 · Latest: 2025-12-14T20:13:21+00:00
Comments: 9 pages, 3 figures. Accepted to VISAPP 2026
Abstract
Large-scale multimodal foundation models, particularly Contrastive Captioners (CoCa), have achieved state-of-the-art results by unifying contrastive alignment with generative captioning. While zero-shot transfer capabilities are well-documented, the adaptation of these generative-contrastive hybrids to downstream tasks with extreme data scarcity (few-shot learning) remains under-explored. Existing literature predominantly focuses on dual-encoder architectures like CLIP, leaving a gap in understanding how CoCa's distinct latent space responds to parameter-efficient fine-tuning (PEFT). This paper presents a comprehensive empirical study on adapting the CoCa visual backbone for few-shot image classification. We systematically evaluate a hierarchy of strategies, ranging from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA). First, we identify an "augmentation divergence": while strong data augmentation degrades the performance of linear probing in low-shot settings, it is essential for stabilizing LoRA fine-tuning. We also demonstrate that hybrid objectives incorporating Supervised Contrastive (SupCon) loss yield consistent performance improvements over standard Cross-Entropy across varying shot counts. Crucially, we characterize the sensitivity of training configurations to data scarcity, providing empirical reference settings for scaling regularization, rank, and sampling strategies to facilitate the efficient adaptation of generative-contrastive foundation models.
中文标题/摘要
标题:适应少样本学习的多模态基础模型:对比式图释器的全面研究
大规模多模态基础模型,尤其是对比式图释器(CoCa),通过统一对比对齐与生成图释,实现了最先进的结果。虽然零样本迁移能力已得到充分记录,但这些生成-对比混合模型在极端数据稀缺(少样本学习)的下游任务中的适应性仍然未被充分探索。现有文献主要集中在双编码器架构如CLIP上,留下了对CoCa独特潜空间如何响应参数高效微调(PEFT)的理解空白。本文对适应CoCa视觉主干进行少样本图像分类的全面经验研究进行了介绍。我们系统地评估了一系列策略,从无需训练的混合原型到通过低秩适应(LoRA)进行深度参数适应。首先,我们发现了一种“增强偏差”:虽然强大的数据增强在少样本设置中会降低线性探针的性能,但它是稳定LoRA微调的关键。我们还证明,结合监督对比损失(SupCon)的混合目标在不同样本数量下比标准交叉熵提供了更一致的性能提升。至关重要的是,我们描述了训练配置对数据稀缺性的敏感性,提供了正则化、秩和采样策略的实证参考设置,以促进生成-对比基础模型的有效适应。
Summary / 总结
This paper investigates the adaptation of Contrastive Captioners (CoCa) for few-shot learning, focusing on parameter-efficient fine-tuning methods. It evaluates various strategies, from training-free hybrid prototyping to deep parameter adaptation via Low-Rank Adaptation (LoRA), and finds that strong data augmentation is crucial for stabilizing LoRA fine-tuning in low-shot settings. The study also shows that hybrid objectives with Supervised Contrastive (SupCon) loss improve performance across different shot counts, and provides empirical guidelines for scaling regularization and sampling strategies to adapt generative-contrastive models efficiently.
本文研究了如何将对比生成式模型(CoCa)适应于少样本学习,重点探讨了参数高效微调方法。评估了从无训练混合原型到通过低秩适应(LoRA)进行深度参数微调的各种策略,并发现强数据增强对于稳定LoRA微调至关重要。研究还表明,结合监督对比损失(SupCon)的目标函数在不同样本量下能持续提升性能。关键发现包括训练配置对数据稀缺性的敏感性,这有助于调整正则化、秩和采样策略以高效适应生成-对比基础模型。
SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling
Authors: Fanjiang Ye, Zepeng Zhao, Yi Mu, Jucheng Shen, Renjie Li, Kaijian Wang, Saurabh Agarwal, Myungjin Lee, Triston Cao, Aditya Akella, Arvind Krishnamurthy, T. S. Eugene Ng, Zhengzhong Tu, Yuke Wang
First: 2025-08-25T07:49:17+00:00 · Latest: 2025-12-14T17:45:53+00:00
Abstract
Diffusion models have recently achieved remarkable success in generative tasks (e.g., image and video generation), and the demand for high-quality content (e.g., 2K/4K videos) is rapidly increasing across various domains. However, generating ultra-high-resolution videos on existing standard-resolution (e.g., 720p) platforms remains challenging due to the excessive re-training requirements and prohibitively high computational and memory costs. To this end, we introduce SUPERGEN, an efficient tile-based framework for ultra-high-resolution video generation. SUPERGEN features a novel training-free algorithmic innovation with tiling to successfully support a wide range of resolutions without additional training efforts while significantly reducing both memory footprint and computational complexity. Moreover, SUPERGEN incorporates a tile-tailored, adaptive, region-aware caching strategy that accelerates video generation by exploiting redundancy across denoising steps and spatial regions. SUPERGEN also integrates cache-guided, communication-minimized tile parallelism for enhanced throughput and minimized latency. Evaluations show that SUPERGEN maximizes performance gains while achieving high output quality across various benchmarks.
中文标题/摘要
标题:SuperGen:一种高效的超高清视频生成系统,结合素描和拼块技术
扩散模型在生成任务(例如图像和视频生成)中最近取得了显著的成功,对高质量内容(例如2K/4K视频)的需求在各个领域迅速增加。然而,由于过度的重新训练要求和高昂的计算和内存成本,现有标准分辨率(例如720p)平台生成超高清视频仍然具有挑战性。为此,我们引入了SUPERGEN,这是一种高效的基于拼块的超高清视频生成框架。SUPERGEN 特设了一种新颖的无需训练的算法创新,通过拼块技术成功支持广泛的分辨率范围,同时显著减少内存占用和计算复杂度。此外,SUPERGEN 结合了一种拼块定制的、自适应的、区域感知的缓存策略,通过利用去噪步骤和空间区域之间的冗余性来加速视频生成。SUPERGEN 还集成了基于缓存、通信最小化的拼块并行性,以提高吞吐量并最小化延迟。评估表明,SUPERGEN 在各种基准测试中实现了高性能增益,同时保持了高质量的输出。
Summary / 总结
SUPERGEN is an efficient framework for ultra-high-resolution video generation that addresses the challenges of excessive re-training and high computational costs. It uses a tiling approach to support various resolutions without additional training, reducing memory and computational complexity. SUPERGEN also includes an adaptive caching strategy and cache-guided parallelism to enhance performance. Experimental results demonstrate that SUPERGEN achieves high output quality across different benchmarks while maximizing performance gains.
SUPERGEN 是一种高效的基于瓦片的超高清视频生成框架,解决了过度重新训练和高计算成本的问题。它使用无训练的瓦片算法支持多种分辨率,同时减少内存占用和计算复杂度。SUPERGEN 还包括自适应缓存策略和缓存引导的并行计算,以加速视频生成并减少延迟,从而在各种基准测试中实现高质量的输出。