arXiv 论文速递

2025-11-25 03:27
Snapshot: 20251125_0327
ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation
Authors: Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Hengyu Liu, Tingting Shen, Yadong MU
First: 2025-11-01T11:29:14+00:00 · Latest: 2025-11-21T18:35:34+00:00
Abstract
Significant progress has been achieved in high-fidelity video synthesis, yet current paradigms often fall short in effectively integrating identity information from multiple subjects. This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. ID-Crafter integrates three key components: (i) a hierarchical identity-preserving attention mechanism that progressively aggregates features at intra-subject, inter-subject, and cross-modal levels; (ii) a semantic understanding module powered by a pretrained Vision-Language Model (VLM) to provide fine-grained guidance and capture complex inter-subject relationships; and (iii) an online reinforcement learning phase to further refine the model for critical concepts. Furthermore, we construct a new dataset to facilitate robust training and evaluation. Extensive experiments demonstrate that ID-Crafter establishes new state-of-the-art performance on multi-subject video generation benchmarks, excelling in identity preservation, temporal consistency, and overall video quality.
中文标题/摘要
标题:ID-Crafter:基于VLM的在线RL多主体视频生成
在高保真视频合成方面取得了显著进展,但当前范式往往难以有效整合多主体的身份信息,导致语义冲突和身份及互动的次优表现,限制了可控性和应用性。为解决这一问题,我们提出了ID-Crafter,一种实现优越身份保留和语义一致性的多主体视频生成框架。ID-Crafter 结合了三个关键组件:(i) 一种分层的身份保留注意力机制,逐步在主体内、主体间和跨模态层面聚合特征;(ii) 由预训练的视觉-语言模型(VLM)驱动的语义理解模块,提供精细指导并捕捉复杂的主体间关系;(iii) 一个在线强化学习阶段,进一步细化模型以处理关键概念。此外,我们构建了一个新的数据集以促进稳健的训练和评估。大量实验表明,ID-Crafter 在多主体视频生成基准测试中建立了新的最先进性能,特别是在身份保留、时间一致性和整体视频质量方面表现出色。
Summary / 总结
ID-Crafter is a framework for multi-subject video generation that integrates a hierarchical identity-preserving attention mechanism, a semantic understanding module using a pretrained Vision-Language Model, and an online reinforcement learning phase. This approach improves identity preservation and semantic coherence, leading to superior performance on multi-subject video generation benchmarks compared to existing methods. The framework also includes a new dataset for robust training and evaluation, demonstrating its effectiveness in identity preservation, temporal consistency, and overall video quality.
ID-Crafter 是一个多主体视频生成框架,结合了层次化的身份保留注意力机制、使用预训练视觉-语言模型的语义理解模块以及在线强化学习阶段。这种方法显著提高了身份保留和语义一致性,在身份保留、时间一致性和整体视频质量的基准测试中优于现有方法。
Planning with Sketch-Guided Verification for Physics-Aware Video Generation
Authors: Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Yue Zhang, Mohit Bansal
First: 2025-11-21T17:48:02+00:00 · Latest: 2025-11-21T17:48:02+00:00
Comments: website: https://sketchverify.github.io/
Abstract
Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.
中文标题/摘要
标题:基于草图引导验证的物理感知视频生成规划
近期的视频生成方法越来越多地依赖于规划中间控制信号(如物体轨迹)以提高时间连贯性和运动保真度。然而,这些方法大多采用单次规划方案,通常仅限于简单的运动,或者需要多次调用视频生成器进行迭代细化,从而导致高计算成本。为克服这些限制,我们提出了一种无需训练的、基于草图验证的规划框架SketchVerify,该框架通过引入测试时采样和验证循环,在进行完整视频生成之前,以更动态一致的轨迹(即物理上合理且指令一致的运动)来提高运动规划质量。给定提示和参考图像,我们的方法预测多个候选运动计划,并使用结合评估语义与指令一致性和物理合理性的视觉语言验证器对其进行排名。为了高效地评分候选运动计划,我们通过将对象合成到静态背景上来渲染每个轨迹,生成轻量级视频草图,从而绕过了昂贵的重复扩散合成过程,同时保持了相当的性能。我们不断细化运动计划,直到找到一个满意的结果,然后将其传递给轨迹条件生成器进行最终合成。在WorldModelBench和PhyWorldBench上的实验表明,与竞争性基线相比,我们的方法在运动质量、物理真实性和长期一致性方面显著提高,且效率更高。我们的消融研究进一步表明,增加轨迹候选的数量可以一致地提高整体性能。
Summary / 总结
The research aims to improve the quality of motion planning in video generation by introducing SketchVerify, a training-free framework that uses a sketch-verification loop to predict and refine multiple candidate motion plans. The method ranks these plans using a vision-language verifier that evaluates both semantic alignment and physical plausibility. Experiments show that SketchVerify enhances motion quality, physical realism, and long-term consistency compared to existing methods, while being more efficient. An ablation study confirms that increasing the number of trajectory candidates improves performance.
研究旨在通过引入无训练框架SketchVerify来提高视频生成中的运动规划质量。该方法使用草图验证过程来预测和优化多个运动计划,确保生成的运动轨迹既符合物理规律又与指令一致,然后再进行完整的视频合成。实验表明,SketchVerify在运动质量、物理真实性和长期一致性方面优于现有方法,同时更为高效。该方法使用视觉语言验证器对候选运动计划进行排名,并将每个轨迹渲染为轻量级的视频草图,以避免昂贵的重复合成。
MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models
Authors: Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz, Tingwen Huang, Yingli Tian, Yew-Soon Ong
First: 2025-11-21T17:46:44+00:00 · Latest: 2025-11-21T17:46:44+00:00
Comments: 10 pages
Abstract
Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.
中文标题/摘要
标题:MMT-ARD:多模态多教师对抗鲁棒蒸馏方法
视觉-语言模型(VLMs)在越来越多的安全关键应用中得到部署,因此其对抗鲁棒性变得至关重要。虽然对抗知识蒸馏在从教师模型向学生模型转移鲁棒性方面显示出潜力,但传统的单教师方法存在知识多样性有限、收敛速度慢以及难以平衡鲁棒性和准确性的缺点。为了解决这些挑战,我们提出了MMT-ARD:一种多模态多教师对抗鲁棒蒸馏框架。我们的主要创新是一个双教师知识融合架构,协同优化干净特征的保留和鲁棒特征的增强。为了更好地处理具有挑战性的对抗样本,我们引入了一种基于教师置信度的动态权重分配策略,能够适应性地关注更难的样本。此外,为了减轻教师之间的偏差,我们设计了一种基于自适应Sigmoid的加权函数,以在不同模态之间平衡知识转移的强度。在ImageNet和零样本基准上的广泛实验表明,MMT-ARD在ViT-B-32模型上提高了鲁棒准确率4.32%,零样本准确率3.5%,同时传统单教师方法的训练效率提高了2.3倍。这些结果突显了MMT-ARD在增强多模态大型模型对抗鲁棒性方面的有效性和可扩展性。我们的代码可在https://github.com/itsnotacie/MMT-ARD/获取。
Summary / 总结
The research aims to enhance the adversarial robustness of Vision-Language Models (VLMs) for safety-critical applications. MMT-ARD, a Multimodal Multi-Teacher Adversarial Robust Distillation framework, is proposed to address limitations of traditional single-teacher approaches. It uses a dual-teacher knowledge fusion architecture and a dynamic weight allocation strategy based on teacher confidence to improve robustness and accuracy. Experiments show that MMT-ARD increases robust accuracy by 4.32% and zero-shot accuracy by 3.5% on the ViT-B-32 model, with a 2.3x increase in training efficiency compared to single-teacher methods.
研究旨在通过增强视觉-语言模型(VLMs)的对抗鲁棒性,使其适用于安全关键应用。提出了多模态多教师对抗鲁棒蒸馏(MMT-ARD)框架,以解决传统单教师方法的局限性。关键创新包括双教师知识融合架构和基于教师置信度的动态权重分配策略。实验结果显示,MMT-ARD在ViT-B-32模型上将鲁棒准确率提高了4.32%,零样本准确率提高了3.5%,并且与单教师方法相比,训练效率提高了2.3倍。
REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing
Authors: Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir
First: 2025-11-21T17:41:26+00:00 · Latest: 2025-11-21T17:41:26+00:00
Comments: Code and data available at https://github.com/be-chen/REMSA
Abstract
Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.
中文标题/摘要
标题:REMSA:用于遥感基础模型选择的LLM代理
基础模型(FMs)在遥感(RS)中越来越多地用于环境监测、灾害评估和土地利用制图等任务。这些模型包括单模态视觉编码器和多模态架构,后者结合了SAR、多光谱、高光谱和图像-文本数据进行训练。它们支持包括语义分割、图像分类、变化检测和视觉问答在内的多种RS任务。然而,由于文档分散、格式异构和部署约束多样,选择合适的遥感基础模型(RSFM)仍然具有挑战性。我们介绍了RSFM数据库(RS-FMD),这是一个结构化的资源,涵盖了超过150个RSFM,跨越了多种数据模态、分辨率和学习范式。基于RS-FMD,我们提出了REMSA,这是第一个用于从自然语言查询中自动选择RSFM的LLM代理。REMSA解释用户需求,解决缺失的约束,使用上下文学习对候选模型进行排名,并提供透明的解释。我们还提出了一组75个专家验证的RS查询场景基准,生成了900种配置,在专家中心的评估协议下进行。REMSA在多个基线中表现更优,包括朴素代理、密集检索和无结构的RAG基LLM。它完全基于公开的元数据运行,并不访问任何私人或敏感数据。
Summary / 总结
REMSA is an LLM agent designed for automated selection of remote sensing foundation models (RSFMs) from natural language queries. Built on the RSFM Database (RS-FMD), REMSA interprets user requirements, ranks candidate models using in-context learning, and provides transparent justifications. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs, in an expert-centered evaluation protocol using 75 expert-verified query scenarios.
REMSA 是一个基于自然语言查询的 LLM 代理,用于从 RSFM 数据库(RS-FMD)中自动选择遥感基础模型(RSFMs)。REMSA 解释用户需求,使用上下文学习对候选模型进行排名,并提供透明的解释。REMSA 在 75 个专家验证的查询场景的专家中心评估协议下,优于包括简单代理、密集检索和无结构的 RAG 基础的 LLM 等几个基线。
Preventing Shortcut Learning in Medical Image Analysis through Intermediate Layer Knowledge Distillation from Specialist Teachers
Authors: Christopher Boland, Sotirios Tsaftaris, Sonia Dahdouh
Venue: Machine.Learning.for.Biomedical.Imaging. 3 (2025)
First: 2025-11-21T17:18:35+00:00 · Latest: 2025-11-21T17:18:35+00:00
Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:020
Abstract
Deep learning models are prone to learning shortcut solutions to problems using spuriously correlated yet irrelevant features of their training data. In high-risk applications such as medical image analysis, this phenomenon may prevent models from using clinically meaningful features when making predictions, potentially leading to poor robustness and harm to patients. We demonstrate that different types of shortcuts (those that are diffuse and spread throughout the image, as well as those that are localized to specific areas) manifest distinctly across network layers and can, therefore, be more effectively targeted through mitigation strategies that target the intermediate layers. We propose a novel knowledge distillation framework that leverages a teacher network fine-tuned on a small subset of task-relevant data to mitigate shortcut learning in a student network trained on a large dataset corrupted with a bias feature. Through extensive experiments on CheXpert, ISIC 2017, and SimBA datasets using various architectures (ResNet-18, AlexNet, DenseNet-121, and 3D CNNs), we demonstrate consistent improvements over traditional Empirical Risk Minimization, augmentation-based bias-mitigation, and group-based bias-mitigation approaches. In many cases, we achieve comparable performance with a baseline model trained on bias-free data, even on out-of-distribution test data. Our results demonstrate the practical applicability of our approach to real-world medical imaging scenarios where bias annotations are limited and shortcut features are difficult to identify a priori.
中文标题/摘要
标题:通过专家教师中间层知识蒸馏防止医学图像分析中的捷径学习
深度学习模型容易通过训练数据中虚假相关但无关的特征来学习捷径解决方案。在如医学图像分析等高风险应用中,这种现象可能会阻止模型在预测时使用临床有意义的特征,可能导致模型的鲁棒性差并对患者造成伤害。我们证明了不同类型的捷径(弥漫在整个图像中的以及局限于特定区域的)在不同网络层中表现不同,因此可以通过针对中间层的缓解策略更有效地进行针对性。我们提出了一种新颖的知识蒸馏框架,利用一个在少量任务相关数据上微调的教师网络来缓解在大量带有偏差特征的数据集上训练的学生网络中的捷径学习。通过在CheXpert、ISIC 2017和SimBA数据集上使用各种架构(ResNet-18、AlexNet、DenseNet-121和3D CNNs)进行广泛的实验,我们展示了与传统的经验风险最小化、基于增强的偏差缓解和基于群体的偏差缓解方法相比的一致改进。在许多情况下,即使在分布外测试数据上,我们也能达到与在无偏差数据上训练的基线模型相当的性能。我们的结果证明了我们的方法在临床医学成像场景中的实际应用性,其中偏差注释有限且捷径特征难以先验识别。
Summary / 总结
The paper addresses the issue of shortcut learning in deep learning models for medical image analysis, which can lead to poor robustness. It proposes a knowledge distillation framework that uses a teacher network fine-tuned on relevant data to mitigate shortcut learning in a student network trained on a larger dataset. Experiments on CheXpert, ISIC 2017, and SimBA datasets show consistent improvements over traditional methods, achieving comparable performance to models trained on bias-free data even on out-of-distribution test data.
论文针对医疗图像分析中深度学习模型易学习捷径解决方案的问题,可能导致模型鲁棒性差。提出了一种知识蒸馏框架,使用一个在相关数据上微调的教师网络来缓解学生网络在带有偏差的数据集上训练时的捷径学习。在CheXpert、ISIC 2017和SimBA数据集上的实验表明,该方法在传统方法上表现出一致的改进,并且在某些情况下,即使在分布外测试数据上也能达到与在无偏差数据上训练的基线模型相当的性能。
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
Authors: Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Pani Paudel
First: 2025-11-21T17:09:43+00:00 · Latest: 2025-11-21T17:09:43+00:00
Abstract
Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution, $~\textbf{SPEAR-1}$: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on $\sim$45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as $π_0$-FAST and $π_{0.5}$, while it uses 20$\times$ fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available.
中文标题/摘要
标题:SPEAR-1:通过三维理解超越机器人演示
机器人基础模型(RFMs)作为通用的端到端系统,在机器人控制方面具有巨大潜力。然而,它们在新环境、任务和实体方面的泛化能力仍然有限。我们认为,主要瓶颈在于它们的基础:大多数RFMs都是通过微调互联网预训练的视觉-语言模型(VLMs)构建的。然而,这些VLMs是在2D图像-语言任务上进行训练的,缺乏在三维世界中进行实体控制所需的三维空间推理能力。直接通过大规模的机器人数据来弥合这一差距成本高昂且难以扩展。相反,我们提出了一种方法,即丰富易于收集的非机器人图像数据并添加三维注释,并增强预训练的VLM以具备三维理解能力。遵循这一策略,我们训练了SPEAR-VLM,这是一种三维感知的VLM,可以从单张2D图像中推断出物体在三维空间中的坐标。基于SPEAR-VLM,我们引入了我们的主要贡献——SPEAR-1:一种结合了基于语言的实体控制和三维感知的机器人基础模型。SPEAR-1在来自24个Open X-Embodiment数据集的约4500万帧数据上进行训练,其性能优于或匹配π_0-FAST和π_{0.5}等最先进的模型,同时使用了20倍少的机器人演示数据。这种精心设计的训练策略解锁了新的VLM能力,从而在仅使用机器人数据的情况下提升了实体控制的可靠性。我们公开了我们的模型权重和三维标注的数据集。
Summary / 总结
The research aims to enhance the generalization capabilities of Robotic Foundation Models (RFMs) by addressing their limitation in 3D spatial reasoning. The method involves enriching non-robotic image data with 3D annotations and training a 3D-aware Vision-Language Model (VLM) named SPEAR-VLM. The main experimental finding is that the proposed SPEAR-1 model, which integrates grounded 3D perception with language-instructed embodied control, outperforms or matches state-of-the-art models while requiring 20 times fewer robot demonstrations.
研究旨在通过增强Robotic Foundation Models (RFMs)的泛化能力,解决它们在新环境和任务中的有限适应性。方法是丰富非机器人图像数据的3D注释,并训练一个能够从2D图像中推断物体坐标的3D感知视觉语言模型(SPEAR-VLM)。主要实验发现是,该提出的SPEAR-1模型结合了基于3D感知的视觉语言和语言指导的实体控制,其性能优于或与最先进的模型相当,同时仅使用了20倍少的机器人演示数据。
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models
Authors: Siyou Li, Huanan Wu, Juexi Shao, Yinghao Ma, Yujian Gan, Yihao Luo, Yuwei Wang, Dong Nie, Lu Wang, Wengqing Wu, Le Zhang, Massimo Poesio, Juntao Yu
First: 2025-11-14T22:41:27+00:00 · Latest: 2025-11-21T17:08:33+00:00
Abstract
Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf{89\%} and reduces end-to-end latency by \textbf{28\%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf{+20.5} and \textbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence.
中文标题/摘要
标题:见林又见木:面向长视频多模态语言模型的查询感知分词器
尽管多模态大型语言模型(MLLMs)在视频理解能力方面取得了近期进展,但长视频理解仍然是一个挑战。主要问题在于视觉标记的数量随着视频长度线性增长,导致注意力成本、内存和延迟爆炸性增长。为了解决这一挑战,我们提出了查询感知标记选择器(\textbf{QTSplus}),这是一种轻量级但强大的视觉标记选择模块,作为视觉编码器和LLMs之间的信息闸门。给定文本查询和视频标记,QTSplus通过(i)通过交叉注意力评分视觉标记,(ii)根据查询的复杂性预测实例特定的保留预算,以及(iii)在训练期间使用可微直通估计器选择Top-$n$标记,在推理期间使用硬门选择,动态选择输入文本查询最重要的视觉证据。此外,一个小的重编码器使用绝对时间信息保持时间顺序,使二级定位成为可能,同时保持全局覆盖。将QTSplus集成到Qwen2.5-VL中,在长视频上压缩视觉流最多可达\textbf{89\%},并减少端到端延迟\textbf{28\%}。在八个长视频理解基准上的评估显示,与原始Qwen模型相比,总体准确率接近一致,分别在TempCompass方向和顺序准确率上优于原始模型\textbf{+20.5}和\textbf{+5.6}点。这些结果表明,QTSplus是一种有效的、通用的机制,可以将MLLMs扩展到现实世界的长视频场景,同时保留任务相关的证据。
Summary / 总结
The paper addresses the challenge of long-video understanding by proposing Query-aware Token Selector (QTSplus), which dynamically selects important visual evidence for text queries. QTSplus scores visual tokens via cross-attention, predicts an instance-specific retention budget, and selects Top-$n$ tokens during training and uses a hard gate at inference. The method compresses the vision stream by up to 89% and reduces end-to-end latency by 28% on long videos. Evaluation on eight benchmarks shows near-parity accuracy and outperformance on TempCompass direction and order accuracies by +20.5 and +5.6 points respectively compared to the original model.
论文提出了一种查询感知的视觉标记选择模块QTSplus,该模块能够动态选择与文本查询相关的视觉证据。QTSplus通过交叉注意力评分视觉标记,预测实例特定的保留预算,并在训练中选择Top-$n$标记,在推理中使用硬门控。这种方法在长视频上将视觉流压缩了89%,并将端到端延迟减少了28%。实验结果显示,QTSplus在八个长视频理解基准测试中的准确率接近原模型,并分别在TempCompass方向和顺序准确率上比原模型高出20.5和5.6个百分点。
IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation
Authors: Yifan Li, Lichi Li, Anh Dao, Xinyu Zhou, Yicheng Qiao, Zheda Mai, Daeun Lee, Zichen Chen, Zhen Tan, Mohit Bansal, Yu Kong
First: 2025-11-21T16:48:49+00:00 · Latest: 2025-11-21T16:48:49+00:00
Abstract
While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the "collision rate" and "warning rate" metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.
中文标题/摘要
标题:IndustryNav:探索动态工业导航中具身代理的空间推理
尽管视觉大型语言模型(VLLMs)作为具身代理显示出巨大的潜力,但在空间推理方面仍面临重大挑战。现有的具身基准主要集中在被动的、静态的家庭环境中,并仅评估孤立的能力,未能捕捉到在动态、现实世界复杂性中的整体表现。为填补这一空白,我们提出了IndustryNav,这是首个用于主动空间推理的动态工业导航基准。IndustryNav 利用了12个手工创建的、高保真的Unity仓库场景,其中包含动态物体和人类移动。我们的评估采用了一种基于视角的导航管道,有效地结合了主观视觉与全局里程计,以评估局部-全局规划的整体表现。至关重要的是,我们引入了“碰撞率”和“警告率”指标来衡量安全行为和距离估计。对九种最先进的VLLMs(包括GPT-5-mini、Claude-4.5和Gemini-2.5等模型)的全面研究显示,闭源模型保持了一致的优势;然而,所有代理在稳健路径规划、碰撞避免和主动探索方面均表现出明显的不足。这突显了具身研究需要超越被动感知,转向在动态现实世界环境中要求稳定规划、主动探索和安全行为的任务。
Summary / 总结
IndustryNav is a new benchmark for evaluating spatial reasoning of embodied agents in dynamic industrial settings. It uses 12 high-fidelity Unity warehouse scenarios with dynamic objects and human movement. The evaluation focuses on local-global planning using a PointGoal navigation pipeline and introduces metrics like collision rate and warning rate to assess safety and distance estimation. The study of nine state-of-the-art VLLMs shows that closed-source models perform better, but all agents struggle with robust path planning, collision avoidance, and active exploration, indicating a need for more advanced embodied AI capabilities.
IndustryNav 是一个用于评估在动态工业环境中实体代理空间推理能力的新基准。它使用了 12 个高保真 Unity 仓库场景,包含动态物体和人类移动。评估集中在使用 PointGoal 导航管道进行局部-全局规划上,并引入了碰撞率和警告率等指标来评估安全性和距离估计。对九个最先进的 VLLM 的研究显示,闭源模型表现更好,但所有代理在稳健路径规划、碰撞避免和主动探索方面都存在明显不足,表明需要更多先进的实体 AI 研究。
METIS: Multi-Source Egocentric Training for Integrated Dexterous Vision-Language-Action Model
Authors: Yankai Fu, Ning Chen, Junkai Zhao, Shaozhe Shan, Guocai Yao, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
First: 2025-11-21T16:32:36+00:00 · Latest: 2025-11-21T16:32:36+00:00
Abstract
Building a generalist robot that can perceive, reason, and act across diverse tasks remains an open challenge, especially for dexterous manipulation. A major bottleneck lies in the scarcity of large-scale, action-annotated data for dexterous skills, as teleoperation is difficult and costly. Human data, with its vast scale and diverse manipulation behaviors, provides rich priors for learning robotic actions. While prior works have explored leveraging human demonstrations, they are often constrained by limited scenarios and a large visual gap between human and robots. To eliminate these limitations, we propose METIS, a vision-language-action (VLA) model for dexterous manipulation pretrained on multi-source egocentric datasets. We first construct EgoAtlas, which integrates large-scale human and robotic data from multiple sources, all unified under a consistent action space. We further extract motion-aware dynamics, a compact and discretized motion representation, which provides efficient and expressive supervision for VLA training. Built upon them, METIS integrates reasoning and acting into a unified framework, enabling effective deployment to downstream dexterous manipulation tasks. Our method demonstrates exceptional dexterous manipulation capabilities, achieving highest average success rate in six real-world tasks. Experimental results also highlight the superior generalization and robustness to out-of-distribution scenarios. These findings emphasize METIS as a promising step toward a generalist model for dexterous manipulation.
中文标题/摘要
标题:METIS:多源自视点训练的综合灵巧视觉-语言-行动模型
构建能够在多种任务中感知、推理和行动的一般机器人仍然是一个开放的挑战,尤其是在灵巧操作方面。主要瓶颈在于缺乏大规模的、带有动作注释的数据,因为远程操作既困难又昂贵。人类数据因其庞大的规模和多样的操作行为,为学习机器人动作提供了丰富的先验知识。尽管先前的工作探索了利用人类示范,但它们往往受限于有限的场景和人类与机器人之间巨大的视觉差距。为了解决这些限制,我们提出了METIS,一种基于多源自视点数据预训练的视觉-语言-行动(VLA)模型。我们首先构建了EgoAtlas,它整合了来自多个来源的大规模人类和机器人数据,并统一在一个一致的动作空间下。我们进一步提取了运动感知动力学,这是一种紧凑且离散化的运动表示,为VLA训练提供了高效的、表达性的监督。基于此,METIS将推理和行动整合到一个统一的框架中,使其能够有效地部署到下游的灵巧操作任务中。我们的方法展示了出色的灵巧操作能力,在六个真实世界任务中实现了最高的平均成功率。实验结果还强调了METIS在泛化能力和对分布外场景的鲁棒性方面的优越性。这些发现强调了METIS作为灵巧操作通用模型的一个有希望的步骤。
Summary / 总结
The research aims to develop a generalist robot capable of dexterous manipulation by addressing the scarcity of large-scale, action-annotated data. METIS, a vision-language-action model, is pretrained on multi-source egocentric datasets, including human and robotic data, to overcome the limitations of prior works. The model integrates motion-aware dynamics for efficient and expressive supervision, achieving high success rates in six real-world tasks and demonstrating superior generalization and robustness to out-of-distribution scenarios.
研究旨在通过解决大规模动作标注数据稀缺的问题,开发出能够进行灵巧操作的通用机器人。METIS 是一种基于多源第一人称数据集预训练的视觉-语言-动作模型,包括人类和机器人数据,以克服先前工作的局限性。该模型整合了运动感知动力学,实现了六个真实任务中的高成功率,并展示了在分布外场景中的优越泛化能力和鲁棒性。
A Training-Free Style-Personalization via SVD-Based Feature Decomposition
Authors: Kyoungmin Lee, Jihun Park, Jongmin Gim, Wonhyeok Choi, Kyumin Hwang, Jaeyeul Kim, Sunghoon Im
First: 2025-07-06T17:42:11+00:00 · Latest: 2025-11-21T16:25:35+00:00
Comments: 21 pages, 14 figures
Abstract
We present a training-free framework for style-personalized image generation that operates during inference using a scale-wise autoregressive model. Our method generates a stylized image guided by a single reference style while preserving semantic consistency and mitigating content leakage. Through a detailed step-wise analysis of the generation process, we identify a pivotal step where the dominant singular values of the internal feature encode style-related components. Building upon this insight, we introduce two lightweight control modules: Principal Feature Blending, which enables precise modulation of style through SVD-based feature reconstruction, and Structural Attention Correction, which stabilizes structural consistency by leveraging content-guided attention correction across fine stages. Without any additional training, extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.
中文标题/摘要
标题:基于SVD特征分解的无训练风格个性化生成
我们提出了一种在推理过程中使用的无训练框架,用于风格个性化图像生成,采用尺度自回归模型。该方法在单个参考风格的引导下生成风格化图像,同时保持语义一致性并减少内容泄露。通过对生成过程的详细步骤分析,我们确定了一个关键步骤,在此步骤中,内部特征的主要奇异值编码了与风格相关的组件。基于这一洞察,我们引入了两个轻量级控制模块:主特征混合,通过SVD特征重构实现对风格的精确调节;结构注意力校正,通过内容引导的注意力校正来稳定结构一致性。无需额外训练,大量实验表明,与微调基线相比,我们的方法在风格保真度和提示保真度方面具有竞争力,同时提供更快的推理速度和更大的部署灵活性。
Summary / 总结
The paper presents a training-free framework for style-personalized image generation using a scale-wise autoregressive model. It generates stylized images guided by a single reference style while maintaining semantic consistency and reducing content leakage. Key to the method is the identification of dominant singular values in internal features that encode style components, leading to the introduction of two control modules: Principal Feature Blending and Structural Attention Correction. Experiments show that the method achieves competitive style and prompt fidelity, with faster inference and greater deployment flexibility compared to fine-tuned baselines.
论文提出了一种无需训练的风格个性化图像生成框架,使用尺度自回归模型。该方法通过单一参考风格生成风格化的图像,同时保持语义一致性并减少内容泄露。方法引入了两个轻量级控制模块:主特征混合和结构注意力校正,分别实现精确的风格调制和结构一致性。实验表明,该方法在风格和提示保真度方面与微调基线相当,同时具有更快的推理速度和更大的部署灵活性。
UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification
Authors: Taixi Chen, Jingyun Chen, Nancy Guo
First: 2025-11-21T16:18:55+00:00 · Latest: 2025-11-21T16:18:55+00:00
Abstract
Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.
中文标题/摘要
标题:UAM:统一注意力-马amba多模态框架的肿瘤细胞分类骨干网络
细胞级放射组学特征提供了对肿瘤表型的精细洞察,并有可能显著提高对苏木精和伊红(H&E)图像的诊断准确性。通过捕捉微观形态和强度模式,这些特征支持更精确的肿瘤识别,并通过突出显示供病理学家审查的诊断相关细胞来提高AI可解释性。然而,大多数现有研究集中在切片级或斑块级肿瘤分类上,而细胞级放射组学分析则被很大程度上忽视。此外,目前没有专门针对放射组学数据的专用骨干网络。受最近马amba架构在视觉和语言领域取得成功启发,我们提出了一种用于细胞级分类的统一注意力-马amba(UAM)骨干网络,利用放射组学特征。与之前将注意力和马amba模块以固定比例集成的混合方法不同,我们的统一设计在单一连贯架构中灵活结合了它们的能力,消除了手动比例调优的需要,并提高了编码能力。我们开发了两种UAM变体以全面评估这种统一结构的好处。在此基础上,我们进一步提出了一种多模态UAM框架,联合执行细胞级分类和图像分割。实验结果表明,UAM在公共基准测试中同时在两项任务上均实现了最先进的性能,超越了领先的基于图像的基础模型。它将细胞分类准确性从74%提高到78%(n=349,882个细胞),并将肿瘤分割精度从75%提高到80%(n=406个斑块)。这些发现突显了UAM作为统一和可扩展的多模态基础架构在放射组学驱动癌症诊断中的有效性和潜力。
Summary / 总结
The research aims to enhance tumor cell classification using cell-level radiomics features by introducing a Unified Attention-Mamba (UAM) backbone. Unlike previous hybrid approaches, UAM flexibly combines the capabilities of Attention and Mamba modules within a single architecture, improving encode capability. The UAM framework achieves state-of-the-art performance, increasing cell classification accuracy from 74% to 78% and tumor segmentation precision from 75% to 80% on public benchmarks.
该研究通过引入统一注意力-马姆巴(UAM)骨干网络解决了细胞级放射组学分析的空白,该网络将注意力和马姆巴模块的优势以灵活统一的方式结合。UAM 进一步扩展为多模态方法,用于细胞级分类和图像分割。实验结果表明,UAM 在公共基准上的表现优于现有模型,将细胞分类准确性从 74% 提高到 78%,肿瘤分割精度从 75% 提高到 80%。
SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion
Authors: Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He, Weida Wang
First: 2025-11-21T15:24:33+00:00 · Latest: 2025-11-21T15:24:33+00:00
Abstract
Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks due to the strong reasoning capability of large language models (LLMs). Nevertheless, most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space. In this work, we propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding and boosting the spatial grounding capability of MLLMs. Specifically, we first unveil that the spatial ambiguity shortcoming stems from the lossy embedding of the vision encoder utilized in most existing MLLMs (e.g., CLIP), restricted to instance-level semantic features. This motivates us to complement CLIP with the geometry features from vision-only self-supervised learning via a hierarchical adapter, enhancing the spatial awareness in the proposed SpatialGeo. The network is efficiently trained using pretrained LLaVA model and optimized with random feature dropping to avoid trivial solutions relying solely on the CLIP encoder. Experimental results show that SpatialGeo improves the accuracy in spatial reasoning tasks, enhancing state-of-the-art models by at least 8.0% in SpatialRGPT-Bench with approximately 50% less memory cost during inference. The source code is available via https://ricky-plus.github.io/SpatialGeoPages/.
中文标题/摘要
标题:SpatialGeo:通过几何语义融合提升多模态大语言模型的空间推理能力
多模态大语言模型(MLLMs)在图像和语言任务中取得了显著进展,得益于大语言模型(LLMs)的强大推理能力。然而,大多数MLLMs在解释和推断三维空间中的空间排列方面存在有限的空间推理能力。在本工作中,我们提出了一种基于几何和语义特征分层融合的新型视觉编码器,生成空间感知的视觉嵌入,并增强MLLMs的空间定位能力。具体而言,我们首先揭示了空间歧义性缺陷源于大多数现有MLLMs(如CLIP)使用的视觉编码器对视觉特征的有损嵌入,仅限于实例级语义特征。这促使我们通过分层适配器将几何特征与仅视觉自监督学习结合,增强提出的SpatialGeo中的空间意识。该网络通过预训练的LaLaVA模型高效训练,并通过随机特征丢弃进行优化,以避免仅依赖CLIP编码器的简单解决方案。实验结果表明,SpatialGeo在空间推理任务中提高了准确性,在SpatialRGPT-Bench上至少比最先进的模型提高了8.0%的准确性,推理时内存成本降低了约50%。源代码可通过https://ricky-plus.github.io/SpatialGeoPages/获取。
Summary / 总结
This work addresses the limitation of spatial reasoning in multimodal large language models (MLLMs) by proposing SpatialGeo, a novel vision encoder that fuses geometry and semantics features. It enhances the spatial grounding capability of MLLMs by addressing the spatial ambiguity issue in existing models like CLIP, which rely on instance-level semantic features. The method uses a hierarchical adapter to integrate geometry features from vision-only self-supervised learning, and is trained with a pretrained LLaVA model and optimized with random feature dropping. Experimental results demonstrate that SpatialGeo improves spatial reasoning accuracy by at least 8.0% in the SpatialRGPT-Bench benchmark, with approximately 50% less memory cost during inference.
本文提出了一种名为SpatialGeo的新视觉编码器,通过层次融合几何和语义特征来解决多模态大型语言模型(MLLMs)的空间推理能力不足问题。该方法旨在通过解决现有模型如CLIP的空间模糊问题来增强MLLMs的空间定位能力。实验结果表明,SpatialGeo在空间推理任务中的准确性提高了至少8.0%,并且在推理过程中具有较低的内存成本。
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
Authors: Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, Volker Tresp, Alois Knoll
First: 2025-03-30T14:45:54+00:00 · Latest: 2025-11-21T14:50:41+00:00
Abstract
We present OpenDriveVLA, a Vision Language Action model designed for end-to-end autonomous driving, built upon open-source large language models. OpenDriveVLA generates spatially grounded driving actions by leveraging multimodal inputs, including 2D and 3D instance-aware visual representations, ego vehicle states, and language commands. To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Furthermore, we incorporate structured agent environment ego interaction modeling into the autoregressive decoding process, enabling the model to capture fine-grained spatial dependencies and behavior-aware dynamics critical for reliable trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question answering tasks. Qualitative analyses further illustrate its capability to follow high-level driving commands and generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving.
中文标题/摘要
标题:OpenDriveVLA:基于大型视觉语言行动模型的端到端自动驾驶
我们提出了OpenDriveVLA,一种基于开源大型语言模型设计的端到端自动驾驶视觉语言行动模型。OpenDriveVLA 通过利用包括2D和3D实例感知视觉表示、 ego 车辆状态和语言命令在内的多模态输入,生成空间上定位的驾驶行动。为了弥合驾驶视觉表示与语言嵌入之间的模态差距,我们引入了一种分层视觉语言对齐过程,将2D和3D结构化视觉标记投影到统一的语义空间中。此外,我们还将结构化代理环境ego交互建模融入自回归解码过程,使模型能够捕捉对可靠轨迹规划至关重要的细粒度空间依赖性和行为感知动力学。在nuScenes数据集上的广泛实验表明,OpenDriveVLA 在开环轨迹规划和驾驶相关问题回答任务中均取得了最先进的结果。进一步的定性分析还表明,它能够遵循高级驾驶命令并在具有挑战性的场景中生成轨迹,突显了其在下一代端到端自动驾驶中的潜力。
Summary / 总结
OpenDriveVLA is an end-to-end autonomous driving model that uses large vision language action models to generate spatially grounded driving actions based on multimodal inputs. It introduces a hierarchical vision language alignment process to bridge the gap between visual and language modalities and incorporates structured agent-environment interaction modeling to capture fine-grained spatial dependencies. Experiments show that OpenDriveVLA outperforms existing methods in open-loop trajectory planning and driving-related question answering tasks.
OpenDriveVLA 是一种基于大型视觉语言行动模型的端到端自动驾驶模型,能够根据多模态输入生成空间上一致的驾驶动作。它引入了层次视觉语言对齐过程以弥合视觉和语言模态之间的差距,并结合了结构化代理环境交互建模以捕捉细粒度的空间依赖关系。实验表明,OpenDriveVLA 在开放环轨迹规划和驾驶相关问题回答任务中优于现有方法。
LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
Authors: Yushi Huang, Xingtong Ge, Ruihao Gong, Chengtao Lv, Jun Zhang
First: 2025-10-09T15:03:39+00:00 · Latest: 2025-11-21T14:45:09+00:00
Comments: Code will be released upon acceptance
Abstract
Video diffusion models (DMs) have enabled high-quality video synthesis. However, their computation costs scale quadratically with sequence length because self-attention has quadratic complexity. While linear attention lowers the cost, fully replacing quadratic attention requires expensive pretraining due to the limited expressiveness of linear attention and the complexity of spatiotemporal modeling in video generation. In this paper, we present LinVideo, an efficient data-free post-training framework that replaces a target number of self-attention modules with linear attention while preserving the original model's performance. First, we observe a significant disparity in the replaceability of different layers. Instead of manual or heuristic choices, we frame layer selection as a binary classification problem and propose selective transfer, which automatically and progressively converts layers to linear attention with minimal performance impact. Additionally, to overcome the ineffectiveness and inefficiency of existing objectives for this transfer process, we introduce an anytime distribution matching (ADM) objective that aligns the distributions of samples across any timestep along the sampling trajectory. This objective is efficient and recovers model performance. Extensive experiments show that our method achieves a 1.25-2.00x speedup while preserving generation quality, and our 4-step distilled model further delivers a 15.92x latency reduction with minimal visual quality drop.
中文标题/摘要
标题:LinVideo:一种面向高效视频生成的O(n)注意机制后训练框架
视频扩散模型(DMs)已实现高质量视频合成。然而,它们的计算成本随着序列长度的增加而呈平方级增长,因为自我注意具有平方复杂度。虽然线性注意可以降低成本,但完全替换平方注意需要昂贵的预训练,因为线性注意的表达能力有限,且视频生成中的时空建模复杂。在本文中,我们提出了LinVideo,这是一种高效的数据免费后训练框架,可以将目标数量的自我注意模块替换为线性注意,同时保持原始模型的性能。首先,我们观察到不同层的可替换性存在显著差异。我们将其层选择问题框架化为二元分类问题,并提出选择性迁移,该方法可以自动且逐步地将层转换为线性注意,同时对性能影响最小。此外,为了克服现有目标在这一转换过程中的无效性和低效率,我们引入了一种随时分布匹配(ADM)目标,该目标沿采样轨迹对任意时间步的样本分布进行对齐。该目标高效且恢复了模型性能。大量实验表明,我们的方法在保持生成质量的同时实现了1.25-2.00倍的加速,而我们的四步蒸馏模型进一步实现了15.92倍的延迟减少,且视觉质量下降极小。
Summary / 总结
LinVideo is a post-training framework that replaces self-attention modules with linear attention to reduce the computational cost of video diffusion models. It automatically selects layers for conversion using a binary classification approach and introduces an anytime distribution matching objective to maintain model performance. Experiments demonstrate that LinVideo achieves a 1.25-2.00x speedup with preserved generation quality, and a 4-step distilled model further reduces latency by 15.92x with minimal visual quality drop.
LinVideo 是一个后训练框架,通过将自注意力模块替换为线性注意力来降低视频生成模型的计算成本。它使用二元分类方法自动选择要转换的层,并引入即时分布匹配目标以保持性能。实验显示,该方法在保持质量的同时实现了1.25-2.00倍的加速,并且在精简模型中实现了15.92倍的延迟减少,视觉质量略有下降。
MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment
Authors: Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli
First: 2025-09-17T14:13:20+00:00 · Latest: 2025-11-21T14:33:03+00:00
Abstract
Personalized object detection aims to adapt a general-purpose detector to recognize user-specific instances from only a few examples. Lightweight models often struggle in this setting due to their weak semantic priors, while large vision-language models (VLMs) offer strong object-level understanding but are too computationally demanding for real-time or on-device applications. We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a distillation framework that transfers multimodal region-level knowledge from a frozen VLM teacher into a lightweight vision-only detector. MOCHA extracts fused visual and textual teacher's embeddings and uses them to guide student training through a dual-objective loss that enforces accurate local alignment and global relational consistency across regions. This process enables efficient transfer of semantics without the need for teacher modifications or textual input at inference. MOCHA consistently outperforms prior baselines across four personalized detection benchmarks under strict few-shot regimes, yielding a +10.1 average improvement, with minimal inference cost.
中文标题/摘要
标题:MOCHA:多模态对象感知跨架构对齐
个性化对象检测旨在将通用检测器适应识别用户特定实例,仅需少量示例。轻量级模型在这种情况下往往难以应对,因为它们的语义先验较弱,而大型视觉语言模型(VLMs)虽然在对象层面的理解很强,但计算成本过高,不适合实时或设备端应用。我们提出了MOCHA(多模态对象感知跨架构对齐),这是一种蒸馏框架,将冻结的VLM教师的多模态区域级知识转移到轻量级的纯视觉检测器中。MOCHA提取融合的视觉和文本教师嵌入,并通过双重目标损失引导学生训练,该损失确保区域内的准确局部对齐和全局关系一致性。这一过程允许高效地转移语义,无需对教师进行修改或在推理时使用文本输入。MOCHA在四个严格的少样本基准测试中始终优于先前的基线,平均提高10.1%,且推理成本极低。
Summary / 总结
The research aims to improve personalized object detection by adapting a general-purpose detector to recognize user-specific instances from limited examples. MOCHA, a distillation framework, transfers knowledge from a large vision-language model to a lightweight detector. It uses a dual-objective loss to align regions accurately and maintain global consistency, without requiring textual input at inference. Experiments show MOCHA outperforms previous methods by an average of +10.1 across four benchmarks, with minimal inference cost.
研究旨在通过适应通用检测器来识别用户特定实例,从少量示例中进行个性化对象检测。MOCHA是一种蒸馏框架,将大型视觉语言模型的知识转移到轻量级检测器中。它使用双重目标损失来对齐区域并保持全局一致性,从而在无需推理时输入文本的情况下实现高效的语义转移。MOCHA在四个基准测试中显著优于先前的方法,且推理成本 minimal.
Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?
Authors: Satiyabooshan Murugaboopathy, Connor T. Jerzak, Adel Daoud
First: 2025-08-01T23:07:16+00:00 · Latest: 2025-11-21T14:32:46+00:00
Comments: 7 figures
Abstract
We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.
中文标题/摘要
标题:柏拉图式表示在贫困制图中的应用:统一的视觉-语言代码还是代理引发的新颖性?
我们研究了社会经济指标如家庭财富是否会在卫星图像(捕捉物理特征)和互联网来源的文本(反映历史/经济叙事)中留下可恢复的印记。使用非洲社区的Demographic and Health Survey (DHS) 数据,我们将Landsat图像与基于位置/年份的LLM生成的文本描述配对,并通过AI搜索代理从网络来源检索文本。我们开发了一种多模态框架,通过五个管道预测家庭财富(国际财富指数):(i)基于卫星图像的视觉模型,(ii)仅使用位置/年份的LLM,(iii)AI代理搜索/合成网络文本,(iv)图像-文本联合编码器,(v)所有信号的集成。我们的框架有三个贡献。首先,将视觉与代理/LLM文本融合在财富预测中优于仅视觉基线(例如,在样本外分割中R平方值为0.77 vs. 0.63),LLM内部知识比代理检索的文本更有效,提高了跨国和跨时间泛化的鲁棒性。其次,我们发现部分表示收敛:视觉/语言模态融合嵌入在对齐后相关性中等(中位余弦相似度为0.60),表明存在共享的物质福祉的潜在代码,同时保留互补细节,这与柏拉图式表示假设一致。尽管仅由LLM生成的文本优于代理检索的数据,挑战了代理引发新颖性假设,但在某些分割中结合代理数据的适度增益部分支持代理收集的信息引入了未被静态LLM知识完全捕捉的独特表示结构的观点。第三,我们发布了一个大规模的多模态数据集,包含超过60,000个DHS集群,链接到卫星图像、LLM生成的描述和代理检索的文本。
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
Authors: Tanveer Hannan, Dimitrios Mallios, Parth Pathak, Faegheh Sardari, Thomas Seidl, Gedas Bertasius, Mohsen Fayyaz, Sunando Sengupta
First: 2025-11-14T13:56:39+00:00 · Latest: 2025-11-21T14:18:23+00:00
Abstract
Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82\% fewer visual tokens, 75\% fewer parameters, and 71\% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code and Model are available in https://github.com/Tanveer81/DocSLM.git.
中文标题/摘要
标题:DocSLM:一种用于长多模态文档理解的小型视觉-语言模型
大型视觉-语言模型(LVLMs)在长且复杂的文档上展示了强大的多模态推理能力。然而,它们较高的内存占用使其在资源受限的边缘设备上部署不切实际。我们提出了DocSLM,一种针对受限内存资源设计的高效小型视觉-语言模型,用于长文档理解。DocSLM 结合使用了层次多模态压缩器,能够将每页的视觉、文本和布局信息联合编码为固定长度的序列,大幅减少内存消耗同时保留局部和全局语义。为了实现对任意长输入的可扩展处理,我们引入了一种流式弃权机制,该机制按文档段顺序操作,并使用基于熵的不确定性校准器过滤低置信度响应。在多个长多模态文档基准测试中,DocSLM 在使用 82% 更少的视觉标记、75% 更少的参数和 71% 更低的延迟的同时,达到了或超过了最先进的方法,实现了轻量级边缘设备上的可靠多模态文档理解。代码和模型可在 https://github.com/Tanveer81/DocSLM.git 获取。
Summary / 总结
DocSLM is a small vision-language model designed for efficient long-document understanding on resource-constrained devices. It uses a Hierarchical Multimodal Compressor to encode visual, textual, and layout information into a fixed-length sequence, reducing memory usage. DocSLM also employs a Streaming Abstention mechanism to handle long inputs efficiently. Experimental results show that DocSLM matches or outperforms state-of-the-art methods with significantly fewer visual tokens, parameters, and latency, making it suitable for lightweight edge devices.
DocSLM 是一种针对资源受限设备设计的小型视觉语言模型,用于高效理解长文档。它使用层次多模态压缩器将视觉、文本和布局信息编码为固定长度的序列,从而减少内存使用并保留语义。DocSLM 还采用了一种流式弃权机制,通过分段处理长输入并使用基于熵的不确定性校准器过滤低置信度响应。该模型在视觉标记、参数数量和延迟方面分别比最先进的方法少82%、75%和71%,使其适用于轻量级边缘设备。
Statistical physics analysis of graph neural networks: Approaching optimality in the contextual stochastic block model
Authors: O. Duranthon, L. Zdeborová
Venue: Physical Review X, 2025
First: 2025-03-03T09:55:10+00:00 · Latest: 2025-11-21T14:07:21+00:00
Abstract
Graph neural networks (GNNs) are designed to process data associated with graphs. They are finding an increasing range of applications; however, as with other modern machine learning techniques, their theoretical understanding is limited. GNNs can encounter difficulties in gathering information from nodes that are far apart by iterated aggregation steps. This situation is partly caused by so-called oversmoothing; and overcoming it is one of the practically motivated challenges. We consider the situation where information is aggregated by multiple steps of convolution, leading to graph convolutional networks (GCNs). We analyze the generalization performance of a basic GCN, trained for node classification on data generated by the contextual stochastic block model. We predict its asymptotic performance by deriving the free energy of the problem, using the replica method, in the high-dimensional limit. Calling depth the number of convolutional steps, we show the importance of going to large depth to approach the Bayes-optimality. We detail how the architecture of the GCN has to scale with the depth to avoid oversmoothing. The resulting large depth limit can be close to the Bayes-optimality and leads to a continuous GCN. Technically, we tackle this continuous limit via an approach that resembles dynamical mean-field theory (DMFT) with constraints at the initial and final times. An expansion around large regularization allows us to solve the corresponding equations for the performance of the deep GCN. This promising tool may contribute to the analysis of further deep neural networks.
中文标题/摘要
标题:图神经网络的统计物理分析:在上下文随机块模型中的最优性逼近
图神经网络(GNNs)旨在处理与图相关联的数据。它们的应用范围正在不断扩大,然而,与其他现代机器学习技术一样,对其理论理解仍然有限。GNNs 在通过迭代聚合步骤收集远处节点信息时可能会遇到困难。这种情况部分是由所谓的过度平滑引起的;克服这一问题是一个实际驱动的挑战。我们考虑了通过多次卷积聚合信息的情况,导致图卷积网络(GCNs)。我们分析了在由上下文随机块模型生成的数据上训练的基本GCN的泛化性能。我们通过在高维极限下使用复利方法推导问题的自由能,预测其渐近性能。我们将深度定义为卷积步骤的数量,并表明为了接近贝叶斯最优性,需要使用较大的深度。我们详细说明了GCN的架构必须如何随深度扩展以避免过度平滑。由此产生的大深度极限可以接近贝叶斯最优性,并导致连续的GCN。从技术上讲,我们通过类似于动态平均场理论(DMFT)并在初始和最终时间具有约束的方法来处理这种连续极限。围绕大型正则化进行展开,使我们能够解决对应于深层GCN性能的方程。这一有前景的工具可能有助于进一步深入神经网络的分析。
Summary / 总结
The paper analyzes the performance of graph neural networks (GNNs) using statistical physics methods, focusing on overcoming oversmoothing in graph convolutional networks (GCNs). By considering the contextual stochastic block model, the authors predict the asymptotic performance of a basic GCN through the replica method and show that increasing the depth of the network is crucial to approach Bayes-optimality. They also detail the necessary scaling of the GCN architecture with depth to avoid oversmoothing and derive the large depth limit, which can be close to Bayes-optimality and leads to a continuous GCN. This approach is similar to dynamical mean-field theory (DMFT) with constraints and provides a promising tool for analyzing deep neural networks.
论文分析了图神经网络(GNNs)在上下文随机块模型中的性能,重点关注通过增加图卷积网络(GCNs)的深度来克服过平滑问题。通过使用复利方法预测渐近性能,并解决随深度增加的架构问题,研究显示大深度可以接近贝叶斯最优性,导致连续的GCN。该方法涉及类似动力学均场理论的方法,带有初始和最终时间的约束,并通过在大正则化附近进行展开来解决性能方程。
Enforcing governing equation constraints in neural PDE solvers via training-free projections
Authors: Omer Rochman, Gilles Louppe
Venue: Neurips 2025
First: 2025-11-21T14:03:28+00:00 · Latest: 2025-11-21T14:03:28+00:00
Comments: Machine Learning and the Physical Sciences, Neurips 2025, San Diego
Abstract
Neural PDE solvers used for scientific simulation often violate governing equation constraints. While linear constraints can be projected cheaply, many constraints are nonlinear, complicating projection onto the feasible set. Dynamical PDEs are especially difficult because constraints induce long-range dependencies in time. In this work, we evaluate two training-free, post hoc projections of approximate solutions: a nonlinear optimization-based projection, and a local linearization-based projection using Jacobian-vector and vector-Jacobian products. We analyze constraints across representative PDEs and find that both projections substantially reduce violations and improve accuracy over physics-informed baselines.
中文标题/摘要
标题:通过训练-free 投影在神经偏微分方程求解器中强制实施 governing 方程约束
用于科学模拟的神经偏微分方程求解器经常违反 governing 方程约束。虽然线性约束可以廉价地投影,但许多约束是非线性的,这使得将解投影到可行集变得复杂。动态偏微分方程尤其难以处理,因为约束会引入时间上的长程依赖性。在本文中,我们评估了两种训练-free、事后投影方法:基于非线性优化的投影和使用雅可比-向量和向量-雅可比乘积的局部线性化投影。我们分析了代表性偏微分方程的约束,并发现这两种投影显著减少了违反并提高了与物理启发式基线的准确性。
Summary / 总结
This study addresses the issue of neural PDE solvers violating governing equation constraints, particularly focusing on nonlinear constraints. The authors propose two training-free methods for post hoc projections: a nonlinear optimization-based projection and a local linearization-based projection using Jacobian-vector and vector-Jacobian products. The experiments show that both methods significantly reduce constraint violations and enhance accuracy compared to physics-informed baselines across various PDEs.
该研究解决了神经PDE求解器违反 governing 方程约束的问题,特别是对于非线性约束。研究评估了两种无需训练的后处理投影方法:基于非线性优化的投影和使用雅可比向量和向量雅可比产品的局部线性化投影。研究发现,这两种方法在各种PDE中显著减少了约束违反并提高了准确性,优于物理信息基线方法。
A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback
Authors: Bulat Khaertdinov, Mirela Popa, Nava Tintarev
Venue: WACV
First: 2025-11-21T14:01:36+00:00 · Latest: 2025-11-21T14:01:36+00:00
Comments: Accepted to WACV'26
Abstract
Large vision-language models (VLMs) enable intuitive visual search using natural language queries. However, improving their performance often requires fine-tuning and scaling to larger model variants. In this work, we propose a mechanism inspired by traditional text-based search to improve retrieval performance at inference time: relevance feedback. While relevance feedback can serve as an alternative to fine-tuning, its model-agnostic design also enables use with fine-tuned VLMs. Specifically, we introduce and evaluate four feedback strategies for VLM-based retrieval. First, we revise classical pseudo-relevance feedback (PRF), which refines query embeddings based on top-ranked results. To address its limitations, we propose generative relevance feedback (GRF), which uses synthetic captions for query refinement. Furthermore, we introduce an attentive feedback summarizer (AFS), a custom transformer-based model that integrates multimodal fine-grained features from relevant items. Finally, we simulate explicit feedback using ground-truth captions as an upper-bound baseline. Experiments on Flickr30k and COCO with the VLM backbones show that GRF, AFS, and explicit feedback improve retrieval performance by 3-5% in MRR@5 for smaller VLMs, and 1-3% for larger ones, compared to retrieval with no feedback. Moreover, AFS, similarly to explicit feedback, mitigates query drift and is more robust than GRF in iterative, multi-turn retrieval settings. Our findings demonstrate that relevance feedback can consistently enhance retrieval across VLMs and open up opportunities for interactive and adaptive visual search.
中文标题/摘要
标题:更贴近的这种:使用视觉语言模型和相关反馈进行文本到图像检索
大型视觉语言模型(VLMs)使使用自然语言查询进行直观的视觉搜索成为可能。然而,提高其性能通常需要微调和使用更大的模型变体。在本研究中,我们提出了一种受传统基于文本的搜索启发的机制,以在推理时提高检索性能:相关反馈。虽然相关反馈可以作为微调的替代方案,但其模型无关的设计也使其能够与微调的VLMs一起使用。具体而言,我们引入并评估了四种基于VLM的检索反馈策略。首先,我们修订了经典的伪相关反馈(PRF),它基于顶级结果细化查询嵌入。为了解决其局限性,我们提出了生成相关反馈(GRF),它使用合成描述符对查询进行细化。此外,我们引入了一种注意反馈总结器(AFS),这是一种自定义的基于变压器的模型,它结合了相关项的多模态细粒度特征。最后,我们使用真实描述符作为上限基线模拟显式反馈。在使用VLM骨干的Flickr30k和COCO上的实验表明,与无反馈的检索相比,GRF、AFS和显式反馈分别在较小的VLM中提高了3-5%的MRR@5,在较大的VLM中提高了1-3%。此外,AFS与显式反馈一样,可以缓解查询漂移,并且在迭代的多轮检索设置中比GRF更具鲁棒性。我们的研究结果表明,相关反馈可以一致地增强VLM的检索性能,并为交互式和自适应视觉搜索提供了机会。
Summary / 总结
This work proposes a relevance feedback mechanism to improve text-to-image retrieval using vision-language models (VLMs). Four feedback strategies are introduced: revising pseudo-relevance feedback, proposing generative relevance feedback, introducing an attentive feedback summarizer, and simulating explicit feedback. Experiments on Flickr30k and COCO show that these methods enhance retrieval performance by 3-5% in MRR@5 for smaller VLMs and 1-3% for larger ones, compared to no feedback. Additionally, the attentive feedback summarizer is robust and mitigates query drift in iterative retrieval settings.
本文提出了一种相关反馈机制,以提高使用视觉语言模型(VLMs)的文本到图像检索性能。四种反馈策略被引入:修订伪相关反馈、提出生成相关反馈、引入注意反馈摘要器以及模拟显式反馈。实验结果表明,这些方法在Flickr30k和COCO上的检索性能分别提高了较小VLMs的3-5%的MRR@5和较大VLMs的1-3%,并且注意反馈摘要器在迭代检索设置中特别稳健。
Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats
Authors: Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang
Venue: NeurIPS 2025
First: 2025-11-21T13:57:38+00:00 · Latest: 2025-11-21T13:57:38+00:00
Comments: Accepted to NeurIPS 2025, Project Page: https://github.com/SooLab/AllPath
Abstract
Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.
中文标题/摘要
标题:Intervene-All-Paths:统一跨对齐格式减轻LVLM幻觉
尽管大型视觉-语言模型(LVLMs)在广泛的任务中表现出色,但它们仍然容易出现幻觉。在本研究中,我们提出了一种与变压器因果架构相一致的综合干预框架,整合了不同干预路径对幻觉的影响。我们发现LVLM中的幻觉并非源自单一的因果路径,而是来自图像到输入文本、图像到输出文本和文本到文本路径之间的相互作用。我们首次发现,LVLM依赖于不同的路径,这取决于问题-答案对齐格式。基于这些见解,我们提出了简单而有效的方法,以识别并干预每个路径中的关键幻觉头部,针对区分性和生成性格式进行定制。在多个基准测试中的实验表明,我们的方法在多种对齐类型中一致地减少了幻觉。
Summary / 总结
This study addresses the issue of hallucinations in Large Vision-Language Models (LVLMs) by proposing a unified intervention framework that considers the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. The research finds that hallucinations are not caused by a single path but by the interaction among these pathways, and that LVLMs rely on different pathways depending on the question-answer alignment format. The proposed method effectively identifies and intervenes on critical hallucination heads within each pathway, reducing hallucinations across various alignment types in multiple benchmarks.
该研究通过提出一个统一的干预框架来解决大型视觉语言模型(LVLM)中的幻觉问题。研究发现,幻觉是由图像到输入文本、图像到输出文本以及文本到文本路径之间的相互作用引起的,并且LVLM在不同问题-答案对齐格式下依赖于不同的路径。作者提出了针对区分性和生成性格式的干预方法,以识别并干预每个路径中的关键幻觉头。实验结果表明,他们的方法能够在多种基准测试中一致地减少各种对齐类型的幻觉。
Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
Authors: Cris Claessens, Christiaan Viviers, Giacomo D'Amicantonio, Egor Bondarev, Fons van der Sommen
First: 2025-11-21T12:41:27+00:00 · Latest: 2025-11-21T12:41:27+00:00
Abstract
We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.
中文标题/摘要
标题:体积CT变换器的自监督和跨模态预训练扩展
我们介绍了SPECTRE,一种基于完全变换器的基础模型,用于体积计算机断层扫描(CT)。我们的CT表示提取的自监督与跨模态预训练(SPECTRE)方法利用可扩展的3D视觉变换器架构和现代自监督及视觉-语言预训练策略来学习通用的CT表示。体积CT带来了独特的挑战,如极端的标记缩放、几何各向异性以及临床监督较弱或噪声较大,这使得标准的变换器和对比学习方法无法开箱即用。该框架联合优化了一个局部变换器以进行高分辨率体积特征提取,以及一个全局变换器以进行整个扫描上下文建模,使得大规模3D注意力计算上可行。值得注意的是,SPECTRE仅在公开可用的CT数据集上进行训练,证明了无需依赖私有数据即可实现高性能和可泛化的表示。预训练结合了DINO风格的自我蒸馏与SigLIP为基础的视觉-语言对齐,使用配对的放射学报告,生成既几何上一致又具有临床意义的特征。在多个CT基准测试中,SPECTRE在零样本和微调设置中均优于先前的CT基础模型,确立了SPECTRE作为3D医学成像的可扩展、开放且完全基于变换器的基础模型的地位。
Summary / 总结
SPECTRE is a transformer-based foundation model for volumetric CT that uses self-supervised and cross-modal pretraining to address unique challenges in CT imaging. The model jointly optimizes local and global transformers to handle high-resolution feature extraction and whole-scan context modeling, making large-scale 3D attention feasible. SPECTRE outperforms previous CT foundation models in both zero-shot and fine-tuned settings, demonstrating that high-performing, generalizable representations can be achieved without private data.
SPECTRE 是一种基于变压器的模型,用于处理 CT 图像,通过自我监督和跨模态预训练解决如标记缩放和几何各向异性等挑战。它联合优化局部和全局变压器,分别用于高分辨率特征提取和上下文建模。SPECTRE 在多个基准测试中无论是在零样本还是微调设置下都优于先前的 CT 模型,证明了其有效性和可扩展性,无需使用私有数据。
Bootstrap Off-policy with World Model
Authors: Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, Shengbo Eben Li
Venue: NeurIPS 2025
First: 2025-11-01T06:33:04+00:00 · Latest: 2025-11-21T12:39:02+00:00
Comments: NeurIPS 2025
Abstract
Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy's actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off-policy with WOrld Model), a framework that tightly integrates planning and off-policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood-free alignment loss that bootstraps the policy using the planner's non-parametric action distribution, combined with a soft value-weighted mechanism that prioritizes high-return behaviors and mitigates variability in the planner's action quality within the replay buffer. Experiments on the high-dimensional DeepMind Control Suite and Humanoid-Bench show that BOOM achieves state-of-the-art results in both training stability and final performance. The code is accessible at https://github.com/molumitu/BOOM_MBRL.
中文标题/摘要
标题:使用世界模型的Bootstrap离策策略
在线规划在强化学习(RL)中已被证明能有效提高样本效率和最终性能。然而,使用规划进行环境交互不可避免地会导致收集的数据与策略的实际行为之间出现偏差,从而降低模型学习和策略改进的效果。为了解决这个问题,我们提出了BOOM(Bootstrap Off-policy with WOrld Model)框架,该框架通过一个Bootstrap循环紧密地将规划和离策学习结合起来:策略初始化规划器,规划器通过行为对齐来引导策略的Bootstrap过程。这个循环由一个联合学习的世界模型支持,该模型使规划器能够模拟未来轨迹并提供价值目标,以促进策略改进。BOOM的核心是一个无概率对齐损失,该损失使用规划器的非参数动作分布来引导策略,并结合了一个软价值加权机制,优先考虑高回报行为,并在重放缓冲区中缓解规划器动作质量的变异性。在高维DeepMind控制套件和Humanoid-Bench上的实验表明,BOOM在训练稳定性和最终性能方面均达到了最先进的结果。代码可在https://github.com/molumitu/BOOM_MBRL获取。
Summary / 总结
BOOM (Bootstrap Off-policy with WOrld Model) addresses the issue of policy divergence in online planning by integrating planning and off-policy learning through a bootstrap loop. It uses a jointly learned world model to simulate future trajectories and provide value targets, and employs a likelihood-free alignment loss to bootstrap the policy. Experiments on the DeepMind Control Suite and Humanoid-Bench demonstrate that BOOM achieves superior training stability and final performance compared to existing methods.
BOOM (Bootstrap Off-policy with WOrld Model) 通过将规划和离策略学习通过一个反馈循环紧密结合来解决在线规划中的策略发散问题。它使用一个联合学习的世界模型来模拟未来轨迹并提供价值目标,并采用无概率对齐损失来引导策略。实验结果显示,BOOM 在 DeepMind 控制套件和 Humanoid-Bench 上实现了更好的训练稳定性和最终性能。
VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
Authors: Jianxiang He, Meisheng Hong, Jungang Li, Ziyang Chen, Weiyu Guo, Xuming Hu, Hui Xiong
First: 2025-08-09T07:38:48+00:00 · Latest: 2025-11-21T12:37:49+00:00
Comments: 9 pages,3 figures
Abstract
Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstrate that VSI achieves state-of-the-art accuracy in keyframe retrieval while delivering breakthrough performance in text-related tasks and exhibiting strong generalization across other tasks.
中文标题/摘要
标题:VSI:视觉字幕集成的关键帧选择以增强长视频理解
多模态大型语言模型(MLLMs)在视觉语言任务中表现出色,但在处理长视频时受到输入上下文长度和高计算成本的限制。因此,稀疏帧采样成为必要的预处理步骤,采样的帧质量直接影响下游性能。现有的关键帧搜索算法在效率和采样帧质量之间取得平衡,但主要依赖于视觉模态,这使得它们难以适应与文本相关任务,并且往往导致检索结果偏离核心语义内容。为了解决这个问题,我们提出了VISUAL-SUBTITLE INTEGRATION(VSI),这是一种多模态关键帧检索框架。它采用视频搜索和字幕匹配的双分支协作检索方法,结合互补的视觉和文本信息进行精确定位。在LongVideoBench和VideoMME上的实验表明,VSI在关键帧检索中达到了最先进的准确率,同时在与文本相关任务中取得了突破性的性能,并且在其他任务中表现出强大的泛化能力。
Summary / 总结
The research aims to improve the processing of long videos by addressing the limitations of sparse frame sampling in multimodal large language models. The proposed VSI framework uses a dual-branch collaborative retrieval approach that integrates visual and textual information from subtitles to enhance keyframe selection. Experiments show that VSI outperforms existing methods in keyframe retrieval and excels in text-related tasks, demonstrating strong generalization across various tasks.
研究旨在通过解决现有稀疏帧采样方法的局限性,提高长视频理解中的关键帧选择效率和准确性。VSI框架通过结合视频搜索和字幕匹配,利用双分支协作检索方法融合视觉和文本信息。实验表明,VSI在关键帧检索中表现出色,并在文本相关任务中表现出色,具有较强的跨任务泛化能力。
VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation
Authors: Hanyu Zhou, Chuanhao Ma, Gim Hee Lee
First: 2025-11-21T12:26:30+00:00 · Latest: 2025-11-21T12:26:30+00:00
Abstract
Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.
中文标题/摘要
标题:VLA-4D:将4D意识嵌入视觉-语言-行动模型以实现时空连贯的机器人操作
视觉-语言-行动(VLA)模型在通用机器人任务中显示出潜力,但在时空连贯的操作方面仍然具有挑战性,这需要精细的表示。通常,现有方法将3D位置嵌入视觉表示中以增强动作的空间精度。然而,这些方法在实现动作执行的时空连贯控制方面存在困难。在本文中,我们提出了VLA-4D,这是一种具有4D意识的一般VLA模型,用于实现时空连贯的机器人操作。我们的模型由两个关键设计指导:1)4D意识视觉表示。我们提取视觉特征,将1D时间嵌入3D位置中形成4D嵌入,并通过交叉注意力机制将它们融合成统一的视觉表示。2)时空动作表示。我们扩展了传统的空间动作表示,加入时间信息以实现时空规划,并将多模态表示对齐到LLM以进行时空动作预测。在这一统一框架中,设计的视觉和动作表示共同使机器人操作在空间上平滑且在时间上连贯。此外,我们扩展了VLA数据集,添加了时间动作注释以微调我们的模型。进行了广泛的实验以验证我们方法在不同机器人操作任务中的优越性。
Summary / 总结
This paper addresses the challenge of spatiotemporally coherent robotic manipulation by proposing VLA-4D, a vision-language-action model with 4D awareness. The model enhances spatial precision through 4D-aware visual representation and spatiotemporal action planning, achieving better temporal coherence. Experiments demonstrate the method's superiority across various robotic manipulation tasks.
研究旨在通过引入VLA-4D,增强视觉-语言-行动模型在时空连贯操作方面的能力,VLA-4D整合了4D意识。方法包括4D感知视觉表示和时空动作表示,提升空间和时间精度。实验结果表明,VLA-4D在各种机器人操作任务中优于现有方法,实现了更平滑的空间和更连贯的时间动作。
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly
Authors: Liang Ma, Jiajun Wen, Min Lin, Rongtao Xu, Xiwen Liang, Bingqian Lin, Jun Ma, Yongxin Wang, Ziming Wei, Haokun Lin, Mingfei Han, Meng Cao, Bokui Chen, Ivan Laptev, Xiaodan Liang
First: 2025-06-10T11:46:06+00:00 · Latest: 2025-11-21T12:13:42+00:00
Abstract
While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.
中文标题/摘要
标题:PhyBlock:一种通过3D积木组装评估物理理解和规划能力的渐进基准
尽管视觉-语言模型(VLMs)在为体态代理进行推理和规划方面展现了令人鼓舞的能力,但它们在理解物理现象方面的能力,特别是在结构化的3D环境中,仍然受到严重限制。为了解决这一问题,我们引入了PhyBlock,这是一种渐进基准,旨在通过机器人3D积木组装任务评估VLMs在物理理解和规划方面的表现。PhyBlock结合了一种新颖的四级认知层次组装任务和有针对性的视觉问答(VQA)样本,旨在评估渐进的空间推理和基本物理理解能力,包括物体属性、空间关系和整体场景理解。PhyBlock包括2600个积木任务(400个组装任务,2200个VQA任务),并从三个关键维度评估模型:部分完成、故障诊断和规划稳健性。我们对21个最先进的VLMs进行了基准测试,突显了它们在物理上接地的多步规划方面的优势和局限性。我们的实证研究结果表明,VLMs在高级规划和推理能力方面表现出明显的局限性,导致随着任务复杂性的增加,性能显著下降。错误分析揭示了在空间定向和依赖推理方面持续存在的困难。令人惊讶的是,思维链提示提供的改进微乎其微,表明空间任务高度依赖于直观的模型理解。我们将PhyBlock定位为统一的测试平台,以促进体态推理,弥合视觉-语言理解与现实世界物理问题解决之间的差距。
Summary / 总结
PhyBlock is a benchmark designed to evaluate vision-language models (VLMs) in understanding physical phenomena and planning for robotic 3D block assembly tasks. It includes a four-level cognitive hierarchy assembly task and VQA samples to assess spatial reasoning and physical comprehension. The benchmark evaluates models on partial completion, failure diagnosis, and planning robustness. The study finds that VLMs struggle with high-level planning and reasoning, particularly in complex tasks, and face challenges in spatial orientation and dependency reasoning. Chain-of-thought prompting provides limited improvement, indicating that spatial tasks require more intuitive model understanding.
PhyBlock 是一个基准,用于评估视觉-语言模型(VLMs)在通过 3D 块组装任务进行物理理解和规划的能力。它包含一个四级认知层次和 VQA 任务,以评估空间推理和物理理解。研究对 21 个最先进的 VLMs 进行了基准测试,发现这些模型在高级规划和推理方面存在局限性,尤其是在任务复杂性增加时表现不佳,表明在空间定向和依赖推理方面存在困难。链式思考提示提供的改进也很小,表明空间任务需要更多的直观理解。PhyBlock 旨在通过将视觉-语言理解与现实世界的物理问题解决相结合来推进嵌入式推理。
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Authors: Weikai Huang, Jieyu Zhang, Taoyang Jia, Chenhao Zheng, Ziqi Gao, Jae Sung Park, Winson Han, Ranjay Krishna
First: 2025-10-10T08:04:30+00:00 · Latest: 2025-11-21T12:01:11+00:00
Comments: Project website: https://github.com/weikaih04/Synthetic-Detection-Segmentation-Grounding-Data
Abstract
Visual grouping -- operationalized through tasks such as instance segmentation, visual grounding, and object detection -- enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by large-scale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce Synthetic Object Compositions (SOC), an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. It composes high-quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask-area-weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just 100K of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy-Paste, X-Paste, SynGround, SegGen) by +24-36% -- achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Beyond the general open-vocabulary setup, SOC also enables controllable dataset construction for different use cases and boosts performance in both low-data and closed-vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real-data scales and yields even greater improvements under extremely limited real-data conditions, including +6.59 AP on a 1% COCO data setup. Furthermore, this controllability enables targeted data generation for intra-class referring, a diagnostic grounding task we propose that requires fine-grained attribute discrimination.
中文标题/摘要
标题:合成对象组合以实现可扩展和准确的检测、分割和定位学习
视觉分组——通过实例分割、视觉定位和对象检测等任务实现——支持从机器人感知到照片编辑等多种应用。这些计算机视觉中的基本问题依赖于大规模、耗时标注的数据集。尽管这些数据集影响深远,但它们的构建成本高昂、覆盖面有偏见且难以扩展。合成数据集提供了一种有前景的替代方案,但在灵活性、准确性和组合多样性方面存在挑战。 我们提出了合成对象组合(SOC),这是一种新颖的对象为中心的合成策略,通过3D几何布局增强和相机配置增强生成新的图像,使用生成性协调和掩码面积加权融合,生成准确且多样的掩码、边界框和引用表达。 仅使用我们合成图像中的10万张图像训练的模型在GRIT 2000万、V3Det 20万等更大规模的真实数据集和合成管道(Copy-Paste、X-Paste、SynGround、SegGen)上表现出+24-36%的性能提升,分别在LVIS上达到+10.9 AP和gRefCOCO上达到+8.4 NAcc。除了通用的开放词汇设置外,SOC还能够为不同的应用场景控制数据集的构建,并在低数据和封闭词汇场景中提升性能。 将LVIS和COCO与合成对象片段结合使用,在不同真实数据规模下表现出强大的性能,并在极度有限的真实数据条件下(如COCO数据集的1%)获得更大的改进,包括+6.59 AP。此外,这种可控性还能够针对类别内引用生成数据,这是我们提出的一种需要精细属性区分的诊断定位任务。
Summary / 总结
The paper addresses the challenges of building large-scale, accurate, and diverse datasets for computer vision tasks such as instance segmentation, visual grounding, and object detection. It introduces Synthetic Object Compositions (SOC), a novel data synthesis pipeline that uses 3D geometric and camera configuration augmentation to create high-quality synthetic images. Models trained on 100K synthetic images outperform those trained on larger real datasets and other synthetic pipelines, achieving significant improvements in metrics like AP and NAcc. SOC also enables controllable dataset construction and enhances performance in low-data scenarios.
论文介绍了合成对象组合(SOC)方法,该方法通过新颖的对象中心化组合策略生成合成图像,增强对象片段、边界框和引用表达的准确性和多样性。SOC 使用 3D 几何和相机配置增强,从而在检测、分割和语义标注任务中表现出更好的性能。仅使用 100K 合成图像训练的模型在 AP 和 NAcc 等指标上优于更大规模的真实数据集和其它合成管道,实现了显著的性能提升。
VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
Authors: Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao
First: 2025-11-20T15:16:09+00:00 · Latest: 2025-11-21T11:57:47+00:00
Abstract
Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.
中文标题/摘要
标题:VLA-Pruner:面向高效视觉-语言-动作推理的时序感知双层视觉标记剪枝
视觉-语言-动作(VLA)模型在体现式人工智能方面展现了巨大的潜力,但处理连续视觉流的高昂计算成本严重限制了其实时部署。标记剪枝(保留显著的视觉标记并丢弃冗余的标记)已成为加速视觉-语言模型(VLMs)的有效方法,为高效VLA提供了解决方案。然而,这些针对VLM的特定标记剪枝方法仅基于语义显著性指标(例如预填充注意)选择标记,而忽视了VLA固有的双系统本质,即高层次语义理解和低层次动作执行。因此,这些方法偏向于语义线索,丢弃了用于生成动作的关键信息,显著降低了VLA性能。为解决这一问题,我们提出了一种名为VLA-Pruner的通用即插即用VLA特定标记剪枝方法,该方法与VLA模型的双系统本质相一致,并利用机器人操作中的时序连续性。具体而言,VLA-Pruner采用双层重要性标准保留视觉标记:视觉-语言预填充注意用于语义层面的相关性,通过时序平滑估计的动作解码注意用于动作层面的重要性。基于此标准,VLA-Pruner提出了一种新颖的双层标记选择策略,在给定计算预算的情况下,自适应地保留一套紧凑且信息丰富的视觉标记,以支持语义理解和动作执行。实验表明,VLA-Pruner在多种VLA架构和不同机器人任务中均实现了最先进的性能。
Summary / 总结
The research addresses the computational challenges in deploying Vision-Language-Action (VLA) models in real-time by proposing VLA-Pruner, a token pruning method that considers both semantic and action aspects of VLA. VLA-Pruner uses a dual-level importance criterion combining vision-language prefill attention and action decode attention, estimated through temporal smoothing, to retain crucial visual tokens. The method enhances VLA performance while maintaining efficiency, as demonstrated by superior results across various VLA architectures and robotic tasks.
VLA-Pruner 是一种针对 VLA 模型的剪枝方法,旨在提高模型效率同时保持性能。它通过考虑 VLA 的语义和动作两个方面,使用基于预填充注意和时间平滑动作解码注意的双层重要性标准来解决现有方法的局限性。实验结果表明,VLA-Pruner 在多种 VLA 架构和机器人任务中表现出色,同时保持实时能力。
Device-Guided Music Transfer
Authors: Manh Pham Hung, Changshuo Hu, Ting Dang, Dong Ma
First: 2025-11-21T10:57:11+00:00 · Latest: 2025-11-21T10:57:11+00:00
Abstract
Device-guided music transfer adapts playback across unseen devices for users who lack them. Existing methods mainly focus on modifying the timbre, rhythm, harmony, or instrumentation to mimic genres or artists, overlooking the diverse hardware properties of the playback device (i.e., speaker). Therefore, we propose DeMT, which processes a speaker's frequency response curve as a line graph using a vision-language model to extract device embeddings. These embeddings then condition a hybrid transformer via feature-wise linear modulation. Fine-tuned on a self-collected dataset, DeMT enables effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.
中文标题/摘要
标题:设备引导的音乐传输
设备引导的音乐传输适应用户在未见设备上的播放,这些用户缺乏这些设备。现有方法主要集中在修改音色、节奏、和声或乐器以模拟流派或艺术家,而忽视了播放设备(即扬声器)的多样化硬件特性。因此,我们提出了DeMT,它使用视觉语言模型处理扬声器的频率响应曲线作为折线图,以提取设备嵌入。这些嵌入通过特征层面的线性调制条件化一个混合变压器。DeMT在自收集的数据集上微调后,能够实现有效的扬声器风格转移和对未见设备的鲁棒少样本适应,支持设备风格增强和质量提升的应用。
Summary / 总结
Device-guided music transfer aims to adapt music playback across different devices for users without the specific device. The existing methods primarily focus on altering timbre, rhythm, harmony, or instrumentation to mimic genres or artists, but overlook the unique hardware properties of the playback device. To address this, the authors propose DeMT, which uses a vision-language model to process the speaker's frequency response curve and extract device embeddings. These embeddings condition a hybrid transformer via feature-wise linear modulation, enabling effective speaker-style transfer and robust few-shot adaptation for unseen devices, supporting applications like device-style augmentation and quality enhancement.
设备导向的音乐传输旨在让用户在不同设备上获得个性化的音乐播放体验。提出的DeMT方法通过视觉-语言模型处理扬声器的频率响应曲线,提取设备嵌入。这些嵌入条件化一个混合变压器,以实现有效的扬声器风格转移和对未见设备的鲁棒少样本适应,支持设备风格增强和质量提升的应用。
ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better
Authors: Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Qi She, Shanghang Zhang
First: 2025-11-21T10:11:17+00:00 · Latest: 2025-11-21T10:11:17+00:00
Comments: 16 pages
Abstract
Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\%$ and shortening output token length by $24.5\%$.
中文标题/摘要
标题:ChainV:原子视觉提示使多模态推理更短更好
近期多模态推理模型在文本和视觉领域展现了令人印象深刻的性能。然而,即使是最先进的模型在生成长推理链时也会表现出冗余的自我反思。虽然在大语言模型领域已经出现了无需训练的CoT压缩方法,但它们依赖于静态视觉参考,因此对多模态推理的增益有限。因此,我们提出了ChainV框架,该框架动态地将视觉提示整合到推理过程中,从而使得多模态推理更短更好。具体而言,ChainV首先基于上一步推理进行粗略的视觉补丁选择,然后通过识别根据平均注意力强度最具有代表性的原子视觉提示来进行细化。此外,ChainV引入了一种基于一致性的评估机制来评估所选提示的可靠性,从而引导模型适当地调整其自我反思的程度。最终,所选视觉提示的像素坐标及其可靠性被纳入伯努利随机过程中的思考。实验表明,我们的方法显著提高了推理准确性和效率,特别是在数学密集型基准测试中,视觉提示对于多步符号推理至关重要。例如,ChainV在MIMO-VL-RL的MathVista中实现了2.3%的改进,同时将推理延迟降低了51.4%,并缩短了输出词元长度24.5%。
Summary / 总结
ChainV is a framework that integrates dynamic visual hints into the reasoning process to enhance multimodal reasoning, particularly for math-intensive tasks. It selects and refines visual hints based on attention intensity and consistency, and incorporates them into the reasoning process. Experiments show that ChainV improves reasoning accuracy and efficiency, achieving a 2.3% improvement on MathVista and reducing inference latency by 51.4% and output token length by 24.5%.
ChainV 是一个框架,通过动态集成视觉提示来增强多模态推理,特别适用于数学密集型任务。它基于注意力强度和一致性选择和精炼视觉提示,并将其纳入推理过程。实验表明,ChainV 提高了推理准确性和效率,例如在 MathVista 上实现了 2.3% 的改进,并将推理延迟减少了 51.4%,输出令牌长度减少了 24.5%。
History
20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553