arXiv 论文速递

2025-11-19 03:28
Snapshot: 20251119_0328
Instruction Tuning Chronologically Consistent Language Models
Authors: Songrun He, Linying Lv, Asaf Manela, Jimmy Wu
First: 2025-10-13T17:45:24+00:00 · Latest: 2025-11-17T18:56:19+00:00
Abstract
We introduce a family of chronologically consistent, instruction-tuned large language models to eliminate lookahead bias. Each model is trained only on data available before a clearly defined knowledge-cutoff date, ensuring strict temporal separation from any post-cutoff data. The resulting framework offers (i) a simple, conversational chat interface, (ii) fully open, fixed model weights that guarantee replicability, and (iii) a conservative lower bound on forecast accuracy, isolating the share of predictability that survives once training leakage is removed. Together, these features provide researchers with an easy-to-use generative AI tool useful for a wide range of prediction tasks that is free of lookahead bias.
中文标题/摘要
标题:按时间顺序一致的语言模型指令调优
我们介绍了一组按时间顺序一致的指令调优大型语言模型,以消除前瞻偏差。每个模型仅在明确的知识截止日期之前的数据上进行训练,确保与任何后截止日期数据之间严格的时间分离。该框架提供以下功能:(i) 简单的对话式聊天界面,(ii) 完全开放且固定的模型权重,确保可复制性,以及 (iii) 保守的预测准确度下限,隔离在消除训练泄漏后仍存在的可预测性份额。这些功能共同为研究人员提供了一个易于使用的生成式AI工具,适用于各种预测任务,且无前瞻偏差。
Summary / 总结
This study introduces chronologically consistent, instruction-tuned large language models to address lookahead bias. The models are trained on data available before a specific cutoff date, ensuring temporal separation from post-cutoff data. Key findings include a simple chat interface, open and fixed model weights, and a conservative lower bound on forecast accuracy, which isolates the predictability unaffected by training leakage.
研究旨在通过仅使用特定截止日期前的数据进行训练,来开发具有时间一致性的大型语言模型,以消除前瞻偏差。方法包括指令调优和确保与后续数据的时间分离。主要发现包括一个简单的聊天界面、开放且可复制的模型权重,以及一个保守的预测准确度下限,该下限隔离了不受训练泄漏影响的可预测性。
Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting
Authors: Jiangnan Ye, Jiedong Zhuang, Lianrui Mu, Wenjie Zheng, Jiaqi Hu, Xingze Zou, Jing Wang, Haoji Hu
First: 2025-11-17T18:37:41+00:00 · Latest: 2025-11-17T18:37:41+00:00
Comments: Submitting for Neurocomputing
Abstract
We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.
中文标题/摘要
标题:Training-Free多视图扩展的IC-Light文本引导场景重新光照
我们引入了GS-Light,一种基于高斯点表示(3DGS)的高效、文本引导的重新光照流水线。GS-Light实现了一种单输入扩散模型的无训练扩展,以处理多视图输入。给定用户提示,可能包含照明方向、颜色、强度或参考对象等信息,我们使用大型视觉语言模型(LVLM)解析提示以提取照明先验。利用现成的几何和语义估计器(深度、表面法线和语义分割),我们将这些照明先验与视图几何约束融合,计算照明图并为每个视图生成初始潜在代码。这些精心推导的初始潜在代码引导扩散模型生成更符合用户期望的重新光照输出,特别是在照明方向方面。通过将多视图渲染图像与初始潜在代码输入到我们的多视图重新光照模型中,我们生成了高保真度、艺术性重新光照的图像。最后,我们用重新光照后的外观微调3DGS场景,以获得完全重新光照的3D场景。我们在室内和室外场景上评估了GS-Light,将其与包括单视图重新光照、视频重新光照和场景编辑方法在内的最新基线进行比较。使用定量指标(多视图一致性、成像质量、美学评分、语义相似度等)和定性评估(用户研究),GS-Light在基线之上展示了持续的改进。代码和资产将在发表后提供。
Summary / 总结
GS-Light is a training-free pipeline for text-guided relighting of 3D scenes using Gaussian Splatting. It parses user prompts with a large vision-language model to derive lighting priors, which are then fused with geometric and semantic estimators to compute illumination maps and generate initial latent codes for each view. These codes guide a diffusion model to produce relit images that better match user expectations. GS-Light shows consistent improvements over state-of-the-art baselines in terms of multi-view consistency, imaging quality, and aesthetic score. Evaluations include both quantitative metrics and user studies on indoor and outdoor scenes.
GS-Light 是一个无需训练的管道,用于使用高斯点云对 3D 场景进行文本引导的重新照明。它使用大型视觉语言模型解析用户提示以提取光照先验,然后将这些先验与几何和语义估计器融合以计算光照图并生成每个视图的初始潜在代码。这些代码引导扩散模型生成更符合用户期望的重新照明图像。GS-Light 在多视图一致性、成像质量和美学评分等方面优于最先进的基线方法。评估包括定量指标和用户研究,适用于室内和室外场景。
Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs
Authors: Jitesh Chavan, Rohit Lal, Anand Kamat, Mengjia Xu
First: 2025-11-14T12:44:02+00:00 · Latest: 2025-11-17T18:00:42+00:00
Abstract
State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent "Mamba-for-vision" variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block's state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block's terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior "vision-mamba" variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$\times$256) with Flow Matching, Arcee reduces FID$\downarrow$ from $82.81$ to $15.33$ ($5.4\times$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.
中文标题/摘要
标题:Arcee:一种用于生成视觉建模的可微循环状态链
状态空间模型(SSMs),特别是Mamba,越来越多地被用于长上下文序列建模,通过输入依赖的、因果的选择性扫描操作提供线性时间聚合。沿着这一思路,最近的“Mamba-for-vision”变体主要探索多种扫描顺序以放松严格的因果性要求(例如,对于图像等非序列信号)。与保留跨块记忆不同,Mamba中选择性扫描操作的常规形式从零重新初始化每个块的状态空间动力学,丢弃前一个块的终端状态空间表示(SSR)。Arcee,一种跨块循环状态链,重用每个块的终端状态空间表示作为下一个块的初始条件。跨块的传递构建为一个可微边界映射,其雅可比矩阵允许端到端梯度流过终端边界。为了实用性,Arcee与所有先前的“vision-mamba”变体兼容,无参数,并且具有恒定、可忽略的成本。作为一种建模视角,我们视终端SSR为由因果输入扫描诱导的轻微方向先验,而不是非序列信号本身的估计器。为了量化影响,在CelebA-HQ(256×256)的无条件生成中,使用Flow Matching,Arcee将单扫描顺序Zigzag Mamba基线的FID从82.81降低到15.33(降低5.4倍)。高效的CUDA内核和训练代码将被发布以支持严格的和可重复的研究。
Summary / 总结
Arcee is a novel approach that extends Mamba state-space models for generative vision tasks by reusing the terminal state-space representation (SSR) from one block as the initial condition for the next block, enabling a cross-block recurrent state chain. This method facilitates end-to-end gradient flow through a differentiable boundary map, reducing the need for multiple scan orders. On CelebA-HQ unconditional generation, Arcee significantly improves the FID score from 82.81 to 15.33, demonstrating its effectiveness in generating high-quality images with a single scan order Zigzag Mamba baseline.
Arcee 是一种跨块递归状态链,它将一个块的终端状态空间表示作为下一个块的初始条件,通过一个可微边界映射实现跨块的端到端梯度流动。该方法在 CelebA-HQ 上显著改善了无条件生成,将 Zigzag Mamba 基线的 FID 从 82.81 降低到 15.33,显示出 FID 分数降低了 5.4 倍。Arcee 与所有先前的 'vision-mamba' 变体兼容,无参数,且具有可忽略的计算成本。
CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding
Authors: Shrenik Patel, Daivik Patel
First: 2025-11-17T17:56:14+00:00 · Latest: 2025-11-17T17:56:14+00:00
Abstract
Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one's keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block's full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant blocks and attends over both the retrieved and local context for precise, long-range reasoning. CacheFlow is drop-in, architecture-agnostic, and requires no fine-tuning. Experiments on both offline and streaming VQA benchmarks demonstrate that CacheFlow outperforms current strong baselines, while processing up to 87% less tokens. Our dual approach enables VLMs to be both efficient and context-aware, paving the way for practical long-form video understanding.
中文标题/摘要
标题:CacheFlow:压缩流式内存以实现高效长视频理解
长视频问答(VQA)使当前的视觉-语言模型(VLMs)不堪重负,因为注意力和键值(KV)缓存会随着运行时间增长,迫使它们要么进行昂贵的推理,要么使用近视的滑动窗口。我们引入了CacheFlow,这是一种无需训练的流水线,它将动态令牌删除(DTD)与压缩的长期记忆相结合。DTD通过余弦相似度在线删除每块的令牌,存活的令牌被压缩到固定大小的块中。这种基于每帧的在线处理使我们的方法从根本上适合于实时流式VQA。随着块的处理,每个块的键将由小型循环编码器总结形成检索索引,而块的完整KV对则被卸载并在稍后重新激活以进行生成,从而保持答案的准确性。在推理时,基于共识的检索机制仅检索最相关的Top-K块,并在检索到的上下文和局部上下文之间进行注意力处理,以实现精确的长距离推理。CacheFlow是即插即用的,架构无关的,并且不需要微调。在离线和流式VQA基准测试中,CacheFlow不仅优于当前的强基线,而且处理的令牌量最多可减少87%。我们的双管齐下方法使VLMs既高效又具有上下文意识,为实用的长视频理解铺平了道路。
Summary / 总结
CacheFlow addresses the challenge of long-form video question answering by introducing a training-free pipeline that combines Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes tokens online based on cosine similarity, and surviving tokens are packed into fixed-size blocks. Each block's keys are summarized by a tiny recurrent encoder, and the full KV pairs are offloaded for later rehydration, maintaining answer fidelity. Experiments show that CacheFlow outperforms strong baselines while processing up to 87% fewer tokens, making VLMs both efficient and context-aware for long-form video understanding.
CacheFlow通过结合动态令牌丢弃(DTD)和压缩的长期记忆,提出了一种无需训练的管道来解决长视频问答的问题。DTD根据余弦相似性在线修剪令牌,并将存活的令牌打包成固定大小的块。每个块的键由一个小型循环编码器进行总结,而完整的KV对则被卸载并在后续重新激活,以保持答案的准确性。实验表明,CacheFlow在处理多达87%更少的令牌的同时优于强大的基线,使视觉语言模型更加高效和上下文感知,为实际的长视频理解铺平了道路。
CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product
Authors: Kaiwen Xue, Chenglong Li, Zhonghong Ou, Guoxin Zhang, Kaoyan Lu, Shuai Lyu, Yifan Zhu, Ping Zong Junpeng Ding, Xinyu Liu, Qunlin Chen, Weiwei Qin, Yiran Shen, Jiayi Cen
Venue: AAAI poster
First: 2025-11-17T17:34:05+00:00 · Latest: 2025-11-17T17:34:05+00:00
Comments: 13 pages, 3 figures,The 40th Annual AAAI Conference on Artificial Intelligence(AAAI 2026),Paper has been accepted for a poster presentation
Abstract
Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision.
中文标题/摘要
标题:CreBench:从创意到过程再到产品的与人类一致的创造力评估基准
人类定义的创造力非常抽象,给多模态大型语言模型(MLLMs)理解和评估与人类判断一致的创造力带来了挑战。由于缺乏现有的基准,这一挑战更加突出。为此,我们提出了CreBench,它包含两个关键组成部分:1)涵盖从创意到过程再到产品的多个维度的评估基准;2)CreMIT(创造力多模态指令调优数据集),一个包含2200个多元来源的多模态数据集,792000条人类反馈和4700万条多类型指令的多模态创造力评估数据集。具体来说,为了确保MLLMs能够处理各种与创造力相关的查询,我们提示GPT对这些人类反馈进行润色,以激活更强的创造力评估能力。CreBench 为构建理解与人类一致的创造力的MLLMs奠定了基础。基于CreBench,我们对开源通用MLLMs进行微调,产生了CreExpert,一个多模态创造力评估专家模型。广泛的实验表明,提出的CreExpert模型在与人类创造力评估的一致性方面显著优于最先进的MLLMs,包括最先进的GPT-4V和Gemini-Pro-Vision。
Summary / 总结
CreBench is designed to evaluate human-aligned creativity across various dimensions from ideas to processes to products. It includes CreMIT, a multimodal dataset with 2.2K diverse multimodal data, 79.2K human feedbacks, and 4.7M instructions. By prompting GPT to refine human feedbacks, CreBench enhances creativity assessment capabilities. Experiments show that the fine-tuned CreExpert model outperforms state-of-the-art MLLMs like GPT-4V and Gemini-Pro-Vision in aligning with human creativity evaluations.
CreBench 是一个评估人类对齐创造力的基准,涵盖了创意想法、过程和产品等多个维度。它包含了一个名为CreMIT的多模态数据集,包含2.2K多样化的数据、79.2K人类反馈和4.7M指令。通过细化人类反馈,该基准增强了MLLMs的创造力评估能力。基于CreBench微调开源的MLLMs,产生了CreExpert模型,该模型在与人类创造力评估的对齐方面优于包括GPT-4V和Gemini-Pro-Vision在内的最先进的模型。
Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation
Authors: Quoc-Huy Trinh, Mustapha Abdullahi, Do Duy Hung Trinh, Bo Zhao, Debesh Jha
First: 2025-11-14T11:21:48+00:00 · Latest: 2025-11-17T17:08:31+00:00
Abstract
Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.
Summary / 总结
Viper-F1 is designed to enhance multimodal understanding by using efficient Liquid State-Space Dynamics instead of Transformer-based cross-attention, which reduces computational cost. It also introduces a Token-Grid Correlation Module to improve visual grounding by computing lightweight correlations between text tokens and image patches. Experimental results show that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency compared to existing methods.
Viper-F1旨在提高资源受限场景下多模态理解的效率和准确性。它使用液态状态空间动力学代替基于Transformer的交叉注意力来降低计算成本。此外,它引入了一个Token-Grid相关模块,通过计算文本和图像之间的轻量级相关性来增强视觉定位。实验结果表明,Viper-F1在细粒度推理任务上的表现良好,并且具有比现有方法更快的推理速度。
PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning
Authors: Yushi Feng, Junye Du, Yingying Hong, Qifan Wang, Lequan Yu
First: 2025-08-14T10:03:47+00:00 · Latest: 2025-11-17T16:36:12+00:00
Abstract
Existing tool-augmented agentic systems are limited in the real world by (i) black-box reasoning steps that undermine trust of decision-making and pose safety risks, (ii) poor multimodal integration, which is inherently critical for healthcare tasks, and (iii) rigid and computationally inefficient agentic pipelines. We introduce PASS (Probabilistic Agentic Supernet Sampling), the first multimodal framework to address these challenges in the context of Chest X-Ray (CXR) reasoning. PASS adaptively samples agentic workflows over a multi-tool graph, yielding decision paths annotated with interpretable probabilities. Given the complex CXR reasoning task with multimodal medical data, PASS leverages its learned task-conditioned distribution over the agentic supernet. Thus, it adaptively selects the most suitable tool at each supernet layer, offering probability-annotated trajectories for post-hoc audits and directly enhancing medical AI safety. PASS also continuously compresses salient findings into an evolving personalized memory, while dynamically deciding whether to deepen its reasoning path or invoke an early exit for efficiency. To optimize a Pareto frontier balancing performance and cost, we design a novel three-stage training procedure, including expert knowledge warm-up, contrastive path-ranking, and cost-aware reinforcement learning. To facilitate rigorous evaluation, we introduce CAB-E, a comprehensive benchmark for multi-step, safety-critical, free-form CXR reasoning. Experiments across various benchmarks validate that PASS significantly outperforms strong baselines in multiple metrics (e.g., accuracy, AUC, LLM-J.) while balancing computational costs, pushing a new paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems.
中文标题/摘要
标题:PASS:概率代理超网络采样以实现可解释和自适应的胸部X光推理
现有的工具增强代理系统在现实世界中受到以下限制:(i) 黑盒推理步骤削弱了决策制定的信任并带来安全风险,(ii) 贫乏的多模态整合,这对医疗保健任务至关重要,以及(iii) 刚性且计算效率低的代理管道。我们引入了PASS(概率代理超网络采样),这是第一个在胸部X光(CXR)推理中解决这些挑战的多模态框架。PASS 适应性地在多工具图上采样代理工作流,生成带有可解释概率的决策路径。鉴于复杂的CXR推理任务和多模态医疗数据,PASS 利用其在代理超网络上学习的任务条件分布。因此,它在每个超网络层上选择最合适的工具,提供带有概率注释的轨迹以供事后审计,并直接增强医疗AI的安全性。PASS 还不断将关键发现压缩到不断发展的个性化记忆中,同时动态决定是否加深其推理路径或调用早期退出以提高效率。为了优化平衡性能和成本的帕累托前沿,我们设计了一种新颖的三阶段训练程序,包括专家知识预热、对比路径排名和成本感知强化学习。为了促进严格的评估,我们引入了CAB-E,这是一个全面的多步骤、安全关键、自由形式CXR推理基准。跨多个基准的实验验证了PASS在多个指标(如准确性、AUC、LLM-J)上显著优于强基线,同时平衡计算成本,推动了可解释、自适应和多模态医疗代理系统的新范式转变。
Summary / 总结
PASS addresses the limitations of existing tool-augmented agentic systems in healthcare by introducing a multimodal framework that enhances trust, integrates multimodal data effectively, and offers efficient and interpretable decision-making. It uses a learned distribution over a supernet to adaptively select tools, providing probability-annotated decision paths for audits and improving safety. PASS also dynamically compresses findings into a personalized memory and uses a three-stage training procedure to balance performance and cost. Experiments show that PASS outperforms strong baselines in accuracy, AUC, and LLM-J while managing computational costs.
PASS通过引入一个多模态框架解决了现有医疗健康领域工具增强型代理系统的局限性,增强了信任度,有效整合了多模态数据,并提供了高效且可解释的决策。它利用一个超网络上的学习分布来适应性地选择工具,提供带有审计注释的概率决策路径,提高安全性。PASS还动态地将发现压缩到个性化记忆中,并使用三阶段训练过程来平衡性能和成本。实验表明,PASS在准确度、AUC和LLM-J等多个指标上优于强基线,同时管理计算成本。
Ghost in the Transformer: Tracing LLM Lineage with SVD-Fingerprint
Authors: Suqing Wang, Ziyang Ma, Xinyi Li, Zuchao Li
Venue: AAAI 2026 Oral
First: 2025-11-09T13:57:59+00:00 · Latest: 2025-11-17T16:20:58+00:00
Comments: Accepted at AAAI 2026 (Oral)
Abstract
Large Language Models (LLMs) have rapidly advanced and are widely adopted across diverse fields. Due to the substantial computational cost and data requirements of training from scratch, many developers choose to fine-tune or modify existing open-source models. While most adhere to open-source licenses, some falsely claim original training despite clear derivation from public models. This raises pressing concerns about intellectual property protection and highlights the need for reliable methods to verify model provenance. In this paper, we propose GhostSpec, a lightweight yet effective method for verifying LLM lineage without access to training data or modification of model behavior. Our approach constructs compact and robust fingerprints by applying singular value decomposition (SVD) to invariant products of internal attention weight matrices, effectively capturing the structural identity of a model. Unlike watermarking or output-based methods, GhostSpec is fully data-free, non-invasive, and computationally efficient. It demonstrates strong robustness to sequential fine-tuning, pruning, block expansion, and even adversarial transformations. Extensive experiments show that GhostSpec can reliably trace the lineage of transformed models with minimal overhead. By offering a practical solution for model verification and reuse tracking, our method contributes to the protection of intellectual property and fosters a transparent, trustworthy ecosystem for large-scale language models.
中文标题/摘要
标题:Transformer中的幽灵:通过SVD指纹追踪LLM谱系
大型语言模型(LLMs)迅速发展并在多个领域广泛应用。由于从头开始训练所需的大量计算资源和数据需求,许多开发者选择对现有的开源模型进行微调或修改。虽然大多数模型遵循开源许可,但有些却虚假声称原始训练,尽管明显源自公共模型。这引发了关于知识产权保护的紧迫问题,并突显了验证模型谱系的可靠方法的需求。在本文中,我们提出GhostSpec,这是一种无需访问训练数据或修改模型行为的轻量级且有效的方法,用于验证LLM谱系。我们的方法通过将奇异值分解(SVD)应用于内部注意力权重矩阵的不变产品来构建紧凑且稳健的指纹,有效地捕捉模型的结构身份。与水印或基于输出的方法不同,GhostSpec完全无需数据、非侵入且计算效率高。它对顺序微调、剪枝、块扩展甚至对抗性变换都表现出很强的鲁棒性。大量实验表明,GhostSpec可以以最小的开销可靠地追踪转换模型的谱系。通过提供一种实用的模型验证和重用跟踪解决方案,我们的方法有助于保护知识产权并促进大规模语言模型的透明、可信赖生态系统。
Summary / 总结
This paper addresses the issue of verifying the lineage of Large Language Models (LLMs) to protect intellectual property. It introduces GhostSpec, a method that uses singular value decomposition (SVD) to create robust fingerprints of models without requiring access to training data or modifying model behavior. GhostSpec shows strong robustness to various transformations and fine-tuning techniques, and extensive experiments confirm its effectiveness in reliably tracing model lineage with minimal overhead.
本文旨在通过验证大型语言模型(LLM)的血统来保护知识产权。提出了GhostSpec方法,利用奇异值分解(SVD)创建LLM的稳健指纹,无需访问训练数据或修改模型行为即可追踪模型的来源。GhostSpec对各种变换和微调技术具有很强的鲁棒性,大量实验验证了其在最小开销下可靠地追踪模型血统的有效性。
FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI
Authors: Yuhang Peng, Yizhou Pan, Xinning He, Jihaoyu Yang, Xinyu Yin, Han Wang, Xiaoji Zheng, Chao Gao, Jiangtao Gong
Venue: AAAI 2026 Oral
First: 2025-11-17T15:58:46+00:00 · Latest: 2025-11-17T15:58:46+00:00
Comments: 9 pages, 4 figures
Abstract
As embodied intelligence emerges as a core frontier in artificial intelligence research, simulation platforms must evolve beyond low-level physical interactions to capture complex, human-centered social behaviors. We introduce FreeAskWorld, an interactive simulation framework that integrates large language models (LLMs) for high-level behavior planning and semantically grounded interaction, informed by theories of intention and social cognition. Our framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline tailored for diverse embodied tasks.To validate the framework, we extend the classic Vision-and-Language Navigation (VLN) task into a interaction enriched Direction Inquiry setting, wherein agents can actively seek and interpret navigational guidance. We present and publicly release FreeAskWorld, a large-scale benchmark dataset comprising reconstructed environments, six diverse task types, 16 core object categories, 63,429 annotated sample frames, and more than 17 hours of interaction data to support training and evaluation of embodied AI systems. We benchmark VLN models, and human participants under both open-loop and closed-loop settings. Experimental results demonstrate that models fine-tuned on FreeAskWorld outperform their original counterparts, achieving enhanced semantic understanding and interaction competency. These findings underscore the efficacy of socially grounded simulation frameworks in advancing embodied AI systems toward sophisticated high-level planning and more naturalistic human-agent interaction. Importantly, our work underscores that interaction itself serves as an additional information modality.
中文标题/摘要
标题:FreeAskWorld:面向人类中心的具身AI互动闭环模拟器
随着具身智能成为人工智能研究的核心前沿领域,模拟平台必须超越低级物理交互,以捕捉复杂的人类中心社会行为。我们介绍了FreeAskWorld,这是一种集成了大型语言模型(LLMs)进行高层次行为规划和语义驱动交互的互动模拟框架,这些交互受到意图理论和社会认知理论的启发。该框架支持可扩展、真实的具身人机模拟,并包括一个针对多样化具身任务定制的数据生成流水线。为了验证该框架,我们将经典的视觉-语言导航(VLN)任务扩展到一个增强互动的路线查询设置中,其中代理可以主动寻求和解释导航指导。我们介绍了并公开发布了FreeAskWorld,这是一个大规模基准数据集,包含重建的环境、六种不同的任务类型、16个核心对象类别、63,429个标注样本帧以及超过17小时的互动数据,以支持具身AI系统的训练和评估。我们对VLN模型和人类参与者在开环和闭环设置下进行了基准测试。实验结果表明,基于FreeAskWorld微调的模型优于其原始版本,实现了增强的语义理解和互动能力。这些发现强调了社会驱动的模拟框架在推动具身AI系统向复杂高层次规划和更自然的人机互动方面的作用。重要的是,我们的工作强调了互动本身作为一种额外的信息模态的作用。
Language-Guided Invariance Probing of Vision-Language Models
Authors: Jae Joong Lee
First: 2025-11-17T15:35:49+00:00 · Latest: 2025-11-17T15:35:49+00:00
Abstract
Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.
中文标题/摘要
标题:语言引导的视觉-语言模型不变性探查
近期的视觉-语言模型(VLMs)如CLIP、OpenCLIP、EVA02-CLIP和SigLIP在零样本情况下表现出色,但不清楚它们在受控语言扰动下的可靠响应情况。我们引入了语言引导的不变性探查(LGIP),这是一个基准测试,用于测量(i)意义保留的同义句不变性以及(ii)图像-文本匹配中意义改变的语义翻转敏感性。使用40000张MS COCO图像和每张图像五个手工生成的描述,我们自动生成了改变物体类别、颜色或数量的同义句和基于规则的翻转,并用不变性误差、语义敏感性差距和正率统计来总结模型行为。在九种VLMs中,EVA02-CLIP和大型OpenCLIP变体位于有利的不变性-敏感性前沿,结合了低同义句诱导的变异性和始终高于翻转描述的原始描述得分。相比之下,SigLIP和SigLIP2显示出更大的不变性误差,经常偏好翻转描述而非人类描述,尤其是在对象和颜色编辑方面。这些失败在标准检索指标中几乎是看不见的,表明LGIP为VLMs的语言鲁棒性提供了一种模型无关的诊断工具,超越了传统的准确率评分。
Summary / 总结
This study introduces Language-Guided Invariance Probing (LGIP) to evaluate the linguistic robustness of vision-language models (VLMs) by measuring their invariance to meaning-preserving paraphrases and sensitivity to meaning-changing semantic flips. Using 40k MS COCO images with five human captions each, the study automatically generates paraphrases and rule-based flips to alter object category, color, or count. Key findings show that EVA02-CLIP and large OpenCLIP variants exhibit a favorable balance between invariance and sensitivity, while SigLIP and SigLIP2 show larger invariance errors and often prefer flipped captions to human descriptions, especially for object and color edits. These findings suggest that LGIP provides a more comprehensive diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.
该研究引入了语言引导不变性探针(LGIP),通过测量视觉语言模型(VLMs)对意义保留的同义句和意义改变的语义翻转的不变性和敏感性来评估其语言鲁棒性。使用40k MS COCO图像和每个图像五个人工描述,该研究自动生成了同义句和基于规则的翻转来改变物体类别、颜色或数量。关键发现表明,EVA02-CLIP和大型OpenCLIP变体在不变性和敏感性之间表现出良好的平衡,而SigLIP和SigLIP2则显示出更大的不变性误差,并且往往更偏好翻转描述而非人工描述,尤其是在物体和颜色编辑方面,这些差异在标准检索指标中是不可见的。
Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling
Authors: Adam Hazimeh, Ke Wang, Mark Collier, Gilles Baechler, Efi Kokiopoulou, Pascal Frossard
First: 2025-11-17T15:16:13+00:00 · Latest: 2025-11-17T15:16:13+00:00
Abstract
Multimedia documents such as slide presentations and posters are designed to be interactive and easy to modify. Yet, they are often distributed in a static raster format, which limits editing and customization. Restoring their editability requires converting these raster images back into structured vector formats. However, existing geometric raster-vectorization methods, which rely on low-level primitives like curves and polygons, fall short at this task. Specifically, when applied to complex documents like slides, they fail to preserve the high-level structure, resulting in a flat collection of shapes where the semantic distinction between image and text elements is lost. To overcome this limitation, we address the problem of semantic document derendering by introducing SliDer, a novel framework that uses Vision-Language Models (VLMs) to derender slide images as compact and editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements in a raster input and organizes them into a coherent SVG format. Crucially, the model iteratively refines its predictions during inference in a process analogous to human design, generating SVG code that more faithfully reconstructs the original raster upon rendering. Furthermore, we introduce Slide2SVG, a novel dataset comprising raster-SVG pairs of slide documents curated from real-world scientific presentations, to facilitate future research in this domain. Our results demonstrate that SliDer achieves a reconstruction LPIPS of 0.069 and is favored by human evaluators in 82.9% of cases compared to the strongest zero-shot VLM baseline.
中文标题/摘要
标题:语义文档去渲染:通过视觉语言建模进行SVG重建
多媒体文档如幻灯片和海报旨在交互性强且易于修改。然而,它们通常以静态位图格式分发,这限制了编辑和定制。恢复其可编辑性需要将这些位图图像转换回结构化的向量格式。然而,现有的几何位图向量转换方法依赖于低级原语如曲线和多边形,在此任务上表现不佳。具体来说,当应用于复杂的文档如幻灯片时,它们无法保留高层次结构,导致一个扁平的形状集合,其中图像和文本元素之间的语义区别丢失。为克服这一限制,我们通过引入SliDer,一种新颖的框架,使用视觉语言模型(VLMs)将幻灯片图像去渲染为紧凑且可编辑的可缩放矢量图形(SVG)表示,来解决语义文档去渲染问题。SliDer从位图输入中检测和提取图像和文本元素的属性,并将它们组织成一个连贯的SVG格式。关键的是,模型在推理过程中迭代地改进其预测,类似于人类设计的过程,生成在渲染时更忠实于原始位图的SVG代码。此外,我们引入了Slide2SVG,这是一个新颖的数据集,包含来自实际科学演示文稿的位图-SVG配对幻灯片文档,以促进该领域的未来研究。我们的结果表明,SliDer的重建LPIPS为0.069,并且在82.9%的情况下被人类评估者偏好,优于最强的零样本VLM基线。
Summary / 总结
The research aims to restore the editability of multimedia documents by converting static raster images into structured vector formats. SliDer, a novel framework, uses Vision-Language Models to detect and extract attributes from image and text elements in raster inputs, organizing them into coherent SVG representations. The model iteratively refines its predictions, generating SVG code that accurately reconstructs the original raster. Experimental results show that SliDer outperforms existing methods with an LPIPS score of 0.069 and is preferred by human evaluators in 82.9% of cases over the strongest zero-shot VLM baseline.
论文解决了将多媒体文档的静态位图图像转换为可编辑的矢量格式的问题,这对于文档的修改和定制至关重要。它引入了SliDer框架,使用视觉-语言模型从位图输入中检测和提取属性,并将其组织成SVG表示。SliDer在推理过程中逐步改进其预测,生成能够准确重建原始文档的SVG代码。结果表明,SliDer的LPIPS得分为0.069,且在82.9%的情况下被人类评估者更偏好,优于最强的零样本VLM基线。
Trust in Vision-Language Models: Insights from a Participatory User Workshop
Authors: Agnese Chiatti, Lara Piccolo, Sara Bernardini, Matteo Matteucci, Viola Schiaffonati
Venue: Proceedings of the The European Workshop on Trustworthy AI (Trust-AI) at ECAI 2025
First: 2025-11-17T15:04:59+00:00 · Latest: 2025-11-17T15:04:59+00:00
Abstract
With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants' engagement to fit the case of user-VLM interaction.
中文标题/摘要
标题:视觉语言模型的信任度:参与式用户研讨会的见解
随着视觉语言模型(VLMs)在大规模图像-文本和视频-文本数据集上进行预训练的部署不断增加,为用户提供工具以判断何时信任这些系统变得至关重要。然而,研究用户对VLMs的信任如何建立和发展仍然是一个开放的问题。随着对AI模型作为实验验证的裁判依赖性的增加,这使得绕过直接与用户进行参与式设计研究的成本和影响变得更为复杂。采用用户中心的方法,本文介绍了与潜在VLM用户进行的工作坊的初步结果。这些试点研讨会的见解将为未来旨在为用户-VLM交互情境化信任度指标和参与者参与策略的研究提供指导。
Summary / 总结
This paper explores how users develop trust in Vision-Language Models (VLMs) through insights gained from a participatory user workshop. The study employs a user-centered approach to understand the factors that influence user trust in VLMs. Key findings suggest that users need specific tools and strategies to discern when to trust these systems, highlighting the importance of contextualizing trust metrics in user-VLM interactions.
本文通过参与式用户研讨会获取见解,探讨用户如何对视觉-语言模型(VLMs)建立信任。研究采用用户中心的方法收集初步数据,以设计更好的信任度量和参与策略。主要发现表明,用户需要更多的背景和互动来建立对VLMs的信任,强调了参与式设计研究的重要性。
Vision Transformers with Self-Distilled Registers
Authors: Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, Andrew F. Luo
Venue: NeurIPS 2025 Spotlight
First: 2025-05-27T17:59:41+00:00 · Latest: 2025-11-17T15:02:58+00:00
Comments: NeurIPS 2025 Spotlight. Website: https://github.com/0raiser0/PH-Reg
Abstract
Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training.Given the availability of existing large-scale pre-trained ViTs, in this paper we seek add register tokens to existing models without needing to re-train from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.
中文标题/摘要
标题:具有自蒸馏寄存器的视觉变换器
视觉变换器(ViTs)已成为视觉处理任务的主要架构,随着训练数据和模型规模的增加,其表现出卓越的可扩展性。然而,近期研究发现ViTs中出现了与局部语义不符的伪令牌,这些异常令牌在需要精细定位或结构连贯的任务中会降低ViT的性能。有效缓解这一问题的方法是在ViTs中添加寄存器令牌,这些寄存器令牌在训练过程中隐式地“吸收”了伪令牌。鉴于现有大规模预训练ViTs的可用性,本文旨在无需从头开始重新训练的情况下将寄存器令牌添加到现有模型中,这在考虑其规模时是不可行的。具体而言,我们提出了一种高效的后验寄存器(PH-Reg)方法,该方法通过不需额外标注数据和完全重新训练即可将寄存器整合到现有的ViT中。PH-Reg从相同的预训练ViT初始化教师和学生网络。教师保持冻结且未修改,学生则增加了随机初始化的寄存器令牌。通过在教师输入上应用测试时增强,我们生成了无伪令牌的去噪密集嵌入,然后仅优化学生网络中解锁的小部分权重。我们展示了该方法可以有效减少伪令牌的数量,在零样本和线性探针下提高学生ViT的分割和深度预测。
Summary / 总结
This paper addresses the issue of artifact tokens in Vision Transformers (ViTs) that degrade performance in tasks requiring fine-grained localization. To mitigate this, the authors propose Post Hoc Registers (PH-Reg), a self-distillation method that integrates register tokens into existing ViTs without retraining. By using test-time augmentation on a pre-trained ViT, the method generates artifact-free embeddings to optimize a small subset of the student model's weights, effectively reducing artifact tokens and improving segmentation and depth prediction performance.
本文解决了视觉变换器(ViTs)中出现的异常标记会降低其在需要精细定位的任务中的性能问题。提出了后验注册(PH-Reg),这是一种高效的自蒸馏方法,可以在不重新训练的情况下将注册标记添加到现有的ViTs中。该方法使用一个冻结的预训练ViT作为教师,而学生ViT则被随机初始化的注册标记所增强。通过在教师输入上应用测试时增强,该方法生成去噪的密集嵌入来优化学生的权重,从而有效减少异常标记并提高分割和深度预测性能。
Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline
Authors: Rui Zuo, Qinyue Tong, Zhe-Ming Lu, Ziqian Lu
First: 2025-11-17T14:49:57+00:00 · Latest: 2025-11-17T14:49:57+00:00
Abstract
With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.
中文标题/摘要
标题:利用原生MLLM解锁图像伪造检测潜力:一种新型无需训练的流水线
随着人工智能生成内容(AIGC)技术的迅速发展,包括多模态大型语言模型(MLLMs)和扩散模型,图像生成和篡改变得异常简便。现有的图像伪造检测与定位(IFDL)方法往往难以在多种数据集上泛化,并且提供有限的可解释性。如今,MLLMs在多种视觉语言任务上展现出强大的泛化潜力,一些研究通过大规模训练将这种能力引入IFDL。然而,这些方法消耗大量计算资源,未能揭示原生MLLM的固有泛化潜力以解决这一问题。受此观察启发,我们提出Foresee,一种针对图像伪造分析的无需训练的MLLM基流水线。它消除了额外训练的需要,实现轻量级推理过程,同时在篡改定位准确性和文本解释丰富性方面超越现有MLLM基方法。Foresee采用类型先验驱动策略,并利用灵活特征检测(FFD)模块专门处理复制移动篡改,从而有效释放原生MLLM在法医领域的潜力。大量实验表明,我们的方法同时实现了更高的篡改定位准确性和更全面的文本解释。此外,Foresee展现出更强的泛化能力,在各种篡改类型,包括复制移动、拼接、删除、局部增强、深度伪造和AIGC编辑基础上,优于现有IFDL方法。代码将在最终版本中发布。
Summary / 总结
This paper introduces Foresee, a training-free pipeline using vanilla MLLMs for image forgery detection and localization. It leverages a type-prior-driven strategy and a Flexible Feature Detector (FFD) module to handle copy-move manipulations, surpassing existing methods in both accuracy and textual explanation richness. Foresee demonstrates strong generalization across various tampering types and outperforms existing methods in localization accuracy and interpretability.
论文提出了一个无需额外训练的Foresee管道,使用vanilla MLLMs进行图像伪造检测和定位。该方法利用类型先验驱动策略和灵活特征检测器(FFD)模块来处理复制移动操作,同时在准确性和解释性上超越了现有方法。Foresee在各种篡改类型上展示了强大的泛化能力,并提供了更全面的文本解释,而无需额外的训练成本。
VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task
Authors: Xingming Long, Jie Zhang, Shiguang Shan, Xilin Chen
First: 2025-11-17T14:32:06+00:00 · Latest: 2025-11-17T14:32:06+00:00
Comments: 8 pages
Abstract
Most research on hallucinations in Large Vision-Language Models (LVLMs) focuses on factual description tasks that prohibit any output absent from the image. However, little attention has been paid to hallucinations in voluntary imagination tasks, e.g., story writing, where the models are expected to generate novel content beyond the given image. In these tasks, it is inappropriate to simply regard such imagined novel content as hallucinations. To address this limitation, we introduce Voluntary-imagined Object Presence Evaluation (VOPE)-a novel method to assess LVLMs' hallucinations in voluntary imagination tasks via presence evaluation. Specifically, VOPE poses recheck-based questions to evaluate how an LVLM interprets the presence of the imagined objects in its own response. The consistency between the model's interpretation and the object's presence in the image is then used to determine whether the model hallucinates when generating the response. We apply VOPE to several mainstream LVLMs and hallucination mitigation methods, revealing two key findings: (1) most LVLMs hallucinate heavily during voluntary imagination, and their performance in presence evaluation is notably poor on imagined objects; (2) existing hallucination mitigation methods show limited effect in voluntary imagination tasks, making this an important direction for future research.
中文标题/摘要
标题:VOPE:回顾自愿想象任务中大型视觉-语言模型的幻觉
大多数关于大型视觉-语言模型(LVLM)幻觉的研究集中在禁止任何超出图像内容的描述性任务上。然而,很少有人关注自愿想象任务中的幻觉,例如故事写作,其中模型被期望生成超出给定图像的新内容。在这些任务中,简单地将想象出的新内容视为幻觉是不合适的。为了解决这一局限性,我们引入了自愿想象对象存在评估(VOPE)——一种通过存在评估来评估LVLM在自愿想象任务中幻觉的新方法。具体而言,VOPE通过重新检查问题来评估LVLM如何解释其自身响应中想象对象的存在。然后根据模型的解释与图像中对象存在的一致性来判断模型在生成响应时是否产生了幻觉。我们将VOPE应用于几种主流的LVLM和幻觉缓解方法,揭示了两个关键发现:(1)大多数LVLM在自愿想象过程中严重幻觉,它们在想象对象的存在评估中的表现明显较差;(2)现有的幻觉缓解方法在自愿想象任务中的效果有限,这为未来的研究指出了一个重要方向。
Summary / 总结
The study revisits hallucinations in Large Vision-Language Models (LVLMs) in voluntary imagination tasks, such as story writing, where models are expected to generate novel content beyond the given image. It introduces VOPE, a novel method for assessing LVLMs' hallucinations by evaluating the consistency between the model's interpretation of imagined objects and their presence in the image. The research finds that most LVLMs hallucinate heavily during voluntary imagination, and their performance in presence evaluation is poor. Additionally, existing hallucination mitigation methods are found to be ineffective in voluntary imagination tasks, highlighting a critical area for future research.
研究重新审视了大型视觉-语言模型(LVLM)在自愿想象任务中的幻觉问题,如故事写作,模型在此任务中应生成新颖内容。研究引入了VOPE方法,通过存在性评估来评估LVLM的幻觉情况,并发现大多数LVLM在自愿想象过程中严重幻觉,对想象对象的存在性评估表现较差。此外,现有的幻觉缓解方法在这类任务中的效果有限,这表明未来研究的一个重要方向。
Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)
Authors: Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising
First: 2025-11-17T14:12:22+00:00 · Latest: 2025-11-17T14:12:22+00:00
Abstract
The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
中文标题/摘要
标题:描述符:带有距离标注的交通感知问答(DTPQA)
视觉语言模型(VLMs)在各种任务上的显著进步激发了其在自动驾驶中的应用兴趣。然而,为了在这样一个安全关键的领域被信任,这些模型必须首先具备稳健的感知能力,即它们必须能够理解一个往往非常复杂且同时发生许多事情的交通场景。此外,由于交通场景中的关键物体和代理通常位于远处,因此我们不仅需要在近距离(20米以内)具有强大的感知能力,还需要在远距离(30米以上)也具有强大的感知能力。因此,有必要在不依赖于推理或其他高级世界知识的情况下评估这些模型的感知能力。带有距离标注的交通感知问答(DTPQA)是一个专门为这一目的设计的视觉问答(VQA)基准:它可以通过使用与驾驶决策相关的简单但至关重要的问题来评估VLMs在交通场景中的感知系统。它包括两个部分:使用模拟器创建的合成基准(DTP-Synthetic)和基于现有真实交通场景图像构建的真实世界基准(DTP-Real)。此外,DTPQA还包括距离标注,即问题中的物体距离相机有多远。具体来说,每个DTPQA样本包括(至少):(a) 一张图像,(b) 一个问题,(c) 真实答案,以及(d) 问题中物体的距离,这使得可以分析VLM性能随物体距离增加而下降的情况。在本文中,我们提供了该数据集本身以及用于创建它的Python脚本,这些脚本可以用来生成相同类型的数据。
Summary / 总结
The research aims to evaluate the perception capabilities of Vision-Language Models (VLMs) in traffic scenarios, particularly their ability to understand distant objects. The Distance-Annotated Traffic Perception Question Answering (DTPQA) benchmark includes both synthetic and real-world traffic scenes with distance annotations. Key findings show that VLMs perform well at close distances but degrade in performance as object distances increase, highlighting the need for improved long-range perception capabilities.
研究旨在评估Vision-Language Models (VLMs)在交通场景中的感知能力,尤其是它们在不同距离下理解复杂场景的能力。Distance-Annotated Traffic Perception Question Answering (DTPQA)基准包括合成和现实世界两个部分,都带有距离标注。主要发现表明,随着目标距离的增加,VLM的性能会下降,突显了在自动驾驶系统中需要在长距离范围内具备稳健的感知能力。
Moving Pictures of Thought: Extracting Visual Knowledge in Charles S. Peirce's Manuscripts with Vision-Language Models
Authors: Carlo Teo Pedretti, Davide Picca, Dario Rodighiero
First: 2025-11-17T13:52:23+00:00 · Latest: 2025-11-17T13:52:23+00:00
Abstract
Diagrams are crucial yet underexplored tools in many disciplines, demonstrating the close connection between visual representation and scholarly reasoning. However, their iconic form poses obstacles to visual studies, intermedial analysis, and text-based digital workflows. In particular, Charles S. Peirce consistently advocated the use of diagrams as essential for reasoning and explanation. His manuscripts, often combining textual content with complex visual artifacts, provide a challenging case for studying documents involving heterogeneous materials. In this preliminary study, we investigate whether Visual Language Models (VLMs) can effectively help us identify and interpret such hybrid pages in context. First, we propose a workflow that (i) segments manuscript page layouts, (ii) reconnects each segment to IIIF-compliant annotations, and (iii) submits fragments containing diagrams to a VLM. In addition, by adopting Peirce's semiotic framework, we designed prompts to extract key knowledge about diagrams and produce concise captions. Finally, we integrated these captions into knowledge graphs, enabling structured representations of diagrammatic content within composite sources.
中文标题/摘要
标题:思想的影像:利用视觉语言模型在查尔斯·S·皮尔士手稿中提取视觉知识
图表在许多学科中是至关重要的但尚未充分探索的工具,展示了视觉表示与学术推理之间的密切联系。然而,它们的象形形式给视觉研究、跨媒体分析和基于文本的数字工作流程带来了障碍。特别是,查尔斯·S·皮尔士一直倡导使用图表作为推理和解释的重要工具。他的手稿通常结合了文本内容和复杂的视觉元素,为研究涉及异质材料的文档提供了具有挑战性的案例。在这项初步研究中,我们探讨视觉语言模型(VLMs)是否能有效地帮助我们识别和解释这些混合页面中的图表。首先,我们提出了一种工作流,包括(i) 分割手稿页面布局,(ii) 将每个片段重新连接到IIIF兼容的注释,(iii) 将包含图表的片段提交给VLM。此外,通过采用皮尔士的符号框架,我们设计了提示来提取关于图表的关键知识并生成简洁的说明。最后,我们将这些说明整合到知识图谱中,使图表内容在复合来源中的结构化表示成为可能。
Summary / 总结
This study explores the use of Vision-Language Models (VLMs) to identify and interpret diagrams in the manuscripts of Charles S. Peirce, who emphasized the importance of diagrams in reasoning. The research proposes a workflow that segments manuscript pages, reconnects segments to annotations, and submits diagram-containing fragments to VLMs. By using Peirce's semiotic framework, the study designs prompts to extract key diagrammatic knowledge and integrates these into knowledge graphs. Key findings include the successful extraction of visual knowledge from Peirce's complex hybrid pages, demonstrating the potential of VLMs for visual studies and intermedial analysis.
本研究探索使用视觉语言模型(VLMs)从查尔斯·桑德斯·皮尔士的混合文本和图表手稿中提取视觉知识。研究提出了一种工作流,将手稿页面分割,重新连接到注释,并将包含图表的片段提交给VLMs。通过使用皮尔士的符号框架,研究设计了提取图表关键知识并生成简洁描述的提示,然后将这些描述整合到知识图谱中。主要发现表明,VLMs可以有效地识别和解释皮尔士手稿中的图表,提供图表内容的结构化表示。
LLMC+: Benchmarking Vision-Language Model Compression with a Plug-and-play Toolkit
Authors: Chengtao Lv, Bilang Zhang, Yang Yong, Ruihao Gong, Yushi Huang, Shiqiao Gu, Jiajun Wu, Yumeng Shi, Jinyang Guo, Wenya Wang
Venue: AAAI 2026
First: 2025-08-13T17:54:49+00:00 · Latest: 2025-11-17T13:22:24+00:00
Comments: Accepted by AAAI 2026
Abstract
Large Vision-Language Models (VLMs) exhibit impressive multi-modal capabilities but suffer from prohibitive computational and memory demands, due to their long visual token sequences and massive parameter sizes. To address these issues, recent works have proposed training-free compression methods. However, existing efforts often suffer from three major limitations: (1) Current approaches do not decompose techniques into comparable modules, hindering fair evaluation across spatial and temporal redundancy. (2) Evaluation confined to simple single-turn tasks, failing to reflect performance in realistic scenarios. (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. LLMC+ supports over 20 algorithms across five representative VLM families and enables systematic study of token-level and model-level compression. Our benchmark reveals that: (1) Spatial and temporal redundancies demand distinct technical strategies. (2) Token reduction methods degrade significantly in multi-turn dialogue and detail-sensitive tasks. (3) Combining token and model compression achieves extreme compression with minimal performance loss. We believe LLMC+ will facilitate fair evaluation and inspire future research in efficient VLM. Our code is available at https://github.com/ModelTC/LightCompress.
中文标题/摘要
标题:LLMC+: 使用即插即用工具包评估视觉-语言模型压缩基准
大型视觉-语言模型(VLMs)表现出色的多模态能力,但由于其长视觉标记序列和庞大的参数量,面临计算和内存需求过高的问题。为解决这些问题,最近的研究提出了无需训练的压缩方法。然而,现有努力往往存在三个主要局限:(1)当前方法未将技术分解为可比模块,阻碍了在空间和时间冗余方面的公平评估。(2)评估局限于简单的单轮任务,未能反映在现实场景中的性能。(3)单独使用个体压缩技术,未探索其联合潜力。为克服这些差距,我们引入了LLMC+,这是一个全面的VLM压缩基准,配备了一个多功能的即插即用工具包。LLMC+支持超过20种算法,覆盖五个代表性VLM家族,并允许系统研究标记级和模型级压缩。我们的基准表明:(1)空间和时间冗余需要不同的技术策略。(2)标记减少方法在多轮对话和细节敏感任务中显著退化。(3)结合标记和模型压缩实现了极高的压缩率,同时保持最小的性能损失。我们相信LLMC+将促进公平评估并激发未来高效VLM的研究。我们的代码可在https://github.com/ModelTC/LightCompress/ 获取。
Summary / 总结
The research aims to address the computational and memory challenges of large Vision-Language Models (VLMs) by introducing LLMC+, a comprehensive benchmark with a plug-and-play toolkit. LLMC+ evaluates over 20 compression algorithms across five VLM families, focusing on token-level and model-level compression. Key findings include the need for different strategies for spatial and temporal redundancies, significant degradation of token reduction methods in complex tasks, and the effectiveness of combining token and model compression for extreme compression with minimal performance loss.
研究旨在通过引入LLMC+综合基准和插件式工具包来解决大型视觉-语言模型(VLM)的计算和内存挑战。LLMC+评估了五个VLM家族中的20多种压缩算法,重点关注标记级和模型级压缩。主要发现包括需要针对空间和时间冗余采用不同的策略、标记减少方法在多轮对话中表现显著下降以及结合标记和模型压缩可以实现极端压缩并保持最小的性能损失。
Tab-PET: Graph-Based Positional Encodings for Tabular Transformers
Authors: Yunze Leng, Rohan Ghosh, Mehul Motani
First: 2025-11-17T13:08:34+00:00 · Latest: 2025-11-17T13:08:34+00:00
Abstract
Supervised learning with tabular data presents unique challenges, including low data sizes, the absence of structural cues, and heterogeneous features spanning both categorical and continuous domains. Unlike vision and language tasks, where models can exploit inductive biases in the data, tabular data lacks inherent positional structure, hindering the effectiveness of self-attention mechanisms. While recent transformer-based models like TabTransformer, SAINT, and FT-Transformer (which we refer to as 3T) have shown promise on tabular data, they typically operate without leveraging structural cues such as positional encodings (PEs), as no prior structural information is usually available. In this work, we find both theoretically and empirically that structural cues, specifically PEs can be a useful tool to improve generalization performance for tabular transformers. We find that PEs impart the ability to reduce the effective rank (a form of intrinsic dimensionality) of the features, effectively simplifying the task by reducing the dimensionality of the problem, yielding improved generalization. To that end, we propose Tab-PET (PEs for Tabular Transformers), a graph-based framework for estimating and inculcating PEs into embeddings. Inspired by approaches that derive PEs from graph topology, we explore two paradigms for graph estimation: association-based and causality-based. We empirically demonstrate that graph-derived PEs significantly improve performance across 50 classification and regression datasets for 3T. Notably, association-based graphs consistently yield more stable and pronounced gains compared to causality-driven ones. Our work highlights an unexpected role of PEs in tabular transformers, revealing how they can be harnessed to improve generalization.
中文标题/摘要
标题:Tab-PET:基于图的位置编码以提高表格变换器的性能
使用表格数据进行监督学习面临独特挑战,包括数据量小、缺乏结构线索以及跨分类和连续域的异质特征。与视觉和语言任务不同,模型可以利用数据中的归纳偏置,但表格数据缺乏固有的位置结构,阻碍了自注意力机制的有效性。尽管像TabTransformer、SAINT和FT-Transformer(我们称之为3T)等基于变换器的模型在表格数据上显示出潜力,但它们通常不利用位置编码(PEs)等结构线索,因为通常没有先验的结构信息。在这项工作中,我们理论和实验证明,结构线索,特别是PEs,可以作为提高表格变换器泛化性能的有效工具。我们发现PEs能够降低特征的有效秩(一种内在维度),从而简化任务并减少问题的维度,提高泛化能力。为此,我们提出了Tab-PET(PEs for Tabular Transformers),一种基于图的框架,用于估计和引入PEs到嵌入中。受从图拓扑中推导PEs的方法启发,我们探索了基于关联和因果关系的两种图估计范式。我们实验证明,基于关联的图显著提高了3T在50个分类和回归数据集上的性能。值得注意的是,基于关联的图始终比基于因果关系的图提供更稳定和显著的改进。我们的工作揭示了PEs在表格变换器中的意外作用,展示了它们如何被利用以提高泛化能力。
Summary / 总结
This paper addresses the challenges of supervised learning with tabular data, such as low data sizes and the lack of structural cues. It proposes Tab-PET, a graph-based framework for estimating and incorporating positional encodings into tabular transformers to improve generalization. The study finds that graph-derived positional encodings, particularly those based on association, significantly enhance performance across various datasets, reducing the effective rank of features and simplifying the task.
该论文针对表格数据监督学习中的挑战,如数据量小和缺乏结构线索。它引入了Tab-PET,一种基于图的框架,通过引入位置编码(PEs)来增强表格变换器。理论和实验证明,PEs可以降低特征的有效秩,简化任务并提高泛化能力。研究显示,基于关联的图在50个分类和回归数据集上比基于因果关系的图更有效地提升3T模型的性能。
Certified Coil Geometry Learning for Short-Range Magnetic Actuation and Spacecraft Docking Application
Authors: Yuta Takahashi, Hayate Tajima, Shin-ichiro Sakai
First: 2025-07-04T20:54:30+00:00 · Latest: 2025-11-17T12:36:41+00:00
Comments: Submitted to IEEE Robotics and Automation Letters
Abstract
This paper presents a learning-based framework for approximating an exact magnetic-field interaction model, supported by both numerical and experimental validation. High-fidelity magnetic-field interaction modeling is essential for achieving exceptional accuracy and responsiveness across a wide range of fields, including transportation, energy systems, medicine, biomedical robotics, and aerospace robotics. In aerospace engineering, magnetic actuation has been investigated as a fuel-free solution for multi-satellite attitude and formation control. Although the exact magnetic field can be computed from the Biot-Savart law, the associated computational cost is prohibitive, and prior studies have therefore relied on dipole approximations to improve efficiency. However, these approximations lose accuracy during proximity operations, leading to unstable behavior and even collisions. To address this limitation, we develop a learning-based approximation framework that faithfully reproduces the exact field while dramatically reducing computational cost. The proposed method additionally provides a certified error bound, derived from the number of training samples, ensuring reliable prediction accuracy. The learned model can also accommodate interactions between coils of different sizes through appropriate geometric transformations, without retraining. To verify the effectiveness of the proposed framework under challenging conditions, a spacecraft docking scenario is examined through both numerical simulations and experimental validation.
中文标题/摘要
标题:认证线圈几何学习在短距离磁驱动和航天器对接应用
本文提出了一种基于学习的框架,用于近似精确的磁场相互作用模型,并通过数值和实验验证支持。高保真度的磁场相互作用建模对于在包括交通运输、能源系统、医学、生物医学机器人和航空航天机器人在内的多个领域实现卓越的准确性和响应性至关重要。在航空航天工程中,磁驱动已被研究作为多卫星姿态和编队控制的无燃料解决方案。尽管可以从毕奥-萨伐尔定律计算出精确的磁场,但相关的计算成本是不可接受的,因此先前的研究依赖于偶极近似以提高效率。然而,在接近操作期间,这些近似会失去准确性,导致不稳定行为甚至碰撞。为了解决这一限制,我们开发了一种基于学习的近似框架,能够忠实再现精确的磁场同时大幅降低计算成本。所提出的方法还提供了由训练样本数量推导出的认证误差界,确保可靠的预测准确性。学习到的模型还可以通过适当的几何变换来适应不同大小线圈之间的相互作用,无需重新训练。为了在具有挑战性的条件下验证所提出框架的有效性,通过数值仿真和实验验证对航天器对接场景进行了研究。
Summary / 总结
This paper introduces a learning-based framework to approximate an exact magnetic-field interaction model, essential for high-fidelity magnetic actuation in various fields. The method reduces computational cost while maintaining accuracy and provides a certified error bound. The framework is validated through numerical simulations and experimental docking scenarios, demonstrating its effectiveness under challenging conditions.
本文提出了一种基于学习的框架来近似精确的磁场相互作用模型,对于高精度应用至关重要。该方法降低了计算成本同时保持准确性,并提供了认证的误差界。该框架通过数值仿真和实验对接场景进行了验证,展示了其在复杂条件下的有效性。
TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing
Authors: Jongha Kim, Minseong Bae, Sanghyeok Lee, Jinsung Yoon, Hyunwoo J. Kim
Venue: AAAI 2026
First: 2025-11-17T12:00:23+00:00 · Latest: 2025-11-17T12:00:23+00:00
Comments: AAAI 2026 (Main Technical Track)
Abstract
Table images present unique challenges for effective and efficient understanding due to the need for question-specific focus and the presence of redundant background regions. Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact to improve table understanding. We first propose progressive question conditioning, which injects the question into Vision Transformer layers with gradually increasing frequency, considering each layer's capacity to handle additional information, to generate question-aware visual features. To reduce redundancy, we introduce a pruning strategy that discards background tokens, thereby improving efficiency. To mitigate information loss from pruning, we further propose token focusing, a training strategy that encourages the model to concentrate essential information in the retained tokens. By combining these approaches, we present TabFlash, an efficient and effective MLLM for table understanding. TabFlash achieves state-of-the-art performance, outperforming both open-source and proprietary MLLMs, while requiring 27% less FLOPs and 30% less memory usage compared to the second-best MLLM.
中文标题/摘要
标题:TabFlash:渐进式问题条件化和token聚焦的高效表格理解
表格图像由于需要问题特定的关注和冗余背景区域的存在,为有效的和高效的理解带来了独特的挑战。现有的多模态大型语言模型(MLLM)方法往往忽视了这些特性,导致生成的视觉表示缺乏信息且冗余。为了解决这些问题,我们旨在生成既具有信息性又具有紧凑性的视觉特征,以提高表格理解的效果。我们首先提出了渐进式问题条件化,该方法以逐渐增加的频率将问题注入到Vision Transformer层中,考虑到每一层处理额外信息的能力,以生成问题感知的视觉特征。为了减少冗余,我们引入了一种剪枝策略,该策略丢弃背景token,从而提高效率。为了减轻剪枝带来的信息损失,我们进一步提出了token聚焦的训练策略,该策略鼓励模型在保留的token中集中关键信息。通过结合这些方法,我们提出了TabFlash,一种高效且有效的MLLM,用于表格理解。TabFlash达到了最先进的性能,优于开源和专有MLLM,同时FLOPs和内存使用分别比第二好的MLLM少27%和30%。
Summary / 总结
The research aims to improve table understanding by addressing the challenges of question-specific focus and redundant background regions. The method involves progressive question conditioning to generate question-aware visual features, token pruning to reduce redundancy, and token focusing to retain essential information. TabFlash, the proposed model, achieves state-of-the-art performance with 27% fewer FLOPs and 30% less memory usage compared to the second-best MLLM.
TabFlash通过提出渐进式问题条件化和token聚焦来解决表格图像理解的挑战。它通过逐步增加问题注入频率的Vision Transformer生成问题感知的视觉特征,并通过去除背景token来减少冗余。token聚焦确保保留的关键信息得到集中。TabFlash在表格理解上优于现有MLLMs,使用27%更少的FLOPs和30%更少的内存。
Use as Many Surrogates as You Want: Selective Ensemble Attack to Unleash Transferability without Sacrificing Resource Efficiency
Authors: Bo Yang, Hengwei Zhang, Jindong Wang, Yuchen Ren, Chenhao Lin, Chao Shen, Zhengyu Zhao
First: 2025-05-19T02:56:41+00:00 · Latest: 2025-11-17T11:44:21+00:00
Abstract
In surrogate ensemble attacks, using more surrogate models yields higher transferability but lower resource efficiency. This practical trade-off between transferability and efficiency has largely limited existing attacks despite many pre-trained models are easily accessible online. In this paper, we argue that such a trade-off is caused by an unnecessary common assumption, i.e., all models should be \textit{identical} across iterations. By lifting this assumption, we can use as many surrogates as we want to unleash transferability without sacrificing efficiency. Concretely, we propose Selective Ensemble Attack (SEA), which dynamically selects diverse models (from easily accessible pre-trained models) across iterations based on our new interpretation of decoupling within-iteration and cross-iteration model diversity. In this way, the number of within-iteration models is fixed for maintaining efficiency, while only cross-iteration model diversity is increased for higher transferability. Experiments on ImageNet demonstrate the superiority of SEA in various scenarios. For example, when dynamically selecting 4 from 20 accessible models, SEA yields 8.5% higher transferability than existing attacks under the same efficiency. The superiority of SEA also generalizes to real-world systems, such as commercial vision APIs and large vision-language models. Overall, SEA opens up the possibility of adaptively balancing transferability and efficiency according to specific resource requirements.
中文标题/摘要
标题:使用任意多的代理模型:选择性集成攻击以释放转移性而不牺牲资源效率
在代理模型集成攻击中,使用更多的代理模型会提高转移性但降低资源效率。这种在转移性和效率之间的实用权衡极大地限制了现有攻击的发展,尽管许多预训练模型很容易在线获取。在本文中,我们认为这种权衡是由一个不必要的共同假设引起的,即所有模型在迭代中应该是相同的。通过取消这一假设,我们可以使用任意多的代理模型来释放转移性而不牺牲效率。具体而言,我们提出了选择性集成攻击(SEA),它基于我们对迭代内和跨迭代模型多样性解耦的新解释,在迭代中动态选择不同的模型(来自易于获取的预训练模型)。这样,为了保持效率,迭代内的模型数量保持固定,而仅通过增加跨迭代模型多样性来提高转移性。在ImageNet上的实验表明,SEA在各种场景中具有优越性。例如,在从20个可访问模型中动态选择4个时,SEA在相同效率下比现有攻击提高了8.5%的转移性。SEA的优越性也适用于现实世界系统,如商业视觉API和大型视觉-语言模型。总体而言,SEA为根据特定资源需求适配性地平衡转移性和效率提供了可能性。
Summary / 总结
This paper addresses the trade-off between transferability and resource efficiency in surrogate ensemble attacks, proposing Selective Ensemble Attack (SEA) to dynamically select diverse models across iterations. SEA uses as many surrogates as needed to enhance transferability without compromising efficiency, achieving 8.5% higher transferability compared to existing methods under the same efficiency on ImageNet. SEA's effectiveness is also validated in real-world systems like commercial vision APIs and large vision-language models.
本文针对代理集成攻击中转移性与资源效率之间的权衡问题,提出了动态选择跨迭代不同模型的Selective Ensemble Attack (SEA) 方法。SEA 利用易于访问的预训练模型来提高转移性而不牺牲效率。实验结果表明,当从20个可用模型中选择4个时,SEA 在ImageNet 上的转移性比现有攻击高出8.5%,同时保持相同的效率。SEA 的效果也适用于商业视觉API和大型视觉-语言模型等实际系统。
Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
Authors: Lingfeng Zhang, Yuchen Zhang, Hongsheng Li, Haoxiang Fu, Yingbo Tang, Hangjun Ye, Long Chen, Xiaojun Liang, Xiaoshuai Hao, Wenbo Ding
First: 2025-11-17T11:39:20+00:00 · Latest: 2025-11-17T11:39:20+00:00
Abstract
Vision-Language Models (VLMs), leveraging their powerful visual perception and reasoning capabilities, have been widely applied in Unmanned Aerial Vehicle (UAV) tasks. However, the spatial intelligence capabilities of existing VLMs in UAV scenarios remain largely unexplored, raising concerns about their effectiveness in navigating and interpreting dynamic environments. To bridge this gap, we introduce SpatialSky-Bench, a comprehensive benchmark specifically designed to evaluate the spatial intelligence capabilities of VLMs in UAV navigation. Our benchmark comprises two categories-Environmental Perception and Scene Understanding-divided into 13 subcategories, including bounding boxes, color, distance, height, and landing safety analysis, among others. Extensive evaluations of various mainstream open-source and closed-source VLMs reveal unsatisfactory performance in complex UAV navigation scenarios, highlighting significant gaps in their spatial capabilities. To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. Extensive experimental results demonstrate that Sky-VLM achieves state-of-the-art performance across all benchmark tasks, paving the way for the development of VLMs suitable for UAV scenarios. The source code is available at https://github.com/linglingxiansen/SpatialSKy.
中文标题/摘要
标题:你的VLM已准备好应对天空了吗?一种全面的空间智能基准测试用于无人机导航
视觉-语言模型(VLMs),凭借其强大的视觉感知和推理能力,已在无人机(UAV)任务中广泛应用。然而,现有VLMs在无人机场景中的空间智能能力仍鲜有探索,对其在导航和解释动态环境中的有效性提出了质疑。为弥补这一差距,我们引入了SpatialSky-Bench,一种专门设计用于评估VLMs在无人机导航中空间智能能力的全面基准测试。我们的基准测试包括环境感知和场景理解两大类,分为13个子类别,包括边界框、颜色、距离、高度和着陆安全性分析等。对各种主流开源和闭源VLMs的广泛评估显示,在复杂无人机导航场景中的表现不尽如人意,突显了其空间能力的显著差距。为应对这一挑战,我们开发了SpatialSky-数据集,包含100万样本,涵盖各种场景的多样化注释。利用该数据集,我们引入了Sky-VLM,一种专门用于无人机多粒度和上下文空间推理的VLM。广泛的实验结果表明,Sky-VLM在所有基准测试任务中均达到最先进的性能,为开发适用于无人机场景的VLM铺平了道路。源代码可在https://github.com/linglingxiansen/SpatialSKy获取。
Summary / 总结
The research aims to evaluate the spatial intelligence capabilities of Vision-Language Models (VLMs) in UAV navigation, addressing the lack of comprehensive benchmarks in this area. The study introduces SpatialSky-Bench, a benchmark with 13 subcategories for environmental perception and scene understanding, and finds that existing VLMs perform poorly in complex UAV navigation scenarios. To improve this, the authors developed the SpatialSky-Dataset and Sky-VLM, which show superior performance across all benchmark tasks, indicating a significant advancement in VLMs for UAV navigation scenarios.
研究旨在评估视觉语言模型(VLMs)在无人机导航中的空间智能能力,填补了该领域缺乏全面基准的空白。研究引入了SpatialSky-Bench,该基准包含13个子类别,用于环境感知和场景理解,并发现现有VLMs在复杂无人机导航场景中表现不佳。为改进这一问题,作者开发了SpatialSky-Dataset和Sky-VLM,这些模型在所有基准任务中表现出色,表明在无人机导航场景中VLMs取得了显著进步。
Building Egocentric Procedural AI Assistant: Methods, Benchmarks, and Challenges
Authors: Junlong Li, Huaiyuan Xu, Sijie Cheng, Kejun Wu, Kim-Hui Yap, Lap-Pui Chau, Yi Wang
First: 2025-11-17T11:21:42+00:00 · Latest: 2025-11-17T11:21:42+00:00
Comments: 26 pages, 8 figures, 8 tables, Under peer-review
Abstract
Driven by recent advances in vision language models (VLMs) and egocentric perception research, we introduce the concept of an egocentric procedural AI assistant (EgoProceAssist) tailored to step-by-step support daily procedural tasks in a first-person view. In this work, we start by identifying three core tasks: egocentric procedural error detection, egocentric procedural learning, and egocentric procedural question answering. These tasks define the essential functions of EgoProceAssist within a new taxonomy. Specifically, our work encompasses a comprehensive review of current techniques, relevant datasets, and evaluation metrics across these three core areas. To clarify the gap between the proposed EgoProceAssist and existing VLM-based AI assistants, we introduce novel experiments and provide a comprehensive evaluation of representative VLM-based methods. Based on these findings and our technical analysis, we discuss the challenges ahead and suggest future research directions. Furthermore, an exhaustive list of this study is publicly available in an active repository that continuously collects the latest work: https://github.com/z1oong/Building-Egocentric-Procedural-AI-Assistant
中文标题/摘要
标题:构建以自我为中心的程序化AI助手:方法、基准和挑战
受近期视觉语言模型(VLMs)和以自我为中心感知研究的推动,我们引入了以自我为中心的程序化AI助手(EgoProceAssist)的概念,旨在从第一人称视角提供日常程序任务的逐步支持。在本文中,我们首先确定了三个核心任务:以自我为中心的程序错误检测、以自我为中心的程序学习和以自我为中心的程序问答。这些任务定义了EgoProceAssist在新分类中的基本功能。具体而言,我们的工作涵盖了这三个核心领域的当前技术、相关数据集和评估指标的全面回顾。为了澄清所提出的EgoProceAssist与现有基于VLM的AI助手之间的差距,我们引入了新的实验,并对代表性VLM方法进行了全面评估。基于这些发现和技术分析,我们讨论了未来的研究挑战,并提出了未来的研究方向。此外,本研究的详尽列表在活跃的仓库中公开,并不断收集最新的工作:https://github.com/z1oong/Building-Egocentric-Procedural-AI-Assistant
Summary / 总结
This paper introduces an egocentric procedural AI assistant (EgoProceAssist) designed to support daily tasks from a first-person perspective. It identifies three core tasks: error detection, learning, and question answering, and reviews relevant techniques, datasets, and evaluation metrics. Novel experiments evaluate VLM-based methods, highlighting gaps and challenges, and suggesting future research directions. An active repository is provided for ongoing research contributions.
本文介绍了用于支持日常任务的基于第一人称视角的程序化AI助手(EgoProceAssist)。它确定了三个核心任务:错误检测、学习和问答,并回顾了相关技术、数据集和评估指标。进行了新颖的实验来评估基于VLM的方法,指出了差距和挑战,并提出了未来的研究方向。最新的工作可以在一个活跃的仓库中找到。
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model
Authors: Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang
First: 2025-07-28T08:44:58+00:00 · Latest: 2025-11-17T11:15:25+00:00
Abstract
Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.
中文标题/摘要
标题:TransPrune: Token 过渡 剪枝 以提高大型视觉-语言模型的效率
大型视觉-语言模型(LVLMs)在多模态学习方面取得了进展,但由于视觉标记数量庞大导致计算成本高昂,因此需要通过标记剪枝来提高推理效率。关键挑战在于确定哪些标记真正重要。大多数现有方法依赖于基于注意力的标准来估计标记的重要性。然而,它们固有地存在某些局限性,如位置偏差。在本文中,我们从LVLMs中的标记过渡角度探索了标记重要性的新视角。我们观察到,标记表示的过渡提供了有意义的语义信息信号。基于这一见解,我们提出了TransPrune,这是一种无需训练且高效的标记剪枝方法。具体而言,TransPrune 通过结合 Token 过渡 变异(TTV)——衡量标记表示在大小和方向上的变化——和指令引导注意力(IGA)——衡量指令如何强烈地关注图像标记——逐步剪枝标记。广泛的实验表明,TransPrune 在八个基准测试中实现了与原始LVLMs(如LLaVA-v1.5和LLaVA-Next)相当的多模态性能,同时将推理TFLOPs减少了超过一半。此外,仅TTV就可以作为有效的标准,无需依赖注意力,其性能与基于注意力的方法相当。代码将在论文被接受后在 https://github.com/liaolea/TransPrune 上公开。
Summary / 总结
TransPrune is a token pruning method for LVLMs that assesses token importance based on token transitions, achieving comparable performance to original models while reducing inference costs. It uses Token Transition Variation and Instruction-Guided Attention to prune tokens without training, and demonstrates significant efficiency gains across multiple benchmarks.
TransPrune 是一种基于 LVLM 中 token 过渡变化和指令引导注意力进行 token 剪枝的方法,解决了现有基于注意力的方法的局限性。它在多个基准测试中实现了与 LLaVA-v1.5 和 LLaVA-Next 等原始模型相当的多模态性能,同时将推理 TFLOPs 减少了超过一半。此外,仅 TTV 就可以作为有效的剪枝标准,无需依赖注意力机制。
MRIQT: Physics-Aware Diffusion Model for Image Quality Transfer in Neonatal Ultra-Low-Field MRI
Authors: Malek Al Abed, Sebiha Demir, Anne Groteklaes, Elodie Germani, Shahrooz Faghihroohi, Hemmen Sabir, Shadi Albarqouni
First: 2025-11-17T10:51:11+00:00 · Latest: 2025-11-17T10:51:11+00:00
Comments: 5 pages, 4 figures
Abstract
Portable ultra-low-field MRI (uLF-MRI, 0.064 T) offers accessible neuroimaging for neonatal care but suffers from low signal-to-noise ratio and poor diagnostic quality compared to high-field (HF) MRI. We propose MRIQT, a 3D conditional diffusion framework for image quality transfer (IQT) from uLF to HF MRI. MRIQT combines realistic K-space degradation for physics-consistent uLF simulation, v-prediction with classifier-free guidance for stable image-to-image generation, and an SNR-weighted 3D perceptual loss for anatomical fidelity. The model denoises from a noised uLF input conditioned on the same scan, leveraging volumetric attention-UNet architecture for structure-preserving translation. Trained on a neonatal cohort with diverse pathologies, MRIQT surpasses recent GAN and CNN baselines in PSNR 15.3% with 1.78% over the state of the art, while physicians rated 85% of its outputs as good quality with clear pathology present. MRIQT enables high-fidelity, diffusion-based enhancement of portable ultra-low-field (uLF) MRI for deliable neonatal brain assessment.
中文标题/摘要
标题:MRIQT:针对新生儿超低场MRI的物理感知扩散模型以提高图像质量
便携式超低场MRI(uLF-MRI,0.064 T)为新生儿护理提供了可访问的神经影像学服务,但与高场(HF)MRI相比,其信噪比低且诊断质量差。我们提出了一种3D条件扩散框架MRIQT,用于从uLF到HF MRI的图像质量转移(IQT)。MRIQT结合了物理一致的超低场模拟的真实K空间降级、用于稳定图像到图像生成的v-预测与无分类引导,并使用信噪比加权的3D感知损失以实现解剖学保真度。该模型从噪声的uLF输入中去噪,并根据相同的扫描进行条件化,利用体素注意力UNet架构进行结构保持的转换。MRIQT在包含多种病理的新生儿队列上进行训练,其PSNR比最近的GAN和CNN基线高出15.3%,超过最新技术水平1.78%,而医生评定其85%的输出为高质量,病理清晰可见。MRIQT使便携式超低场(uLF)MRI的高保真、扩散增强成为可能,以实现可靠的新生儿脑部评估。
Summary / 总结
MRIQT is a 3D conditional diffusion model designed to transfer image quality from ultra-low-field (uLF) to high-field (HF) MRI, addressing the low signal-to-noise ratio and poor diagnostic quality of uLF-MRI. It uses realistic K-space degradation, v-prediction with classifier-free guidance, and an SNR-weighted 3D perceptual loss to ensure anatomical fidelity. MRIQT outperforms recent GAN and CNN baselines in PSNR by 15.3% and is rated as 85% good quality by physicians, indicating clear pathology. This model enhances the diagnostic capability of portable uLF-MRI for neonatal brain assessment.
MRIQT 是一种 3D 条件扩散模型,旨在将图像质量从便携式超低场 MRI 转移到高场 MRI,解决超低场 MRI 信号噪声比低和诊断质量差的问题。它使用真实的 K 空间降级、v 预测与无分类引导以及 SNR 加权的 3D 感知损失来确保解剖保真度。MRIQT 在 PSNR 上比最近的 GAN 和 CNN 基线高出 15.3%,并且在 85% 的情况下,医生认为其输出为高质量且病理清晰。
On the Limitations of Language Targeted Pruning: Investigating the Calibration Language Impact in Multilingual LLM Pruning
Authors: Simon Kurz, Jian-Jia Chen, Lucie Flek, Zhixue Zhao
First: 2024-08-26T16:29:13+00:00 · Latest: 2025-11-17T10:48:59+00:00
Comments: Accepted for publication in TACL
Abstract
Recent advances in large language model (LLM) pruning have shown state-of-the-art (SotA) compression results in post-training and retraining-free settings while maintaining high predictive performance. However, previous research mainly considered calibrating based on English text, despite the multilingual nature of modern LLMs and their frequent use in non-English languages. This analysis paper conducts an in-depth investigation of the performance and internal representation changes associated with pruning multilingual language models for monolingual applications. We present the first comprehensive empirical study, comparing different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques. We further analyze the latent subspaces, pruning masks, and individual neurons within pruned models. Our results reveal that while calibration on the target language effectively retains perplexity and yields high signal-to-noise ratios, it does not consistently improve downstream task performance. Further analysis of internal representations at three different levels highlights broader limitations of current pruning approaches: While they effectively preserve dominant information like language-specific features, this is insufficient to counteract the loss of nuanced, language-agnostic features that are crucial for knowledge retention and reasoning.
中文标题/摘要
标题:关于语言目标剪枝的局限性:探究多语言LLM剪枝中的校准语言影响
近年来,大规模语言模型(LLM)剪枝的进展在后训练和无需重新训练的设置中实现了最先进的(SotA)压缩效果,同时保持了高预测性能。然而,之前的研究所主要基于英语文本进行校准,尽管现代LLM具有多语言性质,并且经常在非英语语言中使用。本文深入分析了为单语言应用剪枝多语言模型时性能和内部表示的变化。我们进行了首个全面的经验研究,比较了在不同语言上校准多语言模型的剪枝效果,涉及多种语言、任务、模型和SotA剪枝技术。我们进一步分析了剪枝模型中的潜在子空间、剪枝掩码和单个神经元。研究结果表明,尽管在目标语言上校准有效地保留了困惑度并产生了高信噪比,但并不一致地提高下游任务性能。对内部表示的进一步分析在三个不同层次上揭示了当前剪枝方法更广泛局限性:它们有效地保留了如语言特定特征等主导信息,但不足以抵消对知识保留和推理至关重要的细微、语言无关特征的损失。
Summary / 总结
This paper investigates the limitations of language-targeted pruning in multilingual large language models (LLMs), focusing on the impact of calibration language on model performance. It compares different calibration languages for pruning across various languages, tasks, and models, and finds that while target language calibration helps maintain perplexity and signal-to-noise ratios, it does not uniformly enhance downstream task performance. The analysis also reveals that current pruning techniques preserve language-specific features but fail to maintain nuanced, language-agnostic features essential for knowledge retention and reasoning.
该论文研究了多语言大型语言模型(LLM)中语言目标剪枝的局限性,重点关注校准语言对模型性能的影响。研究比较了不同校准语言在多种语言、任务和模型上的表现,并发现虽然目标语言校准有助于保持困惑度和信噪比,但并不能均匀提升下游任务性能。分析表明,当前的剪枝技术能够有效保留语言特定特征,但无法保留对知识保留和推理至关重要的细微、语言无关特征。
DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes
Authors: Mogens Henrik From, Jacob Nielsen, Lukas Galke Poech, Peter Schneider-Kamp
Venue: AAAI 2026
First: 2025-02-10T17:55:59+00:00 · Latest: 2025-11-17T10:34:58+00:00
Comments: Accepted as a paper at AAAI 2026 Main Track
Abstract
Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients -- resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo -- introducing new variations of replication schemes and challenging choices made in DeMo. Our results across language and vision domains show that FlexDeMo attains similar validation loss as hybrid sharded data parallel training employing AdamW and full gradient synchronization, while being substantially faster. FlexDeMo is thus a promising distributed training scheme for the largest machine learning models.
中文标题/摘要
标题:DeToNATION:解耦的Torch网络感知训练于互联的在线节点
训练大型神经网络模型需要大量的计算资源,通常分布在多个节点和加速器上。最近的研究表明,可能只需要交换梯度的快速移动部分,而局部累积动量(解耦动量,或DeMo)就足够了。然而,DeMo假设模型可以适应单个加速器。我们放宽了这一假设,引入了FlexDeMo,其中节点在不同的加速器之间完全分割模型参数,而节点间通信通过仅同步快速移动的部分而不是完整的梯度来减少,从而形成一种混合分割数据并行训练策略。我们还引入了一个框架,称为DeToNATION,它泛化了DeMo、FlexDeMo和其他流行的分布式训练方案,如DiLoCo——引入了DeMo的复制方案的新变体,并挑战了DeMo中的选择。我们在语言和视觉领域的结果表明,FlexDeMo在使用AdamW和完整梯度同步的混合分割数据并行训练中达到了相似的验证损失,但速度更快。因此,FlexDeMo是大型机器学习模型的有前途的分布式训练方案。
Summary / 总结
The paper addresses the challenge of training large neural network models by proposing FlexDeMo, a hybrid sharded data parallel training strategy that reduces inter-node communication by synchronizing only fast-moving components of gradients. This method, combined with a framework called DeToNATION, achieves similar validation loss to full gradient synchronization while being significantly faster, making it a promising approach for training large models in both language and vision domains.
论文提出了一种名为FlexDeMo的混合分片数据并行训练策略,通过仅同步梯度中的快速移动部分来减少节点间的通信,从而在语言和视觉领域的大规模机器学习模型训练中实现与全梯度同步相当的验证损失,同时速度更快,是一种有前景的分布式训练方案。
3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation
Authors: Seonho Lee, Jiho Choi, Inha Kang, Jiwook Kim, Junsung Park, Hyunjung Shim
First: 2025-06-11T15:56:59+00:00 · Latest: 2025-11-17T10:19:22+00:00
Abstract
Vision-Language Models (VLMs) have shown remarkable performance on diverse visual and linguistic tasks, yet they remain fundamentally limited in their understanding of 3D spatial structures. We propose Geometric Distillation, a lightweight, annotation-free fine-tuning framework that injects human-inspired geometric cues into pretrained VLMs without modifying their architecture. By distilling (1) sparse correspondences, (2) relative depth relations, and (3) dense cost volumes from off-the-shelf 3D foundation models (e.g., MASt3R, VGGT), our method shapes representations to be geometry-aware while remaining compatible with natural image-text inputs. Through extensive evaluations on 3D vision-language reasoning and 3D perception benchmarks, our method consistently outperforms prior approaches, achieving improved 3D spatial reasoning with significantly lower computational cost. Our work demonstrates a scalable and efficient path to bridge 2D-trained VLMs with 3D understanding, opening up wider use in spatially grounded multimodal tasks.
中文标题/摘要
标题:具有3D意识的视觉-语言模型微调与几何蒸馏
视觉-语言模型(VLMs)在各种视觉和语言任务上表现出色,但在理解3D空间结构方面仍然存在根本限制。我们提出了一种轻量级、无需标注的微调框架——几何蒸馏,该框架通过从现成的3D基础模型(如MASt3R、VGGT)中提取人类启发的几何线索,注入到预训练的VLMs中,而不修改其架构。通过蒸馏(1)稀疏对应关系,(2)相对深度关系,以及(3)密集成本体积,我们的方法使表示具有几何意识,同时保持与自然图像-文本输入的兼容性。通过在3D视觉-语言推理和3D感知基准测试中的广泛评估,我们的方法在所有方面都优于先前的方法,实现了显著更低的计算成本和更好的3D空间推理。我们的工作展示了将2D训练的VLMs与3D理解相结合的可扩展且高效路径,为基于空间的多模态任务的广泛应用打开了大门。
Summary / 总结
The research aims to enhance the 3D spatial understanding of Vision-Language Models (VLMs) by introducing a lightweight fine-tuning framework called Geometric Distillation. This method injects geometric cues from 3D foundation models into pretrained VLMs without altering their architecture. The framework improves 3D spatial reasoning and outperforms previous approaches with lower computational cost. Key findings show consistent performance gains on 3D vision-language and perception benchmarks.
研究旨在通过引入一种轻量级的细调框架——几何蒸馏,增强视觉-语言模型(VLMs)的3D空间理解能力。该方法从3D基础模型中注入几何线索到预训练的VLMs中,而不改变其架构。实验结果表明,几何蒸馏在各种基准测试中提高了3D空间推理性能,并且计算成本更低,优于先前的方法。
Video Spatial Reasoning with Object-Centric 3D Rollout
Authors: Haoran Tang, Meng Cao, Ruyang Liu, Xiaoxi Liang, Linglong Li, Ge Li, Xiaodan Liang
First: 2025-11-17T09:53:41+00:00 · Latest: 2025-11-17T09:53:41+00:00
Abstract
Recent advances in Multi-modal Large Language Models (MLLMs) have showcased remarkable capabilities in vision-language understanding. However, enabling robust video spatial reasoning-the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes-remains a key unsolved challenge. Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. By degrading object-specific visual cues and projecting the altered geometry into 2D space, OCR compels the model to reason holistically across the entire scene. We further design a rollout-based training pipeline that jointly leverages vanilla and region-noisy videos to optimize spatial reasoning trajectories. Experiments demonstrate state-of-the-art performance: our 3B-parameter model achieves 47.5% accuracy on VSI-Bench, outperforming several 7B baselines. Ablations confirm OCR's superiority over prior rollout strategies (e.g., T-GRPO, NoisyRollout).
中文标题/摘要
标题:基于对象中心的3D展开视频空间推理
近期多模态大型语言模型(MLLMs)在视觉-语言理解方面展现了卓越的能力。然而,使模型具备稳健的视频空间推理能力——即理解动态3D场景中物体的位置、姿态和物体间关系的能力——仍然是一个关键的未解挑战。现有方法主要依赖于空间上下文指导的监督微调或强化学习,但我们发现这些模型往往表现出查询锁定的推理方式,仅关注提示中明确提到的物体,而忽视了关键的上下文线索。为解决这一局限,我们提出了对象中心的3D展开(OCR)策略,该策略在训练过程中引入了对选定物体的结构化扰动。通过削弱特定物体的视觉线索并将修改后的几何结构投影到2D空间,OCR促使模型在整个场景中进行整体推理。我们还设计了一种基于展开的训练管道,该管道联合利用普通视频和区域噪声视频来优化空间推理轨迹。实验结果表明,我们的3B参数模型在VSI-Bench上达到了47.5%的准确率,优于多个7B基线模型。消融实验证实了OCR策略优于先前的展开策略(如T-GRPO、NoisyRollout)的优势。
Summary / 总结
The research aims to enhance video spatial reasoning by addressing the limitations of existing models that often focus narrowly on explicitly mentioned objects. The proposed method, Object-Centric 3D Rollout (OCR), introduces structured perturbations to object geometry during training to encourage holistic scene reasoning. Experiments show that OCR outperforms several 7B parameter baselines on the VSI-Bench with 47.5% accuracy, confirming its effectiveness over previous rollout strategies.
研究旨在通过解决现有模型往往仅聚焦于明确提及的对象的局限性,增强视频空间推理能力。提出的Object-Centric 3D Rollout (OCR) 方法在训练过程中对选定对象的3D几何结构引入结构化扰动,促使模型在整个场景中进行整体推理。实验表明,3B参数的OCR模型在VSI-Bench上的准确率达到47.5%,优于多个7B基线模型,并且优于先前的卷出策略如T-GRPO和NoisyRollout。
History
20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553