arXiv 论文速递

2025-11-06 03:29
Snapshot: 20251106_0329
Prompt to Restore, Restore to Prompt: Cyclic Prompting for Universal Adverse Weather Removal
Authors: Rongxin Liao, Feng Li, Yanyan Wei, Zenglin Shi, Le Zhang, Huihui Bai, Meng Wang
First: 2025-03-12T03:03:06+00:00 · Latest: 2025-11-04T15:59:18+00:00
Abstract
Universal adverse weather removal (UAWR) seeks to address various weather degradations within a unified framework. Recent methods are inspired by prompt learning using pre-trained vision-language models (e.g., CLIP), leveraging degradation-aware prompts to facilitate weather-free image restoration, yielding significant improvements. In this work, we propose CyclicPrompt, an innovative cyclic prompt approach designed to enhance the effectiveness, adaptability, and generalizability of UAWR. CyclicPrompt Comprises two key components: 1) a composite context prompt that integrates weather-related information and context-aware representations into the network to guide restoration. This prompt differs from previous methods by marrying learnable input-conditional vectors with weather-specific knowledge, thereby improving adaptability across various degradations. 2) The erase-and-paste mechanism, after the initial guided restoration, substitutes weather-specific knowledge with constrained restoration priors, inducing high-quality weather-free concepts into the composite prompt to further fine-tune the restoration process. Therefore, we can form a cyclic "Prompt-Restore-Prompt" pipeline that adeptly harnesses weather-specific knowledge, textual contexts, and reliable textures. Extensive experiments on synthetic and real-world datasets validate the superior performance of CyclicPrompt. The code is available at: https://github.com/RongxinL/CyclicPrompt.
中文标题/摘要
标题:从提示恢复,恢复到提示:循环提示在通用不良天气去除中的应用
通用不良天气去除(UAWR)旨在在一个统一框架内解决各种天气退化问题。最近的方法受到预训练视觉-语言模型(如CLIP)提示学习的启发,利用退化感知提示来促进无天气图像恢复,取得了显著的改进。在本文中,我们提出了一种名为CyclicPrompt的创新循环提示方法,旨在增强UAWR的有效性、适应性和泛化能力。CyclicPrompt包含两个关键组件:1) 综合上下文提示,将与天气相关的信息和上下文感知表示整合到网络中以指导恢复。这种提示与以往方法不同,通过结合可学习的输入条件向量和特定天气知识,提高了在各种退化中的适应性。2) 在初始引导恢复之后,擦除并粘贴机制用受限的恢复先验替换特定天气知识,将高质量的无天气概念引入综合提示中,进一步微调恢复过程。因此,我们可以形成一个循环的“提示-恢复-提示”管道,巧妙地利用特定天气知识、文本上下文和可靠的纹理。在合成和真实世界数据集上的大量实验验证了CyclicPrompt的优越性能。代码可在以下链接获取:https://github.com/RongxinL/CyclicPrompt.
Summary / 总结
The research aims to improve universal adverse weather removal (UAWR) by proposing CyclicPrompt, which enhances adaptability and generalizability through a cyclic 'Prompt-Restore-Prompt' pipeline. CyclicPrompt includes a composite context prompt that integrates weather-related information and context-aware representations, and an erase-and-paste mechanism that refines the restoration process. Experiments show that CyclicPrompt outperforms existing methods on both synthetic and real-world datasets.
研究旨在通过提出CyclicPrompt来提升统一不良天气去除(UAWR)的效果,该方法通过循环的'Prompt-Restore-Prompt'管道增强适应性和通用性。CyclicPrompt包括一个综合上下文提示,该提示整合了天气相关信息和上下文感知表示,以及一个擦除和粘贴机制,以进一步细化恢复过程。实验表明,CyclicPrompt在合成和真实世界数据集上均优于现有方法。
TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Authors: Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi
First: 2025-11-04T13:56:39+00:00 · Latest: 2025-11-04T13:56:39+00:00
Comments: 13 pages, 8 figures, 3 tables. The first two authors contributed equally. Project Page: https://iyatomilab.github.io/TAUE
Abstract
Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.
中文标题/摘要
标题:TAUE:无需训练的噪声移植与培养扩散模型
尽管文本到图像扩散模型取得了显著的成功,但它们输出单一、扁平图像的局限性仍然是专业应用中逐层控制的瓶颈。现有解决方案要么依赖于大规模、难以获取的数据集的微调,要么是无需训练但只能生成孤立的前景元素,无法生成完整且连贯的场景。为了解决这一问题,我们提出了无需训练的噪声移植与培养扩散模型(TAUE),这是一种用于零样本、逐层图像生成的新框架。我们的核心技术,噪声移植与培养(NTC),从前景生成和复合生成过程中提取中间的潜在表示,并将其移植到初始噪声中,以供后续层使用。这确保了前景、背景和复合层之间的语义和结构一致性,从而在无需微调或辅助数据集的情况下实现一致的多层输出。大量实验表明,我们的无需训练方法在性能上与微调方法相当,增强了逐层一致性,同时保持了高质量和高保真度的图像。TAUE不仅消除了昂贵的训练和数据集需求,还解锁了新的下游应用,如复杂的组合编辑,为更易于访问和可控的生成工作流程铺平了道路。
Summary / 总结
TAUE is a training-free noise transplantation and cultivation diffusion model designed to generate multi-layered images without the need for fine-tuning or large datasets. It extracts intermediate latent representations from foreground and composite generation processes and transplants them into initial noise for subsequent layers, ensuring semantic and structural coherence. Experiments show that TAUE achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency and maintaining high image quality and fidelity, thus enabling complex compositional editing and more accessible generative workflows.
TAUE 是一个无需训练的框架,用于零样本、分层图像生成,通过确保前景、背景和合成层之间的语义和结构一致性来解决现有文本到图像扩散模型的限制。它使用噪声移植和培养 (NTC) 技术提取并移植潜在表示,无需微调即可实现一致的多层输出。实验表明,TAUE 的性能与微调方法相当,增强了分层一致性并保持了高质量的图像。
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Authors: Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, Li Yuan
First: 2025-10-19T15:38:06+00:00 · Latest: 2025-11-04T13:15:36+00:00
Abstract
Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. \texttt{UniWorld-V2}, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available to support further research.
中文标题/摘要
标题:Uniworld-V2: 使用扩散负样本感知微调和MLLM隐式反馈强化图像编辑
基于指令的图像编辑已经取得了显著进展;然而,仅通过监督微调训练的模型往往会过度拟合标注模式,限制了它们探索和泛化的能力。为此,我们提出了Edit-R1,一种基于策略优化的新型后训练框架。具体而言,我们利用了一种与流匹配前向过程一致的无似然性策略优化方法——扩散负样本感知微调(DiffusionNFT),从而能够使用高阶采样器和更高效的训练。另一个关键挑战是没有通用的奖励模型,这源于编辑指令和任务的多样性。为了解决这一问题,我们采用了一种多模态大型语言模型(MLLM)作为统一的、无需训练的奖励模型,利用其输出logits提供细粒度反馈。此外,我们精心设计了一种低方差组过滤机制,以减少MLLM评分噪声并稳定优化。使用此框架训练的\texttt{UniWorld-V2}在ImgEdit和GEdit-Bench基准上分别取得了4.49和7.83的\textbf{最佳}结果。重要的是,我们的框架是模型无关的,当应用于诸如Qwen-Image-Edit和FLUX-Kontext等不同基础模型时,能够显著提高性能,展示了其广泛的应用性。代码和模型已公开,以支持进一步的研究。
Summary / 总结
The research introduces Edit-R1, a post-training framework for instruction-based image editing using policy optimization. It employs Diffusion Negative-aware Finetuning (DiffusionNFT) for efficient training and a Multimodal Large Language Model (MLLM) as a reward model to provide fine-grained feedback. The framework achieves state-of-the-art results on ImgEdit and GEdit-Bench benchmarks with scores of 4.49 and 7.83, respectively, and is model-agnostic, enhancing various base models like Qwen-Image-Edit and FLUX-Kontext.
该论文提出了一种基于策略优化的后训练框架Edit-R1,采用Diffusion Negative-aware Finetuning (DiffusionNFT) 和Multimodal Large Language Model (MLLM) 作为奖励模型,提供精细反馈。该框架在ImgEdit和GEdit-Bench基准测试中分别取得了4.49和7.83的得分,表现出色,并且是模型无关的,在不同基础模型上显示出显著的性能提升。
Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes
Authors: Robinson Umeike, Neil Getty, Yin Xiangyu, Yi Jiang
First: 2025-11-04T11:43:05+00:00 · Latest: 2025-11-04T11:43:05+00:00
Abstract
The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.
中文标题/摘要
标题:将通用基础模型适应于低数据量条件下的X射线 Ptychography
在先进显微镜工作流程的自动化方面,基础模型如语言模型(LLMs)和视觉-语言模型(VLMs)显示出巨大的潜力。然而,将这些通用模型适应于特定的科学任务至关重要,而最优的领域适应策略往往不明确。为解决这一问题,我们引入了PtychoBench,这是一种新的多模态、多任务基准,用于衍射分析。利用这一基准,我们系统地比较了两种专业化策略:监督微调(SFT)和上下文学习(ICL)。我们在数据稀缺的条件下,使用VLMs进行视觉伪影检测任务,使用LLMs进行文本参数推荐任务,评估这些策略。我们的研究发现,最优的专业化路径取决于任务。对于视觉任务,SFT和ICL高度互补,微调模型在上下文感知示例的引导下,达到最高的平均性能(Micro-F1为0.728)。相反,对于文本任务,大型基础模型上的ICL是更优策略,达到峰值Micro-F1为0.847,优于强大的“超级专家”SFT模型(零样本Micro-F1为0.839)。我们还确认了上下文感知提示的优越性,并在微调模型中发现了一致的上下文干扰现象。这些结果,与包括GPT-4o和基于DINOv3的分类器在内的强基线进行基准测试,为科学中的AI提供了关键观察:在我们的基准中,最优的专业化路径取决于任务模态,为开发更有效的基于科学的代理系统提供了清晰的框架。
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
Authors: Lukas Aichberger, Alasdair Paren, Guohao Li, Philip Torr, Yarin Gal, Adel Bibi
Venue: NeurIPS 2025
First: 2025-03-13T18:59:12+00:00 · Latest: 2025-11-04T10:25:46+00:00
Comments: NeurIPS 2025
Abstract
Recent advances in operating system (OS) agents have enabled vision-language models (VLMs) to directly control a user's computer. Unlike conventional VLMs that passively output text, OS agents autonomously perform computer-based tasks in response to a single user prompt. OS agents do so by capturing, parsing, and analysing screenshots and executing low-level actions via application programming interfaces (APIs), such as mouse clicks and keyboard inputs. This direct interaction with the OS significantly raises the stakes, as failures or manipulations can have immediate and tangible consequences. In this work, we uncover a novel attack vector against these OS agents: Malicious Image Patches (MIPs), adversarially perturbed screen regions that, when captured by an OS agent, induce it to perform harmful actions by exploiting specific APIs. For instance, a MIP can be embedded in a desktop wallpaper or shared on social media to cause an OS agent to exfiltrate sensitive user data. We show that MIPs generalise across user prompts and screen configurations, and that they can hijack multiple OS agents even during the execution of benign instructions. These findings expose critical security vulnerabilities in OS agents that have to be carefully addressed before their widespread deployment.
中文标题/摘要
标题:MIP对抗代理:恶意图像补丁劫持多模态OS代理
近年来,操作系统(OS)代理的进步使视觉-语言模型(VLMs)能够直接控制用户的计算机。与传统的被动输出文本的VLMs不同,OS代理能够自主执行基于计算机的任务,仅需一个用户指令。OS代理通过捕获、解析和分析屏幕截图,并通过应用程序编程接口(APIs)如鼠标点击和键盘输入执行低级操作来实现这一目标。这种直接与OS的交互显著提高了风险,因为失败或操纵可能会立即产生实际后果。在本研究中,我们发现了一种针对这些OS代理的新攻击向量:恶意图像补丁(MIPs),这些对抗性扰动的屏幕区域在被OS代理捕获时,会通过利用特定的APIs诱导其执行有害操作。例如,MIP可以嵌入在桌面上的壁纸中或在社交媒体上分享,以使OS代理泄露敏感用户数据。我们展示了MIPs在用户指令和屏幕配置方面具有泛化能力,并且即使在执行良性指令期间也能劫持多个OS代理。这些发现揭示了OS代理中关键的安全漏洞,这些漏洞在广泛部署之前必须仔细解决。
From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
Authors: Nicolas Schuler, Lea Dewald, Nick Baldig, Jürgen Graf
First: 2025-11-04T09:58:29+00:00 · Latest: 2025-11-04T09:58:29+00:00
Comments: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18 DECEMBER 2025
Abstract
Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge devices in the context of Mobile Robotics. The proposed pipeline is evaluated on a diverse dataset consisting of various real-world cityscape, on-campus and indoor scenarios. The experimental evaluation discusses the potential of these small models on edge devices, with particular emphasis on challenges, weaknesses, inherent model biases and the application of the gained information. Supplementary material is provided via the following repository: https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
中文标题/摘要
标题:从实验室到实际应用:在移动机器人领域评估边缘设备上的零样本场景解释
视频理解、场景解释和常识推理是高度挑战性的任务,能够解释视觉信息,使智能体能够感知、交互并理性地做出决策。近年来,大型语言模型(LLMs)和视觉语言模型(VLMs)在这些领域取得了显著进展,不仅能够实现特定领域的应用,还能完成零样本开放式词汇任务,结合多个领域。然而,所需的计算复杂性为它们在边缘设备上的应用以及在移动机器人领域的应用带来了挑战,特别是在准确性和推理时间之间的权衡。本文研究了最先进的VLMs在场景解释和动作识别任务中的能力,特别关注适用于移动机器人领域边缘设备的小型VLMs。提出的管道在包含各种真实城市景观、校园和室内场景的多样数据集上进行了评估。实验评估讨论了这些小型模型在边缘设备上的潜力,特别是挑战、弱点、固有的模型偏差以及获得的信息的应用。补充材料可通过以下存储库提供:https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
Summary / 总结
This paper evaluates the capabilities of state-of-the-art Visual Language Models (VLMs) for zero-shot scene interpretation and action recognition in the context of mobile robotics, focusing on small models suitable for edge devices. The study uses a diverse dataset and highlights the potential of these models while discussing challenges and inherent biases. Key findings include the ability of these models to perform well on edge devices despite computational constraints, but also their limitations in accuracy and inference time. Supplementary material is available via a provided repository.
本文评估了最先进的视觉语言模型(VLMs)在移动机器人领域的零样本场景理解和动作识别能力,重点关注适合边缘设备的小型模型。研究使用了多样化的数据集,并讨论了这些模型的潜力及其挑战和固有偏差。主要发现包括,尽管存在计算限制,这些模型仍能在边缘设备上表现出色,但在准确性和推理时间方面也存在局限性。补充材料可通过提供的仓库获取。
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Authors: Jiahe Song, Chuang Wang, Bowen Jiang, Yinfan Wang, Hao Zheng, Xingjian Wei, Chengjin Liu, Junyuan Gao, Yubin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He
First: 2025-11-04T09:08:44+00:00 · Latest: 2025-11-04T09:08:44+00:00
Abstract
Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.
中文标题/摘要
标题:RxnCaption: 将化学反应图解解析重新定义为视觉提示引导的描述任务
大规模化学反应数据集对于化学领域的AI研究至关重要。然而,现有的化学反应数据通常以论文中的图像形式存在,使其无法被机器读取和用于训练机器学习模型。为应对这一挑战,我们提出了RxnCaption框架,用于化学反应图解解析任务(RxnDP)。我们的框架将传统的基于坐标的预测解析过程重新定义为图像描述问题,这是大型视觉语言模型(LVLMs)能够自然处理的问题。我们引入了一种称为“边界框和索引作为视觉提示”(BIVP)的策略,使用我们最先进的分子检测器MolYOLO在输入图像上预先绘制分子边界框和索引,将下游解析转化为自然语言描述问题。大量实验表明,BIVP策略显著提高了结构提取质量,简化了模型设计。我们进一步构建了包含11,000个样本的RxnCaption-11k数据集,其规模比之前的实际文献基准数据集大一个数量级,并且在四个布局原型上具有平衡的测试子集。实验表明,RxnCaption-VL在多个指标上达到了最先进的性能。我们认为,我们的方法、数据集和模型将促进化学文献中的结构化信息提取,并推动更广泛的化学领域AI应用。我们将通过GitHub发布数据、模型和代码。
Summary / 总结
The research aims to address the challenge of making chemical reaction images machine-readable for AI training. It proposes the RxnCaption framework, which reformulates reaction diagram parsing as a visual prompt guided captioning task. The BBox and Index as Visual Prompt (BIVP) strategy uses MolYOLO to pre-draw molecular bounding boxes and indices, turning the parsing task into a natural-language description problem. Experiments show that this approach significantly improves structural extraction quality and simplifies model design, achieving state-of-the-art performance on multiple metrics. The RxnCaption-11k dataset, an order of magnitude larger than previous benchmarks, further supports these findings.
论文提出了RxnCaption框架,将化学反应图解析重新定义为视觉提示引导的图像描述任务,使用LVLMs。它引入了BBox和Index作为视觉提示(BIVP)策略,在输入图像上预先绘制分子边界框和索引,简化了解析过程。实验表明,这种方法显著提高了结构提取的质量,并在多个指标上达到了最先进的性能。作者还构建了一个大规模的RxnCaption-11k数据集,增强了化学反应解析模型的训练。
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit
Authors: Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba
Venue: NeurIPS 2025
First: 2025-06-03T17:24:55+00:00 · Latest: 2025-11-04T09:06:34+00:00
Comments: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract
Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuits (MP) algorithm from sparse coding to design MP-SAE -- an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MP-SAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.
中文标题/摘要
标题:从平面到层次结构:使用匹配追求提取稀疏表示
受神经网络表示可以编码为线性可访问的、近似正交的方向这一假设的启发,稀疏自编码器(SAEs)已成为可解释性的一个流行工具。然而,最近的工作展示了模型表示的现象学,这超出了这一假设的范围,显示出层次化、非线性和多维特征的特征。这提出了一个问题:SAEs是否表示了与其假设相矛盾的结构特征?如果不是,避免这种不匹配是否有助于识别这些特征并进一步了解神经网络表示?为了回答这些问题,我们采取了一种基于构造的方法,并重新将流行的匹配追求(MP)算法从稀疏编码重新定位,设计了MP-SAE——一种将编码器展开为残差引导步骤的SAE,使其能够捕捉层次化和非线性可访问的特征。在合成和自然数据设置的比较中,我们展示了:(i) 层次概念诱导条件正交特征,现有的SAEs无法忠实捕捉,(ii) MP-SAE的非线性编码步骤恢复了高度有意义的特征,帮助我们揭示了不同模态在视觉语言模型中看似二元表示空间中的共享结构,从而证明假设有用的特征仅线性可访问是不足的。我们还展示了MP-SAE的顺序编码原理在推理时提供了自适应稀疏性的额外好处,这可能具有独立的兴趣。总体而言,我们认为我们的结果支持了可解释性应该从表示的现象学开始,方法应源自适应其假设的观点。
Summary / 总结
This study investigates whether sparse autoencoders (SAEs) can capture hierarchical and nonlinear features, which are beyond the linear and orthogonal assumptions they are based on. By recontextualizing the matching pursuit (MP) algorithm, the researchers developed MP-SAE, which captures hierarchical and nonlinear features more effectively than traditional SAEs. The study demonstrates that hierarchical concepts induce conditionally orthogonal features that existing SAEs cannot capture accurately, and that the nonlinear encoding step of MP-SAE recovers highly meaningful features, revealing shared structure in different modalities of a vision-language model. This suggests that the assumption of linear accessibility is insufficient for interpretability. Additionally, MP-SAE offers adaptive sparsity at inference time, which is an independent benefit. Overall, the results support the idea that interpretability should start with the phenomenology of representations, with methods emerging from assumptions that fit it.
该研究探讨了稀疏自编码器(SAEs)是否能够捕捉到层次和非线性特征,这些特征超出了它们基于的线性和正交假设。通过重新利用匹配追求(MP)算法,作者开发了MP-SAE,这是一种能够捕捉层次和非线性特征的SAE。研究发现,现有的SAE无法捕捉到层次的条件正交特征,而MP-SAE能够成功恢复有意义的非线性特征,表明有用的特征不仅仅是线性可访问的。此外,MP-SAE在推理时表现出自适应稀疏性,这是另一个附加优势。研究结果支持从表征的现象学开始进行可解释性分析的观点,而不是将假设拟合到表征上。
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Authors: Aashray Reddy, Andrew Zagula, Nicholas Saban
First: 2025-11-04T08:56:28+00:00 · Latest: 2025-11-04T08:56:28+00:00
Abstract
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs, yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves up to 95% attack success rate on Llama-3.1-8B within six turns a 24 percent improvement over single turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests then iteratively refines them. Extensive evaluation across commercial and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.
中文标题/摘要
标题:AutoAdv:自动化对抗提示以实现大型语言模型的多轮脱笼攻击
大型语言模型(LLMs)仍然容易受到脱笼攻击的影响,其中对抗提示会引发有害输出,但大多数评估主要集中在单轮交互上,而实际攻击则通过适应性的多轮对话展开。我们提出了AutoAdv,这是一种无需训练的框架,用于实现自动化多轮脱笼攻击,在六轮内对Llama-3.1-8B的成功攻击率高达95%,比单轮基线提高了24个百分点。AutoAdv独特地结合了三种适应性机制:一个模式管理器,可以从成功的攻击中学习以增强未来的提示;一个温度管理器,根据失败模式动态调整采样参数;以及一个两阶段重写策略,先隐藏有害请求,然后逐步优化它们。广泛的评估表明,当前的安全机制存在持续的漏洞,多轮攻击始终优于单轮方法。这些发现表明,针对单轮交互优化的对齐策略无法在长时间对话中保持鲁棒性,突显了对多轮攻击意识的防御措施的迫切需求。
Summary / 总结
AutoAdv is a training-free framework designed to automate multi-turn jailbreaking of LLMs, achieving up to 95% success rate within six turns, which is a 24% improvement over single-turn baselines. It combines pattern learning, dynamic temperature adjustment, and a two-phase rewriting strategy to enhance adversarial prompting. The framework consistently outperforms single-turn approaches across various models, indicating the need for multi-turn-aware defenses to improve LLM safety.
AutoAdv 是一个无需训练的框架,用于自动化多轮 LLM 的 jailbreaking,能够在六轮内使 Llama-3.1-8B 的攻击成功率高达 95%,比单轮基线提高了 24%。它结合了模式管理器、温度管理器和两阶段重写策略,以动态调整提示并提高多轮对话中的攻击成功率。
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Authors: Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang
First: 2025-03-23T13:18:17+00:00 · Latest: 2025-11-04T08:39:38+00:00
Comments: Accepted by IEEE Transactions on Neural Networks and Learning Systems
Abstract
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
中文标题/摘要
标题:从已见重写未见:使用基础模型增强视觉语言导航的观察-指令重写
视觉语言导航(VLN)领域长期面临数据稀缺的挑战,这极大地阻碍了代理在未见环境中的泛化能力。以往的工作主要依赖额外的模拟器数据或网络收集的图像/视频来提高泛化能力。然而,模拟器环境仍然面临多样性有限的问题,而网络收集的数据往往需要大量劳动来去除噪声。在本文中,我们提出了一种重写驱动的增强(RAM)范式,直接通过重写人类标注的训练数据来生成未见的观察-指令对。得益于我们的重写机制,新的观察-指令对可以在无需模拟器和节省劳动的情况下获得,从而促进泛化。具体而言,我们首先引入了对象增强的观察重写,其中结合视觉语言模型(VLMs)和大型语言模型(LLMs)来推导出重写后对象丰富的场景描述,通过文本到图像生成模型(T2IMs)实现具有多样对象和空间布局的观察合成。然后,我们提出了观察对比指令重写,通过要求LLMs推理原始观察与新观察之间的差异来生成与观察对齐的重写指令。我们进一步开发了一种混合然后聚焦的训练策略,结合随机观察裁剪方案,有效增强了数据分布的多样性,同时在训练过程中抑制增强数据噪声。在离散环境(R2R、REVERIE和R4R数据集)和连续环境(R2R-CE数据集)上的实验表明,我们的方法具有优越的性能和令人印象深刻的泛化能力。代码可在https://github.com/SaDil13/VLN-RAM获取。
Summary / 总结
This paper addresses the challenge of data scarcity in Vision-Language Navigation (VLN) by proposing a Rewriting-driven AugMentation (RAM) paradigm. It uses a combination of Vision-Language Models and Large Language Models to rewrite human-annotated training data, generating new observation-instruction pairs without the need for additional simulator data or extensive web data collection. The method includes Object-Enriched Observation Rewriting and Observation-Contrast Instruction Rewriting, and employs a mixing-then-focusing training strategy to enhance data diversity and reduce noise. Experiments on various VLN datasets demonstrate the method's superior performance and strong generalization ability.
本文提出了一种重写驱动增强(RAM)范式,以解决视觉-语言导航(VLN)中的数据稀缺问题。该方法利用视觉-语言模型(VLMs)和大型语言模型(LLMs)重写训练数据,生成新的观察-指令对,无需额外的模拟器数据或大量收集的网络数据。该方法包括对象增强的观察重写和观察对比指令重写,并采用混合-然后聚焦的训练策略,以增强数据多样性并减少噪声。在各种VLN数据集上的实验表明,该方法具有优越的性能和强大的泛化能力。
CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
Authors: Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan
First: 2025-11-04T08:28:46+00:00 · Latest: 2025-11-04T08:28:46+00:00
Abstract
In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.
中文标题/摘要
标题:CoCoVa:连续视觉语言思维链的潜在空间推理
在人类认知中,存在许多默会且无法用言语表达的思维过程,使我们能够以多种方式理解并互动于世界。然而,当前的视觉语言模型(VLMs)仍然局限于语言令牌的离散和僵化空间内的推理,从而限制了视觉感知的丰富性和高维特性。为弥合这一差距,我们提出了CoCoVa(连续视觉语言思维链),一种利用连续跨模态推理的新框架,以应对多种视觉语言任务。CoCoVa的核心是一个迭代推理循环,其中新颖的潜空间Q-Former(LQ-Former)作为动态推理引擎,通过跨模态融合迭代细化思维向量链。为了聚焦此过程,一种标记选择机制动态识别出显著的视觉区域,模拟注意力聚焦。为了确保这些潜思维保持在地,我们使用结合对比学习和扩散重建的多任务目标进行模型训练,强制潜表示与视觉和文本模态之间的对齐。评估表明,CoCoVa在准确性和令牌效率上优于强基线。使用1.5B的骨干网络时,它在几乎所有基准上与或超越了更大的7B-9B模型。当扩展到7B大语言模型(LLM)骨干时,它仍然与最先进的模型竞争。定性分析验证了学习到的潜空间捕捉到可解释和结构化的推理模式,突显了CoCoVa在离散语言处理与视觉理解的连续性之间的表示差距上的潜力。
Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
Authors: Jiawei Wang, Dingyou Wang, Jiaming Hu, Qixuan Zhang, Jingyi Yu, Lan Xu
First: 2025-11-03T07:21:42+00:00 · Latest: 2025-11-04T07:22:41+00:00
Comments: project page: https://sites.google.com/deemos.com/kinematify
Abstract
A deep understanding of kinematic structures and movable components is essential for enabling robots to manipulate objects and model their own articulated forms. Such understanding is captured through articulated objects, which are essential for tasks such as physical simulation, motion planning, and policy learning. However, creating these models, particularly for objects with high degrees of freedom (DoF), remains a significant challenge. Existing methods typically rely on motion sequences or strong assumptions from hand-curated datasets, which hinders scalability. In this paper, we introduce Kinematify, an automated framework that synthesizes articulated objects directly from arbitrary RGB images or textual descriptions. Our method addresses two core challenges: (i) inferring kinematic topologies for high-DoF objects and (ii) estimating joint parameters from static geometry. To achieve this, we combine MCTS search for structural inference with geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid descriptions. We evaluate Kinematify on diverse inputs from both synthetic and real-world environments, demonstrating improvements in registration and kinematic topology accuracy over prior work.
中文标题/摘要
标题:Kinematify:高自由度可动物体的开放词汇合成
对运动结构和可动部件的深刻理解对于使机器人能够操作物体并建模其自身的可动形态至关重要。这种理解通过可动物体来捕捉,这些物体对于物理模拟、运动规划和策略学习等任务至关重要。然而,创建这些模型,特别是对于具有高自由度(DoF)的物体,仍然是一个重大挑战。现有方法通常依赖于运动序列或手选数据集中的强假设,这限制了其可扩展性。在本文中,我们介绍了Kinematify,这是一种自动框架,可以从任意RGB图像或文本描述直接合成可动物体。我们的方法解决了两个核心挑战:(i) 推断高DoF物体的运动结构拓扑,(ii) 从静态几何中估计关节参数。为此,我们结合了基于MCTS的结构推理和基于几何的优化来推断关节参数,从而生成物理上一致且功能上有效的描述。我们在从合成和真实环境获取的多种输入上评估了Kinematify,展示了与先前工作相比在注册和运动结构准确性方面的改进。
Summary / 总结
Kinematify is an automated framework that synthesizes articulated objects from RGB images or textual descriptions, addressing the challenge of creating models with high degrees of freedom. It uses MCTS search for structural inference and geometry-driven optimization for joint parameter estimation, resulting in physically consistent and functionally valid descriptions. Experiments show improvements in registration and kinematic topology accuracy compared to previous methods.
Kinematify 是一个自动化框架,可以从 RGB 图像或文本描述中合成 articulated 对象,解决了高自由度模型创建的挑战。它使用 MCTS 搜索进行结构推理,并使用几何驱动优化进行关节参数估计,生成物理一致且功能有效的描述。实验结果显示,在注册和运动学拓扑准确性方面优于先前的方法。
The Sequential Edge: Inverse-Entropy Voting Beats Parallel Self-Consistency at Matched Compute
Authors: Aman Sharma, Paras Chopra
First: 2025-11-04T06:48:34+00:00 · Latest: 2025-11-04T06:48:34+00:00
Abstract
We revisit test-time scaling for language model reasoning and ask a fundamental question: at equal token budget and compute, is it better to run multiple independent chains in parallel, or to run fewer chains that iteratively refine through sequential steps? Through comprehensive evaluation across 5 state-of-the-art open source models and 3 challenging reasoning benchmarks, we find that sequential scaling where chains explicitly build upon previous attempts consistently outperforms the dominant parallel self-consistency paradigm in 95.6% of configurations with gains in accuracy upto 46.7%. Further, we introduce inverse-entropy weighted voting, a novel training-free method to further boost the accuracy of sequential scaling. By weighing answers in proportion to the inverse entropy of their reasoning chains, we increase our success rate over parallel majority and establish it as the optimal test-time scaling strategy. Our findings fundamentally challenge the parallel reasoning orthodoxy that has dominated test-time scaling since Wang et al.'s self-consistency decoding (Wang et al., 2022), positioning sequential refinement as the robust default for modern LLM reasoning and necessitating a paradigm shift in how we approach inference-time optimization.
中文标题/摘要
标题:顺序优势:逆熵投票优于并行自一致性计算
我们重新审视语言模型推理的测试时扩展,并提出一个基本问题:在相同的标记预算和计算量下,是并行运行多个独立链更好,还是通过顺序步骤迭代改进的较少链更好?通过在5个最先进的开源模型和3个具有挑战性的推理基准上的全面评估,我们发现,在95.6%的配置中,显式基于先前尝试的顺序扩展始终优于主导的并行自一致性范式,准确率提升高达46.7%。此外,我们引入了逆熵加权投票,这是一种无需训练的新方法,可进一步提高顺序扩展的准确性。通过按推理链的逆熵比例加权答案,我们提高了成功率,并将其确立为测试时扩展的最佳策略。我们的发现从根本上挑战了自Wang等人自一致性解码以来一直主导测试时扩展的并行推理正统观念,将顺序改进定位为现代LLM推理的稳健默认选择,并要求我们在推理时优化方面进行范式转变。
Summary / 总结
The study evaluates test-time scaling strategies for language models and finds that sequential scaling, where chains iteratively refine through steps, outperforms parallel self-consistency in 95.6% of configurations, with accuracy gains up to 46.7%. The research introduces inverse-entropy weighted voting, which further boosts sequential scaling's accuracy by weighting answers based on the inverse entropy of reasoning chains, making it the optimal test-time scaling strategy. This challenges the prevailing parallel reasoning approach and suggests sequential refinement as the robust default for modern language models.
研究评估了语言模型的测试时缩放策略,发现迭代改进的顺序缩放方法在95.6%的配置中优于并行自我一致性方法,准确率提升高达46.7%。研究引入了逆熵加权投票方法,通过根据推理链的逆熵加权答案,进一步提高了顺序缩放的准确性,使其成为最优的测试时缩放策略。这挑战了自Wang等人以来主导测试时缩放的并行推理范式,建议将顺序改进作为现代语言模型推理的稳健默认方法。
Grounded Vision-Language Interpreter for Integrated Task and Motion Planning
Authors: Jeremy Siburian, Keisuke Shirai, Cristian C. Beltran-Hernandez, Masashi Hamaya, Michael Görner, Atsushi Hashimoto
First: 2025-06-03T18:00:32+00:00 · Latest: 2025-11-04T06:01:36+00:00
Comments: Project website: https://omron-sinicx.github.io/ViLaIn-TAMP/
Abstract
While recent advances in vision-language models have accelerated the development of language-guided robot planners, their black-box nature often lacks safety guarantees and interpretability crucial for real-world deployment. Conversely, classical symbolic planners offer rigorous safety verification but require significant expert knowledge for setup. To bridge the current gap, this paper proposes ViLaIn-TAMP, a hybrid planning framework for enabling verifiable, interpretable, and autonomous robot behaviors. ViLaIn-TAMP comprises three main components: (1) a Vision-Language Interpreter (ViLaIn) adapted from previous work that converts multimodal inputs into structured problem specifications, (2) a modular Task and Motion Planning (TAMP) system that grounds these specifications in actionable trajectory sequences through symbolic and geometric constraint reasoning, and (3) a corrective planning (CP) module which receives concrete feedback on failed solution attempts and feed them with constraints back to ViLaIn to refine the specification. We design challenging manipulation tasks in a cooking domain and evaluate our framework. Experimental results demonstrate that ViLaIn-TAMP outperforms a VLM-as-a-planner baseline by 18% in mean success rate, and that adding the CP module boosts mean success rate by 32%.
中文标题/摘要
标题:基于视觉-语言解释器的集成任务与运动规划
尽管近期视觉-语言模型的进步加速了语言引导的机器人规划的发展,但它们的黑盒性质往往缺乏现实部署所需的可靠性和可解释性。相反,经典的符号规划器提供了严格的可靠性验证,但需要大量专家知识来设置。为弥合当前的差距,本文提出了一种名为ViLaIn-TAMP的混合规划框架,以实现可验证、可解释和自主的机器人行为。ViLaIn-TAMP包括三个主要组件:(1) 一种从先前工作改编而来的视觉-语言解释器(ViLaIn),将多模态输入转换为结构化问题规范;(2) 一个模块化的任务与运动规划(TAMP)系统,通过符号和几何约束推理将这些规范转化为可执行的轨迹序列;(3) 一个纠正规划(CP)模块,接收失败解决方案的具体反馈,并将约束反馈给ViLaIn以细化规范。我们设计了烹饪领域的挑战性操作任务,并评估了我们的框架。实验结果表明,ViLaIn-TAMP在平均成功率上比VLM作为规划器的基线高出18%,而添加CP模块则将平均成功率提高了32%。
Summary / 总结
The paper addresses the need for interpretable and safe robot planners by proposing ViLaIn-TAMP, which integrates a Vision-Language Interpreter (ViLaIn) and a Task and Motion Planning (TAMP) system. ViLaIn converts multimodal inputs into structured problem specifications, while TAMP grounds these specifications into actionable trajectories. A corrective planning module refines the specifications based on feedback. Experiments show that ViLaIn-TAMP outperforms a vision-language model as a planner by 18% in success rate, and the inclusion of the corrective planning module further improves the success rate by 32%.
本文提出了一种名为ViLaIn-TAMP的混合规划框架,以解决需要可解释和安全的机器人规划器的问题。该框架结合了视觉语言解释器(ViLaIn)和任务与运动规划(TAMP)系统。ViLaIn将多模态输入转换为结构化问题规范,而TAMP将这些规范转化为可执行的轨迹序列。该框架还包括一个纠正规划模块,根据反馈来细化规范。实验在烹饪领域显示,ViLaIn-TAMP在平均成功率上比基于视觉语言模型的规划器高出18%,而加入纠正规划模块则进一步提高了32%的成功率。
LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation
Authors: Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi
First: 2025-11-04T04:02:51+00:00 · Latest: 2025-11-04T04:02:51+00:00
Comments: Preprint. Project page: https://vla2026.github.io/LACY/
Abstract
Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks without deeper contextual understanding, limiting their ability to generalize or explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic grounding. An agent capable of both acting and explaining its actions can form richer internal representations and unlock new paradigms for self-supervised learning. We introduce LACY (Language-Action Cycle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). This enables a self-improving cycle that autonomously generates and filters new training data through an active augmentation strategy targeting low-confidence cases, thereby improving the model without additional human labels. Experiments on pick-and-place tasks in both simulation and the real world show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation. Project page: https://vla2026.github.io/LACY/
中文标题/摘要
标题:LACY:基于视觉-语言模型的语言-行动循环以实现自我提升的机器人操作
学习通用的机器人操作策略越来越多地依赖于大规模模型,将语言指令映射为行动(L2A)。然而,这种单向范式通常会产生执行任务但缺乏深层次语境理解的策略,限制了它们的泛化能力和解释行为的能力。我们认为,将行动映射回语言(A2L)的互补技能对于开发更全面的语义接地至关重要。能够执行并解释其行动的智能体可以形成更丰富的内部表示,并解锁新的自我监督学习范式。我们引入了LACY(语言-行动循环),这是一种统一框架,在单一视觉-语言模型中学习这种双向映射。LACY联合训练于三个协同任务:从语言生成参数化行动(L2A)、用语言解释观察到的行动(A2L)以及验证两种语言描述的语义一致性(L2C)。这使得一个自我提升的循环能够自主生成和过滤新的训练数据,通过针对低置信度案例的主动增强策略,从而在无需额外人工标签的情况下提高模型性能。在模拟和真实世界中的拾取和放置任务上的实验表明,LACY将任务成功率平均提高了56.46%,并为机器人操作提供了更稳健的语言-行动接地。项目页面:https://vla2026.github.io/LACY/
Summary / 总结
LACY is a unified framework that learns bidirectional mappings between language and actions within a single vision-language model. It is trained on three tasks: generating actions from language (L2A), explaining actions in language (A2L), and verifying semantic consistency between language descriptions (L2C). This enables a self-improving cycle that generates and filters new training data, improving task success rates by 56.46% on average and enhancing language-action grounding for robotic manipulation in both simulation and the real world.
LACY 是一个统一框架,在单一视觉语言模型中学习语言和动作之间的双向映射。它通过三个任务进行训练:从语言生成动作(L2A)、用语言解释动作(A2L)以及验证语言描述之间的语义一致性(L2C)。这使得系统能够自动生成和过滤新的训练数据,从而将任务成功率平均提高56.46%,并增强机器人操作中的语言-动作对接。
Dynamic Routing Between Experts: A Data-Efficient Approach to Continual Learning in Vision-Language Models
Authors: Jay Mohta, Kenan Emir Ak, Dimitrios Dimitriadis, Yan Xu, Mingwei Shen
First: 2025-11-03T18:39:32+00:00 · Latest: 2025-11-04T03:19:41+00:00
Abstract
Vision-Language Models (VLMs) suffer from catastrophic forgetting when sequentially fine-tuned on new tasks, degrading performance on previously learned foundational and task-specific capabilities. While multi-task learning can mitigate forgetting, it requires simultaneous access to all datasets and imposes computational overhead that scales linearly with the number of tasks. In this work, we introduce a routing-based approach that enables the integration of new tasks while preserving the foundational knowledge acquired during pretraining. We evaluate our method using InternVL-2 models (2B and 8B parameters) and demonstrate that routing preserves the model's foundational capabilities by maintaining performance on general-purpose benchmarks such as ChartQA, MMBench, and DocVQA, while simultaneously improving accuracy on specialized tasks. Importantly, our approach achieves this without requiring concurrent access to data from all tasks, avoiding the significant computational and data overhead associated with traditional multi-task learning. We further conduct extensive ablation studies to evaluate the scalability and robustness of routing-based learning, showing that the approach is resilient to a growing number of tasks and performs particularly well when new tasks are semantically related. Finally, we show that the routing mechanism enables superior cross-modal transfer between language and vision capabilities, allowing knowledge learned in one modality to enhance performance in another capability not achieved by existing continual learning methods.
中文标题/摘要
标题:专家间动态路由:一种在视觉语言模型中持续学习的数据高效方法
视觉语言模型(VLMs)在按顺序微调新任务时会遭受灾难性遗忘,导致之前学习的基础能力和任务特定能力性能下降。虽然多任务学习可以减轻遗忘,但需要同时访问所有数据集,并且随着任务数量的增加,计算开销呈线性增长。在本工作中,我们提出了一种基于路由的方法,可以在不破坏预训练期间获得的基础知识的情况下整合新任务。我们使用InternVL-2模型(2B和8B参数)评估了我们的方法,并证明了路由可以保持模型的基础能力,通过在通用基准测试(如ChartQA、MMBench和DocVQA)上保持性能,同时在专门任务上提高准确性。重要的是,我们的方法无需同时访问所有任务的数据,从而避免了传统多任务学习相关的重大计算和数据开销。我们还进行了广泛的消融研究,以评估基于路由学习的可扩展性和鲁棒性,表明该方法对任务数量的增长具有鲁棒性,并且在新任务具有语义相关性时表现尤为出色。最后,我们展示了路由机制能够实现语言和视觉能力之间的优越跨模态转移,使在一种能力中获得的知识能够增强另一种能力的表现,这是现有持续学习方法所无法实现的。
ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation
Authors: Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Haopeng Li, Honglei Yan, Tingting Shen, Yadong Mu
First: 2025-11-01T11:29:14+00:00 · Latest: 2025-11-04T03:11:03+00:00
Abstract
Video generative models pretrained on large-scale datasets can produce high-quality videos, but are often conditioned on text or a single image, limiting controllability and applicability. We introduce ID-Composer, a novel framework that addresses this gap by tackling multi-subject video generation from a text prompt and reference images. This task is challenging as it requires preserving subject identities, integrating semantics across subjects and modalities, and maintaining temporal consistency. To faithfully preserve the subject consistency and textual information in synthesized videos, ID-Composer designs a hierarchical identity-preserving attention mechanism, which effectively aggregates features within and across subjects and modalities. To effectively allow for the semantic following of user intention, we introduce semantic understanding via pretrained vision-language model (VLM), leveraging VLM's superior semantic understanding to provide fine-grained guidance and capture complex interactions between multiple subjects. Considering that standard diffusion loss often fails in aligning the critical concepts like subject ID, we employ an online reinforcement learning phase to drive the overall training objective of ID-Composer into RLVR. Extensive experiments demonstrate that our model surpasses existing methods in identity preservation, temporal consistency, and video quality.
中文标题/摘要
标题:ID-Composer:多主体视频合成中的分层身份保留
在大规模数据集上预训练的视频生成模型可以生成高质量的视频,但通常需要文本或单张图像的条件,限制了可控性和适用性。我们引入了ID-Composer,这是一种新颖的框架,通过从文本提示和参考图像生成多主体视频来解决这一差距。这一任务具有挑战性,因为它需要保留主体身份、在主体和模态之间整合语义,并保持时间一致性。为了忠实地保留合成视频中的主体一致性和文本信息,ID-Composer 设计了一种分层身份保留注意力机制,该机制有效地在主体和模态内及跨主体聚合特征。为了有效地允许用户意图的语义跟随,我们引入了通过预训练的视觉-语言模型(VLM)进行语义理解,利用VLM的语义理解优势提供细粒度指导并捕捉多个主体之间的复杂交互。考虑到标准扩散损失往往无法对齐关键概念如主体ID,我们采用在线强化学习阶段来驱动ID-Composer的整体训练目标为RLVR。大量实验表明,我们的模型在身份保留、时间一致性和视频质量方面超越了现有方法。
Summary / 总结
ID-Composer is a framework designed to generate multi-subject videos from text prompts and reference images while preserving subject identities and maintaining temporal consistency. It uses a hierarchical identity-preserving attention mechanism and a pretrained vision-language model for semantic understanding. Experiments show that ID-Composer outperforms existing methods in identity preservation, temporal consistency, and video quality.
ID-Composer 是一个框架,用于从文本提示和参考图像生成多主体视频,同时保持身份和时间一致性。它使用层次身份保留注意力机制和预训练的视觉-语言模型进行语义理解。该模型还采用在线强化学习阶段以提高关键概念的对齐。实验表明,ID-Composer 在身份保留、时间一致性和视频质量方面优于现有方法。
MetAdv: A Unified and Interactive Adversarial Testing Platform for Autonomous Driving
Authors: Aishan Liu, Jiakai Wang, Tianyuan Zhang, Hainan Li, Jiangfan Liu, Siyuan Liang, Yilong Ren, Xianglong Liu, Dacheng Tao
Venue: ACM MM 2025
First: 2025-08-04T03:07:54+00:00 · Latest: 2025-11-04T01:23:14+00:00
Comments: ACM MM 2025 Most Popular Demo Award
Abstract
Evaluating and ensuring the adversarial robustness of autonomous driving (AD) systems is a critical and unresolved challenge. This paper introduces MetAdv, a novel adversarial testing platform that enables realistic, dynamic, and interactive evaluation by tightly integrating virtual simulation with physical vehicle feedback. At its core, MetAdv establishes a hybrid virtual-physical sandbox, within which we design a three-layer closed-loop testing environment with dynamic adversarial test evolution. This architecture facilitates end-to-end adversarial evaluation, ranging from high-level unified adversarial generation, through mid-level simulation-based interaction, to low-level execution on physical vehicles. Additionally, MetAdv supports a broad spectrum of AD tasks, algorithmic paradigms (e.g., modular deep learning pipelines, end-to-end learning, vision-language models). It supports flexible 3D vehicle modeling and seamless transitions between simulated and physical environments, with built-in compatibility for commercial platforms such as Apollo and Tesla. A key feature of MetAdv is its human-in-the-loop capability: besides flexible environmental configuration for more customized evaluation, it enables real-time capture of physiological signals and behavioral feedback from drivers, offering new insights into human-machine trust under adversarial conditions. We believe MetAdv can offer a scalable and unified framework for adversarial assessment, paving the way for safer AD.
中文标题/摘要
标题:MetAdv:统一且互动的自动驾驶对抗性测试平台
评估和确保自动驾驶(AD)系统的对抗性鲁棒性是一项关键且未解决的挑战。本文介绍了MetAdv,这是一种新颖的对抗性测试平台,通过紧密集成虚拟仿真与物理车辆反馈,实现现实、动态和互动的评估。其核心在于建立一个混合虚拟-物理的沙箱,在此环境中设计了一个三层闭环测试环境,具有动态对抗性测试演化。该架构支持从高层次统一的对抗性生成,到中间层次基于仿真的交互,再到低层次在物理车辆上的执行的端到端对抗性评估。此外,MetAdv 支持广泛的AD任务、算法范式(例如模块化深度学习管道、端到端学习、视觉-语言模型)。它支持灵活的3D车辆建模,并在模拟和物理环境之间无缝过渡,内置兼容性支持如Apollo和Tesla等商业平台。MetAdv 的一个关键特性是其人机在环能力:除了灵活的环境配置以进行更定制化的评估,它还能够实时捕捉驾驶员的生理信号和行为反馈,提供在对抗性条件下人机信任的新见解。我们相信MetAdv 可以为对抗性评估提供可扩展且统一的框架,为更安全的AD铺平道路。
DRIP: Dynamic patch Reduction via Interpretable Pooling
Authors: Yusen Peng, Sachin Kumar
First: 2025-10-29T01:10:28+00:00 · Latest: 2025-11-04T01:16:43+00:00
Comments: Need more refinement
Abstract
Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
中文标题/摘要
标题:DRIP: 动态可解释池化下的局部区域减少
近年来,视觉-语言模型的进步,包括对比预训练和指令调优,极大地推动了多模态人工智能的前沿。然而,由于大规模预训练成本高昂,效率问题阻碍了研究人员从头开始预训练视觉语言模型。在本文中,我们提出了动态可解释池化下的局部区域减少(DRIP),该方法适应输入图像并在视觉编码器的深层中动态合并令牌。我们在从零开始训练ImageNet和CLIP对比预训练上的结果表明,在保持相当的分类/零样本性能的同时,显著减少了GFLOP。为了进一步验证我们提出的方法,我们在一个大型生物学数据集上进行了持续预训练,将其影响扩展到科学领域。
Summary / 总结
The research aims to address the efficiency concerns in pretraining vision-language models by proposing DRIP, which dynamically merges tokens in deeper layers of a visual encoder. The method achieves a significant reduction in GFLOPs while maintaining comparable classification and zero-shot performance on ImageNet and CLIP. Continual pretraining on a large biology dataset further validates the method's effectiveness in scientific domains.
研究动机是解决在从零开始预训练视觉语言模型时的效率问题。主要方法是提出DRIP,该方法根据输入图像动态合并视觉编码器深层的令牌。关键实验发现表明,在ImageNet和CLIP对比预训练任务中,GFLOPs显著减少的同时保持了相当的分类和零样本性能。
Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models
Authors: Alexander Htet Kyaw, Richa Gupta, Dhruv Shah, Anoop Sinha, Kory Mathewson, Stefanie Pender, Sachin Chitta, Yotto Koga, Faez Ahmed, Lawrence Sass, Randall Davis
Venue: NeurIPS 2025
First: 2025-11-04T01:02:21+00:00 · Latest: 2025-11-04T01:02:21+00:00
Comments: Accepted to NeurIPS 2025, Conference on Neural Information Processing Systems, Creative AI Track
Abstract
Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components, based on object functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6% of the time, compared to 59.4% for rule-based and 2.5% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.
中文标题/摘要
标题:基于3D生成AI和视觉语言模型的多组件物体文本到机器人装配
3D生成AI的进步使得从文本提示创建物理对象成为可能,但在创建涉及多种组件类型的对象时仍面临挑战。我们提出了一种将3D生成AI与视觉语言模型(VLMs)集成的管道,以使自然语言能够实现多组件物体的机器人装配。该方法利用VLMs进行零样本、多模态的几何和功能推理,将生成的AI网格分解为使用预定义结构和面板组件的多组件3D模型。评估结果显示,用户90.6%的时间更偏好VLM生成的组件分配,而基于规则的分配为59.4%,随机分配为2.5%。此外,该系统允许用户通过对话反馈来细化组件分配,从而在生成AI和机器人技术制造物理对象时赋予更大的人类控制权。
FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Authors: Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, Minghua Liu
First: 2025-10-29T17:58:14+00:00 · Latest: 2025-11-03T22:47:17+00:00
Comments: Project Page: https://czzzzh.github.io/FreeArt3D Code: https://github.com/CzzzzH/FreeArt3D
Abstract
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility. Please check our website for more details: https://czzzzh.github.io/FreeArt3D
中文标题/摘要
标题:FreeArt3D:无需训练的3D可动物体生成方法利用3D扩散
3D可动物体在机器人学、AR/VR和动画等领域中至关重要。最近对这类物体建模的方法要么依赖于需要密集视角监督的优化重建管道,要么依赖于生成前馈模型,这些模型生成粗糙的几何近似,往往忽略了表面纹理。相比之下,静态3D物体的开放世界生成已经取得了显著成功,尤其是在3D扩散模型(如Trellis)的出现之后。然而,将这些方法扩展到可动物体,通过训练3D扩散模型来生成可动物体,面临着重大挑战。本文中,我们提出了FreeArt3D,这是一种无需训练的可动3D物体生成框架。FreeArt3D 不是针对有限的可动数据训练新模型,而是将一个预先训练好的静态3D扩散模型(例如Trellis)重新用于强大的形状先验。它将Score Distillation Sampling (SDS) 扩展到3D到4D领域,将可动性视为额外的生成维度。给定不同可动状态下的少量图像,FreeArt3D 联合优化物体的几何形状、纹理和可动参数,无需特定任务的训练或访问大规模可动数据集。我们的方法生成了高保真度的几何形状和纹理,准确预测了潜在的运动结构,并在多种物体类别中表现出良好的泛化能力。尽管遵循实例优化范式,FreeArt3D 完成时间仅需几分钟,并且在质量和多功能性方面显著优于先前的先进方法。请访问我们的网站获取更多信息:https://czzzzh.github.io/FreeArt3D
Summary / 总结
FreeArt3D is a training-free framework for generating articulated 3D objects. It repurposes a pre-trained static 3D diffusion model as a shape prior and extends Score Distillation Sampling to the 3D-to-4D domain. Given a few images of an object in different articulation states, FreeArt3D optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or large-scale datasets. The method generates high-fidelity geometry and textures, accurately predicts kinematic structures, and generalizes well across various object categories, outperforming previous approaches in both quality and versatility.
FreeArt3D 是一个无需训练的框架,用于生成 articulated 3D 对象。它将一个预训练的静态 3D 扩散模型作为形状先验,并将其 Score Distillation Sampling 扩展到 3D 到 4D 领域。给定对象在不同 articulation 状态下的几张图片,FreeArt3D 优化对象的几何形状、纹理和 articulation 参数,无需特定任务的训练或大规模 articulated 数据集。该方法生成高质量的几何形状和纹理,准确预测了运动结构,并在各种对象类别中表现出良好的泛化能力,优于之前的先进方法。
TapOut: A Bandit-Based Approach to Dynamic Speculative Decoding
Authors: Aditya Sridhar, Nish Sinnadurai, Sean Lie, Vithursan Thangarasa
First: 2025-11-03T19:42:25+00:00 · Latest: 2025-11-03T19:42:25+00:00
Comments: 9 pages, 6 figures, 5 tables
Abstract
Speculative decoding accelerates LLMs by using a lightweight draft model to generate tokens autoregressively before verifying them in parallel with a larger target model. However, determining the optimal number of tokens to draft remains a key challenge limiting the approach's effectiveness. Dynamic speculative decoding aims to intelligently decide how many tokens to draft to achieve maximum speedups. Existing methods often rely on hand-tuned, sensitive thresholds (e.g., token entropy), which are costly to set and generalize poorly across models and domains. We propose TapOut, an online, training-free, plug-and-play algorithm for dynamic speculation policy selection using multi-armed bandits. Our approach employs a meta-algorithm that selects among multiple parameter-free dynamic speculation strategies based on past reward and exploration. We conduct extensive experiments across diverse model pairs and datasets, showing that TapOut achieves competitive or superior speedups compared to well-established dynamic speculation baselines without any hyperparameter tuning.
中文标题/摘要
标题:TapOut:基于拉普拉斯策略的动态推测性解码方法
推测性解码通过使用轻量级草稿模型自回归地生成令牌,然后与较大目标模型并行验证来加速LLMs。然而,确定要草稿的最优令牌数量仍然是限制该方法有效性的关键挑战。动态推测性解码旨在智能地决定要草稿多少令牌以实现最大加速。现有方法通常依赖于手工调优、敏感的阈值(例如,令牌熵),这些阈值设置成本高且在模型和领域之间泛化效果差。我们提出TapOut,一种无需训练的、即插即用的多臂老虎机算法,用于动态推测策略选择。我们的方法采用一个元算法,根据过去奖励和探索选择多个无参数动态推测策略。我们在多种模型对和数据集上进行了广泛的实验,结果显示,TapOut在无需任何超参数调优的情况下,实现了与现有动态推测基准相当或更优的加速效果。
Summary / 总结
TapOut is a bandit-based approach for dynamic speculative decoding that intelligently decides the number of tokens to draft to maximize speedups. Unlike existing methods that rely on hand-tuned thresholds, TapOut uses multi-armed bandits to select among multiple parameter-free strategies based on past performance. Extensive experiments across various model pairs and datasets demonstrate that TapOut achieves competitive or superior speedups compared to established baselines without requiring hyperparameter tuning.
TapOut 是一种基于多臂老虎机的动态推测性解码方法,能够智能地决定推测性解码的令牌数量以最大化加速效果。不同于依赖手动调优阈值的现有方法,TapOut 使用多臂老虎机根据过往表现选择多个无参数策略。在多种模型对和数据集上的广泛实验表明,TapOut 在无需调优超参数的情况下,能够实现与现有基准相当或更优的加速效果。
TRACE: Textual Reasoning for Affordance Coordinate Extraction
Authors: Sangyun Park, Jin Kim, Yuchen Cui, Matthew S. Brown
Venue: ICCV 2025
First: 2025-11-03T19:13:26+00:00 · Latest: 2025-11-03T19:13:26+00:00
Comments: ICCV 2025. *Equal contribution. {\dag}Corresponding author
Abstract
Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation. While visual Chain-of-Thought (CoT) methods exist, they are often computationally intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance Coordinate Extraction), a novel methodology that integrates a textual Chain of Reasoning (CoR) into the affordance prediction process. We use this methodology to create the TRACE dataset, a large-scale collection created via an autonomous pipeline that pairs instructions with explicit textual rationales. By fine-tuning a VLM on this data, our model learns to externalize its spatial reasoning before acting. Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place (W2P) benchmark (a 9.6% relative improvement) and 55.0% on the more challenging W2P(h) subset. Crucially, an ablation study demonstrates that performance scales directly with the amount of reasoning data used, confirming the CoR's effectiveness. Furthermore, analysis of the model's attention maps reveals an interpretable reasoning process where focus shifts dynamically across reasoning steps. This work shows that training VLMs to generate a textual CoR is an effective and robust strategy for enhancing the precision, reliability, and interpretability of VLM-based robot control. Our dataset and code are available at https://github.com/jink-ucla/TRACE
中文标题/摘要
标题:TRACE:文本推理用于提取功能坐标
视觉-语言模型(VLMs)难以将高层次指令转化为机器人操作所需的精确空间功能。虽然存在视觉链式思考(CoT)方法,但它们通常计算量大。本文提出了一种名为TRACE(文本推理用于提取功能坐标)的新方法,将文本链式推理(CoR)整合到功能预测过程中。我们使用此方法创建了TRACE数据集,这是一个通过自主管道生成的大规模集合,将指令与明确的文本推理配对。通过在该数据集上微调VLM,我们的模型学会在行动前外部化其空间推理。实验表明,我们的TRACE微调模型在主要的Where2Place(W2P)基准测试中达到48.1%的准确率(相对提高9.6%),在更具挑战性的W2P(h)子集中达到55.0%。关键的是,消融研究显示性能与使用的推理数据量成正比,证实了CoR的有效性。此外,对模型注意力图的分析揭示了一个可解释的推理过程,其中焦点在推理步骤中动态转移。本文表明,训练VLM生成文本CoR是提高基于VLM的机器人控制的精确性、可靠性和可解释性的有效且稳健的策略。我们的数据集和代码可在https://github.com/jink-ucla/TRACE获取
Summary / 总结
The research aims to improve the precision of spatial affordance extraction for robotic manipulation by integrating textual reasoning into Vision-Language Models (VLMs). TRACE, a novel methodology, uses a textual Chain of Reasoning (CoR) to enhance the model's ability to translate high-level instructions into precise spatial actions. Experiments show that the TRACE-tuned model outperforms existing methods, achieving 48.1% accuracy on the W2P benchmark and 55.0% on the W2P(h) subset. The ablation study confirms the effectiveness of the CoR, and attention maps reveal a dynamic reasoning process. This work demonstrates that training VLMs to generate textual CoR can enhance the precision, reliability, and interpretability of robot control.
研究旨在通过将文本推理集成到视觉语言模型(VLM)中,提高空间功能提取的精度,以实现机器人操作。TRACE是一种新颖的方法,通过文本链式推理(CoR)增强VLMs,使其在Where2Place基准测试中达到48.1%的准确率。消融研究显示,使用更多推理数据可以提高性能,注意力图揭示了一个动态的推理过程。这项工作表明,训练VLMs生成文本CoR可以增强其解释性和可靠性。
SciTextures: Collecting and Connecting Visual Patterns, Models, and Code Across Science and Art
Authors: Sagi Eppel, Alona Strugatski
First: 2025-11-03T18:22:11+00:00 · Latest: 2025-11-03T18:22:11+00:00
Abstract
The ability to connect visual patterns with the processes that form them represents one of the deepest forms of visual understanding. Textures of clouds and waves, the growth of cities and forests, or the formation of materials and landscapes are all examples of patterns emerging from underlying mechanisms. We present the Scitextures dataset, a large-scale collection of textures and visual patterns from all domains of science, tech, and art, along with the models and code that generate these images. Covering over 1,200 different models and 100,000 images of patterns and textures from physics, chemistry, biology, sociology, technology, mathematics, and art, this dataset offers a way to explore the connection between the visual patterns that shape our world and the mechanisms that produce them. Created by an agentic AI pipeline that autonomously collects and implements models in standardized form, we use SciTextures to evaluate the ability of leading AI models to link visual patterns to the models and code that generate them, and to identify different patterns that emerged from the same process. We also test AIs ability to infer and recreate the mechanisms behind visual patterns by providing a natural image of a real-world pattern and asking the AI to identify, model, and code the mechanism that formed the pattern, then run this code to generate a simulated image that is compared to the real image. These benchmarks show that vision-language models (VLMs) can understand and simulate the physical system beyond a visual pattern. The dataset and code are available at: https://zenodo.org/records/17485502
中文标题/摘要
标题:SciTextures:跨科学与艺术收集和连接视觉模式、模型和代码
将视觉模式与其形成过程联系起来的能力代表了最深层次的视觉理解形式。云和波的纹理、城市和森林的增长、材料和景观的形成都是从底层机制中涌现出来的模式的例子。我们介绍了Scitextures数据集,这是一个涵盖科学、技术和艺术所有领域的大型纹理和视觉模式集合,以及生成这些图像的模型和代码。该数据集包括超过1,200个不同模型和来自物理学、化学、生物学、社会学、技术、数学和艺术的100,000张模式和纹理图像,提供了一种探索塑造我们世界的视觉模式与其产生机制之间联系的方法。通过自主收集和以标准化形式实施模型的智能代理AI管道创建,我们使用SciTextures评估领先AI模型将视觉模式与其生成的模型和代码联系起来的能力,并识别来自相同过程的不同模式。我们还通过提供真实世界模式的自然图像并要求AI识别、建模和编码形成模式的机制,然后运行此代码生成与真实图像进行比较的模拟图像,来测试AI推断和重现视觉模式背后机制的能力。这些基准表明,视觉语言模型(VLMs)可以理解并模拟视觉模式背后的物理系统。数据集和代码可在以下链接获取:https://zenodo.org/records/17485502
Summary / 总结
The research aims to enhance visual understanding by connecting visual patterns to their underlying mechanisms. The SciTextures dataset includes over 1,200 models and 100,000 images from various scientific and artistic domains. The study evaluates leading AI models to determine their ability to link visual patterns to the models and code that generate them, and to infer and recreate the mechanisms behind these patterns. Key findings show that vision-language models can understand and simulate the physical systems beyond just the visual patterns.
研究旨在通过将视觉模式与其背后的机制联系起来,增强视觉理解。SciTextures数据集包含来自各个科学和艺术领域的1,200多个模型和100,000多张图像。研究评估了领先的人工智能模型,以确定它们将视觉模式与其生成的模型和代码联系起来的能力,以及推断和重现这些模式背后的机制。关键发现表明,视觉语言模型可以理解并模拟超出视觉模式的物理系统。
GenDexHand: Generative Simulation for Dexterous Hands
Authors: Feng Chen, Zhuxiu Xu, Tianzhe Chu, Xunzhe Zhou, Li Sun, Zewen Wu, Shenghua Gao, Zhongyu Li, Yanchao Yang, Yi Ma
First: 2025-11-03T17:45:38+00:00 · Latest: 2025-11-03T17:45:38+00:00
Abstract
Data scarcity remains a fundamental bottleneck for embodied intelligence. Existing approaches use large language models (LLMs) to automate gripper-based simulation generation, but they transfer poorly to dexterous manipulation, which demands more specialized environment design. Meanwhile, dexterous manipulation tasks are inherently more difficult due to their higher degrees of freedom. Massively generating feasible and trainable dexterous hand tasks remains an open challenge. To this end, we present GenDexHand, a generative simulation pipeline that autonomously produces diverse robotic tasks and environments for dexterous manipulation. GenDexHand introduces a closed-loop refinement process that adjusts object placements and scales based on vision-language model (VLM) feedback, substantially improving the average quality of generated environments. Each task is further decomposed into sub-tasks to enable sequential reinforcement learning, reducing training time and increasing success rates. Our work provides a viable path toward scalable training of diverse dexterous hand behaviors in embodied intelligence by offering a simulation-based solution to synthetic data generation. Our website: https://winniechen2002.github.io/GenDexHand/.
中文标题/摘要
标题:GenDexHand: 生成式模拟用于灵巧手
数据稀缺仍然是体现智能的基本瓶颈。 现有方法使用大型语言模型(LLMs)自动化夹具基模拟生成,但它们在灵巧操作方面转移效果不佳,这需要更专门的环境设计。同时,由于其更高的自由度,灵巧操作任务本身更加困难。大规模生成可行且可训练的灵巧手任务仍然是一个开放的挑战。为此,我们提出了GenDexHand,这是一种生成式模拟流水线,能够自主产生用于灵巧操作的多样化机器人任务和环境。GenDexHand引入了一个闭环精炼过程,根据视觉语言模型(VLM)反馈调整物体放置和比例,显著提高了生成环境的平均质量。每个任务进一步分解为子任务,以启用顺序强化学习,减少训练时间和提高成功率。我们的工作为通过提供基于模拟的合成数据生成解决方案,提供了实现多样化灵巧手行为可扩展训练的可行途径。我们的网站:https://winniechen2002.github.io/GenDexHand/
Summary / 总结
GenDexHand addresses the challenge of data scarcity in embodied intelligence by introducing a generative simulation pipeline for dexterous manipulation tasks. It uses a closed-loop refinement process with vision-language model feedback to improve the quality of generated environments and decomposes tasks into sub-tasks for efficient training. Key findings show that this approach significantly enhances the success rates and training efficiency for dexterous hand behaviors compared to existing methods.
GenDexHand通过自主生成多样化的机器人任务和环境来解决体态智能中的数据稀缺问题,使用生成模拟管道。它采用闭环改进过程并结合视觉语言模型反馈来提高生成环境的质量,并将任务分解为子任务以进行顺序强化学习,从而减少训练时间和提高成功率。该方法通过合成数据生成为体态智能中训练灵巧手行为提供了可扩展的解决方案。
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Authors: Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
First: 2025-01-07T18:58:54+00:00 · Latest: 2025-11-03T17:35:29+00:00
Comments: Code: https://github.com/Bytedance/Sa2VA
Abstract
This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.
中文标题/摘要
标题:Sa2VA:将SAM2与LLaVA结合以实现图像和视频的密集语义理解
本文介绍了Sa2VA,这是首个全面统一的模型,用于处理图像和视频的密集语义理解。与现有的多模态大型语言模型不同,Sa2VA支持广泛的图像和视频任务,包括引用分割和对话,只需少量的一次性指令调优。Sa2VA将基础视频分割模型SAM-2与先进的视觉-语言模型MLLM结合,并将文本、图像和视频统一到共享的LLM标记空间中。利用LLM,Sa2VA生成指令标记,指导SAM-2生成精确的掩码,从而实现对静态和动态视觉内容的语义理解。此外,我们还引入了Ref-SAV,这是一个包含超过72000个复杂视频场景中对象表达式的自动标注数据集,旨在提升模型性能。我们还手动验证了Ref-SAV数据集中2000个视频对象,以评估复杂环境中的引用视频对象分割。实验表明,Sa2VA在多个任务中表现出色,特别是在引用视频对象分割方面,突显了其在复杂现实应用中的潜力。此外,Sa2VA可以轻松扩展到各种VLM中,包括Qwen-VL和Intern-VL,这些模型可以快速更新以适应当前开源VLM。代码和模型已提供给社区。
Summary / 总结
Sa2VA is a unified model for dense grounded understanding of images and videos, combining SAM-2 and MLLM. It supports various tasks like referring segmentation and conversation with minimal tuning. Sa2VA generates instruction tokens to guide SAM-2 in producing precise masks, enabling grounded understanding of both static and dynamic visual content. Experiments show strong performance in tasks, especially in referring video object segmentation, and the model can be easily extended to other VLMs. Manual validation of 2k video objects in the Ref-SAV dataset benchmarks referring video object segmentation in complex environments.
Sa2VA 是一种结合 SAM-2 和 MLLM 的统一模型,用于图像和视频的密集接地理解,支持如引用分割和对话等多种任务,通过最小的指令微调即可实现。Sa2VA 生成指令令牌来指导 SAM-2 生成精确的掩码,特别是在引用视频对象分割任务上表现出色。该模型可以轻松扩展到其他 VLM,还包含一个自动标注的数据集 Ref-SAV,用于训练。基准测试显示 Sa2VA 在复杂现实应用场景中的有效性。
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
Authors: Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun
Venue: ICCV 2025
First: 2025-04-01T07:47:55+00:00 · Latest: 2025-11-03T17:23:02+00:00
Comments: Published as a conference paper at ICCV 2025. Project page: https://github.com/icip-cas/ShortV
Abstract
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV
中文标题/摘要
标题:ShortV:通过冻结无效层中的视觉标记提高多模态大型语言模型的效率
多模态大型语言模型(MLLMs)由于其庞大的规模和大量的视觉标记而面临高昂的计算成本。本文通过引入一个新的度量标准——层贡献(LC),研究了MLLMs中的层间冗余性,该度量标准量化了层的变换对视觉和文本标记的影响。LC的计算涉及测量移除层对指定标记的变换后模型输出的差异。我们的初步实验表明,在处理视觉标记时,MLLMs中的许多层几乎没有贡献。受此观察的启发,我们提出了一种无需训练的方法——ShortV,利用LC来识别无效层,并在这些层中冻结视觉标记的更新。实验表明,ShortV可以在大约60%的MLLM层中冻结视觉标记,从而大幅降低与更新视觉标记相关的计算成本。例如,在LLaVA-NeXT-13B上实现了50%的FLOPs减少,同时保持了优越的性能。代码将在https://github.com/icip-cas/ShortV公开。
Summary / 总结
The research aims to reduce the computational costs of Multimodal Large Language Models (MLLMs) by identifying and freezing ineffective layers. The Layer Contribution (LC) metric is introduced to quantify the impact of each layer on visual and text tokens. Experiments show that ShortV, a training-free method using LC, can freeze visual token updates in about 60% of MLLM layers, leading to a 50% reduction in FLOPs on LLaVA-NeXT-13B without compromising performance. The code is available at https://github.com/icip-cas/ShortV.
本文通过引入一个新的度量标准层贡献(LC)来识别对视觉处理影响较小的层,以解决多模态大型语言模型(MLLMs)的高计算成本问题。提出的ShortV方法利用LC冻结这些层中的视觉标记更新,大约减少了60%的计算成本,同时保持了优越的性能。例如,它在LLaVA-NeXT-13B上实现了50%的FLOPs减少,而性能没有显著下降。代码已公开发布在https://github.com/icip-cas/ShortV
3EED: Ground Everything Everywhere in 3D
Authors: Rong Li, Yuhao Dong, Tianshuai Hu, Ao Liang, Youquan Liu, Dongyue Lu, Liang Pan, Lingdong Kong, Junwei Liang, Ziwei Liu
Venue: NeurIPS 2025
First: 2025-11-03T17:05:22+00:00 · Latest: 2025-11-03T17:05:22+00:00
Comments: NeurIPS 2025 DB Track; 29 pages, 17 figures, 10 tables; Project Page at https://project-3eed.github.io/
Abstract
Visual grounding in 3D is the key for embodied agents to localize language-referred objects in open-world environments. However, existing benchmarks are limited to indoor focus, single-platform constraints, and small scale. We introduce 3EED, a multi-platform, multi-modal 3D grounding benchmark featuring RGB and LiDAR data from vehicle, drone, and quadruped platforms. We provide over 128,000 objects and 22,000 validated referring expressions across diverse outdoor scenes -- 10x larger than existing datasets. We develop a scalable annotation pipeline combining vision-language model prompting with human verification to ensure high-quality spatial grounding. To support cross-platform learning, we propose platform-aware normalization and cross-modal alignment techniques, and establish benchmark protocols for in-domain and cross-platform evaluations. Our findings reveal significant performance gaps, highlighting the challenges and opportunities of generalizable 3D grounding. The 3EED dataset and benchmark toolkit are released to advance future research in language-driven 3D embodied perception.
中文标题/摘要
标题:3EED:在三维空间中一切皆基
三维视觉定位是使具身智能体在开放世界环境中定位语言所指对象的关键。然而,现有的基准测试仅限于室内场景、单一平台限制和小规模。我们引入了3EED,这是一个多平台、多模态的三维定位基准,包含来自车辆、无人机和四足平台的RGB和LiDAR数据。我们提供了超过128,000个物体和22,000个验证过的参照表达,覆盖了多样化的户外场景——比现有数据集大10倍。我们开发了一种可扩展的注释流水线,结合视觉-语言模型提示与人工验证,以确保高质量的空间定位。为了支持跨平台学习,我们提出了平台感知的标准化和跨模态对齐技术,并建立了领域内和跨平台的基准测试协议。我们的研究结果揭示了显著的性能差距,突显了通用三维定位的挑战和机遇。3EED数据集和基准测试工具包已发布,以促进未来由语言驱动的三维具身感知研究。
UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback
Authors: Ropeway Liu, Hangjie Yuan, Bo Dong, Jiazheng Xing, Jinwang Wang, Rui Zhao, Yan Xing, Weihua Chen, Fan Wang
Venue: NeurIPS 2025
First: 2025-11-03T15:41:41+00:00 · Latest: 2025-11-03T15:41:41+00:00
Comments: NeurIPS 2025
Abstract
Relighting is a crucial task with both practical demand and artistic value, and recent diffusion models have shown strong potential by enabling rich and controllable lighting effects. However, as they are typically optimized in semantic latent space, where proximity does not guarantee physical correctness in visual space, they often produce unrealistic results, such as overexposed highlights, misaligned shadows, and incorrect occlusions. We address this with UniLumos, a unified relighting framework for both images and videos that brings RGB-space geometry feedback into a flow matching backbone. By supervising the model with depth and normal maps extracted from its outputs, we explicitly align lighting effects with the scene structure, enhancing physical plausibility. Nevertheless, this feedback requires high-quality outputs for supervision in visual space, making standard multi-step denoising computationally expensive. To mitigate this, we employ path consistency learning, allowing supervision to remain effective even under few-step training regimes. To enable fine-grained relighting control and supervision, we design a structured six-dimensional annotation protocol capturing core illumination attributes. Building upon this, we propose LumosBench, a disentangled attribute-level benchmark that evaluates lighting controllability via large vision-language models, enabling automatic and interpretable assessment of relighting precision across individual dimensions. Extensive experiments demonstrate that UniLumos achieves state-of-the-art relighting quality with significantly improved physical consistency, while delivering a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.
中文标题/摘要
标题:UniLumos:快速统一的图像和视频重新照明,具有物理可信的反馈
重新照明是一项具有实际需求和艺术价值的关键任务,最近的扩散模型展示了强大的潜力,能够实现丰富且可控的照明效果。然而,由于它们通常在语义潜在空间中进行优化,而视觉空间中的邻近性并不保证物理正确性,因此它们经常产生不现实的结果,如过度曝光的高光、错位的阴影和不正确的遮挡。我们通过UniLumos,一种统一的图像和视频重新照明框架,将RGB空间几何反馈引入到流匹配骨干中来解决这一问题。通过使用从模型输出中提取的深度图和法线图进行监督,我们明确地将照明效果与场景结构对齐,增强物理可信度。然而,这种反馈需要高质量的输出来进行视觉空间的监督,使得标准多步去噪计算成本高昂。为缓解这一问题,我们采用路径一致性学习,使监督在少量训练阶段仍然有效。为了实现精细的重新照明控制和监督,我们设计了一种结构化的六维注释协议,捕捉核心照明属性。在此基础上,我们提出了LumosBench,一种解耦的属性级基准,通过大型视觉语言模型评估照明可控性,实现对重新照明精度的自动和可解释评估。大量实验表明,UniLumos在显著提高物理一致性的同时,实现了图像和视频重新照明的20倍速度提升,达到最先进的重新照明质量。代码可在https://github.com/alibaba-damo-academy/Lumos-Custom/ 获取。
Summary / 总结
UniLumos is a unified framework for image and video relighting that integrates RGB-space geometry feedback into a flow matching backbone, enhancing physical plausibility. By using depth and normal maps for supervision, it aligns lighting effects with scene structure, improving realism. The framework employs path consistency learning to maintain effective supervision even with few-step training, and a structured six-dimensional annotation protocol is designed for fine-grained control. Experiments show that UniLumos achieves state-of-the-art relighting quality with significant physical consistency and a 20x speedup for both image and video relighting. Code is available at https://github.com/alibaba-damo-academy/Lumos-Custom.
UniLumos 是一个统一的图像和视频重新照明框架,通过将 RGB 空间几何反馈整合到流匹配骨干中,增强物理合理性。通过使用深度和法线图进行监督,它将照明效果与场景结构对齐,提高逼真度。路径一致性学习允许在更少的步骤中进行有效的监督,从而减少计算成本。UniLumos 在图像和视频重新照明方面实现了最先进的质量,速度提高了 20 倍。它还引入了 LumosBench,这是一种使用大型视觉语言模型评估照明可控性的基准。代码可在 https://github.com/alibaba-damo-academy/Lumos-Custom 获取。
Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers
Authors: Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan
First: 2025-11-03T14:25:12+00:00 · Latest: 2025-11-03T14:25:12+00:00
Abstract
In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC
中文标题/摘要
标题:Vote-in-Context:将VLM转化为零样本排名融合器
在检索领域,异构检索器候选融合是一个长期存在的挑战,尤其是在视频等复杂多模态数据方面。尽管典型的融合技术是无训练的,但它们仅依赖于排名或得分信号,忽略了候选者的表示。本文引入了Vote-in-Context (ViC),这是一种通用的、无训练的框架,重新思考列表级重排序和融合为视觉语言模型(VLM)的零样本推理任务。核心洞察是直接在VLM的提示中序列化内容证据和检索器元数据,使模型能够适应性地权衡检索器共识与视觉语言内容。通过将其应用于跨模态视频检索这一具有挑战性的领域,我们展示了该框架的通用性。为此,我们引入了S-Grid,这是一种紧凑的序列化映射,将每个视频表示为图像网格,可选地配以字幕,以支持视频候选者的列表级推理。ViC 作为单列表重排序器进行评估,它显著提高了单个检索器的精度;作为集成融合器进行评估时,它始终优于强基线如CombSUM。在包括ActivityNet和VATEX在内的视频检索基准测试中,该框架建立了新的零样本检索性能的最新记录,展示了其在处理复杂视觉和时间信号的同时处理文本的有效性。在零样本设置中,ViC 在 MSR-VTT 上实现了 87.1% (t2v) / 89.0% (v2t) 的 Recall@1,在 VATEX 上实现了 99.6% (v2t) 的 Recall@1,相对于之前的最新基线,精度提高了高达 +40%。我们提出 ViC 作为一种简单、可重复且高效的食谱,将现代 VLM 转化为强大的零样本重排序器和融合器。代码和资源可在 https://github.com/mohammad2012191/ViC 公开获取。
Summary / 总结
This work addresses the challenge of fusing heterogeneous retrievers for complex, multi-modal data like videos. It introduces Vote-in-Context (ViC), a training-free framework that uses a Vision-Language Model (VLM) to rerank and fuse retrieval candidates by serializing content evidence and retriever metadata within the VLM's prompt. ViC significantly improves the precision of individual retrievers and outperforms strong baselines like CombSUM in zero-shot settings, achieving new state-of-the-art performance on benchmarks such as ActivityNet and VATEX with up to +40 Recall@1 gains over previous methods.
该研究针对复杂多模态数据(尤其是视频)的异构检索器融合问题,引入了基于视觉-语言模型(VLM)的训练免费框架Vote-in-Context(ViC)。ViC通过在VLM提示中序列化内容证据和检索器元数据来重新排序和融合候选者。该框架在跨模态视频检索基准上进行了评估,实现了新的最佳性能,MSR-VTT上的t2v Recall@1得分为87.1% / v2t得分为89.0%,VATEX上的v2t得分为99.6%,超越了如CombSUM等强基线。
History
20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553