arXiv 论文速递

2025-12-14 03:27
Snapshot: 20251214_0327
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov
First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00
Comments: Project page: https://snap-research.github.io/omni-attribute
Abstract
Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
中文标题/摘要
标题:Omni-Attribute:面向视觉概念个性化的大词汇量属性编码器
视觉概念个性化旨在将特定图像属性,如身份、表情、光照和风格,转移到未见的上下文中。然而,现有方法依赖于通用图像编码器的整体嵌入,这会将多个视觉因素纠缠在一起,使得难以隔离单一属性。这通常会导致信息泄露和不一致的合成。为了解决这一局限性,我们引入了Omni-Attribute,这是第一个用于学习高保真度、属性特定表示的大词汇量图像属性编码器。我们的方法联合设计了数据和模型:(i) 我们收集了带有正负属性标注的语义关联图像对,以明确地教导编码器保留或抑制什么;(ii) 我们采用了一种双目标训练范式,平衡生成保真度与对比性解耦。生成的嵌入在开放词汇量属性检索、个性化和组合生成方面证明是有效的,并在多个基准测试中达到了最先进的性能。
Summary / 总结
Omni-Attribute is an open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations for visual concept personalization. It addresses the limitation of existing methods by curating semantically linked image pairs and using a dual-objective training paradigm. The encoder achieves state-of-the-art performance in open-vocabulary attribute retrieval, personalization, and compositional generation across multiple benchmarks.
研究旨在开发一种方法,以在不纠缠多种视觉因素的情况下,将特定图像属性如身份和表情转移到未见过的上下文中。引入了Omni-Attribute,这是一种开放词汇量的图像属性编码器,用于学习高保真的属性特定表示。该方法使用语义关联的图像对和双目标训练范式来实现这一目标。实验结果表明,Omni-Attribute 在开放词汇量属性检索、个性化和组合生成方面优于现有方法,并在多个基准测试中达到了最先进的性能。
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
Authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung
First: 2025-12-11T18:59:22+00:00 · Latest: 2025-12-11T18:59:22+00:00
Abstract
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
中文标题/摘要
标题:VL-JEPA:视觉语言联合嵌入预测架构
我们介绍了基于联合嵌入预测架构(JEPA)的VL-JEPA视觉语言模型。与经典VLMs逐个生成标记不同,VL-JEPA预测目标文本的连续嵌入。通过在抽象表示空间中学习,该模型专注于任务相关的语义,同时抽象掉表面语言的变异性。在严格控制的比较中,与使用相同视觉编码器和训练数据的标准标记空间VLM训练相比,VL-JEPA在参数量减少50%的情况下表现出更强的性能。在推理时,仅在需要时调用轻量级文本解码器将VL-JEPA预测的嵌入转换为文本。我们展示了VL-JEPA原生支持选择性解码,将解码操作减少2.85倍,同时保持与非自适应均匀解码相似的性能。除了生成之外,VL-JEPA的嵌入空间自然支持开放词汇分类、文本到视频检索和区分性VQA,无需任何架构修改。在八个视频分类数据集和八个视频检索数据集上,VL-JEPA的平均性能超过了CLIP、SigLIP2和感知编码器。同时,该模型在四个VQA数据集(GQA、TallyQA、POPE和POPEv2)上实现了与经典VLMs(InstructBLIP、QwenVL)相当的性能,尽管只有1.6B参数。
Summary / 总结
VL-JEPA is a vision-language model that uses a Joint Embedding Predictive Architecture to predict continuous embeddings of target texts instead of autoregressively generating tokens. This approach leads to better performance with fewer parameters and supports selective decoding, reducing the number of decoding operations. VL-JEPA outperforms several models on video classification and retrieval tasks while achieving comparable results on VQA tasks with significantly fewer parameters.
VL-JEPA 是一种基于联合嵌入预测架构的视觉语言模型,它预测目标文本的连续嵌入而不是自回归生成标记。这种方法使得模型在更少的参数下表现出更强的性能,并支持选择性解码,将解码操作的数量减少2.85倍。VL-JEPA 在多个视频分类和检索任务中表现出色,并且在 VQA 任务中实现了与传统视觉语言模型相当的性能,尽管其参数量仅为1.6亿。
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
Authors: Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, Boqing Gong
First: 2025-12-11T18:57:05+00:00 · Latest: 2025-12-11T18:57:05+00:00
Abstract
Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
中文标题/摘要
标题:BabyVLM-V2:面向发展性基础视觉模型预训练和基准测试的框架
早期儿童的发展轨迹为高效样本预训练视觉基础模型提供了自然目标。我们介绍了BabyVLM-V2,这是一种基于发展的婴儿启发式视觉语言建模框架,通过纵向多维度预训练集、多功能模型以及最重要的是DevCV工具箱进行认知评估,大幅改进了BabyVLM-V1。预训练集最大限度地覆盖了内容,同时减少了纵向婴儿为中心的视听素材的整理,生成了视频-语句、图像-语句和多轮对话数据,这些数据反映了婴儿的经验。DevCV工具箱将最近发布的NIH婴儿工具箱中所有与视觉相关的度量标准改编为包含十个跨模态任务的基准测试套件,涵盖了空间推理、记忆和词汇理解,与早期儿童的能力相一致。实验结果表明,从零开始预训练的紧凑模型在DevCV工具箱上可以达到竞争力的表现,某些任务上优于GPT-4o。我们希望BabyVLM-V2框架能够促进发展性基础视觉模型预训练的研究。
Summary / 总结
The research aims to develop vision foundation models that can learn efficiently from infant-centric data, aligning with early children's cognitive development. BabyVLM-V2 uses a longitudinal, multifaceted pretraining set and DevCV Toolbox for evaluation, which includes ten multimodal tasks. The model pretrained from scratch achieves competitive performance on these tasks, outperforming GPT-4o on some. This framework is expected to advance research in developmentally plausible pretraining of vision foundation models.
研究旨在通过使用婴儿样数据来训练视觉基础模型,提高样本效率。BabyVLM-V2 使用了一个纵向的、多方面的预训练数据集和一个DevCV 工具箱来进行认知评估,其中包括十个与早期儿童认知能力相匹配的多模态任务。实验结果显示,一个从零开始预训练的紧凑型模型在这些任务上可以达到竞争力的表现,部分任务上甚至超过了GPT-4o。
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs
Authors: George Yakushev, Nataliia Babina, Masoud Vahid Dastgerdi, Vyacheslav Zhdanovskiy, Alina Shutova, Denis Kuznedelev
First: 2025-12-11T18:57:02+00:00 · Latest: 2025-12-11T18:57:02+00:00
Comments: Preprint, work in progress
Abstract
Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities and safety, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embedded assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of rotary embeddings to enable LLMs built for sequential interactions to simultaneously think, listen, and generate outputs. We evaluate our approach on math, commonsense, and safety reasoning and find that it can generate accurate thinking-augmented answers in real time, reducing time to first non-thinking token from minutes to <= 5s. and the overall real-time delays by 6-11x.
中文标题/摘要
标题:异步推理:无需训练的交互式思考大语言模型
许多最先进的大语言模型在给出答案之前会先进行思考。推理可以大大提升语言模型的能力和安全性,但也使它们变得不那么互动:给定新的输入,模型必须停止思考才能做出回应。现实世界中的用例,如基于语音或嵌入式助手,需要大语言模型代理能够实时响应并根据额外信息进行调整,这与顺序交互不兼容。相比之下,人类可以异步地听、思考和行动:我们在阅读问题时就开始思考,并在构思答案时继续思考。在本项工作中,我们通过利用旋转嵌入的特性,使原本用于顺序交互的大语言模型能够在不进行额外训练的情况下同时思考、聆听和生成输出。我们对数学、常识和安全推理进行了评估,发现这种方法可以实时生成准确的增强思考的答案,将首次无思考标记的时间从几分钟缩短到≤5秒,并将整体实时延迟减少6-11倍。
Summary / 总结
This work addresses the limitation of state-of-the-art LLMs that require them to stop reasoning before responding, making them less interactive. The authors propose a method to enable LLMs to think, listen, and generate outputs simultaneously using the properties of rotary embeddings. Evaluations on math, commonsense, and safety reasoning show that the approach can generate accurate answers in real time, significantly reducing response delays compared to traditional methods.
该研究针对现有先进LLM需要在回答前停止推理的问题,使其不够互动。作者提出了一种方法,利用旋转嵌入的特性,使LLM能够同时思考、倾听和生成输出。在数学、常识和安全推理上的评估表明,该方法可以实时生成准确的答案,相比传统方法显著减少了响应延迟。
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
Authors: Peiying Zhang, Nanxuan Zhao, Matthew Fisher, Yiran Xu, Jing Liao, Difan Liu
First: 2025-12-11T18:23:03+00:00 · Latest: 2025-12-11T18:23:03+00:00
Comments: Project page: https://intchous.github.io/DuetSVG-site
Abstract
Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.
中文标题/摘要
标题:DuetSVG:统一的多模态SVG生成,带有内部视觉指导
基于视觉-语言模型(VLM)的方法在SVG生成方面取得了令人印象深刻的成果。然而,由于它们在解码过程中仅生成文本而缺乏视觉信号,因此往往难以处理复杂的语义,无法生成视觉上吸引人或几何上一致的SVG。我们引入了DuetSVG,这是一种统一的多模态模型,可以以端到端的方式同时生成图像标记和相应的SVG标记。DuetSVG在图像和SVG数据集上进行训练。在推理时,我们应用了一种新颖的测试时缩放策略,利用模型的原生视觉预测作为指导,以提高SVG解码质量。广泛的实验表明,我们的方法优于现有方法,能够生成视觉上忠实、语义上对齐且语法上干净的SVG。
Summary / 总结
The research motivation is to address the limitations of existing vision-language model (VLM)-based SVG generation approaches, which often fail to produce visually appealing or geometrically coherent SVGs due to the lack of visual signals during decoding. The main method is DuetSVG, a unified multimodal model that generates both image and SVG tokens end-to-end, and uses a test-time scaling strategy to guide SVG decoding with visual predictions. Key experimental findings show that DuetSVG outperforms existing methods in generating visually faithful, semantically aligned, and syntactically clean SVGs across various applications.
研究动机是解决现有基于视觉-语言模型(VLM)的SVG生成方法在解码过程中缺乏视觉信号的问题,导致生成的SVG往往不够视觉上吸引人或几何上连贯。主要方法是DuetSVG,这是一种联合生成图像和SVG标记的统一多模态模型,并在测试时使用视觉预测来指导SVG解码。关键实验发现表明,DuetSVG在各种应用中生成了视觉上忠实、语义上对齐且语法上干净的SVG。
PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction
Authors: Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Maury Courtland
First: 2025-12-11T18:19:00+00:00 · Latest: 2025-12-11T18:19:00+00:00
Comments: 15 pages, 7 figures
Abstract
Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.
中文标题/摘要
标题:PubTables-v2:新的大规模数据集用于全页和多页表格提取
表格提取(TE)是视觉文档理解中的一个关键挑战。传统方法先检测表格,然后识别其结构。最近,人们开始开发可以直接在全页或文档上下文中提取表格的方法,例如视觉-语言模型(VLMs)。然而,由于缺乏标注数据,进步难以展示。为了解决这个问题,我们创建了一个新的大规模数据集,PubTables-v2。PubTables-v2 支持当前许多具有挑战性的表格提取任务。值得注意的是,它是第一个大规模的多页表格结构识别基准。我们通过在这些任务上评估领域专用的 VLMs 来展示其用途,并突出当前的进展。最后,我们使用 PubTables-v2 创建了 Page-Object Table Transformer(POTATR),这是一种表格变换器的图像到图扩展,用于全面的页面级 TE。数据、代码和训练模型将被发布。
Summary / 总结
The research aims to improve table extraction in visual document understanding by addressing the lack of annotated data. The authors developed PubTables-v2, a large-scale dataset for full-page and multi-page table extraction, especially focusing on multi-page table structure recognition. Key findings include the evaluation of domain-specialized vision-language models and the introduction of POTATR, an image-to-graph extension of the Table Transformer for comprehensive page-level table extraction.
研究旨在通过解决标注数据不足的问题,提高视觉文档理解中的表格提取。研究引入了PubTables-v2,这是一个用于全页和多页表格提取的大规模数据集,特别是第一个大规模的多页表格结构识别基准。关键发现包括对领域特定的视觉语言模型的评估以及开发了POTATR,这是一种表格变换器的图像到图扩展,用于全面的页面级表格提取。
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
Authors: Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, Wenbing Huang
First: 2025-12-11T18:00:21+00:00 · Latest: 2025-12-11T18:00:21+00:00
Abstract
This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.
中文标题/摘要
标题:从宏观到微观:通过视觉语言模型评估分子微观空间智能
本文介绍了微观空间智能(MiSI)的概念,即感知和推理看不见的微观实体的空间关系的能力,这是科学研究的基础。为了评估视觉语言模型(VLMs)在这一领域的潜力,我们提出了一种系统性的基准框架MiSI-Bench。该框架包含超过163,000个问答对和587,000张图像,源自约4,000个分子结构,涵盖了九个互补任务,评估能力从基本的空间变换到复杂的关联识别。实验结果表明,当前最先进的VLMs在这一基准上的表现远低于人类水平。然而,微调后的7B模型显示出巨大的潜力,甚至在空间变换任务上超过了人类,而其在氢键识别等基于科学的任务上的表现不佳,突显了整合显式领域知识以实现科学AGI的必要性。数据集可在https://huggingface.co/datasets/zongzhao/MiSI-bench获取。
Summary / 总结
This paper introduces Microscopic Spatial Intelligence (MiSI) as the ability to perceive and reason about the spatial relationships of microscopic entities, crucial for scientific discovery. A benchmark framework MiSI-Bench was developed, containing over 163,000 question-answer pairs and 587,000 images from 4,000 molecular structures, to evaluate Vision-Language Models (VLMs). Results show that current state-of-the-art VLMs perform below human level, but a fine-tuned 7B model shows promise, especially in spatial transformation tasks, highlighting the need for integrating domain knowledge for scientific applications.
本文介绍了微观空间智能(MiSI),即感知和推理微观实体空间关系的能力,对于科学研究至关重要。开发了一个基准框架MiSI-Bench,包含163,000个问答对和587,000张来自约4,000个分子结构的图像,评估了从简单空间变换到复杂关系识别的九项任务。最先进的视觉-语言模型(VLMs)的表现低于人类水平,但经过微调的7B模型在空间变换任务中显示出潜力,强调了集成领域特定知识对于科学进步的必要性。
Fairness-Aware Fine-Tuning of Vision-Language Models for Medical Glaucoma Diagnosis
Authors: Zijian Gu, Yuxi Liu, Zhenhao Zhang, Song Wang
First: 2025-12-03T06:09:14+00:00 · Latest: 2025-12-11T17:17:07+00:00
Comments: 10 pages, 3 tables
Abstract
Vision-language models achieve expert-level performance on medical imaging tasks but exhibit significant diagnostic accuracy disparities across demographic groups. We introduce fairness-aware Low-Rank Adaptation for medical VLMs, combining parameter efficiency with explicit fairness optimization. Our key algorithmic contribution is a differentiable MaxAccGap loss that enables end-to-end optimization of accuracy parity across demographic groups. We propose three methods: FR-LoRA integrates MaxAccGap regularization into the training objective, GR-LoRA applies inverse frequency weighting to balance gradient contributions, and Hybrid-LoRA combines both mechanisms. Evaluated on 10,000 glaucoma fundus images, GR-LoRA reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies reveal that strong regularization strength achieves optimal fairness with minimal accuracy trade-off, and race-specific optimization yields 60% disparity reduction. Our approach requires only 0.24% trainable parameters, enabling practical deployment of fair medical AI in resource-constrained healthcare settings.
中文标题/摘要
标题:面向医学青光眼诊断的公平性意识微调视觉-语言模型
视觉-语言模型在医学成像任务上达到专家级性能,但在不同人口群体中表现出显著的诊断准确率差异。我们引入了公平性意识的低秩适应方法,结合了参数效率和显式的公平优化。我们的主要算法贡献是一种可微分的MaxAccGap损失,使跨人口群体的准确率平等化成为端到端优化的目标。我们提出了三种方法:FR-LoRA将MaxAccGap正则化整合到训练目标中,GR-LoRA应用逆频率加权以平衡梯度贡献,Hybrid-LoRA结合了这两种机制。在10,000张青光眼眼底图像上评估,GR-LoRA将诊断准确率差异降低了69%,同时保持53.15%的整体准确率。消融研究显示,较强的正则化强度可以实现最佳公平性,同时最小化准确率的损失,种族特定优化可实现60%的差异减少。我们的方法只需要0.24%的可训练参数,使公平的医学AI在资源受限的医疗保健环境中具有实际部署的可能性。
Summary / 总结
This study addresses the issue of diagnostic accuracy disparities in medical glaucoma diagnosis by vision-language models across demographic groups. It introduces a fairness-aware Low-Rank Adaptation (LoRA) method, which includes three techniques: FR-LoRA, GR-LoRA, and Hybrid-LoRA, to optimize accuracy parity. GR-LoRA, which applies inverse frequency weighting, significantly reduces diagnostic accuracy disparities by 69% while maintaining 53.15% overall accuracy. Ablation studies show that strong regularization strength and race-specific optimization are effective in achieving fairness with minimal accuracy loss.
该研究针对视觉语言模型在眼科疾病诊断中的诊断准确率在不同人群之间存在显著差异的问题,引入了一种公平性意识的低秩适应(LoRA)方法,该方法包含一个可微分的MaxAccGap损失,用于优化准确率的公平性。提出了三种方法——FR-LoRA、GR-LoRA和混合LoRA,其中GR-LoRA在减少69%的诊断准确率差异的同时,保持了53.15%的整体准确率。该方法只需要极少的可训练参数,适用于资源受限的医疗保健环境。
Replace, Don't Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly
Authors: Moshe Lahmy, Roi Yozevitch
First: 2025-12-11T16:31:29+00:00 · Latest: 2025-12-11T16:31:29+00:00
Comments: 24 pages, 2 figures
Abstract
Retrieval-Augmented Generation (RAG) systems often fail on multi-hop queries when the initial retrieval misses a bridge fact. Prior corrective approaches, such as Self-RAG, CRAG, and Adaptive-$k$, typically address this by \textit{adding} more context or pruning existing lists. However, simply expanding the context window often leads to \textbf{context dilution}, where distractors crowd out relevant information. We propose \textbf{SEAL-RAG}, a training-free controller that adopts a \textbf{``replace, don't expand''} strategy to fight context dilution under a fixed retrieval depth $k$. SEAL executes a (\textbf{S}earch $\rightarrow$ \textbf{E}xtract $\rightarrow$ \textbf{A}ssess $\rightarrow$ \textbf{L}oop) cycle: it performs on-the-fly, entity-anchored extraction to build a live \textit{gap specification} (missing entities/relations), triggers targeted micro-queries, and uses \textit{entity-first ranking} to actively swap out distractors for gap-closing evidence. We evaluate SEAL-RAG against faithful re-implementations of Basic RAG, CRAG, Self-RAG, and Adaptive-$k$ in a shared environment on \textbf{HotpotQA} and \textbf{2WikiMultiHopQA}. On HotpotQA ($k=3$), SEAL improves answer correctness by \textbf{+3--13 pp} and evidence precision by \textbf{+12--18 pp} over Self-RAG. On 2WikiMultiHopQA ($k=5$), it outperforms Adaptive-$k$ by \textbf{+8.0 pp} in accuracy and maintains \textbf{96\%} evidence precision compared to 22\% for CRAG. These gains are statistically significant ($p<0.001$). By enforcing fixed-$k$ replacement, SEAL yields a predictable cost profile while ensuring the top-$k$ slots are optimized for precision rather than mere breadth. We release our code and data at https://github.com/mosherino/SEAL-RAG.
中文标题/摘要
标题:替换,不要扩展:通过固定预算证据组装缓解多跳RAG中的上下文稀释
检索增强生成(RAG)系统在处理多跳查询时常常会失败,因为初始检索可能会错过桥梁事实。此前的纠正方法,如Self-RAG、CRAG和Adaptive-$k$,通常通过增加更多上下文或修剪现有列表来解决这一问题。然而,简单地扩展上下文窗口往往会导致上下文稀释,即干扰信息挤占了相关信息。我们提出了SEAL-RAG,这是一种无需训练的控制器,采用“替换,不要扩展”的策略,在固定检索深度$k$下对抗上下文稀释。SEAL执行一个(S搜索→E提取→A评估→L循环)循环:它进行实时、实体锚定的提取,构建一个动态的“缺口规范”(缺失的实体/关系),触发有针对性的微查询,并使用实体优先排序来主动替换干扰信息,以获取填补缺口的证据。我们在共享环境中对SEAL-RAG与Basic RAG、CRAG、Self-RAG和Adaptive-$k$的忠实重实现进行了评估,评估数据集为HotpotQA和2WikiMultiHopQA。在HotpotQA($k=3$)上,SEAL将答案正确性提高了3-13个百分点,证据精确度提高了12-18个百分点,超过Self-RAG。在2WikiMultiHopQA($k=5$)上,它在准确性上比Adaptive-$k$提高了8.0个百分点,并且保持了96%的证据精确度,而CRAG仅为22%。这些增益在统计上具有显著性($p<0.001$)。通过强制执行固定-$k$替换,SEAL提供了可预测的成本模型,同时确保了前-$k$槽位优化的是精确度而非简单的广度。我们已在https://github.com/mosherino/SEAL-RAG/发布了我们的代码和数据。
Summary / 总结
The paper addresses the issue of context dilution in multi-hop Retrieval-Augmented Generation (RAG) systems, where expanding the context window can lead to irrelevant information crowding out relevant facts. It introduces SEAL-RAG, which uses a 'replace, don’t expand' strategy to mitigate this problem under a fixed retrieval depth. SEAL-RAG performs on-the-fly extraction and ranking to actively swap out distractors with relevant evidence. Experimental results show that SEAL-RAG improves answer correctness and evidence precision on HotpotQA and 2WikiMultiHopQA compared to other methods like Self-RAG and Adaptive-$k$.
论文针对多跳检索增强生成(RAG)系统中初始检索遗漏桥梁事实导致的上下文稀释问题,提出了SEAL-RAG,采用‘替换,不扩展’策略来缓解这一问题。SEAL执行搜索、提取、评估和循环的周期,构建实时的缺口规范,触发目标微查询,并使用实体优先排序来替换掉干扰项。实验结果显示,SEAL在HotpotQA和2WikiMultiHopQA上的表现优于现有方法如Self-RAG和Adaptive-$k$。在HotpotQA上,SEAL在答案正确性和证据精度上分别比Self-RAG高出3-13个百分点和12-18个百分点。在2WikiMultiHopQA上,SEAL比Adaptive-$k$高出8.0个百分点的准确率,并保持96%的证据精度,而CRAG仅为22%。
Optimization-Guided Diffusion for Interactive Scene Generation
Authors: Shihao Li, Naisheng Ye, Tianyu Li, Kashyap Chitta, Tuo An, Peng Su, Boyang Wang, Haiou Liu, Chen Lv, Hongyang Li
First: 2025-12-08T15:56:18+00:00 · Latest: 2025-12-11T15:08:39+00:00
Abstract
Realistic and diverse multi-agent driving scenes are crucial for evaluating autonomous vehicles, but safety-critical events which are essential for this task are rare and underrepresented in driving datasets. Data-driven scene generation offers a low-cost alternative by synthesizing complex traffic behaviors from existing driving logs. However, existing models often lack controllability or yield samples that violate physical or social constraints, limiting their usability. We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling from a scene generation model. OMEGA re-anchors each reverse diffusion step via constrained optimization, steering the generation towards physically plausible and behaviorally coherent trajectories. Building on this framework, we formulate ego-attacker interactions as a game-theoretic optimization in the distribution space, approximating Nash equilibria to generate realistic, safety-critical adversarial scenarios. Experiments on nuPlan and Waymo show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes from 32.35% to 72.27% for free exploration capabilities, and from 11% to 80% for controllability-focused generation. Our approach can also generate $5\times$ more near-collision frames with a time-to-collision under three seconds while maintaining the overall scene realism.
中文标题/摘要
标题:优化引导扩散在交互场景生成中的应用
真实的多样化多智能体驾驶场景对于评估自动驾驶车辆至关重要,但这些场景中的关键安全事件在驾驶数据集中很少见且代表性不足。数据驱动的场景生成提供了一种低成本的替代方案,通过从现有的驾驶日志中合成复杂的交通行为。然而,现有的模型往往缺乏可控性,或者生成的样本违反了物理或社会约束,限制了它们的实用性。我们提出了OMEGA,这是一种优化引导、无需训练的框架,在基于扩散的场景生成模型采样过程中确保结构一致性并增强交互意识。OMEGA 通过约束优化重新锚定每个逆向扩散步骤,引导生成物理上合理且行为上一致的轨迹。在此框架基础上,我们将自我攻击者交互建模为分布空间中的博弈论优化,近似纳什均衡以生成真实且关键的安全对抗场景。在nuPlan和Waymo上的实验表明,OMEGA 提高了生成的真实感、一致性和可控性,使自由探索能力下物理和行为上有效的场景比例从32.35%提高到72.27%,可控性生成下从11%提高到80%。此外,我们的方法还可以生成5倍的近碰撞帧,时间至碰撞少于3秒,同时保持整体场景的真实感。
Summary / 总结
The research aims to generate realistic and diverse multi-agent driving scenes for autonomous vehicle evaluation, addressing the scarcity of safety-critical events in driving datasets. OMEGA, an optimization-guided framework, enhances the controllability and physical plausibility of generated scenes by re-anchoring each reverse diffusion step via constrained optimization. Experiments on nuPlan and Waymo demonstrate that OMEGA significantly improves scene realism and consistency, increasing the ratio of valid scenes from 32.35% to 72.27% for free exploration and from 11% to 80% for controllability-focused generation. Additionally, OMEGA can generate five times more near-collision frames with a time-to-collision under three seconds while preserving overall scene realism.
论文提出了OMEGA,一种优化引导的框架,用于生成现实且多样的多智能体驾驶场景。该框架通过在扩散采样过程中强制执行结构一致性和交互意识来解决现有模型的局限性。实验表明,OMEGA 显著提高了生成的真实性和可控性,使自由探索的有效场景比例从 32.35% 提高到 72.27%,可控性生成的有效场景比例从 11% 提高到 80%。此外,它还能生成五倍数量的时间到碰撞小于三秒的接近碰撞帧,同时保持场景的真实性。
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
Authors: Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, Andreas Zell
First: 2025-12-11T14:59:07+00:00 · Latest: 2025-12-11T14:59:07+00:00
Abstract
End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.
中文标题/摘要
标题:SpaceDrive:将空间意识融入基于VLM的自动驾驶
基于视觉语言模型(VLMs)的端到端自动驾驶方法在大规模预训练获得的广泛视觉理解和强大推理能力的驱动下迅速发展。然而,我们发现当前的VLMs在理解细粒度的三维空间关系方面存在困难,这是与物理世界交互的系统的基本要求。为了解决这一问题,我们提出了SpaceDrive,这是一种空间感知的基于VLM的驾驶框架,将空间信息视为显式的位置编码(PEs),而不是文本数字标记,从而实现语义和空间表示的联合推理。SpaceDrive 使用一种通用的位置编码器对多视图深度估计、历史自我状态和文本提示生成的所有3D坐标进行编码。这些3D PE首先叠加到相应的2D视觉标记上以增强它们。同时,它们作为任务无关的坐标表示,取代了数字标记作为VLM的输入和输出。这种机制使模型在空间推理中更好地索引特定的视觉语义,并直接回归轨迹坐标,而不是逐个生成数字,从而提高规划准确性。广泛的实验验证了SpaceDrive在nuScenes数据集上实现了最先进的开环性能,并在Bench2Drive封闭环基准测试中获得了第二高的驾驶得分为78.02,超过了现有的VLM方法。
Summary / 总结
SpaceDrive is a spatial-aware VLM-based driving framework that enhances 3D spatial understanding for autonomous driving. It uses universal positional encodings (PEs) derived from multi-view depth estimation, historical ego-states, and text prompts to augment 2D visual tokens and replace digit-wise numerical tokens. Experiments show that SpaceDrive outperforms existing VLM-based methods on the nuScenes dataset and achieves the second-best Driving Score of 78.02 on the Bench2Drive benchmark, demonstrating improved planning accuracy through better spatial reasoning.
研究旨在提高视觉语言模型(VLMs)在自动驾驶中的空间推理能力。SpaceDrive 引入了空间感知的位置编码来增强 2D 视觉标记,并作为 VLM 的坐标表示。实验表明,SpaceDrive 在 nuScenes 数据集上超越了现有 VLM 基础方法,并在 Bench2Drive 闭环基准测试中获得了第二高的驾驶得分为 78.02。
Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning
Authors: Benjamin Gundersen, Nicolas Deperrois, Samuel Ruiperez-Campillo, Thomas M. Sutter, Julia E. Vogt, Michael Moor, Farhad Nooralahzadeh, Michael Krauthammer
First: 2025-12-11T14:36:14+00:00 · Latest: 2025-12-11T14:36:14+00:00
Comments: 10 pages main text (3 figures, 3 tables), 31 pages in total
Abstract
Recent advances in vision-language models (VLMs) have improved Chest X-ray (CXR) interpretation in multiple aspects. However, many medical VLMs rely solely on supervised fine-tuning (SFT), which optimizes next-token prediction without evaluating answer quality. In contrast, reinforcement learning (RL) can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks. To investigate the effects of RL and thinking in a CXR VLM, we perform large-scale SFT on CXR data to build an updated RadVLM based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability. We then apply Group Relative Policy Optimization (GRPO) with clinically grounded, task-specific rewards for report generation and visual grounding, and run matched RL experiments on both domain-specific and general-domain Qwen3-VL variants, with and without thinking. Across these settings, we find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results. Under a unified evaluation pipeline, the RL-optimized RadVLM models outperform their baseline counterparts and reach state-of-the-art performance on both report generation and grounding, highlighting clinically aligned RL as a powerful complement to SFT for medical VLMs.
中文标题/摘要
标题:利用强化学习增强放射学报告生成和视觉定位
近期视觉-语言模型(VLMs)在胸部X光(CXR)解释的多个方面取得了进步。然而,许多医学VLMs仅依赖于监督微调(SFT),这优化了下一个词的预测,但没有评估答案质量。相比之下,强化学习(RL)可以结合任务特定反馈,其与显式中间推理(“思考”)的结合在可验证的数学和编程任务上取得了显著进步。为了研究RL和思考在CXR VLM中的影响,我们首先在CXR数据上进行大规模SFT,基于Qwen3-VL构建更新的RadVLM,然后进行冷启动SFT阶段,使模型具备基本的思考能力。接着,我们应用组相对策略优化(GRPO)并使用临床相关的、任务特定的奖励进行报告生成和视觉定位,分别在特定领域和通用领域Qwen3-VL变体上进行匹配的RL实验,有思考和无思考。在这些设置中,我们发现虽然强大的SFT对于高基线性能至关重要,但RL在两个任务上提供了额外的增益,而显式的思考似乎并未进一步改善结果。在统一的评估管道下,RL优化的RadVLM模型优于其基线版本,并在报告生成和定位上达到最先进的性能,突显了临床对齐的RL作为SFT的有力补充对于医学VLMs的重要性。
Summary / 总结
This study investigates the impact of reinforcement learning (RL) and explicit thinking on the performance of a vision-language model (VLM) for Chest X-ray (CXR) interpretation. After extensive supervised fine-tuning (SFT) to build an updated RadVLM, the model was further trained with RL using clinically grounded rewards for report generation and visual grounding. The results show that while strong SFT is essential for high base performance, RL provides additional gains on both tasks, and explicit thinking does not further improve results. The RL-optimized RadVLM models outperform their baselines and achieve state-of-the-art performance in both report generation and grounding tasks.
研究旨在利用强化学习(RL)提升胸部X光(CXR)解读中的放射学报告生成和视觉定位。研究人员首先通过监督微调(SFT)构建基于Qwen3-VL的更新版RadVLM,然后通过冷启动SFT阶段赋予模型基本的思考能力。接着,他们应用组相对策略优化(GRPO)结合临床相关的、任务特定的奖励,用于报告生成和视觉定位。实验结果显示,虽然强大的SFT对于获得高基线性能至关重要,但RL在两个任务上提供了额外的改进,而明确的思考并未进一步提升结果。RL优化后的RadVLM模型在报告生成和定位上均优于基线模型,并达到了该领域的最新技术水平。
Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation
Authors: Siyu Chen, Ting Han, Chengzheng Fu, Changshe Zhang, Chaolei Wang, Jinhe Su, Guorong Cai, Meiliu Wu
Venue: NeurIPS 2025
First: 2025-06-11T15:54:47+00:00 · Latest: 2025-12-11T14:33:57+00:00
Comments: Accepted by NeurIPS 2025
Abstract
Open-Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity that motivates Open-Vocabulary Domain-Generalized Semantic Segmentation (OV-DGSS). OV-DGSS aims to generate pixel-level masks for unseen categories while maintaining robustness across unseen domains, a critical capability for real-world scenarios such as autonomous driving in adverse conditions. We introduce Vireo, a novel single-stage framework for OV-DGSS that unifies the strengths of OVSS and DGSS for the first time. Vireo builds upon the frozen Visual Foundation Models (VFMs) and incorporates scene geometry via Depth VFMs to extract domain-invariant structural features. To bridge the gap between visual and textual modalities under domain shift, we propose three key components: (1) GeoText Prompts, which align geometric features with language cues and progressively refine VFM encoder representations; (2) Coarse Mask Prior Embedding (CMPE) for enhancing gradient flow for faster convergence and stronger textual influence; and (3) the Domain-Open-Vocabulary Vector Embedding Head (DOV-VEH), which fuses refined structural and semantic features for robust prediction. Comprehensive evaluation on these components demonstrates the effectiveness of our designs. Our proposed Vireo achieves the state-of-the-art performance and surpasses existing methods by a large margin in both domain generalization and open-vocabulary recognition, offering a unified and scalable solution for robust visual understanding in diverse and dynamic environments. Code is available at https://github.com/anonymouse-9c53tp182bvz/Vireo.
中文标题/摘要
标题:利用深度和语言实现开放词汇领域泛化语义分割
开放词汇语义分割(OVSS)和语义分割中的领域泛化(DGSS)突显了一种微妙的互补性,这激发了开放词汇领域泛化语义分割(OV-DGSS)的概念。OV-DGSS旨在生成未见类别的像素级掩码,同时在未见领域中保持鲁棒性,这对于自动驾驶等现实场景至关重要。我们提出了Vireo,这是一种新颖的一阶段框架,首次将OVSS和DGSS的优点统一起来。Vireo基于冻结的视觉基础模型(VFMs),并通过深度VFMs引入场景几何结构来提取领域不变的结构特征。为了在领域转移下弥合视觉和文本模态之间的差距,我们提出了三个关键组件:(1)GeoText提示,将几何特征与语言线索对齐,并逐步细化VFM编码器表示;(2)粗略掩码先验嵌入(CMPE),以增强梯度流动,加快收敛速度并增强文本影响;(3)领域开放词汇向量嵌入头(DOV-VEH),将细化的结构和语义特征融合以实现稳健预测。对这些组件的全面评估证明了我们设计的有效性。我们提出的Vireo在领域泛化和开放词汇识别方面均达到了最先进的性能,并大幅超越了现有方法,提供了一种统一且可扩展的解决方案,以实现多样化和动态环境中的稳健视觉理解。代码可在https://github.com/anonymouse-9c53tp182bvz/Vireo/ 获取。
Summary / 总结
The paper addresses the challenge of open-vocabulary domain-generalized semantic segmentation (OV-DGSS) by introducing Vireo, a novel single-stage framework. Vireo leverages frozen Visual Foundation Models and incorporates scene geometry through Depth VFMs to extract domain-invariant features. It also proposes GeoText Prompts, CMPE, and DOV-VEH to align geometric and textual modalities and enhance prediction robustness. Experimental results show that Vireo outperforms existing methods in both domain generalization and open-vocabulary recognition, providing a unified solution for visual understanding in diverse environments.
论文提出了Vireo,一种结合开放词汇语义分割和域泛化的新型单阶段框架,以解决开放词汇域泛化语义分割(OV-DGSS)的挑战。Vireo 使用冻结的视觉基础模型,并通过深度视觉基础模型引入场景几何,以提取域不变特征。此外,还提出了GeoText提示、粗略掩码先验嵌入和DOV-VEH,以使几何特征与语言线索对齐、增强梯度流动并融合精炼特征以实现稳健预测。实验结果表明,Vireo 在域泛化和开放词汇识别方面均优于现有方法,提供了一种在多样化环境中实现视觉理解的统一解决方案。
Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching
Authors: Javier Villena Toro, Mehdi Tarkian
First: 2025-12-11T14:20:17+00:00 · Latest: 2025-12-11T14:20:17+00:00
Abstract
Recent progress in zero-shot 6D object pose estimation has been driven largely by large-scale models and cloud-based inference. However, these approaches often introduce high latency, elevated energy consumption, and deployment risks related to connectivity, cost, and data governance; factors that conflict with the practical constraints of real-world robotics, where compute is limited and on-device inference is frequently required. We introduce Geo6DPose, a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation that trades model scale for geometric reliability. Our method combines foundation model visual features with a geometric filtering strategy: Similarity maps are computed between onboarded template DINO descriptors and scene patches, and mutual correspondences are established by projecting scene patch centers to 3D and template descriptors to the object model coordinate system. Final poses are recovered via correspondence-driven RANSAC and ranked using a weighted geometric alignment metric that jointly accounts for reprojection consistency and spatial support, improving robustness to noise, clutter, and partial visibility. Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). It requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones, advancing practical, fully local 6D perception for robotic deployment.
中文标题/摘要
标题:Geo6DPose:通过几何过滤特征匹配实现快速零样本6D物体姿态估计
零样本6D物体姿态估计的最新进展主要得益于大规模模型和基于云的推理。然而,这些方法通常会引入高延迟、增加的能耗以及与连接性、成本和数据治理相关的部署风险;这些因素与现实世界机器人应用中的计算资源有限和需要在设备上进行推理的实际限制相冲突。我们提出了Geo6DPose,这是一种轻量级、完全本地且无需训练的零样本6D姿态估计管道,通过牺牲模型规模来换取几何可靠性。我们的方法结合了基础模型的视觉特征与几何过滤策略:计算机载模板DINO描述符与场景片段之间的相似性图,并通过将场景片段中心投影到3D空间和将模板描述符投影到物体模型坐标系中来建立互相对应。最终姿态通过对应驱动的RANSAC恢复,并使用加权几何对齐度量进行排名,该度量同时考虑了重投影一致性与空间支持,从而提高了对噪声、杂乱和部分可见性的鲁棒性。Geo6DPose在单个通用GPU上实现了亚秒级推理,同时与显著更大的零样本基线(53.7 AR,1.08 FPS)保持相同的平均召回率。它不需要训练、微调或网络访问,并且与不断演进的基础骨干网络兼容,推动了适用于机器人部署的实用、完全本地的6D感知技术。
Summary / 总结
Geo6DPose is a lightweight, fully local pipeline for zero-shot 6D object pose estimation that uses geometric filtering to enhance reliability. It combines foundation model visual features with a geometric filtering strategy, projecting scene patch centers to 3D and template descriptors to the object model coordinate system. This method achieves sub-second inference on a single commodity GPU and matches the average recall of larger zero-shot baselines while requiring no training or network access. Key findings include sub-second inference time and 53.7 average recall at 1.08 FPS.
Geo6DPose 是一个轻量级、无需训练的零样本 6D 物体姿态估计管道,使用几何过滤来增强可靠性。该方法结合了基础模型的视觉特征和几何过滤策略,将场景片段中心投影到 3D,并将模板描述符投影到物体模型坐标系中。该方法在单个普通 GPU 上实现亚秒级推理,并且在平均召回率(53.7 AR,1.08 FPS)上与更大的零样本基线相当,无需训练、微调或网络访问。
CAPTAIN: Semantic Feature Injection for Memorization Mitigation in Text-to-Image Diffusion Models
Authors: Tong Zhang, Carlos Hinojosa, Bernard Ghanem
First: 2025-12-11T14:01:47+00:00 · Latest: 2025-12-11T14:01:47+00:00
Abstract
Diffusion models can unintentionally reproduce training examples, raising privacy and copyright concerns as these systems are increasingly deployed at scale. Existing inference-time mitigation methods typically manipulate classifier-free guidance (CFG) or perturb prompt embeddings; however, they often struggle to reduce memorization without compromising alignment with the conditioning prompt. We introduce CAPTAIN, a training-free framework that mitigates memorization by directly modifying latent features during denoising. CAPTAIN first applies frequency-based noise initialization to reduce the tendency to replicate memorized patterns early in the denoising process. It then identifies the optimal denoising timesteps for feature injection and localizes memorized regions. Finally, CAPTAIN injects semantically aligned features from non-memorized reference images into localized latent regions, suppressing memorization while preserving prompt fidelity and visual quality. Our experiments show that CAPTAIN achieves substantial reductions in memorization compared to CFG-based baselines while maintaining strong alignment with the intended prompt.
中文标题/摘要
标题:CAPTAIN:文本到图像扩散模型中记忆减轻的语义特征注入
扩散模型可能会无意中再现训练示例,这引发了隐私和版权方面的担忧,因为这些系统正越来越多地大规模部署。现有的推理时减轻方法通常会操控无分类引导(CFG)或扰动提示嵌入;然而,它们往往难以在不损害与条件提示的对齐的情况下减少记忆。我们引入了CAPTAIN,这是一种无需训练的框架,通过在去噪过程中直接修改潜在特征来减轻记忆。CAPTAIN 首先应用基于频率的噪声初始化以减少在去噪早期再现记忆模式的倾向。然后,它确定特征注入的最佳去噪时间步,并定位记忆区域。最后,CAPTAIN 将非记忆参考图像中的语义对齐特征注入局部化潜在区域,抑制记忆同时保持提示保真度和视觉质量。我们的实验表明,与基于CFG的基线相比,CAPTAIN 在减轻记忆方面取得了显著的改进,同时保持了与预期提示的强大对齐。
Summary / 总结
CAPTAIN is a training-free framework that mitigates memorization in text-to-image diffusion models by modifying latent features during denoising. It uses frequency-based noise initialization and identifies optimal denoising timesteps to inject semantically aligned features from non-memorized reference images, reducing memorization without compromising prompt alignment and visual quality. Experiments show CAPTAIN outperforms CFG-based methods in reducing memorization while maintaining prompt fidelity and visual quality.
研究旨在通过减轻文本到图像扩散模型中的记忆现象来解决隐私和版权问题。CAPTAIN 是一个无需训练的框架,通过在去噪过程中修改潜在特征来抑制记忆。它使用基于频率的噪声初始化,并确定最佳去噪时间步长来注入来自非记忆图像的语义对齐特征,从而减少记忆现象同时保持提示对齐和视觉质量。实验表明,CAPTAIN 相比基于CFG的方法显著减少了记忆现象,同时保持了提示对齐和视觉质量。
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Authors: Haoning Wu, Xiao Huang, Yaohui Chen, Ya Zhang, Yanfeng Wang, Weidi Xie
First: 2025-05-22T17:59:03+00:00 · Latest: 2025-12-11T13:21:59+00:00
Comments: Technical Report; Project Page: https://haoningwu3639.github.io/SpatialScore
Abstract
Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we aim to conduct a holistic assessment of the spatial understanding capabilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Specifically, we make the following contributions: (i) we introduce SpatialScore, to our knowledge, the most comprehensive and diverse benchmark for multimodal spatial intelligence to date. It covers multiple visual data types, input modalities, and question-answering formats, and contains approximately 5K manually verified samples spanning 30 distinct tasks; (ii) using SpatialScore, we extensively evaluate 40 representative MLLMs, revealing persistent challenges and a substantial gap between current models and human-level spatial intelligence; (iii) to advance model capabilities, we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples that supports fine-tuning on spatial reasoning tasks and significantly improves the performance of existing models (e.g., Qwen3-VL); (iv) to complement this data-driven route with a training-free paradigm, we develop SpatialAgent, a multi-agent system equipped with 12 specialized spatial perception tools that supports both Plan-Execute and ReAct reasoning, enabling substantial gains in spatial reasoning without additional model training. Extensive experiments and in-depth analyses demonstrate the effectiveness of our benchmark, corpus, and agent framework. We expect these resources to serve as a solid foundation for advancing MLLMs toward human-level spatial intelligence. All data, code, and models will be released to the research community.
中文标题/摘要
标题:SpatialScore:全面评估空间智能的综合评价方法
现有的多模态大型语言模型(MLLMs)在空间智能方面的评估通常是碎片化的且范围有限。本文旨在对现代MLLM的空间理解能力进行全面评估,并提出数据驱动和基于代理的互补解决方案。具体来说,我们做出了以下贡献:(i) 我们引入了SpatialScore,据我们所知,这是迄今为止最全面和多样的多模态空间智能基准,涵盖了多种视觉数据类型、输入模态和问答格式,包含约5000个手动验证的样本,覆盖30个不同的任务;(ii) 使用SpatialScore,我们对40个代表性MLLM进行了广泛评估,揭示了当前模型与人类水平的空间智能之间持续存在的挑战和显著差距;(iii) 为了提高模型能力,我们构建了SpatialCorpus,这是一个包含33.1万个问答样本的大规模训练资源,支持空间推理任务的微调,并显著提高了现有模型的性能(例如Qwen3-VL);(iv) 为了补充数据驱动的方法,我们开发了SpatialAgent,这是一个配备有12种专门的空间感知工具的多代理系统,支持计划-执行和ReAct推理,能够在不进行额外模型训练的情况下显著提高空间推理能力。广泛的实验和深入的分析证明了我们基准、语料库和代理框架的有效性。我们期望这些资源能够为MLLMs向人类水平的空间智能发展奠定坚实的基础。所有数据、代码和模型将向研究社区公开。
Summary / 总结
This work aims to comprehensively evaluate the spatial understanding capabilities of modern multimodal large language models (MLLMs) by introducing SpatialScore, a new benchmark that includes various visual data types, input modalities, and question-answering formats, covering about 5,000 manually verified samples across 30 tasks. The authors extensively evaluate 40 representative MLLMs using SpatialScore, revealing significant gaps between current models and human-level spatial intelligence. Additionally, they develop SpatialCorpus, a large-scale training resource, and SpatialAgent, a multi-agent system with specialized spatial perception tools, to enhance model performance in spatial reasoning tasks without additional training. The results demonstrate the effectiveness of these resources in advancing MLLMs towards human-level spatial intelligence.
论文提出了SpatialScore,这是一个全面的基准,用于评估多模态大型语言模型(MLLMs)在空间智能方面的表现,涵盖了多种视觉数据类型和问题格式。它评估了40个MLLMs,揭示了与人类水平的空间智能之间存在显著差距。作者还开发了SpatialCorpus和SpatialAgent,提高了模型性能和空间推理能力,无需额外训练。全面的实验验证了这些贡献的有效性,旨在推动MLLMs向人类水平的空间智能发展。
DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM
Authors: Qintong Zhang, Junyuan Zhang, Zhifei Ren, Linke Ouyang, Zichen Wen, Junbo Niu, Yuan Qu, Bin Wang, Ka-Ho Chow, Conghui He, Wentao Zhang
First: 2025-12-11T13:16:33+00:00 · Latest: 2025-12-11T13:16:33+00:00
Abstract
Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: https://github.com/ZZZZZQT/DOCR-Inspector.
中文标题/摘要
标题:DOCR-Inspector:使用VLM进行细粒度和自动化文档解析评估
文档解析旨在将无结构的PDF图像转换为半结构化数据,促进信息在各个领域的数字化和利用。尽管视觉语言模型(VLMs)显著推进了这一任务,但在实际场景中实现可靠且高质量的解析仍然具有挑战性。通常的做法是在标准基准上选择表现最佳的模型,但这些基准可能带有数据集特定的偏差,导致模型排名不一致且与实际性能的相关性有限。此外,基准指标通常仅提供总体评分,这可能会掩盖输出中的不同错误模式。这提出了一个关键挑战:我们如何在实际场景中可靠且全面地评估文档解析质量?我们通过DOCR-Inspector解决了这一问题,它将文档解析评估形式化为细粒度的错误检测和分析。利用VLM-as-a-Judge,DOCR-Inspector分析文档图像及其解析输出,识别所有错误,将其分配到28种预定义类型之一,并生成全面的质量评估。为了实现这一能力,我们构建了DOCRcase-200K用于训练,并提出了检查清单推理范式,以实现解析质量评估的层次结构。为了实证验证,我们引入了DOCRcaseBench,这是一个包含882个真实世界文档解析案例的集合,并附有手动注释。在这一基准上,DOCR-Inspector-7B优于商业模型如Gemini 2.5 Pro,以及领先开源模型。进一步的实验表明,其质量评估为解析结果的改进提供了有价值的指导,使DOCR-Inspector既是实用的评估工具,也是推动大规模文档解析系统发展的驱动力。模型和代码可在:https://github.com/ZZZZZQT/DOCR-Inspector/ 获取。
Summary / 总结
DOCR-Inspector addresses the challenge of reliably evaluating document parsing quality in real-world scenarios by formalizing the assessment as fine-grained error detection and analysis. It uses VLM-as-a-Judge to identify and categorize errors into 28 predefined types, providing a comprehensive quality assessment. Empirical validation on DOCRcaseBench shows that DOCR-Inspector outperforms commercial and leading open-source models, offering valuable guidance for parsing result refinement.
DOCR-Inspector通过正式化错误检测和分析来可靠地评估文档解析质量,特别是在现实场景中。它使用VLM-as-a-Judge来识别并分类错误到28种预定义类型,提供全面的质量评估。在DOCRcaseBench上的实证验证表明,DOCR-Inspector在性能上超越了商业和领先开源模型,并为解析结果的改进提供了有价值的指导。
Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval
Authors: J. Xiao, Y. Guo, X. Zi, K. Thiyagarajan, C. Moreira, M. Prasad
First: 2025-12-11T12:43:41+00:00 · Latest: 2025-12-11T12:43:41+00:00
Comments: 6 pages, 1 figure
Abstract
Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.
中文标题/摘要
标题:超越像素:一种无需训练的文本到文本框架用于遥感图像检索
遥感(RS)图像的语义检索是一项关键任务,从根本上受到‘语义鸿沟’的挑战,即模型的低级视觉特征与高级人类概念之间的差异。虽然大型视觉语言模型(VLMs)为弥合这一差距提供了有希望的途径,但现有方法通常依赖于昂贵的、特定领域的训练,而且缺乏基准来评估VLM生成文本在零样本检索中的实用价值。为解决这一研究缺口,我们引入了遥感丰富文本(RSRT)数据集,这是一个新的基准,每个图像包含多个结构化的描述。基于此数据集,我们提出了一种完全无需训练的文本检索参考方法TRSLLaVA。我们的方法将跨模态检索重新表述为文本到文本(T2T)匹配问题,利用丰富的文本描述作为查询,与VLM生成的描述在统一的文本嵌入空间中进行数据库匹配。这种方法完全绕过了模型的训练或微调。在RSITMD和RSICD基准上的实验表明,我们的无需训练方法在与最先进的监督模型的竞争中表现非常出色。例如,在RSITMD上,我们的方法达到了42.62%的平均召回率,几乎是标准零样本CLIP基线23.86%的两倍,并且超过了几个顶级的监督模型。这验证了通过结构化文本获得高质量的语义表示为遥感图像检索提供了一种强大且成本效益高的范式。
Summary / 总结
The research aims to address the semantic gap in remote sensing image retrieval by leveraging large Vision-Language Models (VLMs) without requiring domain-specific training. The authors introduce the RSRT dataset and propose TRSLLaVA, a training-free, text-to-text framework. Experiments show that TRSLLaVA outperforms state-of-the-art supervised models, achieving a mean Recall of 42.62% on RSITMD, nearly doubling the performance of the zero-shot CLIP baseline.
研究通过引入RSRT数据集和提出TRSLLaVA训练-free文本到文本框架,解决了遥感图像检索中的语义差距问题。该方法将跨模态检索重新定义为文本到文本匹配问题,使用丰富的文本描述作为查询,与VLM生成的标题数据库进行匹配。实验结果显示,TRSLLaVA在RSITMD上的平均召回率为42.62%,远超零样本CLIP基线和多个顶级监督模型,证明了使用结构化文本进行语义表示在遥感图像检索中的强大和经济性。
Lightweight Model Attribution and Detection of Synthetic Speech via Audio Residual Fingerprints
Authors: Matías Pizarro, Mike Laszkiewicz, Dorothea Kolossa, Asja Fischer
First: 2024-11-21T10:55:49+00:00 · Latest: 2025-12-11T12:41:32+00:00
Abstract
As speech generation technologies advance, so do risks of impersonation, misinformation, and spoofing. We present a lightweight, training-free approach for detecting synthetic speech and attributing it to its source model. Our method addresses three tasks: (1) single-model attribution in an open-world setting, (2) multi-model attribution in a closed-world setting, and (3) real vs. synthetic speech classification. The core idea is simple: we compute standardized average residuals--the difference between an audio signal and its filtered version--to extract model-agnostic fingerprints that capture synthesis artifacts. Experiments across multiple synthesis systems and languages show AUROC scores above 99%, with strong reliability even when only a subset of model outputs is available. The method maintains high performance under common audio distortions, including echo and moderate background noise, while data augmentation can improve results in more challenging conditions. In addition, out-of-domain detection is performed using Mahalanobis distances to in-domain residual fingerprints, achieving an F1 score of 0.91 on unseen models, reinforcing the method's efficiency, generalizability, and suitability for digital forensics and security applications.
中文标题/摘要
标题:轻量级模型归因及合成语音检测方法:基于音频残差指纹
随着语音生成技术的进步,冒充、误导和欺骗的风险也在增加。我们提出了一种无需训练的轻量级方法,用于检测合成语音并将其归因于其来源模型。该方法解决了三个任务:(1)开放世界中的单模型归因,(2)封闭世界中的多模型归因,以及(3)真实语音与合成语音的分类。核心思想很简单:我们计算标准化平均残差——音频信号与其滤波版本之间的差异——以提取模型无关的指纹,捕捉合成特征。在多个合成系统和语言的实验中,AUROC分数超过99%,即使只有模型输出的一部分可用时,可靠性也很强。该方法在常见的音频失真(包括回声和中等背景噪声)下保持高性能,而数据增强可以在更具挑战性的条件下提高结果。此外,通过马哈拉诺比斯距离对域外残差指纹进行域外检测,未见过的模型F1分数达到0.91,进一步证明了该方法的高效性、普适性和适用于数字取证和安全应用的适用性。
Summary / 总结
This paper introduces a lightweight, training-free approach for detecting synthetic speech and attributing it to its source model. The method computes standardized average residuals to extract model-agnostic fingerprints, which are used for three tasks: single-model attribution in an open-world setting, multi-model attribution in a closed-world setting, and real vs. synthetic speech classification. Experiments show AUROC scores above 99% across multiple synthesis systems and languages, with strong reliability even when only a subset of model outputs is available. The method performs well under common audio distortions and achieves an F1 score of 0.91 on out-of-domain detection, making it suitable for digital forensics and security applications.
本文介绍了一种轻量级、无需训练的方法,用于检测合成语音并将其归因于其源模型。该方法计算标准化平均残差以提取模型无关的指纹,用于三个任务:开放世界中的单模型归因、封闭世界中的多模型归因以及真实与合成语音分类。实验结果显示,在多个合成系统和语言上AUROC分数超过99%,即使只有模型输出的一部分可用时,也具有很强的可靠性。该方法在常见音频失真下表现良好,并在未见过的模型上实现了0.91的F1分数,使其适用于数字取证和安全应用。
Exploring Automated Recognition of Instructional Activity and Discourse from Multimodal Classroom Data
Authors: Ivo Bueno, Ruikun Hou, Babette Bühler, Tim Fütterer, James Drimalla, Jonathan Kyle Foster, Peter Youngs, Peter Gerjets, Ulrich Trautwein, Enkelejda Kasneci
Venue: WACV
First: 2025-11-26T11:57:22+00:00 · Latest: 2025-12-11T11:15:19+00:00
Comments: This article has been accepted for publication in the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
Abstract
Observation of classroom interactions can provide concrete feedback to teachers, but current methods rely on manual annotation, which is resource-intensive and hard to scale. This work explores AI-driven analysis of classroom recordings, focusing on multimodal instructional activity and discourse recognition as a foundation for actionable feedback. Using a densely annotated dataset of 164 hours of video and 68 lesson transcripts, we design parallel, modality-specific pipelines. For video, we evaluate zero-shot multimodal LLMs, fine-tuned vision-language models, and self-supervised video transformers on 24 activity labels. For transcripts, we fine-tune a transformer-based classifier with contextualized inputs and compare it against prompting-based LLMs on 19 discourse labels. To handle class imbalance and multi-label complexity, we apply per-label thresholding, context windows, and imbalance-aware loss functions. The results show that fine-tuned models consistently outperform prompting-based approaches, achieving macro-F1 scores of 0.577 for video and 0.460 for transcripts. These results demonstrate the feasibility of automated classroom analysis and establish a foundation for scalable teacher feedback systems.
中文标题/摘要
标题:探索多模态课堂数据中教学活动和话语的自动化识别
课堂互动的观察可以为教师提供具体的反馈,但当前方法依赖于手动标注,这既耗资源又难以扩展。本研究探索了基于人工智能的课堂录像分析方法,重点关注多模态教学活动和话语识别,作为可操作反馈的基础。使用包含164小时视频和68份教案文本的密集标注数据集,我们设计了并行的、模态特定的流水线。对于视频,我们评估了零样本多模态LLM、微调的视觉-语言模型以及自监督视频变换器在24个活动标签上的表现。对于文本,我们微调了一个基于变换器的分类器,使用上下文化输入,并将其与基于提示的LLM进行比较,以19个话语标签进行对比。为了处理类别不平衡和多标签复杂性,我们应用了标签阈值、上下文窗口和不平衡损失函数。结果表明,微调模型始终优于基于提示的方法,视频的宏F1得分为0.577,文本的得分为0.460。这些结果证明了自动化课堂分析的可行性,并为可扩展的教师反馈系统奠定了基础。
Summary / 总结
This study aims to automate the recognition of instructional activities and discourse in classrooms to provide teachers with actionable feedback. Using a dataset of 164 hours of video and 68 lesson transcripts, the researchers developed modality-specific pipelines and evaluated various models. Fine-tuned models outperformed prompting-based approaches, achieving macro-F1 scores of 0.577 for video and 0.460 for transcripts, demonstrating the feasibility of automated classroom analysis and scalable teacher feedback systems.
该研究旨在通过自动化识别课堂中的教学活动和对话,为教师提供可操作的反馈。利用包含164小时视频和68份教案转录的语料库,研究人员开发了模态特定的流水线并评估了多种模型。微调模型在视频和转录文本上的表现优于提示基模型,分别实现了0.577和0.460的宏F1分数,展示了自动化课堂分析的可行性及可扩展的教师反馈系统的基础。
Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation
Authors: Lim Chien Her, Ming Yan, Yunshu Bai, Ruihao Li, Hao Zhang
First: 2025-12-11T10:22:02+00:00 · Latest: 2025-12-11T10:22:02+00:00
Abstract
Procedural Content Generation (PCG) offers scalable methods for algorithmically creating complex, customizable worlds. However, controlling these pipelines requires the precise configuration of opaque technical parameters. We propose a training-free architecture that utilizes LLM agents for zero-shot PCG parameter configuration. While Large Language Models (LLMs) promise a natural language interface for PCG tools, off-the-shelf models often fail to bridge the semantic gap between abstract user instructions and strict parameter specifications. Our system pairs an Actor agent with a Critic agent, enabling an iterative workflow where the system autonomously reasons over tool parameters and refines configurations to progressively align with human design preferences. We validate this approach on the generation of various 3D maps, establishing a new benchmark for instruction-following in PCG. Experiments demonstrate that our approach outperforms single-agent baselines, producing diverse and structurally valid environments from natural language descriptions. These results demonstrate that off-the-shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools. By shifting the burden from model training to architectural reasoning, our method offers a scalable framework for mastering complex software without task-specific fine-tuning.
中文标题/摘要
标题:零样本3D地图生成:双重代理架构的LLM代理程序
程序化内容生成(PCG)提供了通过算法创建复杂可定制世界的可扩展方法。然而,控制这些管道需要精确配置不透明的技术参数。我们提出了一种无需训练的架构,利用LLM代理程序进行零样本PCG参数配置。虽然大型语言模型(LLMs)承诺为PCG工具提供自然语言界面,但现成的模型往往无法弥合抽象用户指令与严格参数规范之间的语义差距。我们的系统将一个执行者代理与一个评论者代理配对,使系统能够自主推理工具参数并逐步优化配置以与人类设计偏好对齐。我们在生成各种3D地图的实验中验证了这种方法,建立了PCG指令遵循的新基准。实验表明,我们的方法优于单代理基线,能够从自然语言描述中生成多样且结构有效的环境。这些结果表明,现成的LLM可以有效重新利用为任意PCG工具的一般代理程序。通过将负担从模型训练转移到架构推理,我们的方法提供了一种无需特定任务微调的复杂软件掌握的可扩展框架。
Summary / 总结
The research aims to address the challenge of controlling procedural content generation (PCG) pipelines through the use of large language models (LLMs). It introduces a dual-agent architecture where an Actor agent configures PCG parameters and a Critic agent evaluates and refines these configurations based on natural language instructions. Experiments show that this approach outperforms single-agent methods, generating diverse and structurally valid 3D maps from natural language descriptions, thus demonstrating the potential of LLMs as generalized agents for PCG tools.
研究旨在通过使用LLM代理的训练-free架构解决控制程序化内容生成(PCG)管道的问题。方法是将一个Actor代理与一个Critic代理配对,以迭代地推断工具参数并根据人类设计偏好进行优化。关键实验结果表明,这种方法在自然语言描述下生成多样且结构有效的3D地图方面优于单代理基线,从而展示了现成的LLM在PCG工具中的应用潜力。
Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning
Authors: Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan
Venue: NeurIPS 2025
First: 2025-09-22T09:16:34+00:00 · Latest: 2025-12-11T09:40:22+00:00
Comments: NeurIPS 2025
Abstract
The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.
中文标题/摘要
标题:大型语言模型能否在无需训练的情况下推理非文本模态?一种基于上下文表示学习的案例研究
大型语言模型(LLMs)的出色表现可以通过测试时计算得到增强,这依赖于外部工具甚至其他深度学习模型。然而,将非文本模态表示集成到LLMs中的现有方法通常需要额外的昂贵的监督训练,限制了对新领域和模态的即时适应。在本文中,我们探讨了在无需训练的情况下将非文本基础模型(FMs)的表示集成到基于文本的LLMs中的可行性。我们提出了一种基于上下文表示学习(ICRL)的概念,以允许LLMs通过少样本学习适应性地利用非文本模态表示。与传统的基于上下文学习不同,ICRL用FM表示替换文本输入,使LLM能够在无需微调的情况下进行多模态推理。我们在分子领域的多个任务上评估了ICRL,探讨了三个核心研究问题:(i)如何在无需训练的情况下将FM表示映射到LLMs中,(ii)哪些因素影响ICRL的性能,(iii)ICRL有效性的机制是什么。据我们所知,ICRL是第一个在无需训练的情况下将非文本模态表示集成到基于文本的LLMs中的框架,为适应性、多模态泛化的研究提供了有希望的方向。
Summary / 总结
This work investigates the feasibility of integrating non-text modality representations into text-based Large Language Models (LLMs) without additional training. The proposed In-Context Representation Learning (ICRL) framework allows LLMs to adaptively utilize non-text modality representations through few-shot learning. The study evaluates ICRL on molecular domain tasks and explores how to map FM representations into LLMs, factors influencing performance, and underlying mechanisms. ICRL is the first training-free framework for this purpose, showing promise for adaptable, multi-modal generalization.
这项工作探讨了在无需额外训练的情况下将非文本模态表示集成到基于文本的大语言模型(LLMs)中的可行性。提出的In-Context Representation Learning(ICRL)框架允许LLMs通过少样本学习适应性地利用非文本模态表示。研究在分子领域任务上评估了ICRL,探讨了如何将FM表示映射到LLMs中、影响ICRL性能的因素以及其有效性的内在机制。ICRL是首个无需训练的框架,展示了适应性多模态泛化的潜力。
Test-Time Distillation for Continual Model Adaptation
Authors: Xiao Chen, Jiazhen Huang, Zhiming Liu, Qinting Jiang, Fanding Huang, Jingyan Jiang, Zhi Wang
First: 2025-06-03T09:16:51+00:00 · Latest: 2025-12-11T09:34:01+00:00
Comments: 11 pages, 6 figures
Abstract
Deep neural networks often suffer performance degradation upon deployment due to distribution shifts. Continual Test-Time Adaptation (CTTA) aims to address this issue in an unsupervised manner, yet existing methods, which rely on self-supervision, are prone to an inherent self-referential feedback loop that amplifies initial prediction errors, leading to model drift. We revisit this limitation and propose Test-Time Distillation (TTD), which reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM) as an external signal. While promising, we find that direct distillation is fraught with two pitfalls: the Generalist Trap, where the VLM's broad but non-specialized knowledge leads to suboptimal performance on specific tasks and shifts, and the Entropy Bias, where naive model fusion techniques based on entropy fail due to the disparate calibration of heterogeneous models. These pitfalls motivate our insight: the key is to build a robust supervisory signal and leverage it to guide the target model toward stable adaptation. Hence, we present CoDiRe, a Continual Distillation and Rectification framework for TTD. CoDiRe first constructs a robust blended teacher by dynamically fusing the predictions of the VLM and the target model. Critically, it circumvents the Entropy Bias by leveraging Maximum Softmax Probability (MSP) as a more reliable confidence metric for weighting each model's expertise. Then applies an Optimal Transport based rectification to further align predictions with the blended teacher, enabling continuous and stable adaptation. Extensive experiments show that CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% while using only 48% of its time cost on ImageNet-C.
中文标题/摘要
标题:部署时蒸馏以实现持续模型适应
深度神经网络在部署时由于分布偏移往往会遭受性能下降。持续测试时适应(CTTA)旨在以无监督的方式解决这一问题,但现有方法依赖于自我监督,容易产生一种固有的自我参照反馈循环,这会放大初始预测错误,导致模型漂移。我们重新审视了这一局限性,并提出了测试时蒸馏(TTD),将其重新定义为由冻结的视觉-语言模型(VLM)作为外部信号引导的蒸馏过程。虽然前景广阔,但我们发现直接蒸馏存在两个陷阱:专家陷阱,其中VLM的广泛但非专门化的知识导致特定任务和转移上的次优性能;以及熵偏差,其中基于熵的简单模型融合技术由于异构模型的不一致校准而失效。这些陷阱促使我们得出一个见解:关键在于构建一个稳健的监督信号,并利用它来引导目标模型实现稳定的适应。因此,我们提出了CoDiRe,一种持续蒸馏和校正框架以实现TTD。CoDiRe首先通过动态融合VLM和目标模型的预测来构建一个稳健的混合教师。关键的是,它通过利用最大softmax概率(MSP)作为更可靠的置信度度量来规避熵偏差,从而为每个模型的专业知识分配权重。然后应用基于最优传输的校正,进一步使预测与混合教师对齐,从而实现持续和稳定的适应。大量实验表明,CoDiRe优于最先进的基线,其性能比CoTTA高出10.55%,同时仅使用其48%的时间成本在ImageNet-C上。
Summary / 总结
This paper addresses the performance degradation of deep neural networks due to distribution shifts after deployment. It proposes Test-Time Distillation (TTD) as a method to adapt models in an unsupervised manner, using a frozen Vision-Language Model (VLM) as a supervisory signal. To overcome the Generalist Trap and Entropy Bias, the authors introduce CoDiRe, a framework that dynamically fuses the VLM and target model predictions and uses Maximum Softmax Probability (MSP) for weighting. CoDiRe also applies Optimal Transport-based rectification to align predictions with the blended teacher, achieving superior performance compared to existing methods on ImageNet-C with less computational cost.
论文提出了一种持续测试时蒸馏(TTD)方法,以解决由于分布变化导致的深度神经网络性能下降问题。TTD将适应过程重新定义为由冻结的视觉-语言模型(VLM)引导的蒸馏过程,但面临泛化陷阱和熵偏差等挑战。为克服这些挑战,作者引入了CoDiRe框架,该框架构建了一个稳健的混合教师,并使用最大softmax概率(MSP)进行权重分配,随后采用最优传输基的校正,以进一步对齐预测结果。实验表明,CoDiRe在ImageNet-C上的性能优于现有方法,比CoTTA高出10.55%,且仅使用其48%的时间成本。
SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation
Authors: Yuyang Dong, Nobuhiro Ueda, Krisztián Boros, Daiki Ito, Takuya Sera, Masafumi Oyamada
First: 2025-05-20T14:03:24+00:00 · Latest: 2025-12-11T08:51:09+00:00
Abstract
With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs yields better RAG performance, but processing rich documents remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN (SemantiC Document Layout ANalysis), a novel approach that enhances both textual and visual Retrieval-Augmented Generation (RAG) systems that work with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering contiguous components. We trained the SCAN model by fine-tuning object detection models on an annotated dataset. Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.4 points and visual RAG performance by up to 10.4 points, outperforming conventional approaches and even commercial document processing solutions.
中文标题/摘要
标题:SCAN: 语义文档布局分析以增强文本和视觉检索生成
随着大型语言模型(LLMs)和视觉语言模型(VLMs)的广泛应用,用于检索增强生成(RAG)和视觉RAG等应用的丰富文档分析技术正获得广泛关注。近期研究表明,使用VLMs可以提高RAG性能,但处理丰富文档仍是一个挑战,因为单页包含大量信息。本文提出了一种名为SCAN(语义文档布局分析)的新方法,该方法增强了处理丰富视觉文档的文本和视觉RAG系统。这是一种VLM友好的方法,能够以适当的语义粒度识别文档组件,平衡上下文保留与处理效率。SCAN采用粗粒度语义方法,将文档划分为包含连续组件的连贯区域。我们通过在标注数据集上微调对象检测模型来训练SCAN模型。我们的实验结果表明,SCAN在英语和日语数据集上的端到端文本RAG性能提高了9.4个点,视觉RAG性能提高了10.4个点,优于传统方法,甚至超越了商用文档处理解决方案。
Summary / 总结
SCAN is a novel approach that enhances Retrieval-Augmented Generation (RAG) systems for visually rich documents by identifying document components with appropriate semantic granularity. It uses a coarse-grained semantic approach to divide documents into coherent regions, improving both textual and visual RAG performance by up to 9.4 and 10.4 points respectively, outperforming conventional methods and commercial solutions.
SCAN 是一种新颖的方法,通过识别具有适当语义粒度的文档组件来增强视觉丰富文档的 Retrieval-Augmented Generation (RAG) 系统。它使用粗粒度的语义方法将文档划分为一致的区域,分别将文本和视觉 RAG 的性能提高最多 9.4 和 10.4 个百分点,优于传统方法和商业文档处理解决方案。
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
Authors: Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Huan Gao, Mingkun Xu, Shangyang Li
First: 2025-09-26T12:20:01+00:00 · Latest: 2025-12-11T08:31:33+00:00
Comments: 23 pages, 12 figures
Abstract
Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.
中文标题/摘要
标题:超越分类准确率:Neural-MedBench和深入推理基准的需求
近期视觉-语言模型(VLMs)在标准医学基准测试上取得了显著的性能,但其真正的临床推理能力仍然不清楚。现有数据集主要强调分类准确率,导致一种评估错觉,即模型看似熟练但实际上在高风险诊断推理方面仍然失败。我们引入了Neural-MedBench,这是一个紧凑但推理密集的基准,专门设计用于探索神经学多模态临床推理的极限。Neural-MedBench 结合了多序列MRI扫描、结构化的电子健康记录和临床笔记,并涵盖了三个核心任务家族:鉴别诊断、病灶识别和理由生成。为了确保可靠的评估,我们开发了一种混合评分管道,结合了基于LLM的评分、临床验证和语义相似度度量。通过系统评估最先进的VLMs,包括GPT-4o、Claude-4和MedGemma,我们观察到与传统数据集相比,性能急剧下降。错误分析表明,推理失败而不是感知错误是模型缺陷的主要原因。我们的研究结果强调了双重评估框架的必要性:广度导向的大数据集用于统计泛化,以及深度导向、紧凑的基准如Neural-MedBench用于推理准确性。我们通过https://neuromedbench.github.io/发布Neural-MedBench,作为开放和可扩展的诊断测试平台,指导未来基准的扩展,并使临床可信的AI的严格但成本效益高的评估成为可能。
Summary / 总结
The paper introduces Neural-MedBench, a benchmark designed to evaluate the reasoning abilities of vision-language models in medical contexts, particularly in neurology. It integrates MRI scans, electronic health records, and clinical notes to assess differential diagnosis, lesion recognition, and rationale generation. The study finds that state-of-the-art models perform poorly on this benchmark compared to standard datasets, indicating a need for deeper reasoning benchmarks. The authors propose a Two-Axis Evaluation Framework to guide the development of future benchmarks for clinical AI assessment.
论文介绍了Neural-MedBench,这是一个用于评估视觉-语言模型在医学领域(尤其是神经学)推理能力的基准。该基准结合了MRI扫描、电子健康记录和临床笔记,用于评估鉴别诊断、病灶识别和推理生成。研究发现,最先进的模型在这一基准上的表现远不如标准数据集,表明需要更深入的推理基准。作者提出了一个两轴评估框架,以指导未来基准的发展,用于临床AI的严格评估。
Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention
Authors: Yang Yu, Zhuangzhuang Chen, Siqi Wang, Lanqing Li, Xiaomeng Li
First: 2025-12-11T08:27:02+00:00 · Latest: 2025-12-11T08:27:02+00:00
Abstract
Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs). Considering existing RL-based finetuning methods, entropy intervention turns out to be an effective way to benefit exploratory ability, thereby improving policy performance. Notably, most existing studies intervene in entropy by simply controlling the update of specific tokens during policy optimization of RL. They ignore the entropy intervention during the RL sampling that can boost the performance of GRPO by improving the diversity of responses. In this paper, we propose Selective-adversarial Entropy Intervention, namely SaEI, which enhances policy entropy by distorting the visual input with the token-selective adversarial objective coming from the entropy of sampled responses. Specifically, we first propose entropy-guided adversarial sampling (EgAS) that formulates the entropy of sampled responses as an adversarial objective. Then, the corresponding adversarial gradient can be used to attack the visual input for producing adversarial samples, allowing the policy model to explore a larger answer space during RL sampling. Then, we propose token-selective entropy computation (TsEC) to maximize the effectiveness of adversarial attack in EgAS without distorting factual knowledge within VLMs. Extensive experiments on both in-domain and out-of-domain datasets show that our proposed method can greatly improve policy exploration via entropy intervention, to boost reasoning capabilities. Code will be released once the paper is accepted.
中文标题/摘要
标题:利用选择性对抗熵干预提升基于RL的视觉推理
近年来,强化学习(RL)已成为增强视觉语言模型(VLMs)推理能力的常见选择。考虑到现有的基于RL的微调方法,熵干预被证明是提高探索能力的有效方式,从而改善策略性能。值得注意的是,大多数现有研究通过简单地控制RL策略优化过程中特定标记的更新来进行熵干预,而忽略了在RL采样过程中进行熵干预可以提升GRPO性能,通过提高响应的多样性。在本文中,我们提出了选择性对抗熵干预(简称SaEI),通过使用从采样响应的熵中来的标记选择性对抗目标来扭曲视觉输入,从而增强策略熵。具体来说,我们首先提出了基于熵引导的对抗采样(EgAS),将采样响应的熵作为对抗目标进行建模。然后,相应的对抗梯度可以用来攻击视觉输入以生成对抗样本,使策略模型在RL采样过程中探索更大的答案空间。接着,我们提出了标记选择性熵计算(TsEC),以在EgAS中最大化对抗攻击的有效性,而不扭曲VLMs中的事实知识。在领域内和领域外数据集上的广泛实验表明,我们提出的方法可以通过熵干预大大改善策略探索,从而提升推理能力。论文被接受后将发布代码。
Summary / 总结
This paper introduces Selective-adversarial Entropy Intervention (SaEI) to enhance the reasoning capabilities of vision-language models using reinforcement learning. The method proposes entropy-guided adversarial sampling (EgAS) to formulate the entropy of sampled responses as an adversarial objective, and token-selective entropy computation (TsEC) to maximize the effectiveness of the adversarial attack without distorting factual knowledge. Experiments demonstrate that SaEI improves policy exploration and reasoning capabilities on both in-domain and out-of-domain datasets.
本文旨在通过强化学习(RL)提升视觉语言模型的推理能力。提出了选择性对抗熵干预(SaEI)方法,通过基于采样响应熵的视觉输入扭曲来提高策略熵,从而增加响应多样性和探索性。实验表明,SaEI在领域内和领域外数据集上显著提升了RL模型的推理能力。
LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation
Authors: Huanlin Gao, Ping Chen, Fuyuan Shi, Chao Tan, Zhaoxiang Liu, Fang Zhao, Kai Wang, Shiguo Lian
Venue: NeurIPS 2025 Spotlight
First: 2025-10-30T04:57:26+00:00 · Latest: 2025-12-11T08:10:13+00:00
Comments: NeurIPS 2025 Spotlight
Abstract
We present LeMiCa, a training-free and efficient acceleration framework for diffusion-based video generation. While existing caching strategies primarily focus on reducing local heuristic errors, they often overlook the accumulation of global errors, leading to noticeable content degradation between accelerated and original videos. To address this issue, we formulate cache scheduling as a directed graph with error-weighted edges and introduce a Lexicographic Minimax Path Optimization strategy that explicitly bounds the worst-case path error. This approach substantially improves the consistency of global content and style across generated frames. Extensive experiments on multiple text-to-video benchmarks demonstrate that LeMiCa delivers dual improvements in both inference speed and generation quality. Notably, our method achieves a 2.9x speedup on the Latte model and reaches an LPIPS score of 0.05 on Open-Sora, outperforming prior caching techniques. Importantly, these gains come with minimal perceptual quality degradation, making LeMiCa a robust and generalizable paradigm for accelerating diffusion-based video generation. We believe this approach can serve as a strong foundation for future research on efficient and reliable video synthesis. Our code is available at :https://github.com/UnicomAI/LeMiCa
中文标题/摘要
标题:LeMiCa:基于字典序最小最大路径缓存的高效扩散式视频生成加速框架
我们提出了LeMiCa,一种无需训练且高效的加速框架,用于扩散式视频生成。现有的缓存策略主要集中在减少局部启发式误差,但往往忽视了全局误差的累积,导致加速视频和原始视频之间存在明显的内容降级。为了解决这一问题,我们将缓存调度形式化为带有误差加权边的有向图,并引入了一种字典序最小最大路径优化策略,该策略明确界定了最坏情况路径误差。这种方法显著提高了生成帧中全局内容和风格的一致性。在多个文本到视频基准上的大量实验表明,LeMiCa在推理速度和生成质量上都取得了双重改进。值得注意的是,我们的方法在Latte模型上实现了2.9倍的加速,并在Open-Sora上达到了LPIPS分数0.05,优于先前的缓存技术。重要的是,这些改进带来了极小的感知质量下降,使LeMiCa成为加速扩散式视频生成的稳健且通用的范式。我们认为这种方法可以作为未来高效和可靠视频合成研究的强大基础。我们的代码可在:https://github.com/UnicomAI/LeMiCa 获取。
Summary / 总结
LeMiCa is a training-free acceleration framework for diffusion-based video generation that addresses the issue of global content degradation in existing caching strategies. By formulating cache scheduling as a directed graph and using Lexicographic Minimax Path Optimization, LeMiCa bounds the worst-case path error, improving both inference speed and generation quality. Experiments show a 2.9x speedup on the Latte model and an LPIPS score of 0.05 on Open-Sora, outperforming previous methods with minimal perceptual quality loss.
LeMiCa 是一种用于加速基于扩散的视频生成的框架,旨在提高推理速度和生成质量。它通过将缓存调度建模为有向图并使用 Lexicographic Minimax 路径优化来限制最坏情况路径误差,来解决全局内容降级的问题。实验表明,LeMiCa 在 Latte 模型上实现了 2.9 倍的加速,并在 Open-Sora 上达到了 LPIPS 分数 0.05,优于之前的缓存技术,且具有最小的感知质量下降。
Thinking Ahead: Foresight Intelligence in MLLMs and World Models
Authors: Zhantao Gong, Liaoyuan Fan, Qing Guo, Xun Xu, Xulei Yang, Shijie Li
First: 2025-11-24T04:04:59+00:00 · Latest: 2025-12-11T08:02:01+00:00
Comments: 25 pages, 27 figures
Abstract
In this work, we define Foresight Intelligence as the capability to anticipate and interpret future events-an ability essential for applications such as autonomous driving, yet largely overlooked by existing research. To bridge this gap, we introduce FSU-QA, a new Visual Question-Answering (VQA) dataset specifically designed to elicit and evaluate Foresight Intelligence. Using FSU-QA, we conduct the first comprehensive study of state-of-the-art Vision-Language Models (VLMs) under foresight-oriented tasks, revealing that current models still struggle to reason about future situations. Beyond serving as a benchmark, FSU-QA also enables the assessment of world models by measuring the semantic coherence of their generated predictions, quantified through performance gains when VLMs are augmented with such outputs. Our experiments further demonstrate that FSU-QA can effectively enhance foresight reasoning: even small VLMs fine-tuned on FSU-QA surpass much larger, advanced models by a substantial margin. Together, these findings position FSU-QA as a principled foundation for developing next-generation models capable of truly anticipating and understanding future events.
中文标题/摘要
标题:前瞻智能:在MLLMs和世界模型中的预见能力
在本研究中,我们定义前瞻智能为预见和解释未来事件的能力——这种能力对于自动驾驶等应用至关重要,但现有研究却对其关注不足。为弥补这一差距,我们引入了FSU-QA,这是一个新的视觉问答(VQA)数据集,专门设计用于激发和评估前瞻智能。利用FSU-QA,我们首次对最先进的视觉-语言模型(VLMs)进行了全面的前瞻导向任务研究,揭示了当前模型在推理未来情况方面仍然存在困难。除了作为基准测试之外,FSU-QA 还通过测量其生成预测的语义连贯性来评估世界模型,这种连贯性通过VLMs与这些输出结合后的性能提升来量化。我们的实验进一步证明,即使是对FSU-QA进行微调的小型VLMs也能显著超越更大、更先进的模型。这些发现共同将FSU-QA定位为开发下一代能够真正预见和理解未来事件的模型的原理性基础。
Summary / 总结
This work introduces Foresight Intelligence as the capability to anticipate future events, crucial for applications like autonomous driving but underexplored in existing research. The authors developed FSU-QA, a new VQA dataset, to evaluate this ability in Vision-Language Models (VLMs). Experiments show that current VLMs struggle with foresight reasoning, but even small models fine-tuned on FSU-QA outperform larger models. FSU-QA also helps assess world models by measuring the semantic coherence of their predictions, demonstrating its effectiveness in enhancing foresight reasoning.
本文提出了预见智能,即预测和解释未来事件的能力,这对于自动驾驶等应用至关重要。作者开发了FSU-QA,这是一个新的VQA数据集,用于评估Vision-Language模型(VLM)的这种能力。实验表明,当前的VLM在预见推理方面存在困难,但即使是经过FSU-QA微调的小型模型也超过了更大、更先进的模型。因此,FSU-QA成为开发能够预见和理解未来事件的模型的基础基准。
Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies
Authors: Cong Pang, Hongtao Yu, Zixuan Chen, Lewei Lu, Xin Lou
First: 2025-12-11T07:48:34+00:00 · Latest: 2025-12-11T07:48:34+00:00
Abstract
Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\% and open-world data boosts FROW benchmark accuracy by 10\%-20\% and content accuracy by 6\%-12\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\%. The benchmark will be available at https://github.com/pc-inno/FROW.
中文标题/摘要
标题:大型视觉语言模型精细粒度识别:基准与优化策略
大型视觉语言模型(LVLMs)取得了显著进展,使复杂的视觉语言交互和对话应用成为可能。然而,现有的基准主要集中在推理任务上,往往忽视了精细粒度识别,这对于实际应用场景至关重要。为解决这一问题,我们引入了精细粒度识别开放世界(FROW)基准,旨在使用GPT-4o对LVLMs进行详细评估。在此基础上,我们从数据构建和训练过程两个方面提出了新的优化策略,以提高LVLMs的性能。我们的数据集包括拼接数据,将多个短答案响应结合在一起,以及来自GPT-4o的真实世界问题和答案生成的开放世界数据,从而为评估LVLMs的精细粒度识别提供了一个全面框架。实验表明,拼接数据可将类别识别准确性提高1%,而开放世界数据可将FROW基准准确性提高10%-20%,内容准确性提高6%-12%。同时,将精细粒度数据纳入预训练阶段可将模型的类别识别准确性提高多达10%。基准将可在https://github.com/pc-inno/FROW/获取。
Summary / 总结
This paper addresses the limitation of existing benchmarks that primarily focus on reasoning tasks and neglect fine-grained recognition, which is essential for practical applications. It introduces the FROW benchmark using GPT-4o for evaluating LVLMs and proposes an optimization strategy involving data construction and training process. The study shows that mosaic data enhances category recognition accuracy by 1%, while open-world data improves FROW benchmark accuracy by 10%-20% and content accuracy by 6%-12%. Incorporating fine-grained data during pre-training can further boost category recognition accuracy by up to 10%.
研究旨在通过解决现有基准的局限性,提升大型视觉语言模型(LVLM)的细粒度识别能力。引入了使用GPT-4o评估LVLM的FROW基准,并提出了一种从数据构建和训练过程两个角度优化的方法。研究显示,拼接数据可以将类别识别准确率提高1%,而开放世界数据可以将FROW基准准确率和内容准确率分别提高10%-20%和6%-12%。在预训练阶段加入细粒度数据可以进一步将类别识别准确率提高10%。
Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos
Authors: Mingyu Jeon, Jisoo Yang, Sungjin Han, Jinkwon Hwang, Sunjae Yoon, Jonghee Kim, Junyeoung Kim
First: 2025-12-11T07:25:48+00:00 · Latest: 2025-12-11T07:25:48+00:00
Abstract
Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient 'search' and costly 'refine' phases. P2S overcomes these challenges with two key innovations: an 'Adaptive Span Generator' to prevent the search phase candidate explosion, and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7\% on R5@0.1 on MAD).
中文标题/摘要
标题:点到区间:导航未见一小时视频的零样本片段检索
零样本长视频片段检索(ZLVMR)是指使用自然语言查询在一小时长的视频中识别时间片段的任务,无需特定任务的训练。LVMR的核心技术挑战源于一次性处理整个长视频的计算不可行性。这一限制已经确立了“搜索-然后-细化”方法,其中候选者迅速缩小,仅分析那些部分,成为LVMR的主要范式。然而,现有方法对此范式面临严重限制。传统的监督学习在可扩展性和泛化能力方面存在局限,尽管消耗了大量资源。然而,现有的零样本方法也失败了,面临双重挑战:(1)其启发式策略导致“搜索”阶段候选者爆炸,(2)“细化”阶段,由于语义差异易受影响,需要高成本的VLM验证,导致显著的计算开销。我们提出了**P**oint-**to**-**S**pan(P2S),一种新的无需训练的框架,以克服这一低效的“搜索”和昂贵的“细化”阶段挑战。P2S通过两个关键创新克服了这些挑战:一个“自适应区间生成器”来防止搜索阶段候选者爆炸,以及“查询分解”来在不依赖高成本VLM验证的情况下细化候选者。据我们所知,P2S是第一个能够在一小时长视频中实现时间定位的零样本框架,显著优于监督下的最新方法(例如,在MAD上的R5@0.1上提高了3.7%)。
Summary / 总结
The paper addresses the challenge of zero-shot long video moment retrieval (ZLVMR) by proposing Point-to-Span (P2S), a novel training-free framework. P2S introduces an 'Adaptive Span Generator' to reduce the number of candidates in the search phase and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. Experiments show that P2S outperforms supervised state-of-the-art methods by a significant margin, achieving +3.7% on R5@0.1 on MAD.
论文提出了Point-to-Span (P2S)框架,以解决零样本长视频片段检索(ZLVMR)的问题。P2S通过‘自适应跨度生成器’和‘查询分解’来解决‘搜索’阶段的低效和‘精炼’阶段的高计算成本。实验表明,P2S在MAD上的R5@0.1指标上比监督下的最新方法高出3.7%,证明了其在长视频中的时间定位效果。
History
20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553