M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG
Authors: David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata
First: 2025-12-05T18:55:58+00:00 · Latest: 2025-12-05T18:55:58+00:00
Comments: Preprint
Abstract
Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.
中文标题/摘要
标题:M4-RAG:大规模多语言多文化多模态RAG
视觉语言模型(VLMs)在视觉问答(VQA)方面取得了出色的表现,但仍然受限于静态训练数据。检索增强生成(RAG)通过提供最新的、文化基础的和多语言的信息来缓解这一限制;然而,多语言多模态RAG仍然很少被探索。我们介绍了M4-RAG,这是一个涵盖42种语言和56种地区方言和体裁的大规模基准,包含超过80,000个文化多样性的图像-问题对,用于评估跨语言和模态的检索增强VQA。为了平衡现实性和可重复性,我们构建了一个受控的检索环境,包含数百万个与查询领域相关的精心策划的多语言文档,近似于实际的检索条件,同时确保一致的实验。我们的系统评估表明,尽管RAG持续改善较小的VLMs,但它无法扩展到更大的模型,并且经常甚至会降低其性能,揭示了模型规模与当前检索效果之间的关键不匹配。M4-RAG为推进能够无缝跨越语言、模态和文化背景的下一代RAG系统奠定了基础。
Summary / 总结
M4-RAG is a large-scale benchmark for evaluating retrieval-augmented visual question answering across 42 languages and 56 regional dialects, using over 80,000 culturally diverse image-question pairs. It introduces a controlled retrieval environment with millions of curated multilingual documents to balance realism and reproducibility. The study finds that while RAG improves smaller VLMs, it often degrades the performance of larger models, highlighting a need for better scaling of retrieval effectiveness with model size.
M4-RAG 是一个涵盖42种语言和56种方言的大型多模态检索增强生成基准,包含超过8万个图像-问题对。它评估跨语言和模态的VQA,使用包含数百万个相关文档的受控检索环境。研究发现,虽然RAG能提升较小的VLM,但在更大模型上却常导致性能下降,这表明需要改进检索技术的扩展性。
iMotion-LLM: Instruction-Conditioned Trajectory Generation
Authors: Abdulwahab Felemban, Nussair Hroub, Jian Ding, Eslam Abdelrahman, Xiaoqian Shen, Abduallah Mohamed, Mohamed Elhoseiny
First: 2024-06-10T12:22:06+00:00 · Latest: 2025-12-05T18:52:32+00:00
Abstract
We introduce iMotion-LLM, a large language model (LLM) integrated with trajectory prediction modules for interactive motion generation. Unlike conventional approaches, it generates feasible, safety-aligned trajectories based on textual instructions, enabling adaptable and context-aware driving behavior. It combines an encoder-decoder multimodal trajectory prediction model with a pre-trained LLM fine-tuned using LoRA, projecting scene features into the LLM input space and mapping special tokens to a trajectory decoder for text-based interaction and interpretable driving. To support this framework, we introduce two datasets: 1) InstructWaymo, an extension of the Waymo Open Motion Dataset with direction-based motion instructions, and 2) Open-Vocabulary InstructNuPlan, which features safety-aligned instruction-caption pairs and corresponding safe trajectory scenarios. Our experiments validate that instruction conditioning enables trajectory generation that follows the intended condition. iMotion-LLM demonstrates strong contextual comprehension, achieving 84% average accuracy in direction feasibility detection and 96% average accuracy in safety evaluation of open-vocabulary instructions. This work lays the foundation for text-guided motion generation in autonomous driving, supporting simulated data generation, model interpretability, and robust safety alignment testing for trajectory generation models. Our code, pre-trained model, and datasets are available at: https://vision-cair.github.io/iMotion-LLM/.
中文标题/摘要
标题:iMotion-LLM:基于指令的轨迹生成
我们介绍了iMotion-LLM,这是一种结合了轨迹预测模块的大语言模型(LLM),用于交互式运动生成。与传统的做法不同,它基于文本指令生成可行且安全对齐的轨迹,从而实现适应性和上下文感知的驾驶行为。该模型结合了编码器-解码器多模态轨迹预测模型和使用LoRA微调的预训练LLM,将场景特征投影到LLM输入空间,并将特殊标记映射到轨迹解码器,以实现基于文本的交互和可解释的驾驶。为了支持这一框架,我们引入了两个数据集:1) InstructWaymo,这是Waymo开放运动数据集的扩展,包含基于方向的运动指令;2) 开放词汇量InstructNuPlan,该数据集包含安全对齐的指令-描述对以及相应的安全轨迹场景。我们的实验验证了指令条件化能够使轨迹生成遵循预期条件。iMotion-LLM展示了强大的上下文理解能力,在方向可行性检测中平均准确率为84%,在开放词汇量指令的安全评估中平均准确率为96%。这项工作为自主驾驶中的文本引导运动生成奠定了基础,支持模拟数据生成、模型可解释性和轨迹生成模型的稳健安全对齐测试。我们的代码、预训练模型和数据集可在以下网址获取:https://vision-cair.github.io/iMotion-LLM/
Summary / 总结
iMotion-LLM is a large language model integrated with trajectory prediction modules for generating feasible and safety-aligned trajectories based on textual instructions. It combines an encoder-decoder multimodal trajectory prediction model with a pre-trained LLM fine-tuned using LoRA, enabling text-based interaction and interpretable driving. Experiments show that iMotion-LLM achieves 84% accuracy in direction feasibility detection and 96% accuracy in safety evaluation, demonstrating strong contextual comprehension and robust safety alignment in trajectory generation for autonomous driving.
iMotion-LLM 是一个结合了轨迹预测模块的大语言模型,能够根据文本指令生成可行且安全对齐的轨迹。它结合了编码器-解码器多模态轨迹预测模型和使用 LoRA 微调的预训练语言模型,支持基于文本的交互和可解释驾驶。实验结果显示,iMotion-LLM 在方向可行性检测上的准确率为 84%,在开放词汇指令的安全性评估上的准确率为 96%,展示了强大的上下文理解能力和稳健的安全对齐能力,用于自主驾驶中的轨迹生成。
SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models
Authors: Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, Yilun Du
First: 2025-12-05T18:51:03+00:00 · Latest: 2025-12-05T18:51:03+00:00
Abstract
Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io
中文标题/摘要
标题:SIMPACT:使用视觉语言模型的仿真驱动动作规划
视觉语言模型(VLMs)表现出显著的常识和语义推理能力。然而,它们缺乏对物理动力学的现实理解。这一限制源于VLMs在静态互联网规模的视觉语言数据上进行训练,这些数据中没有因果交互或动作条件下的变化。因此,利用VLMs进行需要物理理解、推理和相应动作规划的精细机器人操作任务仍然具有挑战性。为了解决这个问题,我们提出了SIMPACT,一种测试时仿真驱动的动作规划框架,通过仿真闭环世界建模赋予VLM物理推理能力,而无需额外训练。从单个RGB-D观察开始,SIMPACT高效地构建物理仿真,使VLM能够提出有信息量的动作,观察仿真滚动,并逐步完善其推理。通过将语言推理与物理预测集成,我们的仿真驱动的VLM能够以物理为基础的方式理解接触动力学和动作结果。我们的方法在五个需要精细物理推理的现实世界刚体和变形体操作任务上展示了最先进的性能,优于现有的通用机器人操作模型。我们的结果表明,在测试时通过高效仿真嵌入物理理解为VLM推理提供了通向可泛化的具身智能的有希望的途径。项目网页可访问 https://simpact-bot.github.io
Summary / 总结
SIMPACT is a framework that enhances Vision-Language Models (VLMs) with physical reasoning capabilities through simulation, addressing their limitations in understanding physical dynamics. It constructs physics simulations from a single RGB-D observation to enable VLMs to propose informed actions, observe simulated rollouts, and iteratively refine their reasoning. SIMPACT outperforms existing models on five challenging manipulation tasks, demonstrating state-of-the-art performance and offering a promising approach for embodied intelligence.
SIMPACT 是一个框架,通过仿真增强视觉-语言模型(VLM)的物理推理能力,使其能够执行精细的机器人操作任务。它从单个 RGB-D 观测中构建物理仿真,使 VLM 能够提出动作、观察仿真结果并逐步优化其推理。SIMPACT 在五个具有挑战性的操作任务上超越了现有模型,展示了将物理理解集成到 VLM 中以实现具身智能的潜力。
TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
Authors: Shima Imani, Seungwhan Moon, Lambert Mathias, Lu Zhang, Babak Damavandi
First: 2025-12-05T18:40:18+00:00 · Latest: 2025-12-05T18:40:18+00:00
Abstract
Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.
中文标题/摘要
标题:TRACE:分析和增强视觉语言模型逐步推理的框架
可靠地进行数学和科学推理仍然是大型视觉语言模型面临的开放挑战。标准的最终答案评估往往掩盖了推理错误,允许无声失败持续存在。为了解决这一问题,我们引入了TRACE,一种透明推理和一致性评估框架,该框架诊断推理轨迹而非仅关注最终结果。核心上,TRACE 利用辅助推理集,这是一种分解复杂问题的紧凑子问题答案对,通过基于一致性的度量评估中间步骤,并揭示标准评估中忽略的失败。我们的实验表明,辅助推理集(ARS)的一致性与最终答案的正确性相关,并有助于定位失败出现的推理步骤,提供可操作的信号以改进模型。此外,TRACE 定义了置信区域,区分可靠和不可靠的推理路径,支持有效的过滤、调试和模型细化。
Summary / 总结
The research aims to improve the reliability of mathematical and scientific reasoning in large vision-language models by addressing the limitations of standard final-answer evaluation. TRACE, a framework for Transparent Reasoning And Consistency Evaluation, introduces Auxiliary Reasoning Sets to decompose complex problems and evaluate intermediate steps through consistency-based metrics. Experiments demonstrate that consistency across these sets correlates with final-answer correctness and helps identify specific reasoning steps where failures occur, providing actionable signals for model improvement. Additionally, TRACE defines confidence regions to distinguish reliable from unreliable reasoning paths, supporting effective debugging and refinement.
研究旨在通过解决标准最终答案评估的局限性,提高大型视觉语言模型在数学和科学推理方面的可靠性。TRACE,一种透明推理和一致性评估框架,引入了辅助推理集来分解复杂问题,并通过一致性度量评估中间步骤。实验表明,这些集中的一致性与最终答案的正确性相关,并有助于识别推理步骤中的失败点,提供改进模型的行动信号。此外,TRACE定义了置信区间,以区分可靠的和不可靠的推理路径,支持有效的调试和模型优化。
Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding
Authors: Zhiyuan Jiang, Shenghao Xie, Wenyi Li, Wenqiang Zu, Peihang Li, Jiahao Qiu, Siqi Pei, Lei Ma, Tiejun Huang, Mengdi Wang, Shilong Liu
First: 2025-12-05T18:39:12+00:00 · Latest: 2025-12-05T18:39:12+00:00
Comments: Code is available at https://github.com/Princeton-AI2-Lab/ZoomClick
Abstract
Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.
中文标题/摘要
标题:放大缩小,点击退出:解锁并评估缩放技术在GUI定位中的潜力
GUI定位是构建图形用户界面(GUI)代理的基本能力。尽管现有方法依赖于大规模边界框监督,但仍面临各种挑战,如跨平台通用性、复杂布局分析和细粒度元素定位。在本文中,我们研究了缩放作为GUI定位的强大但尚未充分探索的先验,并提出了一种无需训练的方法ZoomClick。通过描述缩放的四个关键属性(即预缩放、深度、缩小尺寸、最小裁剪尺寸),我们解锁了其动态空间聚焦和自适应上下文切换的全部能力。实验表明,我们的方法显著提升了通用视觉-语言和专门的GUI定位模型的性能,在多个主流基准上取得了最先进的结果;例如,UI-Venus-72B在ScreenSpot-Pro上的成功率为73.1%。此外,我们提出了GUIZoom-Bench,这是一个用于评估模型对缩放适应性的基准,旨在激发未来研究以提高缩放在GUI定位任务中的进一步训练和测试时缩放能力。
Summary / 总结
This paper addresses the challenges in GUI grounding by exploring the underutilized potential of zooming. It introduces ZoomClick, a training-free method that leverages four key properties of zoom (pre-zoom, depth, shrink size, minimal crop size) to enhance dynamic spatial focusing and adaptive context switching. Experimental results show that ZoomClick significantly improves the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several benchmarks, such as a 73.1% success rate on ScreenSpot-Pro. Additionally, the paper introduces GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to advance future research in this area.
本文通过探索缩放的潜在价值,解决了GUI定位中的挑战。提出了一个无需训练的方法ZoomClick,通过四个关键属性来增强动态空间聚焦和自适应上下文切换。该方法显著提高了通用视觉-语言模型和专门的GUI定位模型的性能,达到了多个基准上的最新成果,例如在ScreenSpot-Pro上的成功率为73.1%。此外,论文还提出了GUIZoom-Bench,用于评估模型对缩放的适应性,旨在推动GUI定位任务中进一步训练和测试时缩放的研究。
PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation
Authors: Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, Babak Damavandi
First: 2025-12-05T18:14:55+00:00 · Latest: 2025-12-05T18:14:55+00:00
Abstract
Evaluating vision-language models (VLMs) in scientific domains like mathematics and physics poses unique challenges that go far beyond predicting final answers. These domains demand conceptual understanding, symbolic reasoning, and adherence to formal laws, requirements that most existing benchmarks fail to address. In particular, current datasets tend to be static, lacking intermediate reasoning steps, robustness to variations, or mechanisms for verifying scientific correctness. To address these limitations, we introduce PRiSM, a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code. PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent, to generate well-structured problem instances. Each problem contains dynamic textual and visual input, a generated figure, alongside rich structured outputs: executable Python code for ground truth generation and verification, and detailed step-by-step reasoning. The dynamic nature and Python-powered automated ground truth generation of our benchmark allow for fine-grained experimental auditing of multimodal VLMs, revealing failure modes, uncertainty behaviors, and limitations in scientific reasoning. To this end, we propose five targeted evaluation tasks covering generalization, symbolic program synthesis, perturbation robustness, reasoning correction, and ambiguity resolution. Through comprehensive evaluation of existing VLMs, we highlight their limitations and showcase how PRiSM enables deeper insights into their scientific reasoning capabilities.
中文标题/摘要
标题:PRiSM:基于Python验证的科学推理多模态基准
在数学和物理学等科学领域评估视觉-语言模型(VLMs)带来了独特的挑战,远超出了仅预测最终答案的范围。这些领域需要概念理解、符号推理和遵守正式法则,而现有的大多数基准未能解决这些问题。特别是,当前的数据集往往是静态的,缺乏中间推理步骤、对变化的鲁棒性或验证科学正确性的机制。为了解决这些局限性,我们引入了PRiSM,这是一个合成的、完全动态的、基于多模态的基准,用于通过嵌入的Python代码评估科学推理。PRiSM 包含超过24,750个大学级别的物理和数学问题,并利用我们可扩展的基于代理的管道PrismAgent生成结构良好的问题实例。每个问题包含动态的文本和视觉输入、生成的图表,以及丰富的结构化输出:用于生成和验证真实结果的可执行Python代码,以及详细的逐步推理。我们的基准的动态特性和基于Python的自动化真实结果生成允许对多模态VLMs进行精细的实验审计,揭示其推理失败模式、不确定性行为和科学推理的局限性。为此,我们提出了五个有针对性的评估任务,涵盖泛化、符号程序合成、扰动鲁棒性、推理纠正和歧义解决。通过全面评估现有的VLMs,我们指出了它们的局限性,并展示了PRiSM如何使我们更深入地了解它们的科学推理能力。
Summary / 总结
PRiSM is a new benchmark for evaluating vision-language models in scientific domains, addressing limitations of existing benchmarks by introducing dynamic, multimodal problems grounded in Python code. It includes over 24,750 university-level physics and math problems, each with dynamic inputs, a generated figure, and structured outputs for ground truth generation and verification. Experimental results reveal failure modes, uncertainty behaviors, and limitations in scientific reasoning of existing models, highlighting the need for more robust models capable of symbolic reasoning and adherence to formal laws.
PRiSM 是一个合成的、完全动态的多模态基准,通过基于 Python 的评估方法来评估科学领域的视觉-语言模型,解决了现有基准的局限性,如静态问题和缺乏动态推理步骤。关键实验发现表明,现有的 VLM 在泛化、符号程序合成、扰动鲁棒性、推理纠正和歧义解决方面存在困难,突显了需要改进的科学推理能力。
Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning
Authors: Negin Baghbanzadeh, Mohammed Saidul Islam, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour
First: 2025-06-03T10:53:19+00:00 · Latest: 2025-12-05T17:47:02+00:00
Comments: 21 pages
Abstract
In biomedical vision-language modeling, datasets are typically mined from scientific literature, pairing compound figures with captions that are short, context-dependent, and oftern partially informative. Prior work on subfigure extraction has been limited in both dataset size and generalizability. In addition, no existing effort has incorporated rich medical context in image-text pairs. We revisit data curation as a foundational component of effective biomedical representation learning. Our data curation process integrates transformer-based subfigure detection, subcaption extraction, and contextual text enrichment derived from inline references. Our subfigure extraction model, trained on a corpus of 500,000 compound figures, achieves state-of-the-art performance on real and synthetic benchmarks. Using this process, we curate and release Open-PMC-18M, a large-scale high-fidelity biomedical dataset comprising 18 million image-text pairs, spanning radiology, microscopy, and visible light photography. We train vision-language models on our dataset and perform extensive evaluation on 6 retrieval and 19 zero-shot classification tasks across three major modalities. The models trained on our dataset set a new state-of-the-art results in medical representation learning. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.
中文标题/摘要
标题:Open-PMC-18M:多模态表示学习的高保真大规模医学数据集
在生物医学视觉-语言建模中,数据通常是从科学文献中挖掘出来的,将复合图与短、上下文依赖且经常部分信息的描述配对。先前的子图提取工作在数据集大小和泛化能力方面都受到限制。此外,现有的努力没有在图像-文本对中融入丰富的医学上下文。我们重新审视数据收集作为有效生物医学表示学习基础组件的重要性。我们的数据收集过程结合了基于变换器的子图检测、子描述提取以及来自内联参考的上下文文本丰富。我们的子图提取模型在50万复合图的语料库上训练,实现了在真实和合成基准上的最佳性能。通过这一过程,我们收集并发布了Open-PMC-18M,这是一个包含1800万图像-文本对的大规模高保真生物医学数据集,涵盖了放射学、显微镜和可见光摄影。我们在该数据集上训练视觉-语言模型,并在三个主要模态的六个检索和十九个零样本分类任务上进行了广泛的评估。在我们的数据集上训练的模型在医学表示学习中取得了新的最佳结果。我们发布了我们的数据集、模型和代码,以支持可重复的基准测试并进一步研究生物医学视觉-语言建模和表示学习。
Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
Authors: Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah
Venue: ICLR 2026
First: 2025-09-28T21:15:07+00:00 · Latest: 2025-12-05T17:19:01+00:00
Comments: Under review as a conference paper at ICLR 2026
Abstract
Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as consistent within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism that explains how external cues enhance multimodal binding and offer both interpretability and practical improvements.
中文标题/摘要
标题:揭示接地ID:外部线索如何塑造多模态绑定
大型视觉-语言模型(LVLMs)在多模态基准测试中表现出色,但在结构化推理和精确接地方面仍有限制。近期研究表明,添加简单的视觉结构,如分区和注释,可以提高准确性,但这些改进背后的内部机制尚不清楚。我们研究了这一现象,并提出了接地ID的概念,即由外部线索诱导的潜在标识符,这些标识符在不同模态中将对象与其指定的分区绑定在一起。通过表示分析,我们发现这些标识符在嵌入空间中表现为一致的分区内部对齐,并减少了图像与文本之间的模态差距。因果干预进一步证实这些标识符在对象与符号线索之间起中介作用。我们展示了接地ID增强了相关组件之间的注意力,从而改善了跨模态接地并减少了幻觉。综上所述,我们的结果将接地ID识别为一个关键的符号机制,解释了外部线索如何增强多模态绑定,并提供了可解释性和实际改进。
Summary / 总结
The research aims to understand how external visual cues improve the performance of large vision-language models in multimodal tasks. The study introduces the concept of Grounding IDs, which are latent identifiers induced by external cues that help bind objects to their partitions across modalities. The authors find that these identifiers reduce the modality gap and enhance cross-modal grounding, leading to improved performance and reduced hallucinations. Causal interventions confirm that Grounding IDs mediate the binding between objects and symbolic cues, strengthening attention between related components.
研究旨在理解外部线索如何提升视觉-语言模型在多模态任务中的表现。研究引入了Grounding IDs的概念,这些隐含标识符由外部线索产生,并帮助在不同模态间将对象与其分区绑定。研究发现这些标识符减少了模态间的差距,并增强了跨模态的绑定,从而提高了准确性和减少了多模态任务中的幻觉现象。
Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling
Authors: Saurav Jha, M. Jehanzeb Mirza, Wei Lin, Shiqi Yang, Sarath Chandar
First: 2025-12-05T15:30:08+00:00 · Latest: 2025-12-05T15:30:08+00:00
Comments: Extended abstract at World Modeling Workshop 2026
Abstract
Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.
中文标题/摘要
标题:探究世界模型在空间推理中的有效性通过测试时缩放
视觉-语言模型(VLMs)在需要多视角理解和身体视角转换的空间推理任务中仍然受到限制。最近的方法如MindJourney试图通过测试时缩放来缓解这一差距,其中世界模型想象基于动作的轨迹,启发式验证器从中选择有用的观点。在本研究中,我们系统地考察了此类测试时验证器在基准测试中的表现,揭示了它们的潜力和局限性。基于不确定性的分析表明,MindJourney的验证器几乎没有提供有意义的校准,随机评分往往与减少答案熵的效果相当,从而暴露了系统性的动作偏差和不可靠的奖励信号。为缓解这些问题,我们引入了一种基于空间断言的验证框架(ViSA),将测试时的奖励与可验证的、帧锚定的微断言联系起来。这种原理性的验证器在SAT-Real基准测试中一致地提高了空间推理能力,并通过更平衡的探索行为纠正了轨迹选择偏差。然而,在具有挑战性的MMSI-Bench上,包括我们的验证器在内的所有验证器都无法实现一致的缩放,这表明当前的世界模型形成了一个信息瓶颈,想象中的视图未能丰富精细的推理。总之,这些发现描绘了基于世界模型推理的测试时验证的优劣和缺陷。我们的代码可在https://github.com/chandar-lab/visa-for-mindjourney/ 获取。
Summary / 总结
This study evaluates the effectiveness of test-time scaling in Vision-Language Models for spatial reasoning tasks, focusing on the MindJourney approach. The research finds that the heuristic verifier used in MindJourney does not provide meaningful calibration and that random scoring can achieve similar results, indicating biases and unreliable reward signals. To address these issues, the authors propose Verification through Spatial Assertions (ViSA), which improves spatial reasoning on the SAT-Real benchmark by grounding rewards in verifiable micro-claims. However, on the MMSI-Bench, none of the verifiers, including ViSA, achieve consistent scaling, suggesting that current world models may be an information bottleneck. The findings highlight both the potential and limitations of test-time verification for world-model-based reasoning.
这项研究探讨了测试时缩放在视觉-语言模型(VLMs)中用于空间推理任务的有效性。它评估了MindJourney方法,该方法使用世界模型来想象动作条件下的轨迹,并使用启发式验证器选择有用的视图。研究发现,MindJourney中的验证器不能提供有意义的校准,随机评分同样有效,这揭示了偏见和不可靠的奖励信号。为了解决这些问题,作者提出了空间断言验证(ViSA)框架,通过将奖励基于可验证的微断言来改进空间推理,从而在SAT-Real基准上取得了改进。然而,在MMSI-Bench上,包括ViSA在内的所有验证器都无法实现一致的缩放,表明当前的世界模型可能形成信息瓶颈。研究结果既揭示了测试时验证的潜力,也指出了其局限性。
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
Authors: Antonio Bărbălau, Cristian Daniel Păduraru, Teodor Poncu, Alexandru Tifrea, Elena Burceanu
First: 2025-09-13T06:36:07+00:00 · Latest: 2025-12-05T13:23:24+00:00
Abstract
Sparse Autoencoders (SAEs) are widely employed for mechanistic interpretability and model steering. Within this context, steering is by design performed by means of decoding altered SAE intermediate representations. This procedure essentially rewrites the original activations as a weighted sum of decoder features. In contrast to existing literature, we forward an encoder-centric alternative to model steering which demonstrates a stronger cross-modal performance. We introduce S&P Top-K, a retraining-free and computationally lightweight Selection and Projection framework that identifies Top-K encoder features aligned with a sensitive attribute or behavior, optionally aggregates them into a single control axis, and computes an orthogonal projection to be subsequently applied directly in the model's native embedding space. In vision-language models, it improves fairness metrics on CelebA and FairFace by up to 3.2 times over conventional SAE usage, and in large language models, it substantially reduces aggressiveness and sycophancy in Llama-3 8B Instruct, achieving up to 3.6 times gains over masked reconstruction. These findings suggest that encoder-centric interventions provide a general, efficient, and more effective mechanism for shaping model behavior at inference time than the traditional decoder-centric use of SAEs.
中文标题/摘要
标题:重新思考稀疏自编码器:仅从编码器特征选择和投影实现公平性和控制
稀疏自编码器(SAEs)广泛应用于机械可解释性和模型引导。在此背景下,引导设计上是通过解码修改的SAE中间表示来实现的。这一过程本质上是将原始激活重新写为解码器特征的加权和。与现有文献不同,我们提出了一种以编码器为中心的模型引导替代方案,该方案展示了更强的跨模态性能。我们引入了S&P Top-K,这是一种无需重新训练且计算量轻的选优和投影框架,该框架识别与敏感属性或行为对齐的Top-K编码器特征,可选地将它们聚合到一个控制轴上,并计算一个正交投影,随后直接应用于模型的原生嵌入空间中。在视觉-语言模型中,它在CelebA和FairFace上的公平性指标上比传统SAE使用方法提高了3.2倍,而在大型语言模型中,它显著减少了Llama-3 8B Instruct的攻击性和奉承性,实现了高达3.6倍的改进。这些发现表明,以编码器为中心的干预措施提供了一种更通用、更高效且更有效的机制,在推理时塑造模型行为,比传统的以解码器为中心的SAE使用方法更为有效。
Summary / 总结
This paper proposes S&P Top-K, a novel encoder-centric framework for model steering in sparse autoencoders. Unlike traditional methods that alter decoder features, S&P Top-K selects and projects top-K encoder features aligned with sensitive attributes, optionally aggregating them into a single control axis. This approach improves fairness metrics by up to 3.2 times in vision-language models and reduces aggressiveness and sycophancy in large language models by up to 3.6 times, demonstrating its effectiveness in shaping model behavior more efficiently than conventional SAE usage.
该研究旨在通过聚焦编码器而非解码器来提高模型行为的公平性和可控性,使用稀疏自编码器(SAEs)。引入了S&P Top-K框架,该框架选择与敏感属性或行为对齐的前K个编码器特征,并直接在模型的嵌入空间中应用正交投影。该方法在视觉-语言模型中显著提高了公平性指标,最高可达3.2倍的传统SAE使用方法,同时在大型语言模型中减少了攻击性和奉承性,最高可达3.6倍的遮蔽重建方法。
Concept-Guided Backdoor Attack on Vision Language Models
Authors: Haoyu Shen, Weimin Lyu, Haotian Xu, Tengfei Ma
First: 2025-11-30T03:24:23+00:00 · Latest: 2025-12-05T13:17:42+00:00
Abstract
Vision-Language Models (VLMs) have achieved impressive progress in multimodal text generation, yet their rapid adoption raises increasing concerns about security vulnerabilities. Existing backdoor attacks against VLMs primarily rely on explicit pixel-level triggers or imperceptible perturbations injected into images. While effective, these approaches reduce stealthiness and remain vulnerable to image-based defenses. We introduce concept-guided backdoor attacks, a new paradigm that operates at the semantic concept level rather than on raw pixels. We propose two different attacks. The first, Concept-Thresholding Poisoning (CTP), uses explicit concepts in natural images as triggers: only samples containing the target concept are poisoned, causing the model to behave normally in all other cases but consistently inject malicious outputs whenever the concept appears. The second, CBL-Guided Unseen Backdoor (CGUB), leverages a Concept Bottleneck Model (CBM) during training to intervene on internal concept activations, while discarding the CBM branch at inference time to keep the VLM unchanged. This design enables systematic replacement of a targeted label in generated text (for example, replacing "cat" with "dog"), even when the replacement behavior never appears in the training data. Experiments across multiple VLM architectures and datasets show that both CTP and CGUB achieve high attack success rates while maintaining moderate impact on clean-task performance. These findings highlight concept-level vulnerabilities as a critical new attack surface for VLMs.
中文标题/摘要
标题:概念引导的视觉语言模型后门攻击
视觉-语言模型(VLMs)在多模态文本生成方面取得了显著进展,但其快速采用引发了越来越多关于安全漏洞的担忧。现有针对VLMs的后门攻击主要依赖于在图像中注入显式像素级触发器或不可感知的扰动。虽然这些方法有效,但它们降低了隐蔽性,并且仍然容易受到图像防御的攻击。我们引入了概念引导的后门攻击,这是一种在语义概念层面而非像素层面操作的新范式。我们提出了两种不同的攻击方法。第一种,概念阈值中毒(CTP),使用自然图像中的显式概念作为触发器:只有包含目标概念的样本才会被中毒,导致模型在其他情况下正常工作,但在概念出现时始终注入恶意输出。第二种,CBL引导的未见后门(CGUB),在训练过程中利用概念瓶颈模型(CBM)干预内部概念激活,而在推理时丢弃CBM分支以保持VLM不变。这种设计使得即使在训练数据中从未出现替换行为,也可以系统地在生成的文本中替换目标标签(例如,将“猫”替换为“狗”)。在多个VLM架构和数据集上的实验表明,CTP和CGUB均能实现高攻击成功率,同时对干净任务性能的影响适中。这些发现突显了概念层面的漏洞作为VLMs的一个关键新攻击面。
Summary / 总结
This paper addresses the security vulnerabilities of Vision-Language Models (VLMs) by introducing concept-guided backdoor attacks. Two methods, Concept-Thresholding Poisoning (CTP) and CBL-Guided Unseen Backdoor (CGUB), are proposed. CTP uses explicit concepts in images as triggers, while CGUB employs a Concept Bottleneck Model during training to manipulate internal concept activations without altering the model at inference time. Both methods achieve high attack success rates with minimal impact on clean-task performance, highlighting the importance of semantic concept-level vulnerabilities in VLMs.
研究旨在通过引入基于概念的后门攻击来解决视觉语言模型(VLMs)的安全漏洞,这些攻击在语义概念层面而非像素层面操作。提出了两种方法:概念阈值中毒(CTP)使用图像中的显式概念作为触发器,而CBL引导的未见后门(CGUB)在训练过程中利用概念瓶颈模型干预内部概念激活。实验表明,这两种方法都能在对干净任务性能影响较小的情况下实现高攻击成功率,突显了概念层面的漏洞是VLMs的一个新的关键攻击面。
Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving
Authors: Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, Ziwei Liu
First: 2024-05-08T17:59:53+00:00 · Latest: 2025-12-05T12:51:41+00:00
Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Abstract
Efficient data utilization is crucial for advancing 3D scene understanding in autonomous driving, where reliance on heavily human-annotated LiDAR point clouds challenges fully supervised methods. Addressing this, our study extends into semi-supervised learning for LiDAR semantic segmentation, leveraging the intrinsic spatial priors of driving scenes and multi-sensor complements to augment the efficacy of unlabeled datasets. We introduce LaserMix++, an evolved framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to further assist data-efficient learning. Our framework is tailored to enhance 3D scene consistency regularization by incorporating multi-modality, including 1) multi-modal LaserMix operation for fine-grained cross-sensor interactions; 2) camera-to-LiDAR feature distillation that enhances LiDAR feature learning; and 3) language-driven knowledge guidance generating auxiliary supervisions using open-vocabulary models. The versatility of LaserMix++ enables applications across LiDAR representations, establishing it as a universally applicable solution. Our framework is rigorously validated through theoretical analysis and extensive experiments on popular driving perception datasets. Results demonstrate that LaserMix++ markedly outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations and significantly improving the supervised-only baselines. This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.
中文标题/摘要
标题:多模态数据高效自主驾驶3D场景理解
在自主驾驶中,3D场景理解的进步依赖于高效的数据利用,而对大量人工标注的LiDAR点云的依赖挑战了完全监督方法。为解决这一问题,我们的研究扩展到LiDAR语义分割的半监督学习,利用驾驶场景的固有空间先验和多传感器互补来增强未标注数据的有效性。我们引入了LaserMix++,这是一种集成不同LiDAR扫描的激光束操作并结合LiDAR-相机对应关系的框架,进一步辅助数据高效学习。该框架通过结合多模态,包括1)多模态LaserMix操作以实现细粒度的跨传感器交互;2)相机到LiDAR特征蒸馏以增强LiDAR特征学习;3)语言驱动的知识指导,使用开放词汇模型生成辅助监督。LaserMix++的灵活性使其适用于各种LiDAR表示,使其成为一种通用解决方案。我们的框架通过理论分析和在流行驾驶感知数据集上的广泛实验得到了严格验证。结果表明,LaserMix++显著优于完全监督方法,仅需五分之一的标注即可达到相当的准确性,并且显著提高了仅监督基线。这一重大进展突显了半监督方法在减少基于LiDAR的3D场景理解系统对大量标注数据依赖方面的潜力。
Summary / 总结
The study addresses the challenge of data efficiency in 3D scene understanding for autonomous driving by proposing LaserMix++, a semi-supervised learning framework that leverages spatial priors and multi-sensor data to enhance LiDAR semantic segmentation. The framework integrates laser beam manipulations and camera-to-LiDAR feature distillation to improve data efficiency, and incorporates language-driven knowledge guidance for auxiliary supervision. Experiments on popular driving perception datasets show that LaserMix++ significantly outperforms fully supervised methods with fewer labeled data, demonstrating the potential of semi-supervised approaches in reducing data annotation needs.
研究提出了一种半监督学习框架LaserMix++,通过利用多传感器数据和场景的内在空间先验来解决自主驾驶中3D场景理解的数据效率问题。该框架整合了激光束操作和LiDAR-相机对应关系,以增强未标注数据的利用,并引入了多模态操作和基于语言的知识指导,以提高3D场景一致性。实验结果表明,LaserMix++在流行的驾驶感知数据集上显著优于全监督方法,且使用更少的标注数据,展示了半监督方法在减少对大量标注数据依赖方面的潜力。
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM
Authors: Sunghyun Ahn, Youngwan Jo, Kijung Lee, Sein Kwon, Inpyo Hong, Sanghyun Park
Venue: WACV 2026
First: 2025-03-06T14:52:34+00:00 · Latest: 2025-12-05T11:34:09+00:00
Comments: Accepted to WACV 2026
Abstract
Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers user-defined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive results on VAD benchmarks, achieving state-of-the-art performance on UBnormal and UCF-Crime and surpassing other methods in generalization across all datasets. Our code is available online at github.com/SkiddieAhn/Paper-AnyAnomaly.
中文标题/摘要
标题:AnyAnomaly: LVLM驱动的零样本可定制视频异常检测
视频异常检测(VAD)对于计算机视觉中的视频分析和监控至关重要。然而,现有的VAD模型依赖于学习到的正常模式,这使得它们难以应用于多种环境。因此,用户需要重新训练模型或为新环境开发单独的AI模型,这需要机器学习专业知识、高性能硬件和大量数据收集,限制了VAD的实际可用性。为了解决这些挑战,本研究提出了可定制视频异常检测(C-VAD)技术和AnyAnomaly模型。C-VAD将用户定义的文本视为异常事件,并检测视频中包含指定事件的帧。我们通过上下文感知的视觉问答有效实现了AnyAnomaly,无需微调大型视觉语言模型。为了验证所提模型的有效性,我们构建了C-VAD数据集,并展示了AnyAnomaly的优越性。此外,我们的方法在VAD基准测试中表现出竞争力,分别在UBnormal和UCF-Crime上达到了最先进的性能,并在所有数据集上的泛化能力超过了其他方法。我们的代码已在线发布在github.com/SkiddieAhn/Paper-AnyAnomaly上。
Summary / 总结
The study addresses the challenge of applying video anomaly detection (VAD) models to diverse environments by proposing a customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. AnyAnomaly uses a context-aware visual question answering approach without fine-tuning a large vision language model, allowing users to define abnormal events through text. Experimental results show that AnyAnomaly outperforms other methods on VAD benchmarks, achieving state-of-the-art performance on UBnormal and UCF-Crime and demonstrating strong generalization across datasets.
该研究通过提出可定制视频异常检测(C-VAD)技术和AnyAnomaly模型来解决将视频异常检测(VAD)模型应用于不同环境的挑战。该模型采用上下文感知的视觉问答方法,无需微调大型视觉语言模型,允许用户通过文本定义异常事件。实验结果表明,AnyAnomaly在VAD基准测试中优于其他方法,在UBnormal和UCF-Crime上达到最先进的性能,并且在所有数据集上具有很强的泛化能力。
CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation
Authors: Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-Lin, Hong-Han Shuai, Wen-Huang Cheng
First: 2025-12-03T08:01:48+00:00 · Latest: 2025-12-05T10:31:39+00:00
Comments: Accepted by ACM Multimedia 2025
Abstract
Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
中文标题/摘要
标题:CookAnything:一种灵活且一致的多步骤食谱图像生成框架
烹饪是一种顺序性和视觉导向的活动,其中每一步如切菜、搅拌或炒菜都包含程序逻辑和视觉语义。虽然最近的扩散模型在文本到图像生成方面表现出强大的能力,但在处理如食谱插图这样的结构化多步骤场景时却力不从心。此外,当前的食谱插图方法无法适应食谱长度的自然变化,无论实际指令结构如何,都会生成固定数量的图像。为了解决这些限制,我们提出了CookAnything,一种灵活且一致的基于扩散的框架,可以从任意长度的文本烹饪说明中生成连贯且语义上不同的图像序列。该框架引入了三个关键组件:(1) 步骤区域控制(SRC),在单个去噪过程中将文本步骤与相应的图像区域对齐;(2) 灵活RoPE,一种步骤感知的位置编码机制,增强了时间连贯性和空间多样性;(3) 跨步骤一致性控制(CSCC),在步骤之间保持细粒度的食材一致性。在食谱插图基准上的实验结果表明,CookAnything在基于训练和非基于训练的设置中都优于现有方法。所提出的框架支持复杂多步骤指令的可扩展、高质量的视觉合成,并在教学媒体和程序化内容创作方面具有广泛的应用潜力。
Summary / 总结
CookAnything is a framework designed to generate coherent and semantically distinct image sequences from textual cooking instructions, addressing the limitations of existing methods in handling multi-step scenarios. It introduces three key components: Step-wise Regional Control (SRC), Flexible RoPE, and Cross-Step Consistency Control (CSCC). Experimental results demonstrate that CookAnything outperforms existing methods in both training-based and training-free settings, supporting scalable and high-quality visual synthesis of complex multi-step instructions.
CookAnything 是一个框架,旨在从文本烹饪说明生成连贯且语义上不同的图像序列,解决了现有扩散模型在处理多步场景时的局限性。它引入了三个关键组件:步骤区域控制 (SRC)、灵活的 RoPE 和跨步骤一致性控制 (CSCC)。实验结果表明,CookAnything 在训练依赖和非依赖设置中均优于现有方法,支持复杂多步指令的视觉合成。
3D Question Answering via only 2D Vision-Language Models
Authors: Fengyun Wang, Sicheng Yu, Jiawei Wu, Jinhui Tang, Hanwang Zhang, Qianru Sun
First: 2025-05-28T09:04:39+00:00 · Latest: 2025-12-05T10:05:57+00:00
Comments: ICML2025
Abstract
Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.
中文标题/摘要
标题:仅通过2D视觉-语言模型实现3D问答
大型视觉-语言模型(LVLMs)在众多领域取得了显著进展。在本文中,我们探讨如何利用它们的潜力来解决3D场景理解任务,以3D问答(3D-QA)为例。由于3D数据的训练数据有限,我们没有训练LVLMs,而是以零样本的方式进行推理。具体来说,我们从3D点云中采样2D视图,并将它们输入2D模型以回答给定的问题。当选择2D模型时,例如LLAVA-OV,采样的视图质量最为重要。我们提出了cdViews,这是一种新颖的方法,用于自动选择关键且多样的视图以进行3D-QA。cdViews 包含两个关键组件:viewSelector,基于其提供答案特定信息的潜力来优先选择关键视图;viewNMS,通过基于空间重叠去除冗余视图来增强多样性。我们在广泛使用的ScanQA和SQA基准上评估了cdViews,证明它在仅依赖2D模型而无需微调的情况下实现了3D-QA的最新性能。这些发现支持我们相信2D LVLMs目前是解决3D任务最有效的替代方案(相对于资源密集型的3D LVLMs)。
Summary / 总结
This work explores using large vision-language models (LVLMs) for 3D question answering (3D-QA) in a zero-shot manner, by sampling 2D views from 3D point clouds and feeding them into 2D models. The authors propose cdViews, a method that selects critical and diverse views to improve performance. cdViews outperforms existing methods on ScanQA and SQA benchmarks, showing that 2D LVLMs can effectively address 3D tasks without fine-tuning.
本文探讨了使用大型视觉-语言模型(LVLMs)通过从3D点云中采样2D视图并将其输入LLAVA-OV等2D模型来进行3D问答(3D-QA)的方法。提出的cdViews方法选择关键且多样的视图以提高答案质量。在ScanQA和SQA基准上的评估表明,cdViews在无需微调的情况下实现了最先进的性能,这表明2D LVLMs对于3D任务是有效的。
CureAgent: A Training-Free Executor-Analyst Framework for Clinical Reasoning
Authors: Ting-Ting Xie, Yixin Zhang
Venue: NeurIPS 2025
First: 2025-12-05T09:56:58+00:00 · Latest: 2025-12-05T09:56:58+00:00
Comments: 2nd Place Solution to the CURE-Bench Competition @ NeurIPS 2025. Code available at https://github.com/June01/CureAgent
Abstract
Current clinical agent built on small LLMs, such as TxAgent suffer from a \textit{Context Utilization Failure}, where models successfully retrieve biomedical evidence due to supervised finetuning but fail to ground their diagnosis in that information. In this work, we propose the Executor-Analyst Framework, a modular architecture that decouples the syntactic precision of tool execution from the semantic robustness of clinical reasoning. By orchestrating specialized TxAgents (Executors) with long-context foundation models (Analysts), we mitigate the reasoning deficits observed in monolithic models. Beyond simple modularity, we demonstrate that a Stratified Ensemble strategy significantly outperforms global pooling by preserving evidentiary diversity, effectively addressing the information bottleneck. Furthermore, our stress tests reveal critical scaling insights: (1) a \textit{Context-Performance Paradox}, where extending reasoning contexts beyond 12k tokens introduces noise that degrades accuracy; and (2) the \textit{Curse of Dimensionality} in action spaces, where expanding toolsets necessitates hierarchical retrieval strategies. Crucially, our approach underscores the potential of training-free architectural engineering, achieving state-of-the-art performance on CURE-Bench without the need for expensive end-to-end finetuning. This provides a scalable, agile foundation for the next generation of trustworthy AI-driven therapeutics. Code has been released on https://github.com/June01/CureAgent.
中文标题/摘要
标题:CureAgent:一种无需训练的执行-分析师框架用于临床推理
当前基于小型LLM的临床代理,如TxAgent,遭受了“上下文利用失败”的问题,模型在监督微调后能够成功检索生物医学证据,但在将其诊断与这些信息联系起来时却失败了。本文提出了一种执行-分析师框架,这是一种模块化架构,将工具执行的句法精确性与临床推理的语义稳健性分离。通过协调专门的TxAgents(执行者)与长上下文基础模型(分析师),我们减轻了单体模型中观察到的推理缺陷。除了简单的模块化之外,我们还证明了分层集成策略显著优于全局聚合,因为它保留了证据多样性,有效地解决了信息瓶颈问题。此外,我们的压力测试揭示了关键的扩展见解:(1)“上下文-性能悖论”,其中将推理上下文扩展到12k个标记以上引入了噪声,降低了准确性;(2)在动作空间中的“维度灾难”,其中扩展工具集需要分层检索策略。至关重要的是,我们的方法强调了无需训练的架构工程的潜力,无需昂贵的端到端微调即可在CURE-Bench上实现最先进的性能。代码已发布在https://github.com/June01/CureAgent。
Summary / 总结
CureAgent addresses the Context Utilization Failure in clinical agents by proposing an Executor-Analyst Framework that separates the syntactic precision of tool execution from the semantic robustness of clinical reasoning. It uses specialized TxAgents (Executors) and long-context foundation models (Analysts) to mitigate reasoning deficits. The framework also employs a Stratified Ensemble strategy to preserve evidentiary diversity and outperforms global pooling. Stress tests revealed a Context-Performance Paradox and the Curse of Dimensionality, indicating that extending reasoning contexts and expanding toolsets require careful management. CureAgent achieves state-of-the-art performance on CURE-Bench without end-to-end finetuning, providing a scalable foundation for trustworthy AI-driven therapeutics.
研究针对基于小语言模型如TxAgent的临床代理存在的上下文利用失败问题。提出了执行者-分析师框架,将工具执行的语法精确性和临床推理的语义稳健性分离。通过使用专门的TxAgents(执行者)和长上下文基础模型(分析师),该框架缓解了推理缺陷。研究还表明,分层集成策略优于全局聚合,并识别了关键的扩展洞察,如上下文-性能悖论和维度诅咒。该方法在CURE-Bench上实现了最先进的性能,无需端到端微调,为可信的AI驱动疗法提供了可扩展的基础。
MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging
Authors: Xingyu Zhang, Anna Reithmeir, Fryderyk Kögl, Rickmer Braren, Julia A. Schnabel, Daniel M. Lang
First: 2025-12-05T09:53:07+00:00 · Latest: 2025-12-05T09:53:07+00:00
Abstract
Accurate spatial correspondence between medical images is essential for longitudinal analysis, lesion tracking, and image-guided interventions. Medical image registration methods rely on local intensity-based similarity measures, which fail to capture global semantic structure and often yield mismatches in low-contrast or anatomically variable regions. Recent advances in diffusion models suggest that their intermediate representations encode rich geometric and semantic information. We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors. MedDIFT fuses diffusion activations into rich voxel-wise descriptors and matches them via cosine similarity, with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art learning-based UniGradICON model and surpasses conventional B-spline-based registration, without requiring any task-specific model training. Ablation experiments confirm that multi-level feature fusion and modest diffusion noise improve performance.
中文标题/摘要
标题:MedDIFT:基于多尺度扩散的3D医学成像配准
医学图像之间的准确空间对应对于纵向分析、病灶追踪和图像引导干预至关重要。医学图像配准方法依赖于局部强度相似性度量,无法捕捉全局语义结构,常在低对比度或解剖变异区域产生错配。最近在扩散模型方面的进展表明,它们的中间表示编码了丰富的几何和语义信息。我们提出MedDIFT,这是一种无需训练的3D对应框架,利用预训练的医学扩散模型的多尺度特征作为体素描述符。MedDIFT 将扩散激活融合到丰富的体素级描述符中,并通过余弦相似性进行匹配,可选地加入局部搜索先验。在公开的肺部CT数据集上,MedDIFT 的对应精度与基于学习的UniGradICON模型相当,并优于传统的B样条配准方法,无需任何特定任务的模型训练。消融实验表明,多级特征融合和适度的扩散噪声可提高性能。
Summary / 总结
MedDIFT is a training-free 3D correspondence framework that uses multi-scale features from a pretrained latent medical diffusion model to generate voxel descriptors. It matches these descriptors using cosine similarity with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT achieves comparable correspondence accuracy to the state-of-the-art UniGradICON model and outperforms traditional B-spline-based registration methods. Ablation studies show that multi-level feature fusion and moderate diffusion noise enhance performance.
MedDIFT 是一个无需训练的 3D 对应框架,利用预训练的医疗扩散模型的多尺度特征生成体素描述符,并使用余弦相似性进行匹配,可选地加入局部搜索先验。在肺部 CT 数据集上,MedDIFT 的对应精度与最先进的 UniGradICON 模型相当,并优于传统的 B-样条基于的注册方法。消融实验表明,多级特征融合和扩散噪声的引入可以提升性能。
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
Authors: Zijun Wang, Panwen Hu, Jing Wang, Terry Jingchen Zhang, Yuhao Cheng, Long Chen, Yiqiang Yan, Zutao Jiang, Hanhui Li, Xiaodan Liang
First: 2025-12-05T09:39:26+00:00 · Latest: 2025-12-05T09:39:26+00:00
Abstract
Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.
中文标题/摘要
标题:ProPhy: 渐进物理对齐以实现动态世界模拟
近期生成视频方面的进展展示了构建世界模拟器的巨大潜力。然而,当前模型仍然难以生成物理上一致的结果,尤其是在处理大规模或复杂动力学时。这一限制主要源于现有方法对物理提示的各向同性响应以及忽视生成内容与局部物理线索之间的细粒度对齐。为了解决这些挑战,我们提出了ProPhy,一种渐进物理对齐框架,能够实现显式的物理感知条件和各向异性生成。ProPhy采用了一种两阶段物理专家混合机制(MoPE)来提取判别物理先验,其中语义专家从文本描述中推断语义级物理原理,细化专家捕捉标记级物理动态。该机制使模型能够学习更精细的、物理感知的视频表示,更好地反映基本物理定律。此外,我们引入了一种物理对齐策略,将视觉语言模型(VLMs)的物理推理能力转移到细化专家中,从而更准确地表示动态物理现象。在物理感知视频生成基准上的广泛实验表明,ProPhy生成的结果更加真实、动态且物理上一致。
Summary / 总结
ProPhy is a Progressive Physical Alignment Framework designed to enhance the physical consistency of dynamic world simulations. It uses a two-stage Mixture-of-Physics-Experts (MoPE) mechanism to extract discriminative physical priors, with Semantic Experts inferring physical principles from text and Refinement Experts capturing detailed physical dynamics. ProPhy also incorporates a physical alignment strategy to improve the accuracy of dynamic physical phenomena. Experiments show that ProPhy generates more realistic and physically coherent results compared to existing methods.
ProPhy 是一种渐进物理对齐框架,旨在提高动态世界模拟的物理一致性。它使用两阶段的物理专家混合机制(MoPE)来提取判别性的物理先验,其中语义专家从文本中推断物理原理,细化专家捕捉详细的物理动态。ProPhy 还结合了视觉语言模型中的物理对齐策略,以提高动态物理现象的准确性。实验表明,ProPhy 生成的动态物理现象更加真实且物理上更一致。
TabletopGen: Instance-Level Interactive 3D Tabletop Scene Generation from Text or Single Image
Authors: Ziqian Wang, Yonghao He, Licheng Yang, Wei Zou, Hongxuan Ma, Liu Liu, Wei Sui, Yuxin Guo, Hu Su
First: 2025-12-01T02:38:52+00:00 · Latest: 2025-12-05T09:22:27+00:00
Comments: Project page: https://d-robotics-ai-lab.github.io/TabletopGen.project/
Abstract
Generating high-fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI -- especially for robotic manipulation policy learning and data synthesis. However, current text- or image-driven 3D scene generation methods mainly focus on large-scale scenes, struggling to capture the high-density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training-free, fully automatic framework that generates diverse, instance-level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text-to-image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per-instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision-free, simulation-ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top-view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state-of-the-art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.
中文标题/摘要
标题:TabletopGen:从文本或单张图像生成实例级交互式3D桌面场景
生成高保真、物理交互的3D模拟桌面场景对于体态AI至关重要,尤其是对于机器人操作策略学习和数据合成。然而,当前基于文本或图像的3D场景生成方法主要集中在大规模场景上,难以捕捉桌面场景中高密度布局和复杂的空间关系。为了解决这些挑战,我们提出TabletopGen,这是一种无需训练、全自动的框架,可以生成多样化的实例级交互式3D桌面场景。TabletopGen接受参考图像作为输入,该图像可以通过文本到图像模型合成以增强场景多样性。然后我们对参考图像进行实例分割和完成,以获取每个实例的图像。每个实例随后被重建为3D模型,并进行标准坐标对齐。对齐后的3D模型进行姿态和尺度估计,然后组装成一个无碰撞、可模拟的桌面场景。我们框架的关键组件是一种新颖的姿态和尺度对齐方法,将复杂的空间推理分解为两个阶段:可微旋转优化器用于精确的旋转恢复,以及俯视图空间对齐机制用于稳健的平移和尺度估计,从而实现从2D参考图像到3D重建的准确重建。广泛的实验和用户研究显示,TabletopGen在视觉保真度、布局准确性和物理合理性方面达到了最先进的性能,能够生成具有丰富风格和空间多样性的逼真桌面场景。我们的代码将公开发布。
Summary / 总结
TabletopGen is a training-free framework that generates diverse, interactive 3D tabletop scenes from a reference image. It uses instance segmentation and completion to obtain per-instance images, reconstructs them into 3D models, and aligns them using a novel pose and scale alignment approach. Experiments show that TabletopGen outperforms existing methods in visual fidelity, layout accuracy, and physical plausibility, generating realistic and diverse tabletop scenes.
TabletopGen 是一个无需训练的框架,可以从参考图像生成多样化的交互式 3D 台面场景。它使用实例分割和完成来创建每个实例的图像,然后将这些图像重建为 3D 模型并使用一种新颖的姿态和尺度对齐方法进行对齐。实验表明,TabletopGen 在视觉保真度、布局准确性和物理合理性方面优于现有方法,能够生成逼真且多样的台面场景。
Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models
Authors: Weijue Bu, Guan Yuan, Guixian Zhang
First: 2025-12-05T09:07:55+00:00 · Latest: 2025-12-05T09:07:55+00:00
Comments: 6 pages, 6 figures
Abstract
Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations. Existing decoding strategies intervene only at the output logits and thus cannot correct internal reasoning drift, while recent internal-control methods based on heuristic head suppression or global steering vectors lack principled grounding. We introduce Conscious Gaze (CG-VLM), a training-free, inference-time framework that converts game-theoretic interpretability into actionable decoding control. A Cognitive Demand Sensor built on Harsanyi interactions estimates instantaneous vision-text synergy and identifies moments when visual grounding is necessary. Conditioned on this signal, a Focused Consensus Induction module selectively reorients mid-layer attention toward visual tokens before collapse into text priors. CG-VLM achieves state-of-the-art results on POPE and CHAIR across InstructBLIP, LLaVA, Qwen-VL, and mPLUG, while preserving general capabilities, demonstrating that token-level sensing enables precise, context-aware intervention without compromising foundational knowledge.
中文标题/摘要
标题:有意识的目光:视觉语言模型中幻觉缓解的自适应注意力机制
大型视觉语言模型(VLMs)经常表现出文本惯性,注意力从视觉证据转向语言先验,导致物体幻觉。现有的解码策略仅在输出logits处介入,因此无法纠正内部推理漂移,而最近基于启发式头抑制或全局引导向量的内部控制方法缺乏原则性基础。我们引入了有意识的目光(CG-VLM),这是一种无需训练、在推理时使用的框架,将博弈论可解释性转化为可操作的解码控制。基于Harsanyi互动构建的认知需求传感器估计即时的视觉-文本协同效应,并识别视觉接地必要的时刻。根据此信号,聚焦共识诱导模块在文本先验合并之前,选择性地重新定向中间层注意力指向视觉标记。CG-VLM 在 InstructBLIP、LLaVA、Qwen-VL 和 mPLUG 上在 POPE 和 CHAIR 上取得了最先进的结果,同时保留了通用能力,表明标记级感知使精确、上下文相关的干预成为可能,而不牺牲基础知识。
Summary / 总结
The paper addresses the issue of object hallucinations in Vision-Language Models (VLMs) due to attention drift towards linguistic priors. It introduces Conscious Gaze (CG-VLM), a training-free framework that uses a Cognitive Demand Sensor to estimate the need for visual grounding and a Focused Consensus Induction module to reorient mid-layer attention towards visual tokens. CG-VLM outperforms existing methods on POPE and CHAIR benchmarks while maintaining general capabilities, showing that token-level sensing can enable precise, context-aware intervention without losing foundational knowledge.
研究旨在解决视觉语言模型(VLM)中由于注意力偏向语言先验而导致的物体幻觉问题。提出的Conscious Gaze (CG-VLM)框架在推理时无需训练,使用认知需求传感器来估计需要视觉接地的时刻,并使用聚焦共识诱导模块选择性地将中间层注意力重新导向视觉令牌。该方法在InstructBLIP、LLaVA、Qwen-VL和mPLUG等多个VLM上实现了最先进的结果,表明这种干预是精确且上下文相关的,同时不牺牲基础知识。
Enabling Validation for Robust Few-Shot Recognition
Authors: Hanxin Wang, Tian Liu, Shu Kong
First: 2025-06-05T07:37:15+00:00 · Latest: 2025-12-05T08:56:48+00:00
Comments: Project website: https://hannawang09.github.io/projects/vest/
Abstract
Few-Shot Recognition (FSR) tackles classification tasks by training with minimal task-specific labeled data. Prevailing methods adapt or finetune a pretrained Vision-Language Model (VLM) and augment the scarce training data by retrieving task-relevant but noisy samples from open data sources. The finetuned VLM generalizes decently well to the task-specific in-distribution (ID) test data but struggles with out-of-distribution (OOD) test data. This motivates our study of robust FSR with VLM finetuning. The core challenge of FSR is data scarcity, extending beyond limited training data to a complete lack of validation data. We identify a key paradox as a potential solution: repurposing the retrieved open data for validation. As such retrieved data are inherently OOD compared with the task-specific ID training data, finetuned VLMs yield degraded performance on the retrieved data. This causes the validation logic to favor the pretrained model without any finetuning, hindering improvements w.r.t generalization. To resolve this dilemma, we introduce a novel validation strategy that harmonizes performance gain and degradation on the few-shot ID data and the retrieved data, respectively. Our validation enables parameter selection for partial finetuning and checkpoint selection, mitigating overfitting and improving test-data generalization. We unify this strategy with robust learning into a cohesive framework: Validation-Enabled Stage-wise Tuning (VEST). Extensive experiments on the established ImageNet OOD benchmarks show that VEST significantly outperforms existing VLM adaptation methods, achieving state-of-the-art FSR performance on both ID and OOD data.
中文标题/摘要
标题:利用验证增强鲁棒的少样本识别
少样本识别(FSR)通过使用少量任务特定标记数据进行训练来解决分类任务。现有方法通过调整或微调预训练的视觉-语言模型(VLM),并从开放数据源中检索相关但嘈杂的样本来扩充稀缺的训练数据。微调后的VLM在任务特定的分布内(ID)测试数据上表现良好,但在分布外(OOD)测试数据上表现不佳。这促使我们研究带有VLM微调的鲁棒FSR。FSR的核心挑战是数据稀缺,不仅限于有限的训练数据,还包括完全缺乏验证数据。我们发现一个关键悖论可能是潜在的解决方案:将检索到的开放数据用于验证。由于这些检索到的数据与任务特定的ID训练数据相比是固有的OOD,因此微调后的VLM在这些数据上的表现较差。这导致验证逻辑倾向于选择未微调的预训练模型,阻碍了泛化能力的提升。为了解决这一困境,我们提出了一种新的验证策略,该策略在少样本ID数据和检索数据上分别平衡性能提升和下降。我们的验证策略能够选择部分微调的参数和检查点,减轻过拟合并提高测试数据的泛化能力。我们将这种策略与鲁棒学习统一到一个综合框架中:验证增强阶段微调(VEST)。在建立的ImageNet OOD基准测试上的广泛实验表明,VEST显著优于现有的VLM适应方法,在ID和OOD数据上均实现了最先进的FSR性能。
Summary / 总结
This paper addresses the challenge of robust few-shot recognition (FSR) by proposing a validation strategy to mitigate overfitting. The motivation arises from the difficulty in obtaining validation data for finetuning Vision-Language Models (VLMs) due to data scarcity. The method involves repurposing retrieved open data for validation, which is inherently out-of-distribution (OOD) compared to the in-distribution (ID) training data. This strategy, named Validation-Enabled Stage-wise Tuning (VEST), enables parameter and checkpoint selection, improving generalization on both ID and OOD test data. Experiments on ImageNet OOD benchmarks demonstrate that VEST outperforms existing VLM adaptation methods, achieving state-of-the-art FSR performance.
本文旨在通过开发一种验证策略来解决少样本识别(FSR)的鲁棒性问题,以减轻过拟合。动机来自于在训练数据稀缺时验证模型的困难。方法引入了一种新的验证策略——阶段式调谐启用的验证(VEST),该策略在分布内和分布外数据上的性能之间取得平衡。实验表明,VEST在ImageNet分布外基准上优于现有方法,实现了在分布内和分布外数据上的最佳FSR性能。
RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs
Authors: Jonathan Geuter, Gregor Kornhardt
Venue: NeurIPS 2025
First: 2025-12-05T08:55:39+00:00 · Latest: 2025-12-05T08:55:39+00:00
Comments: 20 pages, 3 figures. 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Foundations of Reasoning in Language Models
Abstract
Best-of-$n$ is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of-$n$ relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of-$n$), a sequential multi-LLM alternative to the prevailing single-model best-of-$n$. Given a suite of models $\{m_i\}_{i=1}^M$, RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU), RoBoN consistently outperforms standard best-of-$n$ applied to each individual model for larger $n$, with gains of up to 3.4\% in absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited at inference to improve best-of-$n$ performance over any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.
中文标题/摘要
标题:RoBoN:路由在线最佳-n测试时缩放多LLM的方法
最佳-n是一种广泛用于LLM推理的测试时缩放方法。尽管有证据表明LLM在不同任务上表现出互补的优势,但传统上最佳-n依赖单一模型生成响应。我们提出了RoBoN(路由在线最佳-n),这是一种基于单一模型最佳-n的顺序多LLM替代方案。给定一组模型$\{m_i\}_{i=1}^M$,RoBoN基于奖励模型和预测响应的一致信号,顺序地将生成任务路由到各个模型。这种在线路由无需额外训练,保持计算量一致,并且可以与任何插件奖励模型一起使用。在推理基准测试(MATH500,奥林匹克竞赛题库,MinervaMath,GSM8K,MMLU)中,RoBoN在较大n值时始终优于单独模型应用的标准最佳-n,绝对准确率提高了高达3.4%,并且也优于均匀多模型组合基线。我们的结果表明,在推理时可以利用模型之间的多样性来提高最佳-n性能,超过任何单一模型的性能,提供了一种简单且无需训练的多LLM测试时缩放路径。
Summary / 总结
RoBoN is a novel approach to test-time scaling of LLMs by sequentially routing generations across multiple models based on scores from a reward model and agreement signals. It consistently outperforms traditional best-of-$n$ methods, achieving up to 3.4% higher accuracy on reasoning benchmarks, and improves over uniform multi-model portfolios.
RoBoN 是一种新的 LLM 推断测试时扩展方法,它基于奖励模型评分和预测响应的一致性信号顺序使用多个模型。它在各种推理基准测试中优于传统最佳-of-$n$ 方法和均匀的多模型组合基线,最高可实现 3.4% 的准确率提升。
See in Depth: Training-Free Surgical Scene Segmentation with Monocular Depth Priors
Authors: Kunyi Yang, Qingyu Wang, Cheng Yuan, Yutong Ban
First: 2025-12-05T08:41:42+00:00 · Latest: 2025-12-05T08:41:42+00:00
Comments: The first two authors contributed equally
Abstract
Pixel-wise segmentation of laparoscopic scenes is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. We propose depth-guided surgical scene segmentation (DepSeg), a training-free framework that utilizes monocular depth as a geometric prior together with pretrained vision foundation models. DepSeg first estimates a relative depth map with a pretrained monocular depth estimation network and proposes depth-guided point prompts, which SAM2 converts into class-agnostic masks. Each mask is then described by a pooled pretrained visual feature and classified via template matching against a template bank built from annotated frames. On the CholecSeg8k dataset, DepSeg improves over a direct SAM2 auto segmentation baseline (35.9% vs. 14.7% mIoU) and maintains competitive performance even when using only 10--20% of the object templates. These results show that depth-guided prompting and template-based classification offer an annotation-efficient segmentation approach.
Summary / 总结
The research aims to address the challenge of pixel-wise segmentation of laparoscopic scenes in computer-assisted surgery by proposing a training-free framework called Depth-guided Surgical Scene Segmentation (DepSeg). This method leverages monocular depth as a geometric prior and pretrained vision foundation models. On the CholecSeg8k dataset, DepSeg outperforms a direct SAM2 auto segmentation baseline with a significant improvement in mean IoU (35.9% vs. 14.7%) and maintains good performance even with limited object templates, demonstrating the effectiveness of depth-guided prompting and template-based classification for annotation-efficient segmentation.
研究旨在通过提出一种无需训练的框架——深度引导的手术场景分割(DepSeg),解决腹腔镜场景像素级分割在计算机辅助手术中的挑战。该方法利用单目深度作为几何先验,并结合预训练的视觉基础模型。在CholecSeg8k数据集上,DepSeg在平均IoU(35.9% vs. 14.7%)上显著优于直接的SAM2自动分割基线,并且即使使用有限的对象模板也能保持良好的性能,展示了深度引导提示和基于模板的分类方法在注释高效分割中的有效性。
VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation
Authors: Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando
First: 2025-12-05T08:34:06+00:00 · Latest: 2025-12-05T08:34:06+00:00
Abstract
Spatio-temporal scene graph generation (ST-SGG) aims to model objects and their evolving relationships across video frames, enabling interpretable representations for downstream reasoning tasks such as video captioning and visual question answering. Despite recent advancements in DETR-style single-stage ST-SGG models, they still suffer from several key limitations. First, while these models rely on attention-based learnable queries as a core component, these learnable queries are semantically uninformed and instance-agnostically initialized. Second, these models rely exclusively on unimodal visual features for predicate classification. To address these challenges, we propose VOST-SGG, a VLM-aided one-stage ST-SGG framework that integrates the common sense reasoning capabilities of vision-language models (VLMs) into the ST-SGG pipeline. First, we introduce the dual-source query initialization strategy that disentangles what to attend to from where to attend, enabling semantically grounded what-where reasoning. Furthermore, we propose a multi-modal feature bank that fuses visual, textual, and spatial cues derived from VLMs for improved predicate classification. Extensive experiments on the Action Genome dataset demonstrate that our approach achieves state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG. We will release the code at https://github.com/LUNAProject22/VOST.
中文标题/摘要
标题:VOST-SGG: VLM辅助的一阶段时空场景图生成
时空场景图生成(ST-SGG)旨在建模视频帧中对象及其随时间演变的关系,为下游的视频描述和视觉问答等可解释表示任务提供支持。尽管最近在DETR风格的一阶段ST-SGG模型方面取得了进展,但它们仍然存在几个关键限制。首先,虽然这些模型依赖于基于注意力的学习查询作为核心组件,但这些学习查询在语义上是未受过训练的,并且实例无关地初始化。其次,这些模型完全依赖于单模态视觉特征进行谓词分类。为了解决这些挑战,我们提出了一种VLM辅助的一阶段ST-SGG框架,将视觉语言模型(VLM)的常识推理能力整合到ST-SGG管道中。首先,我们引入了双源查询初始化策略,将注意力于何处与关注什么分离,实现语义导向的什么-在哪里推理。此外,我们提出了一种多模态特征库,将从VLM中提取的视觉、文本和空间线索融合,以提高谓词分类效果。在Action Genome数据集上的广泛实验表明,我们的方法达到了最先进的性能,验证了将VLM辅助的语义先验和多模态特征整合到ST-SGG中的有效性。我们将代码发布在https://github.com/LUNAProject22/VOST。
Summary / 总结
The research aims to improve spatio-temporal scene graph generation (ST-SGG) by addressing limitations in existing DETR-style models, such as semantically uninformed learnable queries and reliance on unimodal visual features. VOST-SGG, a VLM-aided one-stage ST-SGG framework, introduces a dual-source query initialization strategy and a multi-modal feature bank to enhance semantic grounding and predicate classification. Experiments on the Action Genome dataset show that VOST-SGG outperforms existing methods, validating the benefits of integrating VLM-aided semantic priors and multi-modal features for ST-SGG.
研究旨在通过解决DETR风格模型中的问题,如语义不明确的学习查询和依赖单一视觉特征,来改进时空场景图生成(ST-SGG)。VOST-SGG通过将视觉语言模型(VLM)的能力整合进来,增强查询初始化和谓词分类,利用多模态特征库融合视觉、文本和空间线索。在Action Genome数据集上的实验表明,VOST-SGG优于现有方法,验证了VLM辅助语义先验和多模态特征对ST-SGG的有效性。
Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning
Authors: Chinthani Sugandhika, Chen Li, Deepu Rajan, Basura Fernando
First: 2025-12-05T08:15:49+00:00 · Latest: 2025-12-05T08:15:49+00:00
Abstract
Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to "show what they know" and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the code at https://github.com/LUNAProject22/Know-Show.
中文标题/摘要
标题:知行合一:在时空定位推理方面对视频语言模型进行基准测试
大型视频语言模型(Video-LMs)在多模态理解方面取得了显著进展,但其推理在空间和时间上的定位仍然很弱。我们提出了知行合一(Know-Show),这是一个新的基准测试,旨在评估时空定位推理能力,即模型在同时将推理与视觉和时间证据联系起来的情况下,能够推理动作及其语义的能力。知行合一在空间(人物、物体、人物-物体、手-物体)和时间维度上统一了推理和定位,通过五个互补场景构建了一个单一的评估框架。该基准测试基于Charades、Action Genome和Ego4D,包含2500个人工撰写的问答,揭示了当前Video-LMs与人类推理之间的显著差距。为了弥合这一差距,我们提出了GRAM,这是一种无需训练的插件,通过基于注意力的视频标记选择和显式时间戳编码,为Video-LMs增加细粒度的定位。广泛的实验表明,现有的模型在开放和封闭的Video-LMs(Qwen、VideoLLaVA、GPT-4o、Gemini等)中都难以“展示它们所知道的”和“理解它们所展示的”,尤其是在细粒度的手-物体交互方面。知行合一为评估视频语言理解中的定位推理建立了一个统一的标准,并为开发可解释和可靠的多模态推理系统提供了见解。我们将在https://github.com/LUNAProject22/Know-Show/发布代码。
Summary / 总结
The research aims to evaluate the spatio-temporal grounded reasoning ability of Video-Language Models (Video-LMs) by introducing Know-Show, a new benchmark that unifies reasoning and localization in five scenarios. The benchmark highlights significant gaps between current Video-LMs and human reasoning, especially in fine-grained hand-object interactions. Experiments show that existing models struggle to 'show what they know' and vice versa. GRAM, a training-free plug-in, is proposed to enhance Video-LMs with fine-grained grounding. Know-Show provides a unified standard for assessing grounded reasoning in video-language understanding and offers insights for developing interpretable and reliable multimodal reasoning systems.
研究旨在评估视频语言模型(Video-LMs)在时空定位推理方面的能力,目前这一能力较弱。Know-Show 是一个新的基准,包含五个场景来评估这种能力,使用来自 Charades、Action Genome 和 Ego4D 的数据。该基准揭示了当前 Video-LMs 和人类推理之间的显著差距。提出了一个无需训练的插件 GRAM,以增强 Video-LMs 的细粒度定位。实验表明,现有模型在细粒度的手物交互方面存在困难,表明需要更好地在视频语言理解系统中实现定位推理。
InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity
Authors: Haoming Wang, Qiyao Xue, Wei Gao
First: 2025-11-22T22:05:39+00:00 · Latest: 2025-12-05T07:59:09+00:00
Abstract
Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.
中文标题/摘要
标题:InfiniBench:基于可定制场景复杂度的视觉空间推理无限基准测试
现代视觉-语言模型(VLMs)被期望具备在不同场景复杂度下进行空间推理的能力,但由于缺乏既能多样化又能扩展且完全可定制的基准测试,评估这些能力变得困难。现有的基准测试在场景复杂度的可定制性方面有限,无法在不同的空间条件下隔离和分析特定的VLM故障模式。为解决这一问题,本文不单独呈现针对不同场景复杂度的基准测试,而是提出了InfiniBench,这是一种全自动、可定制且用户友好的基准测试生成器,能够合成理论上无限多样的3D场景,并通过参数化控制场景复杂度。InfiniBench独特地将自然语言中的场景描述转化为具有复杂且物理上合理的3D布局的逼真视频。这通过三个关键创新实现:1)基于LLM的代理框架,迭代细化从场景描述中生成的程序化场景约束;2)灵活的基于集群的布局优化器,生成先前无法通过程序化方法处理的密集和拥挤的场景;3)任务感知的摄像机轨迹优化方法,将场景渲染为VLM输入的视频,实现全方位物体覆盖。实验表明,InfiniBench在提示保真度和物理合理性方面优于最先进的程序化和基于LLM的3D生成方法,尤其是在高复杂度场景中。我们进一步展示了InfiniBench的实用性,通过生成代表性的空间推理任务基准测试,包括测量、视角转换和时空跟踪。
iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Authors: Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker, Abhishek Aich
Venue: NeurIPS 2025
First: 2025-09-23T20:25:53+00:00 · Latest: 2025-12-05T07:58:34+00:00
Comments: Accepted at NeurIPS 2025
Abstract
Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues -- object pose, lane positions, and object trajectories -- which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM's outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder's proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.
中文标题/摘要
标题:iFinder:面向后置行车记录视频分析的结构化零样本视觉基础语言模型对接框架
将大型语言模型(LLMs)对接到如后置行车记录视频分析等特定领域任务具有挑战性,因为它们是通用训练的,缺乏结构化的归纳偏置。由于此类分析通常仅依赖视觉模态(即没有LiDAR、GPS等),现有的基于视频的视觉-语言模型(V-VLMs)在空间推理、因果推理和事件解释方面存在困难。为此,我们提出了iFinder,这是一种结构化语义对接框架,通过将行车记录视频转换为层次化、可解释的数据结构来解耦感知与推理。iFinder 作为模块化、无需训练的流水线,利用预训练的视觉模型提取关键线索——物体姿态、车道位置和物体轨迹,并将这些线索按层次组织成帧级和视频级结构。结合三块提示策略,它使LLM能够逐步、基于上下文地进行推理,以细化V-VLM的输出并提供准确的推理。在四个公开的后置行车记录视频基准测试上的评估表明,iFinder 提出的基于特定领域线索的对接,尤其是物体方向和全局上下文,显著优于端到端的V-VLMs,在四个零样本驾驶基准测试中,事故推理准确性最高可提高39%。通过使用驾驶领域特定的表示对接LLM,iFinder 提供了一种零样本、可解释且可靠的替代端到端V-VLMs的方案,用于后置行车记录视频理解。
Summary / 总结
iFinder is a structured semantic grounding framework designed to enhance zero-shot vision-based large language model (LLM) grounding for post-hoc dash-cam driving video analysis. It translates dash-cam videos into a hierarchical, interpretable data structure, enabling LLMs to perform step-wise, grounded reasoning. Evaluations on four public dash-cam video benchmarks demonstrate that iFinder significantly outperforms end-to-end vision-language models, achieving up to 39% gains in accident reasoning accuracy.
iFinder 是一种结构化的语义接地框架,旨在增强零样本基于视觉的大语言模型(LLM)在后处理行车记录仪视频分析中的推理能力。它通过将行车记录仪视频转换为层次化的可解释数据结构来分离感知和推理。iFinder 提取关键线索,如物体姿态、车道位置和物体轨迹,然后用于使 LLM 逐步进行接地推理。该框架在四个零样本驾驶基准测试中显著优于端到端的视觉-语言模型,事故推理准确性最高可提升 39%。
V-CECE: Visual Counterfactual Explanations via Conceptual Edits
Authors: Nikolaos Spanos, Maria Lymperaiou, Giorgos Filandrianos, Konstantinos Thomas, Athanasios Voulodimos, Giorgos Stamou
Venue: NeurIPS 2025
First: 2025-09-20T07:53:06+00:00 · Latest: 2025-12-05T07:24:35+00:00
Comments: Accepted in NeurIPS 2025
Abstract
Recent black-box counterfactual generation frameworks fail to take into account the semantic content of the proposed edits, while relying heavily on training to guide the generation process. We propose a novel, plug-and-play black-box counterfactual generation framework, which suggests step-by-step edits based on theoretical guarantees of optimal edits to produce human-level counterfactual explanations with zero training. Our framework utilizes a pre-trained image editing diffusion model, and operates without access to the internals of the classifier, leading to an explainable counterfactual generation process. Throughout our experimentation, we showcase the explanatory gap between human reasoning and neural model behavior by utilizing both Convolutional Neural Network (CNN), Vision Transformer (ViT) and Large Vision Language Model (LVLM) classifiers, substantiated through a comprehensive human evaluation.
中文标题/摘要
标题:V-CECE: 视觉概念编辑的反事实解释
近期的黑盒反事实生成框架未能考虑所提编辑的语义内容,而主要依赖训练来指导生成过程。我们提出了一种新颖的、即插即用的黑盒反事实生成框架,该框架基于最优编辑的理论保证,逐步建议编辑以生成与人类水平相当的反事实解释,无需训练。该框架利用预训练的图像编辑扩散模型,并在不访问分类器内部结构的情况下运行,从而实现可解释的反事实生成过程。在我们的实验中,通过使用卷积神经网络(CNN)、视觉变换器(ViT)和大型视觉语言模型(LVLM)分类器,并通过全面的人类评估,展示了人类推理与神经模型行为之间的解释差距。
Summary / 总结
The research aims to address the limitations of existing black-box counterfactual generation frameworks that lack semantic consideration in their edits. The proposed V-CECE framework suggests step-by-step edits based on theoretical guarantees, enabling the generation of human-level counterfactual explanations without training. Key findings include the framework's ability to bridge the explanatory gap between human reasoning and neural model behavior, demonstrated through experiments with CNN, ViT, and LVLM classifiers, and validated by human evaluations.
研究针对现有黑盒反事实生成框架不考虑编辑的语义内容的局限性。提出了一种名为V-CECE的新框架,基于理论保证建议逐步编辑,无需训练即可生成人类级别的反事实解释。该框架使用预训练的图像编辑扩散模型,并不访问分类器的内部结构,提供了一种可解释的反事实生成过程。实验通过使用CNN、ViT和LVLM分类器展示了人类推理与神经模型行为之间的解释差距,并通过人类评估验证了结果。
Concept-based Explainable Data Mining with VLM for 3D Detection
Authors: Mai Tsujimoto
First: 2025-12-05T07:18:45+00:00 · Latest: 2025-12-05T07:18:45+00:00
Comments: 28 pages including appendix. Code: https://github.com/mm1129/concept_based_rare_detector_2025
Abstract
Rare-object detection remains a challenging task in autonomous driving systems, particularly when relying solely on point cloud data. Although Vision-Language Models (VLMs) exhibit strong capabilities in image understanding, their potential to enhance 3D object detection through intelligent data mining has not been fully explored. This paper proposes a novel cross-modal framework that leverages 2D VLMs to identify and mine rare objects from driving scenes, thereby improving 3D object detection performance. Our approach synthesizes complementary techniques such as object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection into a cohesive, explainable pipeline that systematically identifies rare but critical objects in driving scenes. By combining Isolation Forest and t-SNE-based outlier detection methods with concept-based filtering, the framework effectively identifies semantically meaningful rare objects. A key strength of this approach lies in its ability to extract and annotate targeted rare object concepts such as construction vehicles, motorcycles, and barriers. This substantially reduces the annotation burden and focuses only on the most valuable training samples. Experiments on the nuScenes dataset demonstrate that this concept-guided data mining strategy enhances the performance of 3D object detection models while utilizing only a fraction of the training data, with particularly notable improvements for challenging object categories such as trailers and bicycles compared with the same amount of random data. This finding has substantial implications for the efficient curation of datasets in safety-critical autonomous systems.
中文标题/摘要
标题:基于概念的可解释数据挖掘与VLM在3D检测中的应用
在自主驾驶系统中,稀有物体检测仍然是一个具有挑战性的任务,尤其是在仅依赖点云数据的情况下。尽管视觉-语言模型(VLMs)在图像理解方面表现出强大的能力,但它们通过智能数据挖掘增强3D物体检测的潜力尚未得到充分探索。本文提出了一种新颖的跨模态框架,利用2D VLMs从驾驶场景中识别和挖掘稀有物体,从而提高3D物体检测性能。我们的方法将物体检测、语义特征提取、降维和多方面离群点检测等互补技术综合成一个系统、可解释的管道,以系统地识别驾驶场景中的稀有但关键物体。通过结合孤立森林和基于t-SNE的离群点检测方法与基于概念的过滤,该框架有效地识别了具有语义意义的稀有物体。该方法的一个关键优势在于能够提取和标注目标稀有物体的概念,如施工车辆、摩托车和障碍物。这大大减少了标注负担,并仅关注最有价值的训练样本。在nuScenes数据集上的实验表明,这种基于概念的数据挖掘策略在使用少量训练数据的情况下提高了3D物体检测模型的性能,特别是在拖车和自行车等具有挑战性的物体类别上,与相同数量的随机数据相比,表现尤为突出。这一发现对安全关键的自主系统中数据集的高效整理具有重要意义。
Summary / 总结
This paper addresses the challenge of detecting rare objects in autonomous driving systems using 3D point cloud data. It introduces a novel cross-modal framework that integrates 2D Vision-Language Models to identify and mine rare objects, improving 3D object detection performance. The framework combines techniques like object detection, semantic feature extraction, and outlier detection, and uses concept-based filtering to identify semantically meaningful rare objects, reducing annotation burden. Experiments on the nuScenes dataset show that this approach enhances 3D object detection, especially for challenging categories like trailers and bicycles, using a fraction of the training data compared to random data.
该论文旨在利用3D点云数据解决自动驾驶系统中稀有物体检测的挑战。它提出了一种新的跨模态框架,结合2D视觉语言模型来识别和挖掘稀有物体,从而提升3D物体检测性能。该框架结合了目标检测、语义特征提取和离群点检测等技术,并使用孤立森林和t-SNE进行离群点检测。该方法能够有效识别具有语义意义的稀有物体,减少标注负担并专注于有价值的训练样本。实验表明,该方法在使用少量训练数据的情况下,特别是在拖车和自行车等具有挑战性的类别中,显著提升了3D物体检测性能,优于随机数据。
Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
Authors: Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, Lixing Zou, Zhaoye Zhou, Gen Li, Bo Zhao
First: 2025-11-06T17:07:49+00:00 · Latest: 2025-12-05T06:57:07+00:00
Comments: Github: https://github.com/MINT-SJTU/Evo-1
Abstract
Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.
中文标题/摘要
标题:Evo-1:轻量级视觉-语言-行动模型,保留语义对齐
视觉-语言-行动(VLA)模型已成为一种强大的框架,能够统一感知、语言和控制,使机器人能够通过多模态理解执行多种任务。然而,当前的VLA模型通常包含大量参数,并且高度依赖大规模的机器人数据预训练,导致训练时计算成本高昂,部署时也受到限制。此外,大多数训练范式往往会降低视觉-语言骨干的感知表示,导致过拟合和下游任务泛化能力差。在本研究中,我们提出了Evo-1,这是一种轻量级的VLA模型,能够减少计算量并提高部署效率,同时保持强大的性能,无需使用机器人数据进行预训练。Evo-1基于一个原生的多模态视觉-语言模型(VLM),结合了一种新颖的跨模态扩散变换器以及一个优化的集成模块,共同形成了一个有效的架构。我们还引入了一种两阶段的训练范式,逐步将行动与感知对齐,保留了VLM的表示。值得注意的是,仅包含0.77亿个参数的Evo-1在Meta-World和RoboTwin套件上达到了最先进的结果,分别超越了之前最佳模型12.4%和6.9%,并在LIBERO上也取得了竞争力的结果,达到94.8%。在实际世界评估中,Evo-1以高推理频率和低内存开销实现了78%的成功率,超越了所有基线方法。我们发布了代码、数据和模型权重,以促进轻量级和高效VLA模型的未来研究。
Summary / 总结
Evo-1 is a lightweight VLA model that reduces computational costs and improves deployability while maintaining strong performance. It builds on a native multimodal Vision-Language model and incorporates a novel cross-modulated diffusion transformer and an optimized integration module. The two-stage training paradigm progressively aligns action with perception, preserving the VLM representations. With only 0.77 billion parameters, Evo-1 outperforms previous models on Meta-World and RoboTwin suite by 12.4% and 6.9%, respectively, and achieves a 94.8% success rate in real-world evaluations with high inference frequency and low memory overhead.
Evo-1 是一个轻量级的 VLA 模型,减少了计算成本并提高了部署效率,同时保持了强大的性能。它基于一个原生的多模态视觉-语言模型,并结合了一个新颖的交叉调制扩散变换器和一个优化的集成模块。两阶段的训练范式逐步将动作与感知对齐,保留了 VLM 的表示。仅含 0.77 亿参数的 Evo-1 在 Meta-World 和 RoboTwin suite 上分别超越了之前最佳模型 12.4% 和 6.9%,并在 LIBERO 上达到了 94.8% 的结果。在实际世界评估中,Evo-1 达到了 78% 的成功率,具有低内存开销,并且优于所有基线方法。