arXiv 论文速递

2025-12-04 03:31
Snapshot: 20251204_0331
DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
Authors: Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Hongyang Li, Ya-Qin Zhang, Hao Zhao
First: 2025-12-02T18:29:18+00:00 · Latest: 2025-12-02T18:29:18+00:00
Abstract
Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce \textbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.
中文标题/摘要
标题:DGGT:使用未摆姿势图像进行动态驾驶场景的前馈4D重建
自动驾驶需要快速且可扩展的4D重建和重模拟来进行训练和评估,但大多数动态驾驶场景的方法仍然依赖于场景优化、已知的相机校准或短帧窗口,这使得它们速度慢且不实用。我们从前馈的角度重新审视了这个问题,并引入了**驾驶高斯接地变换器(DGGT)**,这是一种无需姿态的统一框架,用于动态场景重建。我们注意到现有的公式将相机姿态视为必需的输入,这限制了灵活性和可扩展性。相反,我们将姿态重新定义为模型的输出,从而可以直接从稀疏的、未摆姿势的图像中进行重建,并支持长序列中的任意数量的视角。我们的方法联合预测每帧的3D高斯图和相机参数,通过一个轻量级的动力学头分离动态,并通过一个寿命头随着时间调整可见性来保持时间一致性。基于扩散的渲染细化进一步减少了运动/插值伪影,并在稀疏输入下提高了新视角的质量。结果是一个单次通过、无需姿态的算法,实现了最先进的性能和速度。在大规模驾驶基准数据集(Waymo、nuScenes、Argoverse2)上进行训练和评估,我们的方法在每个数据集上训练时都优于先前的工作,并且在数据集之间进行零样本迁移时也表现出色,随着输入帧数的增加,其可扩展性也很好。
Summary / 总结
The research aims to address the need for fast and scalable 4D reconstruction for autonomous driving, which most existing methods fail to achieve due to their reliance on per-scene optimization and camera calibration. The authors introduce DGGT, a feedforward framework that treats camera pose as an output, enabling reconstruction from unposed images and supporting long sequences. Key findings include superior performance and speed compared to previous methods, especially in zero-shot transfer across different datasets and scalability with increasing input frames.
研究旨在解决自动驾驶中快速和可扩展的4D重建需求,提出了一种名为DGGT的前馈框架,可以从未标定的图像中重建动态场景,无需输入相机姿态。DGGT联合预测3D高斯图和相机参数,通过动态头和寿命头支持长序列,并使用基于扩散的渲染改进质量。在大规模驾驶数据集上的实验表明,DGGT在单数据集训练和跨数据集零样本迁移中均优于先前方法,并且随着输入帧数的增加具有良好的扩展性。
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions
Authors: Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, Donald E. Brown
First: 2025-11-21T19:18:41+00:00 · Latest: 2025-12-02T18:13:35+00:00
Abstract
Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require them to focus on particular areas of the image in tasks such as counting. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically determine how counting performance varies as image and prompt properties change. Using open-source VLMs, we then analyze how attention allocation fluctuates with varying input parameters (e.g. number of objects in the image, objects color, background color, objects texture, background texture, and prompt specificity). We further implement attention-based interventions to modulate focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions. Our experiments reveal that while VLM counting performance remains challenging, especially under high visual or linguistic complexity, certain attention interventions can lead to modest gains in counting performance.
中文标题/摘要
标题:视觉语言模型能数数吗?一种合成基准和注意力干预分析
近期研究表明,视觉语言模型(VLMs)在回答有关图像视觉属性的问题时,往往依赖于训练过程中学到的固有偏见。当VLMs被要求回答需要它们聚焦于图像特定区域的高具体问题时,这种偏见会加剧,例如在计数任务中。我们在此基础上开发了一个合成基准数据集和评估框架,以系统地确定计数性能如何随着图像和提示属性的变化而变化。使用开源VLMs,我们分析了随着输入参数变化(如图像中的物体数量、物体颜色、背景颜色、物体纹理、背景纹理和提示的具体性),注意力分配如何波动。我们进一步实施了基于注意力的干预措施,以调节不同层面上对视觉标记的关注,并评估其对不同视觉条件下计数性能的影响。我们的实验表明,尽管VLM计数性能仍然具有挑战性,尤其是在高视觉或语言复杂性条件下,某些注意力干预措施可以带来适度的计数性能提升。
Summary / 总结
This study investigates the counting capabilities of Vision-Language Models (VLMs) by developing a synthetic benchmark and analyzing attention-based interventions. The research aims to understand how VLMs perform counting tasks under varying image and prompt conditions, revealing that while VLMs struggle with complex visual or linguistic scenarios, specific attention interventions can improve counting performance modestly.
研究通过开发合成基准和分析注意力干预措施,探讨了视觉语言模型(VLMs)的计数能力。研究旨在理解在不同图像和提示条件下,VLMs执行计数任务的表现,发现尽管VLMs在复杂视觉或语言场景下表现不佳,但特定的注意力干预措施可以适度提高计数性能。
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
Authors: Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, Matteo Poggi
First: 2025-07-29T17:55:58+00:00 · Latest: 2025-12-02T18:05:40+00:00
Abstract
We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.
中文标题/摘要
标题:Ov3R:基于RGB视频流的开放词汇语义三维重建框架
我们提出了Ov3R,一种基于RGB视频流的开放词汇语义三维重建框架,旨在推动空间人工智能的发展。该系统包含两个关键组件:CLIP3R,一个受CLIP启发的三维重建模块,能够从重叠片段中预测密集点云图并嵌入对象级语义;以及2D-3D OVS,一个2D-3D开放词汇语义模块,通过学习融合空间、几何和语义线索的特征来将2D特征提升到3D。与先前的方法不同,Ov3R直接将CLIP语义融入重建过程,从而实现全局一致的几何结构和精细的语义对齐。我们的框架在密集三维重建和开放词汇三维分割方面均达到了最先进的性能,标志着向实时、语义感知的空间人工智能迈进了一步。
Summary / 总结
Ov3R is a novel framework for open-vocabulary semantic 3D reconstruction from RGB videos, which includes CLIP3R for predicting dense point maps with object-level semantics and 2D-3D OVS for lifting 2D features into 3D using spatial, geometric, and semantic cues. This system directly incorporates CLIP semantics into the reconstruction process, achieving state-of-the-art performance in dense 3D reconstruction and open-vocabulary 3D segmentation, advancing real-time, semantics-aware Spatial AI.
Ov3R 是一种新型的从 RGB 视频流进行开放词汇语义 3D 重建的框架,包括 CLIP3R 预测带有物体级别语义的密集点云图和 2D-3D OVS 将 2D 特征提升到 3D 通过整合空间、几何和语义线索。与以往方法不同,Ov3R 直接将 CLIP 语义融入重建过程,实现了在密集 3D 重建和开放词汇 3D 分割上的最新性能,推动了实时、语义感知的 Spatial AI 发展。
InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration
Authors: Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, Wei Pang
Venue: AAAI 2026
First: 2025-12-02T17:59:52+00:00 · Latest: 2025-12-02T17:59:52+00:00
Comments: Published in AAAI 2026
Abstract
Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent's ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent's reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.
中文标题/摘要
标题:InEx:通过内省和跨模态多智能体协作减轻幻觉
幻觉仍然是大型语言模型(LLMs)中的一个关键挑战,阻碍了可靠多模态LLMs(MLLMs)的发展。现有解决方案往往依赖于人工干预或未能充分利用智能体自主减轻幻觉的能力。为了解决这些限制,我们从人类在现实世界中做出可靠决策的方式中汲取灵感。他们首先通过内省推理减少不确定性并形成初步判断,然后依靠来自不同视角的外部验证来做出最终决策。受这一认知范式启发,我们提出了InEx,这是一种无需训练的多智能体框架,旨在自主减轻幻觉。InEx引入了基于熵不确定性估计的内省推理,以提高决策智能体推理过程的可靠性。智能体首先生成响应,然后通过与编辑智能体和自我反思智能体进行外部跨模态多智能体协作的迭代验证和优化,进一步提高可靠性和减轻幻觉。大量实验表明,InEx在通用和幻觉基准测试中始终优于现有方法,实现了4%-27%的性能提升,并且表现出很强的鲁棒性。
Summary / 总结
InEx is a training-free multi-agent framework designed to autonomously mitigate hallucination in large language models. It uses internal introspective reasoning and external cross-modal multi-agent collaboration to improve reliability. Experiments show that InEx outperforms existing methods, achieving up to 27% gains on hallucination benchmarks and demonstrating strong robustness.
InEx 是一个无需训练的多代理框架,旨在自主减轻大型语言模型中的幻觉问题。它利用内部反省推理和基于熵的不确定性估计来提高推理可靠性,随后通过外部跨模态多代理协作进行验证和细化。实验表明,InEx 在幻觉基准测试中优于现有方法,最高可实现 27% 的性能提升,并且表现出很强的鲁棒性。
Lumos: Let there be Language Model System Certification
Authors: Isha Chaudhary, Vedaant Jain, Avaljot Singh, Kavya Sachdeva, Sayan Ranu, Gagandeep Singh
First: 2025-12-02T17:44:47+00:00 · Latest: 2025-12-02T17:44:47+00:00
Abstract
We introduce the first principled framework, Lumos, for specifying and formally certifying Language Model System (LMS) behaviors. Lumos is an imperative probabilistic programming DSL over graphs, with constructs to generate independent and identically distributed prompts for LMS. It offers a structured view of prompt distributions via graphs, forming random prompts from sampled subgraphs. Lumos supports certifying LMS for arbitrary prompt distributions via integration with statistical certifiers. We provide hybrid (operational and denotational) semantics for Lumos, providing a rigorous way to interpret the specifications. Using only a small set of composable constructs, Lumos can encode existing LMS specifications, including complex relational and temporal specifications. It also facilitates specifying new properties - we present the first safety specifications for vision-language models (VLMs) in autonomous driving scenarios developed with Lumos. Using these, we show that the state-of-the-art VLM Qwen-VL exhibits critical safety failures, producing incorrect and unsafe responses with at least 90% probability in right-turn scenarios under rainy driving conditions, revealing substantial safety risks. Lumos's modular structure allows easy modification of the specifications, enabling LMS certification to stay abreast with the rapidly evolving threat landscape. We further demonstrate that specification programs written in Lumos enable finding specific failure cases exhibited by state-of-the-art LMS. Lumos is the first systematic and extensible language-based framework for specifying and certifying LMS behaviors, paving the way for a wider adoption of LMS certification.
中文标题/摘要
标题:Lumos:让语言模型系统认证成为可能
我们介绍了第一个原理性的框架Lumos,用于指定和正式认证语言模型系统(LMS)的行为。Lumos是一种基于图的命令式概率编程DSL,具有生成独立同分布提示的构造。它通过图形成随机提示,从采样的子图中形成随机提示。Lumos支持通过与统计认证器集成来为任意提示分布认证LMS。我们为Lumos提供了混合(操作性和语义性)语义,提供了一种严谨的方式来解释规范。仅使用少量可组合的构造,Lumos可以编码现有的LMS规范,包括复杂的关联和时间规范。它还便于指定新的属性——我们使用Lumos首次为自主驾驶场景中的视觉语言模型(VLMs)开发了安全规范。使用这些规范,我们展示了最先进的VLM Qwen-VL表现出关键的安全故障,在雨天驾驶条件下右转场景中至少有90%的概率产生错误和不安全的响应,揭示了重大的安全风险。Lumos的模块化结构允许轻松修改规范,使LMS认证能够跟上快速变化的威胁环境。我们进一步证明,用Lumos编写的规范程序能够找到最先进的LMS展示的具体故障案例。Lumos是第一个系统性和可扩展的语言基础框架,用于指定和认证LMS行为,为更广泛的LMS认证采用铺平了道路。
Summary / 总结
Lumos is a principled framework for specifying and formally certifying Language Model System (LMS) behaviors. It uses an imperative probabilistic programming DSL over graphs to generate independent and identically distributed prompts and supports certifying LMS for arbitrary prompt distributions. Lumos reveals critical safety failures in the state-of-the-art VLM Qwen-VL, indicating a 90% probability of producing incorrect and unsafe responses in right-turn scenarios under rainy conditions. The modular structure of Lumos allows for easy modification and adaptation to new threats, facilitating ongoing LMS certification.
Lumos 是一个用于指定和正式认证语言模型系统 (LMS) 行为的框架。它使用基于图的命令式概率编程 DSL 生成独立且同分布的提示,并通过与统计认证器的集成支持对任意提示分布进行认证。Lumos 揭示了最先进的 VLM Qwen-VL 在特定驾驶条件下的关键安全故障,表明存在重大安全风险。Lumos 的模块化结构使得可以轻松修改规范,从而在应对不断变化的威胁时保持 LMS 认证的有效性。
Guardian: Detecting Robotic Planning and Execution Errors with Vision-Language Models
Authors: Paul Pacaud, Ricardo Garcia, Shizhe Chen, Cordelia Schmid
Venue: www
First: 2025-12-01T17:57:27+00:00 · Latest: 2025-12-02T17:33:19+00:00
Comments: Code, Data, and Models available at https://www.di.ens.fr/willow/research/guardian/. The paper contains 8 pages, 9 figures, 6 tables
Abstract
Robust robotic manipulation requires reliable failure detection and recovery. Although current Vision-Language Models (VLMs) show promise, their accuracy and generalization are limited by the scarcity of failure data. To address this data gap, we propose an automatic robot failure synthesis approach that procedurally perturbs successful trajectories to generate diverse planning and execution failures. This method produces not only binary classification labels but also fine-grained failure categories and step-by-step reasoning traces in both simulation and the real world. With it, we construct three new failure detection benchmarks: RLBench-Fail, BridgeDataV2-Fail, and UR5-Fail, substantially expanding the diversity and scale of existing failure datasets. We then train Guardian, a VLM with multi-view images for detailed failure reasoning and detection. Guardian achieves state-of-the-art performance on both existing and newly introduced benchmarks. It also effectively improves task success rates when integrated into a state-of-the-art manipulation system in simulation and real robots, demonstrating the impact of our generated failure data. Code, Data, and Models available at https://www.di.ens.fr/willow/research/guardian/.
中文标题/摘要
标题:卫报:使用视觉语言模型检测机器人规划和执行错误
稳健的机器人操作需要可靠的故障检测和恢复。尽管当前的视觉语言模型(VLMs)显示出潜力,但它们的准确性和泛化能力受限于故障数据的稀缺性。为了解决这一数据缺口,我们提出了一种自动机器人故障合成方法,通过程序化地扰动成功的轨迹来生成多样化的规划和执行故障。该方法不仅生成二元分类标签,还生成详细的故障类别和步骤推理轨迹,适用于仿真和真实世界。通过这种方法,我们构建了三个新的故障检测基准:RLBench-Fail、BridgeDataV2-Fail 和 UR5-Fail,显著扩展了现有故障数据集的多样性和规模。然后,我们训练了Guardian,这是一种具有多视图图像的VLM,用于详细故障推理和检测。Guardian在现有和新引入的基准测试中均达到了最先进的性能。当将其集成到最先进的操作系统中时,它也有效提高了仿真和真实机器人中的任务成功率,证明了我们生成的故障数据的影响。代码、数据和模型可在https://www.di.ens.fr/willow/research/guardian/获取。
Summary / 总结
The research aims to improve robotic manipulation by developing a method to automatically synthesize diverse failure scenarios using Vision-Language Models (VLMs). The method generates both binary and detailed failure labels, enhancing the accuracy and generalization of VLMs. This leads to the creation of new failure detection benchmarks and the training of Guardian, which outperforms existing models on these benchmarks. Guardian also improves task success rates in both simulation and real robots when integrated into a manipulation system.
论文旨在通过解决失败数据稀缺性问题,提高机器人的故障检测能力。提出了一种自动机器人故障合成方法,生成多样化的规划和执行故障,并产生二元和细粒度标签。该方法构建了新的故障检测基准,并训练了Guardian,这是一种VLM,在故障推理和检测方面表现出色,实现在仿真和真实机器人中任务成功率的提升。
AIDEN: Design and Pilot Study of an AI Assistant for the Visually Impaired
Authors: Luis Marquez-Carpintero, Francisco Gomez-Donoso, Zuria Bauer, Bessie Dominguez-Dager, Alvaro Belmonte-Baeza, Mónica Pina-Navarro, Francisco Morillas-Espejo, Felix Escalona, Miguel Cazorla
First: 2025-11-08T17:23:51+00:00 · Latest: 2025-12-02T16:23:51+00:00
Abstract
This paper presents AIDEN, an artificial intelligence-based assistant designed to enhance the autonomy and daily quality of life of visually impaired individuals, who often struggle with object identification, text reading, and navigation in unfamiliar environments. Existing solutions such as screen readers or audio-based assistants facilitate access to information but frequently lead to auditory overload and raise privacy concerns in open environments. AIDEN addresses these limitations with a hybrid architecture that integrates You Only Look Once (YOLO) for real-time object detection and a Large Language and Vision Assistant (LLaVA) for scene description and Optical Character Recognition (OCR). A key novelty of the system is a continuous haptic guidance mechanism based on a Geiger-counter metaphor, which supports object centering without occupying the auditory channel, while privacy is preserved by ensuring that no personal data are stored. Empirical evaluations with visually impaired participants assessed perceived ease of use and acceptance using the Technology Acceptance Model (TAM). Results indicate high user satisfaction, particularly regarding intuitiveness and perceived autonomy. Moreover, the ``Find an Object'' achieved effective real-time performance. These findings provide promising evidence that multimodal haptic-visual feedback can improve daily usability and independence compared to traditional audio-centric methods, motivating larger-scale clinical validations.
中文标题/摘要
标题:AIDEN:视障者人工智能助理的设计与试点研究
本文介绍了AIDEN,一种基于人工智能的助理,旨在增强视障人士的自主性和日常生活质量,他们常常在物体识别、文本阅读和陌生环境中的导航方面遇到困难。现有的解决方案如屏幕阅读器或基于音频的助理虽然有助于信息访问,但经常导致听觉过载,并在开放环境中引发隐私问题。AIDEN通过结合You Only Look Once (YOLO) 实时物体检测和Large Language and Vision Assistant (LLaVA) 场景描述及光学字符识别(OCR)来解决这些限制。该系统的一个关键创新是基于Geiger计数器隐喻的持续触觉引导机制,它支持物体对齐而不占用听觉通道,同时通过确保不存储个人数据来保护隐私。通过技术接受模型(TAM)对视障参与者进行的实证评估测算了感知易用性和接受度。结果表明,用户满意度高,特别是在直观性和感知自主性方面。此外,“查找物体”功能实现了有效的实时性能。这些发现提供了有力的证据,表明多模态触觉-视觉反馈可以提高日常使用性和独立性,与传统的以音频为中心的方法相比,激励进行更大规模的临床验证。
Summary / 总结
AIDEN is an AI assistant designed to improve the daily life of visually impaired individuals by addressing the limitations of existing solutions such as screen readers and audio-based assistants. It uses a hybrid architecture combining YOLO for real-time object detection, LLaVA for scene description and OCR, and a haptic guidance mechanism to support object centering without auditory overload. Empirical evaluations showed high user satisfaction and effective real-time performance in finding objects, indicating that multimodal haptic-visual feedback can enhance usability and independence compared to traditional methods.
AIDEN 是一个为视障人士设计的 AI 辅助系统,旨在通过解决物体识别、文本阅读和导航等挑战来提高他们的自主性和日常生活质量。该系统采用了一种结合 YOLO 实时物体检测、LLaVA 场景描述和 OCR 的混合架构,并使用基于 Geiger 计数器的连续触觉引导机制来避免占用听觉通道。实证评估显示,用户对系统的满意度高,特别是在物体查找方面的实时性能良好,表明多模态触觉-视觉反馈可以提高使用性和独立性,优于传统的以音频为中心的方法。
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
Authors: Fan Yang, Kaihao Zhang
First: 2025-12-02T16:22:01+00:00 · Latest: 2025-12-02T16:22:01+00:00
Abstract
Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
中文标题/摘要
标题:MRD:多分辨率检索-检测融合用于高分辨率图像理解
多模态大型语言模型(MLLMs)对高分辨率图像的理解仍然是一项重大挑战。最近的研究通过将图像分割成更小的片段,并使用预训练的检索增强生成(RAG)模型计算每个片段与查询之间的语义相似度来解决这一问题。然后选择最相关的片段来定位目标对象并抑制无关信息。然而,这种基于片段的处理可能会将完整的对象分割到多个片段中,从而破坏语义相似度的计算。在我们的实验中,我们发现不同大小的对象图像片段在不同的分辨率下处理效果更好。基于这一观察,我们提出了多分辨率检索-检测(MRD),一种无需训练的高分辨率图像理解框架。为了解决由于对象被分割到不同图像片段中而导致的语义相似度偏差问题,我们提出了一种多分辨率语义融合方法,该方法将不同分辨率下获得的语义相似度图进行整合,以产生更准确的语义信息并保持目标对象的完整性。此外,为了在全局尺度上直接定位目标对象,我们引入了一种开放式词汇对象检测(OVD)模型,该模型使用滑动窗口方法识别对象区域。使用不同MLLMs在高分辨率图像理解基准测试中进行的实验表明了我们方法的有效性。
Summary / 总结
The paper addresses the challenge of understanding high-resolution images by proposing MRD, a multi-resolution retrieval-detection fusion framework. It overcomes the issue of object fragmentation across image crops by handling objects of different sizes at appropriate resolutions and integrating semantic similarity maps. The approach also introduces an open-vocabulary object detection model for global object localization. Experiments show that MRD improves the accuracy of semantic information and preserves the integrity of target objects compared to previous methods.
论文提出了一种多分辨率检索检测(MRD)框架,通过从不同分辨率融合语义相似性图来保持目标对象的完整性。MRD使用多分辨率语义融合方法结合来自不同分辨率的信息,并引入开放词汇量物体检测(OVD)模型使用滑动窗口方法直接定位目标物体。实验表明,MRD在高分辨率图像理解基准上的表现优于基于裁剪的方法,提高了语义理解和定位的准确性。
VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling
Authors: Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, Guangrun Wang
First: 2025-12-02T16:16:13+00:00 · Latest: 2025-12-02T16:16:13+00:00
Abstract
Vision-language-action (VLA) models achieve strong in-distribution performance but degrade sharply under novel camera viewpoints and visual perturbations. We show that this brittleness primarily arises from misalignment in Spatial Modeling, rather than Physical Modeling. To address this, we propose a one-shot adaptation framework that recalibrates visual representations through lightweight, learnable updates. Our first method, Feature Token Modulation (FTM), applies a global affine transformation to visual tokens and improves Libero viewpoint accuracy from 48.5% to 87.1% with only 4K parameters. Building on this, Feature Linear Adaptation (FLA) introduces low-rank updates to the ViT encoder, achieving 90.8% success with 4.7M parameters -- matching LoRA-scale finetuning at far lower cost. Together, these results reveal substantial untapped robustness in pretrained VLA models and demonstrate that targeted, minimal visual adaptation is sufficient to restore viewpoint generalization.
中文标题/摘要
标题:VLA模型比你想象的更具泛化能力:重访物理建模与空间建模
视觉-语言-行动(VLA)模型在分布内表现强劲,但在新型相机视角和视觉扰动下性能急剧下降。我们表明,这种脆弱性主要源自于空间建模中的对齐问题,而非物理建模。为解决这一问题,我们提出了一种单次适应框架,通过轻量级、可学习的更新重新校准视觉表示。我们的第一个方法,特征标记调制(FTM),对视觉标记应用全局仿射变换,仅使用4K参数将Libero视角准确性从48.5%提高到87.1%。在此基础上,特征线性适应(FLA)引入了对ViT编码器的低秩更新,实现90.8%的成功率,使用4.7M参数——在远低于LoRA规模微调的成本下达到类似效果。这些结果揭示了预训练VLA模型中未充分利用的鲁棒性,并证明了针对视觉的最小化适应足以恢复视角泛化。
Summary / 总结
The research aims to improve the generalization of vision-language-action models under novel camera viewpoints and visual perturbations. It proposes a one-shot adaptation framework that recalibrates visual representations through lightweight updates. Specifically, Feature Token Modulation (FTM) improves viewpoint accuracy from 48.5% to 87.1% with 4K parameters, while Feature Linear Adaptation (FLA) achieves 90.8% success with 4.7M parameters, demonstrating substantial untapped robustness in pretrained models and the effectiveness of minimal visual adaptation.
研究旨在提高视觉-语言-行动(VLA)模型在新型相机视角和视觉扰动下的泛化能力。研究提出了一种一次性适应框架,通过轻量级更新重新校准视觉表示。具体而言,特征标记调制(FTM)将视角准确性从48.5%提高到87.1%,使用4K参数,而特征线性适应(FLA)实现了90.8%的成功率,使用4.7M参数,展示了预训练VLA模型中巨大的鲁棒性,并证明了最小的视觉适应足以恢复视角泛化。
FAIRY2I: Universal Extremely-Low Bit QAT framework via Widely-Linear Representation and Phase-Aware Quantization
Authors: Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang
First: 2025-12-02T16:14:08+00:00 · Latest: 2025-12-02T16:14:08+00:00
Comments: 15 pages, 3 figures
Abstract
Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.
中文标题/摘要
标题:FAIRY2I:通过广泛线性表示和相位感知量化实现的通用极低比特量化训练框架
大型语言模型(LLMs)已经彻底改变了人工智能,但它们巨大的内存和计算需求迫使人们采取激进的量化措施,这越来越接近单比特表示的理论极限。虽然复值LLMs,如iFairy,提供了比实值对应物更好的低比特表示机会,但它们需要从头开始训练,无法利用庞大的预训练实值基础模型生态系统。在这里,我们提出了Fairy2i,这是一种通用框架,可以将预训练的实值层转换为等效的广泛线性复数形式,从而实现极低比特量化,同时重用现有的检查点。通过证明实数和广泛线性映射之间的无损数学等价性,我们将标准Transformer转换到复数域,并采用一种基于四次单位根的高效码本的相位感知量化方案。此外,我们引入了一种递归残差量化机制,该机制通过迭代最小化量化误差,允许通过高效的无乘法累加进行推理。我们证明,Fairy2i可以在有效2比特精度下恢复LLaMA-2 7B的性能,几乎与全精度基线相当,显著优于最先进的实值二进制和三进制量化方法。这项工作在复值算术的表示效率和预训练模型的实际用途之间架起了一座桥梁,为在普通硬件上实现高效推理开辟了一条新途径。
Summary / 总结
The research aims to address the memory and computational challenges of large language models by developing a universal framework, Fairy2i, for extremely low-bit quantization. It transforms pre-trained real-valued layers into a widely-linear complex form, enabling phase-aware quantization and recursive residual quantization to minimize quantization error. Experimental results show that Fairy2i can restore LLaMA-2 7B's performance to nearly full-precision levels at 2-bit precision, outperforming existing real-valued binary and ternary quantization methods.
FAIRY2I 是一个通用框架,将预训练的实值层转换为广义线性复数形式,同时重用现有检查点,实现极低比特量化。它采用相位感知量化方案和递归残差量化机制来最小化量化误差。实验结果表明,FAIRY2I 在2比特精度下恢复了LLaMA-2 7B 的性能,几乎与全精度基线相当,显著优于最先进的实值二进制和三进制量化方法。
GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation
Authors: Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, Wanli Peng, Jingchao Qiao, Zeyu Ren, Haixin Shi, Zhi Su, Jiawen Tian, Yuyang Xiao, Shenyu Zhang, Liwei Zheng, Hang Li, Yonghui Wu
First: 2025-12-01T15:33:59+00:00 · Latest: 2025-12-02T15:44:55+00:00
Abstract
We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting $Q$-values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundations models to specialize into reliable real-world experts.
中文标题/摘要
标题:GR-RL:实现长时距灵巧操作的多能向精确化
我们提出了GR-RL,一种将通用视觉-语言-动作(VLA)策略转化为适用于长时距灵巧操作的高效专家的机器人学习框架。假设人类演示的最优性是现有VLA策略的核心。然而,我们认为在高度灵巧和精确的操作任务中,人类演示是嘈杂且次优的。GR-RL 提出了一种多阶段训练管道,通过强化学习过滤、增强和强化演示。首先,GR-RL 学习一个视觉-语言条件下的任务进度,过滤演示轨迹,仅保留对进度有积极贡献的转换。具体来说,我们展示了通过直接应用离线RL和稀疏奖励,所得到的$Q$值可以被视为一个稳健的进度函数。接下来,我们引入了形态对称增强,极大地提高了GR-RL的泛化能力和性能。最后,为了更好地使VLA策略与部署行为对高精度控制进行对齐,我们通过学习潜在空间噪声预测器进行在线RL。通过这个管道,GR-RL,据我们所知,是第一个能够自主穿鞋带的策略,成功率为83.3%,该任务需要长时距推理、毫米级精度和柔体接触交互。我们希望GR-RL能够为使通用机器人基础模型专门化为可靠的现实世界专家提供一步。
Summary / 总结
GR-RL is a robotic learning framework that transforms a generalist vision-language-action policy into a specialist for long-horizon dexterous manipulation. It proposes a multi-stage training pipeline involving filtering, augmenting, and reinforcing human demonstrations through reinforcement learning. Specifically, it uses offline RL with sparse rewards to filter demonstration trajectories and introduces morphological symmetry augmentation to improve generalization. Online RL is also employed to align the policy with deployment behaviors for high-precision control. GR-RL demonstrates a 83.3% success rate in autonomously lacing up a shoe, showcasing its capability in long-horizon reasoning and precise manipulation.
GR-RL 是一种机器人学习框架,将通用的视觉-语言-动作政策转化为擅长长时间精细操作的专家。它提出了一种多阶段训练管道,包括通过强化学习过滤、增强和强化人类演示。具体来说,它使用稀疏奖励的离线 RL 来过滤演示轨迹,并引入形态对称增强以提高泛化能力。还通过学习潜在空间噪声预测器来进行在线 RL,以更好地使政策与部署行为对齐,以实现高精度控制。GR-RL 在自主系鞋带任务中达到了 83.3% 的成功率,展示了其在长时间推理和精细操作方面的能力。
OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models
Authors: Ryoto Miyamoto, Xin Fan, Fuyuko Kido, Tsuneo Matsumoto, Hayato Yamana
First: 2025-10-18T01:39:28+00:00 · Latest: 2025-12-02T15:12:16+00:00
Comments: WACV2026 Accepted
Abstract
OpenLVLM-MIA is a new benchmark that highlights fundamental challenges in evaluating membership inference attacks (MIA) against large vision-language models (LVLMs). While prior work has reported high attack success rates, our analysis suggests that these results often arise from detecting distributional bias introduced during dataset construction rather than from identifying true membership status. To address this issue, we introduce a controlled benchmark of 6{,}000 images where the distributions of member and non-member samples are carefully balanced, and ground-truth membership labels are provided across three distinct training stages. Experiments using OpenLVLM-MIA demonstrated that the performance of state-of-the-art MIA methods approached chance-level. OpenLVLM-MIA, designed to be transparent and unbiased benchmark, clarifies certain limitations of MIA research on LVLMs and provides a solid foundation for developing stronger privacy-preserving techniques.
中文标题/摘要
标题:OpenLVLM-MIA:一个受控基准揭示大规模视觉语言模型成员推断攻击的局限性
OpenLVLM-MIA 是一个新的基准,突显了评估大规模视觉语言模型(LVLM)成员推断攻击(MIA)的基本挑战。尽管先前的工作报告了高攻击成功率,但我们的分析表明,这些结果往往源于检测数据集构建过程中引入的分布偏差,而不是识别真正的成员身份。为了解决这一问题,我们引入了一个包含6,000张图像的受控基准,其中成员样本和非成员样本的分布经过仔细平衡,并提供了三个不同训练阶段的真实成员身份标签。使用OpenLVLM-MIA的实验表明,最先进的MIA方法的性能接近随机水平。OpenLVLM-MIA 设计为透明且无偏的基准,澄清了MIA研究在LVLM上的某些局限性,并为开发更强的隐私保护技术提供了坚实的基础。
Summary / 总结
OpenLVLM-MIA is a new benchmark that aims to evaluate the limitations of membership inference attacks (MIA) on large vision-language models (LVLMs). It addresses the issue of high attack success rates reported in prior work, which often result from detecting distributional bias rather than true membership status. By providing a controlled dataset with balanced distributions and ground-truth labels, the benchmark shows that state-of-the-art MIA methods perform at chance level, highlighting the need for stronger privacy-preserving techniques.
OpenLVLM-MIA 是一个新的基准,旨在评估大型视觉-语言模型上的成员身份推理攻击。它解决了先前工作中高成功率的问题,这些高成功率通常是由于检测分布偏差而不是真实成员身份状态。通过仔细平衡成员和非成员样本的分布并提供真实标签,OpenLVLM-MIA 显示出最先进的 MIA 方法的表现达到了随机水平,突显了当前 MIA 研究在 LVLM 上的局限性,并为开发更强的隐私保护技术提供了基础。
OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-Critic
Authors: Songyan Zhang, Wenhui Huang, Zhan Chen, Chua Jiahao Collister, Qihang Huang, Chen Lv
First: 2025-12-01T16:11:57+00:00 · Latest: 2025-12-02T14:58:48+00:00
Abstract
Recently, two-stage fine-tuning strategies, e.g., acquiring essential driving knowledge through supervised fine-tuning (SFT) and further enhancing decision-making and planning via reinforcement fine-tuning (RFT), have shown strong potential in advancing the knowledge-driven autonomous driving (AD) paradigm. However, the learning nature of SFT still limits the generalization of reasoning, thereby constraining the full potential of driving performance. Meanwhile, current RFT approaches are primarily applied to downstream tasks, since scene understanding is an open-ended problem where corresponding rewards are difficult to quantify. To address these limitations, we propose OpenREAD, an OPEN-ended REasoning reinforced vision-language model (VLM)-based autonomous driving (AD) framework that enables end-to-end RFT across the full spectrum from high-level reasoning to low-level trajectory planning. Specifically, we begin by constructing large-scale Chain-of-Thought (CoT) annotations on open-source driving-related knowledge datasets, and employ the powerful Qwen3 large language model (LLM) as the critic in RFT to quantify reasoning quality for open-ended questions during reward modeling. Extensive experiments confirm that joint end-to-end RFT yields substantial improvements in both upstream and downstream tasks, enabling OpenREAD to achieve state-of-the-art performance on reasoning and planning benchmarks.
中文标题/摘要
标题:OpenREAD:基于LLM-as-Critic的端到端开放性推理自主驾驶
近年来,两阶段微调策略,例如通过监督微调(SFT)获取必要的驾驶知识,然后通过强化微调(RFT)进一步增强决策和规划能力,已经在推动知识驱动的自主驾驶(AD)范式方面显示出强大的潜力。然而,SFT的学习性质仍然限制了推理的泛化能力,从而限制了驾驶性能的全部潜力。同时,当前的RFT方法主要应用于下游任务,因为场景理解是一个开放性问题,其中相应的奖励难以量化。为了解决这些限制,我们提出了一种名为OpenREAD的端到端开放性推理强化视觉语言模型(VLM)自主驾驶(AD)框架,该框架能够在从高层次推理到低层次轨迹规划的整个光谱范围内实现端到端的RFT。具体而言,我们首先在开源驾驶相关知识数据集上构建大规模的思维链(CoT)注释,并利用强大的Qwen3大语言模型(LLM)作为RFT中的批评者,在奖励建模过程中量化开放性问题的推理质量。广泛的实验表明,联合端到端的RFT在上游和下游任务中均取得了显著的改进,使OpenREAD在推理和规划基准测试中达到了最先进的性能。
Summary / 总结
The paper proposes OpenREAD, a framework for end-to-end reinforcement fine-tuning in autonomous driving that combines large-scale Chain-of-Thought annotations with a large language model as a critic to quantify reasoning quality. This approach addresses the limitations of supervised fine-tuning and the difficulty in quantifying rewards for open-ended problems. Experimental results show that OpenREAD significantly improves performance on reasoning and planning benchmarks, achieving state-of-the-art results.
OpenREAD 是一个结合强化学习和大语言模型的开放性推理框架,旨在增强自主驾驶中的决策和规划。它通过从高层次推理到低层次轨迹规划的端到端强化微调(RFT)来解决两阶段微调的局限性。该框架构建了大规模的推理链注解,并使用Qwen3作为评论者来量化推理质量,从而在推理和规划基准测试中取得了显著的改进。
Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?
Authors: Manuel Benavent-Lledo, Konstantinos Bacharidis, Victoria Manousaki, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez
Venue: WACV 2026
First: 2025-12-02T14:57:17+00:00 · Latest: 2025-12-02T14:57:17+00:00
Comments: Accepted in WACV 2026 - Applications Track
Abstract
Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.
中文标题/摘要
标题:一瞥中的行动预判:多模态线索能否替代视频?
在行动理解研究中,预见即将发生的行动是一项核心挑战。传统方法依赖于从视频中提取和聚合时间信息,而人类在给定足够背景的情况下,仅通过观察场景中的一个瞬间就能预测即将发生的行动。模型能否做到这一点?简短的答案是肯定的,尽管其效果取决于任务的复杂性。在本研究中,我们探讨了视频聚合能否被其他模态替代。基于近期在视觉特征提取和基于语言的推理方面的进展,我们引入了AAG方法,用于一瞥中的行动预判。AAG结合了单帧的RGB特征和深度线索,以增强空间推理,并结合先验动作信息提供长期背景。这些背景信息可以通过视觉语言模型的文本摘要获得,或者通过单帧动作识别器生成的预测获得。我们的结果表明,使用AAG进行多模态单帧行动预判,在IKEA-ASM、Meccano和Assembly101三个指导性活动数据集中,可以与基于时间聚合的视频基线和最先进的方法竞争。
Summary / 总结
This study investigates whether a model can anticipate actions based on a single frame of a scene, using multimodal cues, and finds that AAG, a method combining RGB features with depth cues and incorporating prior action information, performs competitively with video-based methods across three instructional activity datasets. The effectiveness varies depending on the task complexity.
该研究探讨了通过观察单帧场景并使用多模态线索而非视频聚合来实现动作预测的可能性。所提出的方法AAG结合了RGB特征和深度线索,并结合先验动作信息提供长期上下文。结果显示,AAG在IKEA-ASM、Meccano和Assembly101三个指令性活动数据集上与视频基线和最先进的方法相比表现相当。
VLM as Strategist: Adaptive Generation of Safety-critical Testing Scenarios via Guided Diffusion
Authors: Xinzheng Wu, Junyi Chen, Naiting Zhong, Yong Shen
First: 2025-12-02T14:56:57+00:00 · Latest: 2025-12-02T14:56:57+00:00
Comments: 25 pages, 9 figures
Abstract
The safe deployment of autonomous driving systems (ADSs) relies on comprehensive testing and evaluation. However, safety-critical scenarios that can effectively expose system vulnerabilities are extremely sparse in the real world. Existing scenario generation methods face challenges in efficiently constructing long-tail scenarios that ensure fidelity, criticality, and interactivity, while particularly lacking real-time dynamic response capabilities to the vehicle under test (VUT). To address these challenges, this paper proposes a safety-critical testing scenario generation framework that integrates the high-level semantic understanding capabilities of Vision Language Models (VLMs) with the fine-grained generation capabilities of adaptive guided diffusion models. The framework establishes a three-layer hierarchical architecture comprising a strategic layer for VLM-directed scenario generation objective determination, a tactical layer for guidance function formulation, and an operational layer for guided diffusion execution. We first establish a high-quality fundamental diffusion model that learns the data distribution of real driving scenarios. Next, we design an adaptive guided diffusion method that enables real-time, precise control of background vehicles (BVs) in closed-loop simulation. The VLM is then incorporated to autonomously generate scenario generation objectives and guidance functions through deep scenario understanding and risk reasoning, ultimately guiding the diffusion model to achieve VLM-directed scenario generation. Experimental results demonstrate that the proposed method can efficiently generate realistic, diverse, and highly interactive safety-critical testing scenarios. Furthermore, case studies validate the adaptability and VLM-directed generation performance of the proposed method.
中文标题/摘要
标题:VLM作为策略师:通过引导扩散生成安全关键测试场景
自主驾驶系统(ADS)的安全部署依赖于全面的测试和评估。然而,能够有效暴露系统漏洞的安全关键场景在现实世界中极为稀少。现有的场景生成方法在高效构建确保真实度、关键性和互动性的长尾场景方面面临挑战,尤其缺乏对被测试车辆(VUT)的实时动态响应能力。为解决这些挑战,本文提出了一种将视觉语言模型(VLM)的高层语义理解能力和自适应引导扩散模型的精细生成能力相结合的安全关键测试场景生成框架。该框架建立了一个三层级的层次架构,包括战略层、战术层和操作层。首先,我们建立了一个高质量的基础扩散模型,学习真实驾驶场景的数据分布。接着,我们设计了一种自适应引导扩散方法,能够在闭环仿真中实现对背景车辆(BV)的实时、精确控制。然后,VLM被整合进来,通过深度场景理解和风险推理自主生成场景生成目标和引导函数,最终引导扩散模型实现VLM导向的场景生成。实验结果表明,所提出的方法能够高效生成真实、多样且高度互动的安全关键测试场景。此外,案例研究验证了所提出方法的适应性和VLM导向生成性能。
Summary / 总结
This paper addresses the challenge of generating safety-critical testing scenarios for autonomous driving systems by proposing a framework that integrates VLMs and adaptive guided diffusion models. The framework consists of three layers: strategic, tactical, and operational. It first establishes a fundamental diffusion model and then designs an adaptive guided diffusion method for real-time control of background vehicles. VLMs are used to generate scenario objectives and guidance functions, enabling VLM-directed scenario generation. Experiments show that the method can produce realistic, diverse, and highly interactive safety-critical testing scenarios.
本文提出了一种框架,通过结合VLM和自适应引导扩散模型来生成自动驾驶系统的安全关键测试场景。该框架包含战略、战术和操作三个层次。首先建立了一个基础扩散模型,然后设计了一种自适应引导扩散方法,用于实时控制背景车辆。VLM用于自主生成场景目标和引导函数,从而高效地生成现实、多样且高度互动的安全关键测试场景。实验表明,所提出的方法能够有效生成此类场景。
BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models
Authors: Yi Fang, Haoran Xu, Jiaxin Han, Sirui Ding, Yizhi Wang, Yue Wang, Xuan Wang
First: 2025-11-29T02:36:54+00:00 · Latest: 2025-12-02T14:46:22+00:00
Abstract
Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.
中文标题/摘要
标题:BioArc:为生物基础模型发现最优神经架构
基础模型已经彻底改变了自然语言处理(NLP)和计算机视觉(CV)等领域。尽管已经努力将基础模型在通用人工智能领域的成功转移到生物学中,但现有工作主要集中在直接采用来自通用机器学习领域的基础模型架构上,而没有系统地考虑每种生物数据模态的独特物理化学和结构特性。这导致了次优的性能,因为这些重新利用的架构难以捕捉生物数据中固有的长程依赖关系、稀疏信息和复杂的“语法”。为了解决这一差距,我们引入了BioArc,这是一种新型框架,旨在超越基于直觉的架构设计,转向基于原理的自动化架构发现,以适应生物基础模型。利用神经架构搜索(NAS),BioArc系统地探索了广泛的架构设计空间,评估了多种生物模态下的架构,同时严格分析了架构、分词和训练策略之间的相互作用。大规模分析识别出新型高性能架构,使我们能够提炼出一套经验设计原则,指导未来模型的开发。此外,为了充分利用这一组发现的原理性架构,我们提出了几种架构预测方法,这些方法能够有效地和高效地预测新生物任务的最佳架构。总体而言,我们的工作提供了一个基础资源和一种基于原理的方法,以指导下一代针对特定任务和基础模型的创建。
Summary / 总结
The research aims to improve the performance of neural architectures in biological foundation models by addressing the limitations of existing transfer approaches. BioArc uses Neural Architecture Search (NAS) to systematically explore architecture spaces, evaluating designs across various biological modalities. Key findings include the discovery of novel, high-performance architectures and empirical design principles that can guide future model development. Additionally, the study proposes methods to predict optimal architectures for new biological tasks, enhancing the efficiency and effectiveness of model creation.
研究旨在通过解决现有架构在生物领域中的局限性,提高基础模型的性能。研究采用神经架构搜索(NAS)系统地探索和发现更适合生物数据特性的新型架构。关键发现包括识别高性能架构和制定未来模型开发的经验设计原则,以及提出架构预测方法以高效预测新生物任务的最佳架构。
DehazeGS: Seeing Through Fog with 3D Gaussian Splatting
Authors: Jinze Yu, Yiqun Wang, Aiheng Jiang, Zhengda Lu, Jianwei Guo, Yong Li, Hongxing Qin, Xiaopeng Zhang
First: 2025-01-07T09:47:46+00:00 · Latest: 2025-12-02T14:44:21+00:00
Comments: 9 pages,5 figures. Accepted by AAAI2026. visualizations are available at https://dehazegs.github.io/
Abstract
Current novel view synthesis methods are typically designed for high-quality and clean input images. However, in foggy scenes, scattering and attenuation can significantly degrade the quality of rendering. Although NeRF-based dehazing approaches have been developed, their reliance on deep fully connected neural networks and per-ray sampling strategies leads to high computational costs. Furthermore, NeRF's implicit representation limits its ability to recover fine-grained details from hazy scenes. To overcome these limitations, we propose learning an explicit Gaussian representation to explain the formation mechanism of foggy images through a physically forward rendering process. Our method, DehazeGS, reconstructs and renders fog-free scenes using only multi-view foggy images as input. Specifically, based on the atmospheric scattering model, we simulate the formation of fog by establishing the transmission function directly onto Gaussian primitives via depth-to-transmission mapping. During training, we jointly learn the atmospheric light and scattering coefficients while optimizing the Gaussian representation of foggy scenes. At inference time, we remove the effects of scattering and attenuation in Gaussian distributions and directly render the scene to obtain dehazed views. Experiments on both real-world and synthetic foggy datasets demonstrate that DehazeGS achieves state-of-the-art performance. visualizations are available at https://dehazegs.github.io/
中文标题/摘要
标题:DehazeGS:透过雾气的3D高斯点绘制
当前的新型视图合成方法通常针对高质量和干净的输入图像进行设计。然而,在雾气场景中,散射和衰减会显著降低渲染质量。尽管已经开发了基于NeRF的去雾方法,但它们依赖于深度全连接神经网络和逐光线采样策略,导致高计算成本。此外,NeRF的隐式表示限制了其从雾气场景中恢复细粒度细节的能力。为克服这些限制,我们提出了一种学习显式高斯表示的方法,通过物理前向渲染过程解释雾气图像的形成机制。我们的方法DehazeGS仅使用多视角雾气图像作为输入,重建和渲染无雾场景。具体而言,基于大气散射模型,我们通过深度到传输映射直接将传输函数建立在高斯原语上,模拟雾的形成。在训练过程中,我们同时学习大气光和散射系数,优化雾气场景的高斯表示。在推理阶段,我们去除高斯分布中的散射和衰减效应,直接渲染场景以获得去雾视图。在现实世界和合成雾气数据集上的实验表明,DehazeGS达到了最先进的性能。可视化结果可在https://dehazegs.github.io/获取。
Summary / 总结
DehazeGS is designed to address the limitations of existing dehazing methods by using an explicit Gaussian representation to simulate the formation of foggy images. It reconstructs and renders fog-free scenes from multi-view foggy images using a physically forward rendering process. The method jointly learns atmospheric light and scattering coefficients during training and removes scattering and attenuation effects at inference time, achieving state-of-the-art performance on both real-world and synthetic foggy datasets.
DehazeGS 通过提出显式的高斯表示来模拟雾景图像的形成机制,解决了现有去雾方法的局限性。该方法使用多视角的雾景图像作为输入,并通过物理前向渲染过程重建无雾场景。实验表明,DehazeGS 在真实世界和合成雾景数据集上的表现优于现有方法,显示出在去雾方面的优越性能。可视化结果可在 https://dehazegs.github.io/ 查看。
ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
Authors: Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian, Xin Wang, Yanwei Fu
First: 2025-12-02T14:44:12+00:00 · Latest: 2025-12-02T14:44:12+00:00
Abstract
Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .
中文标题/摘要
标题:ReVSeg:通过强化学习激励视频分割中的推理链
以推理为中心的视频对象分割是一个固有的复杂任务:查询通常涉及动态、因果关系和时间交互,而不是静态外观。然而,现有解决方案通常将这些因素简化为潜在嵌入中的简化推理,使推理链变得不透明且实际上无法解决。因此,我们采用显式的分解视角并引入ReVSeg,它在预训练视觉语言模型(VLMs)的原生界面中执行推理作为顺序决策。ReVSeg 不将所有推理合并为一步预测,而是执行三个明确的操作——语义解释、时间证据选择和空间定位,以对齐预训练能力。我们进一步使用强化学习优化多步推理链,使模型能够从结果驱动的信号中自我完善其决策质量。实验结果表明,ReVSeg 在标准视频对象分割基准测试中达到了最先进的性能,并提供了可解释的推理轨迹。项目页面可在 https://clementine24.github.io/ReVSeg/ 获取。
Summary / 总结
ReVSeg addresses the complexity of reasoning in video object segmentation by decomposing the reasoning process into explicit operations and optimizing it with reinforcement learning. It interprets semantics, selects temporal evidence, and grounds spatially, leading to interpretable and state-of-the-art performance on standard benchmarks.
ReVSeg通过将任务分解为明确的操作并使用强化学习优化多步推理链来解决基于推理的视频对象分割的复杂性。它解释语义、选择时间证据并进行空间定位,从而在标准基准上达到最先进的性能并具有可解释的推理轨迹。
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach
Authors: Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li, Chenjia Bai, Xuelong Li
First: 2025-12-02T14:42:54+00:00 · Latest: 2025-12-02T14:42:54+00:00
Comments: The first two authors contributed equally. Yang Zhang leads the whole project
Abstract
Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose \textbf{TACO}, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.
中文标题/摘要
标题:引导视觉-语言-动作模型作为反探索:一种测试时扩展方法
视觉-语言-动作(VLA)模型,通过流匹配或扩散目标进行训练,在大规模多模态数据集(例如,人类远程操作、脚本策略)中表现出色,学习复杂行为。然而,由于VLA在预训练阶段整合了多种数据模式,且微调数据集通常包含以动力学次优或不理想方式收集的演示数据,因此存在与下游任务成功动作模式无关的冗余动作模式。具体而言,我们观察到,在预训练VLA的监督微调后,各种采样噪声在推理时表现出关键的不稳定性。在本文中,我们将这种不稳定性归因于VLA策略与由下游任务数据集稳定成功模式诱导的策略之间的分布偏移。因此,我们提出了一种测试时扩展(TTS)框架TACO,该框架采用轻量级伪计数估计器作为动作片段的高保真验证器。结合TACO的VLA模型可以执行所有采样动作片段中伪计数最大的动作,从而防止分布偏移同时保持VLA的泛化能力,因为约束仅在推理时应用。我们的方法类似于离线强化学习(RL)中的经典反探索原则,且由于无梯度,与RL更新相比,它具有显著的计算优势,尤其是对于难以执行RL更新的去噪过程的流或扩散基VLA。在四个模拟基准(RoboTwin2.0、Robotwin、LIBERO、SimplerEnv)和双臂平台上的广泛实验表明,我们的方法显著提高了下游任务适应的推理稳定性和成功率。
Summary / 总结
This paper addresses the instability of Vision-Language-Action (VLA) models during inference, which is attributed to a distribution shift between the VLA policy and the stable success modes of the downstream task. To mitigate this, the authors propose TACO, a test-time-scaling framework that uses a lightweight pseudo-count estimator to select actions with the highest pseudo-count from all sampled action chunks. This approach enhances inference stability and success rates in various simulation benchmarks and a dual-arm platform.
本文研究了Vision-Language-Action (VLA)模型在推理时的不稳定性,这归因于VLA策略与下游任务稳定成功模式之间的分布偏移。为解决这一问题,作者提出了TACO,一种测试时缩放框架,使用轻量级的伪计数估计器来选择具有最高伪计数的动作,从而防止分布偏移并保持泛化能力。实验结果表明,在多个基准和双臂平台上,该方法显著提高了推理稳定性和成功率。
From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity
Authors: Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji, Yuanhe Guo, Shenji Wan, Hongyi Wen
First: 2025-12-02T14:34:10+00:00 · Latest: 2025-12-02T14:34:10+00:00
Comments: Preprint version; 15 pages, 16 figures
Abstract
Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements.
中文标题/摘要
标题:从导航到精炼:通过Oracle速度揭示基于流的扩散模型的两阶段本质
基于流的扩散模型已成为训练图像和视频生成模型的主要范式。然而,它们的存储-泛化行为仍然知之甚少。在本文中,我们重新审视了流匹配(FM)目标,并研究了其边际速度场,该速度场具有闭式表达式,允许精确计算Oracle FM目标。分析该Oracle速度场揭示了基于流的扩散模型本质上形成了一个两阶段的训练目标:早期阶段由数据模式混合引导,后期阶段则由最近的数据样本主导。两阶段目标导致了不同的学习行为:早期的导航阶段在数据模式之间泛化以形成全局布局,而后期的精炼阶段则越来越多地记忆细粒度的细节。利用这些见解,我们解释了诸如时间步长偏移计划、无分类引导间隔和潜在空间设计选择等实用技术的有效性。我们的研究加深了对扩散模型训练动力学的理解,并为未来的架构和算法改进提供了原则。
Summary / 总结
This work revisits flow-based diffusion models and identifies a two-stage training process through the analysis of the oracle velocity field. The early stage is guided by a mixture of data modes, leading to generalization across data modes, while the later stage focuses on memorizing fine-grained details. This two-stage objective explains the effectiveness of practical techniques like timestep-shifted schedules and classifier-free guidance intervals. The study enhances the understanding of diffusion model training dynamics and provides guidance for future improvements.
本文重新审视了基于流的扩散模型,并通过分析Oracle速度场识别出一个两阶段的训练过程。早期阶段由数据模式混合引导,实现跨数据模式的泛化,而后期阶段则专注于记忆细粒度细节。这种两阶段目标解释了时间步长偏移调度和无分类器引导间隔等实用技术的有效性。该研究加深了对扩散模型训练动态的理解,并为未来的架构和算法改进提供了指导原则。
Radiologist Copilot: An Agentic Assistant with Orchestrated Tools for Radiology Reporting with Quality Control
Authors: Yongrui Yu, Zhongzhen Huang, Linjie Mu, Shaoting Zhang, Xiaofan Zhang
First: 2025-12-02T14:25:05+00:00 · Latest: 2025-12-02T14:25:05+00:00
Abstract
Radiology reporting is an essential yet time-consuming and error-prone task for radiologists in clinical examinations, especially for volumetric medical images. Rigorous quality control is also critical but tedious, ensuring that the final report meets clinical standards. Existing automated approaches, including radiology report generation methods and medical vision-language models, focus mainly on the report generation phase and neglect the crucial quality control procedure, limiting their capability to provide comprehensive support to radiologists. We propose Radiologist Copilot, an agentic AI assistant equipped with orchestrated tools designed for automated radiology reporting with quality control. Leveraging large language models as the reasoning backbone, the agentic system autonomously selects tools, plans, and executes actions, emulating the behavior of radiologists throughout the holistic radiology reporting process. The orchestrated tools include region localization, think with image paradigm directed region analysis planning, strategic template selection for report generation, quality assessment and feedback-driven adaptive refinement for quality control. Therefore, Radiologist Copilot facilitates accurate, complete, and efficient radiology reporting, assisting radiologists and improving clinical efficiency. Experimental results demonstrate that Radiologist Copilot significantly surpasses other state-of-the-art methods in radiology reporting. The source code will be released upon acceptance.
中文标题/摘要
标题:放射科副驾:具备质量控制的代理型助理与协调工具
放射学报告是临床检查中放射科医生的一项重要但耗时且易出错的任务,尤其是对于体积医学图像。严格的质量控制也至关重要但繁琐,确保最终报告符合临床标准。现有的自动化方法,包括放射学报告生成方法和医学视觉-语言模型,主要集中在报告生成阶段,忽视了关键的质量控制程序,限制了它们为放射科医生提供全面支持的能力。我们提出放射科副驾,这是一种具备代理型的AI助理,配备有为自动化放射学报告和质量控制设计的协调工具。利用大型语言模型作为推理基础,代理系统自主选择工具、规划和执行操作,模拟放射科医生在整个放射学报告过程中的行为。协调工具包括区域定位、基于图像导向的区域分析规划、战略模板选择以生成报告、质量评估以及反馈驱动的质量控制自适应精炼。因此,放射科副驾促进了准确、完整和高效的放射学报告,协助放射科医生并提高临床效率。实验结果表明,放射科副驾在放射学报告方面显著超越了其他最先进的方法。源代码将在接受后发布。
Summary / 总结
Radiologist Copilot is an AI assistant designed to assist radiologists in generating accurate and efficient radiology reports with integrated quality control. It uses large language models to autonomously select and execute tools for tasks such as region localization, image analysis, report generation, and quality assessment. Experimental results show that Radiologist Copilot outperforms existing methods in radiology reporting accuracy and efficiency.
Radiologist Copilot 是一个旨在支持放射科医生进行全面放射学报告工作的 AI 助手,包括质量控制。它使用大型语言模型自主选择和执行工具,用于区域定位、图像分析、报告生成和质量评估。实验结果表明,Radiologist Copilot 在生成准确且高效的放射学报告方面优于现有方法,提高临床效率。
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
Authors: Xianchao Zeng, Xinyu Zhou, Youcheng Li, Jiayou Shi, Tianle Li, Liangming Chen, Lei Ren, Yong-Lu Li
First: 2025-12-02T14:02:42+00:00 · Latest: 2025-12-02T14:02:42+00:00
Abstract
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
中文标题/摘要
标题:通过视觉符号诊断、纠正和学习操作失败
视觉-语言-行动(VLA)模型在机器人操作方面取得了显著进展,但在故障诊断和从故障中学习方面仍然有限。此外,现有的故障数据集大多是在模拟中通过编程生成的,这限制了它们在现实世界中的泛化能力。鉴于此,我们提出了ViFailback框架,旨在诊断机器人操作故障并提供文本和视觉纠正指导。我们的框架利用显式的视觉符号来增强注释效率。我们进一步发布了ViFailback数据集,这是一个包含58,126个视觉问答(VQA)对及其对应的5,202条真实世界操作轨迹的大规模集合。基于该数据集,我们建立了ViFailback-Bench基准,这是一个包含11个细粒度VQA任务的基准,旨在评估视觉语言模型(VLM)的故障诊断和纠正能力,其中包括ViFailback-Bench Lite用于封闭式评估和ViFailback-Bench Hard用于开放式评估。为了证明我们框架的有效性,我们构建了ViFailback-8B VLM,它不仅在ViFailback-Bench上实现了显著的整体性能提升,还生成了视觉符号以提供纠正行动指导。最后,通过将ViFailback-8B与VLA模型集成,我们进行了现实世界的机器人实验,展示了其帮助VLA模型从故障中恢复的能力。项目网站:https://x1nyuzhou.github.io/vifailback.github.io/
Summary / 总结
The paper introduces ViFailback, a framework for diagnosing robotic manipulation failures and providing corrective guidance through visual symbols. It includes a large real-world dataset of 58,126 VQA pairs and 5,202 manipulation trajectories, and establishes ViFailback-Bench to evaluate VLMs in failure diagnosis and correction. The ViFailback-8B VLM shows significant improvement on the benchmark and generates visual symbols for corrective actions, aiding a VLA model in real-world robotic experiments to recover from failures.
该论文提出了ViFailback框架,用于通过视觉符号诊断机器人操作失败并提供纠正指导。它包含一个大规模的真实世界数据集和一个基准,用于评估视觉语言模型。ViFailback-8B模型在失败诊断和纠正方面表现出显著的改进,并在与视觉语言行动模型集成后,在真实世界实验中帮助恢复操作。
medDreamer: Model-Based Reinforcement Learning with Latent Imagination on Complex EHRs for Clinical Decision Support
Authors: Qianyi Xu, Gousia Habib, Feng Wu, Dilruk Perera, Mengling Feng
First: 2025-05-26T10:16:39+00:00 · Latest: 2025-12-02T13:41:07+00:00
Abstract
Timely and personalized treatment decisions are essential across a wide range of healthcare settings where patient responses can vary significantly and evolve over time. Clinical data used to support these treatment decisions are often irregularly sampled, where missing data frequencies may implicitly convey information about the patient's condition. Existing Reinforcement Learning (RL) based clinical decision support systems often ignore the missing patterns and distort them with coarse discretization and simple imputation. They are also predominantly model-free and largely depend on retrospective data, which could lead to insufficient exploration and bias by historical behaviors. To address these limitations, we propose medDreamer, a novel model-based reinforcement learning framework for personalized treatment recommendation. medDreamer contains a world model with an Adaptive Feature Integration module that simulates latent patient states from irregular data and a two-phase policy trained on a hybrid of real and imagined trajectories. This enables learning optimal policies that go beyond the sub-optimality of historical clinical decisions, while remaining close to real clinical data. We evaluate medDreamer on both sepsis and mechanical ventilation treatment tasks using two large-scale Electronic Health Records (EHRs) datasets. Comprehensive evaluations show that medDreamer significantly outperforms model-free and model-based baselines in both clinical outcomes and off-policy metrics.
中文标题/摘要
标题:medDreamer:基于模型的强化学习与潜在想象在复杂EHRs中的应用以支持临床决策
及时且个性化的治疗决策在广泛医疗保健环境中至关重要,因为患者反应可能差异显著且随时间变化。用于支持这些治疗决策的临床数据通常采样不规则,缺失数据的频率可能隐含地传达患者状况的信息。现有的基于强化学习(RL)的临床决策支持系统往往忽略了缺失模式,并通过粗略离散化和简单插补来扭曲它们。它们也主要是无模型的,很大程度上依赖于回顾性数据,这可能导致探索不足和历史行为偏差。为解决这些局限性,我们提出了一种名为medDreamer的新型基于模型的强化学习框架,用于个性化治疗推荐。medDreamer包含一个世界模型,其中包含自适应特征整合模块,可以从不规则数据中模拟潜在患者状态,并且包含基于真实和想象轨迹混合训练的两阶段策略。这使得学习超越历史临床决策次优性的最优策略,同时保持接近真实临床数据。我们在两个大型电子健康记录(EHRs)数据集上对medDreamer进行了败血症和机械通气治疗任务的评估。全面评估表明,medDreamer在临床结果和离策性能指标上均显著优于无模型和基于模型的基线。
Summary / 总结
The research aims to improve personalized treatment decisions in healthcare by addressing the limitations of existing Reinforcement Learning (RL) systems, which often ignore missing data patterns and rely on coarse discretization. medDreamer, a model-based RL framework, uses a world model with an Adaptive Feature Integration module to simulate latent patient states from irregular data and a two-phase policy trained on a combination of real and imagined trajectories. The study demonstrates that medDreamer outperforms both model-free and model-based baselines in clinical outcomes and off-policy metrics on sepsis and mechanical ventilation treatment tasks using large-scale EHR datasets.
研究旨在通过解决现有基于强化学习(RL)系统的局限性,如忽略缺失数据模式和依赖粗略离散化,来改善个性化治疗决策。medDreamer 是一种基于模型的 RL 框架,结合了一个世界模型和一个具有自适应特征整合模块来从不规则数据中模拟潜在患者状态,并通过真实和想象的轨迹训练两阶段策略。研究结果表明,medDreamer 在使用大规模电子健康记录(EHR)数据集进行败血症和机械通气任务评估时,在临床结果和离策指标上均优于基于模型的和基于模型自由的方法。
Reasoning-Aware Multimodal Fusion for Hateful Video Detection
Authors: Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao, Zeyu Fu
First: 2025-12-02T13:24:17+00:00 · Latest: 2025-12-02T13:24:17+00:00
Abstract
Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.
中文标题/摘要
标题:具有推理意识的多模态融合用于仇恨视频检测
在线视频中的仇恨言论正日益对数字平台构成严重威胁,尤其是随着视频内容变得越来越具多模态性和情境依赖性。现有方法往往难以有效融合模态间的复杂语义关系,缺乏理解细微仇恨内容的能力。为解决这些问题,我们提出了一种创新的具有推理意识的多模态融合(RAMF)框架。为应对第一个挑战,我们设计了局部-全局上下文融合(LGCF)以捕捉局部显著线索和全局时间结构,并提出了语义跨注意力(SCA)以实现细粒度的多模态语义交互。为应对第二个挑战,我们引入了对抗性推理——一个结构化的三阶段过程,其中视觉-语言模型生成(i)客观描述,(ii)假设仇恨的推断,以及(iii)未假设仇恨的推断,提供互补的语义视角,丰富模型对细微仇恨意图的情境理解。在两个真实世界的仇恨视频数据集上的评估表明,我们的方法实现了稳健的泛化性能,分别在宏F1和仇恨类别召回上比最先进的方法提高了3%和7%。在匿名期结束后,我们将发布代码。
Summary / 总结
The paper addresses the challenge of detecting hate speech in online videos by proposing a Reasoning-Aware Multimodal Fusion (RAMF) framework. This framework includes Local-Global Context Fusion (LGCF) for capturing both local and global information, and Semantic Cross Attention (SCA) for fine-grained multimodal interaction. Additionally, it introduces adversarial reasoning to generate objective, hate-assumed, and non-hate-assumed inferences, which enhance contextual understanding. Experiments on two real-world datasets show that the proposed method outperforms existing approaches, with improvements of 3% and 7% in Macro-F1 and hate class recall, respectively.
论文针对在线视频中的仇恨言论检测问题,由于内容变得多模态且依赖上下文,这一问题变得越来越复杂。提出了一种Reasoning-Aware Multimodal Fusion (RAMF)框架,包括Local-Global Context Fusion (LGCF)和Semantic Cross Attention (SCA),以捕捉局部和全局线索并实现细粒度的语义交互。该框架还采用对抗推理生成客观、假设仇恨和非假设仇恨的推断,增强上下文理解。实验结果显示,该方法优于现有方法,分别在Macro-F1和仇恨类别召回上提高了3%和7%。
Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints
Authors: Xisheng Feng
First: 2025-11-30T13:04:43+00:00 · Latest: 2025-12-02T13:21:07+00:00
Abstract
Vision-Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to "Reasoning-Driven Hallucination" where linguistic priors override visual perception. A key bottleneck is the "Modality Gap": visual embeddings fail to reliably activate the fine-grained expert knowledge already encoded in model parameters. We propose "Look, Recite, Then Answer," a parameter-efficient framework that enhances VLMs via self-generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label. On AgroBench, our method achieves state-of-the-art results, improving Weed Identification accuracy by 23.52% over Qwen2-VL-72B and surpassing GPT-4o without external search overhead. This modular design mitigates hallucinations by transforming passive perception into active, controllable knowledge retrieval
中文标题/摘要
标题:看、背诵、然后回答:通过自动生成知识提示提高VLM性能
视觉-语言模型(VLMs)在精准农业等专业领域表现出显著的性能瓶颈,主要原因是“基于推理的幻觉”,其中语言先验覆盖了视觉感知。一个关键瓶颈是“模态差距”:视觉嵌入无法可靠激活模型参数中已编码的细粒度专家知识。我们提出了一种参数高效的“看、背诵、然后回答”框架,通过自动生成的知识提示增强VLMs,同时冻结主干模型。该框架将推理拆分为三个阶段:(1)看生成客观的视觉描述和候选集;(2)背诵使用轻量级的1.7B路由器将视觉线索转化为触发特定候选参数知识的目标查询;(3)回答在描述和背诵的知识之间进行并行证据对齐,以选择最一致的标签。在AgroBench上,我们的方法达到了最先进的结果,相比Qwen2-VL-72B提高了杂草识别的准确性23.52%,并且优于GPT-4o而无需外部搜索开销。这种模块化设计通过将被动感知转化为可控制的知识检索来减轻幻觉
Summary / 总结
The research aims to improve VLM performance in specialized domains like precision agriculture by addressing the 'Reasoning-Driven Hallucination' issue. The proposed 'Look, Recite, Then Answer' framework enhances VLMs through self-generated knowledge hints without freezing the backbone models. It consists of three stages: Look generates visual descriptions and candidate sets, Recite transforms visual cues into targeted queries, and Answer aligns evidence to select the most consistent label. The method achieves state-of-the-art results on AgroBench, improving Weed Identification accuracy by 23.52% over Qwen2-VL-72B and surpassing GPT-4o without external search overhead.
研究针对视觉语言模型(VLMs)在精准农业等专业领域中的性能瓶颈,特别是‘推理驱动的幻觉’问题。提出了一种参数高效的‘看、背、答’框架,通过生成自我生成的知识提示来增强VLMs。该方法分为三个阶段:视觉描述生成、目标查询转换和证据对齐。在AgroBench上,该方法取得了最先进的成果,显著提高了杂草识别的准确性,相比Qwen2-VL-72B提高了23.52%,并且在没有额外搜索开销的情况下超过了GPT-4o。
GAPO: Robust Advantage Estimation for Real-World Code LLMs
Authors: Jianqing Zhang, Zhezheng Hao, Wei Xia, Hande Dong, Hong Wang, Chenxing Wei, Yuyan Zhou, Yubin Qi, Qiang Lin, Jian Cao
First: 2025-10-22T03:37:49+00:00 · Latest: 2025-12-02T13:14:15+00:00
Abstract
Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods like GRPO are popular for their critic-free, normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable outliers, leading to distorted advantage computation and increased noise. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an outlier-free highest-density interval (HDI) per prompt and then uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation. This adaptive Q robustly handles skewed distributions while remaining plug-and-play and efficient. We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks across 10 languages, demonstrating consistent improvements in exact match accuracy over GRPO and its variant DAPO. Code is publicly available.
中文标题/摘要
标题:GAPO:面向实际代码LLM的稳健优势估计
强化学习(RL)广泛用于代码编辑的后训练大型语言模型(LLMs),其中群组相对方法如GRPO因其无批评者、标准化的优势估计而流行。然而,在实际代码编辑场景中,奖励分布往往偏斜且存在不可预测的异常值,导致优势计算失真并增加噪声。为解决这一问题,我们提出了组自适应策略优化(GAPO),它会针对每个提示自适应地找到一个无异常值的最高密度区间(HDI),然后使用该区间的中位数作为自适应Q来替换优势计算中的群组均值。这种自适应Q能够稳健地处理偏斜分布,同时保持即插即用和高效。我们使用包含51,844个实际、历史感知的代码编辑任务(涵盖10种语言)的大规模内部数据集,对九种指令调优的LLM(3B-14B)进行了GAPO的验证,结果显示GAPO在精确匹配准确度上优于GRPO及其变体DAPO。代码已公开。
Summary / 总结
The paper introduces GAPO, a method for robust advantage estimation in real-world code-editing scenarios using large language models (LLMs). It addresses the issue of skewed reward distributions by adaptively finding an outlier-free highest-density interval per prompt and using the median of that interval as an adaptive Q for advantage calculation. Experimental results show consistent improvements in exact match accuracy over existing methods like GRPO and DAPO on nine instruction-tuned LLMs across 51,844 real-world code-editing tasks in 10 languages.
论文针对现实世界代码编辑场景中奖励分布偏斜导致的优势计算失真的问题,提出了Group Adaptive Policy Optimization (GAPO) 方法,该方法为每个提示识别一个无异常值的最高密度区间,并使用该区间的中位数作为适应性Q进行优势计算。实验结果显示,GAPO 在九个指令调优的大语言模型上的一致提高了精确匹配准确率,优于GRPO和DAPO。
NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation
Authors: Max Gandyra, Alessandro Santonicola, Michael Beetz
First: 2025-07-02T08:23:14+00:00 · Latest: 2025-12-02T12:42:27+00:00
Comments: 9 pages, 3 figures, 5 tables
Abstract
Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed for all kinds of novel objects without (re-) training has proven to be a difficult task. To handle this, we present a new training-free framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). NOCTIS integrates two pre-trained models: Grounded-SAM 2 for object proposals with precise bounding boxes and corresponding segmentation masks; and DINOv2 for robust class and patch embeddings, due to its zero-shot capabilities. Internally, the proposal-object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings with a new cyclic thresholding (CT) mechanism that mitigates unstable matches caused by repetitive textures or visually similar patterns. Beyond CT, NOCTIS introduces: (i) an appearance score that is unaffected by object selection bias; (ii) the usage of the average confidence of the proposals' bounding box and mask as a scoring component; and (iii) an RGB-only pipeline that performs even better than RGB-D ones. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB-D methods regarding the mean AP score on the seven core datasets of the BOP 2023 challenge for the "Model-based 2D segmentation of unseen objects" task.
中文标题/摘要
标题:NOCTIS:基于新颖对象循环阈值实例分割
给定每种对象的一些示例图像,在RGB图像中对新颖对象实例进行实例分割是一个在计算机视觉中广为人知的问题。设计一个适用于所有类型新颖对象的通用模型而无需(重新)训练,证明是一个困难的任务。为了解决这个问题,我们提出了一种新的无需训练框架,称为:基于新颖对象循环阈值的实例分割(NOCTIS)。NOCTIS 结合了两个预训练模型:Grounded-SAM 2 用于具有精确边界框和相应分割掩码的对象提案;以及 DINOv2 由于其零样本能力,用于稳健的类别和补丁嵌入。内部,提案-对象匹配通过基于类别嵌入的相似性和补丁嵌入的平均最大相似性来确定对象匹配得分,使用新的循环阈值(CT)机制来缓解由重复纹理或视觉相似模式引起的不稳定匹配。除了CT,NOCTIS 引入了:(i)不受对象选择偏差影响的外观得分;(ii)使用提案边界框和掩码的平均置信度作为评分组件;(iii)仅使用RGB的管道,其性能甚至优于RGB-D管道。我们实验证明,NOCTIS 在BOP 2023挑战赛的七个核心数据集上,对于“基于模型的未见对象2D分割”任务,无需进一步训练/微调,其平均AP得分优于最佳RGB和RGB-D方法。
Summary / 总结
NOCTIS is a training-free framework for instance segmentation of novel objects in RGB images, leveraging Grounded-SAM for object proposals and DINOv2 for robust embeddings. It introduces a cyclic thresholding mechanism to stabilize object matching and enhances the segmentation process with an appearance score and confidence-based scoring. NOCTIS outperforms existing RGB and RGB-D methods on the BOP 2023 challenge, achieving higher mean AP scores across seven core datasets.
NOCTIS 是一个无需训练的框架,用于 RGB 图像中新型物体的实例分割。它利用 Grounded-SAM 进行精确的对象提案,并使用 DINOv2 进行稳健的类别和补丁嵌入。NOCTIS 引入了循环阈值机制以提高匹配稳定性,并包括外观得分和置信度得分组件。实验表明,NOCTIS 在 BOP 2023 挑战赛中优于现有 RGB 和 RGB-D 方法的未见过的对象分割任务。
VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Authors: Zhenkai Wu, Xiaowen Ma, Zhenliang Ni, Dengming Zhang, Han Shu, Xin Jiang, Xinghao Chen
First: 2025-12-02T12:30:05+00:00 · Latest: 2025-12-02T12:30:05+00:00
Abstract
Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup.
中文标题/摘要
标题:VLM-剪枝器:高效VLM中空间稀疏性的缓冲机制
视觉语言模型(VLMs)在图像理解任务中表现出色,但大量的视觉标记带来了显著的计算成本,阻碍了其在移动设备上的部署。许多剪枝方法仅依赖于标记的重要性,从而忽视了标记间的冗余性,保留了大量重复的标记,浪费了容量。尽管已经提出了一些意识冗余的方法,但它们往往忽略了视觉标记之间的空间关系。这可能导致保留标记的选择过于稀疏,无法充分覆盖目标对象的区域。为了解决这些限制,我们提出了VLM-剪枝器,这是一种无需训练的标记剪枝算法,明确平衡冗余性和空间稀疏性。我们引入了一种离心标记剪枝范式,能够在优先保留细粒度对象细节的同时进行近到远的选择。此外,我们设计了一种空间稀疏性缓冲(BSS)准则,推迟选择空间上距离较远的标记。我们还采用了一种并行贪婪策略来高效地进行标记选择。为了减轻剪枝带来的信息损失,我们选择性地将被丢弃标记中的重要信息融合到保留的标记中。全面的比较表明,VLM-剪枝器在五个VLM中以88.9%的剪枝率持续优于强大的基线模型,同时实现了端到端的推理加速。
Summary / 总结
The research aims to reduce the computational costs of vision-language models (VLMs) by addressing the issue of spatial sparsity in token pruning. VLM-Pruner is a training-free algorithm that balances redundancy and spatial sparsity by introducing a centrifugal token pruning paradigm and a Buffering for Spatial Sparsity (BSS) criterion. The method also uses a parallel greedy strategy for efficient token selection and selectively fuses information from discarded tokens into retained ones to mitigate information loss. Experimental results show that VLM-Pruner outperforms strong baselines with an 88.9% pruning rate, achieving an end-to-end inference speedup.
研究旨在通过提出VLM-Pruner,一种无需训练的token剪枝算法,来解决视觉语言模型(VLMs)的计算挑战,该算法平衡冗余和空间稀疏性。VLM-Pruner使用了离心token剪枝范式和空间稀疏性缓冲(BSS)准则,以高效地选择token并保留细粒度的对象细节。实验结果表明,VLM-Pruner在88.9%的剪枝率下优于强基线,并提供了端到端的推理加速。
Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations
Authors: Nicola Messina, Rosario Leonardi, Luca Ciampi, Fabio Carrara, Giovanni Maria Farinella, Fabrizio Falchi, Antonino Furnari
First: 2025-09-30T09:34:55+00:00 · Latest: 2025-12-02T12:27:47+00:00
Comments: Under consideration at Pattern Recognition Letters
Abstract
Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations $\unicode{x2013}$ natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations in a weakly-supervised regime. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations. Code and data can be found at https://fpv-iplab.github.io/WISH.
中文标题/摘要
标题:通过人类叙述的弱监督学习自手内物体分割
从第一人称图像中识别用户操作的对象的像素级识别能够支持辅助技术、工业安全和活动监测等关键应用。然而,该领域的发展目前受到标注数据稀缺的阻碍,因为现有方法依赖于昂贵的手动标签。本文提出利用叙述(即,摄像机佩戴者执行的动作的自然语言描述,其中包含有关操作对象的线索)来学习人类-物体交互检测。我们引入了叙述监督的手内物体分割(NS-iHOS),这是一个新颖的任务,模型需要通过学习自然语言叙述来学习手内物体分割,处于弱监督的环境中。叙述在推理时不再使用。我们通过提出弱监督的手内物体分割从人类叙述(WISH),一个端到端模型,从叙述中提炼知识来学习合理的手-物体关联,从而在测试时不使用叙述实现手内物体分割。我们基于开放词汇物体检测器和视觉语言模型的不同基线对WISH进行了基准测试。EPIC-Kitchens和Ego4D上的实验表明,WISH超越了所有基线,恢复了完全监督方法超过50%的性能,而无需使用细粒度的像素级注释。代码和数据可在https://fpv-iplab.github.io/WISH/找到。
Summary / 总结
This paper addresses the challenge of in-hand object segmentation from egocentric images, which is crucial for applications like assistive technologies and activity monitoring. The authors propose a weakly-supervised approach using natural language narrations as a source of supervision, which are not used at inference time. Their model, WISH, achieves performance comparable to fully supervised methods, recovering over 50% of their performance without requiring pixel-wise annotations. Experiments on EPIC-Kitchens and Ego4D demonstrate the effectiveness of this approach.
本文旨在解决识别用户在第一人称视角图像中操作的物体的问题,这对于辅助技术和活动监测至关重要。作者提出了一种弱监督方法,即基于自然语言描述的Narration-Supervised in-Hand Object Segmentation (NS-iHOS),该方法利用摄像机佩戴者执行的动作的自然语言描述来学习。提出的模型Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH)能够在不使用像素级注释的情况下达到与全监督方法相当的性能,展示了从自然语言描述中学习的潜力。实验结果表明,WISH在EPIC-Kitchens和Ego4D上的表现优于现有基线,恢复了超过50%的全监督方法的性能。
3DIS: Depth-Driven Decoupled Instance Synthesis for Text-to-Image Generation
Authors: Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang
First: 2024-10-16T15:34:13+00:00 · Latest: 2025-12-02T12:08:17+00:00
Comments: 10 pages
Abstract
The increasing demand for controllable outputs in text-to-image generation has spurred advancements in multi-instance generation (MIG), allowing users to define both instance layouts and attributes. However, unlike image-conditional generation methods such as ControlNet, MIG techniques have not been widely adopted in state-of-the-art models like SD2 and SDXL, primarily due to the challenge of building robust renderers that simultaneously handle instance positioning and attribute rendering. In this paper, we introduce Depth-Driven Decoupled Instance Synthesis (3DIS), a novel framework that decouples the MIG process into two stages: (i) generating a coarse scene depth map for accurate instance positioning and scene composition, and (ii) rendering fine-grained attributes using pre-trained ControlNet on any foundational model, without additional training. Our 3DIS framework integrates a custom adapter into LDM3D for precise depth-based layouts and employs a finetuning-free method for enhanced instance-level attribute rendering. Extensive experiments on COCO-Position and COCO-MIG benchmarks demonstrate that 3DIS significantly outperforms existing methods in both layout precision and attribute rendering. Notably, 3DIS offers seamless compatibility with diverse foundational models, providing a robust, adaptable solution for advanced multi-instance generation. The code is available at: https://github.com/limuloo/3DIS.
中文标题/摘要
标题:3DIS:深度驱动的解耦实例合成用于文本到图像生成
文本到图像生成对可控输出的需求不断增加,推动了多实例生成(MIG)的进步,允许用户定义实例布局和属性。然而,与基于图像条件生成的方法(如ControlNet)相比,MIG技术在SD2和SDXL等最先进的模型中尚未广泛采用,主要原因是构建同时处理实例定位和属性渲染的稳健渲染器的挑战。本文介绍了一种新颖的框架——深度驱动的解耦实例合成(3DIS),该框架将MIG过程分为两个阶段:(i)生成粗略的场景深度图以实现准确的实例定位和场景组成;(ii)使用预训练的ControlNet在任何基础模型上渲染细粒度属性,无需额外训练。我们的3DIS框架将一个自定义适配器集成到LDM3D中,以实现精确的基于深度的布局,并采用无需微调的方法以增强实例级别的属性渲染。在COCO-Position和COCO-MIG基准上的广泛实验表明,3DIS在布局精度和属性渲染方面显著优于现有方法。值得注意的是,3DIS与多种基础模型无缝兼容,提供了一种稳健且适应性强的多实例生成解决方案。代码可在:https://github.com/limuloo/3DIS获取。
Summary / 总结
3DIS is a novel framework for text-to-image generation that decouples the multi-instance generation process into two stages: generating a coarse scene depth map for accurate instance positioning and scene composition, and rendering fine-grained attributes using pre-trained ControlNet. The framework integrates a custom adapter into LDM3D for precise depth-based layouts and employs a finetuning-free method for enhanced instance-level attribute rendering. Experiments on COCO-Position and COCO-MIG benchmarks show that 3DIS significantly outperforms existing methods in layout precision and attribute rendering, offering seamless compatibility with diverse foundational models.
3DIS是一种用于文本到图像生成的新型框架,将多实例生成过程分为两个阶段:生成粗略的场景深度图以实现准确的实例定位和场景组成,以及使用预训练的ControlNet渲染细粒度的属性。在COCO-Position和COCO-MIG基准上的实验表明,3DIS在布局精度和属性渲染方面优于现有方法,并且兼容多种基础模型。
History
20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553