ARC Is a Vision Problem!
Authors: Keya Hu, Ali Cy, Linlu Qiu, Xiaoman Delores Ding, Runqian Wang, Yeyin Eva Zhu, Jacob Andreas, Kaiming He
First: 2025-11-18T18:59:49+00:00 · Latest: 2025-11-18T18:59:49+00:00
Comments: Technical Report. Project webpage: https://github.com/lillian039/VARC
Abstract
The Abstraction and Reasoning Corpus (ARC) is designed to promote research on abstract reasoning, a fundamental aspect of human intelligence. Common approaches to ARC treat it as a language-oriented problem, addressed by large language models (LLMs) or recurrent reasoning models. However, although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. To incorporate visual priors, we represent the inputs on a "canvas" that can be processed like natural images. It is then natural for us to apply standard vision architectures, such as a vanilla Vision Transformer (ViT), to perform image-to-image mapping. Our model is trained from scratch solely on ARC data and generalizes to unseen tasks through test-time training. Our framework, termed Vision ARC (VARC), achieves 60.4% accuracy on the ARC-1 benchmark, substantially outperforming existing methods that are also trained from scratch. Our results are competitive with those of leading LLMs and close the gap to average human performance.
中文标题/摘要
标题:ARC 是一个视觉问题!
抽象和推理语料库(ARC)旨在促进抽象推理的研究,这是人类智能的一个基本方面。常见的ARC处理方法将其视为语言导向的问题,通过大型语言模型(LLMs)或递归推理模型来解决。然而,尽管ARC中的谜题任务本质上是视觉性的,现有的研究很少从视觉中心的角度来处理这个问题。在本文中,我们从视觉范式出发,将ARC视为图像到图像的转换问题。为了引入视觉先验,我们将输入表示为一个“画布”,可以像自然图像一样进行处理。因此,我们自然地可以应用标准的视觉架构,如基础的视觉变换器(ViT),来进行图像到图像的映射。我们的模型仅从头开始在ARC数据上进行训练,并通过测试时的训练泛化到未见过的任务。我们的框架称为Vision ARC(VARC),在ARC-1基准测试中达到了60.4%的准确率,显著优于其他从头开始训练的方法。我们的结果与领先的LLMs相当,并缩小了与平均人类表现的差距。
Summary / 总结
The research aims to address the Abstraction and Reasoning Corpus (ARC) by formulating it within a vision-centric paradigm, treating it as an image-to-image translation problem. The proposed Vision ARC (VARC) model, which uses a vanilla Vision Transformer, is trained solely on ARC data and achieves 60.4% accuracy on the ARC-1 benchmark, outperforming existing methods and closing the gap to human performance.
抽象推理语料库(ARC)旨在促进抽象推理的研究,这是人类智能的关键方面。以往的方法通常将ARC视为语言问题,使用大型语言模型或递归推理模型。本研究将ARC置于视觉范式中,将其视为图像到图像的转换任务。作者使用标准的Vision Transformer来处理类似画布的输入表示,使其能够应用标准的视觉架构。VARC提出的模型在ARC-1基准测试中达到了60.4%的准确率,显著优于其他从零开始训练的方法,并接近领先的大语言模型和人类的性能。
$π^{*}_{0.6}$: a VLA That Learns From Experience
Authors: Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen, Tim Jones, Ben Katz, Liyiming Ke, Chandra Kuchi, Marinda Lamb, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Yao Lu, Vishnu Mano, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Charvi Sharma, Lucy Xiaoyang Shi, Laura Smith, Jost Tobias Springenberg, Kyle Stachowicz, Will Stoeckle, Alex Swerdlow, James Tanner, Marcel Torne, Quan Vuong, Anna Walling, Haohuan Wang, Blake Williams, Sukwon Yoo, Lili Yu, Ury Zhilinsky, Zhiyuan Zhou
First: 2025-11-18T18:58:55+00:00 · Latest: 2025-11-18T18:58:55+00:00
Abstract
We study how vision-language-action (VLA) models can improve through real-world deployments via reinforcement learning (RL). We present a general-purpose method, RL with Experience and Corrections via Advantage-conditioned Policies (RECAP), that provides for RL training of VLAs via advantage conditioning. Our method incorporates heterogeneous data into the self-improvement process, including demonstrations, data from on-policy collection, and expert teleoperated interventions provided during autonomous execution. RECAP starts by pre-training a generalist VLA with offline RL, which we call $π^{*}_{0.6}$, that can then be specialized to attain high performance on downstream tasks through on-robot data collection. We show that the $π^{*}_{0.6}$ model trained with the full RECAP method can fold laundry in real homes, reliably assemble boxes, and make espresso drinks using a professional espresso machine. On some of the hardest tasks, RECAP more than doubles task throughput and roughly halves the task failure rate.
中文标题/摘要
标题:$π^{*}_{0.6}$:一种通过经验学习的VLA
我们研究了视觉-语言-行动(VLA)模型如何通过强化学习(RL)在实际部署中得到改进。我们提出了一种通用方法,即通过优势条件策略的强化学习与经验及纠正(RECAP),该方法通过优势条件化为VLAs提供RL训练。我们的方法将异构数据纳入自我改进过程,包括演示、在线策略收集的数据以及自主执行期间提供的专家远程操作干预。RECAP首先通过离线RL预训练一个通用VLA,我们称之为$π^{*}_{0.6}$,然后可以通过机器人数据收集将其专门化以在下游任务中达到高性能。我们展示了使用完整RECAP方法训练的$π^{*}_{0.6}$模型可以在真实家庭中折叠衣物、可靠地组装盒子,并使用专业咖啡机制作咖啡饮品。在一些最难的任务上,RECAP将任务吞吐量提高了两倍多,并将任务失败率降低了约一半。
Summary / 总结
The study explores how vision-language-action (VLA) models can improve through real-world reinforcement learning (RL) deployments. The method, RECAP, integrates various data types such as demonstrations, on-policy data, and expert interventions to enhance VLA training. The $π^{*}_{0.6}$ model, pre-trained with RECAP, demonstrates high performance in real-world tasks, including folding laundry, assembling boxes, and making espresso drinks. The method significantly improves task throughput and reduces failure rates on challenging tasks.
研究旨在通过现实世界的强化学习(RL)部署来提升视觉-语言-动作(VLA)模型。方法RECAP整合了演示、在线收集的数据和专家干预等异构数据以提高VLA性能。通过RECAP预训练的$π^{*}_{0.6}$模型在折叠衣物、组装盒子和制作意式咖啡等实际任务中表现出色,特别是在困难任务上显著提高了任务处理效率和降低了失败率。
Vision Large Language Models Are Good Noise Handlers in Engagement Analysis
Authors: Alexander Vedernikov, Puneet Kumar, Haoyu Chen, Tapio Seppänen, Xiaobai Li
First: 2025-11-18T18:50:26+00:00 · Latest: 2025-11-18T18:50:26+00:00
Abstract
Engagement recognition in video datasets, unlike traditional image classification tasks, is particularly challenged by subjective labels and noise limiting model performance. To overcome the challenges of subjective and noisy engagement labels, we propose a framework leveraging Vision Large Language Models (VLMs) to refine annotations and guide the training process. Our framework uses a questionnaire to extract behavioral cues and split data into high- and low-reliability subsets. We also introduce a training strategy combining curriculum learning with soft label refinement, gradually incorporating ambiguous samples while adjusting supervision to reflect uncertainty. We demonstrate that classical computer vision models trained on refined high-reliability subsets and enhanced with our curriculum strategy show improvements, highlighting benefits of addressing label subjectivity with VLMs. This method surpasses prior state of the art across engagement benchmarks such as EngageNet (three of six feature settings, maximum improvement of +1.21%), and DREAMS / PAFE with F1 gains of +0.22 / +0.06.
中文标题/摘要
标题:视觉大型语言模型是参与分析中的良好噪声处理器
在视频数据集中,参与识别不同于传统的图像分类任务,特别受到主观标签和噪声的挑战,限制了模型的性能。为了克服主观和噪声参与标签的挑战,我们提出了一种利用视觉大型语言模型(VLMs)来细化注释并指导训练过程的框架。该框架使用问卷提取行为线索,并将数据分为高可靠性和低可靠性子集。我们还引入了一种结合递增学习和软标签细化的训练策略,逐步引入模糊样本并调整监督以反映不确定性。我们证明,经过细化的高可靠性子集训练的经典计算机视觉模型,并结合我们的递增策略进行增强,显示出改进,突显了使用VLMs解决标签主观性的益处。该方法在EngageNet(三个中的六个特征设置,最大改进为+1.21%)和DREAMS / PAFE(F1增益分别为+0.22 / +0.06)等参与基准测试中超越了先前的最先进水平。
OG-VLA: Orthographic Image Generation for 3D-Aware Vision-Language Action Model
Authors: Ishika Singh, Ankit Goyal, Stan Birchfield, Dieter Fox, Animesh Garg, Valts Blukis
First: 2025-06-01T22:15:45+00:00 · Latest: 2025-11-18T18:49:00+00:00
Comments: 13 pages
Abstract
We introduce OG-VLA, a novel architecture and learning framework that combines the generalization strengths of Vision Language Action models (VLAs) with the robustness of 3D-aware policies. We address the challenge of mapping natural language instructions and one or more RGBD observations to quasi-static robot actions. 3D-aware robot policies achieve state-of-the-art performance on precise robot manipulation tasks, but struggle with generalization to unseen instructions, scenes, and objects. On the other hand, VLAs excel at generalizing across instructions and scenes, but can be sensitive to camera and robot pose variations. We leverage prior knowledge embedded in language and vision foundation models to improve generalization of 3D-aware keyframe policies. OG-VLA unprojects input observations from diverse views into a point cloud which is then rendered from canonical orthographic views, ensuring input view invariance and consistency between input and output spaces. These canonical views are processed with a vision backbone, a Large Language Model (LLM), and an image diffusion model to generate images that encode the next position and orientation of the end-effector on the input scene. Evaluations on the Arnold and Colosseum benchmarks demonstrate state-of-the-art generalization to unseen environments, with over 40% relative improvements while maintaining robust performance in seen settings. We also show real-world adaption in 3 to 5 demonstrations along with strong generalization. Videos and resources at https://og-vla.github.io/
中文标题/摘要
标题:OG-VLA:基于视觉语言动作模型的正交图像生成
我们介绍了OG-VLA,这是一种结合了视觉语言动作模型(VLAs)的泛化优势和3D感知策略的鲁棒性的新型架构和学习框架。我们解决了将自然语言指令和一个或多个RGBD观察映射到准静态机器人动作的挑战。3D感知的机器人策略在精确的机器人操作任务上达到了最先进的性能,但在处理未见过的指令、场景和物体时存在泛化问题。另一方面,VLAs在指令和场景的泛化方面表现出色,但对相机和机器人姿态的变化较为敏感。我们利用嵌入在语言和视觉基础模型中的先验知识来提高3D感知关键帧策略的泛化能力。OG-VLA将输入观察从多个视角反投影到点云中,然后从标准正交视角进行渲染,确保输入视角不变性和输入输出空间的一致性。这些标准视角通过视觉骨干、大型语言模型(LLM)和图像扩散模型处理,生成编码末端执行器在输入场景中下一个位置和方向的图像。在Arnold和Colosseum基准上的评估表明,OG-VLA在未见过的环境中实现了最先进的泛化能力,相对改进超过40%,同时在已见过的环境中保持了稳健的性能。我们还展示了3到5次演示中的实际应用,并且具有强大的泛化能力。有关视频和资源,请访问https://og-vla.github.io/
Summary / 总结
OG-VLA is a novel architecture that integrates the generalization strengths of Vision Language Action models with the robustness of 3D-aware policies. It addresses the challenge of mapping natural language instructions and RGBD observations to robot actions. By leveraging prior knowledge from language and vision models, OG-VLA improves the generalization of 3D-aware keyframe policies. The model generates orthographic images to ensure input view invariance and consistency, which are then processed to predict the next position and orientation of the end-effector. Evaluations show that OG-VLA achieves state-of-the-art generalization to unseen environments with over 40% relative improvements while maintaining robust performance in seen settings, and it also demonstrates real-world adaptability with strong generalization capabilities.
OG-VLA 是一种新颖的架构,它将视觉语言行动模型的泛化能力与 3D 意识策略的鲁棒性相结合。它解决了将自然语言指令和 RGBD 观测转化为机器人动作的挑战。通过将输入观测投影到点云中并从标准正视图渲染,OG-VLA 确保了视图不变性和输入输出空间的一致性。这种方法在未见过的环境中实现了最先进的泛化能力,相对改进超过 40%,同时在已见过的环境中保持了稳健的性能。还展示了在 3 到 5 次演示中的实际应用适应性,具有强大的泛化能力。
Measuring AI Progress in Drug Discovery: A Reproducible Leaderboard for the Tox21 Challenge
Authors: Antonia Ebner, Christoph Bartmann, Sonja Topf, Sohvi Luukkonen, Johannes Schimunek, Günter Klambauer
First: 2025-11-18T18:43:42+00:00 · Latest: 2025-11-18T18:43:42+00:00
Abstract
Deep learning's rise since the early 2010s has transformed fields like computer vision and natural language processing and strongly influenced biomedical research. For drug discovery specifically, a key inflection - akin to vision's "ImageNet moment" - arrived in 2015, when deep neural networks surpassed traditional approaches on the Tox21 Data Challenge. This milestone accelerated the adoption of deep learning across the pharmaceutical industry, and today most major companies have integrated these methods into their research pipelines. After the Tox21 Challenge concluded, its dataset was included in several established benchmarks, such as MoleculeNet and the Open Graph Benchmark. However, during these integrations, the dataset was altered and labels were imputed or manufactured, resulting in a loss of comparability across studies. Consequently, the extent to which bioactivity and toxicity prediction methods have improved over the past decade remains unclear. To this end, we introduce a reproducible leaderboard, hosted on Hugging Face with the original Tox21 Challenge dataset, together with a set of baseline and representative methods. The current version of the leaderboard indicates that the original Tox21 winner - the ensemble-based DeepTox method - and the descriptor-based self-normalizing neural networks introduced in 2017, continue to perform competitively and rank among the top methods for toxicity prediction, leaving it unclear whether substantial progress in toxicity prediction has been achieved over the past decade. As part of this work, we make all baselines and evaluated models publicly accessible for inference via standardized API calls to Hugging Face Spaces.
中文标题/摘要
标题:药物发现中人工智能进步的衡量:Tox21 挑战赛的可重复排行榜
自2010年代初以来,深度学习的兴起已经改变了诸如计算机视觉和自然语言处理等领域,并对生物医学研究产生了强烈影响。对于药物发现而言,一个关键转折点——类似于视觉领域的“ImageNet时刻”——出现在2015年,当时深度神经网络在Tox21数据挑战赛中超越了传统方法。这一里程碑加速了深度学习在制药行业的应用,如今大多数大型公司都已将这些方法整合到其研究管道中。在Tox21挑战赛结束后,其数据集被包含在多个现有基准中,如MoleculeNet和开放图基准。然而,在这些整合过程中,数据集被修改,标签被填补或制造,导致研究之间的可比性丧失。因此,过去十年间生物活性和毒性预测方法的进步程度仍然不清楚。为此,我们引入了一个可重复的排行榜,该排行榜托管在Hugging Face上,使用原始的Tox21挑战赛数据集,以及一组基线和代表性方法。当前版本的排行榜表明,原始Tox21的获胜者——基于集成的DeepTox方法——和2017年引入的基于描述符的自归一化神经网络,继续表现出色,并在毒性预测方面排名靠前,表明过去十年间在毒性预测方面是否取得了实质性进展尚不清楚。作为这项工作的部分,我们使所有基线和评估模型通过标准化API调用在Hugging Face Spaces上公开,以便进行推理。
Summary / 总结
The paper aims to measure progress in AI for drug discovery by introducing a reproducible leaderboard using the original Tox21 Challenge dataset. The method involves comparing various baseline and representative models on this dataset. Key findings show that the original Tox21 winner, DeepTox, and self-normalizing neural networks from 2017 continue to perform well, suggesting that significant advancements in toxicity prediction may not have been achieved over the past decade.
该论文旨在通过引入Tox21挑战的可重复排行榜来衡量AI在药物发现领域的进展,该挑战在2015年成为关键转折点,当时深度神经网络超越了传统方法。研究使用原始的Tox21数据集和一组基线及代表性方法来评估生物活性和毒性预测。结果表明,原始Tox21获胜者和2017年的自规范化神经网络继续表现出色,暗示过去十年在毒性预测方面几乎没有取得显著进展。
Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
Authors: Xiyuan Wang, Muhan Zhang
First: 2025-11-18T17:58:16+00:00 · Latest: 2025-11-18T17:58:16+00:00
Comments: Tech Report. 10 pages
Abstract
Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.
中文标题/摘要
标题:扩散作为自我蒸馏:单一模型中的端到端潜在扩散
标准潜在扩散模型依赖于一个复杂的三部分架构,包括单独的编码器、解码器和扩散网络,这些组件在多个阶段进行训练。这种模块化设计计算效率低下,导致性能不佳,并阻止了扩散与视觉基础模型中常见的单网络架构的统一。我们的目标是将这三个组件统一到一个单一的、端到端可训练的网络中。我们首先证明,一种简单的联合训练方法由于“潜在坍塌”灾难性地失败了,其中扩散训练目标干扰了网络学习良好潜在表示的能力。我们通过将扩散与基于自我蒸馏的无监督学习方法建立新的类比,识别出这种不稳定性的根本原因。基于这一见解,我们提出了扩散作为自我蒸馏(DSD),这是一种具有关键修改训练目标的新框架,可以稳定潜在空间。这种方法首次使单一网络能够同时学习编码、解码和执行扩散的稳定端到端训练成为可能。DSD 在 ImageNet 256×256 条件生成任务中取得了出色的表现:FID=13.44/6.38/4.25,仅使用 42M/118M/205M 参数,在 ImageNet 上进行 50 轮训练,而无需使用无分类器引导。
Summary / 总结
The paper aims to unify the encoder, decoder, and diffusion network into a single end-to-end trainable model to improve computational efficiency and performance. It introduces Diffusion as Self-Distillation (DSD), which addresses the issue of 'latent collapse' by modifying the training objective. DSD achieves competitive performance on the ImageNet $256 imes 256$ conditional generation task with fewer parameters and training epochs compared to existing models.
论文旨在将编码器、解码器和扩散网络统一为一个端到端可训练的模型,以提高计算效率和性能。它引入了Diffusion as Self-Distillation (DSD),通过修改训练目标来解决“潜在空间坍塌”的问题。DSD在ImageNet $256 imes 256$ 条件生成任务上取得了竞争力的表现,参数量和训练轮次较少,相比现有模型更具优势。
FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation
Authors: Yunfeng Wu, Jiayi Song, Zhenxiong Tan, Zihao He, Songhua Liu
First: 2025-11-18T17:56:04+00:00 · Latest: 2025-11-18T17:56:04+00:00
Comments: 13 pages, 8 figures
Abstract
The quadratic time and memory complexity of the attention mechanism in modern Transformer based video generators makes end-to-end training for ultra high resolution videos prohibitively expensive. Motivated by this limitation, we introduce a training-free approach that leverages video Diffusion Transformers pretrained at their native scale to synthesize higher resolution videos without any additional training or adaptation. At the core of our method lies an inward sliding window attention mechanism, which originates from a key observation: maintaining each query token's training scale receptive field is crucial for preserving visual fidelity and detail. However, naive local window attention, unfortunately, often leads to repetitive content and exhibits a lack of global coherence in the generated results. To overcome this challenge, we devise a dual-path pipeline that backs up window attention with a novel cross-attention override strategy, enabling the semantic content produced by local attention to be guided by another branch with a full receptive field and, therefore, ensuring holistic consistency. Furthermore, to improve efficiency, we incorporate a cross-attention caching strategy for this branch to avoid the frequent computation of full 3D attention. Extensive experiments demonstrate that our method delivers ultra-high-resolution videos with fine-grained visual details and high efficiency in a training-free paradigm. Meanwhile, it achieves superior performance on VBench, even compared to training-based alternatives, with competitive or improved efficiency. Codes are available at: https://github.com/WillWu111/FreeSwim
中文标题/摘要
标题:FreeSwim:重新审视用于训练免费超高清视频生成的滑动窗口注意力机制
现代基于Transformer的视频生成器中的注意力机制的时间和空间复杂度呈二次增长,使得端到端训练超高清视频变得极其昂贵。鉴于这一限制,我们提出了一种训练免费的方法,利用在原生尺度下预训练的视频扩散Transformer来合成更高分辨率的视频,无需任何额外的训练或适应。我们方法的核心在于一种向内滑动窗口注意力机制,其源于一个关键观察:保持每个查询令牌的训练尺度感受野对于保持视觉保真度和细节至关重要。然而,朴素的局部窗口注意力往往导致重复的内容,并且在生成结果中缺乏全局一致性。为克服这一挑战,我们设计了一种双路径管道,通过一种新颖的交叉注意力覆盖策略支持窗口注意力,从而使局部注意力生成的语义内容能够受到具有全感受野的另一分支的引导,从而确保整体一致性。此外,为了提高效率,我们为该分支引入了一种交叉注意力缓存策略,以避免频繁计算全3D注意力。大量实验表明,我们的方法在训练免费的框架下提供了具有精细视觉细节和高效率的超高清视频。同时,在VBench上实现了优于基于训练的替代方案的性能,具有竞争力或更高的效率。代码可在:https://github.com/WillWu111/FreeSwim 获取
Summary / 总结
The paper addresses the computational challenges of training Transformer-based video generators for ultra-high resolution videos. It proposes a training-free approach using video Diffusion Transformers pretrained at their native scale, combined with an inward sliding window attention mechanism to maintain visual fidelity. The method also introduces a dual-path pipeline with a cross-attention override strategy to enhance global coherence and consistency, and a cross-attention caching strategy for efficiency. Experiments show that the proposed method generates ultra-high-resolution videos with fine details and high efficiency, outperforming training-based alternatives on VBench with competitive or improved efficiency. Codes are available at the provided GitHub link.
本文解决了使用Transformer进行超高清视频生成时的高计算成本问题。提出了一种名为FreeSwim的无训练方法,利用预训练的视频Diffusion Transformer生成更高分辨率的视频。该方法采用内滑动窗口注意力机制以保持视觉细节和保真度,并引入了一种带有全局视野的交叉注意力覆盖策略的双路径管道来增强全局一致性。实验表明,FreeSwim能够高效地生成具有精细细节的超高清视频,并在VBench上优于基于训练的替代方案,且具有竞争力的效率。
Optimizing Federated Learning by Entropy-Based Client Selection
Authors: Andreas Lutz, Gabriele Steidl, Karsten Müller, Wojciech Samek
First: 2024-11-02T13:31:36+00:00 · Latest: 2025-11-18T17:47:33+00:00
Comments: Accepted at the 3rd IEEE International Conference on Federated Learning Technologies and Applications (FLTA 2025), Dubrovnik, Croatia, October 14-17, 2025
Abstract
Although deep learning has revolutionized domains such as natural language processing and computer vision, its dependence on centralized datasets raises serious privacy concerns. Federated learning addresses this issue by enabling multiple clients to collaboratively train a global deep learning model without compromising their data privacy. However, the performance of such a model degrades under label skew, where the label distribution differs between clients. To overcome this issue, a novel method called FedEntOpt is proposed. In each round, it selects clients to maximize the entropy of the aggregated label distribution, ensuring that the global model is exposed to data from all available classes. Extensive experiments on multiple benchmark datasets show that the proposed method outperforms several state-of-the-art algorithms by up to 6% in classification accuracy under standard settings regardless of the model size, while achieving gains of over 30% in scenarios with low participation rates and client dropout. In addition, FedEntOpt offers the flexibility to be combined with existing algorithms, enhancing their classification accuracy by more than 40%. Importantly, its performance remains unaffected even when differential privacy is applied.
中文标题/摘要
标题:基于熵的客户端选择优化联邦学习
尽管深度学习在自然语言处理和计算机视觉等领域取得了革命性进展,但其对集中式数据集的依赖引发了严重的隐私问题。联邦学习通过使多个客户端协作训练全局深度学习模型,而不泄露其数据隐私,解决了这一问题。然而,在标签偏斜的情况下,即标签分布在不同客户端之间存在差异时,该模型的性能会下降。为克服这一问题,提出了一种名为FedEntOpt的新方法。在每一轮中,它选择客户端以最大化聚合标签分布的熵,确保全局模型接触到所有可用类别的数据。在多个基准数据集上的广泛实验表明,该方法在标准设置下无论模型大小如何,分类准确率都比几种最先进的算法高出多达6%,而在低参与率和客户端退出的场景中,其性能提升超过30%。此外,FedEntOpt 可以与现有算法结合使用,提高其分类准确率超过40%。重要的是,即使应用差分隐私,其性能也不会受到影响。
Summary / 总结
The paper proposes FedEntOpt, a method for optimizing federated learning by selecting clients based on entropy of the aggregated label distribution to address label skew. Experiments show that FedEntOpt outperforms state-of-the-art algorithms by up to 6% in classification accuracy and offers significant improvements in low participation and client dropout scenarios, while maintaining performance with differential privacy applied.
该研究提出了一种名为FedEntOpt的方法,通过选择具有最大聚合标签分布熵的客户端来优化联邦学习,以解决标签偏差问题并提高模型性能。实验表明,该方法在标准设置下的分类准确率比最先进的算法高出最多6%,在低参与度和客户端退出场景下则高出超过30%。此外,它还能提高现有算法的准确率超过40%,即使在应用差分隐私时也不会影响性能。
GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification
Authors: Ngoc Bui Lam Quang, Nam Le Nguyen Binh, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Quan Nguyen, Ulas Bagci
Venue: MICCAI
First: 2025-08-02T09:59:39+00:00 · Latest: 2025-11-18T17:43:54+00:00
Comments: Acccepted in MICCAI Workshop 2025
Abstract
Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.
中文标题/摘要
标题:GMAT:基于视觉-语言多实例学习的临床描述生成框架
多实例学习(MIL)是全切片图像(WSI)分类的领先方法,能够高效分析吉普赛像素病理切片。近期工作将视觉-语言模型(VLMs)引入MIL管道中,通过基于文本的类描述而非简单的类名来整合医学知识。然而,当这些方法依赖大型语言模型(LLMs)生成临床描述或使用固定长度的提示来表示复杂的病理概念时,VLMs的有限标记容量往往限制了编码的类信息的表达性和丰富性。此外,仅由LLMs生成的描述可能缺乏领域关联和精细的医学特异性,导致与视觉特征的对齐不足。为解决这些挑战,我们提出了一种基于视觉-语言MIL框架,包含两个关键贡献:(1)一个基于病理教科书的多智能体描述生成系统,利用专业化的病理学(如形态学、空间上下文)生成准确且多样的临床描述;(2)一种使用描述列表而非单一提示的文本编码策略,捕捉更精细且互补的临床信号,以更好地与视觉特征对齐。整合到VLM-MIL管道中,我们的方法在单提示类基线之上表现出改进的性能,并在肾癌和肺癌数据集上达到了与最先进的模型相当的结果。
Summary / 总结
The research aims to enhance the performance of whole slide image classification using vision-language models by addressing the limitations of text-based class descriptions generated by large language models. The method involves a grounded multi-agent description generation system that uses curated pathology textbooks and agent specialization to produce accurate and diverse clinical descriptions, and a text encoding strategy that utilizes a list of descriptions instead of a single prompt. The key experimental findings show that this approach outperforms single-prompt baselines and achieves results comparable to state-of-the-art models on renal and lung cancer datasets.
研究解决了使用大型语言模型(LLMs)生成临床描述在视觉-语言多实例学习(MIL)中对全切片图像(WSI)分类的限制。提出了一种基于专业化代理和病理教科书的描述生成系统,以生成准确且多样的描述。该系统还使用描述列表进行文本编码,从而增强与视觉特征的对齐。实验结果表明,该方法在肾癌和肺癌数据集上的性能优于单一提示基线,并且与最先进的模型具有竞争力。
NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards
Authors: Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, Soujanya Poria
First: 2025-11-18T16:55:48+00:00 · Latest: 2025-11-18T16:55:48+00:00
Comments: https://declare-lab.github.io/nora-1.5
Abstract
Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.
中文标题/摘要
标题:NORA-1.5:一种基于世界模型和动作偏好奖励训练的视觉-语言-行动模型
视觉-语言-行动(VLA)模型在各种体态任务中表现出色,但在可靠性和泛化能力方面仍存在不足,尤其是在不同体态或真实环境中的部署。本文介绍了NORA-1.5,这是一种基于预训练NORA骨干网络构建的VLA模型,通过添加基于流匹配的动作专家进行增强。这种架构改进显著提升了性能,使NORA-1.5在模拟和真实世界基准测试中均优于NORA和多个最先进的VLA模型。为了进一步提高鲁棒性和任务成功率,我们开发了一套用于后训练VLA策略的奖励模型。我们的奖励模型结合了(i)动作条件下的世界模型(WM),评估生成的动作是否有助于实现目标,以及(ii)偏离真实值的启发式方法,区分好的动作和差的动作。利用这些奖励信号,我们构建了偏好数据集,并通过直接偏好优化(DPO)使NORA-1.5适应特定体态。广泛的评估表明,奖励驱动的后训练在模拟和真实机器人设置中均能持续提升性能,通过简单的有效奖励模型显著提高了VLA模型的可靠性。我们的研究结果强调了NORA-1.5和奖励引导的后训练作为实现更可靠体态智能体的可行路径。
Summary / 总结
The research aims to enhance the reliability and generalization of vision-language-action (VLA) models for embodied tasks. NORA-1.5, built on the NORA backbone with an action expert, shows significant performance gains. Reward models combining an action-conditioned world model and a deviation-from-ground-truth heuristic were developed to further improve robustness. Extensive evaluations demonstrate that reward-driven post-training improves performance in both simulation and real-world settings, highlighting NORA-1.5 as a promising model for real-world deployment.
NORA-1.5 是一个增强版的视觉-语言-行动模型,通过加入动作专家来提升预训练的 NORA 基准模型,使其在模拟和真实机器人环境中表现出色。该模型结合了动作条件下的世界模型和偏离真实情况的启发式方法来生成偏好数据集,并通过直接偏好优化进行调整。这种方法在模拟和真实机器人设置中都能提高任务成功率和可靠性,展示了奖励驱动后训练对增强实体代理的有效性。
Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities
Authors: Kahaan Gandhi, Boris Bolliet, Inigo Zubeldia
First: 2025-11-18T16:23:02+00:00 · Latest: 2025-11-18T16:23:02+00:00
Abstract
We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent
中文标题/摘要
标题:利用视觉语言模型能力增强代理自主科学发现
我们展示了由视觉语言模型(VLMs)引导的多代理系统可以提高端到端的自主科学发现能力。通过将图表视为可验证的检查点,VLM作为裁判评估图表与动态生成的领域特定评分标准的符合情况,使代理能够纠正自己的错误并在实时中引导探索性数据分析。在宇宙学和天体化学案例研究中,展示了从错误推理路径中恢复并适应新数据集的能力,无需人工干预。在数据驱动发现的10任务基准测试中,增强VLM的系统达到0.7-0.8的通过率,而仅代码和代码及文本基线分别达到0.2-0.3和0.4-0.5,同时提供可审计的推理轨迹以提高可解释性。代码可在以下链接获取:https://github.com/CMBAgents/cmbagent
Summary / 总结
This study demonstrates that multi-agent systems guided by vision-language models enhance autonomous scientific discovery. By using VLMs as judges, the system evaluates figures against domain-specific rubrics, allowing agents to correct errors and adapt in real-time. Case studies in cosmology and astrochemistry show that VLM-augmented systems can recover from faulty reasoning and adapt to new datasets without human intervention. On a benchmark for data-driven discovery, VLM-augmented systems achieved scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while providing auditable reasoning traces that improve interpretability.
这项研究展示了由视觉-语言模型引导的多智能体系统如何增强自主科学研究。通过使用VLM作为裁判,系统可以评估图表并根据领域特定的评分标准进行动态调整,使智能体能够实时纠正错误并适应。在宇宙学和天体化学的案例研究中,VLM增强的系统能够从错误的推理路径中恢复,并在无需人类干预的情况下适应新数据集。在数据驱动发现基准测试中,VLM增强的系统得分在0.7-0.8之间,而代码-only和代码-文本基线分别为0.2-0.3和0.4-0.5,同时提供了可审计的推理痕迹以提高可解释性。
StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving
Authors: Ruiyang Hao, Bowen Jing, Haibao Yu, Zaiqing Nie
First: 2025-06-30T15:48:38+00:00 · Latest: 2025-11-18T15:45:36+00:00
Comments: 25 pages, 7 figures, 5 tables
Abstract
Personalization, while extensively studied in conventional autonomous driving pipelines, has been largely overlooked in the context of end-to-end autonomous driving (E2EAD), despite its critical role in fostering user trust, safety perception, and real-world adoption. A primary bottleneck is the absence of large-scale real-world datasets that systematically capture driving preferences, severely limiting the development and evaluation of personalized E2EAD models. In this work, we introduce the first large-scale real-world dataset explicitly curated for personalized E2EAD, integrating comprehensive scene topology with rich dynamic context derived from agent dynamics and semantics inferred via a fine-tuned vision-language model (VLM). We propose a hybrid annotation pipeline that combines behavioral analysis, rule-and-distribution-based heuristics, and subjective semantic modeling guided by VLM reasoning, with final refinement through human-in-the-loop verification. Building upon this dataset, we introduce the first standardized benchmark for systematically evaluating personalized E2EAD models. Empirical evaluations on state-of-the-art architectures demonstrate that incorporating personalized driving preferences significantly improves behavioral alignment with human demonstrations.
中文标题/摘要
标题:StyleDrive:面向端到端自动驾驶的驾驶风格感知基准测试
个性化,在传统自动驾驶管道中得到了广泛研究,但在端到端自动驾驶(E2EAD)的背景下却很少被关注,尽管它在培养用户信任、安全感知和实际应用中的作用至关重要。主要瓶颈在于缺乏大规模的现实世界数据集,这些数据集能够系统地捕捉驾驶偏好,严重限制了个性化E2EAD模型的开发和评估。在本文中,我们介绍了第一个专门用于个性化E2EAD的大规模现实世界数据集,该数据集结合了全面的场景拓扑和通过微调视觉语言模型(VLM)推断出的丰富动态上下文。我们提出了一种混合注释流水线,该流水线结合了行为分析、基于规则和分布的启发式方法以及由VLM推理引导的主观语义建模,并通过人工在环验证进行最终细化。基于此数据集,我们引入了第一个标准化基准,用于系统地评估个性化E2EAD模型。对最先进的架构的实证评估表明,纳入个性化的驾驶偏好可以显著提高行为与人类示范的一致性。
Summary / 总结
This paper addresses the lack of personalization in end-to-end autonomous driving (E2EAD) by introducing a new large-scale real-world dataset that captures driving preferences. The dataset integrates scene topology and dynamic context, and a hybrid annotation pipeline is proposed to annotate it. The authors also introduce a benchmark to evaluate personalized E2EAD models, showing that incorporating personal driving preferences enhances behavioral alignment with human demonstrations.
本文通过引入一个新的大规模真实世界数据集来解决端到端自动驾驶(E2EAD)中缺乏个性化的问题,该数据集能够捕捉驾驶偏好。该数据集整合了场景拓扑和动态上下文,并提出了一种混合注释流水线进行标注。作者还引入了一个基准来评估个性化E2EAD模型,实验证明,将个人驾驶偏好纳入模型可以提高行为与人类示范的一致性。
Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression
Authors: Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang, Chenzhi Li, Xinyue Zhang, Ping Jian
First: 2025-11-14T12:42:07+00:00 · Latest: 2025-11-18T15:36:54+00:00
Abstract
Existing Vision Language Models (VLMs) architecturally rooted in "flatland" perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an "Embedding-as-Value" paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.
中文标题/摘要
标题:超越平地:通过解耦三维推理与数值回归,解锁空间智能
现有的视觉语言模型(VLMs)在架构上根植于“平地”感知,根本无法理解现实世界的三维空间智能。这种失败源于双重瓶颈:输入阶段计算成本高昂的几何感知编码器与浅层的二维特征之间的冲突,以及输出阶段对离散分词器无法生成精确连续数值的结构性不匹配。为打破这一僵局,我们引入了GEODE(几何输出和解耦输入引擎),这是一种新型架构,通过解耦三维推理与数值生成来解决这一双重瓶颈。GEODE通过两个专门的即插即用模块增强主要的VLM:空间协处理器模块(DRM),通过交叉注意力将显式的三维数据与二维视觉特征对齐,并将空间链式思维(CoT)逻辑提炼为可注入的推理令牌;以及直接回归头(DRH),这是一种“嵌入即值”范式,将专门的控制令牌路由到轻量级MLP,以实现对标量和三维边界框的精确连续回归。这些模块的协同作用使我们的1.5亿参数模型能够作为高级语义调度器运行,实现与7亿+参数模型相当的空间推理性能。
Summary / 总结
The research aims to enhance the spatial reasoning capabilities of Vision Language Models (VLMs) by addressing the limitations of their 'flatland' architecture. To achieve this, the authors introduce GEODE, which decouples 3D reasoning from numerical generation. GEODE includes a Decoupled Rationale Module that aligns 3D data with 2D visual features and a Direct Regression Head for precise numerical predictions. The model, with 1.5 billion parameters, demonstrates state-of-the-art performance in spatial reasoning, comparable to larger models with over 7 billion parameters.
论文针对现有视觉语言模型(VLMs)在处理3D空间智能方面受限于‘平地’架构的问题。它引入了GEODE,通过增加一个解耦推理模块和直接回归头来分离3D推理与数值生成。该模型以1.5亿参数大小实现了最先进的空间推理性能,与更大规模的模型相当。
Is Your VLM for Autonomous Driving Safety-Ready? A Comprehensive Benchmark for Evaluating External and In-Cabin Risks
Authors: Xianhui Meng, Yuchen Zhang, Zhijian Huang, Zheng Lu, Ziling Ji, Yaoyao Yin, Hongyuan Zhang, Guangfeng Jiang, Yandan Lin, Long Chen, Hangjun Ye, Li Zhang, Jun Liu, Xiaoshuai Hao
First: 2025-11-18T15:33:49+00:00 · Latest: 2025-11-18T15:33:49+00:00
Abstract
Vision-Language Models (VLMs) show great promise for autonomous driving, but their suitability for safety-critical scenarios is largely unexplored, raising safety concerns. This issue arises from the lack of comprehensive benchmarks that assess both external environmental risks and in-cabin driving behavior safety simultaneously. To bridge this critical gap, we introduce DSBench, the first comprehensive Driving Safety Benchmark designed to assess a VLM's awareness of various safety risks in a unified manner. DSBench encompasses two major categories: external environmental risks and in-cabin driving behavior safety, divided into 10 key categories and a total of 28 sub-categories. This comprehensive evaluation covers a wide range of scenarios, ensuring a thorough assessment of VLMs' performance in safety-critical contexts. Extensive evaluations across various mainstream open-source and closed-source VLMs reveal significant performance degradation under complex safety-critical situations, highlighting urgent safety concerns. To address this, we constructed a large dataset of 98K instances focused on in-cabin and external safety scenarios, showing that fine-tuning on this dataset significantly enhances the safety performance of existing VLMs and paves the way for advancing autonomous driving technology. The benchmark toolkit, code, and model checkpoints will be publicly accessible.
中文标题/摘要
标题:您的VLM在自动驾驶中安全吗?一种全面的评估外部和车内风险基准
视觉-语言模型(VLMs)在自动驾驶中展现出巨大的潜力,但它们在安全关键场景中的适用性尚未得到充分探索,引发了安全方面的担忧。这一问题源于缺乏能够同时评估外部环境风险和车内驾驶行为安全性的全面基准。为弥补这一关键缺口,我们引入了DSBench,这是首个旨在以统一方式评估VLM对各种安全风险意识的全面驾驶安全基准。DSBench 包含两大类:外部环境风险和车内驾驶行为安全,分为10个关键类别和总共28个子类别。这一全面评估涵盖了多种场景,确保了对VLM在安全关键情境下性能的全面评估。在各种主流开源和闭源VLM上的广泛评估显示,在复杂的安全关键情况下性能显著下降,突显了迫切的安全关切。为解决这一问题,我们构建了一个包含98,000个实例的大规模数据集,专注于车内和外部安全场景,表明在该数据集上进行微调显著提高了现有VLM的安全性能,并为推进自动驾驶技术铺平了道路。基准工具包、代码和模型检查点将公开提供。
Summary / 总结
The paper addresses the safety concerns of Vision-Language Models (VLMs) in autonomous driving by introducing DSBench, a comprehensive benchmark. DSBench evaluates VLMs on external environmental risks and in-cabin driving behavior safety, covering 28 sub-categories. Evaluations across various VLMs show significant performance degradation in complex safety scenarios, emphasizing the need for safety improvements. Fine-tuning on a large dataset of 98K instances improves safety performance, suggesting a path forward for safer autonomous driving technology.
论文介绍了DSBench,这是一个全面的基准,用于评估VLM在自动驾驶场景中的表现,重点关注外部和车内安全风险。基准包括10个关键类别和28个子类别,涵盖了多种场景。对各种VLM的评估显示,在复杂的安全关键情况下表现显著下降,强调了增强安全措施的必要性。通过对98K实例的大规模数据集进行微调,可以显著提高VLM的安全性能,展示了推进自动驾驶技术的潜力。
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Authors: Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Jian liu, Huan Wang
First: 2025-11-18T15:22:32+00:00 · Latest: 2025-11-18T15:22:32+00:00
Comments: Code Link: https://github.com/KD-TAO/OmniZip
Abstract
Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42X inference speedup and 1.4X memory reduction over other top-performing counterparts, while maintaining performance with no training.
中文标题/摘要
标题:OmniZip:基于音频指导的动态令牌压缩以实现快速多模态大型语言模型
多模态大型语言模型(OmniLLMs)近年来在统一的音频-视频理解方面引起了越来越多的研究关注,其中处理音频-视频令牌序列已成为一个显著的计算瓶颈。现有的令牌压缩方法尚未满足同时压缩多模态令牌的新兴需求。为了解决这一问题,我们提出了OmniZip,这是一种无需训练、基于音频的音频-视觉令牌压缩框架,优化了多模态令牌表示并加速了推理。具体而言,OmniZip 首先识别出重要的音频令牌,然后为每个时间组计算一个音频保留分数以捕捉信息密度,从而动态指导视频令牌的剪枝,并保留由跨模态相似性增强的音频锚点提供的线索。对于每个时间窗口,OmniZip 使用交错的空间-时间方案压缩视频令牌。广泛的实验证明了OmniZip 的优点——它实现了3.42倍的推理加速和1.4倍的内存减少,同时保持性能无需训练。
Summary / 总结
OmniZip is an audio-guided token compression framework for multimodal large language models, addressing the computational bottleneck in processing audio-video token sequences. It identifies salient audio tokens and computes an audio retention score for each time group to guide dynamic video token pruning. This results in a 3.42X inference speedup and 1.4X memory reduction without compromising performance.
OmniZip 是一种基于音频的多模态大语言模型中的令牌压缩框架,旨在解决处理音频-视频令牌序列时的计算瓶颈。它根据显著的音频令牌和跨模态相似性动态压缩视频令牌,实现了3.42倍的推理加速和1.4倍的内存减少,同时保持了性能。该方法无需训练,可以增强多模态令牌表示以实现更快的推理。
CARScenes: Semantic VLM Dataset for Safe Autonomous Driving
Authors: Yuankai He, Weisong Shi
First: 2025-11-12T21:13:19+00:00 · Latest: 2025-11-18T15:20:04+00:00
Comments: 8 pages, 6 figures, 7 tables
Abstract
CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: https://github.com/Croquembouche/CAR-Scenes
中文标题/摘要
标题:CARScenes:安全自动驾驶的语义VLM数据集
CAR-Scenes 是一个用于自动驾驶的帧级数据集,用于训练和评估能够进行可解释、场景级理解的视觉-语言模型(VLMs)。我们使用包含环境、道路几何、背景车辆行为、ego车辆行为、弱势道路使用者、传感器状态和离散严重程度等级(1-10)的28键类别/子类别知识库,对来自Argoverse 1、Cityscapes、KITTI和nuScenes的5,192张图像进行标注,总计350多个叶属性。标签由GPT-4o辅助的视觉-语言管道生成,并通过人工在环验证;我们发布了具体的提示、后处理规则以及各领域的基线模型性能。CAR-Scenes 还提供了属性共现图和JSONL记录,支持语义检索、数据集筛选和跨源的风险感知场景挖掘。为了校准任务难度,我们包括了可复现、非基准的基线模型,特别是LoRA调优的Qwen2-VL-2B,通过标量准确性、微平均F1和固定验证分割的严重程度MAE/RMSE进行评估。我们公开发布了注释和分析脚本,包括图构建和评估脚本,以促进未来智能车辆的可解释、数据为中心的工作流程。数据集:https://github.com/Croquembouche/CAR-Scenes
Summary / 总结
CAR-Scenes is a dataset for autonomous driving that enables training and evaluation of vision-language models for scene-level understanding. It includes 5,192 annotated images from various sources, with labels covering 350+ attributes across 28 categories. The dataset is annotated using a GPT-4o-assisted vision-language pipeline and provides attribute co-occurrence graphs and JSONL records. Experimental results show the performance of a LoRA-tuned Qwen2-VL-2B model on a fixed validation split, evaluated by scalar accuracy, micro-averaged F1, and severity MAE/RMSE. The dataset and analysis scripts are publicly available to support explainable, data-centric workflows for future intelligent vehicles.
CAR-Scenes 是一个用于自主驾驶的框架级数据集,旨在训练和评估视觉-语言模型以实现场景级理解。该数据集包含来自多个来源的 5,192 张标注图像,涵盖了 350 多个与环境、道路几何、车辆行为等相关的属性。这些标签使用 GPT-4o 辅助的管道进行标注,并经过人工验证,同时提供了属性共现图和 JSONL 记录以支持语义检索。实验结果显示,LoRA 调整的 Qwen2-VL-2B 模型在固定验证集上的性能通过准确性、F1 分数和严重性 MAE/RMSE 评估。数据集和分析脚本已公开,以支持未来智能车辆的可解释、数据为中心的工作流程。
Task Addition and Weight Disentanglement in Closed-Vocabulary Models
Authors: Adam Hazimeh, Alessandro Favero, Pascal Frossard
First: 2025-11-18T15:12:21+00:00 · Latest: 2025-11-18T15:12:21+00:00
Abstract
Task arithmetic has recently emerged as a promising method for editing pre-trained \textit{open-vocabulary} models, offering a cost-effective alternative to standard multi-task fine-tuning. However, despite the abundance of \textit{closed-vocabulary} models that are not pre-trained with language supervision, applying task arithmetic to these models remains unexplored. In this paper, we deploy and study task addition in closed-vocabulary image classification models. We consider different pre-training schemes and find that \textit{weight disentanglement} -- the property enabling task arithmetic -- is a general consequence of pre-training, as it appears in different pre-trained closed-vocabulary models. In fact, we find that pre-trained closed-vocabulary vision transformers can also be edited with task arithmetic, achieving high task addition performance and enabling the efficient deployment of multi-task models. Finally, we demonstrate that simple linear probing is a competitive baseline to task addition. Overall, our findings expand the applicability of task arithmetic to a broader class of pre-trained models and open the way for more efficient use of pre-trained models in diverse settings.
中文标题/摘要
标题:封闭词汇模型中的任务添加与权重解耦
任务算术最近作为一种有前途的方法,被用作编辑预训练的\textit{开放词汇}模型的手段,提供了一种比标准多任务微调更经济的选择。然而,尽管存在大量未用语言监督预训练的\textit{封闭词汇}模型,但将任务算术应用于这些模型仍然未被探索。在本文中,我们部署并研究了封闭词汇图像分类模型中的任务添加。我们考虑了不同的预训练方案,并发现\textit{权重解耦}——使任务算术成为可能的特性——是预训练的普遍结果,因为它在不同的预训练封闭词汇模型中出现。事实上,我们发现预训练的封闭词汇视觉变换器也可以用任务算术进行编辑,实现了高任务添加性能,并使多任务模型的高效部署成为可能。最后,我们证明了简单的线性探针是任务添加的一个有竞争力的基线。总体而言,我们的发现扩大了任务算术的应用范围,使其适用于更广泛的预训练模型,并为在各种场景中更高效地使用预训练模型铺平了道路。
Summary / 总结
This paper explores the application of task addition in closed-vocabulary image classification models, a method previously studied mainly in open-vocabulary models. The authors find that weight disentanglement, a key property for task addition, is a general outcome of pre-training across different models. They demonstrate that pre-trained closed-vocabulary vision transformers can be effectively edited using task addition, achieving high performance and enabling efficient multi-task deployment. The study also shows that simple linear probing can be a competitive alternative to task addition. These findings broaden the scope of task arithmetic's applicability to a wider range of pre-trained models.
该论文探讨了任务算术在封闭词汇模型中的应用,特别是图像分类。作者发现,使任务算术得以实现的权重解耦是预训练模型的普遍特性。他们证明,预训练的视觉变换器可以通过任务算术得到有效编辑,实现高任务添加性能,并使多任务模型的部署更加高效。线性探针在这一场景中被证明是任务算术的一个有竞争力的替代方案。
Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions
Authors: Hubert Baniecki, Maximilian Muschalik, Fabian Fumagalli, Barbara Hammer, Eyke Hüllermeier, Przemyslaw Biecek
Venue: NeurIPS 2025
First: 2025-08-07T14:18:56+00:00 · Latest: 2025-11-18T15:05:51+00:00
Comments: NeurIPS 2025. Code: https://github.com/hbaniecki/fixlip
Abstract
Language-image pre-training (LIP) enables the development of vision-language models capable of zero-shot classification, localization, multimodal retrieval, and semantic understanding. Various explanation methods have been proposed to visualize the importance of input image-text pairs on the model's similarity outputs. However, popular saliency maps are limited by capturing only first-order attributions, overlooking the complex cross-modal interactions intrinsic to such encoders. We introduce faithful interaction explanations of LIP models (FIxLIP) as a unified approach to decomposing the similarity in vision-language encoders. FIxLIP is rooted in game theory, where we analyze how using the weighted Banzhaf interaction index offers greater flexibility and improves computational efficiency over the Shapley interaction quantification framework. From a practical perspective, we propose how to naturally extend explanation evaluation metrics, such as the pointing game and area between the insertion/deletion curves, to second-order interaction explanations. Experiments on the MS COCO and ImageNet-1k benchmarks validate that second-order methods, such as FIxLIP, outperform first-order attribution methods. Beyond delivering high-quality explanations, we demonstrate the utility of FIxLIP in comparing different models, e.g. CLIP vs. SigLIP-2.
中文标题/摘要
标题:使用加权班扎夫相互作用解释视觉-语言编码器中的相似性
语言-图像预训练(LIP)使开发能够在零样本分类、定位、多模态检索和语义理解方面发挥作用的视觉-语言模型成为可能。已经提出了各种解释方法来可视化输入图像-文本对对模型相似性输出的重要性。然而,流行的显著图仅限于捕捉一阶归因,忽视了此类编码器中固有的复杂跨模态相互作用。我们引入了LIP模型的忠实相互作用解释(FIxLIP)作为分解视觉-语言编码器中相似性的统一方法。FIxLIP基于博弈论,我们分析了使用加权班扎夫相互作用指数如何提供更大的灵活性并提高计算效率,优于Shapley相互作用量化框架。从实用角度来看,我们提出了如何自然地将指针游戏和插入/删除曲线之间的面积等解释评估指标扩展到二阶相互作用解释。在MS COCO和ImageNet-1k基准上的实验验证了二阶方法,如FIxLIP,优于一阶归因方法。除了提供高质量的解释外,我们还展示了FIxLIP在比较不同模型方面的实用性,例如CLIP vs. SigLIP-2。
Summary / 总结
The research aims to provide more accurate explanations for the similarity outputs in vision-language models by addressing the limitations of first-order attributions. The method introduces FIxLIP, which uses the weighted Banzhaf interaction index to decompose the similarity in these models, offering greater flexibility and computational efficiency. Experiments on MS COCO and ImageNet-1k show that FIxLIP outperforms first-order attribution methods, providing high-quality explanations and aiding in model comparison.
研究旨在通过解决第一阶归因的局限性,为视觉-语言编码器的相似性输出提供更准确的解释。方法FIxLIP使用加权Banzhaf交互指数来分解相似性,相比Shapley交互量化框架,提供了更好的灵活性和计算效率。在MS COCO和ImageNet-1k上的实验表明,FIxLIP优于第一阶归因方法,提供了高质量的解释,并有助于模型比较。
Closed-Form Feedback-Free Learning with Forward Projection
Authors: Robert O'Shea, Bipin Rajendran
First: 2025-01-27T20:10:37+00:00 · Latest: 2025-11-18T14:56:55+00:00
Comments: 26 pages, 5 figures. Study code available at https://github.com/robertoshea/forward_projection. Study data available at https://data.mendeley.com/datasets/fb7xddyxs4/2
Abstract
State-of-the-art methods for backpropagation-free learning employ local error feedback to direct iterative optimisation via gradient descent. In this study, we examine the more restrictive setting where retrograde communication from neuronal outputs is unavailable for pre-synaptic weight optimisation. To address this challenge, we propose Forward Projection (FP). This randomised closed-form training method requires only a single forward pass over the entire dataset for model fitting, without retrograde communication. Our method generates target values for pre-activation membrane potentials at each layer through randomised nonlinear projections of pre-synaptic inputs and the labels, thereby encoding information from both sources. Local loss functions are optimised over pre-synaptic inputs using closed-form regression, without feedback from neuronal outputs or downstream layers. Interpretability is a key advantage of FP training; membrane potentials of hidden neurons in FP-trained networks encode information which are interpretable layer-wise as label predictions. We demonstrate the effectiveness of FP across four biomedical datasets, comparing it with backpropagation and local learning techniques such as Forward-Forward training and Local Supervision in multi-layer perceptron and convolutional architectures. In some few-shot learning tasks, FP yielded more generalisable models than those optimised via backpropagation. In large-sample tasks, FP-based models achieve generalisation comparable to gradient descent-based local learning methods while requiring only a single forward propagation step, achieving significant speed up for training.
中文标题/摘要
标题:无需反向传播的闭式无反馈学习与前向投影
当前最先进的无需反向传播的学习方法通过局部误差反馈来指导梯度下降迭代优化。本研究探讨了更严格的场景,其中神经元输出的反向通信无法用于前向突触权重优化。为了解决这一挑战,我们提出了前向投影(FP)。这是一种随机闭式训练方法,仅需对整个数据集进行一次前向传递即可完成模型拟合,无需反向通信。该方法通过随机非线性投影前向突触输入和标签来生成每一层前激活膜电位的目标值,从而编码来自两个来源的信息。前向突触输入的局部损失函数通过闭式回归进行优化,无需来自神经元输出或下游层的反馈。FP训练的可解释性是其主要优势;FP训练网络中的隐藏神经元膜电位逐层编码可解释的标签预测信息。我们在四个生物医学数据集上展示了FP的有效性,将其与多层感知机和卷积架构中的反向传播和局部学习技术(如前向-前向训练和局部监督)进行了比较。在某些少样本学习任务中,FP生成的模型比通过反向传播优化的模型更具泛化能力。在大规模样本任务中,基于FP的方法在泛化能力上与基于梯度下降的局部学习方法相当,但只需一个前向传播步骤,从而显著提高了训练速度。
Summary / 总结
This study addresses the challenge of backpropagation-free learning by proposing Forward Projection (FP), a randomised closed-form training method that requires only a single forward pass over the dataset. FP generates target values for pre-activation membrane potentials through nonlinear projections of inputs and labels, optimizing local loss functions without feedback from neuronal outputs. The method demonstrates effectiveness across four biomedical datasets, showing better generalization in few-shot learning tasks and comparable generalization to gradient descent-based methods in large-sample tasks with significant speedup.
本研究提出了一种前向投影(FP)方法,这是一种随机化的封闭形式训练方法,旨在解决无需反向传播的学习挑战。FP方法仅需一次前向传播遍历整个数据集,并通过随机非线性投影生成预激活膜电位的目标值。该方法不依赖神经元输出的反馈来优化局部损失函数,展示了在生物医学数据集上的有效性。在少量样本的学习任务中,FP训练的模型在泛化能力上优于反向传播方法;而在大量样本的任务中,FP方法的模型泛化能力与基于梯度下降的局部学习方法相当,但训练速度显著加快。
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D
Authors: Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, Li Zhang
First: 2025-03-29T04:51:50+00:00 · Latest: 2025-11-18T14:31:17+00:00
Comments: Project page: https://fudan-zvg.github.io/spar
Abstract
Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.
中文标题/摘要
标题:从平面向空间:训练视觉-语言模型感知和推理三维
最近在LVLMs方面的进展提高了视觉-语言理解,但它们仍然在空间感知方面存在困难,限制了它们对复杂三维场景进行推理的能力。与之前将三维表示整合到模型中以提高空间理解的方法不同,我们旨在通过利用与空间相关的图像数据来解锁VLMs的潜力。为此,我们引入了一种基于具有三维真实值的场景数据的新颖二维空间数据生成和注释管道。该管道使我们能够创建从基本感知任务到更复杂推理任务的多样化空间任务集。利用此管道,我们构建了SPAR-7M,这是一个从多个公共数据集中数千个场景生成的大规模数据集。此外,我们引入了SPAR-Bench,这是一个旨在提供比现有空间基准更全面评估空间能力的基准,支持单视图和多视图输入。在SPAR-7M和大规模二维数据集上进行训练使我们的模型在二维空间基准上达到了最先进的性能。进一步针对三维任务特定数据集进行微调取得了竞争力的结果,突显了我们数据集在增强空间推理方面的有效性。
Summary / 总结
This paper addresses the limitation of vision-language models in spatial perception, which hinders their ability to reason about 3D scenes. It introduces a novel 2D spatial data generation and annotation pipeline to create SPAR-7M, a large-scale dataset, and SPAR-Bench, a benchmark for evaluating spatial capabilities. Models trained on SPAR-7M and 2D datasets achieve state-of-the-art performance on 2D spatial benchmarks and competitive results on 3D tasks after fine-tuning.
研究旨在通过增强视觉语言模型(VLMs)的空间感知能力,提高其对复杂3D场景的推理能力。作者开发了一种新颖的2D空间数据生成和注释管道,创建了SPAR-7M大规模数据集,并引入了SPAR-Bench用于评估空间能力。在SPAR-7M和2D数据集上训练的模型在2D空间基准测试中达到了最先进的性能,进一步在3D任务特定数据集上的微调也取得了竞争力的结果,证明了该方法的有效性。
Segmentation-Driven Initialization for Sparse-view 3D Gaussian Splatting
Authors: Yi-Hsin Li, Thomas Sikora, Sebastian Knorr, Mårten Sjöström
First: 2025-09-15T12:31:33+00:00 · Latest: 2025-11-18T14:13:44+00:00
Abstract
Sparse-view synthesis remains a challenging problem due to the difficulty of recovering accurate geometry and appearance from limited observations. While recent advances in 3D Gaussian Splatting (3DGS) have enabled real-time rendering with competitive quality, existing pipelines often rely on Structure-from-Motion (SfM) for camera pose estimation, an approach that struggles in genuinely sparse-view settings. Moreover, several SfM-free methods replace SfM with multi-view stereo (MVS) models, but generate massive numbers of 3D Gaussians by back-projecting every pixel into 3D space, leading to high memory costs. We propose Segmentation-Driven Initialization for Gaussian Splatting (SDI-GS), a method that mitigates inefficiency by leveraging region-based segmentation to identify and retain only structurally significant regions. This enables selective downsampling of the dense point cloud, preserving scene fidelity while substantially reducing Gaussian count. Experiments across diverse benchmarks show that SDI-GS reduces Gaussian count by up to 50% and achieves comparable or superior rendering quality in PSNR and SSIM, with only marginal degradation in LPIPS. It further enables faster training and lower memory footprint, advancing the practicality of 3DGS for constrained-view scenarios.
中文标题/摘要
标题:基于分割驱动初始化的稀视角3D高斯点云合成
稀视角合成仍然是一个具有挑战性的问题,因为从有限的观察中恢复准确的几何形状和外观非常困难。尽管最近在3D高斯点云合成(3DGS)方面的进展使得实时渲染具有竞争力的质量成为可能,但现有的流水线通常依赖于结构从运动(SfM)进行相机姿态估计,这种方法在真正稀视角设置中表现不佳。此外,一些无SfM的方法用多视图立体(MVS)模型替代SfM,但通过将每个像素反投影到3D空间生成大量的3D高斯点,导致高内存成本。我们提出了基于分割驱动初始化的高斯点云合成(SDI-GS)方法,该方法通过利用区域分割来识别并保留仅结构上重要的区域,从而减轻了低效性。这使得密集点云的选择性下采样成为可能,同时保持场景保真度并大幅减少高斯点的数量。跨多种基准的实验表明,SDI-GS将高斯点的数量最多减少50%,在PSNR和SSIM方面达到可比或更优的渲染质量,仅在LPIPS上有轻微的退化。此外,它还使训练速度更快,内存占用更小,推动了3DGS在受限视角场景中的实用性。
Summary / 总结
The paper addresses the challenge of sparse-view synthesis by proposing SDI-GS, which uses region-based segmentation to selectively retain structurally significant regions, reducing the number of 3D Gaussians by up to 50% while maintaining or improving rendering quality. Experiments show that SDI-GS achieves comparable or superior PSNR and SSIM scores and reduces memory usage and training time, making 3D Gaussian Splatting more practical for constrained-view scenarios.
论文通过提出SDI-GS方法,利用区域分割减少3D高斯的数量同时保持场景保真度。该方法最多可减少50%的高斯数量,并在PSNR和SSIM上提高渲染质量,LPIPS略有下降。SDI-GS还加快了训练速度并减少了内存使用,使3D高斯散点图在受限视角场景中更具实用性。
Deep Equilibrium models for Poisson Imaging Inverse problems via Mirror Descent
Authors: Christian Daniele, Silvia Villa, Samuel Vaiter, Luca Calatroni
First: 2025-07-15T16:33:01+00:00 · Latest: 2025-11-18T14:10:36+00:00
Abstract
Deep Equilibrium Models (DEQs) are implicit neural networks with fixed points, which have recently gained attention for learning image regularization functionals, particularly in settings involving Gaussian fidelities, where assumptions on the forward operator ensure contractiveness of standard (proximal) Gradient Descent operators. In this work, we extend the application of DEQs to Poisson inverse problems, where the data fidelity term is more appropriately modeled by the Kullback--Leibler divergence. To this end, we introduce a novel DEQ formulation based on Mirror Descent defined in terms of a tailored non-Euclidean geometry that naturally adapts with the structure of the data term. This enables the learning of neural regularizers within a principled training framework. We derive sufficient conditions and establish refined convergence results based on the Kurdyka--Lojasiewicz framework for subanalytic functions with non-closed domains to guarantee the convergence of the learned reconstruction scheme and propose computational strategies that enable both efficient training and parameter-free inference. Numerical experiments show that our method outperforms traditional model-based approaches and it is comparable to the performance of Bregman Plug-and-Play methods, while mitigating their typical drawbacks, such as time-consuming tuning of hyper-parameters. The code is publicly available at https://github.com/christiandaniele/DEQ-MD.
中文标题/摘要
标题:深度平衡模型在镜像梯度下降下的泊松成像逆问题
深度平衡模型(DEQs)是具有固定点的隐式神经网络,近年来因其在学习图像正则化泛函方面的应用而受到关注,特别是在涉及高斯保真度的情况下,前向算子的假设确保了标准(近端)梯度下降算子的收缩性。在本文中,我们扩展了DEQs在泊松逆问题中的应用,其中数据保真项更适当地由Kullback-Leibler散度建模。为此,我们基于针对数据项结构量身定制的非欧几里得几何引入了一种新的DEQ公式,基于镜像梯度下降。这使得在有原则的训练框架中学习神经正则化器成为可能。我们基于子分析函数的Kurdyka-Lojasiewicz框架推导了充分条件并建立了细化的收敛结果,以保证所学重建方案的收敛,并提出了计算策略,以实现高效的训练和无参数的推理。数值实验表明,我们的方法在传统基于模型的方法中表现出色,并且在性能上与Bregman插值和校正方法相当,同时减轻了它们的典型缺点,如超参数调整耗时。代码可在https://github.com/christiandaniele/DEQ-MD上公开获取。
Summary / 总结
This paper introduces Deep Equilibrium Models (DEQs) for solving Poisson imaging inverse problems. The authors extend DEQs from Gaussian fidelities to Poisson data by formulating Mirror Descent with a tailored non-Euclidean geometry. Key findings show that the proposed method outperforms traditional model-based approaches and is competitive with Bregman Plug-and-Play methods, while avoiding their hyper-parameter tuning issues. Convergence results and computational strategies are provided to ensure efficient training and inference. The code is publicly available.
本文介绍了将深度平衡模型(DEQs)应用于泊松逆问题的方法,扩展了其从高斯保真度到Kullback-Leibler散度的应用。通过在特定非欧几里得几何中使用镜像下降法来形式化DEQs,作者能够在原则性的框架中学习神经正则化器。实验表明,该方法在性能上优于传统的基于模型的方法,并且与Bregman插值和优化方法相当,同时避免了需要调参。基于Kurdyka--Lojasiewicz框架建立了收敛结果。
DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving
Authors: Sven Kirchner, Nils Purschke, Ross Greer, Alois C. Knoll
First: 2025-09-09T07:42:07+00:00 · Latest: 2025-11-18T13:49:21+00:00
Abstract
Ensuring reliable autonomous operation when visual input is degraded remains a key challenge in intelligent vehicles and robotics. We present DepthVision, a multimodal framework that enables Vision--Language Models (VLMs) to exploit LiDAR data without any architectural changes or retraining. DepthVision synthesizes dense, RGB-like images from sparse LiDAR point clouds using a conditional GAN with an integrated refiner, and feeds these into off-the-shelf VLMs through their standard visual interface. A Luminance-Aware Modality Adaptation (LAMA) module fuses synthesized and real camera images by dynamically weighting each modality based on ambient lighting, compensating for degradation such as darkness or motion blur. This design turns LiDAR into a drop-in visual surrogate when RGB becomes unreliable, effectively extending the operational envelope of existing VLMs. We evaluate DepthVision on real and simulated datasets across multiple VLMs and safety-critical tasks, including vehicle-in-the-loop experiments. The results show substantial improvements in low-light scene understanding over RGB-only baselines while preserving full compatibility with frozen VLM architectures. These findings demonstrate that LiDAR-guided RGB synthesis is a practical pathway for integrating range sensing into modern vision-language systems for autonomous driving.
中文标题/摘要
标题:DepthVision:基于GAN的LiDAR到RGB合成使视觉-语言模型在自主驾驶中更加稳健
确保在视觉输入退化时可靠地实现自主操作仍然是智能车辆和机器人领域的一项关键挑战。我们提出了DepthVision,这是一种多模态框架,使视觉-语言模型(VLMs)能够利用LiDAR数据,而无需进行任何架构更改或重新训练。DepthVision使用一个条件GAN和集成的细化器从稀疏的LiDAR点云中合成密集的、类似RGB的图像,并通过标准视觉接口将这些图像输入现成的VLMs。一种亮度感知模态适应(LAMA)模块通过根据环境照明动态加权每种模态来融合合成和真实相机图像,补偿诸如黑暗或运动模糊等退化。此设计将LiDAR转换为当RGB不可靠时的即插即用视觉替代品,有效地扩展了现有VLMs的操作范围。我们在多个VLM和安全关键任务上,包括车辆在环实验,对DepthVision进行了实际和模拟数据集的评估。结果表明,与仅基于RGB的基线相比,在低光场景理解方面取得了显著改进,同时保持了与冻结VLM架构的完全兼容性。这些发现表明,基于LiDAR的RGB合成是将距离传感集成到现代视觉-语言系统以实现自主驾驶的一种实用途径。
Summary / 总结
DepthVision is a multimodal framework that enables Vision-Language Models to use LiDAR data by synthesizing dense RGB-like images from sparse LiDAR point clouds using a conditional GAN. It includes a Luminance-Aware Modality Adaptation module that dynamically weights synthesized and real camera images based on lighting conditions. The framework improves low-light scene understanding significantly compared to RGB-only baselines while maintaining compatibility with existing VLM architectures. Evaluations on real and simulated datasets show that DepthVision enhances the operational envelope of VLMs in autonomous driving scenarios.
DepthVision 是一种多模态框架,通过整合 LiDAR 数据来增强视觉-语言模型(VLMs)在视觉退化条件下的性能。它使用条件 GAN 从稀疏 LiDAR 点云中合成密集的 RGB 类似图像,并使用亮度感知模态适应(LAMA)模块将合成图像与真实图像融合,以补偿光照条件。实验结果显示,在低光照场景理解方面,与仅使用 RGB 的基线相比有显著改进,同时保持与现有 VLM 架构的兼容性。
Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM
Authors: Jack Qin, Zhitao Wang, Yinan Zheng, Keyu Chen, Yang Zhou, Yuanxin Zhong, Siyuan Cheng
First: 2025-11-18T13:46:18+00:00 · Latest: 2025-11-18T13:46:18+00:00
Abstract
The autonomous driving (AD) system has exhibited remarkable performance in complex driving scenarios. However, generalization is still a key limitation for the current system, which refers to the ability to handle unseen scenarios or unfamiliar sensor configurations.Related works have explored the use of Vision-Language Models (VLMs) to address few-shot or zero-shot tasks. While promising, these methods introduce a new challenge: the emergence of a hybrid AD system, where two distinct systems are used to plan a trajectory, leading to potential inconsistencies. Alternative research directions have explored Vision-Language-Action (VLA) frameworks that generate control actions from VLM directly. However, these end-to-end solutions demonstrate prohibitive computational demands. To overcome these challenges, we introduce Risk Semantic Distillation (RSD), a novel framework that leverages VLMs to enhance the training of End-to-End (E2E) AD backbones. By providing risk attention for key objects, RSD addresses the issue of generalization. Specifically, we introduce RiskHead, a plug-in module that distills causal risk estimates from Vision-Language Models into Bird's-Eye-View (BEV) features, yielding interpretable risk-attention maps.This approach allows BEV features to learn richer and more nuanced risk attention representations, which directly enhance the model's ability to handle spatial boundaries and risky objects.By focusing on risk attention, RSD aligns better with human-like driving behavior, which is essential to navigate in complex and dynamic environments. Our experiments on the Bench2Drive benchmark demonstrate the effectiveness of RSD in managing complex and unpredictable driving conditions. Due to the enhanced BEV representations enabled by RSD, we observed a significant improvement in both perception and planning capabilities.
中文标题/摘要
标题:利用VLM的风险语义精炼提升端到端自动驾驶
自动驾驶(AD)系统在复杂驾驶场景中表现出色。然而,泛化仍然是当前系统的关键限制,指的是处理未见过的场景或不熟悉传感器配置的能力。相关研究探索了使用视觉-语言模型(VLMs)来解决少样本或零样本任务。虽然前景看好,但这些方法引入了一个新的挑战:混合AD系统的出现,其中两个独立系统用于规划轨迹,可能导致潜在的不一致性。替代研究方向探索了视觉-语言-动作(VLA)框架,直接从VLM生成控制动作。然而,这些端到端解决方案表现出计算需求过高的问题。为克服这些挑战,我们引入了风险语义精炼(RSD),这是一种新颖的框架,利用VLMs提升端到端(E2E)AD主干的训练。通过为关键对象提供风险注意力,RSD解决了泛化问题。具体来说,我们引入了RiskHead,这是一种插件模块,从视觉-语言模型中提炼因果风险估计到鸟瞰图(BEV)特征,生成可解释的风险注意力图。这种方法使BEV特征能够学习更丰富和更细腻的风险注意力表示,直接增强了模型处理空间边界和危险对象的能力。通过关注风险注意力,RSD更好地与人类驾驶行为对齐,这对于在复杂和动态环境中导航至关重要。我们在Bench2Drive基准上的实验表明,RSD在管理复杂和不可预测的驾驶条件方面非常有效。由于RSD增强了BEV表示,我们观察到感知和规划能力都有显著提高。
Summary / 总结
The paper addresses the challenge of generalization in autonomous driving systems by introducing Risk Semantic Distillation (RSD), a novel framework that leverages Vision-Language Models (VLMs) to enhance the training of end-to-end autonomous driving backbones. RSD uses a RiskHead module to distill risk estimates from VLMs into Bird's-Eye-View (BEV) features, improving the model's ability to handle complex and unpredictable driving conditions. Experiments on the Bench2Drive benchmark show significant improvements in both perception and planning capabilities.
论文通过提出Risk Semantic Distillation (RSD)框架,利用Vision-Language Models (VLMs)来增强端到端的自动驾驶模型,解决了自动驾驶系统在复杂和不可预测驾驶条件下的泛化问题。RSD引入了RiskHead模块,将风险估计转化为Bird's-Eye-View特征,提高了模型处理复杂驾驶条件的能力。实验结果表明,RSD在感知和规划能力上取得了显著提升。
MAVias: Mitigate any Visual Bias
Authors: Ioannis Sarridis, Christos Koutlis, Symeon Papadopoulos, Christos Diou
First: 2024-12-09T16:23:51+00:00 · Latest: 2025-11-18T13:43:20+00:00
Abstract
Mitigating biases in computer vision models is an essential step towards the trustworthiness of artificial intelligence models. Existing bias mitigation methods focus on a small set of predefined biases, limiting their applicability in visual datasets where multiple, possibly unknown biases exist. To address this limitation, we introduce MAVias, an open-set bias mitigation approach leveraging foundation models to discover spurious associations between visual attributes and target classes. MAVias first captures a wide variety of visual features in natural language via a foundation image tagging model, and then leverages a large language model to select those visual features defining the target class, resulting in a set of language-coded potential visual biases. We then translate this set of potential biases into vision-language embeddings and introduce an in-processing bias mitigation approach to prevent the model from encoding information related to them. Our experiments on diverse datasets, including CelebA, Waterbirds, ImageNet, and UrbanCars, show that MAVias effectively detects and mitigates a wide range of biases in visual recognition tasks outperforming current state-of-the-art.
中文标题/摘要
标题:MAVias: 消除任何视觉偏见
消除计算机视觉模型中的偏见是提高人工智能模型可信度的重要步骤。现有的偏见缓解方法主要针对一组预定义的偏见,限制了它们在包含多种可能未知偏见的视觉数据集中的适用性。为了解决这一限制,我们引入了MAVias,这是一种利用基础模型发现视觉属性与目标类之间虚假关联的开放集偏见缓解方法。MAVias首先通过基础图像标签模型捕获自然语言中的广泛视觉特征,然后利用大型语言模型选择定义目标类的视觉特征,从而形成一组语言编码的潜在视觉偏见。我们随后将这一组潜在偏见转化为视觉-语言嵌入,并引入一种在线处理偏见缓解方法,以防止模型编码与它们相关的信息。我们在CelebA、Waterbirds、ImageNet和UrbanCars等多个数据集上的实验表明,MAVias能够有效检测和缓解视觉识别任务中广泛存在的多种偏见,优于当前最先进的方法。
Summary / 总结
MAVias is designed to mitigate various visual biases in computer vision models, addressing the limitation of existing methods that only handle predefined biases. It uses a foundation model to capture diverse visual features and a large language model to identify potential biases related to target classes. MAVias then applies in-processing bias mitigation to prevent the model from encoding these biases. Experiments on multiple datasets demonstrate that MAVias effectively detects and mitigates a wide range of biases, outperforming current state-of-the-art methods.
MAVias 通过利用基础模型发现视觉属性与目标类之间的虚假关联来缓解计算机视觉模型中的各种视觉偏见。它通过基础图像标签模型捕获广泛的视觉特征,并使用大型语言模型选择定义目标类的视觉特征,从而形成潜在的视觉偏见。这些偏见随后被翻译成视觉-语言嵌入,然后应用一种在处理过程中缓解偏见的方法,以防止模型编码相关的信息。在多样化的数据集上的实验表明,MAVias 有效地检测和缓解了各种偏见,并且优于当前最先进的方法。
WARP-LUTs - Walsh-Assisted Relaxation for Probabilistic Look Up Tables
Authors: Lino Gerlach, Liv Våge, Thore Gerlach, Elliott Kauffman
First: 2025-10-17T13:44:36+00:00 · Latest: 2025-11-18T12:48:39+00:00
Comments: Preprint. Under review
Abstract
Fast and efficient machine learning is of growing interest to the scientific community and has spurred significant research into novel model architectures and hardware-aware design. Recent hard? and software co-design approaches have demonstrated impressive results with entirely multiplication-free models. Differentiable Logic Gate Networks (DLGNs), for instance, provide a gradient-based framework for learning optimal combinations of low-level logic gates, setting state-of-the-art trade-offs between accuracy, resource usage, and latency. However, these models suffer from high computational cost during training and do not generalize well to logic blocks with more inputs. In this work, we introduce Walsh-Assisted Relaxation for Probabilistic Look-Up Tables (WARP-LUTs) - a novel gradient-based method that efficiently learns combinations of logic gates with substantially fewer trainable parameters. We demonstrate that WARP-LUTs achieve significantly faster convergence on CIFAR-10 compared to DLGNs, while maintaining comparable accuracy. Furthermore, our approach suggests potential for extension to higher-input logic blocks, motivating future research on extremely efficient deployment on modern FPGAs and its real-time science applications.
中文标题/摘要
标题:WARP-LUTs - Walsh辅助放松的概率查找表
快速高效的机器学习日益引起科学界的关注,并推动了新型模型架构和硬件感知设计的研究。最近的硬件和软件协同设计方法展示了完全无乘法运算模型的出色成果。例如,可微逻辑门网络(DLGNs)提供了一种基于梯度的学习低级逻辑门最优组合的框架,实现了在准确度、资源使用和延迟之间的最佳权衡。然而,这些模型在训练过程中计算成本高,并且不适用于具有更多输入的逻辑块。在本研究中,我们引入了Walsh辅助放松的概率查找表(WARP-LUTs)——一种高效的基于梯度的新方法,可以使用显著较少的可训练参数学习逻辑门的组合。我们证明,与DLGNs相比,WARP-LUTs在CIFAR-10上的收敛速度显著加快,同时保持了相当的准确度。此外,我们的方法表明有可能扩展到更高输入的逻辑块,这激发了未来研究在现代FPGA上极其高效的部署及其实时科学应用的动机。
Summary / 总结
WARP-LUTs is a novel gradient-based method that efficiently learns combinations of logic gates with fewer parameters, improving convergence speed on CIFAR-10 compared to DLGNs while maintaining similar accuracy. This method shows potential for extending to higher-input logic blocks, suggesting future research on efficient FPGA deployment and real-time science applications.
WARP-LUTs 是一种新型的基于梯度的方法,能够用更少的可训练参数高效地学习逻辑门的组合,相比 DLGNs 在 CIFAR-10 上实现了更快的收敛速度,同时保持了相近的准确率。它显示出向更高输入逻辑块扩展的潜力,激励未来在现代 FPGA 上的高效部署和实时科学应用的研究。
Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding
Authors: Hong Gao, Yiming Bao, Xuezhen Tu, Yutong Xu, Yue Jin, Yiyang Mu, Bin Zhong, Linan Yue, Min-Ling Zhang
First: 2025-11-18T12:43:15+00:00 · Latest: 2025-11-18T12:43:15+00:00
Abstract
Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.
中文标题/摘要
标题:代理视频智能:一种灵活的高级视频探索与理解框架
视频理解不仅需要视觉识别,还需要复杂的推理。尽管视觉语言模型(VLMs)表现出色,但它们通常以单次处理视频的方式进行,缺乏对证据的重新访问和迭代优化的支持。虽然最近出现的基于代理的方法能够进行长期推理,但它们要么依赖昂贵的专有模型,要么需要大量的代理强化学习训练。为克服这些限制,我们提出了一种灵活且无需训练的框架——代理视频智能(AVI),通过系统级设计和优化,可以模拟人类的视频理解过程。AVI 引入了三个关键创新:(1)灵感来源于人类的三阶段推理过程(检索-感知-回顾),确保了充分的全局探索和集中的局部分析;(2)通过实体图组织的结构化视频知识库,以及多粒度集成工具,构成了代理的交互环境;(3)开源模型集合,结合推理大语言模型与轻量级基础计算机视觉模型和VLM,消除了对专有API或强化学习训练的依赖。在LVBench、VideoMME-Long、LongVideoBench和Charades-STA上的实验表明,AVI 在保持竞争力的同时提供了更好的可解释性。
Summary / 总结
The research aims to enhance video understanding by addressing the limitations of single-pass processing and the need for iterative refinement. AVI introduces a three-phase reasoning process (Retrieve-Perceive-Review), a structured video knowledge base, and an open-source model ensemble. Experiments show that AVI achieves competitive performance with better interpretability compared to existing methods.
研究旨在通过解决单次处理和需要迭代改进的局限性,提升视频理解能力。AVI 引入了检索-感知-回顾的三阶段推理过程、结构化的视频知识库以及开源模型集合。实验结果显示,AVI 达到了与现有方法相当的性能,且具有更好的可解释性。
RynnEC: Bringing MLLMs into Embodied World
Authors: Ronghao Dang, Yuqian Yuan, Yunxuan Mao, Kehan Li, Jiangpin Liu, Zhikai Wang, Xin Li, Fan Wang, Deli Zhao
First: 2025-08-19T18:00:01+00:00 · Latest: 2025-11-18T12:40:49+00:00
Comments: The technical report of RynnEC, an embodied cognition MLLM
Abstract
We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general-purpose vision-language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region-level video interaction. Despite its compact architecture, RynnEC achieves state-of-the-art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region-centric video paradigm for the brain of embodied agents, providing fine-grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC-Bench, a region-centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general-purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba-damo-academy/RynnEC
中文标题/摘要
标题:RynnEC:将MLLMs引入具身世界
我们介绍了RynnEC,这是一种用于具身认知的视频多模态大型语言模型。RynnEC基于通用的视觉-语言基础模型构建,集成了区域编码器和掩码解码器,能够灵活地进行区域级别的视频交互。尽管其架构紧凑,但RynnEC在物体属性理解、物体分割和空间推理方面均达到了最先进的性能。从概念上讲,它为具身代理的大脑提供了一种以区域为中心的视频范式,提供了对物理世界的精细感知,并使交互更加精确。为缓解标注的3D数据集稀缺问题,我们提出了一种基于第一人称视角视频的生成具身认知数据的管道。此外,我们还引入了RynnEC-Bench,这是一种以区域为中心的基准测试,用于评估具身认知能力。我们期望RynnEC将促进通用具身代理认知核心的发展,并促进在各种具身任务上的泛化。代码、模型检查点和基准测试可在:https://github.com/alibaba-damo-academy/RynnEC 获取
Summary / 总结
RynnEC is a video multimodal large language model for embodied cognition, built on a vision-language foundation model with a region encoder and mask decoder for flexible video interaction. It excels in object property understanding, object segmentation, and spatial reasoning despite its compact architecture. The proposed egocentric video pipeline addresses the scarcity of annotated 3D datasets, and RynnEC-Bench serves as a benchmark for evaluating embodied cognitive capabilities. This work aims to advance embodied agents' cognitive development and generalization across diverse tasks.
RynnEC 是一个用于具身认知的视频多模态大语言模型,基于一个视觉语言基础模型,并包含区域编码器和掩码解码器以实现灵活的视频交互。尽管架构紧凑,它在物体属性理解、物体分割和空间推理方面表现出色。提出的基于第一人称视频的数据生成管道解决了标注的 3D 数据集稀缺问题,而 RynnEC-Bench 则用于评估具身认知能力。这项工作旨在推动具身代理的认知发展和跨任务的泛化能力。
Watchdogs and Oracles: Runtime Verification Meets Large Language Models for Autonomous Systems
Authors: Angelo Ferrando
Venue: EPTCS 436, 2025, pp. 80-87
First: 2025-11-18T12:35:05+00:00 · Latest: 2025-11-18T12:35:05+00:00
Comments: In Proceedings FMAS 2025, arXiv:2511.13245
Abstract
Assuring the safety and trustworthiness of autonomous systems is particularly difficult when learning-enabled components and open environments are involved. Formal methods provide strong guarantees but depend on complete models and static assumptions. Runtime verification (RV) complements them by monitoring executions at run time and, in its predictive variants, by anticipating potential violations. Large language models (LLMs), meanwhile, excel at translating natural language into formal artefacts and recognising patterns in data, yet they remain error-prone and lack formal guarantees. This vision paper argues for a symbiotic integration of RV and LLMs. RV can serve as a guardrail for LLM-driven autonomy, while LLMs can extend RV by assisting specification capture, supporting anticipatory reasoning, and helping to handle uncertainty. We outline how this mutual reinforcement differs from existing surveys and roadmaps, discuss challenges and certification implications, and identify future research directions towards dependable autonomy.
中文标题/摘要
标题:看门狗与预言家:运行时验证与大型语言模型在自主系统中的结合
在涉及学习组件和开放环境的自主系统中,确保其安全性和可信性尤为困难。形式化方法可以提供强有力的保证,但依赖于完整模型和静态假设。运行时验证(RV)通过在运行时监控执行和预测潜在违规行为来补充它们。与此同时,大型语言模型(LLMs)在将自然语言转化为形式化制品以及识别数据模式方面表现出色,但它们仍然容易出错且缺乏形式化保证。本文认为,RV和LLMs应该共生整合。RV可以作为LLM驱动自主性的护栏,而LLMs可以扩展RV,通过辅助规范捕获、支持前瞻性推理和处理不确定性来提供支持。我们概述了这种相互强化与现有综述和路线图的不同之处,讨论了挑战和认证影响,并指出了未来研究方向,以实现可靠的自主性。
O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model
Authors: Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty
Venue: AAAI 2026
First: 2025-11-18T11:18:08+00:00 · Latest: 2025-11-18T11:18:08+00:00
Comments: Accepted to AAAI 2026
Abstract
While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
中文标题/摘要
标题:O3SLM:开放权重、开放数据和开放词汇量草图语言模型
虽然大型视觉语言模型(LVLMs)在越来越多的实际应用中被部署,但它们对抽象视觉输入的解释能力仍然有限。具体来说,它们难以理解手绘草图,这种模态提供了一种直观的方式来表达难以用文字描述的概念。我们确定的主要瓶颈是没有一个大规模的数据集可以同时建模草图、照片级真实图像及其相应的自然语言指令。为了解决这个问题,我们提出了两个关键贡献:(1) 一个新的大规模图像-草图-指令三元组数据集,旨在促进预训练和指令微调,以及(2) 在此数据集上训练的O3SLM。在多个基于草图的任务上的全面评估:(a) 物体定位,(b) 数量统计,(c) 图像检索,即(SBIR和细粒度SBIR),以及(d) 视觉问答(VQA),结合现有的三个草图数据集,即QuickDraw!、Sketchy和Tu Berlin,以及我们生成的SketchVCL数据集,表明O3SLM达到了最先进的性能,在草图理解和推理方面显著优于现有的LVLMs。
Summary / 总结
The research aims to enhance Large Vision Language Models' ability to interpret hand-drawn sketches, which are useful for expressing complex ideas. To address this, the authors created a new large-scale dataset of image-sketch-instruction triplets and trained a model called O3SLM on this dataset. Experimental results show that O3SLM outperforms existing models in tasks such as object localization, counting, image retrieval, and visual question answering, particularly in understanding sketches better than previous models.
研究旨在提高大型视觉语言模型对手绘草图的解释能力,因为草图对于表达复杂概念非常有用。为此,作者创建了一个新的大规模图像-草图-指令三元组数据集,并在该数据集上训练了一个名为O3SLM的LVLM。实验结果显示,O3SLM在对象定位、计数、图像检索和视觉问答等任务上表现出色,特别是在理解和推理草图方面优于现有模型。