arXiv 论文速递

2025-12-24 03:31
Snapshot: 20251224_0331
Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis
Authors: Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra
First: 2025-12-22T18:41:45+00:00 · Latest: 2025-12-22T18:41:45+00:00
Comments: 14 pages, 14 figures
Abstract
Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.
中文标题/摘要
标题:超越CLIP:知识增强的多模态变换器在糖尿病视网膜病变诊断中的跨模态对齐
糖尿病视网膜病变(DR)是全球可预防失明的主要原因,需要准确的自动化诊断系统。虽然通用领域的视觉-语言模型如对比语言-图像预训练(CLIP)在自然图像任务上表现良好,但在医学领域的应用中却遇到困难,特别是在眼科图像的跨模态检索方面。我们提出了一种新颖的知识增强联合嵌入框架,通过多模态变换器架构将视网膜底片图像、临床文本和结构化患者数据结合起来,以解决医学图像-文本对齐的关键差距。我们的方法为每种模态使用单独的编码器:视网膜图像使用视觉变换器(ViT-B/16),临床叙述使用Bio-ClinicalBERT,结构化的人口统计和临床特征使用多层感知机。这些模态通过具有模态特定嵌入的联合变换器融合,使用包括模态对之间的对比损失、图像和文本的重构损失以及根据ICDR和SDRG方案的DR严重程度分类损失等多种目标进行训练。在巴西多标签眼科数据集(BRSET)上的实验结果表明,与基线模型相比有显著改进。我们的框架在文本到图像检索性能上达到近乎完美的99.94%的召回率@1,而微调后的CLIP仅为1.29%,同时保持了SDRG的97.05%和ICDR的97.97%的最先进的分类准确性。此外,对未见过的DeepEyeNet数据集的零样本评估验证了其强大的泛化能力,召回率@1为93.95%,而微调后的CLIP仅为0.22%。这些结果表明,我们的多模态训练方法有效地捕捉了医学领域的跨模态关系,建立了卓越的检索能力和稳健的诊断性能。
Summary / 总结
The research aims to improve automated diagnostic systems for diabetic retinopathy (DR) by addressing the limitations of general-domain models like CLIP in medical applications. The proposed framework uses a knowledge-enhanced joint embedding model with separate encoders for retinal images, clinical text, and structured patient data, integrated through a multimodal transformer. The model is trained with multiple objectives, including contrastive and reconstruction losses, and achieves significant improvements over baseline models. Key findings include near-perfect text-to-image retrieval with Recall@1 of 99.94% and state-of-the-art classification accuracy of 97.05% and 97.97% for SDRG and ICDR schemes, respectively. Zero-shot evaluation on an unseen dataset further validates the model's generalizability.
本文提出了一种知识增强的联合嵌入框架,以解决糖尿病视网膜病变(DR)的准确自动化诊断问题。该方法使用独立的编码器处理视网膜图像、临床文本和结构化患者数据,并通过联合变换器融合。模型通过多种目标进行训练,包括对比损失和分类损失。实验结果表明,该模型在文本到图像检索性能和DR严重程度分级的分类准确性方面均优于基线模型,并且在未见过的数据集上的零样本评估进一步验证了其泛化能力。
AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models
Authors: Heng Zhang, Haichuan Hu, Yaomin Shen, Weihao Yu, Yilei Yuan, Haochen You, Guo Cheng, Zijian Zhang, Lubin Gan, Huihui Wei, Hao Zhang, Jin Huang
First: 2025-09-16T06:16:05+00:00 · Latest: 2025-12-22T18:22:20+00:00
Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results
Abstract
Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.
中文标题/摘要
标题:AsyMoE:利用模态不对称性增强大型视觉-语言模型专家专业化
大型视觉-语言模型(LVLMs)通过扩展架构和大量训练,在多模态任务中表现出色。然而,现有的混合专家(MoE)方法由于视觉和语言处理之间的不对称性而面临挑战。视觉信息是空间上完整的,而语言需要保持顺序上下文。因此,MoE模型难以平衡模态特定特征和跨模态交互。通过系统分析,我们观察到,深层的语言专家逐渐失去上下文定位,并更多依赖参数知识,而不是利用提供的视觉和语言信息。为了解决这一问题,我们提出了一种新的AsyMoE架构,该架构使用三个专门的专家组来建模这种不对称性。我们设计了跨模态专家进行模态特定处理,超曲面跨模态专家进行分层跨模态交互,并设计了证据优先的语言专家以抑制参数偏差并保持上下文定位。广泛的实验表明,与vanilla MoE和模态特定MoE相比,AsyMoE分别实现了26.58%和15.45%的准确率提升,且参数激活量比密集模型少25.45%。
Summary / 总结
The paper addresses the challenges faced by existing Mixture of Experts (MoE) approaches in large Vision-Language Models (LVLMs) due to the asymmetry between visual and linguistic processing. It proposes AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups: intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to maintain contextual grounding. Experimental results show that AsyMoE outperforms vanilla MoE and modality-specific MoE by 26.58% and 15.45% in accuracy, respectively, with fewer activated parameters than dense models. However, the submission has been withdrawn due to a fundamental error in the methodology that affects the validity of the main results.
论文旨在通过解决现有Mixture of Experts(MoE)方法中视觉和语言处理之间的不对称性,来提高大型视觉-语言模型的性能。提出了AsyMoE,它由三个专门的专家组组成:用于模态特定处理的内模态专家、用于分层跨模态交互的超球面跨模态专家以及用于抑制参数偏见并保持上下文接地的证据优先语言专家。实验表明,AsyMoE在准确率上分别比vanilla MoE和模态特定MoE高出26.58%和15.45%,并且参数量比密集模型少25.45%。然而,由于方法中的根本错误,提交已被撤回。
GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks
Authors: Heng Zheng, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Hao Zhang, Wenjun Huang, Jin Huang
First: 2025-11-02T11:58:55+00:00 · Latest: 2025-12-22T18:21:18+00:00
Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results
Abstract
Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes. Existing multi-agent systems improve performance through model collaboration but treat all agent interactions uniformly. They lack mechanisms to handle conflicting predictions effectively. We propose \textbf{GraphGeo}, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization. Our approach models diverse debate relationships through typed edges, distinguishing supportive collaboration, competitive argumentation, and knowledge transfer. We introduce a dual-level debate mechanism combining node-level refinement and edge-level argumentation modeling. A cross-level topology refinement strategy enables co-evolution between graph structure and agent representations. Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state-of-the-art methods. Our framework transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate.
中文标题/摘要
标题:GraphGeo:基于异构图神经网络的多智能体辩论框架用于视觉地理定位
视觉地理定位需要广泛的空间知识和复杂的推理来确定图像位置,而不依赖GPS元数据。传统的检索方法受限于数据库的覆盖范围和质量。最近的大规模视觉-语言模型(LVLMs)能够直接从图像内容进行位置推理,但单个模型在处理多样化的地理区域和复杂的场景时存在困难。现有的多智能体系统通过模型协作提高了性能,但对所有智能体间的交互处理方式相同。它们缺乏有效处理相互矛盾预测的机制。我们提出 **GraphGeo**,一种使用异构图神经网络的多智能体辩论框架,用于视觉地理定位。我们的方法通过类型化的边建模多样化的辩论关系,区分支持性的合作、竞争性的论辩以及知识转移。我们引入了一种结合节点级细化和边级论辩建模的双重级辩论机制。跨级拓扑细化策略使图结构和智能体表示能够共同进化。在多个基准上的实验表明,GraphGeo 显著优于现有最佳方法。我们的框架通过结构化的辩论将智能体之间的认知冲突转化为增强的地理定位精度。
Summary / 总结
GraphGeo is a multi-agent debate framework for visual geo-localization using heterogeneous graph neural networks. It models diverse debate relationships and introduces a dual-level debate mechanism. Experiments show that GraphGeo outperforms state-of-the-art methods. However, the submission was withdrawn due to a fundamental error in the methodology that affects the validity of the main results.
GraphGeo 是一种使用异构图神经网络的多代理辩论框架,用于视觉地理定位。它建模了多种辩论关系,并引入了节点级和边级的双重辩论机制。实验表明,GraphGeo 显著优于现有最佳方法。然而,由于方法论中的根本错误影响了结果的有效性,提交已被撤回。
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
Authors: Moritz Böhle, Amélie Royer, Juliette Marrie, Edouard Grave, Patrick Pérez
First: 2025-12-22T16:21:39+00:00 · Latest: 2025-12-22T16:21:39+00:00
Abstract
Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .
中文标题/摘要
标题:CASA: 使用自注意力机制实现高效视觉-语言融合的交叉注意力
视觉-语言模型(VLMs)通常通过将预训练视觉编码器中的图像标记插入语言模型的文字流中进行训练。这使得文本和图像信息能够在模型内部完全相互注意,但对高分辨率图像、长对话或流式视频来说,这在内存和计算上都非常昂贵。利用交叉注意力的VLMs是标记插入的高效替代方案,但在涉及精细视觉细节的任务上表现出明显的性能差距。我们发现,提高此类模型的关键在于在专门的交叉注意力层中也启用局部文本到文本的交互。基于此,我们提出了CASA(Cross-Attention via Self-Attention),一种简单且高效的范式,该范式在常见的图像理解基准测试中显著缩小了与完整标记插入的差距,同时在长上下文多模态任务如流式视频字幕生成中保持与交叉注意力模型相同的可扩展性。如需查看示例和代码,请访问我们的项目页面 https://kyutai.org/casa 。
Summary / 总结
The research aims to address the computational inefficiency of vision-language models (VLMs) when handling high-resolution images or long conversations. The proposed CASA method enhances cross-attention layers by incorporating self-attention mechanisms, enabling local text-to-text interactions. This approach significantly reduces the performance gap compared to models using full token insertion while maintaining scalability for long-context tasks like streaming video captioning. Key findings show that CASA outperforms existing cross-attention models on common image understanding benchmarks while being more efficient. For more details, visit the project page at https://kyutai.org/casa .
研究旨在解决视觉语言模型(VLMs)在处理高分辨率图像或长对话时的计算挑战。CASA方法在交叉注意力层中引入局部文本到文本的交互,以缩小与基于令牌插入方法的性能差距。实验表明,CASA在图像理解基准测试中显著提高了性能,同时保持了对长上下文任务的可扩展性。
QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
Authors: Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-fei, Ehsan Adeli
First: 2025-12-22T16:18:00+00:00 · Latest: 2025-12-22T16:18:00+00:00
Abstract
Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.
中文标题/摘要
标题:QuantiPhy:评估视觉语言模型物理推理能力的定量基准
理解物理世界对于通用人工智能代理至关重要。然而,目前尚不清楚最先进的视觉感知模型(例如大型VLM)是否能够进行定量的物理属性推理。现有的评估主要基于VQA且为定性的,无法提供这些模型能否从视频观察中推断出移动物体的动力学量的见解。为了解决这一问题,我们提出了QuantiPhy,这是第一个用于定量测量VLM物理推理能力的基准。QuantiPhy包含超过3300个视频-文本实例,具有数值真实值,评估VLM在给定时间戳时估计物体大小、速度和加速度的表现,其中一个属性作为输入先验。基准标准化了提示和评分,以评估数值准确性,从而实现模型之间的公平比较。我们在最先进的VLM上的实验揭示了它们的定性合理性与实际数值正确性之间的一致差距。我们进一步深入分析了关键因素,如背景噪声、反事实先验和策略性提示,并发现最先进的VLM在进行定量动力学属性推理时,严重依赖预训练的世界知识,而不是忠实使用提供的视觉和文本输入作为参考。QuantiPhy提供了第一个严格的、可扩展的测试平台,使VLM超越单纯的口头合理性,迈向基于数值的物理理解。
Summary / 总结
The research aims to evaluate the physical reasoning abilities of vision-language models, particularly their quantitative reasoning skills. QuantiPhy, a new benchmark, is introduced to assess these models' ability to estimate object size, velocity, and acceleration from video observations. Experiments show a gap between models' qualitative plausibility and numerical accuracy, indicating reliance on pre-trained knowledge rather than visual and textual inputs for quantitative reasoning.
QuantiPhy 是一个基准,用于评估视觉-语言模型的定量物理推理能力。它包含超过 3,300 个带有数值 ground truth 的视频-文本实例,评估模型估计物体大小、速度和加速度的能力。实验表明,模型的定性合理性与其数值准确性之间存在差距,表明它们在推理动力学属性时更依赖预训练知识而非视觉和文本输入。
VERDI: VLM-Embedded Reasoning for Autonomous Driving
Authors: Bowen Feng, Zhiting Mei, Baiang Li, Julian Ost, Filippo Ghilotti, Roger Girgis, Anirudha Majumdar, Felix Heide
First: 2025-05-21T18:24:36+00:00 · Latest: 2025-12-22T15:37:49+00:00
Abstract
While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We validate VERDI in both open-loop (NuScenes and Bench2Drive benchmarks) and closed-loop (HugSim Simulator) settings. We find that VERDI outperforms existing e2e methods that do not embed reasoning by up to 11% in $\ell_{2}$ distance and 11% in driving performance, while maintaining real-time inference speed.
中文标题/摘要
标题:VERDI: VLM嵌入式自主驾驶推理
在面对部分可观测性和现实世界复杂性带来的决策难题时,自主驾驶(AD)堆栈难以做出最优决策,而人类驾驶员则能够利用常识推理在信息有限的情况下做出近乎最优的决策。近期的研究尝试在推理时利用微调后的视觉-语言模型(VLMs)进行轨迹规划,以模拟人类行为。尽管这些方法在基准测试中表现出色,但它们在部署时往往不切实际(一个700亿参数的VLM推理需要每秒8个词,内存超过160G),并且其单一网络结构限制了安全性分解。为解决这一问题,我们提出了VLM嵌入式自主驾驶推理(VERDI),这是一种训练时框架,将VLM的推理过程和常识知识提炼到AD堆栈中。VERDI通过将模块化可微端到端(e2e)AD模型与VLM生成的解释驾驶推理过程的文本特征在感知、预测和规划阶段对齐,从而在潜在空间中促进对齐。通过这种方式,VERDI使模块化AD堆栈能够内化结构化推理,而不增加大型VLM的推理时间成本。我们分别在开环(NuScenes和Bench2Drive基准)和闭环(HugSim模拟器)环境中验证了VERDI。结果显示,与不嵌入推理的现有端到端方法相比,VERDI在欧几里得距离上提高了11%,在驾驶性能上提高了11%,同时保持了实时推理速度。
Summary / 总结
VERDI is a training-time framework that embeds the reasoning process and commonsense knowledge of Vision-Language Models (VLMs) into autonomous driving (AD) stacks to improve decision-making under partial observability. By aligning intermediate module outputs with text features from VLMs, VERDI enables modular AD models to internalize structured reasoning without the high inference-time costs of large VLMs. Experimental results show that VERDI outperforms existing end-to-end methods by up to 11% in $\ell_{2}$ distance and driving performance, while maintaining real-time inference speed.
VERDI 是一个训练时框架,将 Vision-Language 模型(VLM)中的常识推理嵌入到自动驾驶系统中,以改善在部分可观测性下的决策能力。它通过将感知、预测和规划模块的中间输出与 VLM 生成的文本解释对齐,使模块化的自动驾驶堆栈能够内化结构化的推理,而不增加大型 VLM 的高推理时间成本。VERDI 在开放环(NuScenes 和 Bench2Drive 基准)和闭环(HugSim 模拟器)设置中均优于现有端到端方法,分别在 $\ell_{2}$ 距离和驾驶性能上提高了 11%,同时保持实时推理速度。
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning
Authors: Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang
First: 2025-10-18T09:22:40+00:00 · Latest: 2025-12-22T15:14:59+00:00
Abstract
Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.
中文标题/摘要
标题:SSL4RL:重新审视自监督学习作为视觉-语言推理内在奖励
视觉-语言模型(VLMs)通过结合大型语言模型和视觉输入展示了显著的能力。然而,它们往往未能充分利用视觉证据,要么依赖于视觉中心任务中的语言先验,要么在推理过程中求助于文本捷径。尽管强化学习(RL)可以将模型与期望行为对齐,但将其应用于VLMs受到缺乏可扩展和可靠的奖励机制的阻碍。为克服这一挑战,我们提出了一种名为SSL4RL的新框架,该框架利用自监督学习(SSL)任务作为基于RL的微调的验证性奖励来源。我们的方法将SSL目标,如预测图像旋转或重建遮罩的片段,重新表述为密集的自动奖励信号,从而消除了对人类偏好数据或不可靠的人工智能评估者的需要。实验表明,SSL4RL在视觉中心和视觉-语言推理基准测试中显著提高了性能。此外,通过系统性的消融实验,我们确定了影响SSL4RL任务有效性的关键因素,如任务难度、模型规模和与目标领域的语义对齐,为未来工作提供了新的设计原则。我们还通过将其应用于图学习,展示了该框架的通用性,其中它带来了显著的收益。SSL4RL建立了一种使用可验证的自监督目标对齐多模态模型的灵活且有效的范式。
Summary / 总结
The research motivation is to improve the ability of vision-language models to utilize visual evidence effectively, addressing the limitations of relying on linguistic priors or textual shortcuts. The main method is SSL4RL, which uses self-supervised learning tasks as intrinsic rewards for reinforcement learning fine-tuning, providing dense and automatic reward signals. Key experimental findings show that SSL4RL significantly enhances performance on vision-centric and vision-language reasoning benchmarks, and it also improves graph learning tasks. Ablations reveal that task difficulty, model scale, and semantic alignment are critical factors affecting the effectiveness of SSL4RL.
论文提出了SSL4RL框架,利用自我监督学习(SSL)任务作为强化学习(RL)微调视觉语言模型(VLM)的内在奖励。该方法通过提供密集的自动奖励信号,提高了视觉中心和视觉语言推理基准上的性能。研究确定了影响SSL4RL任务有效性的关键因素,如任务难度、模型规模和与目标领域的语义对齐。此外,SSL4RL在图学习任务中也表现出色,展示了其通用性。
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
Authors: Runze Li, Yuwen Zhai, Bo Xu, LiWu Xu, Nian Shi, Wei Zhang, Ran Lin, Liang Wang
First: 2025-12-22T13:42:18+00:00 · Latest: 2025-12-22T13:42:18+00:00
Abstract
Contemporary GUI agents, while increasingly capable due to advances in Large Vision-Language Models (VLMs), often operate with a critical limitation: they treat each task in isolation, lacking a mechanism to systematically learn from past successes. This digital ''amnesia'' results in sub-optimal performance, repeated errors, and poor generalization to novel challenges. To bridge this gap, we introduce EchoTrail-GUI, a novel framework designed to mimic human-like experiential learning by equipping agents with a dynamic, accessible memory. Our framework operates in three distinct stages. First, during Experience Exploration, an agent autonomously interacts with GUI environments to build a curated database of successful task trajectories, validated by a reward model. Crucially, the entire knowledge base construction is thus fully automated, requiring no human supervision. Second, in the Memory Injection stage, upon receiving a new task, our system efficiently retrieves the most relevant past trajectories to serve as actionable ''memories''. Finally, during GUI Task Inference, these memories are injected as in-context guidance to inform the agent's reasoning and decision-making process. We demonstrate the efficacy of our approach on benchmarks including Android World and AndroidLab. The results show that EchoTrail-GUI significantly improves the task success rate and operational efficiency of baseline agents, validating the power of structured memory in creating more robust and intelligent GUI automation.
中文标题/摘要
标题:EchoTrail-GUI:通过评论引导自我探索构建可操作的记忆
当代GUI代理虽然由于大型视觉-语言模型(VLMs)的进步而变得越来越强大,但它们通常以一个关键限制为代价:它们将每个任务视为独立的,缺乏系统地从过去成功中学习的机制。这种数字“健忘症”导致了次优性能、重复错误和对新挑战的不良泛化。为了弥合这一差距,我们提出了EchoTrail-GUI,这是一种新颖的框架,旨在通过为代理提供动态且易于访问的记忆来模拟人类经验学习。我们的框架分为三个阶段。首先,在经验探索阶段,代理自主与GUI环境交互,构建由奖励模型验证的成功任务轨迹数据库,整个知识库构建过程完全自动化,无需人类监督。其次,在记忆注入阶段,当接收到新任务时,我们的系统高效地检索最相关的过去轨迹,作为可操作的“记忆”。最后,在GUI任务推理阶段,这些记忆作为上下文指导注入,以指导代理的推理和决策过程。我们在Android World和AndroidLab等基准测试上展示了我们方法的有效性。结果表明,EchoTrail-GUI 显著提高了基线代理的任务成功率和操作效率,验证了结构化记忆在创建更强大和智能的GUI自动化方面的强大功能。
Summary / 总结
EchoTrail-GUI is a framework that addresses the limitation of contemporary GUI agents by providing them with a dynamic memory system. The system autonomously builds a database of successful task trajectories through self-exploration and validates them using a reward model. This memory is then used to guide the agent's decision-making process when faced with new tasks. Experiments on Android World and AndroidLab show that EchoTrail-GUI enhances task success rates and operational efficiency compared to baseline agents.
EchoTrail-GUI 是一个框架,通过为 GUI 代理提供动态记忆系统来学习过去的经验。它包括三个阶段:经验探索,其中代理自主收集成功的任务轨迹;记忆注入,其中为新任务检索相关的历史轨迹;以及 GUI 任务推理,其中这些记忆指导代理的行为。该框架在 Android World 和 AndroidLab 等基准测试中显著提高了任务成功率和操作效率,证明了结构化记忆对于 GUI 自动化的重要性。
Xiaomi MiMo-VL-Miloco Technical Report
Authors: Jiaze Li, Jingyang Chen, Yuxun Qu, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, Jianzhong Ju, Zhenbo Luo, Jian Luan
First: 2025-12-19T10:43:37+00:00 · Latest: 2025-12-22T13:27:24+00:00
Abstract
We open-source MiMo-VL-Miloco-7B and its quantized variant MiMo-VL-Miloco-7B-GGUF, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco to support research and deployment in real-world smart-home applications.
中文标题/摘要
标题:小米MiMo-VL-Miloco技术报告
我们开源了MiMo-VL-Miloco-7B及其量化变体MiMo-VL-Miloco-7B-GGUF,这是一个面向家庭的视觉-语言模型对,能够在家庭场景理解和通用多模态推理方面取得优异表现。基于MiMo-VL-7B骨干网络,MiMo-VL-Miloco-7B专门针对智能家居环境,实现了手势识别和常见家庭场景理解的领先F1分数,并在视频基准测试(如Video-MME、Video-MMMU和Charades-STA)以及语言理解基准测试(如MMMU-Pro和MMLU-Pro)中也取得了持续的改进。在我们的实验中,MiMo-VL-Miloco-7B在家庭场景理解和多个多模态推理基准测试中均优于强大的闭源和开源基线。为了平衡专业化和通用性,我们设计了一种两阶段训练管道,结合了监督微调和基于组相对策略优化的强化学习,利用高效的多域数据。我们进一步引入了思维链监督和令牌预算感知推理,使模型能够在数据高效学习的同时,也能高效推理。我们的分析表明,针对家庭场景的训练不仅增强了活动和手势理解,还仅以适度的文档中心任务权衡提高了文本推理能力。模型检查点、量化GGUF权重以及我们的家庭场景评估工具包可在https://github.com/XiaoMi/xiaomi-mimo-vl-miloco 公开获取,以支持在实际智能家居应用中的研究和部署。
Summary / 总结
The research aims to develop home-centric vision-language models for smart-home applications. The method involves specialized training on a MiMo-VL-7B backbone using a two-stage pipeline combining supervised fine-tuning and reinforcement learning. Key findings show that MiMo-VL-Miloco-7B outperforms strong baselines on home-scenario understanding and multimodal reasoning benchmarks, while also improving text-only reasoning with minimal impact on document-centric tasks.
研究旨在开发针对智能家居应用的视觉-语言模型。方法包括在MiMo-VL-7B基础上进行专门训练,使用结合监督微调和基于组相对策略优化的强化学习的两阶段管道。关键发现表明,MiMo-VL-Miloco-7B在家庭场景理解和多模态推理基准测试中优于强基线,同时在文本-only推理方面也有所改进,对文档中心任务的影响较小。
SafeMed-R1: Adversarial Reinforcement Learning for Generalizable and Robust Medical Reasoning in Vision-Language Models
Authors: A. A. Gde Yogi Pramana, Jason Ray, Anthony Jaya, Michael Wijaya
First: 2025-12-22T12:07:33+00:00 · Latest: 2025-12-22T12:07:33+00:00
Abstract
Vision--Language Models (VLMs) show significant promise for Medical Visual Question Answering (VQA), yet their deployment in clinical settings is hindered by severe vulnerability to adversarial attacks. Standard adversarial training, while effective for simpler tasks, often degrades both generalization performance and the quality of generated clinical reasoning. We introduce SafeMed-R1, a hybrid defense framework that ensures robust performance while preserving high-quality, interpretable medical reasoning. SafeMed-R1 employs a two-stage approach: at training time, we integrate Adversarial Training with Group Relative Policy Optimization (AT-GRPO) to explicitly robustify the reasoning process against worst-case perturbations; at inference time, we augment the model with Randomized Smoothing to provide certified $L_2$-norm robustness guarantees. We evaluate SafeMed-R1 on the OmniMedVQA benchmark across eight medical imaging modalities comprising over 88,000 samples. Our experiments reveal that standard fine-tuned VLMs, despite achieving 95\% accuracy on clean inputs, collapse to approximately 25\% under PGD attacks. In contrast, SafeMed-R1 maintains 84.45\% accuracy under the same adversarial conditions, representing a 59 percentage point improvement in robustness. Furthermore, we demonstrate that models trained with explicit chain-of-thought reasoning exhibit superior adversarial robustness compared to instruction-only variants, suggesting a synergy between interpretability and security in medical AI systems.
中文标题/摘要
标题:SafeMed-R1:针对视觉语言模型在医学视觉问答中的泛化和鲁棒医学推理的对抗强化学习
视觉-语言模型(VLMs)在医学视觉问答(VQA)中显示出巨大的潜力,但在临床环境中的部署受到严重对抗攻击的威胁。标准的对抗训练虽然对简单任务有效,但往往会降低泛化性能和生成的临床推理质量。我们提出了SafeMed-R1,这是一种混合防御框架,确保在保持高质量、可解释的医学推理的同时,具有鲁棒性。SafeMed-R1采用两阶段方法:在训练阶段,我们结合对抗训练与组相对策略优化(AT-GRPO)以明确地使推理过程对抗最坏情况的扰动;在推理阶段,我们通过随机平滑增强模型,提供$L_2$范数鲁棒性保证。我们在OmniMedVQA基准上对SafeMed-R1进行了评估,涵盖了88,000多个样本,涉及八种医学成像模态。实验结果显示,标准微调的VLMs在干净输入上达到95%的准确率,但在PGD攻击下降至约25%。相比之下,SafeMed-R1在相同对抗条件下保持84.45%的准确率,显示出59个百分点的鲁棒性提升。此外,我们证明了带有显式推理链的模型相较于仅指令版本具有更好的对抗鲁棒性,这表明在医学AI系统中可解释性和安全性之间存在协同作用。
Summary / 总结
SafeMed-R1 is a hybrid defense framework designed to enhance the robustness of Vision-Language Models (VLMs) in medical applications against adversarial attacks. It uses a two-stage approach: adversarial training with Group Relative Policy Optimization (AT-GRPO) during training to robustify the reasoning process, and Randomized Smoothing at inference time to provide certified $L_2$-norm robustness. SafeMed-R1 significantly improves robustness, maintaining 84.45% accuracy under PGD attacks compared to 25% for standard fine-tuned VLMs. Additionally, models with explicit chain-of-thought reasoning show better adversarial robustness than instruction-only variants, indicating a positive correlation between interpretability and security.
SafeMed-R1 是一种混合防御框架,旨在增强 Vision-Language 模型在医疗应用中的鲁棒性以抵御对抗性攻击。该框架采用两阶段方法:训练时使用 Group Relative Policy Optimization (AT-GRPO) 的对抗训练以增强推理过程的鲁棒性,在推理时使用随机化平滑以提供 $L_2$-范数鲁棒性保证。SafeMed-R1 显著提高了鲁棒性,在 PGD 攻击下保持了 84.45% 的准确率,而标准微调的 VLMs 在相同条件下准确率仅为 25%。此外,带有显式推理链的模型比仅指令的模型具有更好的对抗鲁棒性,表明可解释性和安全性在医疗 AI 系统中存在正相关关系。
Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing
Authors: Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang
First: 2025-12-22T11:46:42+00:00 · Latest: 2025-12-22T11:46:42+00:00
Abstract
Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at https://github.com/Ricardo-XZ/Think2Seg-RS.
中文标题/摘要
标题:语义与几何融合:一种解耦的LVLM-SAM框架用于遥感推理分割
大型视觉语言模型(LVLMs)在推进遥感(RS)分析方面具有巨大潜力,但现有的推理分割框架通过端到端的监督微调将语言推理和像素预测耦合在一起,导致几何定位薄弱且跨任务泛化能力有限。为了解决这一问题,我们开发了Think2Seg-RS,这是一种解耦框架,通过结构化的几何提示训练LVLM提示器控制冻结的分割任何模型(SAM)。通过掩码强化学习目标,LVLM学习将抽象的语义推理转化为空间定位的动作,实现了在EarthReason数据集上的最佳性能。值得注意的是,学习到的提示策略在多个引用分割基准上实现了零样本泛化,揭示了语义级和实例级定位之间的区别。我们还发现,在语义级监督下,紧凑的分割器优于较大的分割器,并且在异质航空背景中负提示无效。这些发现确立了语义级推理分割作为地理空间理解的新范式,为统一、可解释的LVLM驱动地球观测铺平了道路。我们的代码和模型可在https://github.com/Ricardo-XZ/Think2Seg-RS获取。
Summary / 总结
The research aims to improve the geometric grounding of large vision-language models (LVLMs) in remote sensing analysis by decoupling linguistic reasoning from pixel prediction. The study introduces Think2Seg-RS, a framework that trains an LVLM prompter to control a frozen SAM model using structured geometric prompts. This approach, through a mask-only reinforcement learning objective, enables the LVLM to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. The framework also demonstrates zero-shot generalization to multiple referring segmentation benchmarks, highlighting the importance of semantic-level reasoning over instance-level grounding and the effectiveness of compact segmenters under semantic-level supervision.
研究旨在通过将语言推理与像素预测解耦来提高大型视觉语言模型(LVLM)在遥感分析中的几何定位能力。提出的Think2Seg-RS框架通过结构化的几何提示训练LVLM提示器来控制冻结的SAM模型,实现了在EarthReason数据集上的最佳性能。学习到的提示策略在多个基准上表现出良好的泛化能力,突出了语义级推理分割在地理空间理解中的有效性。在语义级监督下,紧凑的分割器优于较大的分割器,而在异质航空背景中,负向提示的效果较差。
VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
Authors: Meng Chu, Senqiao Yang, Haoxuan Che, Suiyun Zhang, Xichen Zhang, Shaozuo Yu, Haokun Gui, Zhefan Rao, Dandan Tu, Rui Liu, Jiaya Jia
First: 2025-12-22T10:25:38+00:00 · Latest: 2025-12-22T10:25:38+00:00
Abstract
Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models' performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.
中文标题/摘要
标题:VisionDirector:视觉-语言引导的闭环精炼方法在生成图像合成中的应用
生成模型现在可以生成逼真的图像,但它们仍然难以处理专业设计师发布的长篇多目标提示。为了揭示这一差距并更好地评估模型在实际环境中的性能,我们引入了长目标基准(LGBench),这是一个包含2000个任务的套件(1000个T2I和1000个I2I),其平均指令包含18到22个紧密耦合的目标,涵盖全局布局、局部对象放置、字体设计和标志保真度。我们发现,即使是最先进的模型也仅能满足不到72%的目标,并且经常遗漏局部编辑,证实了当前管道的脆弱性。为了解决这一问题,我们提出了VisionDirector,这是一种无需训练的视觉-语言监督器,(i)从长指令中提取结构化目标,(ii)动态决定是一次生成还是分阶段编辑,(iii)进行微网格采样,并在每次编辑后进行语义验证和回滚,(iv)记录目标级奖励。我们进一步使用组相对策略优化对规划器进行微调,从而获得更短的编辑轨迹(3.1步对4.2步)和更强的对齐。VisionDirector在GenEval(提高7%)和ImgEdit(提高0.07绝对值)上达到了新的最先进的水平,同时在字体设计、多对象场景和姿态编辑方面产生了持续的定性改进。
Summary / 总结
The research aims to improve generative models' ability to handle long, multi-goal instructions as used by professional designers. VisionDirector, a vision-language guided system, extracts structured goals from instructions, decides between one-shot generation and staged edits, and uses micro-grid sampling with semantic verification. This approach leads to shorter edit trajectories and better alignment, achieving new state-of-the-art results on GenEval and ImgEdit while improving typography, multi-object scenes, and pose editing.
研究旨在解决生成模型在处理长多目标指令方面的局限性,这些指令在专业设计中很常见。提出了VisionDirector,一种基于视觉语言的指导系统,用于细化生成图像合成。它从指令中提取结构化目标,决定是一次生成还是分阶段编辑,并使用带有语义验证的微网格采样。VisionDirector在GenEval和ImgEdit上取得了新的最佳结果,并在字体排版、多对象场景和姿态编辑方面提高了对齐性和一致性。
ChemATP: A Training-Free Chemical Reasoning Framework for Large Language Models
Authors: Mingxu Zhang, Dazhong Shen, Qi Zhang, Ying Sun
First: 2025-12-22T10:21:40+00:00 · Latest: 2025-12-22T10:21:40+00:00
Abstract
Large Language Models (LLMs) exhibit strong general reasoning but struggle in molecular science due to the lack of explicit chemical priors in standard string representations. Current solutions face a fundamental dilemma. Training-based methods inject priors into parameters, but this static coupling hinders rapid knowledge updates and often compromises the model's general reasoning capabilities. Conversely, existing training-free methods avoid these issues but rely on surface-level prompting, failing to provide the fine-grained atom-level priors essential for precise chemical reasoning. To address this issue, we introduce ChemATP, a framework that decouples chemical knowledge from the reasoning engine. By constructing the first atom-level textual knowledge base, ChemATP enables frozen LLMs to explicitly retrieve and reason over this information dynamically. This architecture ensures interpretability and adaptability while preserving the LLM's intrinsic general intelligence. Experiments show that ChemATP significantly outperforms training-free baselines and rivals state-of-the-art training-based models, demonstrating that explicit prior injection is a competitive alternative to implicit parameter updates.
中文标题/摘要
标题:ChemATP:一种无需训练的化学推理框架用于大型语言模型
大型语言模型(LLMs)在一般推理方面表现出色,但在分子科学领域却因缺乏明确的化学先验知识而在标准字符串表示中遇到困难。当前的解决方案面临一个根本性的困境。基于训练的方法将先验知识注入参数中,但这种静态耦合阻碍了知识的快速更新,并且往往损害了模型的一般推理能力。相反,现有的无需训练的方法避免了这些问题,但依赖于表面级提示,无法提供化学推理所需的精细原子级先验知识。为了解决这一问题,我们引入了ChemATP,这是一种将化学知识与推理引擎解耦的框架。通过构建首个原子级文本知识库,ChemATP使冻结的LLM能够动态地检索和推理这些信息。该架构确保了可解释性和适应性,同时保留了LLM固有的通用智能。实验表明,ChemATP显著优于现有的无需训练基线,并且能够与最先进的基于训练的模型相媲美,证明了显式先验注入是隐式参数更新的有竞争力的替代方案。
Summary / 总结
The research aims to enhance large language models' performance in molecular science by addressing the limitations of both training-based and training-free methods. ChemATP introduces a framework that decouples chemical knowledge from the reasoning process, allowing frozen LLMs to dynamically retrieve and reason over atom-level information. Experiments show that ChemATP outperforms existing training-free methods and matches the performance of state-of-the-art training-based models, indicating that explicit prior injection can be a viable alternative to implicit parameter updates.
ChemATP 是一种训练-free 框架,旨在通过将化学知识与推理过程解耦来增强大型语言模型(LLMs)的化学推理能力。它构建了一个原子级的文本知识库,使冻结的 LLM 能够动态检索和推理这些信息。实验结果表明,ChemATP 在性能上优于现有训练-free 方法,并且能够与最先进的训练-based 模型相媲美,突出了显式先验注入的有效性。
Towards Minimal Fine-Tuning of VLMs
Authors: Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee
First: 2025-12-22T10:02:10+00:00 · Latest: 2025-12-22T10:02:10+00:00
Abstract
We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.
中文标题/摘要
标题:朝向VLMs的极简微调
我们引入了Image-LoRA,这是一种轻量级参数高效微调(PEFT)方法,用于基于变换器的视觉语言模型(VLMs)。Image-LoRA 仅对视觉标记跨度内的注意力层的价值路径应用低秩适应,减少仅适配器训练的FLOPs大致与视觉标记比例成正比。我们进一步只适配了一部分注意力头,这些头是使用秩为1的Image-LoRA估计的头影响得分所选择的,并通过层更新大小归一化来稳定层更新。在跨越文本密集到图像密集领域的屏幕中心定位和引用基准测试中,Image-LoRA 在使用更少可训练参数和更低仅适配器训练FLOPs的情况下,达到了或接近标准LoRA的准确性。该方法还保持了VLMs在微调前后纯文本推理性能,如在GSM8K上进一步所示。
Summary / 总结
The research aims to minimize fine-tuning requirements for vision-language models (VLMs) by introducing Image-LoRA, a lightweight parameter-efficient fine-tuning method. Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing the number of floating-point operations during training. The method also selectively adapts a subset of attention heads and normalizes per-layer updates. Across various benchmarks, Image-LoRA achieves comparable or near-standard LoRA accuracy while using fewer parameters and lower training FLOPs. Additionally, it maintains the pure-text reasoning performance of VLMs before and after fine-tuning.
研究旨在通过引入Image-LoRA,一种轻量级参数高效微调方法,来最小化视觉语言模型(VLMs)的微调需求。该方法仅对视觉标记段内的注意力层的价值路径应用低秩适应,从而减少训练中的浮点运算次数。此外,该方法还选择性地适应了一部分注意力头,并对每层更新进行了归一化处理。在各种基准测试中,Image-LoRA 达到了与标准 LoRA 相当或接近的准确率,同时使用更少的可训练参数和更低的训练 FLOPs。此外,它还保持了 VLMs 在微调前后纯文本推理性能。
Vision-Language-Policy Model for Dynamic Robot Task Planning
Authors: Jin Wang, Kim Tien Ly, Jacques Cloete, Nikos Tsagarakis, Ioannis Havoutis
First: 2025-12-22T09:12:48+00:00 · Latest: 2025-12-22T09:12:48+00:00
Comments: Manuscript under review
Abstract
Bridging the gap between natural language commands and autonomous execution in unstructured environments remains an open challenge for robotics. This requires robots to perceive and reason over the current task scene through multiple modalities, and to plan their behaviors to achieve their intended goals. Traditional robotic task-planning approaches often struggle to bridge low-level execution with high-level task reasoning, and cannot dynamically update task strategies when instructions change during execution, which ultimately limits their versatility and adaptability to new tasks. In this work, we propose a novel language model-based framework for dynamic robot task planning. Our Vision-Language-Policy (VLP) model, based on a vision-language model fine-tuned on real-world data, can interpret semantic instructions and integrate reasoning over the current task scene to generate behavior policies that control the robot to accomplish the task. Moreover, it can dynamically adjust the task strategy in response to changes in the task, enabling flexible adaptation to evolving task requirements. Experiments conducted with different robots and a variety of real-world tasks show that the trained model can efficiently adapt to novel scenarios and dynamically update its policy, demonstrating strong planning autonomy and cross-embodiment generalization. Videos: https://robovlp.github.io/
中文标题/摘要
标题:面向动态机器人任务规划的视觉-语言-政策模型
在非结构化环境中,将自然语言指令与自主执行之间的差距仍然是机器人技术中的一个开放挑战。这要求机器人通过多种模态感知和推理当前任务场景,并计划其行为以实现其预期目标。传统的机器人任务规划方法往往难以将低级执行与高级任务推理相结合,并且在执行过程中指令发生变化时无法动态更新任务策略,这最终限制了它们对新任务的多样性和适应性。在本文中,我们提出了一种基于语言模型的动态机器人任务规划新框架。我们的视觉-语言-政策(VLP)模型基于在真实数据上微调的视觉-语言模型,可以解释语义指令并结合当前任务场景的推理来生成控制机器人完成任务的行为策略。此外,它可以根据任务变化动态调整任务策略,实现对不断变化的任务要求的灵活适应。在不同机器人和各种真实世界任务上的实验表明,训练后的模型可以高效地适应新场景并动态更新其策略,展示了强大的规划自主性和跨体征的一般性。视频:https://robovlp.github.io/
Summary / 总结
This paper addresses the challenge of bridging natural language commands with autonomous execution in unstructured environments. It introduces a Vision-Language-Policy (VLP) model, which combines a vision-language model fine-tuned on real-world data to interpret instructions and reason about the current task scene. The model generates behavior policies to control the robot and can dynamically adjust its strategy during task execution. Experiments show the model can adapt to new scenarios and update its policy, showcasing strong planning autonomy and cross-embodiment generalization capabilities.
本文解决了自然语言指令与在非结构化环境中自主执行之间的鸿沟。提出了一种Vision-Language-Policy (VLP) 模型,结合了在真实数据上微调的视觉语言模型,以解释指令并推理解当前的任务场景。该模型生成行为策略来控制机器人,并且在任务执行过程中可以动态调整其策略。实验表明,该模型能够适应新场景并更新其策略,展示了强大的规划自主性和跨体征的一般化能力。
Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation
Authors: Khanh Nguyen, Dasith de Silva Edirimuni, Ghulam Mubashar Hassan, Ajmal Mian
Venue: AAAI 2026
First: 2025-12-22T06:57:42+00:00 · Latest: 2025-12-22T06:57:42+00:00
Comments: Accepted to AAAI 2026 Workshop on New Frontiers in Information Retrieval
Abstract
Locating and retrieving objects from scene-level point clouds is a challenging problem with broad applications in robotics and augmented reality. This task is commonly formulated as open-vocabulary 3D instance segmentation. Although recent methods demonstrate strong performance, they depend heavily on SAM and CLIP to generate and classify 3D instance masks from images accompanying the point cloud, leading to substantial computational overhead and slow processing that limit their deployment in real-world settings. Open-YOLO 3D alleviates this issue by using a real-time 2D detector to classify class-agnostic masks produced directly from the point cloud by a pretrained 3D segmenter, eliminating the need for SAM and CLIP and significantly reducing inference time. However, Open-YOLO 3D often fails to generalize to object categories that appear infrequently in the 3D training data. In this paper, we propose a method that generates 3D instance masks for novel objects from RGB images guided by a 2D open-vocabulary detector. Our approach inherits the 2D detector's ability to recognize novel objects while maintaining efficient classification, enabling fast and accurate retrieval of rare instances from open-ended text queries. Our code will be made available at https://github.com/ndkhanh360/BoxOVIS.
中文标题/摘要
标题:基于盒子引导的开放词汇实例分割从3D场景中检索对象
从场景级点云中定位和检索对象是一个具有广泛机器人技术和增强现实应用挑战性的问题。该任务通常被形式化为开放词汇3D实例分割。尽管最近的方法表现出色,但它们严重依赖SAM和CLIP从伴随点云的图像中生成和分类3D实例掩码,导致巨大的计算开销和缓慢的处理速度,限制了它们在实际环境中的部署。Open-YOLO 3D通过使用实时2D检测器来分类由预训练3D分割器直接从点云生成的类无感知掩码,从而缓解了这一问题,消除了对SAM和CLIP的需求,并显著减少了推理时间。然而,Open-YOLO 3D往往无法泛化到在3D训练数据中出现频率较低的对象类别。在本文中,我们提出了一种方法,通过2D开放词汇检测器从RGB图像中引导生成新对象的3D实例掩码。我们的方法继承了2D检测器识别新对象的能力,同时保持了高效的分类,使从开放文本查询中快速准确地检索稀有实例成为可能。我们的代码将在https://github.com/ndkhanh360/BoxOVIS上公开。
Summary / 总结
This paper addresses the challenge of locating and retrieving objects from 3D scenes by proposing a method that generates 3D instance masks from RGB images guided by a 2D open-vocabulary detector. This approach reduces the reliance on SAM and CLIP, thus decreasing computational overhead and improving inference speed. The key finding is that the proposed method can accurately retrieve rare instances from open-ended text queries, overcoming the limitations of previous methods that struggle with infrequent object categories in 3D training data.
该论文通过提出一种使用2D开放词汇检测器指导RGB图像生成3D实例掩码的方法,解决了从3D场景中检索物体的挑战。这种方法减少了对SAM和CLIP的依赖,从而降低了计算开销和处理时间。主要发现是,该方法能够准确且高效地从开放文本查询中检索稀有实例,提高了对不常见物体类别的泛化能力。
Population-Evolve: a Parallel Sampling and Evolutionary Method for LLM Math Reasoning
Authors: Yanzhi Zhang, Yitong Duan, Zhaoxi Zhang, Jiyan He, Shuxin Zheng
First: 2025-12-22T06:42:46+00:00 · Latest: 2025-12-22T06:42:46+00:00
Abstract
Test-time scaling has emerged as a promising direction for enhancing the reasoning capabilities of Large Language Models in last few years. In this work, we propose Population-Evolve, a training-free method inspired by Genetic Algorithms to optimize LLM reasoning. Our approach maintains a dynamic population of candidate solutions for each problem via parallel reasoning. By incorporating an evolve prompt, the LLM self-evolves its population in all iterations. Upon convergence, the final answer is derived via majority voting. Furthermore, we establish a unification framework that interprets existing test-time scaling strategies through the lens of genetic algorithms. Empirical results demonstrate that Population-Evolve achieves superior accuracy with low performance variance and computational efficiency. Our findings highlight the potential of evolutionary strategies to unlock the reasoning power of LLMs during inference.
中文标题/摘要
标题:Population-Evolve:一种用于LLM数学推理的并行采样和进化方法
近年来,测试时缩放已成为增强大型语言模型推理能力的一个有前途的方向。在本文中,我们提出了一种名为Population-Evolve的无需训练的方法,该方法受到遗传算法的启发,用于优化LLM推理。我们的方法通过并行推理为每个问题维护一个动态候选解群体。通过引入进化提示,LLM在所有迭代中自我进化其群体。在收敛后,最终答案通过多数投票得出。此外,我们建立了一个统一框架,通过遗传算法的视角解释现有的测试时缩放策略。实验证明,Population-Evolve在保持低性能波动和计算效率的同时实现了更高的准确性。我们的研究结果突显了进化策略在推理过程中解锁LLM推理能力的潜力。
Summary / 总结
Population-Evolve is a training-free method inspired by Genetic Algorithms designed to enhance the reasoning capabilities of Large Language Models (LLMs). It maintains a dynamic population of candidate solutions for each problem through parallel reasoning and self-evolves the population via an evolve prompt. Upon convergence, the final answer is determined by majority voting. Experiments show that Population-Evolve achieves high accuracy with low performance variance and computational efficiency, indicating the potential of evolutionary strategies for LLM reasoning during inference.
Population-Evolve 是一种受遗传算法启发的训练-free 方法,旨在增强大型语言模型(LLM)的推理能力。它通过并行推理维护每个问题的动态候选解群体,并通过进化提示自我进化该群体。收敛后,最终答案通过多数投票确定。实验表明,Population-Evolve 能够实现高精度、低性能波动和高计算效率,表明进化策略在推理期间解锁 LLM 的推理能力的潜力。
Watch Closely: Mitigating Object Hallucinations in Large Vision-Language Models with Disentangled Decoding
Authors: Ruiqi Ma, Yu Yan, Chunhong Zhang, Minghao Yin, XinChao Liu, Zhihong Jin, Zheng Hu
First: 2025-12-22T06:20:53+00:00 · Latest: 2025-12-22T06:20:53+00:00
Abstract
Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model's dependence on language priors but also enhances its visual performance. (Code: https://github.com/rickeyhhh/Hallucination-Disentangled-Decoding)
中文标题/摘要
标题:密切关注:通过解耦解码减轻大型视觉-语言模型中的对象幻觉
大型视觉-语言模型(LVLMs)在视觉和语言模态之间架起桥梁,展现出在多种领域的强大潜力。然而,尽管取得了显著进展,LVLMs 在物体识别任务中仍然遭受严重的幻觉问题。这些模型往往无法准确识别某些物体,导致生成的文本看似流畅但与视觉内容不符,这在实际应用中可能会产生严重后果。最近,提出了几种缓解 LVLM 幻觉的方法,但大多数方法仅专注于减少语言模态中的幻觉。为了同时减轻语言和视觉模态中的幻觉,我们引入了一种无需训练的幻觉解耦解码(HDD)方法。HDD 通过分割图像并选择增强原始图像的图像来增强原始图像,同时利用空白图像消除原始图像和分割图像中的语言先验幻觉。该设计不仅减少了模型对语言先验的依赖,还提升了其视觉性能。(代码: https://github.com/rickeyhhh/Hallucination-Disentangled-Decoding)
Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models
Authors: Tongyuan Miao, Gary Huang, Kai Jun Han, Annie Jiang
First: 2025-12-22T03:45:04+00:00 · Latest: 2025-12-22T03:45:04+00:00
Abstract
Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.
中文标题/摘要
标题:基于上下文的初始化以减少扩散语言模型生成路径长度
扩散大型语言模型(DLLMs)能够实现完全并行的标记解码,但由于需要许多去噪迭代将信息空白的完全掩码初始化细化为连贯的文本,因此在推理时往往不切实际。现有的大多数加速方法侧重于通过改进求解器或采样策略更有效地遍历这种生成轨迹。我们提出了一种互补的观点:通过基于上下文的初始化从目标分布更近的地方开始,从而缩短轨迹本身。我们提出了一种无需训练的接口,将轻量级辅助模型的条件提示注入到扩散初始化中,并通过离散标记注入和表示级嵌入插值机制进行实例化。由于注入的先验可能不完美,且仅解掩码可能会过早地做出承诺,我们还引入了一种基于信心的重新掩码机制作为先验怀疑的形式。初步证据表明,基于上下文的初始化可以显著减少去噪迭代(在我们的设置中约减少35%的功能评估),同时揭示了一个关键的开放挑战:简单的预热启动可能会相对于强大的扩散基线降低最终的准确性。我们利用这些发现来推动围绕校准、修订机制和表示对齐的研究议程,以实现可靠的预热启动扩散解码。
Summary / 总结
The paper aims to reduce the computational burden of diffusion language models by shortening the generative trajectory through context-aware initialization. It proposes a training-free method that injects prompt-conditioned priors into the diffusion initialization using discrete token injection and representation-level embedding interpolation. The study shows that context-aware initialization can significantly reduce the number of denoising iterations, but also highlights that naive warm-starting can sometimes degrade final accuracy compared to strong diffusion baselines.
论文针对扩散语言模型(DLLMs)在推理时因需要多次去噪迭代而效率低下问题,提出了一种上下文感知初始化方法,通过从轻量级辅助模型注入基于提示的先验知识,缩短生成路径长度,减少了约35%的功能评估次数。然而,研究也指出,简单的预热启动有时会降低最终准确性,这表明需要进一步研究校准和修订机制。
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
Authors: Siming Yan, Min Bai, Weifeng Chen, Xiong Zhou, Qixing Huang, Li Erran Li
Venue: ECCV 2024
First: 2024-02-09T01:00:14+00:00 · Latest: 2025-12-22T03:14:10+00:00
Comments: Accepted by ECCV 2024
Abstract
By combining natural language understanding, generation capabilities, and breadth of knowledge of large language models with image perception, recent large vision language models (LVLMs) have shown unprecedented visual reasoning capabilities. However, the generated text often suffers from inaccurate grounding in the visual input, resulting in errors such as hallucination of nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes of and relationships between objects. To address these issues, we introduce a novel framework, ViGoR (Visual Grounding Through Fine-Grained Reward Modeling) that utilizes fine-grained reward modeling to significantly enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is efficiently achieved using much cheaper human evaluations instead of full supervisions, as well as automated methods. We show the effectiveness of our approach through a variety of evaluation methods and benchmarks. Additionally, we released our human annotation (https://github.com/amazon-science/vigor) comprising 15,440 images and generated text pairs with fine-grained evaluations to contribute to related research in the community.
中文标题/摘要
标题:ViGoR:通过精细粒度的奖励建模提高大型视觉语言模型的视觉定位能力
通过将大型语言模型的自然语言理解、生成能力和广泛知识与图像感知相结合,最近的大型视觉语言模型(LVLMs)展示了前所未有的视觉推理能力。然而,生成的文本往往在视觉输入中缺乏准确的定位,导致诸如虚构不存在的场景元素、遗漏场景的重要部分以及错误推断物体属性和关系等错误。为了解决这些问题,我们提出了一种新的框架,ViGoR(通过精细粒度的奖励建模进行视觉定位),利用精细粒度的奖励建模显著提高了LVLMs的视觉定位能力,超越了预训练基线。这种改进通过使用更便宜的人类评估而不是全面监督以及自动化方法来高效实现。我们通过多种评估方法和基准展示了我们方法的有效性。此外,我们还发布了包含15,440张图像及其生成文本对的精细粒度评估的人类注释(https://github.com/amazon-science/vigor),以促进相关研究领域的贡献。
Summary / 总结
The paper introduces ViGoR, a framework that uses fine-grained reward modeling to improve the visual grounding of large vision language models (LVLMs). This approach enhances the accuracy of text generation in relation to visual inputs, reducing issues like hallucination and incorrect attribute inference. The improvement is achieved through cheaper human evaluations and automated methods, demonstrating effectiveness across various benchmarks.
研究旨在通过解决生成文本中的视觉定位不准确问题来提升大型视觉语言模型的性能。方法是引入ViGoR框架,利用细粒度奖励建模来增强视觉定位。该方法结合了更便宜的人类评估和自动化方法,显著优于预训练基线。实验结果表明,ViGoR在多种评估方法和基准测试中均表现出有效性。
ICP-4D: Bridging Iterative Closest Point and LiDAR Panoptic Segmentation
Authors: Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin, Sangpil Kim
First: 2025-12-22T03:13:08+00:00 · Latest: 2025-12-22T03:13:08+00:00
Abstract
Dominant paradigms for 4D LiDAR panoptic segmentation are usually required to train deep neural networks with large superimposed point clouds or design dedicated modules for instance association. However, these approaches perform redundant point processing and consequently become computationally expensive, yet still overlook the rich geometric priors inherently provided by raw point clouds. To this end, we introduce ICP-4D, a simple yet effective training-free framework that unifies spatial and temporal reasoning through geometric relations among instance-level point sets. Specifically, we apply the Iterative Closest Point (ICP) algorithm to directly associate temporally consistent instances by aligning the source and target point sets through the estimated transformation. To stabilize association under noisy instance predictions, we introduce a Sinkhorn-based soft matching. This exploits the underlying instance distribution to obtain accurate point-wise correspondences, resulting in robust geometric alignment. Furthermore, our carefully designed pipeline, which considers three instance types-static, dynamic, and missing-offers computational efficiency and occlusion-aware matching. Our extensive experiments across both SemanticKITTI and panoptic nuScenes demonstrate that our method consistently outperforms state-of-the-art approaches, even without additional training or extra point cloud inputs.
中文标题/摘要
标题:ICP-4D:连接迭代最近点与LiDAR全景分割
4D LiDAR全景分割的主要范式通常需要训练深度神经网络以处理大量叠加的点云,或者设计专门的模块进行实例关联。然而,这些方法会进行冗余的点处理,从而变得计算成本高昂,但仍然忽视了原始点云中固有的丰富几何先验。为此,我们提出了ICP-4D,这是一种简单而有效的无需训练框架,通过实例级点集之间的几何关系统一空间和时间推理。具体而言,我们应用迭代最近点(ICP)算法直接通过估计的变换对时间上一致的实例进行关联。为了在噪声实例预测下稳定关联,我们引入了基于Sinkhorn的软匹配。这利用了潜在的实例分布来获得准确的点对对应关系,从而实现稳健的几何对齐。此外,我们精心设计的管道考虑了三种实例类型——静态、动态和缺失——提供了计算效率和遮挡感知的匹配。我们在SemanticKITTI和panoptic nuScenes上的广泛实验表明,即使没有额外的训练或额外的点云输入,我们的方法也始终优于最先进的方法。
Summary / 总结
ICP-4D is a training-free framework that integrates spatial and temporal reasoning for 4D LiDAR panoptic segmentation. It uses the Iterative Closest Point (ICP) algorithm to align point sets and a Sinkhorn-based soft matching to handle noisy instance predictions. The method consistently outperforms state-of-the-art approaches on SemanticKITTI and panoptic nuScenes without additional training or extra point cloud inputs.
ICP-4D 是一个无需训练的框架,通过实例级点集之间的几何关系来整合空间和时间推理,用于 4D LiDAR 分段。它使用 ICP 算法对齐源和目标点集,并引入基于 Sinkhorn 的软匹配以在噪声预测下增强鲁棒性。在 SemanticKITTI 和 panoptic nuScenes 上的实验表明,ICP-4D 在无需额外训练或额外点云输入的情况下优于现有方法。
Affordance RAG: Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation
Authors: Ryosuke Korekata, Quanting Xie, Yonatan Bisk, Komei Sugiura
Venue: ICRA 2026
First: 2025-12-22T02:55:25+00:00 · Latest: 2025-12-22T02:55:25+00:00
Comments: Accepted to IEEE RA-L, with presentation at ICRA 2026
Abstract
In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.
中文标题/摘要
标题:Affordance RAG:基于操作感知体记忆的分层多模态检索
在本研究中,我们解决了开放词汇的移动操作问题,即机器人需要根据自由形式的自然语言指令携带各种物体到容器中。这一任务具有挑战性,因为它涉及理解视觉语义和操作动作的适用性。为了解决这些挑战,我们提出了Affordance RAG,这是一种零样本分层多模态检索框架,能够从预先探索的图像中构建操作感知体记忆。该模型基于区域和视觉语义检索候选目标,并使用适用性评分重新排序,使机器人能够识别在实际环境中可能可执行的操作选项。我们的方法在大型室内环境中的移动操作指令检索性能上优于现有方法。此外,在基于自由形式指令在室内环境中进行移动操作的实地实验中,所提出的方法实现了85%的任务成功率,优于现有方法在检索性能和整体任务成功率方面的表现。
Summary / 总结
The study aims to address the challenge of open-vocabulary mobile manipulation, where a robot must execute tasks based on natural language instructions. To achieve this, the researchers propose Affordance RAG, a hierarchical multimodal retrieval framework that uses pre-explored images to build an embodied memory aware of manipulation affordances. The model retrieves and reranks candidates based on regional and visual semantics, as well as affordance scores, enabling the robot to perform tasks effectively. The method outperformed existing approaches in both retrieval performance and task success rate in real-world experiments, achieving an 85% success rate.
研究旨在解决基于自然语言指令的开放词汇移动操作问题,即机器人需要根据口头指令执行任务。为此,研究人员提出了一种零样本层次多模态检索框架Affordance RAG。该框架通过预探索图像构建感知操作的体记忆,根据区域和视觉语义以及操作性评分检索和重新排序候选目标。该方法在检索性能上超过了现有方法,并在真实世界实验中实现了85%的任务成功率,优于其他方法在检索和整体任务成功率方面。
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
Authors: Yue Li, Qi Ma, Runyi Yang, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Theo Gevers, Luc Van Gool, Danda Pani Paudel, Martin R. Oswald
First: 2025-12-19T17:22:35+00:00 · Latest: 2025-12-22T02:33:28+00:00
Abstract
While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians' centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.
中文标题/摘要
标题:合唱:全方位3D高斯场景编码的多教师预训练
虽然3DGS已成为一种高保真场景表示,但直接从其原语中编码丰富的通用特征仍然未被充分探索。我们通过引入合唱,一种多教师预训练框架,通过从2D基础模型中提取互补信号来学习一个全方位的3D高斯点绘制(3DGS)场景编码器。合唱使用共享的3D编码器和教师特定的投影器,从语言对齐、通用和对象感知的教师中学习,鼓励一个共享的嵌入空间,捕捉从高层语义到精细结构的信号。我们评估了合唱在一系列任务上的表现:开放词汇语义和实例分割、线性探针和解码器探针,以及数据高效监督。除了3DGS,我们还测试了合唱在仅支持点云的几个基准上的表现,通过预训练一个仅使用高斯中心、颜色、估计法线作为输入的变体。有趣的是,这个编码器表现出强大的迁移性能,并在使用39.9倍少的训练场景时优于点云基线。最后,我们提出了一种渲染和提取适应方法,以促进域外微调。我们的代码和模型将在发表后发布。
Summary / 总结
Chorus is a multi-teacher pretraining framework that addresses the under-explored area of encoding rich, general-purpose features directly from 3D Gaussian Splatting (3DGS) primitives. It uses a shared 3D encoder and teacher-specific projectors to learn from complementary signals from language-aligned, generalist, and object-aware teachers, capturing signals from high-level semantics to fine-grained structure. Chorus shows strong transfer and outperforms the point clouds baseline while using significantly fewer training scenes on various tasks including semantic and instance segmentation, linear and decoder probing, and data-efficient supervision.
Chorus 是一个多教师预训练框架,旨在直接从 3D 贝塞尔点绘制的原始数据中学习丰富的通用特征。它使用共享的 3D 编码器和特定于教师的投影器,从不同类型的教师(包括语言对齐、通用和对象感知模型)中学习互补信号。Chorus 在语义和实例分割、线性探针和解码器探针以及数据高效监督等多种任务上进行了评估。该方法表现出强大的迁移学习能力,并在使用显著较少的训练场景时优于点云基线。
DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized Generation
Authors: Guandong Li, Yijun Ding
First: 2025-12-22T02:25:05+00:00 · Latest: 2025-12-22T02:25:05+00:00
Abstract
Recent tuning-free identity customization methods achieve high facial fidelity but often overlook visual context, such as lighting, skin texture, and environmental tone. This limitation leads to ``Semantic-Visual Dissonance,'' where accurate facial geometry clashes with the input's unique atmosphere, causing an unnatural ``sticker-like'' effect. We propose **DVI (Disentangled Visual-Identity)**, a zero-shot framework that orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. Unlike methods relying solely on semantic vectors, DVI exploits the inherent statistical properties of the VAE latent space, utilizing mean and variance as lightweight descriptors for global visual atmosphere. We introduce a **Parameter-Free Feature Modulation** mechanism that adaptively modulates semantic embeddings with these visual statistics, effectively injecting the reference's ``visual soul'' without training. Furthermore, a **Dynamic Temporal Granularity Scheduler** aligns with the diffusion process, prioritizing visual atmosphere in early denoising stages while refining semantic details later. Extensive experiments demonstrate that DVI significantly enhances visual consistency and atmospheric fidelity without parameter fine-tuning, maintaining robust identity preservation and outperforming state-of-the-art methods in IBench evaluations.
中文标题/摘要
标题:DVI:分离语义和视觉身份以实现无需训练的个性化生成
近期无需调优的身份定制方法在面部保真度方面取得了高水准,但往往忽视了视觉上下文,如照明、皮肤纹理和环境色调。这一限制导致了“语义-视觉不和谐”的问题,即准确的面部几何结构与输入的独特氛围产生冲突,造成一种不自然的“贴纸效果”。我们提出**DVI(分离视觉-身份)**,这是一种零样本框架,能够正交地将身份细分为细粒度语义流和粗粒度视觉流。与仅依赖语义向量的方法不同,DVI 利用了VAE潜在空间的固有统计特性,使用均值和方差作为轻量级的全局视觉氛围描述符。我们引入了一种**无参数特征调制**机制,能够自适应地用这些视觉统计信息调节语义嵌入,有效地注入参考的“视觉灵魂”而无需训练。此外,一种**动态时间粒度调度器**与扩散过程相协调,在早期去噪阶段优先考虑视觉氛围,而在后期细化语义细节。大量实验表明,DVI 显著提高了视觉一致性和氛围保真度,无需参数微调,同时保持了鲁棒的身份保留,并在IBench评估中优于现有最佳方法。
Summary / 总结
The research addresses the issue of 'Semantic-Visual Dissonance' in identity customization methods, which often result in unnatural effects due to a lack of visual context. DVI (Disentangled Visual-Identity) is proposed as a zero-shot framework that disentangles identity into semantic and visual streams. It uses mean and variance from the VAE latent space to describe the visual atmosphere and introduces a parameter-free feature modulation mechanism to adaptively inject visual context into semantic embeddings. Additionally, a dynamic temporal granularity scheduler aligns with the diffusion process to prioritize visual atmosphere early and refine semantic details later. Experiments show that DVI improves visual consistency and atmospheric fidelity while preserving identity and outperforming existing methods.
研究旨在解决无调参身份定制方法中存在的‘语义-视觉不和谐’问题,这些方法往往忽视了光照、皮肤纹理等视觉上下文。DVI 是一个零样本框架,将身份分解为语义和视觉流,利用VAE潜在空间的均值和方差来描述视觉氛围。它引入了一种无参数特征调制机制,以适应性地注入参考的视觉特征,并使用动态时间粒度调度器在去噪早期阶段优先处理视觉氛围,而在后期细化语义细节。实验表明,DVI 在提高视觉一致性和氛围保真度的同时,保持了身份的稳健性,并在IBench评估中优于现有方法。
TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning
Authors: Chuang Jiang, Mingyue Cheng, Xiaoyu Tao, Qingyang Mao, Jie Ouyang, Qi Liu
Venue: WSDM 2026
First: 2025-09-08T02:00:31+00:00 · Latest: 2025-12-22T01:55:37+00:00
Comments: Comments: 10 pages, 6 figures. Submitted to WSDM 2026
Abstract
Table reasoning requires models to jointly perform comprehensive semantic understanding and precise numerical operations. Although recent large language model (LLM)-based methods have achieved promising results, most of them still rely on a single-turn reasoning paradigm that processes flattened tables in a single forward pass. This paradigm suffers from inherent limitations, including context overflow on large tables, weak sensitivity to continuous numerical values, and the absence of explicit tool-use and reflection. In this paper, we propose TableMind, a tuning-based autonomous programmatic table agent that simulates the human-like cognitive schema of the multi-turn interaction within a lightweight LLM. Instead of adopting a training-free workflow design, TableMind learns to internalize planning, action, and reflection through a principled two-stage training strategy. To bootstrap structured table reasoning capabilities, we construct and filter high-quality reasoning data for the supervised fine-tuning (SFT) stage. To enable precise code generation, we introduce a designed multi-perspective reward scheme and a novel optimization objective in the reinforcement learning (RL) stage. Extensive experiments on diverse benchmarks demonstrate that TableMind consistently outperforms previous baselines, validating the effectiveness of training autonomous agents to improve overall performance.
中文标题/摘要
标题:TableMind:一种自主程序代理,用于工具增强的表格推理
表格推理需要模型同时进行全面的语义理解和精确的数值操作。尽管最近基于大型语言模型(LLM)的方法取得了令人鼓舞的结果,但大多数方法仍然依赖于单一回合的推理范式,即在单次前向传递中处理扁平化的表格。这种范式存在固有的局限性,包括在大表格上出现上下文溢出、对连续数值的敏感性较弱以及缺乏明确的工具使用和反思。在本文中,我们提出了一种基于调优的自主程序化表格代理TableMind,它模拟了轻量级LLM中多回合交互的人类认知模式。TableMind 通过一个原则性的两阶段训练策略学习内化规划、行动和反思,而不是采用无训练的工作流设计。为了启动结构化的表格推理能力,我们构建并筛选高质量的推理数据用于监督微调(SFT)阶段。为了实现精确的代码生成,我们在强化学习(RL)阶段引入了一种设计的多视角奖励方案和一种新的优化目标。在多种基准上的广泛实验表明,TableMind 一致地优于之前的基线,验证了训练自主代理以提高整体性能的有效性。
Summary / 总结
TableMind is an autonomous programmatic agent designed to enhance table reasoning by simulating multi-turn human-like interactions within a lightweight language model. It overcomes the limitations of single-turn reasoning by learning planning, action, and reflection through a two-stage training strategy. TableMind outperforms previous methods on various benchmarks, demonstrating its effectiveness in improving overall performance in table reasoning tasks.
TableMind 是一个自主程序化代理,用于表格推理,通过引入多轮交互解决单轮推理的局限性。它采用结合监督微调和强化学习的两阶段训练策略,实现精确的数值操作和工具使用。TableMind 在多种基准测试中表现出色,验证了训练自主代理以提高整体性能的有效性。
The Ensemble Schr{ö}dinger Bridge filter for Nonlinear Data Assimilation
Authors: Feng Bao, Hui Sun
First: 2025-12-22T00:06:49+00:00 · Latest: 2025-12-22T00:06:49+00:00
Abstract
This work puts forward a novel nonlinear optimal filter namely the Ensemble Schr{ö}dinger Bridge nonlinear filter. The proposed filter finds marriage of the standard prediction procedure and the diffusion generative modeling for the analysis procedure to realize one filtering step. The designed approach finds no structural model error, and it is derivative free, training free and highly parallizable. Experimental results show that the designed algorithm performs well given highly nonlinear dynamics in (mildly) high dimension up to 40 or above under a chaotic environment. It also shows better performance than classical methods such as the ensemble Kalman filter and the Particle filter in numerous tests given different level of nonlinearity. Future work will focus on extending the proposed approach to practical meteorological applications and establishing a rigorous convergence analysis.
中文标题/摘要
标题:Schr{ö}dinger 桥集成滤波器在非线性数据同化中的应用
本文提出了一种新颖的非线性最优滤波器,即集成Schr{ö}dinger桥非线性滤波器。所提出的滤波器将标准预测过程与扩散生成建模相结合,以实现一次滤波步骤。所设计的方法不存在结构模型误差,且无需求导、无需训练,具有高度并行性。实验结果表明,在混沌环境中,该设计算法在(轻微)高维(40或以上)且具有高度非线性动力学的情况下表现良好。此外,在不同非线性程度下与经典方法(如集成卡尔曼滤波器和粒子滤波器)相比,该算法在多次测试中表现出更好的性能。未来的工作将致力于将所提出的方法扩展到实际气象应用,并建立严格的收敛分析。
Summary / 总结
This work introduces the Ensemble Schrödinger Bridge nonlinear filter, which combines standard prediction with diffusion generative modeling for analysis, achieving a filtering step without structural model errors. The method is derivative-free, training-free, and highly parallelizable. Experiments demonstrate that the filter performs well in highly nonlinear dynamics up to 40 dimensions under chaotic conditions, outperforming traditional methods like the ensemble Kalman filter and Particle filter in various tests. Future work aims to apply this approach to meteorological applications and conduct a rigorous convergence analysis.
该研究提出了一种新的非线性滤波器——Ensemble Schrödinger Bridge非线性滤波器,将标准预测与扩散生成建模相结合用于分析,消除了结构模型误差,无需训练和导数计算。实验表明,该算法在高达40维的强非线性动力学中表现出色,优于传统的如集合卡尔曼滤波器和粒子滤波器等方法。未来的工作将致力于将其应用于气象应用,并进行严格的收敛性分析。
Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models
Authors: Mohamad Zamini, Diksha Shukla
First: 2025-12-21T23:02:56+00:00 · Latest: 2025-12-21T23:02:56+00:00
Abstract
Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost of processing dense visual tokens remains a major bottleneck. A critical component in this pipeline is the visual projector, which bridges the vision encoder and the language model. Standard designs often employ a simple multi-layer perceptron for direct token mapping, but this approach scales poorly with high-resolution inputs, introducing significant redundancy. We present Delta-LLaVA, a token-efficient projector that employs a low-rank DeltaProjection to align multi-level vision features into a compact subspace before further interaction. On top of this base alignment, lightweight Transformer blocks act as specialization layers, capturing both global and local structure under constrained token budgets. Extensive experiments and ablations demonstrate that this base-then-specialize design yields consistent gains across multiple benchmarks with only 144 tokens, highlighting the importance of token formation prior to scaling interaction capacity. With Delta-LLaVA, inference throughput improves by up to 55%, while end-to-end training accelerates by nearly 4-5x in pretraining and over 1.5x in finetuning, highlighting the dual benefits of our design in both efficiency and scalability.
中文标题/摘要
标题:Delta-LLaVA:基于先基化后专业化对齐的高效视觉-语言模型
多模态大型语言模型(MLLMs)结合视觉和文本表示,以实现丰富的推理能力。然而,处理密集视觉标记的高计算成本仍然是一个主要瓶颈。此管道中的关键组件是视觉投影器,它连接视觉编码器和语言模型。标准设计通常采用简单的多层感知机进行直接标记映射,但这种方法在高分辨率输入下扩展性差,引入了大量冗余。我们提出了Delta-LLaVA,这是一种高效的投影器,采用低秩DeltaProjection将多级视觉特征对齐到一个紧凑的子空间中,然后再进一步交互。在此基础对齐之上,轻量级的Transformer块作为专业化层,能够在受限的标记预算下捕捉全局和局部结构。广泛的实验和消融研究证明,这种先基化后专业化的设计在多个基准测试中仅使用144个标记就能获得一致的收益,突显了在扩展交互能力之前标记形成的重要性。通过Delta-LLaVA,推理吞吐量提高了55%,而端到端训练在预训练中加速了近4-5倍,在微调中加速了超过1.5倍,突显了我们设计在效率和可扩展性方面的双重优势。
Summary / 总结
Delta-LLaVA is designed to address the computational challenges of processing dense visual tokens in multimodal large language models. It introduces a low-rank DeltaProjection for base alignment, followed by lightweight Transformer blocks for specialization. This approach reduces the token budget to 144 while improving inference throughput by up to 55% and accelerating end-to-end training by nearly 4-5x in pretraining and over 1.5x in finetuning. The base-then-specialize design consistently improves performance across multiple benchmarks.
Delta-LLaVA旨在解决多模态大型语言模型中处理密集视觉标记的计算挑战。它引入了一种低秩DeltaProjection进行基础对齐,随后使用轻量级的Transformer块进行专业化处理。这种设计将令牌预算减少到144个,同时将推理吞吐量提高高达55%,并加速端到端训练,预训练加速近4-5倍,微调加速超过1.5倍。基底然后专业化的方法在多个基准测试中始终提高了性能。
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Authors: Aashray Reddy, Andrew Zagula, Nicholas Saban
Venue: NeurIPS 2025
First: 2025-11-04T08:56:28+00:00 · Latest: 2025-12-21T22:30:20+00:00
Comments: Presented at NeurIPS 2025 Lock-LLM Workshop. Code is available at https://github.com/AAN-AutoAdv/AutoAdv
Abstract
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where adversarial prompts elicit harmful outputs. Yet most evaluations focus on single-turn interactions while real-world attacks unfold through adaptive multi-turn conversations. We present AutoAdv, a training-free framework for automated multi-turn jailbreaking that achieves an attack success rate of up to 95% on Llama-3.1-8B within six turns, a 24% improvement over single-turn baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern manager that learns from successful attacks to enhance future prompts, a temperature manager that dynamically adjusts sampling parameters based on failure modes, and a two-phase rewriting strategy that disguises harmful requests and then iteratively refines them. Extensive evaluation across commercial and open-source models (Llama-3.1-8B, GPT-4o mini, Qwen3-235B, Mistral-7B) reveals persistent vulnerabilities in current safety mechanisms, with multi-turn attacks consistently outperforming single-turn approaches. These findings demonstrate that alignment strategies optimized for single-turn interactions fail to maintain robustness across extended conversations, highlighting an urgent need for multi-turn-aware defenses.
中文标题/摘要
标题:AutoAdv:自动化对抗提示以实现大型语言模型的多轮脱笼攻击
大型语言模型(LLMs)仍然容易受到脱笼攻击的影响,其中对抗性提示会引发有害输出。然而,大多数评估集中在单轮交互上,而实际攻击则通过适应性的多轮对话展开。我们提出了AutoAdv,这是一种无需训练的框架,用于实现自动化多轮脱笼攻击,在六轮内对Llama-3.1-8B的成功攻击率高达95%,比单轮基线提高了24%。AutoAdv独特地结合了三种适应性机制:一个模式管理器,从成功的攻击中学习以增强未来的提示;一个温度管理器,根据失败模式动态调整采样参数;以及一个两阶段重写策略,首先隐藏有害请求,然后逐步优化它们。在商业和开源模型(Llama-3.1-8B、GPT-4o mini、Qwen3-235B、Mistral-7B)上的广泛评估揭示了当前安全机制的持续漏洞,多轮攻击始终优于单轮方法。这些发现表明,针对单轮交互优化的对齐策略无法在长时间对话中保持鲁棒性,突显了对多轮意识防御的迫切需求。
Summary / 总结
AutoAdv is a training-free framework for automated multi-turn jailbreaking of large language models, achieving up to 95% attack success rate within six turns on Llama-3.1-8B, which is a 24% improvement over single-turn baselines. It combines three adaptive mechanisms: a pattern manager, a temperature manager, and a two-phase rewriting strategy, to enhance prompt effectiveness and disguise harmful requests. The framework consistently outperforms single-turn approaches across various models, indicating the need for multi-turn-aware defenses.
AutoAdv 是一个无需训练的框架,用于自动化多轮大语言模型的脱管攻击,六轮内对 Llama-3.1-8B 的攻击成功率高达 95%,比单轮基线提高了 24%。它结合了模式管理器、温度管理器和两阶段重写策略,动态调整和优化恶意提示,展示了当前安全机制在多轮交互中的持续漏洞。
Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs
Authors: Dmitry Demidov, Zaigham Zaheer, Zongyan Han, Omkar Thawakar, Rao Anwer
First: 2025-12-21T22:01:29+00:00 · Latest: 2025-12-21T22:01:29+00:00
Abstract
Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on github.com/demidovd98/FiNDR.
中文标题/摘要
标题:超越标签的思考:基于推理增强LMM的无词汇细粒度识别
无词汇细粒度图像识别旨在区分元类内的视觉相似类别,而无需固定的人类定义标签集。现有解决方案要么依赖于庞大且僵化的词汇列表,要么依赖于复杂且易出错的管道,其中错误会在各个阶段传播。与此同时,近期大型多模态模型(LMMs)具备显式或隐式推理能力,能够理解视觉-语言数据、分解问题、检索潜在知识并自我纠正,这表明了一种更为原则性和有效的方法。基于这些能力,我们提出了FiNDR(基于推理的细粒度名称发现),这是首个基于推理增强LMM的无词汇细粒度识别框架。系统分为三个自动化步骤:(i) 一个推理增强的LMM为每张图像生成描述性候选标签;(ii) 视觉-语言模型筛选并排序这些候选标签,形成一个一致的类别集;(iii) 验证后的名称实例化一个轻量级多模态分类器用于推理阶段。在流行的细粒度分类基准上的广泛实验表明,在无词汇设置下,该方法达到了最先进的性能,相对于先前方法有高达18.8%的显著相对优势。此外,该方法超越了利用预定义真实名称的零样本基线,挑战了人类策划词汇定义上限的假设。另外,我们展示了精心策划的提示使开源LMM能够匹配专有版本。这些发现确立了推理增强LMM作为可扩展、全自动、开放世界的细粒度视觉识别的有效基础。源代码可在github.com/demidovd98/FiNDR获取。
Summary / 总结
The paper addresses the challenge of distinguishing visually similar categories within a meta-class without relying on predefined labels. It introduces FiNDR, a reasoning-augmented large multi-modal model framework that generates descriptive labels for images, filters them to form coherent classes, and uses these names to classify images. Experiments show that FiNDR outperforms previous methods by up to 18.8% and even surpasses zero-shot baselines that use predefined names, suggesting that reasoning-augmented LMMs can effectively handle vocabulary-free fine-grained recognition. The method also demonstrates that open-source models can match proprietary ones with carefully curated prompts, making it a scalable solution for fine-grained visual recognition.
研究旨在通过利用增强推理的大规模多模态模型来改进无词汇表的细粒度图像识别。方法包括三个步骤:使用增强推理的LMM生成描述性候选标签,使用视觉语言模型过滤和排序这些候选标签,以及使用验证的名称实例化轻量级多模态分类器。实验表明,FiNDR在无词汇表设置下达到了最先进的性能,相对于先前的方法提高了高达18.8%的性能,并超越了使用预定义名称的零样本基线,表明人类编纂的词汇表可能不是最佳性能的上限。这些发现表明,增强推理的LMM可以有效地用于无预定义标签的开放世界细粒度视觉识别。
Independent Density Estimation
Authors: Jiahao Liu, Senhao Cao
First: 2025-12-10T20:43:03+00:00 · Latest: 2025-12-21T21:19:52+00:00
Comments: 10 pages, 1 table, 4 figures
Abstract
Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Nevertheless, these models still encounter difficulties in achieving human-like compositional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connection between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy-based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.
中文标题/摘要
标题:独立密度估计
大规模的跨模态模型在图像字幕和条件图像生成等领域取得了显著成果,然而这些模型在实现类人的组合泛化方面仍然面临困难。本研究提出了一种新的方法,即独立密度估计(IDE),以应对这一挑战。IDE旨在学习句子中单个词与图像中相应特征之间的联系,从而实现组合泛化。我们基于IDE的理念构建了两个模型。第一个模型使用完全解耦的视觉表示作为输入,第二个模型利用变分自编码器从原始图像中获得部分解耦的特征。此外,我们还提出了一种基于熵的组合推理方法,用于结合句子中每个词的预测。当在各种数据集上进行评估时,我们的模型在未见过的组合泛化方面优于当前模型。
Summary / 总结
This study addresses the challenge of compositional generalization in vision-language models by proposing a new method called Independent Density Estimation (IDE). IDE learns the relationship between individual words in a sentence and corresponding image features, enabling better compositional generalization. Two models were developed: one using fully disentangled visual representations and another using partially disentangled features from raw images via a Variational Auto-Encoder. An entropy-based compositional inference method was also introduced to combine word predictions. The models outperform existing methods in generalizing to unseen compositions across various datasets.
本研究旨在解决大型视觉-语言模型在实现类人类的组合泛化方面的挑战。提出了独立密度估计(IDE)方法,以学习句子中的单词与相应图像特征之间的联系。提出了两种模型:一种使用完全解纠缠的视觉表示,另一种使用通过变分自编码器从原始图像中获得的部分解纠缠特征。还提出了一种基于熵的组合推理方法,以结合每个单词的预测。在各种数据集上的评估表明,这些模型在未见过的组合泛化方面优于现有模型。
History
20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553