arXiv 论文速递

2025-10-21 03:31
Snapshot: 20251021_0331
BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models
Authors: Kaushitha Silva, Mansitha Eashwara, Sanduni Ubayasiri, Ruwan Tennakoon, Damayanthi Herath
First: 2025-10-17T17:58:31+00:00 · Latest: 2025-10-17T17:58:31+00:00
Comments: 10 Pages + 15 Supplementary Material Pages, 5 figures
Abstract
The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model's performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.
中文标题/摘要
标题:BiomedXPro:使用生物医学视觉语言模型进行可解释诊断的提示优化
生物医学视觉语言模型在临床中的应用受到提示优化技术的阻碍,这些技术产生的要么是不可解释的潜在向量,要么是单一的文本提示。这种缺乏透明度和无法捕捉临床诊断的多面性,限制了它们在高风险环境中的可信度。为了解决这一问题,我们引入了BiomedXPro,这是一种进化框架,利用大型语言模型作为生物医学知识提取器和自适应优化器,自动生成一系列可解释的自然语言提示对,用于疾病诊断。在多个生物医学基准上的实验表明,BiomedXPro在数据稀缺的少量样本设置中始终优于最先进的提示调优方法。此外,我们的分析表明,发现的提示与统计上显著的临床特征之间存在强烈的语义对齐,使模型的性能基于可验证的概念。通过生成一系列可解释的提示,BiomedXPro为模型预测提供了可验证的基础,代表了朝着开发更可信和临床对齐的AI系统迈出的关键一步。
Summary / 总结
The paper introduces BiomedXPro, an evolutionary framework that uses a large language model to automatically generate interpretable, natural-language prompt pairs for disease diagnosis, addressing the lack of transparency in biomedical vision-language models. Experiments show that BiomedXPro outperforms state-of-the-art methods, especially in data-scarce few-shot settings, and the discovered prompts align well with clinical features, enhancing the model's trustworthiness.
该论文提出了BiomedXPro框架,利用大型语言模型自动生成可解释的自然语言提示对,解决生物医学视觉语言模型缺乏透明度的问题。实验表明,BiomedXPro在数据稀缺的少量样本设置中优于现有方法,并且发现的提示与临床特征高度一致,提升了模型的可信度。
Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt
Authors: Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen, Lian Zhang, Peiwu Qin
First: 2025-10-17T17:42:28+00:00 · Latest: 2025-10-17T17:42:28+00:00
Abstract
Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.
中文标题/摘要
标题:Memory-SAM:无需人工提示的舌部分割
准确的舌部分割对于可靠的中医分析至关重要。监督模型需要大量标注数据集,而SAM家族模型仍依赖于提示。我们提出了Memory-SAM,这是一种无需训练、无需人工提示的流水线,通过密集的DINOv3特征和FAISS检索,自动从少量的先前案例记忆中生成有效的提示。给定查询图像,检索到的示例的掩码约束对应关系被提炼成前景/背景点提示,指导SAM2进行分割,无需手动点击或模型微调。我们在600张由专家标注的图像(300张受控,300张野外)上进行了评估。在混合测试集上,Memory-SAM的mIoU为0.9863,超过了FCN(0.8188)和一个检测器到框的SAM基线(0.1839)。在受控数据上,天花板效应使得超过0.98的小差异变得不那么有意义,而我们的方法在真实条件下显示出明显的改进。结果表明,检索到提示能够实现数据高效、鲁棒的舌部成像不规则边界分割。代码已公开,可在https://github.com/jw-chae/memory-sam/ 获取。
Summary / 总结
Memory-SAM is a human-prompt-free pipeline for tongue segmentation that uses dense DINOv3 features and FAISS retrieval to automatically generate effective prompts from a small memory of prior cases. It achieves an mIoU of 0.9863 on a mixed test split, surpassing FCN and a detector-to-box SAM baseline. On controlled data, while ceiling effects limit the meaningful differences, Memory-SAM still shows gains under real-world conditions, indicating its robustness for irregular boundary segmentation in tongue imaging.
Memory-SAM旨在通过传统中医分析中的舌头分割,解决监督模型和SAM家族模型的局限性,采用密集的DINOv3特征和FAISS检索,从少量的先前案例记忆中自动生成有效的提示,无需手动点击或模型微调。在600张专家标注的图像上,Memory-SAM的平均交并比(mIoU)达到0.9863,超过了FCN和一个检测器到框的SAM基线。该方法在真实世界条件下显示出明显的改进,表明其在数据高效和鲁棒分割不规则舌头边界方面的潜力。
Neuro-Symbolic Spatial Reasoning in Segmentation
Authors: Jiayi Lin, Jiabo Huang, Shaogang Gong
First: 2025-10-17T17:35:34+00:00 · Latest: 2025-10-17T17:35:34+00:00
Abstract
Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., <cat, to-right-of, person>, and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., "cat") and a spatial pseudo category (e.g., "right of person") simultaneously, enforcing relational constraints (e.g., a "cat" pixel must lie to the right of a "person"). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.
中文标题/摘要
标题:神经符号空间推理在分割中的应用
开放词汇语义分割(OVSS)为一个开放类别集合分配像素级标签,需要对未见过且未标注的对象进行泛化。使用视觉语言模型(VLMs)将局部图像块与潜在未见过的对象类别相关联,但由于缺乏对场景中对象空间关系的理解,存在局限性。为解决这一问题,我们引入了OVSS中的神经符号(NeSy)空间推理。与当前基于VLM相关性的方法不同,我们提出了关系分割器(RelateSeg),通过一阶逻辑(FOL)在神经网络架构中施加显式的空间关系约束。这是首次尝试在OVSS中探索NeSy空间推理。具体而言,RelateSeg自动提取空间关系,如<猫, 在...右边, 人>,并使用我们提出的伪类别将其编码为一阶逻辑公式。每个像素同时学习预测一个语义类别(如“猫”)和一个空间伪类别(如“在人的右边”),从而施加关系约束(如“猫”像素必须位于“人”的右边)。最后,这些逻辑约束通过模糊逻辑松弛在深度网络架构中进行形式化,从而实现空间关系一致的分割端到端学习。RelateSeg在四个基准数据集上的平均mIoU上达到了最先进的性能,并且在包含多个类别的图像上特别表现出明显优势,仅引入了一个辅助损失函数且没有增加额外参数,验证了NeSy空间推理在OVSS中的有效性。
Summary / 总结
This paper addresses the challenge of Open-Vocabulary Semantic Segmentation (OVSS) by introducing neuro-symbolic spatial reasoning. The authors propose Relational Segmentor (RelateSeg), which uses first-order logic to encode spatial relations and imposes these constraints in a neural network. RelateSeg achieves state-of-the-art performance on four benchmark datasets, demonstrating its effectiveness in handling multiple categories while maintaining efficiency.
论文通过引入神经符号空间推理解决了开放词汇语义分割(OVSS)的问题。提出了一种关系分割器(RelateSeg),通过一阶逻辑显式编码空间关系,并将其整合到神经网络中。这种方法在包含多个类别的场景中特别有效,实现了最先进的分割性能,且仅引入了一个辅助损失函数,没有增加额外的参数。
CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models
Authors: Denis Rychkovskiy
First: 2025-10-14T19:57:58+00:00 · Latest: 2025-10-17T15:59:13+00:00
Comments: 8 pages, 3 figures. Endorsed by Dr. Seyedmorteza Sadat (ETH Zurich). The work introduces CADE 2.5 with ZeResFDG as a practical inference-time guidance stack for SD/SDXL. Code and visual examples to be released on GitHub and Hugging Face
Abstract
We introduce CADE 2.5 (Comfy Adaptive Detail Enhancer), a sampler-level guidance stack for SD/SDXL latent diffusion models. The central module, ZeResFDG, unifies (i) frequency-decoupled guidance that reweights low- and high-frequency components of the guidance signal, (ii) energy rescaling that matches the per-sample magnitude of the guided prediction to the positive branch, and (iii) zero-projection that removes the component parallel to the unconditional direction. A lightweight spectral EMA with hysteresis switches between a conservative and a detail-seeking mode as structure crystallizes during sampling. Across SD/SDXL samplers, ZeResFDG improves sharpness, prompt adherence, and artifact control at moderate guidance scales without any retraining. In addition, we employ a training-free inference-time stabilizer, QSilk Micrograin Stabilizer (quantile clamp + depth/edge-gated micro-detail injection), which improves robustness and yields natural high-frequency micro-texture at high resolutions with negligible overhead. For completeness we note that the same rule is compatible with alternative parameterizations (e.g., velocity), which we briefly discuss in the Appendix; however, this paper focuses on SD/SDXL latent diffusion models.
中文标题/摘要
标题:CADE 2.5 - ZeResFDG:频率解耦、重新缩放和零投影指导策略用于SD/SDXL潜在扩散模型
我们介绍了CADE 2.5(Comfy自适应细节增强器),这是一种针对SD/SDXL潜在扩散模型的采样级指导堆栈。核心模块ZeResFDG统一了(i) 频率解耦指导,重新加权指导信号中的低频和高频分量,(ii) 能量重新缩放,使受指导预测的样本量级与正分支匹配,以及(iii) 零投影,去除与无条件方向平行的分量。一种轻量级的带滞回的频谱指数移动平均值在采样过程中根据结构的形成在保守模式和细节寻求模式之间切换。在SD/SDXL采样器中,ZeResFDG在中等指导规模下提高了锐度、提示依从性和伪影控制,无需任何重新训练。此外,我们还采用了一种无需训练的推理时稳定器QSilk微粒稳定器(分位数钳制+深度/边缘门控微细节注入),提高了鲁棒性,并在高分辨率下产生了自然的高频微纹理,几乎没有额外开销。我们还注意到,同样的规则适用于其他参数化(例如,速度),我们在附录中简要讨论了这一点;然而,本文主要关注SD/SDXL潜在扩散模型。
Summary / 总结
CADE 2.5, a sampler-level guidance stack for SD/SDXL latent diffusion models, introduces ZeResFDG, which combines frequency-decoupled guidance, energy rescaling, and zero-projection to enhance sharpness, prompt adherence, and artifact control. The lightweight spectral EMA switches between conservative and detail-seeking modes, improving results without retraining. Additionally, QSilk Micrograin Stabilizer is used to stabilize inference and produce natural high-frequency micro-texture at high resolutions with minimal overhead.
CADE 2.5 是一种针对 SD/SDXL 潜在扩散模型的采样器级指导堆栈,引入了 ZeResFDG,结合了频率解耦指导、能量重标定和零投影,以提高清晰度、提示一致性以及减少伪影。轻量级的光谱 EMA 在采样过程中在保守模式和细节寻求模式之间切换,无需重新训练即可改进结果。此外,QSilk 微粒稳定器作为一种无需训练的推理时稳定器,增强了鲁棒性并在高分辨率下生成自然的高频微纹理,且几乎没有额外开销。
CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding
Authors: Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho
First: 2025-09-27T16:01:09+00:00 · Latest: 2025-10-17T14:59:53+00:00
Comments: Preprint, 27 pages, 3 figures
Abstract
Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Decoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.
中文标题/摘要
标题:CCD:通过临床对比解码减轻放射学MLLM中的幻觉
多模态大型语言模型(MLLMs)在放射学领域通过结合视觉感知和自然语言理解取得了显著进展。然而,它们经常生成缺乏临床支持的描述,即医学幻觉,这在需要准确性和图像相关输出的医学应用中构成了严重风险。通过实证分析,我们发现提示诱导的幻觉在放射学MLLM中仍然普遍存在,主要是由于对临床部分的过度敏感。为了解决这一问题,我们引入了临床对比解码(CCD),这是一种无需训练和检索的推理框架,结合了特定任务的放射学专家模型的结构化临床信号。CCD引入了双重对比机制,在生成过程中细化标记级概率,从而提高临床准确性,而不修改基础MLLM。在三个数据集和多种模型上的实验表明,CCD在放射学报告生成(RRG)方面始终如一地提高了整体性能。在MIMIC-CXR数据集上,当应用于最先进的RRG模型时,它在RadGraph-F1上最多可提高17%。我们的方法提供了一种轻量级且可泛化的解决方案,用于减轻医学幻觉,有效地将专家模型和MLLMs在放射学中联系起来。
Summary / 总结
The paper addresses the issue of medical hallucinations in radiology multimodal large language models (MLLMs) by introducing Clinical Contrastive Decoding (CCD), a training-free and retrieval-free inference framework. CCD uses structured clinical signals from task-specific radiology expert models to refine token-level logits during generation, enhancing clinical fidelity. Experiments show that CCD improves performance on radiology report generation, with up to a 17% improvement in RadGraph-F1 on the MIMIC-CXR dataset when applied to state-of-the-art models.
论文通过引入无训练和无检索的推理框架Clinical Contrastive Decoding (CCD),解决了放射学多模态大型语言模型(MLLMs)中的医疗幻觉问题。CCD 使用任务特定的放射学专家模型中的结构化临床信号,在生成过程中细化标记级概率,提高临床准确性。实验表明,CCD 在放射学报告生成上提高了性能,应用到最先进的模型时,在MIMIC-CXR 数据集上的 RadGraph-F1 得分提高了高达 17%。
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Authors: Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma
First: 2025-10-16T10:18:48+00:00 · Latest: 2025-10-17T14:12:46+00:00
Comments: Github Repo: https://github.com/PaddlePaddle/PaddleOCR
Abstract
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios. Code is available at https://github.com/PaddlePaddle/PaddleOCR .
中文标题/摘要
标题:PaddleOCR-VL:通过0.9B超紧凑视觉语言模型提升多语言文档解析
在本报告中,我们提出了PaddleOCR-VL,一种针对文档解析的SOTA且资源高效的模型。其核心组件是PaddleOCR-VL-0.9B,这是一种紧凑而强大的视觉语言模型(VLM),结合了NaViT风格的动态分辨率视觉编码器和ERNIE-4.5-0.3B语言模型,以实现准确的元素识别。该创新模型能够高效支持109种语言,并在识别复杂元素(如文本、表格、公式和图表)方面表现出色,同时保持极低的资源消耗。通过在广泛使用的公共基准和内部基准上的全面评估,PaddleOCR-VL 在页面级文档解析和元素级识别方面均达到了SOTA性能。它显著优于现有解决方案,表现出与顶级VLM相当的竞争力,并提供快速的推理速度。这些优势使其非常适合在实际场景中的部署。代码可在https://github.com/PaddlePaddle/PaddleOCR 获取。
Summary / 总结
PaddleOCR-VL is a state-of-the-art and resource-efficient model for document parsing, featuring PaddleOCR-VL-0.9B, a compact vision-language model that integrates a NaViT-style visual encoder and ERNIE-4.5-0.3B language model. It supports 109 languages and excels in recognizing complex elements like text, tables, formulas, and charts. Comprehensive evaluations show that PaddleOCR-VL outperforms existing solutions and delivers fast inference speeds, making it suitable for practical deployment in real-world scenarios.
PaddleOCR-VL 是一种面向文档解析的先进且资源高效的模型,包含 PaddleOCR-VL-0.9B,该模型结合了 NaViT 风格的视觉编码器和 ERNIE-4.5-0.3B 语言模型,支持 109 种语言,并擅长识别文本、表格、公式和图表等复杂元素。全面的评估显示,PaddleOCR-VL 在现有解决方案中表现出色,提供快速的推理速度,适用于实际部署的现实场景。
Methods and Trends in Detecting AI-Generated Images: A Comprehensive Review
Authors: Arpan Mahara, Naphtali Rishe
First: 2025-02-21T03:16:18+00:00 · Latest: 2025-10-17T13:36:01+00:00
Comments: 34 pages, 4 Figures, 10 Tables
Abstract
The proliferation of generative models, such as Generative Adversarial Networks (GANs), Diffusion Models, and Variational Autoencoders (VAEs), has enabled the synthesis of high-quality multimedia data. However, these advancements have also raised significant concerns regarding adversarial attacks, unethical usage, and societal harm. Recognizing these challenges, researchers have increasingly focused on developing methodologies to detect synthesized data effectively, aiming to mitigate potential risks. Prior reviews have predominantly focused on deepfake detection and often overlook recent advancements in synthetic image forensics, particularly approaches that incorporate multimodal frameworks, reasoning-based detection, and training-free methodologies. To bridge this gap, this survey provides a comprehensive and up-to-date review of state-of-the-art techniques for detecting and classifying synthetic images generated by advanced generative AI models. The review systematically examines core detection paradigms, categorizes them into spatial-domain, frequency-domain, fingerprint-based, patch-based, training-free, and multimodal reasoning-based frameworks, and offers concise descriptions of their underlying principles. We further provide detailed comparative analyses of these methods on publicly available datasets to assess their generalizability, robustness, and interpretability. Finally, the survey highlights open challenges and future directions, emphasizing the potential of hybrid frameworks that combine the efficiency of training-free approaches with the semantic reasoning of multimodal models to advance trustworthy and explainable synthetic image forensics.
中文标题/摘要
标题:检测AI生成图像的方法与趋势:全面综述
生成模型的普及,如生成对抗网络(GAN)、扩散模型和变分自编码器(VAE),使高质量多媒体数据的合成成为可能。然而,这些进步也引发了关于对抗攻击、不道德使用和社会危害的重大关切。认识到这些挑战,研究人员越来越多地专注于开发有效检测合成数据的方法,以减轻潜在风险。之前的综述主要集中在深度假信息检测上,往往忽略了合成图像取证的最新进展,特别是那些结合多模态框架、基于推理的检测和无需训练的方法。为了弥补这一差距,本综述提供了对先进生成AI模型生成的合成图像检测和分类的最新技术的全面和及时的综述。综述系统地检查了核心检测范式,将它们分类为空间域、频域、指纹基、块基、无需训练和多模态推理基框架,并简要描述了它们的基本原理。我们还对这些方法在公开可用的数据集上的详细比较分析,以评估它们的泛化能力、鲁棒性和可解释性。最后,综述强调了开放挑战和未来方向,强调了结合无需训练方法的效率和多模态模型的语义推理的混合框架的潜力,以推进可信和可解释的合成图像取证。
Summary / 总结
This paper reviews methods for detecting AI-generated images, focusing on recent advancements in synthetic image forensics. It examines core detection paradigms, categorizes them into various frameworks, and provides comparative analyses on public datasets. Key findings include the effectiveness of multimodal reasoning-based and training-free methods in detecting synthetic images, highlighting the need for hybrid frameworks combining efficiency and semantic reasoning.
本文回顾了检测AI生成图像的方法,重点关注合成图像取证的最新进展。研究了核心检测范式,将其分类为各种框架,并在公共数据集上提供了比较分析。关键发现包括多模态推理基和无训练方法在检测合成图像方面的有效性,强调需要结合无训练方法的效率和多模态模型的语义推理来提高可信度和可解释性。
CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints
Authors: Yuhong Deng, Chao Tang, Cunjun Yu, Linfeng Li, David Hsu
First: 2025-07-26T15:43:25+00:00 · Latest: 2025-10-17T13:17:10+00:00
Abstract
Clothes manipulation, such as folding or hanging, is a critical capability for home service robots. Despite recent advances, most existing methods remain limited to specific clothes types and tasks, due to the complex, high-dimensional geometry of clothes. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP), which aims at general-purpose clothes manipulation over diverse clothes types, T-shirts, shorts, skirts, long dresses, ..., as well as different tasks, folding, flattening, hanging, .... The core idea of CLASP is semantic keypoints-e.g., ''left sleeve'' and ''right shoulder''-a sparse spatial-semantic representation, salient for both perception and action. Semantic keypoints of clothes can be reliably extracted from RGB-D images and provide an effective representation for a wide range of clothes manipulation policies. CLASP uses semantic keypoints as an intermediate representation to connect high-level task planning and low-level action execution. At the high level, it exploits vision language models (VLMs) to predict task plans over the semantic keypoints. At the low level, it executes the plans with the help of a set of pre-built manipulation skills conditioned on the keypoints. Extensive simulation experiments show that CLASP outperforms state-of-the-art baseline methods on multiple tasks across diverse clothes types, demonstrating strong performance and generalization. Further experiments with a Franka dual-arm system on four distinct tasks-folding, flattening, hanging, and placing-confirm CLASP's performance on real-life clothes manipulation.
中文标题/摘要
标题:CLASP:通用服装操作的语义关键点
服装操作,如折叠或悬挂,是家庭服务机器人的一项关键能力。尽管取得了近期进展,但大多数现有方法仍局限于特定类型的衣物和任务,因为衣物具有复杂的、高维的几何形状。本文提出了Clothes mAnipulation with Semantic keyPoints (CLASP),旨在实现对多种衣物类型(T恤、短裤、裙子、长裙等)和不同任务(折叠、压平、悬挂、放置等)的通用服装操作。CLASP的核心思想是语义关键点,例如“左袖”和“右肩”——这是一种稀疏的空间语义表示,对感知和操作都非常重要。可以从RGB-D图像中可靠地提取衣物的语义关键点,并为广泛的服装操作策略提供有效的表示。CLASP使用语义关键点作为中间表示,连接高层任务规划和低层动作执行。在高层,它利用视觉语言模型(VLMs)预测语义关键点上的任务计划。在低层,它在关键点的指导下执行这些计划,借助一组预构建的操作技能。广泛的模拟实验表明,CLASP在多种任务和不同类型的衣物上优于最先进的基线方法,显示出强大的性能和泛化能力。进一步在Franka双臂系统上的实验,针对折叠、压平、悬挂和放置四个不同任务,证实了CLASP在实际服装操作中的性能。
Summary / 总结
CLASP is designed for general-purpose clothes manipulation across various types and tasks. It uses semantic keypoints, such as 'left sleeve' and 'right shoulder', to connect high-level task planning and low-level action execution. CLASP employs vision language models to predict task plans based on these keypoints and executes them using pre-built manipulation skills. Experimental results show that CLASP outperforms existing methods in multiple tasks with diverse clothes types, indicating strong performance and generalization. Further real-life experiments with a Franka dual-arm system confirm its effectiveness in practical scenarios.
CLASP旨在实现不同类型和任务的通用衣物操作,使用‘左袖’和‘右肩’等语义关键点连接高层次的任务规划和低层次的动作执行。CLASP利用视觉语言模型根据这些关键点预测任务计划,并使用预构建的操作技能执行这些计划。实验结果显示,CLASP在多种任务和不同类型的衣物上优于现有方法,表现出强大的性能和泛化能力。进一步使用Franka双臂系统进行的实际操作实验也证实了其在实际场景中的有效性。
GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device
Authors: Jiahao Zhou, Chengliang Lin, Dingji Li, Mingkai Dong, Haibo Chen
First: 2025-10-17T13:06:09+00:00 · Latest: 2025-10-17T13:06:09+00:00
Abstract
Semantic top-K selection with cross-encoder rerankers underpins of on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings stabilize early in intermediate layers, allowing pruning opportunities prior to completing full inference. Building on this insight, we propose monolithic forwarding and develop a training-free inference system, GRATING. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via dual-layer sliding window and chunked execution. We evaluate GRATING against state-of-the-art baselines on rerankers from 0.6B to 8B parameters across Apple M2 and RTX 5070. GRATING consistently reduces latency by up to 89.0% and peak memory by up to 94.9% in microbenchmarks, without any loss in precision. Across three real-world on-device AI applications, GRATING lowers latency by 11.6%-51.0% and peak memory by 18.6%-77.8%, demonstrating substantial improvements in efficiency and deployability.
中文标题/摘要
标题:光栅化:设备上的低延迟和内存高效语义选择
使用交叉编码器重排序器进行语义Top-K选择是设备上AI服务(如检索增强生成、代理记忆和个人化推荐)的基础。然而,其延迟和内存需求主导了边缘硬件上的端到端预算。重新审视Top-K选择的目标,我们发现只有相对排名才重要,而非每个候选者的精确得分。我们进一步观察到序列级别的稀疏性:相对排名在中间层早期就趋于稳定,允许在完成完整推理之前进行剪枝机会。基于这一洞察,我们提出了一体化转发并开发了一个无需训练的推理系统GRATING。通过维护所有候选者的全局视图,它通过渐进簇剪枝来降低延迟。它还通过双层滑动窗口和分块执行战略性地重叠I/O与计算,来限制峰值内存使用。我们在Apple M2和RTX 5070上对从0.6B到8B参数的重排序器进行了与最新基准的评估。GRATING在微基准测试中始终将延迟降低高达89.0%,峰值内存降低高达94.9%,而没有任何精度损失。在三个实际的设备上AI应用中,GRATING将延迟降低11.6%-51.0%,峰值内存降低18.6%-77.8%,展示了显著的效率和部署改进。
Summary / 总结
GRATING is a low-latency and memory-efficient system for semantic top-K selection using cross-encoder rerankers. By leveraging the early stabilization of relative rankings and employing progressive cluster pruning and dual-layer sliding window techniques, GRATING reduces latency by up to 89.0% and peak memory usage by up to 94.9% without compromising precision. In real-world applications, GRATING improves latency by 11.6%-51.0% and peak memory by 18.6%-77.8%.
GRATING 是一种无需训练的推理系统,用于减少设备上语义 top-K 选择的延迟和内存使用。通过利用早期推理过程中相对排名的稳定性,GRATING 剪枝不必要的计算,并通过双层滑动窗口重叠 I/O 和计算。实验表明,GRATING 可以将延迟最多减少 89.0%,峰值内存最多减少 94.9%,且不损失精度,并在实际应用中提高效率 11.6%-51.0%,减少峰值内存使用 18.6%-77.8%。
Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models
Authors: Yan Chen, Long Li, Teng Xi, Long Zeng, Jingdong Wang
First: 2025-09-16T12:51:11+00:00 · Latest: 2025-10-17T10:09:35+00:00
Abstract
Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model's visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.
中文标题/摘要
标题:感知先于推理:视觉语言模型中的两阶段强化学习
强化学习(RL)已被证明在激发大型语言模型(LLMs)的推理能力方面非常有效。受此成功启发,最近的研究探索了将类似技术应用于视觉语言模型(VLMs),以提高其推理性能。然而,直接将RL方法从LLMs移植到VLMs是不理想的,因为VLMs面临的任务本质上更为复杂。具体来说,VLMs必须首先准确地感知和理解视觉输入,然后才能有效地进行推理。为了解决这一挑战,我们提出了一种两阶段的强化学习框架,旨在同时增强VLMs的感知和推理能力。为了缓解RL训练中常见的消失优势问题,我们首先在数据集级别进行采样,以选择性地使用不同的数据源强化特定能力。在训练过程中,第一阶段专注于通过粗粒度和细粒度的视觉理解来提高模型的视觉感知能力,而第二阶段则针对推理能力的提升。经过提出的两阶段强化学习过程后,我们获得了PeBR-R1,这是一种感知和推理能力显著增强的视觉语言模型。在七个基准数据集上的实验结果表明,我们的方法有效,并且验证了PeBR-R1在各种视觉推理任务中的优越性能。
Summary / 总结
This paper proposes a two-stage reinforcement learning framework to improve the perceptual and reasoning capabilities of vision-language models (VLMs). The first stage focuses on enhancing visual perception, while the second stage targets reasoning abilities. By selectively strengthening specific capabilities using distinct data sources, the approach mitigates the vanishing advantage issue in RL training. Experimental results on seven benchmark datasets show that the proposed PeBR-R1 model outperforms existing methods in various visual reasoning tasks.
本文提出了一种两阶段强化学习框架,旨在提升视觉语言模型(VLMs)的感知和推理能力。第一阶段通过粗细粒度的视觉理解来增强视觉感知,第二阶段则针对推理能力进行提升。该方法通过使用不同的数据源有选择地加强特定能力来解决vanishing优势问题。实验结果表明,所提出的PeBR-R1方法在七个基准数据集上的多种视觉推理任务中表现优于现有模型。
FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
Authors: Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin
First: 2025-10-13T02:32:07+00:00 · Latest: 2025-10-17T08:47:31+00:00
Abstract
Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.
中文标题/摘要
标题:FG-CLIP 2:一种双语细粒度视觉语言对齐模型
细粒度的视觉语言理解需要视觉内容与语言描述之间精确对齐,而当前模型在这方面的能力仍然有限,尤其是在非英语环境中。虽然像CLIP这样的模型在全局对齐方面表现良好,但在捕捉对象属性、空间关系和语言表达的细粒度细节方面常常力不从心,且对双语理解的支持有限。为了解决这些挑战,我们引入了FG-CLIP 2,这是一种旨在推进英汉双语细粒度对齐的视觉语言模型。我们的方法利用了丰富的细粒度监督,包括区域-文本匹配和长描述建模,以及多个判别性目标。我们还引入了文本内模态对比损失(TIC损失)以更好地区分语义相似的描述。FG-CLIP 2在精心筛选的大量英汉数据上进行训练,实现了强大的双语性能。为了进行严格的评估,我们提出了一个新的中文多模态理解基准,包括长描述检索和边界框分类。在8个任务的29个数据集上的广泛实验表明,FG-CLIP 2在两种语言中均优于现有方法,达到了最先进的性能。我们发布了该模型、代码和基准,以促进双语细粒度对齐的未来研究。
Summary / 总结
FG-CLIP 2 is a bilingual vision-language model designed to improve fine-grained alignment between visual content and linguistic descriptions, especially in non-English settings. It uses rich fine-grained supervision and multiple discriminative objectives, including the Textual Intra-modal Contrastive (TIC) loss, to better distinguish semantically similar captions. Trained on a curated mix of English and Chinese data, FG-CLIP 2 outperforms existing methods on 29 datasets across 8 tasks, achieving state-of-the-art results in both languages. A new benchmark for Chinese multimodal understanding is also introduced to facilitate evaluation.
FG-CLIP 2 是一种双语视觉-语言模型,旨在提高视觉内容与语言描述之间的细粒度对齐,特别是在非英语环境中。它使用丰富的细粒度监督和多个区分性目标,包括文本内模态对比损失(TIC损失),以更好地区分语义相似的描述。通过训练一个精心筛选的英语和中文数据混合集,FG-CLIP 2 在8个任务的29个数据集上超越现有方法,实现了双语领域的最新成果。还引入了一个新的中文多模态理解基准,以促进评估。
Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Authors: Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang
First: 2025-10-17T08:37:45+00:00 · Latest: 2025-10-17T08:37:45+00:00
Abstract
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
中文标题/摘要
标题:在大型视觉-语言模型中学习检测未知越狱攻击
尽管进行了广泛的对齐努力,大型视觉-语言模型(LVLMs)仍然容易受到越狱攻击的影响,这带来了严重的安全风险。为了解决这一问题,现有的检测方法要么学习特定攻击的参数,这妨碍了对未见过攻击的泛化,要么依赖于经验主义的原则,这限制了准确性和效率。为了克服这些限制,我们提出了学习检测(LoD),这是一种通用框架,通过将重点从特定攻击的学习转移到特定任务的学习,准确地检测未知的越狱攻击。该框架包括一个多模态安全概念激活向量模块,用于安全导向的表示学习,以及一个安全模式自编码器模块,用于无监督攻击分类。广泛的实验表明,我们的方法在多种未知攻击上的检测AUC始终更高,同时提高了效率。代码可在https://anonymous.4open.science/r/Learning-to-Detect-51CB 获取。
HumorDB: Can AI understand graphical humor?
Authors: Vedaant Jain, Felipe dos Santos Alves Feitosa, Gabriel Kreiman
First: 2024-06-19T13:51:40+00:00 · Latest: 2025-10-17T08:13:12+00:00
Comments: 10 main figures, 4 additional appendix figures
Abstract
Despite significant advancements in image segmentation and object detection, understanding complex scenes remains a significant challenge. Here, we focus on graphical humor as a paradigmatic example of image interpretation that requires elucidating the interaction of different scene elements in the context of prior cognitive knowledge. This paper introduces \textbf{HumorDB}, a novel, controlled, and carefully curated dataset designed to evaluate and advance visual humor understanding by AI systems. The dataset comprises diverse images spanning photos, cartoons, sketches, and AI-generated content, including minimally contrastive pairs where subtle edits differentiate between humorous and non-humorous versions. We evaluate humans, state-of-the-art vision models, and large vision-language models on three tasks: binary humor classification, funniness rating prediction, and pairwise humor comparison. The results reveal a gap between current AI systems and human-level humor understanding. While pretrained vision-language models perform better than vision-only models, they still struggle with abstract sketches and subtle humor cues. Analysis of attention maps shows that even when models correctly classify humorous images, they often fail to focus on the precise regions that make the image funny. Preliminary mechanistic interpretability studies and evaluation of model explanations provide initial insights into how different architectures process humor. Our results identify promising trends and current limitations, suggesting that an effective understanding of visual humor requires sophisticated architectures capable of detecting subtle contextual features and bridging the gap between visual perception and abstract reasoning. All the code and data are available here: \href{https://github.com/kreimanlab/HumorDB}{https://github.com/kreimanlab/HumorDB}
中文标题/摘要
标题:HumorDB:AI能否理解图形幽默?
尽管在图像分割和物体检测方面取得了显著进展,但理解复杂场景仍然是一个重大挑战。本文以图形幽默为例,探讨了需要阐明场景元素在先验认知知识背景下相互作用的图像解释问题。本文介绍了\textbf{HumorDB},这是一个新颖的、受控的、精心策划的数据集,旨在评估和促进AI系统的视觉幽默理解。该数据集包含从照片、漫画、素描到AI生成内容的多样图像,包括细微对比对的图像对,其中细微的编辑区分了幽默和非幽默版本。我们评估了人类、最先进的视觉模型和大型视觉-语言模型在三项任务上的表现:二元幽默分类、幽默度评分预测和两两幽默比较。结果表明,当前的AI系统与人类的幽默理解之间存在差距。虽然预训练的视觉-语言模型比仅视觉模型表现更好,但在处理抽象素描和微妙的幽默线索方面仍然存在困难。注意力图分析显示,即使模型正确分类了幽默图像,它们也往往未能关注使图像变得有趣的精确区域。初步的机制可解释性研究和模型解释评估提供了不同架构处理幽默的初步见解。我们的结果指出了有希望的趋势和当前的局限性,表明有效理解视觉幽默需要能够检测细微上下文特征并弥合视觉感知与抽象推理之间差距的复杂架构。所有代码和数据均可在此获取:\href{https://github.com/kreimanlab/HumorDB}{https://github.com/kreimanlab/HumorDB}
Summary / 总结
This paper introduces HumorDB, a dataset designed to evaluate AI systems in understanding graphical humor. The dataset includes diverse images and minimally contrastive pairs. The study evaluates humans, vision models, and vision-language models on humor classification, funniness prediction, and pairwise comparison tasks. Results show that current AI systems lag behind human-level understanding, especially with abstract sketches and subtle humor cues. The analysis of attention maps indicates that models often fail to focus on the precise regions that make an image funny. The findings suggest the need for sophisticated architectures that can detect subtle contextual features and bridge the gap between visual perception and abstract reasoning.
该论文介绍了HumorDB数据集,旨在评估AI系统在理解图形幽默方面的能力。数据集包含多样化的图像和最小对比度的图像对。研究对人类、视觉模型和视觉-语言模型进行了幽默分类、趣味性预测和成对比较任务的评估。结果显示,当前的AI系统在抽象素描和微妙幽默提示方面的人类水平理解能力存在差距。注意力图的分析表明,模型往往无法聚焦于使图像有趣的精确区域。研究结果表明,需要具备检测微妙上下文特征并弥合视觉感知与抽象推理之间差距的复杂架构。
Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs
Authors: Lee Qi Zun, Mohamad Zulhilmi Bin Abdul Halim, Goh Man Fye
First: 2025-10-17T08:11:54+00:00 · Latest: 2025-10-17T08:11:54+00:00
Abstract
Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.
中文标题/摘要
标题:Fine-Tuning MedGemma 以增强马来西亚临床实践指南的多模态RAG临床配图能力
检索增强生成系统对于提供基于马来西亚临床实践指南的事实指导至关重要。然而,它们在处理基于图像的查询时效果有限,因为通用的视觉-语言模型配图往往缺乏临床特异性和事实依据。本研究提出并验证了一种框架,以专门化MedGemma模型,生成高保真度的配图,作为更优的查询。为克服数据稀缺性,我们采用知识蒸馏管道在皮肤科、眼底和胸部X光领域创建合成数据集,并使用参数高效的QLoRA方法微调MedGemma。通过双重框架严格评估性能,该框架同时测量分类准确性和通过RAGAS框架的新应用测量配图的忠实性、相关性和正确性。微调后的模型在分类性能上表现出显著改进,而RAGAS评估证实了配图忠实性和正确性有显著提升,验证了模型生成可靠、事实依据描述的能力。本研究建立了一个稳健的管道,用于专门化医疗视觉语言模型,并验证了生成的模型作为高质量查询生成器的能力,为增强基于证据的临床决策支持的多模态RAG系统奠定了基础。
Summary / 总结
This study aims to enhance the effectiveness of Retrieval-Augmented Generation (RAG) systems for providing fact-based guidance from Malaysian Clinical Practice Guidelines, especially for image-based queries. To address the issue of clinical specificity in general Vision-Language Model captions, the researchers propose and validate a framework to fine-tune the MedGemma model. They created a synthetic dataset using a knowledge distillation pipeline and employed the QLoRA method for parameter-efficient fine-tuning. The fine-tuned model showed substantial improvements in classification performance and significant gains in caption faithfulness and correctness, as evaluated by the RAGAS framework. This work establishes a robust pipeline for specializing medical Vision-Language Models and validates the model as a high-quality query generator for RAG systems.
本研究旨在增强 Retrieval-Augmented Generation (RAG) 系统在提供基于马来西亚临床实践指南的事实指导方面的有效性,尤其是针对图像查询。研究人员通过知识蒸馏管道创建了一个合成数据集,并使用 QLoRA 方法对 MedGemma 模型进行了微调。经过微调的模型在分类性能和标题忠实度和正确性方面表现出显著的改进,通过 RAGAS 框架评估验证了其作为可靠查询生成器用于多模态 RAG 系统的有效性。
MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment
Authors: Bingyu Li, Feiyu Wang, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
First: 2025-10-17T07:50:58+00:00 · Latest: 2025-10-17T07:50:58+00:00
Abstract
Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce \textbf{MARIS} (\underline{Mar}ine Open-Vocabulary \underline{I}nstance \underline{S}egmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (\textbf{GPEM}) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (\textbf{SAIM}) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research.
中文标题/摘要
标题:MARIS:海洋开放词汇实例分割与几何增强及语义对齐
大多数现有的水下实例分割方法受到近词汇预测的限制,限制了它们识别新型海洋类别的能力。为了支持评估,我们引入了**MARIS**(**Mar**ine **Open-Vocabulary** **I**nstance **S**egmentation),这是第一个大规模细粒度的水下开放词汇(OV)分割基准,包含一组有限的已见类别和多种未见类别。尽管在自然图像上OV分割显示出前景,但我们的分析表明,将其转移到水下场景中会遭受严重的视觉退化(例如,颜色衰减)和由于缺乏水下类定义而引起的语义对齐问题。为了解决这些问题,我们提出了一种统一框架,包含两个互补组件。几何先验增强模块(**GPEM**)利用稳定的部分级和结构线索,在退化视觉条件下保持对象一致性。语义对齐注入机制(**SAIM**)通过加入领域特定的先验丰富语言嵌入,减轻语义歧义并提高对未见类别的识别能力。实验表明,我们的框架在MARIS上的一致性表现优于现有OV基线,无论是域内还是跨域设置,为未来的水下感知研究奠定了坚实的基础。
Summary / 总结
The research aims to address the limitations of existing underwater instance segmentation methods that rely on close-vocabulary prediction. To tackle this, the authors introduce MARIS, a benchmark for marine Open-Vocabulary (OV) segmentation. They propose a unified framework with two components: GPEM, which enhances geometric priors to maintain object consistency under degraded visual conditions, and SAIM, which aligns semantic embeddings with domain-specific priors to improve recognition of unseen categories. Experiments demonstrate that this framework outperforms existing OV baselines both in-domain and cross-domain settings on MARIS, paving the way for future underwater perception research.
MARIS 是一个新的水下实例分割基准,旨在解决近词汇预测的限制。它引入了一个统一框架,包含几何先验增强模块 (GPEM) 和语义对齐注入机制 (SAIM),以应对水下场景中的视觉退化和语义错位问题。实验表明,MARIS 在 MARIS 数据集的领域内和跨领域设置中均优于现有开放词汇基线,为未来水下感知研究奠定了坚实的基础。
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation
Authors: Ao Wang, Hui Chen, Jiaxin Li, Jianchao Tan, Kefeng Zhang, Xunliang Cai, Zijia Lin, Jungong Han, Guiguang Ding
Venue: NeurIPS 2025
First: 2024-12-04T15:48:59+00:00 · Latest: 2025-10-17T06:54:10+00:00
Comments: NeurIPS 2025 Camera-ready Version
Abstract
Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at https://github.com/THU-MIG/PrefixKV.
中文标题/摘要
标题:PrefixKV:视觉指令跟随模型高效生成所需的自适应前缀KV缓存
近年来,大型视觉-语言模型(LVLMs)因其在多种多模态输入下的强大生成和推理能力而迅速流行。然而,在推理过程中,这些模型会带来显著的计算和内存开销,极大地阻碍了其实用场景下的高效部署。由于输入和输出序列较长,广泛的键值(KV)缓存显著增加了推理成本。基于此,近期的研究工作探索了减少KV缓存大小的方法以提高效率。尽管有效,但它们通常忽略了KV向量在各层中的重要性分布,并在下个词预测时为每一层保持相同的缓存大小。这导致某些层的重要上下文信息丢失,从而导致性能下降。为解决这一问题,我们提出了PrefixKV,其中“前缀”是指基于重要性而非原始序列位置的排名靠前的KV。它将为所有层确定KV缓存大小的问题重新定义为寻找最优全局前缀配置的任务。基于二分搜索的自适应分层KV保留方案使得每一层可以保留最大上下文信息,从而促进生成。广泛的实验表明,我们的方法在与其他方法的比较中达到了最先进的性能。它展示了出色的推理效率和生成质量权衡,显示出在实际应用中的巨大潜力。代码可在https://github.com/THU-MIG/PrefixKV获取。
Summary / 总结
The paper introduces PrefixKV, an adaptive prefix key-value (KV) cache mechanism designed to enhance the efficiency of vision-language models during inference. It addresses the issue of significant computational and memory overhead by focusing on the varying importance of KV vectors across layers. Through an adaptive layer-wise KV retention strategy based on binary search, PrefixKV preserves maximum contextual information, leading to superior inference efficiency and generation quality. Experiments show that PrefixKV outperforms existing methods in terms of performance and efficiency trade-offs, making it promising for practical applications.
论文提出了PrefixKV,一种自适应前缀键值(KV)缓存方法,旨在提高视觉-语言模型在推理过程中的效率。受广泛KV缓存导致的高计算和内存成本的驱动,作者提出了一种新方法,动态选择每个层中最重要的一些KV向量,从而减少性能损失。实验表明,PrefixKV在保持高效推理的同时,生成质量也达到最佳,显示出在实际应用中的巨大潜力。
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
Authors: Shaoqi Dong, Chaoyou Fu, Haihan Gao, Yi-Fan Zhang, Chi Yan, Chu Wu, Xiaoyu Liu, Yunhang Shen, Jing Huo, Deqiang Jiang, Haoyu Cao, Yang Gao, Xing Sun, Ran He, Caifeng Shan
First: 2025-10-10T17:59:56+00:00 · Latest: 2025-10-17T04:19:59+00:00
Comments: Homepage: https://ltbai.github.io/VITA-VLA/
Abstract
Vision-Language Action (VLA) models significantly advance robotic manipulation by leveraging the strong perception capabilities of pretrained vision-language models (VLMs). By integrating action modules into these pretrained models, VLA methods exhibit improved generalization. However, training them from scratch is costly. In this work, we propose a simple yet effective distillation-based framework that equips VLMs with action-execution capability by transferring knowledge from pretrained small action models. Our architecture retains the original VLM structure, adding only an action token and a state encoder to incorporate physical inputs. To distill action knowledge, we adopt a two-stage training strategy. First, we perform lightweight alignment by mapping VLM hidden states into the action space of the small action model, enabling effective reuse of its pretrained action decoder and avoiding expensive pretraining. Second, we selectively fine-tune the language model, state encoder, and action modules, enabling the system to integrate multimodal inputs with precise action generation. Specifically, the action token provides the VLM with a direct handle for predicting future actions, while the state encoder allows the model to incorporate robot dynamics not captured by vision alone. This design yields substantial efficiency gains over training large VLA models from scratch. Compared with previous state-of-the-art methods, our method achieves 97.3% average success rate on LIBERO (11.8% improvement) and 93.5% on LIBERO-LONG (24.5% improvement). In real-world experiments across five manipulation tasks, our method consistently outperforms the teacher model, achieving 82.0% success rate (17% improvement), which demonstrate that action distillation effectively enables VLMs to generate precise actions while substantially reducing training costs.
中文标题/摘要
标题:VITA-VLA:通过动作专家蒸馏高效训练视觉语言模型执行动作
视觉语言动作(VLA)模型通过利用预训练视觉语言模型(VLMs)的强大感知能力,显著推动了机器人操作的进步。通过将动作模块整合到这些预训练模型中,VLA方法展示了更好的泛化能力。然而,从头开始训练它们成本高昂。在本研究中,我们提出了一种简单而有效的基于蒸馏的框架,通过从预训练的小动作模型中转移知识,使VLMs具备执行动作的能力。我们的架构保留了原始VLM结构,仅添加了一个动作标记和一个状态编码器以纳入物理输入。为了蒸馏动作知识,我们采用两阶段训练策略。首先,我们进行轻量级对齐,将VLM隐藏状态映射到小动作模型的动作空间,从而有效利用其预训练的动作解码器并避免昂贵的预训练。其次,我们选择性地微调语言模型、状态编码器和动作模块,使系统能够结合多模态输入并精确生成动作。具体而言,动作标记为VLM提供了直接预测未来动作的手段,而状态编码器使模型能够结合仅凭视觉无法捕捉到的机器人动力学。此设计在从头开始训练大型VLA模型时实现了显著的效率提升。与之前最先进的方法相比,我们的方法在LIBERO上实现了97.3%的平均成功率(提高11.8%),在LIBERO-LONG上实现了93.5%的成功率(提高24.5%)。在五个操作任务的现实世界实验中,我们的方法始终优于教师模型,实现了82.0%的成功率(提高17%),这表明动作蒸馏有效地使VLMs能够生成精确的动作,同时大幅降低了训练成本。
Summary / 总结
This work addresses the high cost of training Vision-Language Action (VLA) models from scratch by proposing a distillation-based framework. It transfers knowledge from pretrained small action models to vision-language models, adding an action token and a state encoder. The method achieves significant improvements in success rates on the LIBERO and LIBERO-LONG datasets, and outperforms the teacher model in real-world manipulation tasks, demonstrating efficiency gains and effective action generation.
研究旨在通过从预训练的小动作模型中提取知识,提高视觉-语言模型执行机器人操作任务的效率。方法包括在原有视觉-语言模型架构中添加动作标记和状态编码器,并采用两阶段训练策略对模型的隐藏状态进行对齐。这种方法显著降低了训练成本,同时在机器人操作任务中取得了高成功率。具体来说,所提出的方法在LIBERO和LIBERO-LONG数据集上优于之前的最先进方法,并在五个操作任务的现实世界实验中表现出一致的改进。
Exemplar-Guided Planing: Enhanced LLM Agent for KGQA
Authors: Jingao Xu, Shuoyoucheng Ma, Xin Song, Rong Jiang, Hongkui Tu, Bin Zhou
First: 2025-10-17T03:43:06+00:00 · Latest: 2025-10-17T03:43:06+00:00
Abstract
Large Language Models (LLMs) as interactive agents show significant promise in Knowledge Graph Question Answering (KGQA) but often struggle with the semantic gap between natural language queries and structured knowledge graph (KG) representations. This leads to suboptimal planning and inefficient exploration on KG, while training-free approaches often underutilize valuable reasoning patterns in training data. To address these limitations, we propose a novel framework, Exemplar-Guided Planning (EGP), which enhances the planning capabilities of LLM agents for KGQA. EGP first preprocesses the training set questions via entity templating to normalize semantic variations. It then retrieves highly similar exemplary questions and their successful reasoning paths from this preprocessed set using semantic embeddings and an efficient FAISS index. These retrieved exemplars dynamically guide the LLM's planning process in two key phases: (1) Task Decomposition, by aligning generated sub-objectives with proven reasoning steps, and (2) Relation Exploration, by providing high-quality auxiliary information to improve relation pruning accuracy. Additionally, we introduce a Smart Lookahead mechanism during relation exploration to improve efficiency by preemptively exploring promising paths and potentially terminating exploration earlier. We apply EGP to the Plan-on-Graph (PoG) framework, termed PoG-EGP. Extensive experiments on two real-world KGQA datasets, WebQSP and CWQ, demonstrate that PoG-EGP significantly improves over the baseline PoG system and other compared methods.
中文标题/摘要
标题:基于范例引导的规划:增强的LLM代理用于KGQA
大型语言模型(LLMs)作为交互式代理在知识图谱问答(KGQA)中显示出显著的潜力,但往往难以弥合自然语言查询与结构化知识图谱(KG)表示之间的语义差距。这导致在KG上的规划不理想且探索效率低下,而无训练方法往往未能充分利用训练数据中的宝贵推理模式。为解决这些局限性,我们提出了一种新颖的框架——基于范例引导的规划(EGP),该框架增强了LLM代理在KGQA中的规划能力。EGP首先通过实体模板化预处理训练集问题以标准化语义变体。然后,使用语义嵌入和高效的FAISS索引从预处理集中检索高度相似的范例问题及其成功的推理路径。这些检索到的范例动态地在两个关键阶段引导LLM的规划过程:(1)任务分解,通过将生成的子目标与已验证的推理步骤对齐;(2)关系探索,通过提供高质量的辅助信息以提高关系剪枝的准确性。此外,我们在关系探索过程中引入了一种智能前瞻机制,以提高效率,通过预先探索有希望的路径并可能提前终止探索。我们将在Plan-on-Graph(PoG)框架上应用EGP,称为PoG-EGP。在两个真实世界的KGQA数据集WebQSP和CWQ上的广泛实验表明,PoG-EGP显著优于基准PoG系统和其他比较方法。
Summary / 总结
The research aims to enhance the planning capabilities of Large Language Models (LLMs) for Knowledge Graph Question Answering (KGQA) by addressing the semantic gap between natural language queries and structured knowledge graphs. The Exemplar-Guided Planning (EGP) framework preprocesses training questions and retrieves similar exemplary questions and their reasoning paths to guide the LLM's planning process. Key findings show that the PoG-EGP framework significantly outperforms the baseline PoG system and other methods on WebQSP and CWQ datasets.
论文提出了一种Exemplar-Guided Planning (EGP)框架,以增强大型语言模型(LLM)在知识图谱问答(KGQA)中的规划能力。EGP对训练问题进行预处理,并检索相似的范例问题来指导LLM在任务分解和关系探索过程中的规划。实验结果表明,PoG-EGP在WebQSP和CWQ数据集上的表现优于基线PoG系统和其他方法。
Scope: Selective Cross-modal Orchestration of Visual Perception Experts
Authors: Tianyu Zhang, Suyuchen Wang, Chao Wang, Juan Rodriguez, Ahmed Masry, Xiangru Jian, Yoshua Bengio, Perouz Taslakian
First: 2025-10-14T20:33:01+00:00 · Latest: 2025-10-17T03:30:31+00:00
Comments: 14 pages, 2 figures
Abstract
Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49\%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.
中文标题/摘要
标题:范围:选择性跨模态视觉感知专家的组合
视觉语言模型(VLMs)可以从多个视觉编码器中受益,但简单地堆叠它们会导致收益递减并增加推理成本。我们提出SCOPE,这是一种混合编码器(MoEnc)框架,通过实例级路由动态选择每幅图像-文本对的专门编码器,不同于传统MoE中的令牌级路由。SCOPE保持一个共享编码器和一个路由编码器池。一个轻量级的路由器使用文本提示与共享视觉特征之间的交叉注意力来从路由编码器中选择最优编码器。为了训练这个路由器,我们引入了双重熵正则化和辅助损失,以平衡数据集级负载分布与实例级路由置信度。令人惊讶的是,使用一个共享编码器加上一个路由编码器的SCOPE在使用四个额外编码器同时工作的模型上表现更好,同时计算量减少了24-49%。这表明智能编码器选择优于暴力聚合,挑战了多编码器VLMs中的主导范式。
Summary / 总结
SCOPE is a Mixture-of-Encoders framework that dynamically selects a specialized encoder for each image-text pair, improving the efficiency of vision-language models without compromising performance. It uses a lightweight router based on cross-attention to choose the best encoder from a pool of options, reducing inference costs by 24-49% compared to models using all encoders simultaneously. This shows that intelligent encoder selection is more effective than brute-force aggregation.
SCOPE是一种Mixture-of-Encoders框架,通过轻量级的路由机制基于交叉注意力动态选择每个图像-文本对的专用编码器。该方法在保持或提升性能的同时,比同时使用所有额外编码器的模型表现更好,计算成本降低了24-49%。使用双重熵正则化来平衡数据集级别的负载分布与实例级别的路由置信度。
FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment
Authors: Hao Yin, Lijun Gu, Paritosh Parmar, Lin Xu, Tianxiao Guo, Weiwei Fu, Yang Zhang, Tianyou Zheng
First: 2025-06-02T01:44:02+00:00 · Latest: 2025-10-17T03:26:07+00:00
Comments: Dataset and code are available at https://github.com/HaoYin116/FLEX . Link to Project page https://haoyin116.github.io/FLEX_Dataset
Abstract
Action Quality Assessment (AQA) -- the task of quantifying how well an action is performed -- has great potential for detecting errors in gym weight training, where accurate feedback is critical to prevent injuries and maximize gains. Existing AQA datasets, however, are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions. We introduce FLEX, the first large-scale, multimodal, multiview dataset for fitness AQA that incorporates surface electromyography (sEMG). FLEX contains over 7,500 multiview recordings of 20 weight-loaded exercises performed by 38 subjects of diverse skill levels, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) linking actions, key steps, error types, and feedback, supporting a compositional scoring function for interpretable quality assessment. FLEX enables multimodal fusion, cross-modal prediction -- including the novel Video$\rightarrow$EMG task -- and biomechanically oriented representation learning. Building on the FKG, we further introduce FLEX-VideoQA, a structured question-answering benchmark with hierarchical queries that drive cross-modal reasoning in vision-language models. Baseline experiments demonstrate that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance. FLEX thus advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching. Dataset and code are available at \href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}. Link to Project \href{https://haoyin116.github.io/FLEX_Dataset}{page}.
中文标题/摘要
标题:FLEX:用于健身动作质量评估的大规模多模态多视角数据集
动作质量评估(AQA)——量化动作执行质量的任务——在健身房重量训练中具有巨大潜力,准确的反馈对于预防受伤和最大化收益至关重要。现有的AQA数据集仅限于单视角竞技运动和RGB视频,缺乏多模态信号和健身动作的专业评估。我们介绍了FLEX,这是首个用于健身AQA的大规模多模态多视角数据集,包含表面肌电图(sEMG)。FLEX包含超过7,500个20种负重锻炼动作的多视角记录,由38名不同技能水平的受试者完成,配有同步的RGB视频、3D姿态、sEMG和生理信号。专家注释组织成健身知识图谱(FKG),链接动作、关键步骤、错误类型和反馈,支持组合评分函数以实现可解释的质量评估。FLEX支持多模态融合、跨模态预测——包括新颖的Video$\rightarrow$EMG任务——以及生物力学导向的表示学习。基于FKG,我们进一步引入了FLEX-VideoQA,这是一个具有层次查询的结构化问答基准,驱动视觉语言模型中的跨模态推理。基线实验表明,多模态输入、多视角视频和细粒度注释显著提升了AQA性能。FLEX因此推动了AQA向更丰富的多模态环境发展,并为基于AI的健身评估和指导提供了基础。数据集和代码可在https://github.com/HaoYin116/FLEX 获取。项目页面链接:https://haoyin116.github.io/FLEX_Dataset
Summary / 总结
The research aims to improve Action Quality Assessment (AQA) in fitness training by developing FLEX, a large-scale multimodal and multiview dataset that includes surface electromyography (sEMG) signals. The dataset contains over 7,500 recordings of 20 weight-loaded exercises performed by 38 subjects, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) to support a compositional scoring function. Experimental results show that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance, enabling multimodal fusion and cross-modal prediction tasks. This dataset advances AQA towards richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching.
研究旨在通过开发名为FLEX的新数据集来改进健身训练中的动作质量评估(AQA),该数据集包含多模态和多视角数据,如RGB视频、3D姿态、表面肌电图(sEMG)和生理信号。FLEX包含来自38名不同技能水平的受试者超过7,500次20项负重练习的录制,专家注释形成了健身知识图谱(FKG),用于结构化评分。实验结果表明,多模态输入和细粒度注释显著提高了AQA性能,使健身评估和指导能够处于更丰富的多模态环境中。
General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting
Authors: Bernard Lange, Anil Yildiz, Mansur Arief, Shehryar Khattak, Mykel Kochenderfer, Georgios Georgakis
First: 2025-06-20T20:06:14+00:00 · Latest: 2025-10-17T03:19:22+00:00
Abstract
Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed information flows, limiting their generalizability. Large Vision-Language Models (LVLMs) offer a promising alternative by embedding human-like knowledge for reasoning and planning, but prior LVLM-robot integrations have largely depended on pre-mapped spaces, hard-coded representations, and rigid control logic. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools drawn from modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query modules, reason over multimodal inputs, and select navigation actions. This agentic formulation enables robust navigation and reasoning in previously unmapped environments, offering a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA outperforms state-of-the-art EQA-specific approaches. Qualitative results on RxR and custom tasks further demonstrate its ability to generalize across a broad range of navigation challenges.
中文标题/摘要
标题:通用机器人导航通过LVLM协调感知、推理和行动
开发适用于未知环境的通用导航策略仍然是机器人技术中的核心挑战。大多数现有系统依赖于任务特定的神经网络和固定的信息流,限制了它们的泛化能力。大型视觉-语言模型(LVLM)通过嵌入类人的知识来进行推理和规划,提供了一种有前景的替代方案,但之前的LVLM-机器人集成主要依赖于预先映射的空间、硬编码的表示和刚性控制逻辑。我们引入了代理机器人导航架构(ARNA),这是一种通用框架,为基于LVLM的代理配备了来自现代机器人堆栈的感知、推理和导航工具库。在运行时,代理自主定义和执行任务特定的工作流,迭代查询模块、处理多模态输入并选择导航动作。这种代理形式使代理能够在未映射的环境中实现稳健的导航和推理,为机器人堆栈设计提供了新的视角。在Habitat Lab上的HM-EQA基准测试中,ARNA优于最先进的EQA特定方法。在RxR和自定义任务上的定性结果进一步证明了其在广泛导航挑战中的泛化能力。
Summary / 总结
The research aims to develop a general-purpose navigation policy for unknown environments, addressing the limitations of task-specific neural networks and fixed information flows. The Agentic Robotic Navigation Architecture (ARNA) uses Large Vision-Language Models (LVLMs) to enable an agent to autonomously define and execute task-specific workflows, iteratively querying modules, reasoning over multimodal inputs, and selecting navigation actions. ARNA outperforms state-of-the-art EQA-specific approaches on the HM-EQA benchmark and demonstrates generalization across various navigation challenges in Habitat Lab.
研究旨在开发适用于未知环境的一般导航策略,解决任务特定神经网络和固定信息流的局限性。Agentic Robotic Navigation Architecture (ARNA) 将 LVLM 与现代机器人堆栈中的感知、推理和导航工具集成。ARNA 自动定义并执行任务特定的工作流,查询模块、处理多模态输入并选择导航动作。实验结果表明,ARNA 在 HM-EQA 基准测试中优于最先进的 EQA 特定方法,并且能够跨各种导航挑战进行泛化。
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
Authors: Zijian Song, Xiaoxin Lin, Qiuming Huang, Guangrun Wang, Liang Lin
First: 2025-06-17T13:40:00+00:00 · Latest: 2025-10-17T02:36:30+00:00
Comments: 20 pages, 11 figures
Abstract
Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving.
中文标题/摘要
标题:SIRI-Bench:通过复杂推理任务挑战VLM的空间智能
大型语言模型(LLMs)取得了快速进步,主要归功于在复杂推理任务上的强化学习。相比之下,虽然空间智能对于视觉-语言模型(VLMs)在现实世界交互中至关重要,但对其复杂的空间推理的系统研究仍相对不足。为弥补这一差距,我们引入了SIRI-Bench,这是一个旨在通过空间关联推理任务评估VLMs结构空间智能的基准。SIRI-Bench 包含9,000个视频-问题-答案三元组,每个问题都嵌入在现实的3D场景中。该基准精心设计,使得解决每个问题都需要空间理解与结构推理。为了促进大规模数据合成,我们开发了一个自动场景生成引擎,该引擎利用协作的LLM代理将抽象的数学问题转化为忠实的3D场景。实验结果表明,最先进的VLMs在SIRI-Bench上面临巨大挑战,突显了结构空间推理的难度。我们希望我们的研究能够引起研究人员对空间关联推理的关注,并推动VLMs在视觉问题解决方面的进步。
Summary / 总结
SIRI-Bench is a benchmark designed to evaluate VLMs' spatial intelligence through complex reasoning tasks in realistic 3D scenes. It consists of 9,000 video-question-answer triplets, requiring both spatial comprehension and structural reasoning. State-of-the-art VLMs perform poorly on this benchmark, highlighting the difficulty of structural spatial reasoning. The study aims to draw attention to spatially grounded reasoning and advance VLMs in visual problem-solving.
SIRI-Bench 是一个基准,旨在通过复杂的空间推理任务评估 VLMs 的结构空间智能。它包含 9,000 个视频-问题-答案三元组,设置在现实的 3D 场景中,需要空间理解和结构推理。自动场景生成引擎使用协作的 LLM 代理将抽象的数学问题转化为忠实的 3D 场景。最先进的 VLMs 在 SIRI-Bench 上表现不佳,突显了结构空间推理的难度。本研究旨在引起研究人员对空间推理的关注,并推动 VLMs 在视觉问题解决方面的进步。
Extending Audio Context for Long-Form Understanding in Large Audio-Language Models
Authors: Yuatyong Chaichana, Pittawat Taveekitworachai, Warit Sirichotedumrong, Potsawee Manakul, Kunat Pipatanakul
First: 2025-10-17T01:44:28+00:00 · Latest: 2025-10-17T01:44:28+00:00
Abstract
Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.
中文标题/摘要
标题:扩展音频上下文以增强大型音频语言模型的长期理解
大型音频语言模型(LALMs)通常受限于短音频上下文窗口,即使其文本后端支持长上下文,也限制了对长音频的理解。先前的工作已经在单模态LLM上引入了上下文扩展方法(例如YaRN),但其在LALMs中的应用尚未被探索。首先,基于RoPE的上下文扩展,我们引入了Partial YaRN,这是一种无需训练、仅修改音频标记位置的音频扩展方法,保留了基LLM的文本能力。其次,我们提出了虚拟长音频训练(VLAT),这是一种训练策略,将Partial YaRN扩展为训练时的位置增强。VLAT在训练过程中模拟了多种音频长度,使模型能够泛化到远长于训练中所见的输入,并提高了对长上下文音频理解的鲁棒性。我们在SALMONN和Qwen2-Audio上的实验表明,Partial YaRN在各种设置中均优于原始模型,而VLAT训练策略提供了显著的改进,实现了对未见过长度的长音频的强性能。
Summary / 总结
This paper addresses the limitation of short audio context in Large Audio-Language Models (LALMs) by introducing Partial YaRN, a training-free method that extends audio context without affecting text capabilities. VLAT, a training strategy, further enhances this by augmenting positional embeddings during training to simulate diverse audio lengths. Experiments on SALMONN and Qwen2-Audio demonstrate that Partial YaRN outperforms original models across various settings, and VLAT significantly improves performance on long, unseen audio inputs.
研究旨在通过解决大型音频语言模型(LALM)中的短音频上下文窗口问题,增强其对长音频的理解能力。研究引入了Partial YaRN,这是一种无需训练的方法,通过仅修改音频标记位置来扩展音频上下文,以及VLAT训练策略,在训练过程中模拟多种音频长度,以提高泛化能力和鲁棒性。实验结果表明,Partial YaRN在各种设置中均优于原模型,而VLAT进一步提升了对未见过的长音频输入的性能。
D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language Models
Authors: Jisu Han, Wonjun Hwang
First: 2025-10-10T15:27:44+00:00 · Latest: 2025-10-17T01:32:13+00:00
Comments: Corrected typos
Abstract
Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.
中文标题/摘要
标题:D-TPT:维度熵最大化在视觉语言模型测试时提示调优校准中的应用
测试时适应范式通过在源模型的未标记目标数据上进行即时适应,提供了对领域转移的灵活性。视觉语言模型(VLMs)利用其泛化能力处理各种下游任务,而测试时提示调优已成为适应VLMs的突出解决方案。在本文中,我们探索对比VLMs,并发现由于各模态中单一主导特征维度的差异导致的模态差距。我们观察到,文本和图像模态中的主导维度表现出高度的预测敏感性,限制其影响可以改善校准误差。基于这一见解,我们提出维度熵最大化,通过使文本特征分布趋于均匀来减轻主导维度的依赖性。我们的方法缓解了测试时提示调优中的校准性能下降,提供了一种简单而有效的解决方案,以增强VLMs在实际部署场景中的可靠性。
Summary / 总结
This work addresses the challenge of calibrating test-time prompt tuning in Vision-Language Models (VLMs) by exploring contrastive VLMs and identifying a modality gap due to a single dominant feature dimension. The authors propose dimensional entropy maximization to regularize textual features, reducing the influence of dominant dimensions and improving calibration error. This method enhances the reliability of VLMs in real-world deployment scenarios.
该研究通过探索对比视觉语言模型(VLMs)并识别由于跨模态单一主导特征维度导致的模态差距,解决了测试时提示调优校准的问题。作者提出通过最大化文本特征分布的熵来减少对主导维度的依赖,从而提高校准误差。实验表明,这种方法能够增强VLMs在实际部署场景中的可靠性。
MLFM: Multi-Layered Feature Maps for Richer Language Understanding in Zero-Shot Semantic Navigation
Authors: Sonia Raychaudhuri, Enrico Cancelli, Tommaso Campari, Lamberto Ballan, Manolis Savva, Angel X. Chang
First: 2025-07-09T21:46:43+00:00 · Latest: 2025-10-17T00:58:38+00:00
Abstract
Recent progress in large vision-language models has driven improvements in language-based semantic navigation, where an embodied agent must reach a target object described in natural language. Yet we still lack a clear, language-focused evaluation framework to test how well agents ground the words in their instructions. We address this gap by proposing LangNav, an open-vocabulary multi-object navigation dataset with natural language goal descriptions (e.g. 'go to the red short candle on the table') and corresponding fine-grained linguistic annotations (e.g., attributes: color=red, size=short; relations: support=on). These labels enable systematic evaluation of language understanding. To evaluate on this setting, we extend multi-object navigation task setting to Language-guided Multi-Object Navigation (LaMoN), where the agent must find a sequence of goals specified using language. Furthermore, we propose Multi-Layered Feature Map (MLFM), a novel method that builds a queryable, multi-layered semantic map from pretrained vision-language features and proves effective for reasoning over fine-grained attributes and spatial relations in goal descriptions. Experiments on LangNav show that MLFM outperforms state-of-the-art zero-shot mapping-based navigation baselines.
中文标题/摘要
标题:MLFM:多层特征图在零样本语义导航中的丰富语言理解
大型视觉-语言模型的最新进展推动了基于语言的语义导航的进步,在这种导航中,一个具身代理必须根据自然语言描述到达目标物体。然而,我们仍然缺乏一个清晰的语言导向评估框架来测试代理如何将指令中的词语进行语义化。为此,我们提出了LangNav,这是一个包含自然语言目标描述(例如,“去桌子上的那支红色短蜡烛”)和相应的细粒度语言注释(例如,属性:颜色=红色,大小=短;关系:支撑=在上)的开放词汇多对象导航数据集。这些标签使语言理解的系统评估成为可能。为了在这一环境中进行评估,我们将多对象导航任务扩展为语言引导的多对象导航(LaMoN),其中代理必须找到用语言指定的一系列目标。此外,我们提出了多层特征图(MLFM),这是一种新颖的方法,可以从预训练的视觉-语言特征构建可查询的多层语义地图,并证明在目标描述中的细粒度属性和空间关系推理方面非常有效。LangNav上的实验表明,MLFM优于最先进的零样本映射导航基线。
Summary / 总结
The research aims to improve language understanding in zero-shot semantic navigation by developing a clear evaluation framework and a new method. The method, Multi-Layered Feature Map (MLFM), constructs a queryable semantic map from pretrained vision-language features, which is effective for reasoning over fine-grained attributes and spatial relations. Experiments on the LangNav dataset demonstrate that MLFM outperforms existing zero-shot mapping-based navigation approaches.
研究旨在通过建立清晰的评估框架和新方法来提高零样本语义导航中的语言理解能力。该方法,多层特征图(MLFM),从预训练的视觉-语言特征构建可查询的语义地图,适用于对目标描述中的细粒度属性和空间关系进行推理。在LangNav数据集上的实验表明,MLFM优于现有的零样本映射导航基准方法。
SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow
Authors: Kenan Tang, Yanhong Li, Yao Qin
Venue: NeurIPS
First: 2025-04-13T19:13:04+00:00 · Latest: 2025-10-16T23:37:21+00:00
Comments: The paper has been accepted to NeurIPS Creative AI Track 2025. Figure 4(c) has been accepted to CVPR AI Art Gallery 2025
Abstract
Prompt-based models have demonstrated impressive prompt-following capability at image editing tasks. However, the models still struggle with following detailed editing prompts or performing local edits. Specifically, global image quality often deteriorates immediately after a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and consistently improves image quality during more than 100 editing steps, while keeping the unedited regions intact. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. On a challenging realistic image-editing dataset, SPICE quantitatively outperforms state-of-the-art baselines and is consistently preferred by human annotators. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.
中文标题/摘要
标题:SPICE:一种协同、精确、迭代和可定制的图像编辑工作流
基于提示的模型在图像编辑任务中展示了出色的提示跟随能力。然而,这些模型仍然难以遵循详细的编辑提示或执行局部编辑。具体来说,单次编辑步骤后,全局图像质量往往会迅速下降。为了解决这些挑战,我们引入了SPICE,这是一种无需训练的工作流,可以接受任意分辨率和纵横比,准确地遵循用户要求,并在超过100次编辑步骤中持续提高图像质量,同时保持未编辑区域不变。通过结合基础扩散模型和Canny边缘ControlNet模型的优势,SPICE能够稳健地处理用户的自由形式编辑指令。在一项具有挑战性的现实图像编辑数据集上,SPICE在定量上优于最先进的基线,并且始终被人类标注者偏好。我们为流行的扩散模型Web UI发布了工作流实现,以支持进一步的研究和艺术探索。
GradES: Significantly Faster Training in Transformers with Gradient-Based Early Stopping
Authors: Qifu Wen, Xi Zeng, Zihan Zhou, Shuaijun Liu, Mehdi Hosseinzadeh, Ningxin Su, Reza Rawassizadeh
First: 2025-09-01T23:51:12+00:00 · Latest: 2025-10-16T18:38:51+00:00
Comments: 20 pages, 5 figures
Abstract
Early stopping monitors global validation loss and halts all parameter updates simultaneously, which is computationally costly for large transformers due to the extended time required for validation inference. We propose \textit{GradES}, a novel gradient-based early stopping approach that operates within transformer components (attention projections and Feed-Forward layer matrices). We found that different components converge at varying rates during fine-tuning for both language and vision-language models. \textit{GradES} tracks the magnitude of gradient changes in backpropagation for these matrices during training. When a projection matrix's magnitude of gradient changes fall below a convergence threshold $\tau$, we exclude that projection matrix from further updates individually, eliminating costly validation passes while allowing slow converging matrices to continue learning. \textit{GradES} speeds up training time by 1.57--7.22$\times$ while simultaneously enhancing generalization through early prevention of overfitting, resulting in 1.2\% higher average accuracy in language tasks and 3.88\% on multimodal benchmarks.
中文标题/摘要
标题:GradES:基于梯度的早期停止方法显著加速变压器模型的训练
早期停止监控全局验证损失并在训练过程中同时停止所有参数更新,这在大型变压器模型中由于验证推理所需时间较长而计算成本高昂。我们提出了一种新颖的基于梯度的早期停止方法\textit{GradES},该方法在变压器组件(注意投影和前馈层矩阵)内部运行。我们发现,在微调过程中,语言和多模态模型的不同组件以不同的速率收敛。\textit{GradES} 在训练过程中跟踪这些矩阵在反向传播中的梯度变化幅度。当一个投影矩阵的梯度变化幅度低于收敛阈值 $\tau$ 时,我们单独排除该投影矩阵的进一步更新,从而消除昂贵的验证过程,同时允许慢收敛的矩阵继续学习。\textit{GradES} 使训练时间加快了 1.57–7.22 倍,同时通过早期防止过拟合提高泛化能力,从而在语言任务中平均准确率提高了 1.2%,在多模态基准测试中提高了 3.88%。
Summary / 总结
GradES is a gradient-based early stopping method that accelerates training in transformers by monitoring the gradient changes in attention projections and Feed-Forward layer matrices. It stops updating the parameters of slowly converging matrices early, reducing the need for costly validation passes. This approach speeds up training by 1.57 to 7.22 times and improves generalization, leading to a 1.2% higher average accuracy in language tasks and 3.88% in multimodal benchmarks.
GradES 是一种基于梯度的早期停止方法,通过监控不同组件的收敛情况来加速变压器的训练。它跟踪注意力投影和前馈层矩阵的梯度变化,对已经收敛的矩阵停止更新,从而减少验证过程。这种方法将训练时间加快了1.57–7.22倍,并通过早期防止过拟合提高泛化能力,导致语言任务的平均准确率提高了1.2%,多模态基准提高了3.88%。
Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
Authors: Hadi Alzayer, Yunzhi Zhang, Chen Geng, Jia-Bin Huang, Jiajun Wu
First: 2025-10-16T17:59:59+00:00 · Latest: 2025-10-16T17:59:59+00:00
Comments: Project page: https://coupled-diffusion.github.io
Abstract
We present an inference-time diffusion sampling method to perform multi-view consistent image editing using pre-trained 2D image editing models. These models can independently produce high-quality edits for each image in a set of multi-view images of a 3D scene or object, but they do not maintain consistency across views. Existing approaches typically address this by optimizing over explicit 3D representations, but they suffer from a lengthy optimization process and instability under sparse view settings. We propose an implicit 3D regularization approach by constraining the generated 2D image sequences to adhere to a pre-trained multi-view image distribution. This is achieved through coupled diffusion sampling, a simple diffusion sampling technique that concurrently samples two trajectories from both a multi-view image distribution and a 2D edited image distribution, using a coupling term to enforce the multi-view consistency among the generated images. We validate the effectiveness and generality of this framework on three distinct multi-view image editing tasks, demonstrating its applicability across various model architectures and highlighting its potential as a general solution for multi-view consistent editing.
中文标题/摘要
标题:耦合扩散采样用于无训练多视图图像编辑
我们提出了一种推理时的扩散采样方法,使用预训练的2D图像编辑模型在多视图图像集中执行多视图一致的图像编辑。这些模型可以独立地为多视图场景或对象的一组图像生成高质量的编辑,但它们在视图之间不保持一致性。现有方法通常通过优化显式的3D表示来解决这个问题,但它们会遭受优化过程漫长且在稀疏视图设置下不稳定的问题。我们提出了一种隐式的3D正则化方法,通过约束生成的2D图像序列遵循预训练的多视图图像分布来实现。这通过耦合扩散采样实现,这是一种简单的扩散采样技术,同时从多视图图像分布和2D编辑图像分布中采样两条轨迹,并使用耦合项来强制生成图像之间的多视图一致性。我们在三个不同的多视图图像编辑任务上验证了该框架的有效性和通用性,展示了其在各种模型架构中的适用性,并强调了其作为多视图一致编辑的通用解决方案的潜力。
Summary / 总结
The paper introduces a method for multi-view consistent image editing using pre-trained 2D image editing models. It proposes coupled diffusion sampling to enforce consistency across multiple views without the need for explicit 3D optimization. The method involves concurrently sampling from a multi-view image distribution and a 2D edited image distribution, using a coupling term to maintain consistency. Experiments show the method's effectiveness across different editing tasks and model architectures, making it a general solution for multi-view consistent image editing.
该研究提出了一种使用预训练的2D图像编辑模型进行多视图一致图像编辑的方法。它通过耦合扩散采样来确保多视图之间的一致性,而无需进行显式的3D优化。该方法涉及从多视图和编辑图像分布中同时采样两条轨迹,并使用耦合项来确保生成图像的一致性。实验结果显示该方法在不同模型架构上的有效性和通用性。
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Authors: Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu
First: 2025-10-16T17:59:58+00:00 · Latest: 2025-10-16T17:59:58+00:00
Comments: 21 pages, 7 figures
Abstract
The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, capable of rivaling top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLMs, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.
中文标题/摘要
标题:从像素到文字——迈向大规模原生视觉-语言基础模块
原生视觉-语言模型(VLMs)的建筑已经成为了典型的模块化VLMs的有力竞争者,这得益于不断演化的模型架构和训练范式。然而,两个悬而未决的问题仍然阻碍了其广泛探索和推广:(-)原生VLMs与模块化VLMs之间有哪些根本性的区别,这些障碍可以克服到什么程度?(-)如何使原生VLMs的研究更加普及和民主化,从而加速该领域的进展。在本文中,我们澄清了这些挑战,并概述了构建原生VLMs的指导原则。具体而言,一个原生VLM基础模块应该:(i)在共享语义空间内有效对齐像素和词的表示;(ii)无缝整合以前分离的视觉和语言模块的优势;(iii)内在地体现各种跨模态特性,以支持统一的视觉-语言编码、对齐和推理。因此,我们推出了NEO,这是一种从第一原理构建的新一代原生VLMs,能够在多种现实场景中与顶级模块化对手竞争。仅使用3.9亿张图像-文本样本,NEO能够从头开始高效地发展视觉感知,同时在密集且单一的模型中缓解视觉-语言冲突,该模型由我们精心设计的基础模块构建而成。我们将NEO定位为大规模且强大的原生VLMs的基础,并配有一套丰富的可重用组件,以促进经济高效且可扩展的生态系统。我们的代码和模型已公开发布于:https://github.com/EvolvingLMMs-Lab/NEO。
Learning an Image Editing Model without Image Editing Pairs
Authors: Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang
First: 2025-10-16T17:59:57+00:00 · Latest: 2025-10-16T17:59:57+00:00
Comments: project page: https://nupurkmr9.github.io/npedit/
Abstract
Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
中文标题/摘要
标题:无需图像编辑配对的学习图像编辑模型
最近的图像编辑模型在遵循自然语言编辑指令方面取得了令人印象深刻的成果,但它们依赖于大规模输入-目标配对数据集的监督微调。这是一个关键瓶颈,因为这种自然出现的配对难以大规模整理。当前的工作绕过使用利用现有模型零样本能力的合成训练配对。然而,这可能会传播并放大预训练模型的缺陷到最终训练模型中。在本工作中,我们提出了一种新的训练范式,完全消除了对配对数据的需求。我们的方法直接优化了一个多步扩散模型,在训练过程中展开它,并利用视觉语言模型(VLM)的反馈。对于每个输入和编辑指令,VLM 评估编辑是否遵循指令并保留未更改的内容,提供端到端优化的直接梯度。为了确保视觉保真度,我们引入了分布匹配损失(DMD),该损失限制生成的图像保持在预训练模型学习到的图像流形内。我们在标准基准上评估了我们的方法,并包括了详尽的消融研究。在没有任何配对数据的情况下,我们的方法在多步设置下与各种在大量监督配对数据上训练的图像编辑扩散模型表现相当。使用相同的 VLM 作为奖励模型,我们还优于基于 RL 的技术如 Flow-GRPO。
Summary / 总结
This work proposes a new training paradigm for image editing models that does not require paired input-target data, addressing the challenge of curation. The method uses unrolled diffusion models and feedback from vision-language models to optimize edits, incorporating a distribution matching loss to maintain visual fidelity. Experiments show that the model performs comparably to those trained on extensive paired data, and outperforms RL-based techniques like Flow-GRPO when using the same VLM as the reward model.
该研究提出了一种无需输入-目标配对数据的新训练范式,解决了数据标注难题。方法利用展开的扩散模型和视觉语言模型的反馈进行优化,并引入分布匹配损失以保持视觉保真度。实验表明,该模型在少量步骤设置下与大量配对数据训练的模型表现相当,并且在使用相同VLM作为奖励模型时,优于基于RL的技术如Flow-GRPO。
History
20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553