arXiv 论文速递

2025-10-26 03:27
Snapshot: 20251026_0327
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Authors: Yuhan Liu, Lianhui Qin, Shengjie Wang
First: 2025-10-23T17:59:21+00:00 · Latest: 2025-10-23T17:59:21+00:00
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict
中文标题/摘要
标题:小草图,大裁决:基于推测的密集信息视觉推理
大型多模态视觉语言模型(VLMs)在多模态理解方面取得了显著进展,但在处理密集交织了文本注释和细粒度图形元素的信息密集型图像时却面临挑战。主要挑战在于精确定位密集布局中的关键线索以及进行多跳推理以整合分散的证据。我们提出了一种名为推测裁决(SV)的无需训练框架,该框架受到推测解码的启发,结合了多个轻量级草图专家和一个大型裁决模型。在草图阶段,小型VLM作为草图专家生成提供多样化定位候选的推理路径;在裁决阶段,强大的VLM综合这些路径生成最终答案,同时降低计算成本并恢复正确答案。为了进一步提高效率和准确性,SV引入了一种共识专家选择机制,仅将高一致性的推理路径转发到裁决阶段。实验证明,SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等具有挑战性的信息密集型和高分辨率视觉问答基准测试中取得了持续的改进。通过综合多个部分准确推理路径中的正确见解,SV在错误纠正和成本效率方面优于大型专有模型或训练管道。代码可在https://github.com/Tinaliu0123/speculative-verdict 获取
DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
Authors: Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, Raanan Fattal
First: 2025-10-23T17:42:14+00:00 · Latest: 2025-10-23T17:42:14+00:00
Abstract
Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.
中文标题/摘要
标题:DyPE:动态位置外推在超高清扩散中的应用
扩散变换器模型可以生成具有非凡保真度和细节的图像,但由于自注意力机制与图像标记数量的平方级扩展,训练它们在超高清分辨率上仍然极其昂贵。在本文中,我们介绍了动态位置外推(DyPE),这是一种无需训练的新方法,使预训练的扩散变换器能够在其训练数据远超的分辨率下合成图像,且无需额外的采样成本。DyPE 利用了扩散过程中固有的频谱进展,其中低频结构早期收敛,而高频则需要更多步骤来解决。具体而言,DyPE 在每次扩散步骤中动态调整模型的位置编码,使其频谱与生成过程的当前阶段匹配。这种方法使我们能够在远超训练分辨率的分辨率下生成图像,例如,使用FLUX生成1600万像素的图像。在多个基准测试中,DyPE 一致地提高了性能,并在超高清图像生成中达到了最先进的保真度,尤其是在更高分辨率下,性能提升更为显著。项目页面可在 https://noamissachar.github.io/DyPE/ 获取。
Summary / 总结
DyPE is a training-free method that enables pre-trained diffusion transformers to synthesize images at ultra-high resolutions by dynamically adjusting positional encodings during the diffusion process. This method leverages the spectral progression of the diffusion process to generate images with high fidelity, achieving state-of-the-art results at resolutions up to 16 million pixels on multiple benchmarks.
DyPE 是一种无需训练的方法,通过在每个扩散步骤中动态调整位置编码,使预训练的扩散变换器能够在超高清分辨率下生成图像。该方法利用扩散过程中的频谱进展,提前外推低频结构,并与当前生成阶段的频谱相匹配。DyPE 显著提高了性能,并在超高清分辨率图像生成中达到了最先进的保真度,尤其是在更高分辨率下。无额外采样成本。项目页面:https://noamissachar.github.io/DyPE/.
mmWalk: Towards Multi-modal Multi-view Walking Assistance
Authors: Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen
Venue: NeurIPS 2025
First: 2025-10-13T15:25:52+00:00 · Latest: 2025-10-23T16:40:49+00:00
Comments: Accepted by NeurIPS 2025 Datasets and Benchmarks Track. Data and Code: https://github.com/KediYing/mmWalk
Abstract
Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.
中文标题/摘要
标题:mmWalk:迈向多模态多视角行走辅助
在极端或复杂环境中提供行走辅助仍然是盲人或低视力(BLV)人群的一大挑战,主要原因是缺乏对整体场景的理解。受BLV社区实际需求的启发,我们构建了mmWalk,这是一个模拟的多模态数据集,集成了多视角传感器和无障碍导向特征,用于户外安全导航。该数据集包含120条手动控制、场景分类的行走轨迹,共有62000帧同步图像。它包含了超过559000张全景图像,涵盖RGB、深度和语义模态。此外,为了强调现实相关性,每条轨迹都涉及户外的边缘情况和专为BLV用户设计的无障碍地标。此外,我们还生成了mmWalkVQA,这是一个包含超过69000个视觉问题-答案三元组的VQA基准,分为9个类别,旨在提供安全和知情的行走辅助。我们使用零样本和少样本设置评估了最先进的视觉-语言模型(VLMs),发现它们在我们的风险评估和导航任务中表现不佳。我们还在真实世界数据集上验证了mmWalk微调模型,并展示了该数据集在推进多模态行走辅助方面的有效性。
Summary / 总结
The paper introduces mmWalk, a simulated multi-modal dataset for walking assistance in complex environments, addressing the need for holistic scene understanding for people with blindness or low vision. The dataset includes 120 walking trajectories with 62k synchronized frames and over 559k panoramic images across RGB, depth, and semantic modalities. The authors evaluate state-of-the-art Vision-Language Models and find they struggle with risk assessment and navigational tasks, demonstrating the effectiveness of their dataset for advancing multi-modal walking assistance.
研究旨在通过开发mmWalk多模态数据集来解决视觉障碍人士在极端环境中的行走辅助问题,该数据集整合了多视角传感器数据和无障碍特征。数据集包含120条行走轨迹,共有62k同步帧和超过559k张全景图像,涵盖RGB、深度和语义模态,并强调了真实的户外场景。实验结果显示,最先进的视觉-语言模型在风险评估和导航任务上表现不佳,而mmWalk微调模型在真实世界数据集上表现出有效性。
Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning
Authors: Wenyi Xiao, Leilei Gan
First: 2025-04-25T16:11:23+00:00 · Latest: 2025-10-23T16:25:28+00:00
Abstract
When applying reinforcement learning--typically through GRPO--to large vision-language model reasoning struggles to effectively scale reasoning length or generates verbose outputs across all tasks with only marginal gains in accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. Inspired by these observations, we introduce two complementary metrics to estimate the difficulty of the questions, guiding the model to determine when fast or slow thinking is more appropriate. Next, we incorporate adaptive length-based rewards and difficulty-aware KL divergence into the GRPO algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10\% relative improvement compared to the base model, while reducing token usage by 32.7-67.3\% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.
Summary / 总结
FAST-GRPO is a variant of GRPO designed to improve the scalability of large vision-language model reasoning by dynamically adjusting reasoning depth based on question characteristics. It introduces two metrics to estimate question difficulty and uses adaptive rewards and difficulty-aware KL divergence. Experiments show that FAST achieves superior accuracy with up to 10% improvement and reduces token usage by 32.7-67.3% compared to previous methods, effectively balancing reasoning length and accuracy across seven benchmarks.
FAST-GRPO 是一种改进的 GRPO 变体,通过根据问题特性动态调整推理深度来提高大型视觉-语言模型推理的可扩展性。它引入了两个指标来估计问题难度,并使用自适应奖励和难度感知的 KL 散度。实验表明,FAST 在七个基准测试中实现了更高的准确率,最高可达 10% 的提升,并且与之前的慢思考方法相比,减少了 32.7-67.3% 的标记使用量,有效地平衡了推理长度和准确率。
Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Authors: Xuyang Liu, Xiyan Gui, Yuchao Zhang, Linfeng Zhang
First: 2025-10-23T16:17:47+00:00 · Latest: 2025-10-23T16:17:47+00:00
Comments: Our code is available at https://github.com/xuyang-liu16/MixKV
Abstract
Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.
中文标题/摘要
标题:结合重要性与多样性:在大型视觉-语言模型中联合优化KV缓存压缩
近期的大型视觉-语言模型(LVLMs)在处理扩展的多模态序列方面表现出色,但由此产生的键值(KV)缓存扩展造成了一个关键的内存瓶颈,从根本上限制了部署的可扩展性。虽然现有的KV缓存压缩方法侧重于保留高重要性的KV对以最小化存储,但它们往往忽略了多模态KV缓存中出现的独特的模态特定语义冗余模式。在这项工作中,我们首先分析了LVLMs中的KV缓存如何在不同的注意力头中表现出不同程度的冗余,而不仅仅是简单的重要性。我们表明,仅依赖于重要性只能覆盖KV缓存信息分布的一部分,可能导致语义覆盖的潜在损失。为了解决这个问题,我们提出了\texttt{MixKV},一种新颖的方法,将重要性与多样性结合以优化LVLMs中的KV缓存压缩。\texttt{MixKV}根据头级语义冗余进行调整,在压缩KV对时选择性地平衡多样性和重要性。广泛的实验表明,\texttt{MixKV}在多个LVLMs中始终优于现有方法。在极端压缩(预算=64)下,\texttt{MixKV}在五个多模态理解基准测试中平均提高了基线方法的\textbf{5.1\%},并在SnapKV和AdaKV的GUI定位任务中分别实现了显著的\textbf{8.0\%}和\textbf{9.0\%}的提升,同时保持了相当的推理效率。此外,\texttt{MixKV}无缝扩展到LLMs,性能提升相当。我们的代码可在\href{https://github.com/xuyang-liu16/MixKV}{https://github.com/xuyang-liu16/MixKV}获取。
Summary / 总结
This paper addresses the memory bottleneck caused by the expansion of key-value (KV) cache in large vision-language models (LVLMs) by proposing a method called MixKV. MixKV combines importance and diversity to optimize KV cache compression, addressing the limitations of existing methods that only focus on importance. Extensive experiments show that MixKV improves baseline methods by an average of 5.1% across five multi-modal understanding benchmarks and achieves significant gains of 8.0% and 9.0% for SnapKV and AdaKV on GUI grounding tasks, while maintaining comparable inference efficiency.
该研究针对大型视觉-语言模型(LVLMs)中由于关键值(KV)缓存扩展导致的内存瓶颈问题,提出了一种结合重要性和多样性的KV缓存压缩方法MixKV。该方法适应头部级别的语义冗余,并在压缩KV对时选择性地平衡多样性和重要性。广泛的实验表明,MixKV在五个多模态理解基准测试中平均提高了5.1%,在GUI接地任务中,SnapKV和AdaKV分别取得了8.0%和9.0%的显著提升,同时保持了相当的推理效率。MixKV还扩展到了语言模型,具有类似的表现改进。
FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation
Authors: Zebin Yao, Lei Ren, Huixing Jiang, Chen Wei, Xiaojie Wang, Ruifan Li, Fangxiang Feng
First: 2025-04-22T14:55:23+00:00 · Latest: 2025-10-23T16:11:42+00:00
Comments: Code: https://github.com/Nihukat/FreeGraftor
Abstract
Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.
中文标题/摘要
标题:FreeGraftor:无需训练的跨图像特征嫁接以实现主题驱动的文本到图像生成
主题驱动的图像生成旨在从参考图像中合成新的场景,同时忠实保留主题身份并遵循文本指导。然而,现有方法在保真度和效率之间面临关键权衡。基于调优的方法依赖于耗时且资源密集的主题特定优化,而零样本方法往往无法保持足够的主题一致性。在本文中,我们提出了一种无需训练的FreeGraftor框架,通过跨图像特征嫁接来解决这些限制。具体而言,FreeGraftor利用语义匹配和位置约束注意力融合将参考主题的视觉细节转移到生成图像中。此外,我们的框架引入了一种新颖的噪声初始化策略,以保留参考主题的几何先验,从而促进稳健的特征匹配。广泛的定性和定量实验表明,我们的方法能够实现精确的主题身份转移,同时保持文本对齐的场景合成。无需进行模型微调或额外训练,FreeGraftor在主题保真度和文本对齐方面显著优于现有零样本和无需训练的方法。此外,我们的框架可以无缝扩展到多主题生成,使其适用于实际部署。我们的代码可在https://github.com/Nihukat/FreeGraftor获取。
Summary / 总结
FreeGraftor is a training-free framework for subject-driven text-to-image generation that uses cross-image feature grafting to preserve subject identity while adhering to textual guidance. It employs semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to generated images, and introduces a noise initialization strategy to maintain subject geometry. Experiments show that FreeGraftor outperforms existing methods in subject fidelity and text alignment without requiring model fine-tuning or additional training, and can handle multi-subject generation.
FreeGraftor 是一个无需训练的框架,用于在保留主体身份的同时根据文本指导生成图像。它使用跨图像特征嫁接技术,通过语义匹配和位置约束注意力融合将参考主体的视觉细节转移到生成图像中,并引入噪声初始化策略以保持主体的几何先验。实验表明,FreeGraftor 在主体保真度和文本对齐方面优于现有零样本和无需训练的方法,且无需进行模型微调。此外,该框架还支持多主体生成,使其适用于实际应用。
Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward
Authors: Jing Bi, Guangyu Sun, Ali Vosoughi, Chen Chen, Chenliang Xu
First: 2025-10-23T16:10:03+00:00 · Latest: 2025-10-23T16:10:03+00:00
Comments: 5 pages
Abstract
Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.
中文标题/摘要
标题:视觉推理诊断:挑战、见解与未来路径
多模态大型语言模型(MLLMs)结合视觉和文本推理,利用链式思考(CoT)提示解决复杂视觉任务,但仍表现出视觉幻觉和过度依赖文本先验的问题。我们使用三阶段评估框架系统地诊断了最先进的视觉语言模型,揭示了关键的失败模式。为了解决这些问题,我们提出了一种基于代理的架构,结合LLM推理和轻量级视觉模块,实现精细的推理链分析和迭代优化。我们的结果强调未来视觉推理模型应专注于整合更广泛的专门工具来分析视觉内容。我们的系统在MMMU上提高了10.3,在MathVista上提高了6.0,超过了7B基线模型。我们将发布我们的框架和评估套件以促进未来研究。
Summary / 总结
The paper aims to diagnose the challenges faced by multimodal large language models (MLLMs) in visual reasoning tasks, particularly their tendency to make visual hallucinations and rely heavily on textual information. The authors propose a three-stage evaluation framework and an agent-based architecture that integrates lightweight visual modules with LLM reasoning. Key findings include significant improvements in performance (+10.3 on MMMU, +6.0 on MathVista) over a 7B baseline model, suggesting that future visual reasoning models should incorporate a broader set of specialized tools for analyzing visual content.
论文旨在诊断多模态大型语言模型(MLLMs)在视觉推理任务中面临的挑战,特别是它们倾向于产生视觉幻觉并过度依赖文本信息。作者提出了一种三阶段评估框架和一种结合轻量级视觉模块与LLM推理的代理架构。关键发现包括相对于7B基线模型,在MMMU上提高了10.3,在MathVista上提高了6.0,表明未来的视觉推理模型应整合更多专门用于分析视觉内容的工具。
REOBench: Benchmarking Robustness of Earth Observation Foundation Models
Authors: Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, Tianjin Huang
First: 2025-05-22T15:34:50+00:00 · Latest: 2025-10-23T15:43:31+00:00
Comments: Accepted to NeruIPS 2025 D&B Track
Abstract
Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models. Code and data are publicly available at https://github.com/lx709/REOBench.
中文标题/摘要
标题:REOBench:地球观测基础模型鲁棒性基准测试
地球观测基础模型在多个地球观测任务中表现出强大的泛化能力,但在真实世界扰动下的鲁棒性仍被忽视。为填补这一空白,我们引入了REOBench,这是首个全面评估地球观测基础模型鲁棒性的基准,涵盖了六个任务和十二种图像破坏类型,包括基于外观和几何的扰动。为了确保评估的现实性和精细度,我们的基准专注于高分辨率光学遥感图像,这些图像广泛应用于城市规划和灾害响应等关键应用。我们系统地评估了使用掩码图像建模、对比学习和视觉语言预训练范式训练的一系列模型。我们的结果表明:(1)现有地球观测基础模型在面对输入破坏时会遭受显著性能下降。(2)性能下降的程度在不同任务、模型架构、骨干网络大小和破坏类型之间存在差异,性能下降幅度从不到1%到超过20%不等。(3)视觉语言模型在多模态任务中显示出增强的鲁棒性。REOBench突显了当前地球观测基础模型对真实世界破坏的脆弱性,并提供了开发更鲁棒和可靠的模型的可操作见解。代码和数据可在https://github.com/lx709/REOBench公开获取。
Summary / 总结
REOBench is a benchmark designed to evaluate the robustness of Earth observation foundation models under real-world perturbations. It assesses models across six tasks and twelve types of image corruptions. The study reveals that existing models experience significant performance degradation, with drops ranging from less than 1% to over 20%, depending on the task, model architecture, and type of corruption. Vision-language models show enhanced robustness, especially in multimodal tasks. This benchmark highlights the need for developing more robust models for critical applications like urban planning and disaster response.
REOBench 是一个基准,用于评估地球观测基础模型在六种任务和十二种图像扭曲下的鲁棒性。该基准使用高分辨率光学遥感图像,并评估了通过掩码图像建模、对比学习和视觉语言预训练训练的模型。主要发现包括现有模型在暴露于输入扭曲时出现显著性能下降,不同任务和模型类型下的严重程度不同,以及视觉语言模型在多模态任务中的鲁棒性增强。这项工作强调了为关键应用如城市规划和灾害响应开发更鲁棒模型的必要性。
Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging
Authors: Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Dong Yang, Pengfei Guo, Marc Edgar, Daguang Xu, Bernhard Kainz, Bjoern Menze
Venue: NeurIPS 2025
First: 2025-10-23T15:13:13+00:00 · Latest: 2025-10-23T15:13:13+00:00
Comments: NeurIPS 2025
Abstract
Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D
中文标题/摘要
标题:更好的3D标记以实现更好的3D:推进3D医学成像中的视觉-语言建模
3D医学成像中的视觉-语言建模最近的进步得益于大规模的计算机断层扫描(CT)语料库,这些语料库配有配对的自由文本报告,更强的架构和强大的预训练模型。这使得自动化报告生成和文本条件下的3D图像合成等应用成为可能。然而,当前的方法在处理高分辨率、长序列体积时存在困难:对比预训练往往导致视觉编码器与临床语言不一致,而切片级标记模糊了细微解剖结构,降低了下游任务的诊断性能。我们提出了BTB3D(更好的3D标记),这是一种因果卷积编码器-解码器,统一了2D和3D的训练和推理,同时生成紧凑的、频率感知的体素标记。三阶段的训练课程使模型能够(i)局部重建,(ii)重叠窗口镶嵌,以及(iii)长上下文解码器细化,在此过程中,模型从短切片片段中学习,但能够处理超过300片的扫描而无需额外的内存开销。BTB3D在两个关键任务上达到了新的最佳水平:它在报告生成任务上提高了BLEU分数,并且与CT2Rep、CT-CHAT和Merlin相比,临床F1提高了40%;在文本到CT合成任务上,它将FID降低了75%,并将FVD减半,与GenerateCT和MedSyn相比,生成了解剖上一致的512*512*241体积。这些结果表明,精确的三维标记化,而不是更大的语言骨干模型,对于3D医学成像中的可扩展视觉-语言建模至关重要。代码库可在:https://github.com/ibrahimethemhamamci/BTB3D
Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey
Authors: Weifan Guan, Qinghao Hu, Aosheng Li, Jian Cheng
First: 2025-10-20T02:59:45+00:00 · Latest: 2025-10-23T15:06:39+00:00
Abstract
Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.
中文标题/摘要
标题:高效视觉-语言-行动模型在嵌入式操作中的应用:系统综述
视觉-语言-行动(VLA)模型将视觉-语言模型扩展到嵌入式控制,通过将自然语言指令和视觉观察映射到机器人行动。尽管具有这些能力,VLA 系统由于其巨大的计算和内存需求而面临重大挑战,这与边缘平台(如车载移动操作器)的实时性能要求相冲突。解决这一矛盾已成为最近研究的中心焦点。鉴于对更高效和可扩展的VLA系统的日益努力,本文综述了提高VLA效率的方法,重点在于减少延迟、内存占用和训练及推理成本。我们按照模型架构、感知特征、行动生成和训练/推理策略四个维度对现有解决方案进行了分类,总结了每个类别中的代表性技术。最后,我们讨论了未来趋势和开放挑战,指出了推进高效嵌入式智能的方向。
SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
Authors: Yuan Sheng, Yanbin Hao, Chenxu Li, Shuo Wang, Xiangnan He
First: 2025-10-23T14:55:28+00:00 · Latest: 2025-10-23T14:55:28+00:00
Abstract
Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.
中文标题/摘要
标题:SeViCES:统一语义-视觉证据共识的长视频理解
长视频理解由于其复杂、多样且时间上分散的内容而具有挑战性。尽管视频大型语言模型(Video-LLMs)能够处理长达数十分钟的视频,但将其应用于真正长的序列在计算上是不可行的,通常会导致不集中或不一致的推理。一种有希望的解决方案是仅选择最具信息性的帧,但现有方法通常忽略时间依赖性或依赖单一模态的证据,限制了它们提供完整且与查询相关背景的能力。我们提出了一种语义-视觉共识证据选择(SeViCES)框架,以实现有效的和可靠的长视频理解。SeViCES 是无需训练且模型无关的,并引入了两个关键组件。语义-视觉共识帧选择(SVCFS)模块通过(1)一个时间感知的语义分支,利用LLM对字幕的推理,以及(2)一个聚类引导的视觉分支,通过互信息对齐嵌入与语义得分。答案共识细化(ACR)模块进一步通过融合证据并限制答案空间来解决基于语义和视觉预测之间的一致性问题。在长视频理解基准上的广泛实验表明,SeViCES 在准确性和鲁棒性方面始终优于最先进的方法,证明了共识驱动的证据选择对Video-LLMs的重要性。
Summary / 总结
SeViCES is a framework designed to improve long video understanding by selecting the most informative frames and fusing semantic and visual evidence. It uses a temporal-aware semantic branch and a cluster-guided visual branch to select frames, and an Answer Consensus Refinement module to resolve inconsistencies between predictions. Experiments show that SeViCES outperforms existing methods in accuracy and robustness, highlighting the importance of consensus-driven evidence selection for Video-LLMs.
研究旨在通过提出SeViCES框架解决长视频理解的挑战,该框架选择信息丰富的帧并融合语义和视觉证据。SeViCES使用语义-视觉共识帧选择模块根据时序感知的语义推理和视觉对齐来选择帧,并使用答案共识精炼模块解决语义和视觉预测之间的不一致。实验表明,SeViCES在准确性和鲁棒性方面优于现有方法,突显了Video-LLMs中共识驱动的证据选择的重要性。
Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models
Authors: Xinmiao Huang, Qisong He, Zhenglin Huang, Boxuan Wang, Zhuoyun Li, Guangliang Cheng, Yi Dong, Xiaowei Huang
First: 2025-10-15T10:44:01+00:00 · Latest: 2025-10-23T14:31:13+00:00
Comments: Project Page: https://shinmohuang.github.io/spatialdise_page/
Abstract
Spatial reasoning ability is crucial for Vision Language Models (VLMs) to support real-world applications in diverse domains including robotics, augmented reality, and autonomous navigation. Unfortunately, existing benchmarks are inadequate in assessing spatial reasoning ability, especially the \emph{intrinsic-dynamic} spatial reasoning which is a fundamental aspect of human spatial cognition. In this paper, we propose a unified benchmark, \textbf{Spatial-DISE}, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants: \textbf{I}ntrinsic-\textbf{S}tatic, Intrinsic-\textbf{D}ynamic, \textbf{E}xtrinsic-Static, and Extrinsic-Dynamic spatial reasoning. Moreover, to address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions, resulting in a new \textbf{Spatial-DISE} dataset that includes Spatial-DISE Bench (559 evaluation VQA pairs) and Spatial-DISE-12K (12K+ training VQA pairs). Our comprehensive evaluation across 28 state-of-the-art VLMs reveals that, current VLMs have a large and consistent gap to human competence, especially on multi-step multi-view spatial reasoning. Spatial-DISE offers a robust framework, valuable dataset, and clear direction for future research toward human-like spatial intelligence. Benchmark, dataset, and code will be publicly released.
中文标题/摘要
标题:Spatial-DISE:评估视觉语言模型空间推理能力的统一基准
空间推理能力对于视觉语言模型(VLMs)在包括机器人技术、增强现实和自主导航在内的多个领域支持实际应用至关重要。不幸的是,现有的基准测试在评估空间推理能力方面存在不足,尤其是内在动态空间推理,这是人类空间认知的一个基本方面。在本文中,我们基于认知基础的分类体系提出了一个统一的基准测试——Spatial-DISE,将任务分为四个基本象限:内在静态、内在动态、外在静态和外在动态空间推理。此外,为了解决数据稀缺问题,我们开发了一个可扩展和自动化的管道来生成多样且可验证的空间推理问题,从而形成了一个新的Spatial-DISE数据集,包括Spatial-DISE基准(559个评估VQA对)和Spatial-DISE-12K(12000多个训练VQA对)。我们对28个最先进的VLMs进行全面评估表明,当前的VLMs在多步多视角空间推理方面与人类能力之间存在显著且一致的差距。Spatial-DISE提供了一个稳健的框架、有价值的数据库和未来研究向人类空间智能方向发展的明确方向。基准测试、数据集和代码将公开发布。
Summary / 总结
The paper introduces Spatial-DISE, a unified benchmark for evaluating spatial reasoning in Vision-Language Models (VLMs). It addresses the inadequacy of existing benchmarks in assessing intrinsic-dynamic spatial reasoning and proposes a taxonomy categorizing tasks into four quadrants. The authors developed a scalable pipeline to generate diverse and verifiable spatial reasoning questions, resulting in the Spatial-DISE dataset. Comprehensive evaluations across 28 state-of-the-art VLMs show a significant gap between current models and human competence, especially in multi-step multi-view spatial reasoning. Spatial-DISE provides a robust framework for future research in human-like spatial intelligence.
论文提出了Spatial-DISE,一个统一的基准来评估视觉语言模型(VLMs)的空间推理能力。它解决了现有基准在评估内在动态空间推理方面的不足。该基准将任务分为四个象限,并通过一个可扩展的管道生成多样化的空间推理问题,从而形成了Spatial-DISE数据集。对28个最先进的VLMs的评估显示,模型在多步多视角空间推理任务上的表现与人类能力之间存在显著差距。
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Authors: Hao Fang, Changle Zhou, Jiawei Kong, Kuofeng Gao, Bin Chen, Shu-Tao Xia
First: 2025-05-26T08:36:10+00:00 · Latest: 2025-10-23T13:08:11+00:00
Abstract
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
中文标题/摘要
标题:视觉引导语言:一种条件点互信息校准解码策略以减少LVLM中的幻觉
大型视觉-语言模型(LVLMs)容易出现幻觉现象,即生成的响应在语义上看似合理,但实际上与输入图像几乎没有关联。先前的研究表明,这一问题主要源于LVLMs过度依赖语言先验,而在解码过程中忽略了视觉信息。为了解决这一问题,我们提出了一种新颖的条件点互信息(C-PMI)校准解码策略,该策略能够自适应地增强生成文本与输入图像之间的相互依赖性,从而减轻幻觉现象。与现有方法仅关注文本词元采样不同,我们提出同时建模视觉和文本词元对C-PMI的贡献,将幻觉缓解问题表述为一个双层优化问题,旨在最大化互信息。为了解决这一问题,我们设计了一种词元净化机制,该机制通过动态调节解码过程来采样与给定图像最相关的文本词元,同时不断优化与生成响应最相关的图像词元。在各种基准上的广泛实验表明,所提出的方法在显著减少LVLM中的幻觉现象的同时,保持了解码效率。
Summary / 总结
The paper addresses the issue of hallucinations in Large Vision-Language Models (LVLMs), where generated text is semantically plausible but irrelevant to the input image. To tackle this, the authors propose a Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy that enhances the mutual dependency between generated text and input images. This method jointly models the contributions of visual and textual tokens, solving hallucination mitigation as a bi-level optimization problem. Experimental results show that the proposed method effectively reduces hallucinations while maintaining decoding efficiency.
论文针对大型视觉语言模型(LVLMs)中的幻觉问题,即生成的响应虽然在语义上合理但与输入图像无关。提出了一种条件点互信息(C-PMI)校准解码策略,增强生成文本与输入图像之间的相互依赖性。该方法联合建模视觉和文本令牌对C-PMI的贡献,将其作为最大化互信息的双层优化问题来解决。在各种基准上的实验结果表明,所提出的方法有效减少了幻觉现象,同时保持了解码效率。
Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization
Authors: Kaiyuan Li, Xiaoyue Chen, Chen Gao, Yong Li, Xinlei Chen
Venue: Neurips 2025
First: 2025-05-28T07:00:50+00:00 · Latest: 2025-10-23T12:39:42+00:00
Comments: Accepted by Neurips 2025
Abstract
Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models' performance on average. Our code is available at https://github.com/EmbodiedCity/NeurIPS2025-Balanced-Token-Pruning.
中文标题/摘要
标题:平衡的标记剪枝:超越局部优化加速视觉语言模型
大型多模态模型(LVLMs)通过将图像编码成数千个标记,在多模态任务中表现出色。然而,大量的图像标记导致了显著的计算开销,而使用动态高分辨率输入进一步增加了这一负担。先前的方法试图通过标记剪枝减少图像标记的数量,通常基于注意力分数或图像标记多样性来选择标记。通过实证研究,我们观察到现有方法往往忽略了剪枝对当前层输出(局部)和后续层输出(全局)的联合影响,导致剪枝决策欠佳。为解决这一挑战,我们提出了平衡的标记剪枝(BTP),这是一种插件式方法,用于剪枝视觉标记。具体而言,我们的方法利用一个小的校准集将剪枝过程分为多个阶段。在早期阶段,我们的方法强调剪枝对后续层的影响,而在较深的阶段,重点转向保留局部输出的一致性。在各种LVLMs上的广泛实验表明,我们的方法在多个基准测试中具有广泛的适用性。我们的方法在平均情况下实现了78%的压缩率,同时保留了原始模型96.7%的性能。我们的代码可在https://github.com/EmbodiedCity/NeurIPS2025-Balanced-Token-Pruning/ 获取。
Summary / 总结
The paper addresses the computational overhead in large Vision-Language Models (LVLMs) due to the large number of image tokens. It proposes Balanced Token Pruning (BTP), a method that divides the pruning process into stages to balance the impact on both local and global outputs. Experiments show that BTP achieves a 78% compression rate with 96.7% of the original performance on multiple benchmarks.
论文针对大型视觉语言模型(LVLM)因使用数千个图像令牌而导致的计算开销问题,提出了一种平衡令牌剪枝(BTP)方法,该方法将剪枝过程分为多个阶段,以平衡对局部和全局输出的影响。实验表明,BTP在各种基准测试中实现了78%的压缩率,同时保持了原始模型96.7%的性能。
Bi-CoG: Bi-Consistency-Guided Self-Training for Vision-Language Models
Authors: Rui Zhu, Song-Lin Lv, Zi-Kang Wang, Lan-Zhe Guo
First: 2025-10-23T12:16:41+00:00 · Latest: 2025-10-23T12:16:41+00:00
Abstract
Exploiting unlabeled data through semi-supervised learning (SSL) or leveraging pre-trained models via fine-tuning are two prevailing paradigms for addressing label-scarce scenarios. Recently, growing attention has been given to combining fine-tuning of pre-trained vision-language models (VLMs) with SSL, forming the emerging paradigm of semi-supervised fine-tuning. However, existing methods often suffer from model bias and hyperparameter sensitivity, due to reliance on prediction consistency or pre-defined confidence thresholds. To address these limitations, we propose a simple yet effective plug-and-play methodology named $\underline{\textbf{Bi-Co}}$nsistency-$\underline{\textbf{G}}$uided Self-Training (Bi-CoG), which assigns high-quality and low-bias pseudo-labels, by simultaneously exploiting inter-model and intra-model consistency, along with an error-aware dynamic pseudo-label assignment strategy. Both theoretical analysis and extensive experiments over 14 datasets demonstrate the effectiveness of Bi-CoG, which consistently and significantly improves the performance of existing methods.
中文标题/摘要
标题:Bi-CoG: 基于双向一致性自训练的视觉-语言模型
通过半监督学习(SSL)利用未标记数据或通过微调预训练模型是应对标签稀缺场景的两种主要范式。最近,人们越来越关注将预训练视觉-语言模型(VLMs)的微调与SSL结合起来,形成了新兴的半监督微调范式。然而,现有方法往往因依赖预测一致性或预定义的置信阈值而存在模型偏差和超参数敏感性的问题。为了解决这些限制,我们提出了一种简单而有效的即插即用方法,名为$\underline{\textbf{Bi-Co}}$nsistency-$\underline{\textbf{G}}$uided 自训练(Bi-CoG),该方法通过同时利用模型间和模型内的一致性,并结合一种错误感知的动态伪标签分配策略,赋予高质量和低偏差的伪标签。理论分析和在14个数据集上的广泛实验均证明了Bi-CoG的有效性,它一致且显著地提高了现有方法的性能。
Summary / 总结
The paper proposes Bi-CoG, a method for semi-supervised fine-tuning of vision-language models, which combines inter-model and intra-model consistency to generate high-quality pseudo-labels. The method uses an error-aware dynamic strategy to assign pseudo-labels, addressing the limitations of existing methods in terms of model bias and hyperparameter sensitivity. Extensive experiments on 14 datasets show that Bi-CoG significantly improves the performance of existing methods.
该论文提出了一种名为Bi-CoG的方法,用于视觉-语言模型的半监督微调。该方法通过利用模型间的和模型内的一致性来分配高质量的伪标签,并采用一种错误感知的动态伪标签分配策略。在14个数据集上的实验表明,Bi-CoG能够一致且显著地提高现有方法的性能。
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Authors: Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen
Venue: NeurIPS 2025
First: 2025-09-17T11:28:58+00:00 · Latest: 2025-10-23T10:59:53+00:00
Comments: NeurIPS 2025
Abstract
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.
中文标题/摘要
标题:ViSpec:通过视觉感知投机解码加速视觉语言模型
投机解码是一种广泛采用的技术,用于加速大型语言模型(LLMs)的推理,但在视觉语言模型(VLMs)中的应用仍被忽视,现有方法仅能实现轻微的加速(<1.5x)。随着多模态能力成为大型模型的核心,这一差距变得越来越重要。我们假设大型VLMs可以在逐层过滤冗余图像信息的同时不损害文本理解,而较小的草稿模型则难以做到这一点。为了解决这个问题,我们引入了视觉感知投机解码(ViSpec),这是一种针对VLMs的新框架。ViSpec使用一个轻量级的视觉适配模块将图像标记压缩成紧凑的表示,无缝集成到草稿模型的注意力机制中,同时保留原始图像的位置信息。此外,我们为每个输入图像提取一个全局特征向量,并将该特征添加到所有后续文本标记中,以增强多模态的一致性。为了克服缺乏带有长助手响应的多模态数据集的问题,我们通过重新利用现有数据集并使用目标VLM生成扩展输出来构建一个专门的训练数据集,使用修改后的提示。我们的训练策略减轻了草稿模型直接访问目标模型隐藏状态的风险,这在仅使用目标模型输出进行训练时可能会导致捷径学习。广泛的实验验证了ViSpec,据我们所知,这是首次在VLM投机解码中实现显著加速。代码可在https://github.com/KangJialiang/ViSpec/获得。
Summary / 总结
ViSpec is a novel speculative decoding framework designed to accelerate vision-language models (VLMs) by integrating a lightweight vision adaptor that compresses image tokens into a compact representation. This method enhances multimodal coherence by augmenting text tokens with global image features. Experiments show that ViSpec achieves significant speedups in VLM inference, overcoming previous limitations where speedups were only modest (<1.5x).
ViSpec 是一种新颖的视觉语言模型(VLM)推测解码框架,通过将图像令牌压缩为紧凑表示并集成到草稿模型的注意力机制中来加速推理。该方法实现了显著的速度提升,最高可达2.5倍,同时保持了多模态的一致性。此外,该方法还包括一个专门的训练数据集,以减轻捷径学习的风险,增强模型的鲁棒性。
Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection
Authors: Wenping Jin, Yuyang Tang, Li Zhu, Fei Guo
First: 2025-10-21T16:31:56+00:00 · Latest: 2025-10-23T09:53:25+00:00
Abstract
A recent class of hyperspectral anomaly detection methods that can be trained once on background datasets and then universally deployed -- without per-scene retraining or parameter tuning -- has demonstrated remarkable efficiency and robustness. Building upon this paradigm, we focus on the integration of spectral and spatial cues and introduce a novel "Rebellious Student" framework for complementary feature learning. Unlike conventional teacher-student paradigms driven by imitation, our method intentionally trains the spatial branch to diverge from the spectral teacher, thereby learning complementary spatial patterns that the teacher fails to capture. A two-stage learning strategy is adopted: (1) a spectral enhancement network is first trained via reverse distillation to obtain robust background spectral representations; and (2) a spatial network -- the rebellious student -- is subsequently optimized using decorrelation losses that enforce feature orthogonality while maintaining reconstruction fidelity to avoid irrelevant noise. Once trained, the framework enhances both spectral and spatial background features, enabling parameter-free and training-free anomaly detection when paired with conventional detectors. Experiments on the HAD100 benchmark show substantial improvements over several established baselines with modest computational overhead, confirming the effectiveness of the proposed complementary learning paradigm. Our code is publicly available at https://github.com/xjpp2016/FERS.
中文标题/摘要
标题:叛逆学生:用于高光谱异常检测背景特征增强的互补学习框架
一类可以在背景数据集上训练一次,然后在无需每景重新训练或参数调整的情况下广泛部署的高光谱异常检测方法,已经展示了显著的效率和鲁棒性。在此基础上,我们专注于光谱和空间线索的整合,并引入了一种新颖的“叛逆学生”框架,用于互补特征学习。与传统的模仿驱动的教师-学生范式不同,我们的方法故意训练空间分支与光谱教师相背离,从而学习光谱教师未能捕捉到的互补空间模式。采用两阶段学习策略:(1)首先通过反向蒸馏训练光谱增强网络,以获得稳健的背景光谱表示;(2)随后使用去相关损失优化空间网络——叛逆学生——以确保特征正交性同时保持重建保真度,以避免无关噪声。训练完成后,该框架增强光谱和空间背景特征,与传统检测器结合时可实现无参数和无训练的异常检测。在HAD100基准上的实验显示,与几个现有基线相比,具有适度的计算开销,确认了所提互补学习范式的有效性。我们的代码可在https://github.com/xjpp2016/FERS/ 获取。
Summary / 总结
This paper introduces a 'Rebellious Student' framework for enhancing background features in hyperspectral anomaly detection. Motivated by the need for efficient and robust methods, the framework integrates spectral and spatial cues. It employs a two-stage learning strategy: first, a spectral enhancement network is trained using reverse distillation to obtain robust background spectral representations, and then a spatial network is optimized to learn complementary spatial patterns through decorrelation losses. Experiments on the HAD100 benchmark demonstrate significant improvements over existing methods with minimal computational overhead, validating the effectiveness of the complementary learning approach. The framework enables parameter-free and training-free anomaly detection when paired with conventional detectors.
本文提出了一种“叛逆学生”框架,用于增强高光谱异常检测中的背景特征。该方法旨在实现无需场景特定重新训练的高效和鲁棒方法,作者提出了一种两阶段学习策略。首先,使用逆蒸馏训练光谱增强网络以捕获稳健的背景光谱表示。然后,优化空间网络使其与光谱教师相异,学习互补的空间模式。在HAD100基准上的实验表明,与现有方法相比,该方法在计算开销较小的情况下取得了显著改进,验证了互补学习方法的有效性。
S$^2$-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation
Authors: Quantao Yang, Michael C. Welle, Danica Kragic, Olov Andersson
First: 2025-02-13T15:06:42+00:00 · Latest: 2025-10-23T09:09:29+00:00
Abstract
Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment \textit{instances} that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S$^2$-Diffusion) which enables generalization from instance-level training data to category-level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S$^2$-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Project website: https://s2-diffusion.github.io.
中文标题/摘要
标题:S$^2$-扩散:从实例级到类别级的机器人操作技能泛化
近期在技能学习方面的进展通过使机器人能够从实际数量的演示中学习复杂的操作任务,推动了机器人操作达到新的高度。然而,这些技能通常仅限于训练数据中展示的具体动作、物体和环境实例,并且难以转移到同一类别的其他实例。在本工作中,我们提出了一种开放词汇量的空间语义扩散策略(S$^2$-扩散),该策略能够从实例级的训练数据泛化到类别级,使技能能够在同一类别的不同实例之间进行转移。我们展示了通过可提示的语义模块结合空间表示可以捕捉技能的功能方面。我们进一步提出利用深度估计网络,仅使用单个RGB摄像头即可。我们的方法在多种机器人操作任务上进行了评估和比较,包括模拟和真实世界环境。我们的结果表明,S$^2$-扩散对与类别无关的因素变化具有不变性,并且即使未在特定实例上进行训练,也能在相同类别内的其他实例上实现令人满意的性能。项目网站:https://s2-diffusion.github.io/
Summary / 总结
This paper addresses the challenge of generalizing robot manipulation skills from specific instances to broader categories. It introduces S$^2$-Diffusion, a policy that combines a promptable semantic module and a spatial representation to capture the functional aspects of skills. The method uses depth estimation networks to operate with a single RGB camera, simplifying the setup. Experimental results demonstrate that S$^2$-Diffusion can generalize well to new instances within the same category, showing improved performance in various robot manipulation tasks both in simulation and the real world.
本文解决了将机器人操作技能从特定实例推广到更广泛类别中的挑战。它提出了S$^2$-Diffusion策略,结合了可提示的语义模块和空间表示,以捕捉技能的功能方面。该方法使用深度估计网络,仅需单个RGB摄像头即可操作,简化了设置。实验结果表明,S$^2$-Diffusion能够在同一类别中的新实例上表现出良好的泛化能力,并在各种机器人操作任务中(包括仿真和真实世界)展示了更好的性能。
A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
Authors: Jiacheng Liu, Xinyu Wang, Yuqi Lin, Zhikai Wang, Peiru Wang, Peiliang Cai, Qinming Zhou, Zhengan Yan, Zexuan Yan, Zhengyi Shi, Chang Zou, Yue Ma, Linfeng Zhang
First: 2025-10-22T16:46:05+00:00 · Latest: 2025-10-23T09:09:15+00:00
Comments: 22 pages,2 figures
Abstract
Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
中文标题/摘要
标题:关于扩散模型中缓存方法的综述:朝向高效的多模态生成
扩散模型已成为现代生成AI的基石,因其卓越的生成质量和可控性。然而,它们固有的\textit{多步迭代}和\textit{复杂骨干网络}导致了巨大的计算开销和生成延迟,成为实时应用的主要瓶颈。尽管现有加速技术取得了一定进展,但仍面临适用性有限、高训练成本或质量下降等问题。 在此背景下,\textbf{扩散缓存}提供了一种无训练、架构无关且高效的推理范式。其核心机制识别并重用了扩散过程中的内在计算冗余。通过在特征级别实现跨步重用和跨层调度,它减少了计算量而不修改模型参数。本文系统地回顾了扩散缓存的理论基础及其演变,并提出了一种统一的分类和分析框架。 通过对代表性方法的比较分析,我们表明扩散缓存从\textit{静态重用}发展到\textit{动态预测}。这一趋势增强了缓存的灵活性,使其适用于各种任务,并能够与其他加速技术如采样优化和模型蒸馏集成,为未来的多模态和交互应用提供统一、高效的推理框架。我们认为,这一范式将成为实时和高效生成AI的关键使能器,为\textit{高效生成智能}的理论和实践注入新的活力。
Summary / 总结
The paper addresses the computational challenges of diffusion models in real-time applications by introducing diffusion caching, which reduces computational overhead without altering model parameters. It reviews the evolution of diffusion caching from static reuse to dynamic prediction, showing its potential for efficient inference and integration with other acceleration techniques. Key findings include the enhancement of caching flexibility and the development of a unified framework for analysis and classification of diffusion caching methods.
论文通过引入扩散缓存来解决实际应用中扩散模型的计算挑战,该方法在不改变模型参数的情况下减少计算开销。它回顾了从静态重用到动态预测的扩散缓存演变,展示了其在高效推理和与其他加速技术(如采样优化和模型蒸馏)集成方面的潜力。关键发现包括缓存灵活性的增强以及开发了一个统一的框架来进行扩散缓存方法的分析和分类。
GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?
Authors: Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang
First: 2025-10-23T08:33:24+00:00 · Latest: 2025-10-23T08:33:24+00:00
Abstract
Vision-Language Models (VLMs) are increasingly deployed as autonomous agents to navigate mobile graphical user interfaces (GUIs). Operating in dynamic on-device ecosystems, which include notifications, pop-ups, and inter-app interactions, exposes them to a unique and underexplored threat vector: environmental injection. Unlike prompt-based attacks that manipulate textual instructions, environmental injection corrupts an agent's visual perception by inserting adversarial UI elements (for example, deceptive overlays or spoofed notifications) directly into the GUI. This bypasses textual safeguards and can derail execution, causing privacy leakage, financial loss, or irreversible device compromise. To systematically evaluate this threat, we introduce GhostEI-Bench, the first benchmark for assessing mobile agents under environmental injection attacks within dynamic, executable environments. Moving beyond static image-based assessments, GhostEI-Bench injects adversarial events into realistic application workflows inside fully operational Android emulators and evaluates performance across critical risk scenarios. We further propose a judge-LLM protocol that conducts fine-grained failure analysis by reviewing the agent's action trajectory alongside the corresponding screenshot sequence, pinpointing failure in perception, recognition, or reasoning. Comprehensive experiments on state-of-the-art agents reveal pronounced vulnerability to deceptive environmental cues: current models systematically fail to perceive and reason about manipulated UIs. GhostEI-Bench provides a framework for quantifying and mitigating this emerging threat, paving the way toward more robust and secure embodied agents.
中文标题/摘要
标题:GhostEI-Bench:移动代理在动态设备环境中对环境注入的韧性如何?
视觉-语言模型(VLMs)正越来越多地作为自主代理部署,以导航移动图形用户界面(GUIs)。在包括通知、弹出窗口和跨应用交互的动态非设备生态系统中运行,使它们面临一种独特的、尚未充分探索的威胁向量:环境注入。与基于提示的攻击不同,后者通过操纵文本指令来影响代理,环境注入通过直接向GUI插入敌对的UI元素(例如,欺骗性覆盖或伪造的通知)来破坏代理的视觉感知。这绕过了文本保护措施,可能导致执行中断,引发隐私泄露、经济损失或设备不可逆的破坏。为了系统地评估这一威胁,我们引入了GhostEI-Bench,这是首个评估移动代理在动态可执行环境中遭受环境注入攻击的基准。超越基于静态图像的评估,GhostEI-Bench将敌对事件注入到完全运行的Android模拟器中的现实应用工作流中,并在关键风险场景中评估性能。我们进一步提出了一种法官-LLM协议,通过审查代理的动作轨迹与相应的屏幕截图序列来开展精细的失败分析,定位感知、识别或推理中的失败。全面的实验表明,最先进的代理模型对欺骗性环境线索表现出明显的脆弱性:当前模型系统地无法感知和推理关于被操纵的UI。GhostEI-Bench提供了一种量化和缓解这一新兴威胁的框架,为更稳健和安全的实体代理铺平了道路。
Summary / 总结
The paper introduces GhostEI-Bench, a benchmark for evaluating mobile agents' resilience to environmental injection attacks in dynamic on-device environments. It assesses agents by injecting adversarial events into realistic application workflows on fully operational Android emulators and evaluates their performance in critical risk scenarios. Key findings show that state-of-the-art agents are highly vulnerable to deceptive environmental cues, failing to perceive and reason about manipulated UIs. This benchmark provides a framework for quantifying and mitigating this emerging threat, enhancing the security of embodied agents.
研究旨在评估视觉-语言模型(VLMs)作为移动代理在动态设备环境中对环境注入攻击的抗性。研究引入了GhostEI-Bench基准,该基准在完全运行的Android模拟器中将恶意事件注入到现实的应用工作流中。关键发现表明,最先进的VLMs对欺骗性的环境提示高度脆弱,无法感知和推理关于被操纵的UI。GhostEI-Bench框架有助于量化和缓解这一新兴威胁,提高实体代理的安全性。
VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions
Authors: Ziteng Wang, Siqi Yang, Limeng Qiao, Lin Ma
Venue: NeurIPS 2025
First: 2025-08-04T11:57:10+00:00 · Latest: 2025-10-23T08:20:46+00:00
Comments: Accepted to NeurIPS 2025
Abstract
Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN's visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.
中文标题/摘要
标题:VITRIX-CLIPIN:通过指令编辑数据和长描述增强CLIP的细粒度视觉理解
尽管视觉语言模型(VLMs)如CLIP在视觉和语言对齐方面取得了成功,但在细节和细粒度的视觉理解方面仍面临关键挑战。我们提出了CLIP-IN,这是一种新颖的框架,通过两种核心创新增强了CLIP的细粒度感知。首先,我们利用最初为图像操作设计的指令编辑数据集作为硬负样本图像-文本对的独特来源。结合对称的硬负样本对比损失,这使模型能够有效地区分细微的视觉语义差异。其次,CLIP-IN引入了长描述性字幕,利用旋转位置编码来捕捉标准CLIP经常忽略的丰富语义上下文。我们的实验表明,CLIP-IN在MMVP基准和各种细粒度视觉识别任务上取得了显著的提升,而不会牺牲更广泛分类和检索任务的鲁棒零样本性能。关键的是,将CLIP-IN的视觉表示集成到多模态大型语言模型中,显著减少了视觉幻觉并增强了推理能力。这项工作强调了将目标导向的对比学习与全面的描述性信息相结合以提升VLMs细粒度理解的巨大潜力。
Summary / 总结
The research aims to improve CLIP's fine-grained visual understanding by introducing CLIP-IN, which uses instruction-editing datasets and long captions. CLIP-IN enhances CLIP's ability to distinguish subtle visual-semantic differences and captures rich semantic context, leading to significant improvements on fine-grained visual recognition tasks without affecting broader classification tasks. Integrating CLIP-IN's visual representations into multimodal large language models reduces visual hallucinations and enhances reasoning abilities.
研究旨在通过引入CLIP-IN来提升CLIP的细粒度视觉理解能力,CLIP-IN利用指令编辑数据集和长描述性标题。CLIP-IN增强了CLIP区分细微视觉语义差异的能力,并捕捉丰富的语义上下文,从而在细粒度视觉识别任务上取得了显著改进,同时不影响更广泛的分类任务。将CLIP-IN的视觉表示集成到多模态大型语言模型中可以减少视觉幻觉并增强推理能力。
Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning
Authors: Xingjian Ran, Yixuan Li, Linning Xu, Mulin Yu, Bo Dai
First: 2025-06-05T17:59:42+00:00 · Latest: 2025-10-23T08:15:56+00:00
Comments: Project Page: https://directlayout.github.io/
Abstract
Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.
中文标题/摘要
标题:基于空间推理的直接数值布局生成以实现3D室内场景合成
逼真的3D室内场景合成对于具身AI和数字内容创作至关重要。它可以自然地分为两个子任务:对象生成和布局生成。虽然最近的生成模型在对象级的质量和可控性方面取得了显著进展,但布局生成仍然具有挑战性,因为数据集有限。现有方法要么过度拟合这些数据集,要么依赖预定义的约束来优化数值布局,牺牲了灵活性。因此,它们无法生成既开放词汇又与精细用户指令对齐的场景。我们引入了DirectLayout框架,该框架使用大型语言模型(LLMs)的可泛化空间推理直接从文本描述生成数值3D布局。DirectLayout将生成过程分解为三个阶段:生成鸟瞰图(BEV)布局、将其提升到3D空间以及细化对象放置。为了实现显式空间推理并帮助模型掌握对象放置的基本原则,我们基于3D-Front数据集采用了基于链式思维(CoT)激活。此外,我们设计了基于链式思维(CoT)的生成布局奖励,以增强泛化能力和空间规划。在推理过程中,DirectLayout通过上下文学习解决资产-布局不匹配问题。广泛的实验表明,DirectLayout实现了令人印象深刻的语义一致性、泛化能力和物理合理性。
Summary / 总结
DirectLayout is a framework that generates 3D indoor scene layouts directly from text descriptions using spatial reasoning from large language models. It decomposes the process into three stages: BEV layout generation, 3D space lifting, and object placement refinement. The method uses Chain-of-Thought Activation and a specialized reward system to improve generalization and spatial planning. Experiments show that DirectLayout achieves high semantic consistency, generalization, and physical plausibility, addressing limitations of previous methods by handling open-vocabulary scenes and user instructions effectively.
DirectLayout 是一个框架,直接从文本描述生成 3D 室内场景布局,使用大型语言模型的空间推理。该方法将过程分解为三个阶段:BEV 布局生成、3D 空间提升和对象放置细化。该方法使用 Chain-of-Thought 激活和专门的奖励系统来提高泛化能力和空间规划,并在推理过程中解决资产-布局不匹配问题。实验表明,DirectLayout 在场景合成中实现了高语义一致性、泛化能力和物理合理性。
MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment
Authors: Bingyu Li, Feiyu Wang, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
First: 2025-10-17T07:50:58+00:00 · Latest: 2025-10-23T07:18:58+00:00
Abstract
Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce \textbf{MARIS} (\underline{Mar}ine Open-Vocabulary \underline{I}nstance \underline{S}egmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (\textbf{GPEM}) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (\textbf{SAIM}) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research.
中文标题/摘要
标题:MARIS:海洋开放词汇实例分割与几何增强及语义对齐
大多数现有的水下实例分割方法受到近词汇预测的限制,限制了它们识别新型海洋类别的能力。为了支持评估,我们引入了**MARIS**(**Mar**ine **O**pen-**V**ocabulary **I**nstance **S**egmentation),这是第一个大规模细粒度的水下开放词汇(OV)分割基准,包含一组有限的已见类别和多样化的未见类别。尽管在自然图像上OV分割显示出前景,但我们的分析表明,将其转移到水下场景会遭受严重的视觉退化(例如,颜色衰减)和由于缺乏水下类定义而引起的语义对齐问题。为了解决这些问题,我们提出了一种统一框架,包含两个互补组件。几何先验增强模块(**GPEM**)利用稳定的部分级和结构线索,在退化视觉条件下保持对象一致性。语义对齐注入机制(**SAIM**)通过引入领域特定的先验丰富语言嵌入,减轻语义歧义并提高对未见类别的识别能力。实验表明,我们的框架在MARIS上的一致性表现优于现有OV基线,无论是领域内还是跨领域设置,为未来的水下感知研究奠定了坚实的基础。
Summary / 总结
The research aims to address the limitations of existing underwater instance segmentation methods that are constrained by close-vocabulary prediction. To tackle this, the authors introduce MARIS, a benchmark for underwater Open-Vocabulary instance segmentation with seen and unseen categories. They propose a unified framework with two components: GPEM, which enhances geometric priors to maintain object consistency under degraded visual conditions, and SAIM, which aligns semantic embeddings with domain-specific priors to improve recognition of unseen categories. Experiments demonstrate that their approach outperforms existing methods in both in-domain and cross-domain settings, setting a strong foundation for future underwater perception research.
研究旨在解决现有水下实例分割方法受限于近义词预测的问题。为此,作者引入了MARIS,一个用于水下开放词汇分割的基准,并提出了一种统一框架,包含GPEM模块以在视觉条件退化时保持物体一致性,以及SAIM机制以减轻语义模糊并提高对未见过类别的识别能力。实验结果表明,该框架在域内和跨域设置中均优于现有方法,为未来的水下感知研究奠定了坚实基础。
Breakdance Video classification in the age of Generative AI
Authors: Sauptik Dhar, Naveen Ramakrishnan, Michelle Munson
First: 2025-10-23T07:18:54+00:00 · Latest: 2025-10-23T07:18:54+00:00
Comments: 11 pages
Abstract
Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.
中文标题/摘要
标题:街舞视频分类在生成式AI时代
大型视觉语言模型最近在多项体育应用中得到了广泛应用。这些工作大多集中在足球、板球、篮球等少数热门体育项目上,主要集中在生成任务,如视觉问答和高光生成。本研究分析了现代视频基础模型(包括编码器和解码器)在一项非常小众但非常流行的舞蹈运动——街舞中的适用性。我们的结果显示,视频编码器模型在预测任务中继续优于最先进的视频语言模型。我们提供了选择编码器模型的见解,并对微调解码器模型在街舞视频分类中的工作机制进行了全面分析。
Summary / 总结
This study explores the use of large vision language models for classifying breakdance videos, a niche but popular dance sport. The research compares modern video foundation models, both encoder and decoder, and finds that video encoders outperform state-of-the-art video language models in prediction tasks. The study also provides guidance on selecting the appropriate encoder model and analyzes the performance of a fine-tuned decoder model for breakdance video classification.
该研究探讨了大型视觉语言模型在分类霹雳舞视频中的应用,霹雳舞是一种小众但流行的舞蹈运动。研究比较了现代视频基础模型,包括编码器和解码器,并发现视频编码器在预测任务中优于最先进的视频语言模型。研究还提供了选择适当编码器模型的指导,并分析了细调后的解码器模型在霹雳舞视频分类中的性能。
Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding
Authors: Minseok Kang, Minhyeok Lee, Minjung Kim, Donghyeong Kim, Sangyoun Lee
Venue: NeurIPS 2025
First: 2025-10-23T05:53:01+00:00 · Latest: 2025-10-23T05:53:01+00:00
Comments: Comments: 28 pages, including appendix. 5 figures. Full version of the NeurIPS 2025 paper
Abstract
Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local semantics by routing the [EOS] token through a sentence-level path and clustering word tokens into phrase-level units for localized grounding. Our method introduces (1) tokenrole- aware cross modal interaction strategies that align video features with sentence-level and phrase-level semantics in a structurally disentangled manner, and (2) a joint modeling framework that not only improves global sentence-level alignment but also enhances finegrained temporal grounding by leveraging structured phrase-aware context. This design allows the model to capture both coarse and localized semantics, enabling more expressive and context-aware video grounding. DualGround achieves state-of-the-art performance on both Moment Retrieval and Highlight Detection tasks across QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of disentangled semantic modeling in video-language alignment.
中文标题/摘要
标题:赋权之言:DualGround 用于结构化短语和句子级时间定位
视频时间定位(VTG)旨在将给定自然语言查询对齐的时序段在长的、未剪辑的视频中进行本地化。该任务通常包括两个子任务:时刻检索(MR)和高光检测(HD)。尽管最近的进步得益于强大的预训练跨模态模型如CLIP和InternVideo2,但现有方法通常在跨模态注意力中统一处理所有文本标记,忽略了它们的语义角色。为了验证这种方法的局限性,我们进行了受控实验,表明VTG模型过度依赖于由[EOS]驱动的全局语义,而未能有效利用词级信号,这限制了它们实现细粒度时间对齐的能力。受此局限的启发,我们提出了DualGround,这是一种双分支架构,明确分离全局和局部语义,通过将[EOS]标记路由到句子级路径,并将词标记聚类为短语级单元进行局部定位。我们的方法引入了(1)词角色感知的跨模态交互策略,以结构化分离的方式对齐视频特征与句子级和短语级语义,以及(2)一种联合建模框架,不仅提高了全局句子级对齐,还通过利用结构化短语感知上下文增强了细粒度时间定位。这种设计使模型能够捕捉粗略和局部语义,从而实现更具表现力和上下文感知的视频定位。DualGround 在 QVHighlights 和 Charades-STA 基准上的时刻检索和高光检测任务中均取得了最先进的性能,证明了分离语义建模在视频-语言对齐中的有效性。
Summary / 总结
The research aims to improve video temporal grounding by addressing the limitations of existing approaches that treat all text tokens uniformly. The DualGround method proposes a dual-branch architecture to explicitly separate global and local semantics, using the [EOS] token for sentence-level processing and clustering word tokens into phrase-level units. This design enhances both global sentence-level alignment and fine-grained temporal grounding, achieving state-of-the-art performance on Moment Retrieval and Highlight Detection tasks.
研究旨在通过解决现有方法将所有文本标记统一处理的局限性,提高视频时间定位的效果。提出的DualGround方法使用双分支架构来分离全局和局部语义,增强细粒度的时间对齐。关键发现表明,DualGround在QVHighlights和Charades-STA基准上的时刻检索和高亮检测任务中均优于先前的方法,突出了分离语义建模的优势。
COS3D: Collaborative Open-Vocabulary 3D Segmentation
Authors: Runsong Zhu, Ka-Hei Hui, Zhengzhe Liu, Qianyi Wu, Weiliang Tang, Shi Qiu, Pheng-Ann Heng, Chi-Wing Fu
Venue: NeurIPS 2025
First: 2025-10-23T05:45:15+00:00 · Latest: 2025-10-23T05:45:15+00:00
Comments: NeurIPS 2025. The code is publicly available at \href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123/COS3D}
Abstract
Open-vocabulary 3D segmentation is a fundamental yet challenging task, requiring a mutual understanding of both segmentation and language. However, existing Gaussian-splatting-based methods rely either on a single 3D language field, leading to inferior segmentation, or on pre-computed class-agnostic segmentations, suffering from error accumulation. To address these limitations, we present COS3D, a new collaborative prompt-segmentation framework that contributes to effectively integrating complementary language and segmentation cues throughout its entire pipeline. We first introduce the new concept of collaborative field, comprising an instance field and a language field, as the cornerstone for collaboration. During training, to effectively construct the collaborative field, our key idea is to capture the intrinsic relationship between the instance field and language field, through a novel instance-to-language feature mapping and designing an efficient two-stage training strategy. During inference, to bridge distinct characteristics of the two fields, we further design an adaptive language-to-instance prompt refinement, promoting high-quality prompt-segmentation inference. Extensive experiments not only demonstrate COS3D's leading performance over existing methods on two widely-used benchmarks but also show its high potential to various applications,~\ie, novel image-based 3D segmentation, hierarchical segmentation, and robotics. The code is publicly available at \href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123/COS3D}.
中文标题/摘要
标题:COS3D:协作开放式词汇3D分割
开放式词汇3D分割是一项基本但具有挑战性的任务,需要对分割和语言有共同的理解。然而,现有的基于高斯点云的方法要么依赖单一的3D语言领域,导致分割效果不佳,要么依赖预先计算的类别无关分割,导致错误累积。为了解决这些限制,我们提出了COS3D,这是一种新的协作提示分割框架,能够在其整个管道中有效地整合互补的语言和分割线索。我们首先引入了协作领域的新概念,包括实例领域和语言领域,作为协作的基础。在训练过程中,为了有效地构建协作领域,我们的关键思想是通过新颖的实例到语言特征映射捕捉实例领域和语言领域之间的内在关系,并设计了一种高效的两阶段训练策略。在推理过程中,为了弥合两个领域之间不同的特征,我们进一步设计了自适应语言到实例提示的细化,促进高质量的提示分割推理。广泛的实验不仅证明了COS3D在两个广泛使用的基准上的领先性能,还展示了其在各种应用中的高潜力,例如新颖的基于图像的3D分割、层次分割和机器人技术。代码可在\href{https://github.com/Runsong123/COS3D}{https://github.com/Runsong123}公开获取。
Summary / 总结
COS3D is a collaborative prompt-segmentation framework that integrates language and segmentation cues to improve 3D segmentation. It introduces a collaborative field combining an instance field and a language field, and uses a novel instance-to-language feature mapping and an efficient two-stage training strategy to construct this field. During inference, adaptive language-to-instance prompt refinement is designed to enhance the quality of segmentation. Experiments show that COS3D outperforms existing methods on two benchmarks and has potential applications in various fields such as 3D segmentation, hierarchical segmentation, and robotics.
COS3D 是一种协作式提示分割框架,通过整合语言和分割线索来提升 3D 分割效果。它引入了一个协作场,包含实例场和语言场,并使用新颖的实例到语言特征映射和高效的两阶段训练策略来在训练期间有效构建该场。在推理期间,它设计了适应性的语言到实例提示精炼,以提高分割质量。实验表明,COS3D 在两个基准上的性能优于现有方法,并且在 3D 分割、层次分割和机器人等领域具有广泛应用潜力。
Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context
Authors: Ge Zheng, Jiaye Qian, Jiajin Tang, Sibei Yang
Venue: ICCV
First: 2025-10-23T05:22:07+00:00 · Latest: 2025-10-23T05:22:07+00:00
Abstract
Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel "induce-detect-suppress" framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential object-level hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses.
中文标题/摘要
标题:为什么LVLM在长响应中更容易产生幻觉:上下文的作用
大型视觉-语言模型(LVLMs)近年来取得了显著进展,但也容易出现幻觉问题。它们在较长的自由形式响应中表现出更多的幻觉,通常归因于累积的不确定性。在本文中,我们提出一个问题:幻觉的增加是否仅由长度引起的错误导致,还是存在更深层次的机制?经过一系列初步实验和发现,我们建议幻觉的风险并非由长度本身引起,而是由在较长响应中对上下文的依赖增加以确保连贯性和完整性所导致。基于这些见解,我们提出了一种新颖的“诱导-检测-抑制”框架,通过故意设计的上下文主动诱导幻觉,利用诱导的实例早期检测高风险案例,并最终在实际解码过程中抑制潜在的对象级幻觉。我们的方法在所有基准测试中都实现了持续且显著的改进,证明了其有效性。强大的检测和改进的幻觉抑制不仅验证了我们的框架,更重要的是,重新验证了我们关于上下文的假设。本研究不仅旨在追求性能提升,还旨在提供新的见解,并作为深入探索LVLM在长响应中幻觉问题的第一步。
OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts
Authors: Shiting Xiao, Rishabh Kabra, Yuhang Li, Donghyun Lee, Joao Carreira, Priyadarshini Panda
First: 2025-07-07T19:16:22+00:00 · Latest: 2025-10-23T05:00:20+00:00
Abstract
The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model's spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks. Code is available at https://github.com/GinnyXiao/OpenWorldSAM.
中文标题/摘要
标题:OpenWorldSAM:扩展SAM2以通过语言提示实现通用图像分割
基于开放语言提示进行对象分割的能力仍然是一个关键挑战,要求模型将文本语义转化为精确的空间掩码,同时处理多样且未见过的类别。我们提出了OpenWorldSAM,这是一种框架,通过将轻量级视觉-语言模型(VLM)提取的多模态嵌入整合到SAM2的提示驱动模型中,将其扩展到开放词汇场景。我们的方法遵循四个关键原则:i) 统一提示:OpenWorldSAM 支持各种提示,包括类别级和句子级语言描述,提供灵活的接口以适应各种分割任务。ii) 高效性:通过冻结SAM2和VLM的预训练组件,我们仅在COCO-stuff数据集上训练450万参数,实现了显著的资源效率。iii) 实例意识:我们通过新颖的位置决断嵌入和交叉注意力层增强模型的空间理解,使其能够有效分割多个实例。iv) 通用性:OpenWorldSAM 展现出强大的零样本能力,在未见过的类别和开放词汇概念上表现出良好的泛化能力,无需额外训练。广泛的实验表明,OpenWorldSAM 在多个基准测试中的开放词汇语义、实例和全景分割方面达到了最先进的性能。代码可在https://github.com/GinnyXiao/OpenWorldSAM/ 获取。
Summary / 总结
OpenWorldSAM extends SAM2 to handle open-vocabulary image segmentation using multi-modal embeddings from a lightweight VLM. It supports diverse prompts, is efficient with only 4.5 million parameters, and enhances spatial understanding with positional tie-breaker embeddings and cross-attention layers. OpenWorldSAM shows strong zero-shot performance and achieves state-of-the-art results in multiple benchmarks for open-vocabulary segmentation tasks.
OpenWorldSAM 扩展了 SAM2,使用轻量级 VLM 提取的多模态嵌入来处理开放词汇的图像分割。它支持多种提示,资源效率高,增强空间理解,并在未见类别和开放词汇概念上表现出强大的零样本泛化能力。实验表明,它在多个基准测试中超越了现有方法,在开放词汇语义、实例和全景分割任务中表现优异。
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Authors: Yi Ding, Ruqi Zhang
Venue: NeurIPS 2025
First: 2025-05-28T17:58:03+00:00 · Latest: 2025-10-23T04:45:46+00:00
Comments: Published at NeurIPS 2025, 27 pages
Abstract
Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $\beta$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.
中文标题/摘要
标题:夏洛克:视觉语言模型的自我纠正推理
视觉语言模型(VLMs)在复杂多模态任务上表现出色,但仍然面临重大挑战:它们对推理错误极为敏感,需要大量标注数据或准确的验证者,并且难以在特定领域之外进行泛化。为解决这些限制,我们探索了自我纠正作为增强推理VLMs的策略。我们首先深入分析了推理VLMs的自我纠正能力,并确定了关键差距。基于我们的发现,我们引入了夏洛克,这是一种自我纠正和自我改进训练框架。夏洛克引入了轨迹级自我纠正目标、基于视觉扰动的偏好数据构建方法以及动态$\beta$值用于偏好调整。一旦模型仅使用20,000个随机采样的标注数据就获得了自我纠正能力,它将继续自我改进而无需外部监督。基于Llama3.2-Vision-11B模型,夏洛克在八个基准测试中取得了显著成果,直接生成的准确率为64.1%,自我纠正后的准确率为65.4%。它在使用不到20%的标注数据的情况下,优于LLaVA-CoT(63.2)、Mulberry(63.9)和LlamaV-o1(63.4)。
Summary / 总结
The paper addresses the limitations of Reasoning Vision-Language Models (VLMs) in handling reasoning errors and generalizing across domains. It introduces Sherlock, a self-correction and self-improvement framework that enhances VLMs by incorporating a trajectory-level self-correction objective, visual perturbation-based preference data, and dynamic $eta$ for preference tuning. Using only 20k annotated data, Sherlock achieves an average accuracy of 64.1 with direct generation and 65.4 after self-correction, outperforming other models like LLaVA-CoT, Mulberry, and LlamaV-o1 while using less than 20% of the annotated data.
论文通过引入Sherlock框架,旨在解决推理视觉语言模型(VLMs)的局限性,该框架包括轨迹级自我纠正目标、基于视觉扰动的偏好数据构建方法以及动态$eta$偏好调整。仅使用20k标注数据,Sherlock在直接生成时达到平均准确率64.1,在自我纠正后达到65.4,超越了LLaVA-CoT、Mulberry和LlamaV-o1等模型,同时使用了不到20%的标注数据。
Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
Authors: Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, Bo Du
First: 2025-10-14T19:57:03+00:00 · Latest: 2025-10-23T03:45:15+00:00
Comments: This paper contains fundamental errors and will not be replaced
Abstract
Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.
中文标题/摘要
标题:具有知识意识的胎儿超声影像语言基础模型
近期的医疗视觉-语言模型在问答、报告生成和异常检测等任务上显示出潜力。然而,大多数模型适应于结构化的成人影像,在胎儿超声方面表现不佳,这带来了多视角图像推理、多种疾病和图像多样性等挑战。为解决这一问题,我们引入了FetalMind,这是一种针对胎儿超声的医疗AI系统,用于报告生成和诊断。受临床工作流程的启发,我们提出了显著知识解耦(SED),将专家策划的二分图注入模型中,以解耦视角-疾病关联,并通过强化学习引导临床忠实步骤的偏好选择。这一设计减轻了疾病间的变异性以及视角间的异质性,减少了学习瓶颈,使模型的推理与产科实践保持一致。为了大规模训练FetalMind,我们构建了FetalSigma-1M数据集,这是首个大规模的胎儿超声报告语料库,包含来自十二家医疗机构的20000份报告,解决了领域数据稀缺的问题。广泛的实验表明,FetalMind在所有妊娠阶段的表现均优于开源和闭源基线,平均提升14%,在关键条件下准确率提高61.2%,同时保持高效、稳定和可扩展。项目页面:https://hexiao0275.github.io/FetalMind。
Summary / 总结
The research aims to improve the performance of vision-language models in fetal ultrasound interpretation, addressing challenges such as multi-view image reasoning and disease diversity. FetalMind, a medical AI system, uses Salient Epistemic Disentanglement (SED) to inject a bipartite graph and steer the model's preference selection via reinforcement learning, aligning with clinical practice. Experiments show FetalMind outperforms existing baselines, achieving significant gains in accuracy, especially for critical conditions, while maintaining efficiency and scalability. However, the paper contains fundamental errors and will not be replaced.
研究旨在通过解决多视角图像推理和疾病多样性的问题,提高胎儿超声图像的解释能力。方法是使用临床工作流程引导的Sed(显著知识分离),通过强化学习解耦视图-疾病关联并引导偏好选择。关键发现表明,FetalMind在所有妊娠阶段都优于现有基线,尤其在严重状况下的准确率提高了61.2%,同时保持了高效、稳定和可扩展。项目页面:https://hexiao0275.github.io/FetalMind
History
20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553