arXiv 论文速递

Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Authors: Yuhan Liu, Lianhui Qin, Shengjie Wang

First: 2025-10-23T17:59:21+00:00 · Latest: 2025-10-23T17:59:21+00:00

Abstract

Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict

DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

Authors: Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, Raanan Fattal

First: 2025-10-23T17:42:14+00:00 · Latest: 2025-10-23T17:42:14+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.

中文标题/摘要

标题：DyPE：动态位置外推在超高清扩散中的应用

扩散变换器模型可以生成具有非凡保真度和细节的图像，但由于自注意力机制与图像标记数量的平方级扩展，训练它们在超高清分辨率上仍然非常昂贵。在本文中，我们引入了一种名为动态位置外推（DyPE）的新型、无需训练的方法，该方法使预训练的扩散变换器能够在远超其训练数据的分辨率下合成图像，且无需额外的采样成本。DyPE 利用了扩散过程固有的频谱进展，其中低频结构早期收敛，而高频结构需要更多步骤才能解决。具体而言，DyPE 在每次扩散步骤中动态调整模型的位置编码，使其频谱与生成过程的当前阶段相匹配。这种方法使我们能够在远超训练分辨率的分辨率下生成图像，例如，使用 FLUX 生成 1600 万像素的图像。在多个基准测试中，DyPE 一致地提高了性能，并在超高清图像生成中达到了最先进的保真度，尤其是在更高分辨率下，性能提升更为显著。项目页面可在 https://noamissachar.github.io/DyPE/ 获取。

Summary / 总结

DyPE is a training-free method that allows pre-trained diffusion transformers to generate images at ultra-high resolutions by dynamically adjusting positional encodings during the diffusion process. This method leverages the spectral progression of the diffusion process to match the frequency spectrum of the model's positional encoding with the current stage of generation, enabling image synthesis at resolutions far beyond the training data. DyPE significantly improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, especially at higher resolutions.

DyPE 是一种无需训练的方法，通过在扩散过程中动态调整位置编码来使预训练的扩散变换器生成超高清图像。该方法利用扩散过程中的频谱进展，使模型的位置编码频谱与生成过程的当前阶段相匹配，从而允许在远超训练数据分辨率的分辨率下合成图像。DyPE 在超高清图像生成中提高了性能，并实现了最先进的保真度，尤其是在更高分辨率下效果更明显。

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Authors: Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

Venue: NeurIPS 2025

First: 2025-10-13T15:25:52+00:00 · Latest: 2025-10-23T16:40:49+00:00

Comments: Accepted by NeurIPS 2025 Datasets and Benchmarks Track. Data and Code: https://github.com/KediYing/mmWalk

Abs · PDF · Code1 · Code2 · Code3

Abstract

Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

中文标题/摘要

标题：mmWalk：迈向多模态多视角行走辅助

在极端或复杂环境中提供行走辅助仍然是盲人或低视力（BLV）人群的一大挑战，主要原因是缺乏对整体场景的理解。受BLV社区实际需求的启发，我们构建了mmWalk，这是一个模拟的多模态数据集，集成了多视角传感器和无障碍导向特征，用于户外安全导航。该数据集包含120条手动控制、场景分类的行走轨迹，共有62000帧同步图像。它包含了超过559000张全景图像，涵盖RGB、深度和语义模态。此外，为了强调现实相关性，每条轨迹都涉及户外的边缘情况和专为BLV用户设计的无障碍地标。此外，我们还生成了mmWalkVQA，这是一个包含超过69000个视觉问题-答案三元组的VQA基准，分为9个类别，旨在提供安全和知情的行走辅助。我们使用零样本和少样本设置评估了最先进的视觉-语言模型（VLMs），发现它们在我们的风险评估和导航任务中表现不佳。我们还在真实世界数据集上验证了mmWalk微调模型，并展示了该数据集在推进多模态行走辅助方面的有效性。

Summary / 总结

The research aims to address the challenge of walking assistance in extreme environments for people with blindness or low vision by developing mmWalk, a multi-modal dataset integrating multi-view sensor and accessibility features. The dataset includes 120 walking trajectories with 62k synchronized frames and over 559k panoramic images. Evaluations show that state-of-the-art Vision-Language Models struggle with the risk assessment and navigational tasks, highlighting the need for more effective multi-modal walking assistance. The mmWalk-finetuned model is validated on real-world datasets, demonstrating its effectiveness.

研究旨在通过开发mmWalk多模态数据集来解决盲人或低视力人士在极端环境下的行走辅助问题。该数据集包含120条行走轨迹，62k同步帧以及超过559k的RGB、深度和语义全景图像。关键发现表明，最先进的视觉-语言模型在风险评估和导航任务上表现不佳，突显了需要更有效的多模态行走辅助系统。

Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning

Authors: Wenyi Xiao, Leilei Gan

First: 2025-04-25T16:11:23+00:00 · Latest: 2025-10-23T16:25:28+00:00

Abs · PDF · Code1 · Code2

Abstract

When applying reinforcement learning--typically through GRPO--to large vision-language model reasoning struggles to effectively scale reasoning length or generates verbose outputs across all tasks with only marginal gains in accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. Inspired by these observations, we introduce two complementary metrics to estimate the difficulty of the questions, guiding the model to determine when fast or slow thinking is more appropriate. Next, we incorporate adaptive length-based rewards and difficulty-aware KL divergence into the GRPO algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10\% relative improvement compared to the base model, while reducing token usage by 32.7-67.3\% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.

Summary / 总结

The research aims to improve the scalability and efficiency of reinforcement learning in large vision-language model reasoning. FAST-GRPO, a variant of GRPO, dynamically adjusts reasoning depth based on question characteristics. Experiments across seven benchmarks show that FAST achieves superior accuracy with up to 10% improvement and reduces token usage by 32.7-67.3% compared to previous methods, effectively balancing reasoning length and accuracy.

研究旨在提高大型视觉-语言模型推理中强化学习的可扩展性和效率。FAST-GRPO 是 GRPO 的一种变体，根据问题特征动态调整推理深度。在七个基准测试中的实验表明，FAST 能够实现高达 10% 的准确性提升，并且与之前的方法相比，减少了 32.7-67.3% 的标记使用量，有效地平衡了推理长度和准确性。

Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

Authors: Xuyang Liu, Xiyan Gui, Yuchao Zhang, Linfeng Zhang

First: 2025-10-23T16:17:47+00:00 · Latest: 2025-10-23T16:17:47+00:00

Comments: Our code is available at https://github.com/xuyang-liu16/MixKV

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.

中文标题/摘要

标题：结合重要性与多样性：在大型视觉-语言模型中联合优化KV缓存压缩

近期的大型视觉-语言模型（LVLMs）在处理扩展的多模态序列方面表现出色，但由此产生的键值（KV）缓存扩展造成了一个关键的内存瓶颈，从根本上限制了部署的可扩展性。虽然现有的KV缓存压缩方法侧重于保留高重要性的KV对以最小化存储，但它们往往忽略了多模态KV缓存中出现的独特的模态特定语义冗余模式。在这项工作中，我们首先分析了LVLMs中的KV缓存如何在不同的注意力头中表现出不同程度的冗余，而不仅仅是简单的重要性。我们表明，仅依赖于重要性只能覆盖KV缓存信息分布的一部分，可能导致语义覆盖的潜在损失。为了解决这个问题，我们提出了\texttt{MixKV}，一种新颖的方法，将重要性与多样性结合以优化LVLMs中的KV缓存压缩。\texttt{MixKV}根据头级语义冗余进行调整，在压缩KV对时选择性地平衡多样性和重要性。广泛的实验表明，\texttt{MixKV}在多个LVLMs中始终优于现有方法。在极端压缩（预算=64）下，\texttt{MixKV}在五个多模态理解基准测试中平均提高了基线方法的\textbf{5.1\%}，并在SnapKV和AdaKV的GUI定位任务中分别实现了显著的\textbf{8.0\%}和\textbf{9.0\%}的提升，同时保持了相当的推理效率。此外，\texttt{MixKV}无缝扩展到LLMs，性能提升相当。我们的代码可在\href{https://github.com/xuyang-liu16/MixKV}{https://github.com/xuyang-liu16/MixKV}获取。

Summary / 总结

This work addresses the memory bottleneck caused by key-value (KV) cache expansion in large vision-language models (LVLMs) by proposing MixKV, a method that combines importance with diversity for KV cache compression. The method adapts to head-wise semantic redundancy and selectively balances diversity and importance. Extensive experiments show that MixKV improves existing methods by an average of 5.1% across five benchmarks and achieves significant gains of 8.0% and 9.0% for SnapKV and AdaKV on GUI grounding tasks, while maintaining comparable inference efficiency. MixKV also extends to language models with similar performance gains.

该研究针对大型视觉-语言模型（LVLMs）中由于键值（KV）缓存扩展导致的内存瓶颈问题，提出了一种结合重要性和多样性的方法MixKV，以优化KV缓存压缩。MixKV在五个基准测试中平均提高了5.1%，并在GUI定位任务中分别实现了8.0%和9.0%的显著改进，同时保持了相似的推理效率。该方法适应头部级别的语义冗余，并在压缩过程中选择性地平衡多样性和重要性。

FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

Authors: Zebin Yao, Lei Ren, Huixing Jiang, Chen Wei, Xiaojie Wang, Ruifan Li, Fangxiang Feng

First: 2025-04-22T14:55:23+00:00 · Latest: 2025-10-23T16:11:42+00:00

Comments: Code: https://github.com/Nihukat/FreeGraftor

Abs · PDF · Code1 · Code2 · Code3

Abstract

Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.

中文标题/摘要

标题：FreeGraftor：无需训练的跨图像特征嫁接以实现主题驱动的文本到图像生成

主题驱动的图像生成旨在从参考图像中合成新的场景，同时忠实保留主题身份并遵循文本指导。然而，现有方法在保真度和效率之间面临关键权衡。基于调优的方法依赖于耗时且资源密集的主题特定优化，而零样本方法往往无法保持足够的主题一致性。在本文中，我们提出了一种无需训练的FreeGraftor框架，通过跨图像特征嫁接来解决这些限制。具体而言，FreeGraftor利用语义匹配和位置约束注意力融合将参考主题的视觉细节转移到生成图像中。此外，我们的框架引入了一种新颖的噪声初始化策略，以保留参考主题的几何先验，从而促进稳健的特征匹配。广泛的定性和定量实验表明，我们的方法能够实现精确的主题身份转移，同时保持文本对齐的场景合成。无需进行模型微调或额外训练，FreeGraftor在主题保真度和文本对齐方面显著优于现有零样本和无需训练的方法。此外，我们的框架可以无缝扩展到多主题生成，使其适用于实际部署。我们的代码可在https://github.com/Nihukat/FreeGraftor获取。

Summary / 总结

FreeGraftor is a training-free framework for subject-driven text-to-image generation that uses cross-image feature grafting to preserve subject identity while adhering to textual guidance. It employs semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to generated images, and introduces a novel noise initialization strategy to maintain subject geometry. Experimental results show that FreeGraftor outperforms existing zero-shot and training-free approaches in subject fidelity and text alignment without requiring model fine-tuning. It also supports multi-subject generation, making it suitable for practical deployment.

FreeGraftor 是一个无需训练的框架，用于基于文本的图像生成，通过跨图像特征嫁接来保留主体身份并遵循文本指导。它使用语义匹配和位置约束注意力融合来将参考主体的视觉细节转移到生成图像中，并引入了一种噪声初始化策略以保持参考主体的几何先验。实验结果表明，FreeGraftor 在主体保真度和文本对齐方面优于现有零样本和无需训练的方法，且无需进行模型微调或额外训练，还可以处理多主体生成。代码可在 https://github.com/Nihukat/FreeGraftor 获取。

Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward

Authors: Jing Bi, Guangyu Sun, Ali Vosoughi, Chen Chen, Chenliang Xu

First: 2025-10-23T16:10:03+00:00 · Latest: 2025-10-23T16:10:03+00:00

Comments: 5 pages

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.

中文标题/摘要

标题：视觉推理诊断：挑战、见解与未来路径

多模态大型语言模型（MLLMs）结合视觉和文本推理，利用链式思考（CoT）提示解决复杂视觉任务，但仍表现出视觉幻觉和过度依赖文本先验的问题。我们使用三阶段评估框架系统地诊断了最先进的视觉语言模型，揭示了关键的失败模式。为了解决这些问题，我们提出了一种基于代理的架构，结合了LLM推理和轻量级视觉模块，使推理链的精细分析和迭代改进成为可能。我们的结果强调未来视觉推理模型应专注于整合更广泛的专门工具来分析视觉内容。我们的系统在MMMU上取得了显著的提升（+10.3），在MathVista上超过了7B基线模型（+6.0），并达到了或超过了更大模型的水平。我们将发布我们的框架和评估套件，以促进未来的研究。

Summary / 总结

The research aims to diagnose the limitations of multimodal large language models (MLLMs) in visual reasoning, particularly their tendency to hallucinate and rely heavily on textual information. A three-stage evaluation framework was used to identify key failure modes. The study proposes an agent-based architecture that integrates LLM reasoning with lightweight visual modules, leading to significant improvements in performance on MMMU (+10.3) and MathVista (+6.0) tasks compared to a 7B baseline model. This suggests that future models should incorporate a wider range of specialized tools for analyzing visual content.

论文旨在诊断多模态大型语言模型（MLLMs）在视觉推理中遇到的挑战，特别是它们的视觉幻觉和对文本先验的过度依赖。作者使用三阶段评估框架来识别关键的失败模式。他们提出了一种基于代理的架构，将LLM推理与轻量级视觉模块相结合，使得在MMMU (+10.3) 和MathVista (+6.0) 任务上的性能显著提升，超过了7B基线模型。这项工作表明，未来的视觉推理模型应该整合更多专门用于分析视觉内容的工具。

REOBench: Benchmarking Robustness of Earth Observation Foundation Models

Authors: Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, Tianjin Huang

First: 2025-05-22T15:34:50+00:00 · Latest: 2025-10-23T15:43:31+00:00

Comments: Accepted to NeruIPS 2025 D&B Track

Abs · PDF · Code1 · Code2 · Code3

Abstract

Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models. Code and data are publicly available at https://github.com/lx709/REOBench.

中文标题/摘要

标题：REOBench：评估地球观测基础模型鲁棒性的基准

地球观测基础模型在多个地球观测任务中表现出强大的泛化能力，但它们在现实世界扰动下的鲁棒性仍被忽视。为弥补这一差距，我们引入了REOBench，这是首个全面评估地球观测基础模型鲁棒性的基准，涵盖了六个任务和十二种图像破坏类型，包括基于外观和几何的扰动。为了确保评估的现实性和精细度，我们的基准专注于高分辨率光学遥感图像，这些图像广泛应用于城市规划和灾害响应等关键应用。我们系统地评估了使用掩码图像建模、对比学习和视觉-语言预训练范式训练的一系列模型。我们的结果表明：(1) 当暴露于输入扰动时，现有地球观测基础模型会经历显著的性能下降。(2) 性能下降的程度在不同任务、模型架构、骨干网络大小和扰动类型之间有所不同，性能下降幅度从不到1%到超过20%不等。(3) 视觉-语言模型在多模态任务中显示出增强的鲁棒性。REOBench突显了当前地球观测基础模型对现实世界扰动的脆弱性，并为开发更鲁棒和可靠的模型提供了可操作的见解。代码和数据可在https://github.com/lx709/REOBench公开获取。

Summary / 总结

REOBench is a benchmark designed to evaluate the robustness of Earth observation foundation models under various real-world perturbations. It assesses models across six tasks and twelve types of image corruptions, focusing on high-resolution optical remote sensing images. The study reveals that existing models experience significant performance degradation when exposed to input corruptions, with the severity varying widely across different models and tasks. Vision-language models show enhanced robustness, especially in multimodal tasks, highlighting their potential for developing more reliable Earth observation models. Code and data are publicly available at https://github.com/lx709/REOBench.

REOBench 是首个评估地球观测基础模型在六个任务和十二种图像失真类型下的鲁棒性的基准。它专注于用于关键应用的高分辨率光学遥感图像。研究显示，在输入失真下模型性能显著下降，不同模型和任务的鲁棒性差异显著。视觉-语言模型在多模态任务中表现出更强的鲁棒性。该基准突显了开发更鲁棒的地球观测模型的必要性。

Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging

Authors: Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Dong Yang, Pengfei Guo, Marc Edgar, Daguang Xu, Bernhard Kainz, Bjoern Menze

Venue: NeurIPS 2025

First: 2025-10-23T15:13:13+00:00 · Latest: 2025-10-23T15:13:13+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D

中文标题/摘要

标题：更好的3D标记以实现更好的3D：推进3D医学成像中的视觉-语言建模

3D医学成像中的视觉-语言建模最近的进步得益于大规模的计算机断层扫描（CT）语料库，这些语料库配有配对的自由文本报告，更强的架构和强大的预训练模型。这使得自动化报告生成和文本条件下的3D图像合成等应用成为可能。然而，当前的方法在处理高分辨率、长序列体积时存在困难：对比预训练往往导致视觉编码器与临床语言不一致，而切片级标记模糊了细微解剖结构，降低了下游任务的诊断性能。我们提出了BTB3D（更好的3D标记），这是一种因果卷积编码器-解码器，统一了2D和3D的训练和推理，同时生成紧凑的、频率感知的体素标记。三阶段的训练课程使模型能够（i）局部重建，（ii）重叠窗口镶嵌，以及（iii）长上下文解码器细化，在此过程中，模型从短切片片段中学习，但能够泛化到超过300片的扫描，而无需额外的内存开销。BTB3D在两个关键任务上达到了新的最佳水平：它在报告生成任务上提高了BLEU分数，并且与CT2Rep、CT-CHAT和Merlin相比，临床F1提高了40%；在文本到CT合成任务上，它将FID降低了75%，并将FVD减半，与GenerateCT和MedSyn相比，生成了解剖上一致的512*512*241体积。这些结果表明，精确的三维标记化，而不是更大的语言骨干模型，对于3D医学成像中的可扩展视觉-语言建模至关重要。代码库可在：https://github.com/ibrahimethemhamamci/BTB3D

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Authors: Weifan Guan, Qinghao Hu, Aosheng Li, Jian Cheng

First: 2025-10-20T02:59:45+00:00 · Latest: 2025-10-23T15:06:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.

中文标题/摘要

标题：高效视觉-语言-行动模型在嵌入式操作中的应用：系统综述

视觉-语言-行动（VLA）模型将视觉-语言模型扩展到嵌入式控制，通过将自然语言指令和视觉观察映射到机器人行动。尽管它们具有这些能力，但VLA系统由于其巨大的计算和内存需求而面临重大挑战，这与边缘平台（如车载移动操作器）的实时性能要求相冲突。解决这一矛盾已成为最近研究的中心焦点。鉴于对更高效和可扩展的VLA系统的日益努力，本文综述了提高VLA效率的方法，重点在于减少延迟、内存占用和训练及推理成本。我们按照模型架构、感知特征、行动生成和训练/推理策略四个维度对现有解决方案进行了分类，总结了每个类别中的代表性技术。最后，我们讨论了未来趋势和开放挑战，指出了推进高效嵌入式智能的方向。

Summary / 总结

The research aims to address the computational and memory challenges of Vision-Language-Action (VLA) models in real-time embodied manipulation tasks. The study reviews methods to enhance VLA efficiency, focusing on reducing latency, memory usage, and costs. Key findings include categorizing existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, and summarizing representative techniques within each category. Future trends and open challenges are also discussed to advance efficient embodied intelligence.

研究旨在解决视觉-语言-行动（VLA）模型在实时物理操作任务中的计算和内存挑战。该研究回顾了提高VLA效率的方法，重点在于减少延迟、内存使用和成本。关键发现包括将现有解决方案分类为四个维度：模型架构、感知特征、行动生成和训练/推理策略，并总结了每个类别中的代表性技术。最后还讨论了未来趋势和开放挑战，以促进高效的物理智能发展。