arXiv 论文速递

Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Authors: Yuhan Liu, Lianhui Qin, Shengjie Wang

First: 2025-10-23T17:59:21+00:00 · Latest: 2025-10-23T17:59:21+00:00

Abstract

Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict

中文标题/摘要

标题：小草图，大裁决：基于推测的密集信息视觉推理

大型多模态视觉语言模型（VLMs）在多模态理解方面取得了显著进展，但在处理密集交织了文本注释和细粒度图形元素的信息密集型图像时，它们面临挑战。主要挑战在于在密集布局中精确定位关键线索以及进行多跳推理以整合分散的证据。我们提出了推测裁决（SV），这是一种无需训练的框架，灵感来源于推测解码，结合了多个轻量级草图专家和一个大型裁决模型。在草图阶段，小型VLM作为草图专家生成提供多样化定位候选的推理路径；在裁决阶段，强大的VLM综合这些路径生成最终答案，同时降低计算成本并恢复正确答案。为了进一步提高效率和准确性，SV引入了一种共识专家选择机制，仅将高一致性的推理路径转发到裁决阶段。实验证明，SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等具有挑战性的信息密集型和高分辨率视觉问答基准测试中取得了持续的改进。通过综合多个部分准确推理路径中的正确见解，SV在错误纠正和成本效率方面优于大型专有模型或训练管道。代码可在https://github.com/Tinaliu0123/speculative-verdict 获取

DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

Authors: Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, Raanan Fattal

First: 2025-10-23T17:42:14+00:00 · Latest: 2025-10-23T17:42:14+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism's quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model's positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.

中文标题/摘要

标题：DyPE：动态位置外推在超高清扩散中的应用

扩散变换器模型可以生成具有非凡保真度和细节的图像，但由于自注意力机制与图像标记数量的平方级扩展，它们在超高清分辨率下的训练成本仍然非常高。在本文中，我们引入了一种名为动态位置外推（DyPE）的新型、无需训练的方法，该方法使预训练的扩散变换器能够在远超其训练数据的分辨率下合成图像，且无需额外的采样成本。DyPE 利用了扩散过程固有的频谱进展，其中低频结构早期收敛，而高频结构需要更多步骤才能解决。具体而言，DyPE 在每次扩散步骤中动态调整模型的位置编码，使其频谱与生成过程的当前阶段相匹配。这种方法使我们能够在远超训练分辨率的分辨率下生成图像，例如，使用 FLUX 生成 1600 万像素的图像。在多个基准测试中，DyPE 一致地提高了性能，并在超高清图像生成中达到了最先进的保真度，尤其是在更高分辨率下，性能提升更为显著。项目页面可在 https://noamissachar.github.io/DyPE/ 获取。

Summary / 总结

DyPE is a training-free method that enables pre-trained diffusion transformers to generate images at ultra-high resolutions by dynamically adjusting positional encodings during the diffusion process. This method leverages the spectral progression of the diffusion process to extrapolate low-frequency structures early and high-frequencies later, allowing for image synthesis at resolutions far beyond the training data. DyPE significantly improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, especially at higher resolutions.

DyPE 是一种无需训练的方法，通过在扩散过程中动态调整位置编码，使预训练的扩散变换器能够生成超高清图像。该方法利用扩散过程中的频谱进展，早期生成低频结构，后期生成高频结构，从而能够在远超训练数据的分辨率下生成图像。DyPE 显著提高了性能，并在超高清图像生成中达到了最先进的保真度，尤其是在更高分辨率下效果更为显著。

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Authors: Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

Venue: NeurIPS 2025

First: 2025-10-13T15:25:52+00:00 · Latest: 2025-10-23T16:40:49+00:00

Comments: Accepted by NeurIPS 2025 Datasets and Benchmarks Track. Data and Code: https://github.com/KediYing/mmWalk

Abs · PDF · Code1 · Code2 · Code3

Abstract

Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames. It contains over 559k panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

中文标题/摘要

标题：mmWalk：迈向多模态多视角行走辅助

在极端或复杂环境中提供行走辅助仍然是盲人或低视力（BLV）人群的一大挑战，主要原因是缺乏对整体场景的理解。受BLV社区实际需求的启发，我们构建了mmWalk，这是一个模拟的多模态数据集，集成了多视角传感器和无障碍导向特征，用于户外安全导航。该数据集包含120条手动控制、场景分类的行走轨迹，共有62000帧同步图像。它包含了超过559000张全景图像，涵盖RGB、深度和语义模态。此外，为了强调现实相关性，每条轨迹都涉及户外的边缘情况和专为BLV用户设计的无障碍地标。此外，我们还生成了mmWalkVQA，这是一个包含超过69000个视觉问题-答案三元组的VQA基准，分为9个类别，旨在提供安全和知情的行走辅助。我们使用零样本和少样本设置评估了最先进的视觉-语言模型（VLMs），发现它们在我们的风险评估和导航任务中表现不佳。我们还在真实世界数据集上验证了mmWalk微调模型，并展示了该数据集在推进多模态行走辅助方面的有效性。

Summary / 总结

The research aims to address the challenges of walking assistance in extreme environments for people with blindness or low vision by developing a comprehensive multi-modal dataset called mmWalk. The dataset includes 120 walking trajectories with 62k synchronized frames and over 559k panoramic images across RGB, depth, and semantic modalities. The evaluation of state-of-the-art Vision-Language Models shows their limitations in handling risk assessment and navigational tasks, highlighting the need for further development. The mmWalk-finetuned model demonstrates the effectiveness of the dataset in advancing multi-modal walking assistance.

研究旨在通过开发综合性多模态数据集mmWalk，解决盲人或低视力人士在极端环境下的行走辅助问题。该数据集包含120条行走轨迹，62k同步帧和超过559k来自RGB、深度和语义模态的全景图像。研究评估了最先进的视觉-语言模型，发现它们在风险评估和导航任务上表现不佳。mmWalk微调模型在真实世界数据集上的验证显示了其在多模态行走辅助中的有效性。

Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning

Authors: Wenyi Xiao, Leilei Gan

First: 2025-04-25T16:11:23+00:00 · Latest: 2025-10-23T16:25:28+00:00

Abs · PDF · Code1 · Code2

Abstract

When applying reinforcement learning--typically through GRPO--to large vision-language model reasoning struggles to effectively scale reasoning length or generates verbose outputs across all tasks with only marginal gains in accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. Inspired by these observations, we introduce two complementary metrics to estimate the difficulty of the questions, guiding the model to determine when fast or slow thinking is more appropriate. Next, we incorporate adaptive length-based rewards and difficulty-aware KL divergence into the GRPO algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10\% relative improvement compared to the base model, while reducing token usage by 32.7-67.3\% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.

中文标题/摘要

标题：快速-缓慢思考GRPO在大型视觉-语言模型推理中的应用

在通过GRPO应用强化学习时，尤其是在大型视觉-语言模型推理中，难以有效扩展推理长度或在所有任务中生成冗长的输出，仅在准确性上取得微小的提升。为解决这一问题，我们提出了FAST-GRPO，这是一种根据问题特征动态调整推理深度的GRPO变体。通过实证分析，我们通过研究响应长度和数据分布如何影响性能，确立了LVLM中快速-缓慢思考的可行性。受这些观察的启发，我们引入了两个互补的指标来估计问题的难度，指导模型确定何时使用快速或缓慢思考更为合适。随后，我们将自适应长度奖励和难度感知KL散度纳入GRPO算法。在七个推理基准测试中的实验表明，FAST在相对改进超过10%的准确性方面优于基线模型，同时与之前的缓慢思考方法相比，减少了32.7%-67.3%的令牌使用量，有效地平衡了推理长度和准确性。

Summary / 总结

The paper addresses the challenge of scaling reasoning length in large vision-language models using reinforcement learning. It introduces FAST-GRPO, which dynamically adjusts reasoning depth based on question characteristics. Experiments show that FAST improves accuracy by over 10% compared to the base model while reducing token usage by 32.7-67.3% compared to previous methods, effectively balancing reasoning length and accuracy.

论文旨在解决使用强化学习在大型视觉-语言模型中扩展推理长度的挑战。它提出了FAST-GRPO，该方法根据问题特征动态调整推理深度。实验表明，FAST相比基线模型提高了超过10%的准确性，同时相比之前的方法减少了32.7-67.3%的标记使用量，有效地平衡了推理长度和准确性。

Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

Authors: Xuyang Liu, Xiyan Gui, Yuchao Zhang, Linfeng Zhang

First: 2025-10-23T16:17:47+00:00 · Latest: 2025-10-23T16:17:47+00:00

Comments: Our code is available at https://github.com/xuyang-liu16/MixKV

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.

中文标题/摘要

标题：结合重要性与多样性：在大型视觉-语言模型中联合优化KV缓存压缩

近期的大型视觉-语言模型（LVLMs）在处理扩展的多模态序列方面表现出色，但由此产生的键值（KV）缓存扩展造成了一个关键的内存瓶颈，从根本上限制了部署的可扩展性。虽然现有的KV缓存压缩方法侧重于保留高重要性的KV对以最小化存储，但它们往往忽略了多模态KV缓存中出现的独特的模态特定语义冗余模式。在这项工作中，我们首先分析了LVLMs中的KV缓存如何在不同的注意力头中表现出不同程度的冗余，而不仅仅是简单的重要性。我们表明，仅依赖于重要性只能覆盖KV缓存信息分布的一部分，可能导致语义覆盖的潜在损失。为了解决这个问题，我们提出了\texttt{MixKV}，一种新颖的方法，将重要性与多样性结合以优化LVLMs中的KV缓存压缩。\texttt{MixKV}根据头级语义冗余进行调整，在压缩KV对时选择性地平衡多样性和重要性。广泛的实验表明，\texttt{MixKV}在多个LVLMs中始终优于现有方法。在极端压缩（预算=64）下，\texttt{MixKV}在五个多模态理解基准测试中平均提高了基线方法的\textbf{5.1\%}，并在SnapKV和AdaKV的GUI定位任务中分别实现了显著的\textbf{8.0\%}和\textbf{9.0\%}的提升，同时保持了相当的推理效率。此外，\texttt{MixKV}无缝扩展到LLMs，性能提升相当。我们的代码可在\href{https://github.com/xuyang-liu16/MixKV}{https://github.com/xuyang-liu16/MixKV}获取。

Summary / 总结

This paper addresses the memory bottleneck caused by the expansion of key-value (KV) cache in large vision-language models (LVLMs). It proposes MixKV, a method that combines importance and diversity for KV cache compression, which enhances existing methods by 5.1% on average under extreme compression and achieves significant improvements on specific tasks. The method balances diversity and importance head-wise, leading to better semantic coverage and comparable inference efficiency.

本文提出了一种结合重要性和多样性的方法MixKV，用于解决大型视觉-语言模型（LVLM）中由于关键值（KV）缓存扩展引起的大内存瓶颈问题。该方法适应头部级别的语义冗余，并在压缩KV对时选择性地平衡多样性和重要性。实验表明，MixKV在五个跨模态理解基准测试中平均提高了5.1%，并在GUI定位任务中分别实现了SnapKV和AdaKV的显著提升，达到8.0%和9.0%，同时保持了相当的推理效率。

FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

Authors: Zebin Yao, Lei Ren, Huixing Jiang, Chen Wei, Xiaojie Wang, Ruifan Li, Fangxiang Feng

First: 2025-04-22T14:55:23+00:00 · Latest: 2025-10-23T16:11:42+00:00

Comments: Code: https://github.com/Nihukat/FreeGraftor

Abs · PDF · Code1 · Code2 · Code3

Abstract

Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.

中文标题/摘要

标题：FreeGraftor：无需训练的跨图像特征嫁接以实现主题驱动的文本到图像生成

主题驱动的图像生成旨在从参考图像中合成新的场景，同时忠实保留主题身份并遵循文本指导。然而，现有方法在保真度和效率之间面临关键权衡。基于调优的方法依赖于耗时且资源密集的主题特定优化，而零样本方法往往无法保持足够的主题一致性。在本文中，我们提出了一种无需训练的FreeGraftor框架，通过跨图像特征嫁接来解决这些限制。具体而言，FreeGraftor利用语义匹配和位置约束注意力融合来将参考主题的视觉细节转移到生成图像中。此外，我们的框架引入了一种新颖的噪声初始化策略，以保留参考主题的几何先验，从而促进稳健的特征匹配。广泛的定性和定量实验表明，我们的方法能够实现精确的主题身份转移，同时保持文本对齐的场景合成。无需进行模型微调或额外训练，FreeGraftor在主题保真度和文本对齐方面显著优于现有零样本和无需训练的方法。此外，我们的框架可以无缝扩展到多主题生成，使其适用于实际部署。我们的代码可在https://github.com/Nihukat/FreeGraftor获取。

Summary / 总结

FreeGraftor is a training-free framework for subject-driven text-to-image generation that uses cross-image feature grafting to preserve subject identity while adhering to textual guidance. It employs semantic matching and position-constrained attention fusion to transfer visual details from reference images to generated scenes, and introduces a noise initialization strategy to maintain subject geometry. Experimental results show that FreeGraftor outperforms existing zero-shot and training-free approaches in subject fidelity and text alignment without requiring model fine-tuning. It also supports multi-subject generation, making it practical for real-world applications.

FreeGraftor 是一个无需训练的框架，用于在保持主体身份的同时根据文本指导生成图像。它使用跨图像特征嫁接技术，通过语义匹配和位置约束注意力融合将参考图像中的视觉细节转移到生成场景中，并引入了一种噪声初始化策略以保持参考主体的几何先验。实验表明，FreeGraftor 在主体保真度和文本对齐方面优于现有零样本和无需训练的方法，且无需进行模型微调。此外，该框架还支持多主体生成，使其适用于实际部署。

Diagnosing Visual Reasoning: Challenges, Insights, and a Path Forward

Authors: Jing Bi, Guangyu Sun, Ali Vosoughi, Chen Chen, Chenliang Xu

First: 2025-10-23T16:10:03+00:00 · Latest: 2025-10-23T16:10:03+00:00

Comments: 5 pages

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.

中文标题/摘要

标题：视觉推理诊断：挑战、见解与未来路径

多模态大型语言模型（MLLMs）结合视觉和文本推理，利用链式思考（CoT）提示解决复杂视觉任务，但仍表现出视觉幻觉和过度依赖文本先验的问题。我们使用三阶段评估框架系统地诊断了最先进的视觉语言模型，揭示了关键的失败模式。为了解决这些问题，我们提出了一种基于代理的架构，结合LLM推理和轻量级视觉模块，实现精细的推理链分析和迭代优化。我们的结果强调未来视觉推理模型应专注于整合更广泛的专门工具来分析视觉内容。我们的系统在MMMU上提高了10.3，在MathVista上提高了6.0，超过了7B基线模型。我们将发布我们的框架和评估套件以促进未来研究。

Summary / 总结

The paper aims to diagnose the challenges faced by multimodal large language models (MLLMs) in visual reasoning, particularly their tendency to hallucinate and rely heavily on textual information. The authors propose a three-stage evaluation framework and an agent-based architecture that integrates lightweight visual modules with LLM reasoning to improve visual reasoning. Key findings include significant performance gains on MMMU and MathVista tasks compared to a 7B baseline model, suggesting that future models should incorporate specialized tools for analyzing visual content more effectively.

本文诊断了多模态大型语言模型（MLLMs）在视觉推理中面临的挑战，特别是它们倾向于产生幻觉并过度依赖文本信息。作者提出了一种三阶段评估框架和一种结合轻量级视觉模块与LLM推理的代理架构，以提高视觉推理能力。该系统在7B基线模型上显示出显著的改进（+10.3在MMMU上，+6.0在MathVista上），表明所提出的方法在解决视觉推理挑战方面的有效性。

REOBench: Benchmarking Robustness of Earth Observation Foundation Models

Authors: Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, Tianjin Huang

First: 2025-05-22T15:34:50+00:00 · Latest: 2025-10-23T15:43:31+00:00

Comments: Accepted to NeruIPS 2025 D&B Track

Abs · PDF · Code1 · Code2 · Code3

Abstract

Earth observation foundation models have shown strong generalization across multiple Earth observation tasks, but their robustness under real-world perturbations remains underexplored. To bridge this gap, we introduce REOBench, the first comprehensive benchmark for evaluating the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions, including both appearance-based and geometric perturbations. To ensure realistic and fine-grained evaluation, our benchmark focuses on high-resolution optical remote sensing images, which are widely used in critical applications such as urban planning and disaster response. We conduct a systematic evaluation of a broad range of models trained using masked image modeling, contrastive learning, and vision-language pre-training paradigms. Our results reveal that (1) existing Earth observation foundation models experience significant performance degradation when exposed to input corruptions. (2) The severity of degradation varies across tasks, model architectures, backbone sizes, and types of corruption, with performance drop varying from less than 1% to over 20%. (3) Vision-language models show enhanced robustness, particularly in multimodal tasks. REOBench underscores the vulnerability of current Earth observation foundation models to real-world corruptions and provides actionable insights for developing more robust and reliable models. Code and data are publicly available at https://github.com/lx709/REOBench.

中文标题/摘要

标题：REOBench：地球观测基础模型鲁棒性基准测试

地球观测基础模型在多个地球观测任务中表现出强大的泛化能力，但在实际世界扰动下的鲁棒性仍被忽视。为弥补这一差距，我们引入了REOBench，这是首个全面评估地球观测基础模型鲁棒性的基准，涵盖了六个任务和十二种图像破坏类型，包括基于外观和几何的扰动。为了确保评估的现实性和精细度，我们的基准专注于高分辨率光学遥感图像，这些图像广泛应用于城市规划和灾害响应等关键应用。我们系统地评估了使用掩码图像建模、对比学习和视觉-语言预训练范式训练的一系列模型。我们的结果表明：(1) 当暴露于输入扰动时，现有地球观测基础模型会经历显著的性能下降。(2) 性能下降的程度在不同任务、模型架构、骨干网络大小和扰动类型之间有所不同，性能下降幅度从不到1%到超过20%不等。(3) 视觉-语言模型在多模态任务中显示出增强的鲁棒性。REOBench突显了当前地球观测基础模型对实际世界扰动的脆弱性，并为开发更鲁棒和可靠的模型提供了可操作的见解。代码和数据可在https://github.com/lx709/REOBench上公开获取。

Summary / 总结

REOBench is a benchmark designed to evaluate the robustness of Earth observation foundation models across six tasks and twelve types of image corruptions. It uses high-resolution optical remote sensing images to assess models trained with masked image modeling, contrastive learning, and vision-language pre-training. The study finds that these models experience significant performance degradation under real-world perturbations, with varying severity across different tasks and model architectures. Vision-language models show enhanced robustness, especially in multimodal tasks, highlighting their potential for developing more reliable Earth observation models.

REOBench 是一个用于评估地球观测基础模型在各种实际扰动下鲁棒性的基准。它在六项任务和十二种图像腐蚀类型上评估模型，重点关注高分辨率光学遥感图像。研究发现，现有模型在暴露于输入扰动时表现出显著的性能下降，不同任务、模型架构和腐蚀类型下的严重程度各不相同。视觉语言模型在多模态任务中表现出增强的鲁棒性。该基准突显了需要开发更鲁棒的地球观测模型的必要性。

Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging

Authors: Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Dong Yang, Pengfei Guo, Marc Edgar, Daguang Xu, Bernhard Kainz, Bjoern Menze

Venue: NeurIPS 2025

First: 2025-10-23T15:13:13+00:00 · Latest: 2025-10-23T15:13:13+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D

中文标题/摘要

标题：更好的3D标记以实现更好的3D：推进3D医学成像中的视觉-语言建模

3D医学成像中的视觉-语言建模最近的进步得益于大规模的计算机断层扫描（CT）语料库，这些语料库配有配对的自由文本报告，更强的架构和强大的预训练模型。这使得自动化报告生成和文本条件下的3D图像合成等应用成为可能。然而，当前的方法在处理高分辨率、长序列的体积时存在困难：对比预训练往往导致视觉编码器与临床语言不一致，而切片级标记模糊了细微解剖结构，降低了下游任务的诊断性能。我们提出了BTB3D（更好的3D标记），这是一种因果卷积编码器-解码器，统一了2D和3D的训练和推理，同时生成紧凑的、频率感知的体素标记。三阶段的训练课程使模型能够（i）局部重建，（ii）重叠窗口镶嵌，以及（iii）长上下文解码器细化，在此过程中，模型从短切片片段中学习，但能够泛化到超过300片的扫描，而无需额外的内存开销。BTB3D在两个关键任务上达到了新的最佳水平：它在报告生成任务上提高了BLEU分数，并且在CT2Rep、CT-CHAT和Merlin上将临床F1提高了40%；在文本到CT合成任务上，它将FID降低了75%，并将FVD减半，与GenerateCT和MedSyn相比，生成了解剖上一致的512*512*241体积。这些结果表明，精确的三维标记化，而不是更大的语言骨干模型，对于3D医学成像中的可扩展视觉-语言建模至关重要。代码库可在：https://github.com/ibrahimethemhamamci/BTB3D

Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey

Authors: Weifan Guan, Qinghao Hu, Aosheng Li, Jian Cheng

First: 2025-10-20T02:59:45+00:00 · Latest: 2025-10-23T15:06:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models extend vision-language models to embodied control by mapping natural-language instructions and visual observations to robot actions. Despite their capabilities, VLA systems face significant challenges due to their massive computational and memory demands, which conflict with the constraints of edge platforms such as on-board mobile manipulators that require real-time performance. Addressing this tension has become a central focus of recent research. In light of the growing efforts toward more efficient and scalable VLA systems, this survey provides a systematic review of approaches for improving VLA efficiency, with an emphasis on reducing latency, memory footprint, and training and inference costs. We categorize existing solutions into four dimensions: model architecture, perception feature, action generation, and training/inference strategies, summarizing representative techniques within each category. Finally, we discuss future trends and open challenges, highlighting directions for advancing efficient embodied intelligence.

中文标题/摘要

标题：高效视觉-语言-行动模型在嵌入式操作中的应用：系统综述

视觉-语言-行动（VLA）模型将视觉-语言模型扩展到嵌入式控制，通过将自然语言指令和视觉观察映射到机器人行动。尽管具有这些能力，VLA 系统由于其巨大的计算和内存需求而面临重大挑战，这与边缘平台（如车载移动操作器）的实时性能要求相冲突。解决这一矛盾已成为最近研究的中心焦点。鉴于对更高效和可扩展的VLA系统的日益努力，本文综述了提高VLA效率的方法，重点在于减少延迟、内存占用和训练及推理成本。我们按照模型架构、感知特征、行动生成和训练/推理策略四个维度对现有解决方案进行了分类，总结了每个类别中的代表性技术。最后，我们讨论了未来趋势和开放挑战，指出了推进高效嵌入式智能的方向。

Summary / 总结

This paper addresses the challenges of Vision-Language-Action (VLA) models in embodied manipulation by surveying methods to improve efficiency. The research focuses on reducing latency, memory usage, and costs through model architecture, perception features, action generation, and training/inference strategies. Key findings include the categorization of existing techniques and the identification of future research directions to advance efficient embodied intelligence.

本文通过综述提高效率的方法，解决了Vision-Language-Action (VLA)模型在执行任务中的挑战。研究重点在于通过模型架构、感知特征、动作生成以及训练/推理策略来减少延迟、内存使用和成本。主要发现包括对现有技术的分类以及对未来研究方向的探讨，以推进高效的嵌入式智能。