arXiv 论文速递

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Authors: Hang Hua, Yolo Yunlong Tang, Chenliang Xu, Jiebo Luo

Venue: AAAI 2025

First: 2024-04-18T17:32:46+00:00 · Latest: 2025-10-08T17:22:42+00:00

Comments: Accepted to AAAI 2025

Abstract

Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective training of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

中文标题/摘要

标题：V2Xum-LLM：跨模态视频摘要生成与时间提示指令调优

视频摘要生成旨在创建长视频的简短、准确且连贯的摘要。尽管存在多种视频摘要数据集，但一个显著的限制是其来源视频数量有限，这阻碍了高级大型视觉-语言模型（VLM）的有效训练。此外，大多数现有数据集仅用于视频到视频摘要，忽视了对多模态视频内容摘要的当前需求。最近的研究努力从单模态扩展到多模态视频摘要，并根据摘要的模态将任务分为三个子任务：视频到视频（V2V）、视频到文本（V2T）以及视频和文本摘要的组合（V2VT）。然而，之前多模态数据集中的文本摘要不足。为了解决这些问题，我们引入了Instruct-V2Xum，这是一个跨模态视频摘要数据集，包含30,000个来自YouTube的多样化视频，长度从40秒到940秒不等，平均摘要率为16.39%。Instruct-V2Xum中的每个视频摘要都配有一个文本摘要，引用了特定的帧索引，便于生成对齐的视频和文本摘要。此外，我们提出了一种新的视频摘要框架V2Xum-LLM。V2Xum-LLM，即本研究中的V2Xum-LLaMA，是第一个将不同视频摘要任务统一到一个大型语言模型（LLM）的文本解码器中的框架，并通过时间提示和任务指令实现了可控的视频摘要。实验表明，V2Xum-LLaMA在多个视频摘要任务中优于强基线模型。此外，我们还为视频到视频（V2V）和视频到文本和视频摘要（V2VT）任务提出了增强的评估指标。

Summary / 总结

The research aims to address the limitations of existing video summarization datasets, particularly their lack of diverse source videos and multimodal content. The authors introduce Instruct-V2Xum, a new dataset with 30,000 videos, and propose V2Xum-LLM, a unified framework for video summarization tasks using a large language model's text decoder with temporal prompts and task instructions. Experiments demonstrate that V2Xum-LLaMA outperforms baseline models in multiple summarization tasks and introduces an enhanced evaluation metric for V2V and V2VT tasks.

研究旨在解决现有视频摘要数据集的局限性，特别是缺乏多样化的源视频和多模态内容。作者引入了Instruct-V2Xum数据集，包含30,000个视频，并提出了一种名为V2Xum-LLM的统一框架，该框架利用大型语言模型的文本解码器，结合时间提示和任务指令，实现了可控制的视频摘要。实验表明，V2Xum-LLaMA在各种摘要任务中优于基线模型，并引入了V2V和V2VT任务的增强评估指标。

Prefilled responses enhance zero-shot detection of AI-generated images

Authors: Zoher Kachwala, Danishjeet Singh, Danielle Yang, Filippo Menczer

First: 2025-05-20T22:44:04+00:00 · Latest: 2025-10-08T16:59:43+00:00

Abs · PDF · Code1 · Code2

Abstract

As AI models generate increasingly realistic images, growing concerns over potential misuse underscore the need for reliable detection. Traditional supervised detection methods depend on large, curated datasets for training and often fail to generalize to novel, out-of-domain image generators. As an alternative, we explore pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images. We evaluate VLM performance on three diverse benchmarks encompassing synthetic images of human faces, objects, and animals produced by 16 different state-of-the-art image generators. While off-the-shelf VLMs perform poorly on these datasets, we find that their reasoning can be guided effectively through simple response prefilling -- a method we call Prefill-Guided Thinking (PGT). In particular, prefilling a VLM response with the task-aligned phrase "Let's examine the style and the synthesis artifacts" improves the Macro F1 scores of three widely used open-source VLMs by up to 24%.

中文标题/摘要

标题：预填充响应增强零样本检测AI生成图像的能力

随着AI模型生成的图像越来越逼真，对潜在滥用的担忧加剧了对可靠检测的需求。传统监督检测方法依赖于大型、精心策划的数据集进行训练，往往无法泛化到新的、域外的图像生成器。作为替代方案，我们探索了预训练视觉-语言模型（VLMs）在零样本检测AI生成图像中的应用。我们评估了VLM在三个不同的基准上的性能，这些基准涵盖了由16种不同最先进的图像生成器生成的人脸、物体和动物的合成图像。尽管即用型VLM在这些数据集上的表现不佳，但我们发现，通过简单的响应预填充，可以有效地引导其推理——我们称之为预填充引导思考（PGT）的方法。特别是，用与任务对齐的短语“让我们检查风格和合成伪影”预填充VLM的响应，可以将三种广泛使用的开源VLM的宏F1分数提高多达24%。

Summary / 总结

This study addresses the challenge of detecting AI-generated images by leveraging pre-trained Vision-Language Models (VLMs) in a zero-shot setting. The research evaluates VLM performance on three diverse benchmarks and finds that simple response prefilling, termed Prefill-Guided Thinking (PGT), significantly enhances detection accuracy, improving Macro F1 scores by up to 24% for three open-source VLMs. This method guides the VLM's reasoning effectively, making it a promising approach for reliable detection of AI-generated content.

研究通过利用预训练的Vision-Language模型（VLM）在零样本设置下检测AI生成的图像，解决了这一挑战。研究在三个不同的基准上评估了VLM的表现，并发现通过简单的方法——称为Prefill-Guided Thinking（PGT）——填充VLM的响应，可以显著提高检测准确性，使三种开源VLM的Macro F1分数提高高达24%。这种方法有效地引导了VLM的推理，使其成为可靠检测AI生成内容的一个有前景的方法。

Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models

Authors: Chengzhi Zhong, Fei Cheng, Qianying Liu, Yugo Murawaki, Chenhui Chu, Sadao Kurohashi

First: 2025-10-08T16:46:57+00:00 · Latest: 2025-10-08T16:46:57+00:00

Comments: Work in progress. Our code will be available at: https://github.com/ku-nlp/language-specific-dimensions

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.

中文标题/摘要

标题：语言存在于稀疏维度中：面向可解释和高效多语言控制的大语言模型

大语言模型在有限的非英语数据暴露下表现出强大的多语言能力。先前的研究表明，以英语为中心的大语言模型在中间层将多语言内容映射到英语对齐的表示，然后在最终层将其投影回目标语言的标记空间。基于这一观察，我们假设这种跨语言过渡由一组小而稀疏的维度控制，这些维度在中间层到最终层的一致索引中出现。基于这一见解，我们提出了一种简单的、无需训练的方法来识别和操作这些维度，只需要少量（最多50句）平行或单语数据。在多语言生成控制任务上的实验揭示了这些维度的可解释性，表明在这些维度上的干预可以切换输出语言同时保留语义内容，并且其性能显著低于基于神经元的方法的成本。

Summary / 总结

The research aims to understand the sparse dimensions in large language models that govern multilingual content representation. The method involves identifying and manipulating these dimensions using minimal data, enabling language switching in multilingual generation tasks while preserving semantic content. Experiments show that this approach outperforms previous neuron-based methods with lower computational cost.

研究旨在理解大型语言模型中控制多语言内容处理的稀疏维度。通过假设这些维度在中间层和最终层中是一致的，研究人员提出了一种无需训练的方法，仅使用少量数据来操控这些维度。实验表明，在这些维度中进行干预可以切换输出语言同时保留语义内容，并且在成本更低的情况下超越了之前的神经元基方法。

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Authors: Yi Han, Cheng Chi, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang

First: 2025-10-08T16:20:23+00:00 · Latest: 2025-10-08T16:20:23+00:00

Comments: 9 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation-oriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.

中文标题/摘要

标题：TIGeR: 工具集成几何推理在视觉-语言模型中的应用以实现机器人技术

视觉-语言模型（VLMs）在空间推理方面表现出色，但它们本质上仍局限于定性的精确度，并缺乏实现现实世界机器人技术所需的计算精确度。当前的方法未能利用深度传感器和相机校准的度量线索，而是将几何问题简化为模式识别任务，这些任务无法提供机器人操作所需的厘米级精度。我们提出了TIGeR（Tool-Integrated Geometric Reasoning），这是一种新颖的框架，通过使VLMs能够生成和执行精确的几何计算，从而将它们从感知估计器转变为几何计算机。TIGeR 不试图在神经网络中内化复杂的几何操作，而是赋予模型识别几何推理需求、合成适当的计算代码并调用专门的库进行精确计算的能力。为了支持这一范式，我们引入了TIGeR-300K，这是一个全面的工具调用导向数据集，涵盖了点变换、姿态估计、轨迹生成和空间兼容性验证，包括工具调用序列和中间计算。通过结合监督微调（SFT）和强化微调（RFT）以及我们提出的分层奖励设计的两阶段训练管道，TIGeR 在几何推理基准测试中达到了最佳性能，同时在现实世界的机器人操作任务中展示了厘米级的精度。

Summary / 总结

TIGeR is a novel framework that enhances Vision-Language Models (VLMs) for geometric reasoning in robotics by integrating external tools. It transforms VLMs from perceptual estimators to geometric computers capable of generating and executing precise geometric computations. TIGeR uses a two-stage training pipeline combining supervised fine-tuning and reinforcement fine-tuning to achieve state-of-the-art performance on geometric reasoning benchmarks and demonstrates centimeter-level precision in real-world robotic manipulation tasks.

TIGeR 是一种新颖的框架，通过集成外部工具将视觉-语言模型（VLMs）转变为能够执行精确几何计算的几何计算机。它使 VLMs 能够生成和执行准确的几何计算。通过两阶段训练管道，TIGeR 在几何推理基准测试中达到了最先进的性能，并在真实的机器人操作任务中展示了厘米级的精度。

A Multi-Agent Framework for Stateful Inference-Time Search

Authors: Arshika Lalan, Rajat Ghosh, Aditya Kolsur, Debojyoti Dutta

First: 2025-10-08T15:48:41+00:00 · Latest: 2025-10-08T15:48:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent work explores agentic inference-time techniques to perform structured, multi-step reasoning. However, stateless inference often struggles on multi-step tasks due to the absence of persistent state. Moreover, task-specific fine-tuning or instruction-tuning often achieve surface-level code generation but remain brittle on tasks requiring deeper reasoning and long-horizon dependencies. To address these limitations, we propose stateful multi-agent evolutionary search, a training-free framework that departs from prior stateless approaches by combining (i) persistent inference-time state, (ii) adversarial mutation, and (iii) evolutionary preservation. We demonstrate its effectiveness in automated unit test generation through the generation of edge cases. We generate robust edge cases using an evolutionary search process, where specialized agents sequentially propose, mutate, and score candidates. A controller maintains persistent state across generations, while evolutionary preservation ensures diversity and exploration across all possible cases. This yields a generalist agent capable of discovering robust, high-coverage edge cases across unseen codebases. Experiments show our stateful multi-agent inference framework achieves substantial gains in coverage over stateless single-step baselines, evaluated on prevalent unit-testing benchmarks such as HumanEval and TestGenEvalMini and using three diverse LLM families - Llama, Gemma, and GPT. These results indicate that combining persistent inference-time state with evolutionary search materially improves unit-test generation.

中文标题/摘要

标题：一种状态型推理时态多智能体框架

近期研究探索了代理型推理时态技术以执行结构化、多步推理。然而，无状态推理在多步任务上常常因缺乏持久状态而遇到困难。此外，针对特定任务的微调或指令调优往往只能实现表面的代码生成，但在需要深入推理和长时依赖的任务上仍然脆弱。为解决这些局限，我们提出了一种状态型多智能体进化搜索，这是一种无需训练的框架，它通过结合(i) 持久的推理时态状态，(ii) 对抗性变异，和(iii) 进化保存，从而与之前的无状态方法脱钩。我们通过生成边缘案例来展示其在自动化单元测试生成中的有效性。我们使用进化搜索过程生成稳健的边缘案例，其中专门化的智能体依次提出、变异和评分候选方案。控制器在各代之间保持持久状态，而进化保存确保了所有可能情况下的多样性和探索。这产生了一种通才智能体，能够在未见过的代码库中发现稳健且高覆盖率的边缘案例。实验表明，我们的状态型多智能体推理框架在覆盖范围上显著优于无状态单步基线，这些基线在广泛使用的单元测试基准测试如HumanEval和TestGenEvalMini上进行了评估，并使用了三种不同的LLM家族——Llama、Gemma和GPT。这些结果表明，将持久的推理时态状态与进化搜索相结合，显著提高了单元测试生成的效果。

Summary / 总结

This paper addresses the limitations of stateless inference by proposing a stateful multi-agent evolutionary search framework. The method combines persistent inference-time state, adversarial mutation, and evolutionary preservation to perform structured multi-step reasoning. Experiments show that this framework significantly improves coverage in automated unit test generation compared to stateless single-step baselines on benchmarks like HumanEval and TestGenEvalMini using LLM families Llama, Gemma, and GPT.

论文通过提出一种状态化的多智能体进化搜索框架来解决无状态推理的局限性。该框架结合了持久的推理时状态、对抗性变异和进化保存。该方法在生成自动化单元测试的边缘案例方面表现出色，超越了在HumanEval和TestGenEvalMini等基准测试上的单一步骤无状态基线，适用于不同的LLM家族。

Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

Authors: Karim El Khoury, Maxime Zanella, Christophe De Vleeschouwer, Benoit Macq

First: 2025-10-08T15:29:48+00:00 · Latest: 2025-10-08T15:29:48+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs

中文标题/摘要

标题：遥感视觉语言模型的少样本适应基准

遥感视觉语言模型（RSVLMs）由于大规模预训练显示出显著的潜力，在各种任务上实现了强大的零样本性能。然而，它们在低数据环境下的泛化能力，如少样本学习，仍缺乏充分探索。在本文中，我们首次提出了一个结构化的基准，用于评估RSVLMs的少样本适应方法。我们在十个遥感场景分类数据集上进行了全面实验，应用了五种广泛使用的少样本适应策略，对三种最先进的RSVLMs进行了评估，这些模型具有不同的骨干网络。我们的研究发现，具有相似零样本性能的模型在少样本适应下的行为可能大不相同，一些RSVLMs比其他模型更易于进行此类适应。性能的变异性以及现有方法中没有明显胜者表明，需要开发更多针对遥感的鲁棒性更强的少样本适应方法。为了促进未来的研究，我们提供了一个可重复的基准测试框架和开源代码，以系统地评估RSVLMs在少样本条件下的性能。源代码已公开发布在Github上：https://github.com/elkhouryk/fewshot_RSVLMs

Summary / 总结

This work introduces the first structured benchmark for evaluating few-shot adaptation methods on Remote Sensing Vision-Language Models (RSVLMs). Comprehensive experiments across ten remote sensing scene classification datasets show that models with similar zero-shot performance can have different behaviors under few-shot adaptation, with some RSVLMs being more adaptable than others. The variability in performance and the lack of a clear winner among existing methods underscore the need for more robust few-shot adaptation methods tailored to remote sensing. The study provides a reproducible benchmarking framework and open-source code for future research.

这项工作首次引入了一个结构化的基准，用于评估Remote Sensing Vision-Language Models (RSVLMs)在少样本适应方法上的表现。在十个遥感场景分类数据集上的全面实验表明，具有相似零样本性能的模型在少样本适应时表现差异很大，有些RSVLMs比其他模型更易于适应。性能的差异性强调了需要开发更适合遥感任务的更稳健的少样本适应方法的必要性。

MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency

Authors: Dongki Jung, Jaehoon Choi, Yonghan Lee, Sungmin Eum, Heesung Kwon, Dinesh Manocha

First: 2025-10-08T15:11:32+00:00 · Latest: 2025-10-08T15:11:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Monocular 3D foundation models offer an extensible solution for perception tasks, making them attractive for broader 3D vision applications. In this paper, we propose MoRe, a training-free Monocular Geometry Refinement method designed to improve cross-view consistency and achieve scale alignment. To induce inter-frame relationships, our method employs feature matching between frames to establish correspondences. Rather than applying simple least squares optimization on these matched points, we formulate a graph-based optimization framework that performs local planar approximation using the estimated 3D points and surface normals estimated by monocular foundation models. This formulation addresses the scale ambiguity inherent in monocular geometric priors while preserving the underlying 3D structure. We further demonstrate that MoRe not only enhances 3D reconstruction but also improves novel view synthesis, particularly in sparse view rendering scenarios.

中文标题/摘要

标题：MoRe: 通过图优化实现单目几何精炼以提高跨视图一致性

单目3D基础模型为感知任务提供了一种可扩展的解决方案，使其成为更广泛3D视觉应用的吸引力选择。在本文中，我们提出MoRe，一种无需训练的单目几何精炼方法，旨在提高跨视图一致性并实现尺度对齐。为了诱导帧间关系，我们的方法在帧之间使用特征匹配来建立对应关系。我们不直接在这些匹配点上应用简单的最小二乘优化，而是提出了一种基于图的优化框架，使用估计的3D点和单目基础模型估计的表面法线进行局部平面近似。这种形式化方法解决了单目几何先验中的尺度歧义，同时保留了潜在的3D结构。我们进一步证明，MoRe不仅增强了3D重建，还提高了新颖视图合成，特别是在稀疏视图渲染场景中。

Summary / 总结

MoRe is a training-free method for monocular geometry refinement that enhances cross-view consistency and scale alignment. It uses feature matching to establish correspondences between frames and applies a graph-based optimization framework to perform local planar approximation using estimated 3D points and surface normals. This approach resolves scale ambiguity and preserves the underlying 3D structure, thereby improving 3D reconstruction and novel view synthesis, especially in sparse view scenarios.

该论文提出了一种无需训练的方法MoRe，用于单目几何精化，以增强跨视图一致性并实现尺度对齐。通过使用特征匹配在帧之间建立对应关系，MoRe采用基于图的优化框架进行局部平面近似，解决尺度歧义问题同时保留3D结构。该方法不仅提高了3D重建效果，还在稀疏视图渲染场景中提升了新颖视图合成的效果。

Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications

Authors: Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, Yuke Zhu

First: 2025-10-08T14:38:25+00:00 · Latest: 2025-10-08T14:38:25+00:00

Comments: Accepted to IEEE Access, website: https://vla-survey.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Amid growing efforts to leverage advances in large language models (LLMs) and vision-language models (VLMs) for robotics, Vision-Language-Action (VLA) models have recently gained significant attention. By unifying vision, language, and action data at scale, which have traditionally been studied separately, VLA models aim to learn policies that generalise across diverse tasks, objects, embodiments, and environments. This generalisation capability is expected to enable robots to solve novel downstream tasks with minimal or no additional task-specific data, facilitating more flexible and scalable real-world deployment. Unlike previous surveys that focus narrowly on action representations or high-level model architectures, this work offers a comprehensive, full-stack review, integrating both software and hardware components of VLA systems. In particular, this paper provides a systematic review of VLAs, covering their strategy and architectural transition, architectures and building blocks, modality-specific processing techniques, and learning paradigms. In addition, to support the deployment of VLAs in real-world robotic applications, we also review commonly used robot platforms, data collection strategies, publicly available datasets, data augmentation methods, and evaluation benchmarks. Throughout this comprehensive survey, this paper aims to offer practical guidance for the robotics community in applying VLAs to real-world robotic systems. All references categorized by training approach, evaluation method, modality, and dataset are available in the table on our project website: https://vla-survey.github.io .

中文标题/摘要

标题：机器人领域的视觉-语言-行动模型：面向实际应用的综述

随着对大型语言模型（LLMs）和视觉-语言模型（VLMs）在机器人领域的应用日益重视，视觉-语言-行动（VLA）模型最近获得了广泛关注。通过大规模统一视觉、语言和行动数据，VLA模型旨在学习能够跨多种任务、物体、实体和环境泛化的策略。这种泛化能力有望使机器人能够解决新的下游任务，无需或只需少量特定任务的数据，从而促进更灵活和可扩展的实际部署。不同于以往专注于行动表示或高层模型架构的综述，本文提供了一个全面的全栈式综述，整合了VLA系统的软件和硬件组件。特别是，本文系统地回顾了VLA，涵盖了其策略和架构转变、架构和构建块、模态特定处理技术以及学习范式。此外，为了支持VLA在实际机器人应用中的部署，本文还回顾了常用的机器人平台、数据收集策略、公开可用的数据集、数据增强方法和评估基准。在整个全面综述中，本文旨在为机器人社区在实际机器人系统中应用VLA提供实用指导。所有按训练方法、评估方法、模态和数据集分类的参考文献可在我们项目网站上的表格中获取：https://vla-survey.github.io。

Summary / 总结

This paper reviews Vision-Language-Action (VLA) models, which integrate vision, language, and action data to learn generalizable policies for robotics. The study aims to enable robots to perform novel tasks with minimal additional data, facilitating real-world deployment. It covers VLA strategies, architectures, processing techniques, and learning paradigms, and provides a comprehensive overview of robot platforms, data collection, datasets, and evaluation benchmarks to support practical applications.

本文回顾了将视觉、语言和动作数据结合的Vision-Language-Action (VLA)模型，旨在使机器人能够跨任务和环境进行泛化，从而促进其实用部署。研究提供了VLA策略、架构和学习范式的全面概述，以及机器人平台和评估基准的细节。关键发现包括VLA模型在少量额外数据的情况下解决新任务的潜力，以及需要稳健的数据收集和增强方法。

Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration

Authors: Tengwei Song, Min Wu, Yuan Fang

First: 2025-10-08T14:02:51+00:00 · Latest: 2025-10-08T14:02:51+00:00

Comments: CIKM 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Molecular representation learning plays a crucial role in advancing applications such as drug discovery and material design. Existing work leverages 2D and 3D modalities of molecular information for pre-training, aiming to capture comprehensive structural and geometric insights. However, these methods require paired 2D and 3D molecular data to train the model effectively and prevent it from collapsing into a single modality, posing limitations in scenarios where a certain modality is unavailable or computationally expensive to generate. To overcome this limitation, we propose FlexMol, a flexible molecule pre-training framework that learns unified molecular representations while supporting single-modality input. Specifically, inspired by the unified structure in vision-language models, our approach employs separate models for 2D and 3D molecular data, leverages parameter sharing to improve computational efficiency, and utilizes a decoder to generate features for the missing modality. This enables a multistage continuous learning process where both modalities contribute collaboratively during training, while ensuring robustness when only one modality is available during inference. Extensive experiments demonstrate that FlexMol achieves superior performance across a wide range of molecular property prediction tasks, and we also empirically demonstrate its effectiveness with incomplete data. Our code and data are available at https://github.com/tewiSong/FlexMol.

中文标题/摘要

标题：统一分子预训练与灵活的2D和3D模态：单模态和配对模态集成

分子表示学习在推动药物发现和材料设计等应用方面发挥着关键作用。现有工作利用分子的2D和3D模态进行预训练，旨在捕捉全面的结构和几何洞察。然而，这些方法需要配对的2D和3D分子数据来有效训练模型并防止其退化为单一模态，这在某些模态不可用或生成成本高昂的情况下构成了限制。为克服这一限制，我们提出了一种灵活的分子预训练框架FlexMol，该框架在支持单模态输入的同时学习统一的分子表示。具体而言，我们的方法借鉴了视觉语言模型中的统一结构，分别使用2D和3D分子数据的模型，通过参数共享提高计算效率，并利用解码器生成缺失模态的特征。这使得在训练过程中两种模态能够协作贡献，同时在仅有一种模态可用的推理阶段确保鲁棒性。广泛的实验表明，FlexMol 在一系列分子性质预测任务中取得了优越的性能，并且我们还通过不完整数据的实验验证了其有效性。我们的代码和数据可在 https://github.com/tewiSong/FlexMol 获取。

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Authors: Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri

Venue: NeurIPS

First: 2025-05-27T01:24:29+00:00 · Latest: 2025-10-08T13:51:05+00:00

Comments: The first two authors contributed equally. This work has been accepted to the Neural Information Processing Systems (NeurIPS) 2025 Datasets & Benchmark Track. Project Page: https://rf100-vl.org/

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.

中文标题/摘要

标题：Roboflow100-VL：一种多域物体检测基准，用于视觉-语言模型

在互联网规模数据上训练的视觉-语言模型（VLMs）在常见的物体如汽车、卡车和行人的零样本检测性能上表现出色。然而，最先进的模型仍然难以泛化到其预训练中不常见的分布外类别、任务和成像模态。与其简单地在更多视觉数据上重新训练VLMs，我们认为应该通过包含少量视觉示例和丰富文本描述的注释指令来对齐VLMs到新的概念。为此，我们引入了Roboflow100-VL，这是一个包含100个多模态物体检测数据集的大规模集合，这些数据集中的概念在VLM预训练中不常见。我们在零样本、少样本、半监督和完全监督设置下评估了最先进的模型，允许在不同数据条件下进行比较。值得注意的是，我们在Roboflow100-VL中的挑战性医学成像数据集上发现，如GroundingDINO和Qwen2.5-VL等VLMs的零样本准确率低于2%，这表明需要进行少样本概念对齐。最后，我们讨论了我们最近的CVPR 2025基础FSOD竞赛，并分享了社区的见解。值得注意的是，获胜团队的性能比我们的基线高出17个mAP！我们的代码和数据集可在https://github.com/roboflow/rf100-vl和https://universe.roboflow.com/rf100-vl/获取。

DiffMI: Breaking Face Recognition Privacy via Diffusion-Driven Training-Free Model Inversion

Authors: Hanrui Wang, Shuo Wang, Chun-Shien Lu, Isao Echizen

First: 2025-04-25T01:53:27+00:00 · Latest: 2025-10-08T13:46:41+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Face recognition poses serious privacy risks due to its reliance on sensitive and immutable biometric data. While modern systems mitigate privacy risks by mapping facial images to embeddings (commonly regarded as privacy-preserving), model inversion attacks reveal that identity information can still be recovered, exposing critical vulnerabilities. However, existing attacks are often computationally expensive and lack generalization, especially those requiring target-specific training. Even training-free approaches suffer from limited identity controllability, hindering faithful reconstruction of nuanced or unseen identities. In this work, we propose DiffMI, the first diffusion-driven, training-free model inversion attack. DiffMI introduces a novel pipeline combining robust latent code initialization, a ranked adversarial refinement strategy, and a statistically grounded, confidence-aware optimization objective. DiffMI applies directly to unseen target identities and face recognition models, offering greater adaptability than training-dependent approaches while significantly reducing computational overhead. Our method achieves 84.42%--92.87% attack success rates against inversion-resilient systems and outperforms the best prior training-free GAN-based approach by 4.01%--9.82%. The implementation is available at https://github.com/azrealwang/DiffMI.

中文标题/摘要

标题：DiffMI：通过扩散驱动的无训练模型反转打破面部识别隐私

面部识别由于依赖敏感且不可变的生物特征数据，带来了严重的隐私风险。尽管现代系统通过将面部图像映射到嵌入（通常认为是隐私保护的）来缓解隐私风险，但模型反转攻击表明，身份信息仍然可以恢复，暴露了关键的漏洞。然而，现有的攻击往往计算成本高昂且缺乏泛化能力，尤其是那些需要针对特定目标进行训练的攻击。即使无训练方法也受到身份可控性有限的限制，阻碍了对复杂或未见过的身份进行忠实重建。在本文中，我们提出了DiffMI，这是第一个扩散驱动的无训练模型反转攻击。DiffMI 引入了一种新颖的结合了鲁棒潜在代码初始化、排名对抗性细化策略和统计上基于信心的优化目标的管道。DiffMI 可直接应用于未见过的目标身份和面部识别模型，提供了比依赖训练的方法更大的适应性，同时显著减少了计算开销。我们的方法在对抗反转鲁棒系统中的攻击成功率达到了84.42%--92.87%，并优于最佳的先前提取的基于GAN的方法4.01%--9.82%。实现代码可在https://github.com/azrealwang/DiffMI 获取。

Summary / 总结

This paper addresses the privacy risks in face recognition by proposing DiffMI, a diffusion-driven, training-free model inversion attack. DiffMI combines robust latent code initialization, a ranked adversarial refinement strategy, and a confidence-aware optimization objective to achieve high attack success rates against inversion-resilient systems. It outperforms existing training-free GAN-based approaches by 4.01% to 9.82% and offers greater adaptability with reduced computational overhead.

本文提出了一种基于扩散驱动的无训练模型反转攻击DiffMI，结合了鲁棒的潜在代码初始化、排名对抗精炼策略和置信度感知的优化目标，以实现对反转抗性的系统高攻击成功率。与现有的基于GAN的无训练方法相比，其性能提高了4.01%到9.82%，并且具有更高的适应性和更低的计算开销。

GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm

Authors: Hanrui Wang, Ching-Chun Chang, Chun-Shien Lu, Christopher Leckie, Isao Echizen

First: 2025-01-24T04:17:03+00:00 · Latest: 2025-10-08T13:27:03+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Deep neural networks are highly vulnerable to adversarial examples that inputs with small, carefully crafted perturbations that cause misclassification, making adversarial attacks an essential tool for robustness evaluation. Existing black-box attacks fall into three categories: query-only, transfer-only, and query-and-transfer, and vary in perturbation pattern and optimization strategy. However, no prior method jointly achieves query-and-transfer guidance, pixel-wise sparsity, and training-free direct optimization, leaving a gap between black-box flexibility and white-box precision. We present GreedyPixel, a new attack framework that fills this gap by combining a surrogate-derived pixel priority map with greedy, per-pixel optimization refined by query feedback. This design reduces the exponential brute-force search space to a tractable linear procedure, guarantees monotonic loss decrease and convergence to a coordinate-wise optimum, and concentrates perturbations on robust, semantically meaningful pixels to improve perceptual quality. Extensive experiments on CIFAR-10 and ImageNet under both white-box and black-box settings demonstrate that GreedyPixel achieves state-of-the-art attack success rates and produces visually imperceptible perturbations. Our results show that GreedyPixel bridges the precision gap between white-box and black-box attacks and provides a practical framework for fine-grained robustness evaluation. The implementation is available at https://github.com/azrealwang/greedypixel.

中文标题/摘要

标题：GreedyPixel：基于贪婪算法的细粒度黑盒对抗攻击

深度神经网络对对抗样本极为敏感，这些样本通过小的、精心构造的扰动导致误分类，使得对抗攻击成为评估鲁棒性的重要工具。现有的黑盒攻击可分为三类：查询型、迁移型和查询-迁移型，它们在扰动模式和优化策略上有所不同。然而，没有先前的方法能够同时实现查询-迁移指导、像素级稀疏性和无需训练的直接优化，这在黑盒灵活性和白盒精度之间留下了差距。我们提出了GreedyPixel，这是一种新的攻击框架，通过结合代理模型衍生的像素优先级图和基于查询反馈的贪婪、逐像素优化来填补这一空白。这种设计将指数级的暴力搜索空间缩减为可处理的线性过程，保证了损失的单调减少和坐标方向上的收敛，并将扰动集中在鲁棒且语义上有意义的像素上，以提高感知质量。在CIFAR-10和ImageNet上的广泛实验表明，GreedyPixel实现了最先进的攻击成功率，并产生了视觉上不可感知的扰动。我们的结果表明，GreedyPixel弥合了白盒和黑盒攻击之间的精度差距，并提供了一种实用的细粒度鲁棒性评估框架。实现代码可在https://github.com/azrealwang/greedypixel/获取。

Summary / 总结

GreedyPixel is a new black-box adversarial attack framework that combines a surrogate-derived pixel priority map with greedy, per-pixel optimization to achieve query-and-transfer guidance, pixel-wise sparsity, and training-free direct optimization. It reduces the search space, guarantees monotonic loss decrease, and focuses perturbations on robust, semantically meaningful pixels. Experiments show that GreedyPixel achieves state-of-the-art attack success rates and produces visually imperceptible perturbations on CIFAR-10 and ImageNet datasets. This method bridges the precision gap between white-box and black-box attacks and provides a practical framework for robustness evaluation.

GreedyPixel 是一种新的黑盒对抗攻击框架，结合了代理模型衍生的像素优先级图和逐像素贪婪优化，实现了查询和传输指导、像素级稀疏性和无需训练的直接优化。它减少了搜索空间，保证了损失的单调减少，并将扰动集中在语义上有意义的像素上。实验表明，GreedyPixel 在 CIFAR-10 和 ImageNet 数据集上实现了最先进的攻击成功率，并产生了视觉上不可感知的扰动。这项工作填补了白盒和黑盒攻击之间的精度差距，并提供了一种实用的鲁棒性评估框架。

Unified Unsupervised Anomaly Detection via Matching Cost Filtering

Authors: Zhe Zhang, Mingxiu Cai, Gaochang Wu, Jing Zhang, Lingqiao Liu, Dacheng Tao, Tianyou Chai, Xiatian Zhu

First: 2025-10-03T03:28:18+00:00 · Latest: 2025-10-08T12:00:06+00:00

Comments: 63 pages (main paper and supplementary material), 39 figures, 58 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB-3D and RGB-Text, enabled by point cloud sensing and vision-language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB-3D, RGB-Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

中文标题/摘要

标题：统一无监督异常检测方法：匹配成本过滤

无监督异常检测（UAD）旨在仅使用正常训练数据来识别图像和像素级别的异常，广泛应用于工业检测和医学分析等领域，其中异常由于隐私问题和冷启动限制而稀缺。现有方法无论是基于重建（恢复正常样本）还是基于嵌入（预训练表示），本质上都是在图像或特征级别进行匹配以生成异常图。然而，匹配噪声被广泛忽视，限制了它们的检测能力。除了早期对基于单模态RGB的UAD的关注，最近的进步扩展到了多模态场景，例如RGB-3D和RGB-Text，这得益于点云传感和视觉语言模型。尽管存在共同挑战，这些领域仍然相对孤立，阻碍了全面的理解和知识转移。在本文中，我们提倡在匹配视角下统一的UAD方法，无论是单模态还是多模态。基于这一见解，我们提出了统一成本过滤（UCF），这是一种通用的后处理精炼框架，用于任何UAD模型的成本体积精炼。成本体积通过将测试样本与相同或不同模态的正常样本进行匹配构建，然后通过来自测试样本的多层注意力引导的可学习过滤模块进行精炼，减轻匹配噪声并突出细微异常。在22个不同基准上的全面实验表明，UCF在增强各种UAD方法方面具有有效性，一致地在单模态（RGB）和多模态（RGB-3D，RGB-Text）UAD场景中实现了新的最佳结果。代码和模型将在https://github.com/ZHE-SAPI/CostFilter-AD上发布。

Train-Free Segmentation in MRI with Cubical Persistent Homology

Authors: Anton François, Raphaël Tinarrage

First: 2024-01-02T11:43:49+00:00 · Latest: 2025-10-08T11:59:15+00:00

Comments: Preprint, 36 pages, 18 figures, 4 tables. For associated code, see https://github.com/antonfrancois/gliomaSegmentation_TDA

Abs · PDF · Code1 · Code2 · Code3

Abstract

We present a new general framework for segmentation of MRI scans based on Topological Data Analysis (TDA), offering several advantages over traditional machine learning approaches. The pipeline proceeds in three steps, first identifying the whole object to segment via automatic thresholding, then detecting a distinctive subset whose topology is known in advance, and finally deducing the various components of the segmentation. Unlike most prior TDA uses in medical image segmentation, which are typically embedded within deep networks, our approach is a standalone method tailored to MRI. A key ingredient is the localization of representative cycles from the persistence diagram, which enables interpretable mappings from topological features to anatomical components. In particular, the method offers the ability to perform segmentation without the need for large annotated datasets. Its modular design makes it adaptable to a wide range of data segmentation challenges. We validate the framework on three applications: glioblastoma segmentation in brain MRI, where a sphere is to be detected; myocardium in cardiac MRI, forming a cylinder; and cortical plate detection in fetal brain MRI, whose 2D slices are circles. We compare our method with established supervised and unsupervised baselines.

中文标题/摘要

标题：基于立方持久同调的MRI无训练分割

我们提出了一种基于拓扑数据分析(TDA)的新颖通用框架，用于MRI扫描的分割，相比传统机器学习方法具有多项优势。该流水线分为三个步骤：首先通过自动阈值确定要分割的整体对象，然后检测一个拓扑结构已知的特殊子集，最后推断分割的各个组成部分。与大多数先前将TDA用于医学图像分割的方法通常嵌入在深度网络中不同，我们的方法是一种独立的方法，专门针对MRI。关键成分是将持久图中的代表性环路进行定位，这使得从拓扑特征到解剖结构的映射具有可解释性。特别是，该方法提供了在无需大量标注数据的情况下进行分割的能力。其模块化设计使其能够适应广泛的分割挑战。我们在三个应用中验证了该框架：在脑部MRI中进行胶质母细胞瘤分割，需要检测一个球体；在心脏MRI中进行心肌分割，形成一个圆柱；在胎儿脑部MRI中进行皮质板检测，其2D切片是圆。我们将我们的方法与现有的监督和无监督基准方法进行了比较。

Summary / 总结

This paper introduces a novel framework for MRI segmentation using Topological Data Analysis (TDA), which avoids the need for large annotated datasets. The method consists of three steps: automatic thresholding for object identification, detecting a known topological subset, and deducing the segmentation components. It validates the approach on three MRI applications, demonstrating its ability to segment glioblastoma, myocardium, and cortical plate without training data. The modular design allows for adaptability to various segmentation challenges.

该论文提出了一种新的MRI分割框架，利用拓扑数据分析（TDA），无需大量标注数据。方法分为三步：自动阈值确定对象，检测已知拓扑子集，推导分割组件。该方法在三个MRI应用中进行了验证，展示了其在无训练数据情况下分割胶质母细胞瘤、心肌和胎儿脑皮层的能力。模块化设计使其能够适应各种分割挑战。

AC-LoRA: (Almost) Training-Free Access Control-Aware Multi-Modal LLMs

Authors: Lara Magdalena Lazier, Aritra Dhar, Vasilije Stambolic, Lukas Cavigelli

Venue: NeurIPS 2025

First: 2025-05-15T23:19:35+00:00 · Latest: 2025-10-08T10:01:30+00:00

Comments: Accepted in NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Corporate LLMs are gaining traction for efficient knowledge dissemination and management within organizations. However, as current LLMs are vulnerable to leaking sensitive information, it has proven difficult to apply them in settings where strict access control is necessary. To this end, we design AC-LoRA, an end-to-end system for access control-aware corporate LLM chatbots that maintains a strong information isolation guarantee. AC-LoRA maintains separate LoRA adapters for permissioned datasets, along with the document embedding they are finetuned on. AC-LoRA retrieves a precise set of LoRA adapters based on the similarity score with the user query and their permission. This similarity score is later used to merge the responses if more than one LoRA is retrieved, without requiring any additional training for LoRA routing. We provide an end-to-end prototype of AC-LoRA, evaluate it on two datasets, and show that AC-LoRA matches or even exceeds the performance of state-of-the-art LoRA mixing techniques while providing strong isolation guarantees. Furthermore, we show that AC-LoRA design can be directly applied to different modalities.

中文标题/摘要

标题：AC-LoRA: (几乎) 不需要训练的访问控制意识多模态LLM

企业LLM在组织内部高效知识传播和管理方面正逐渐受到青睐。然而，由于当前LLM存在泄露敏感信息的风险，它们在需要严格访问控制的环境中应用起来非常困难。为此，我们设计了AC-LoRA，这是一种端到端的系统，用于访问控制意识的企业LLM聊天机器人，能够保持强烈的信息隔离保证。AC-LoRA为授权数据集维护独立的LoRA适配器，以及它们所微调的文档嵌入。AC-LoRA根据用户查询与LoRA适配器的相似度分数检索精确的LoRA适配器集，并根据其权限使用相似度分数合并响应，而无需对LoRA路由进行额外训练。我们提供了AC-LoRA的端到端原型，并在两个数据集上进行了评估，结果显示AC-LoRA在提供强隔离保证的同时，其性能与最先进的LoRA混合技术相当甚至更优。此外，我们展示了AC-LoRA的设计可以直接应用于不同的模态。

Summary / 总结

AC-LoRA is designed to enhance access control in corporate LLMs by maintaining separate LoRA adapters for permissioned datasets, ensuring strong information isolation. The system retrieves relevant LoRA adapters based on user queries and merges their responses without additional training. Experiments show that AC-LoRA matches or surpasses state-of-the-art LoRA mixing techniques in performance while providing robust isolation guarantees. This design can be adapted to different modalities.

AC-LoRA 设计用于增强企业 LLM 中的访问控制，通过为权限数据集维护独立的 LoRA 适配器。系统根据用户查询检索最相关的 LoRA 适配器，并在不进行额外训练的情况下合并它们的响应。实验表明，AC-LoRA 在确保强信息隔离的同时，性能与最先进的 LoRA 混合技术相当甚至更优。该设计还可应用于不同的模态。

Intelligent Healthcare Imaging Platform: A VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation

Authors: Samer Al-Hamadani

First: 2025-09-16T23:15:44+00:00 · Latest: 2025-10-08T09:41:31+00:00

Comments: 32 pages, 14 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of artificial intelligence (AI) in healthcare imaging has revolutionized diagnostic medicine and clinical decision-making processes. This work presents an intelligent multimodal framework for medical image analysis that leverages Vision-Language Models (VLMs) in healthcare diagnostics. The framework integrates Google Gemini 2.5 Flash for automated tumor detection and clinical report generation across multiple imaging modalities including CT, MRI, X-ray, and Ultrasound. The system combines visual feature extraction with natural language processing to enable contextual image interpretation, incorporating coordinate verification mechanisms and probabilistic Gaussian modeling for anomaly distribution. Multi-layered visualization techniques generate detailed medical illustrations, overlay comparisons, and statistical representations to enhance clinical confidence, with location measurement achieving 80 pixels average deviation. Result processing utilizes precise prompt engineering and textual analysis to extract structured clinical information while maintaining interpretability. Experimental evaluations demonstrated high performance in anomaly detection across multiple modalities. The system features a user-friendly Gradio interface for clinical workflow integration and demonstrates zero-shot learning capabilities to reduce dependence on large datasets. This framework represents a significant advancement in automated diagnostic support and radiological workflow efficiency, though clinical validation and multi-center evaluation are necessary prior to widespread adoption.

中文标题/摘要

标题：智能医疗成像平台：基于VLM的自动化医学图像分析和临床报告生成框架

医疗成像中人工智能（AI）的迅速发展已经革新了诊断医学和临床决策过程。本研究提出了一种基于视觉语言模型（VLM）的智能多模态医学图像分析框架。该框架利用Google Gemini 2.5 Flash实现多模态成像（包括CT、MRI、X光和超声）的自动化肿瘤检测和临床报告生成。系统结合了视觉特征提取和自然语言处理，以实现上下文图像解释，采用坐标验证机制和概率高斯建模来描述异常分布。多层可视化技术生成详细的医学插图、叠加比较和统计表示，以增强临床信心，位置测量平均偏差为80像素。结果处理利用精确的提示工程和文本分析来提取结构化的临床信息，同时保持可解释性。实验评估表明，该系统在多种模态下具有高异常检测性能。该系统具有用户友好的Gradio界面，便于临床工作流程集成，并展示了零样本学习能力，以减少对大数据集的依赖。该框架代表了自动化诊断支持和放射学工作流程效率的重要进步，但临床验证和多中心评估是广泛采用之前必要的。

Summary / 总结

This work introduces an intelligent multimodal framework for medical image analysis using Vision-Language Models (VLMs) to automate tumor detection and clinical report generation across various imaging modalities. The system integrates Google Gemini 2.5 Flash for visual feature extraction and natural language processing, achieving 80 pixels average deviation in location measurement. Experimental evaluations show high performance in anomaly detection across multiple modalities, with zero-shot learning capabilities to reduce data dependency. The framework includes a user-friendly Gradio interface for clinical workflow integration and enhanced visualization techniques for better clinical confidence, though further clinical validation is needed.

该研究提出了一种使用视觉-语言模型（VLM）的智能多模态框架，用于自动化多种成像模态（包括CT、MRI、X光和超声）的肿瘤检测和临床报告生成。系统结合了视觉特征提取和自然语言处理，并包括坐标验证和概率建模机制。实验结果显示，该系统在多种模态的异常检测中表现出色，具有用户友好的Gradio界面和零样本学习能力。然而，在广泛采用之前，仍需进行临床验证和多中心评估。

Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction

Authors: Chi Yan, Dan Xu

First: 2025-10-06T12:36:07+00:00 · Latest: 2025-10-08T09:34:48+00:00

Comments: Project Page: https://yanchi-3dv.github.io/PG-Occ

Abs · PDF · Code1 · Code2 · Project1

Abstract

The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ

中文标题/摘要

标题：具有各向异性感知采样的渐进高斯变换器用于开放词汇占用预测

近年来，3D 占有预测任务取得了显著进展，在基于视觉的自动驾驶系统中发挥着重要作用。虽然传统方法局限于固定语义类别，但最近的方法转向预测文本对齐特征，以支持现实场景中的开放词汇文本查询。然而，在文本对齐场景建模中存在权衡：稀疏的高斯表示难以捕捉场景中的小物体，而密集表示则会带来显著的计算开销。为了解决这些限制，我们提出了 PG-Occ，一种创新的渐进高斯变换器框架，以实现开放词汇的3D 占有预测。我们的框架采用渐进在线稠密化，这是一种逐步增强3D 高斯表示以捕捉细粒度场景细节的前馈策略。通过迭代增强表示，框架实现了越来越精确和详细的场景理解。另一个关键贡献是引入了具有时空融合的各向异性感知采样策略，该策略根据不同尺度和阶段适配高斯的感知域，从而实现更有效的特征聚合和更丰富的场景信息捕获。通过广泛的评估，我们证明 PG-Occ 达到了最先进的性能，相对提高了上一最佳方法14.3%的mIoU。代码和预训练模型将在我们项目页面上发布：https://yanchi-3dv.github.io/PG-Occ

Summary / 总结

The paper introduces PG-Occ, a Progressive Gaussian Transformer Framework for open-vocabulary 3D occupancy prediction. It addresses the limitations of sparse and dense Gaussian representations by employing progressive online densification and an anisotropy-aware sampling strategy. The framework iteratively enhances the 3D Gaussian representation to capture fine-grained scene details and adaptively assigns receptive fields to Gaussians at different scales. Experimental results show that PG-Occ achieves state-of-the-art performance with a 14.3% relative improvement in mIoU compared to the previous best method.

PG-Occ 是一种渐进高斯变换框架，用于开放词汇的 3D 占有预测。它使用渐进在线密集化来增强 3D 高斯表示，并采用具有时空融合的各向异性感知采样策略，以捕捉细粒度的场景细节。PG-Occ 达到了最先进的性能，相比之前的最佳方法在 mIoU 上提高了 14.3%。

Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

Authors: Tavish McDonald, Bo Lei, Stanislav Fort, Bhavya Kailkhura, Brian Bartoldson

First: 2025-10-08T09:18:53+00:00 · Latest: 2025-10-08T09:18:53+00:00

Comments: 17 pages

Abs · PDF · Code1 · Code2

Abstract

Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute investments into their robustification. Zaremba et al. (2025) make progress on this problem at test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test compute fades when attackers are given access to gradients or multimodal inputs. We address this gap, clarifying that inference-compute offers benefits even in such cases. Our approach argues that compositional generalization, through which OOD data is understandable via its in-distribution (ID) components, enables adherence to defensive specifications on adversarially OOD inputs. Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as the model's training data better reflects the attacked data's components. We empirically support this hypothesis across vision language model and attack types, finding robustness gains from test-time compute if specification following on OOD data is unlocked by compositional generalization, while RL finetuning and protracted reasoning are not critical. For example, increasing emphasis on defensive specifications via prompting lowers the success rate of gradient-based multimodal attacks on VLMs robustified by adversarial pretraining, but this same intervention provides no such benefit to not-robustified models. This correlation of inference-compute's robustness benefit with base model robustness is the rich-get-richer dynamic of the RICH: attacked data components are more ID for robustified models, aiding compositional generalization to OOD data. Accordingly, we advise layering train-time and test-time defenses to obtain their synergistic benefit.

中文标题/摘要

标题：获取财富或死亡：盈利地交易推理计算以提高鲁棒性

尽管在模型的鲁棒性提升上投入了大量训练计算资源，但模型仍然容易受到对抗性离分布（OOD）数据的影响。Zaremba等人（2025）在测试时对此问题取得进展，表明语言模型的推理能力提高了模型规范的满足度，这些规范旨在抵御攻击，从而导致推理努力与对抗性破解的鲁棒性之间存在相关性。然而，当攻击者获得梯度访问权限或多种模态输入时，这种测试计算的好处会消失。我们解决了这一缺口，阐明了即使在这些情况下，推理计算也提供了益处。我们的方法认为，通过组成泛化，OOD数据可以通过其分布内（ID）组件变得可理解，从而在对抗性OOD输入上遵守防御规范。具体而言，我们提出了推理计算鲁棒性假设（RICH）：当模型的训练数据更好地反映受攻击数据的组件时，推理计算防御会受益。我们通过视觉语言模型和不同类型的攻击进行实证支持，发现如果通过组成泛化解锁OOD数据上的规范遵循，测试计算可以带来鲁棒性提升，而强化学习微调和长时间推理则不是关键因素。例如，通过提示增加防御规范的强调会降低对抗预训练后使VLMs鲁棒化的梯度基线多模态攻击的成功率，但同样的干预措施对未鲁棒化的模型没有这种益处。推理计算鲁棒性益处与基础模型鲁棒性之间的这种相关性是RICH的富者愈富动态：受攻击数据的组件对于鲁棒化模型来说更ID，有助于组成泛化到OOD数据。因此，我们建议叠加训练时间和测试时间的防御以获得它们的协同效益。

Summary / 总结

The paper addresses the vulnerability of models to adversarially out-of-distribution (OOD) data despite significant training compute. It introduces the Robustness from Inference Compute Hypothesis (RICH), suggesting that increased inference compute can enhance model robustness, especially when the model's training data better reflects the attacked data's components. Empirical results show that compositional generalization, enabled by this inference compute, leads to robustness gains, particularly in vision language models, without the need for extensive reinforcement learning fine-tuning or prolonged reasoning. This dynamic is termed the 'rich-get-richer' effect, where robustified models benefit more from inference compute due to their better alignment with attacked data components.

论文探讨了尽管进行了大量训练计算，模型仍然容易受到对抗性离分布数据的影响。它提出了推理计算稳健性假设（RICH），认为推理计算可以增强模型的稳健性，尤其是在模型的训练数据更好地反映了攻击数据的组成部分时。实验证明，通过提示增加防御性规范可以提高对抗性梯度多模态攻击下稳健化视觉语言模型的鲁棒性，但对非稳健化模型则无此效果。这表明稳健化模型从推理计算中获益更多，符合RICH的富者愈富动态。

TTRV: Test-Time Reinforcement Learning for Vision Language Models

Authors: Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza

First: 2025-10-08T09:10:31+00:00 · Latest: 2025-10-08T09:10:31+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

中文标题/摘要

标题：TTRV：视觉语言模型的测试时强化学习

现有的强化学习中提取奖励信号的方法通常依赖于标记数据和专门的训练分割，这与人类直接从环境学习的方式不同。在本工作中，我们提出了TTRV，通过在推理时实时调整模型来增强视觉语言理解，无需任何标记数据。具体而言，我们通过基于基模型输出频率设计奖励，结合多次对每个测试样本进行推理，改进了Group Relative Policy Optimization (GRPO)框架。此外，我们还提出通过同时奖励模型以获得输出经验分布的低熵来控制模型输出的多样性。我们的方法在对象识别和视觉问答（VQA）中均取得了持续的改进，分别提高了52.4%和29.8%，并在16个数据集中平均提高了24.6%和10.0%。在图像识别方面，TTRV应用于InternVL 8B在8个基准测试中平均优于GPT-4o 2.3%，同时在VQA方面保持竞争力，表明测试时的强化学习可以匹配或超越最强的专有模型。最后，我们发现测试时的RL对于VLMs有许多有趣的特性：例如，在极端数据受限的场景中，即使在单个随机选择的未标记测试样本上进行适应，TTRV仍能带来高达5.5%的识别任务改进。

OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot

Authors: Junhan Zhu, Hesong Wang, Mingluo Su, Zefang Wang, Huan Wang

First: 2025-10-08T08:19:15+00:00 · Latest: 2025-10-08T08:19:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Large-scale text-to-image diffusion models, while powerful, suffer from prohibitive computational cost. Existing one-shot network pruning methods can hardly be directly applied to them due to the iterative denoising nature of diffusion models. To bridge the gap, this paper presents OBS-Diff, a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models. Specifically, (i) OBS-Diff revitalizes the classic Optimal Brain Surgeon (OBS), adapting it to the complex architectures of modern diffusion models and supporting diverse pruning granularity, including unstructured, N:M semi-structured, and structured (MHA heads and FFN neurons) sparsity; (ii) To align the pruning criteria with the iterative dynamics of the diffusion process, by examining the problem from an error-accumulation perspective, we propose a novel timestep-aware Hessian construction that incorporates a logarithmic-decrease weighting scheme, assigning greater importance to earlier timesteps to mitigate potential error accumulation; (iii) Furthermore, a computationally efficient group-wise sequential pruning strategy is proposed to amortize the expensive calibration process. Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.

中文标题/摘要

标题：OBS-Diff：针对扩散模型的一次性精确剪枝

大规模文本到图像扩散模型虽然强大，但计算成本高昂。现有的一次性网络剪枝方法由于扩散模型的迭代去噪特性，难以直接应用于它们。为解决这一问题，本文提出了一种名为OBS-Diff的新颖一次性剪枝框架，能够实现大规模文本到图像扩散模型的准确且无需训练的压缩。具体而言，(i) OBS-Diff 重新激活了经典的Optimal Brain Surgeon (OBS)，将其适应现代扩散模型的复杂架构，并支持多种剪枝粒度，包括无结构、N:M半结构和结构（MHA头和FFN神经元）稀疏性；(ii) 为了使剪枝标准与扩散过程的迭代动态相一致，通过从误差累积的角度出发，我们提出了一种新颖的时间步感知Hessian构建方法，引入了对数递减加权方案，赋予早期时间步更高的重要性，以减轻潜在的误差累积；(iii) 此外，还提出了一种计算高效的分组顺序剪枝策略，以摊销昂贵的校准过程。大量实验表明，OBS-Diff 在扩散模型中实现了最先进的一次性剪枝，实现了最小视觉质量降级的推理加速。

Summary / 总结

This paper addresses the computational challenges of large-scale text-to-image diffusion models by presenting OBS-Diff, a novel one-shot pruning framework. OBS-Diff adapts the Optimal Brain Surgeon method to modern diffusion model architectures, supporting various sparsity types. It introduces a timestep-aware Hessian construction with a logarithmic-decrease weighting scheme to align pruning with the iterative denoising process. Additionally, it proposes an efficient group-wise sequential pruning strategy to reduce calibration costs. Experiments demonstrate that OBS-Diff achieves state-of-the-art one-shot pruning, offering significant inference acceleration with minimal impact on visual quality.

OBS-Diff 是一种新颖的一次性剪枝框架，旨在压缩大规模文本到图像的扩散模型同时保持高精度。它将 Optimal Brain Surgeon 方法适应于支持多种稀疏性类型，并引入了一种时间步长感知的海森矩阵构建方法，带有对数递减加权方案，以使剪枝与迭代去噪过程对齐。此外，还提出了一种计算高效的分组顺序剪枝策略来减少校准成本。实验表明，OBS-Diff 达到了最先进的剪枝效果，提供了显著的推理加速同时对视觉质量影响最小。

Adaptive Rank, Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA

Authors: Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong

First: 2024-12-01T23:41:42+00:00 · Latest: 2025-10-08T07:30:38+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2 · Code3

Abstract

Continual learning (CL) aims to accumulate knowledge from sequential tasks without catastrophic forgetting. Vision-language models such as CLIP, with strong generalization, are widely used for CL. Existing methods often adapt isolated PTM components, increasing inference complexity and limiting model improvement, or rely on replay, stored data, or assumptions, leading to high costs and limited applicability. To advance models as continual learners, we explore CL through natural and efficient PTM updates rather than complex task-specific additions. We study continual low-rank learning and analyze how LoRA ranks and placements affect learning and forgetting. A higher-rank LoRA improves task learning (plasticity) but increases forgetting, while a lower-rank LoRA enhances stability but limits adaptation. We observe a plasticity-stability balance tied to rank across parameters and tasks, with moderately small ranks maximizing CL benefits. Motivated by this, we propose Continual Dynamic Rank-Selective LoRA (CoDyRA), which continually updates PTMs with LoRA adapters of adaptively optimized ranks. The new-task objective drives learning, while sparsity-promoting regularization minimizes ranks to reduce interference and forgetting, achieving a balance tailored to each parameter and task. Although all parameters are updated, the minimized ranks keep the model close to its prior state while enabling effective new-task learning. CoDyRA performs efficient CL as a sequence of LoRA-based updates without storing past data or relying on assumptions, preserving the original model architecture and adding no inference overhead. Experiments show CoDyRA improves new representations while retaining old knowledge, achieving state-of-the-art results. Code is available at https://github.com/jeff024/codyra.

中文标题/摘要

标题：自适应秩，减少遗忘：动态秩选择LoRA在持续学习视觉-语言模型中的知识保留

持续学习（CL）旨在从顺序任务中积累知识而不发生灾难性遗忘。视觉-语言模型如CLIP，因其强大的泛化能力而广泛用于持续学习。现有方法通常单独适应PTM组件，增加推理复杂度并限制模型改进，或者依赖于重放、存储数据或假设，导致高成本和有限的应用范围。为了使模型成为持续学习者，我们探索通过自然和高效的PTM更新来实现持续学习，而不是复杂的任务特定添加。我们研究持续低秩学习，并分析LoRA的秩和位置如何影响学习和遗忘。高秩LoRA提高任务学习能力（可塑性）但增加遗忘，而低秩LoRA增强稳定性但限制适应。我们观察到可塑性和稳定性的平衡与参数和任务的秩相关，适度的小秩最大化持续学习的好处。受此启发，我们提出了持续动态秩选择LoRA（CoDyRA），该方法持续更新PTM的LoRA适配器，其秩由优化调整。新任务目标驱动学习，而稀疏性促进正则化最小化秩以减少干扰和遗忘，实现针对每个参数和任务的平衡。尽管所有参数都被更新，但最小化的秩使模型保持接近其初始状态，同时允许有效的新任务学习。CoDyRA作为一系列基于LoRA的更新高效地实现持续学习，无需存储过去的数据或依赖假设，保留原始模型架构并增加零推理开销。实验表明，CoDyRA在保留旧知识的同时提高新表示，达到最先进的结果。

Summary / 总结

The paper addresses the challenge of continual learning in vision-language models like CLIP, focusing on reducing catastrophic forgetting. It proposes Continual Dynamic Rank-Selective LoRA (CoDyRA), which adaptively optimizes ranks of LoRA adapters to balance plasticity and stability. Experiments demonstrate that CoDyRA effectively retains old knowledge while learning new tasks, achieving state-of-the-art results without storing past data or making assumptions. The method updates parameters efficiently, preserving the original model architecture and adding no inference overhead.

论文针对视觉-语言模型如CLIP在持续学习中的灾难性遗忘问题，提出了一种持续动态秩选择性LoRA（CoDyRA）方法，通过自适应优化LoRA适配器的秩来平衡可塑性和稳定性。实验表明，CoDyRA能够有效保留旧知识的同时学习新任务，达到最先进的效果，且无需存储过去的数据或做假设，保持了原始模型架构和推理速度。

ProCut: LLM Prompt Compression via Attribution Estimation

Authors: Zhentao Xu, Fengyi Li, Albert Chen, Xiaofeng Wang

First: 2025-08-04T04:44:43+00:00 · Latest: 2025-10-08T04:59:55+00:00

Abs · PDF · Code1 · Code2

Abstract

In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.

中文标题/摘要

标题：ProCut: 通过归因估计进行LLM提示压缩

在大规模工业LLM系统中，提示模板经常扩展到数千个标记，随着团队逐步加入任务说明、少量示例和启发式规则以增强鲁棒性和覆盖率。这种扩展导致了难以维护且会显著增加推理延迟和提供成本的庞大提示。为了解决这一问题，我们引入了通过归因估计进行提示压缩（ProCut）的灵活、LLM无关、无需训练的框架，该框架通过归因分析压缩提示。ProCut将提示模板分割为语义上有意义的单元，量化其对任务性能的影响，并修剪低效组件。通过在五个公开基准数据集和实际工业提示上的广泛实验，我们展示了ProCut在生产中实现了显著的提示大小减少（减少了78%的标记数量）的同时，保持甚至略微提高了任务性能（比其他方法好62%）。我们还引入了由LLM驱动的归因估计器，将压缩延迟降低了超过50%，并证明ProCut可以无缝集成到现有的提示优化框架中，以生成简洁且高性能的提示。

HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models

Authors: Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu

Venue: NeurIPS 2025

First: 2025-05-26T18:37:40+00:00 · Latest: 2025-10-08T04:28:29+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long contexts, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Our code is available at https://github.com/hrlics/HoPE.

中文标题/摘要

标题：HoPE：长上下文视觉-语言模型的混合位置嵌入

视觉-语言模型（VLMs）在多模态任务中取得了显著进展。然而，在长上下文场景中，尤其是长视频中，其性能往往会下降。虽然旋转位置嵌入（RoPE）在大型语言模型（LLMs）中广泛用于长度泛化，但将其扩展到捕捉视频中的复杂空间-时间依赖关系仍然是一个未解决的挑战。现有方法通常在RoPE中分配不同的频率来编码3D位置信息。然而，这些分配策略主要依赖于启发式方法，缺乏深入的理论分析。在本文中，我们首先研究不同分配策略如何影响VLMs的长上下文能力。我们的分析表明，当前的多模态RoPE无法可靠地捕捉扩展上下文中的语义相似性。为了解决这一问题，我们提出了HoPE，一种旨在提高VLMs长上下文能力的混合位置嵌入。HoPE引入了一种混合频率分配策略，以在任意长的上下文中可靠地进行语义建模，并引入了一种动态时间缩放机制，以促进在不同上下文长度下的稳健学习和灵活推理。在四个视频基准上的广泛实验表明，HoPE在长视频理解和检索任务中始终优于现有方法，证实了其有效性。我们的代码可在https://github.com/hrlics/HoPE获取。

Summary / 总结

The paper addresses the challenge of long-context performance in Vision-Language Models (VLMs), particularly in long videos. It proposes HoPE, a Hybrid of Position Embedding, which introduces a hybrid frequency allocation strategy and a dynamic temporal scaling mechanism to improve semantic modeling over extended contexts. Experiments on four video benchmarks show that HoPE outperforms existing methods in long video understanding and retrieval tasks, confirming its effectiveness.

研究旨在通过解决现有旋转位置嵌入（RoPE）方法的局限性，提升视觉-语言模型（VLMs）在长上下文场景，尤其是长视频中的性能。研究提出了HoPE，一种混合位置嵌入方法，引入了混合频率分配策略和动态时间缩放机制，以改善在任意长上下文中的语义建模能力。实验在四个视频基准上显示，HoPE在长视频理解和检索任务中优于现有方法，证实了其在长上下文中捕捉语义相似性的有效性。

VUGEN: Visual Understanding priors for GENeration

Authors: Xiangyi Chen, Théophane Vallaeys, Maha Elbayad, John Nguyen, Jakob Verbeek

First: 2025-10-08T00:04:47+00:00 · Latest: 2025-10-08T00:04:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM's pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM's original understanding capabilities.

中文标题/摘要

标题：VUGEN: 视觉理解先验的视觉生成

近期视觉-语言模型（VLMs）的发展使跨文本和图像的统一理解成为可能，但为这些模型配备稳健的图像生成能力仍然具有挑战性。现有方法通常依赖于重建导向的自编码器或复杂的桥梁机制，导致理解和生成表示之间的不一致，或架构复杂性。在本文中，我们提出了一种名为VUGEN的新框架，该框架明确利用VLM预训练的视觉理解先验，以实现高效且高质量的图像生成。我们的方法首先将VLM原生视觉编码器的高维潜在空间转换为一个低维、可处理的分布，该分布最大限度地保留了视觉信息。然后，VLM被训练在该减少的潜在空间中采样，以确保与其实现的视觉理解能力保持一致。最后，专用的像素解码器将这些生成的潜在变量映射回图像空间。我们发现，无需VAE的像素扩散解码器与内部依赖于VAE潜在变量的常用复杂潜在扩散解码器相比，性能相当或更好。广泛的实验表明，VUGEN在COCO数据集上的图像生成性能优于DPG Bench，从71.17提高到74.32，FID从11.86降低到9.06，同时完全保留了VLM的原始理解能力。

Summary / 总结

VUGEN proposes a new framework that uses pretrained visual understanding priors from Vision-Language Models to generate high-quality images. It transforms the high-dimensional latent space into a lower-dimensional distribution that preserves visual information, and trains the VLM to sample within this space. The pixel diffusion decoder then maps these generated latents back to the image space. VUGEN outperforms common latent diffusion decoders and achieves better DPG Bench and FID scores on COCO, while maintaining the VLM's original understanding capabilities.

VUGEN 提出了一种新框架，利用 Vision-Language 模型的预训练视觉理解先验来生成高质量图像。它将高维潜空间转换为低维分布，保留视觉信息，并训练 VLM 在此空间中采样。像素扩散解码器随后将这些生成的潜变量映射回图像空间。VUGEN 在 COCO 上的 DPG Bench 和 FID 分数优于常见潜变量扩散解码器，同时保持 VLM 的原始理解能力。

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Authors: Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, Ion Stoica

First: 2025-05-24T21:30:29+00:00 · Latest: 2025-10-07T17:25:03+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. Our code is open-sourced at \href{https://github.com/svg-project/Sparse-VideoGen}{https://github.com/svg-project/Sparse-VideoGen}.

中文标题/摘要

标题：Sparse VideoGen2：通过语义感知排列加速视频生成

扩散变换器（DiTs）对于视频生成至关重要，但由于注意力机制的二次复杂性，存在显著的延迟问题。通过仅计算关键令牌，稀疏注意力可以降低计算成本并提供加速的有希望的方法。然而，我们发现现有方法在相同的计算预算下未能接近最优生成质量，原因有两个：（1）关键令牌识别不准确：当前方法基于位置而不是语义对令牌进行聚类，导致聚合表示不精确。（2）计算浪费过多：关键令牌分散在非关键令牌中，导致在优化处理连续令牌的GPU上浪费计算资源。在本文中，我们提出SVG2，这是一种无需训练的框架，旨在最大化识别准确性并最小化计算浪费，实现生成质量和效率之间的帕累托前沿权衡。SVG2的核心是语义感知排列，该方法使用k-means根据语义相似性对令牌进行聚类和重新排序。这种方法确保了精确的聚类表示，提高了识别准确性，并且关键令牌的布局更加密集，从而可以在不填充的情况下实现高效的计算。此外，SVG2集成了top-p动态预算控制和定制内核实现，分别在HunyuanVideo和Wan 2.1上实现了高达2.30倍和1.89倍的加速，同时保持PSNR分别为30和26。我们的代码已开源于https://github.com/svg-project/Sparse-VideoGen。

Summary / 总结

The research aims to accelerate video generation using sparse attention while maintaining high quality. The method involves semantic-aware permutation to cluster and reorder tokens based on semantic similarity, reducing computation waste and improving accuracy. Experiments show that SVG2 achieves up to 2.30x and 1.89x speedup with PSNR up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively, without padding. The framework also includes top-p dynamic budget control and customized kernel implementations.

论文通过提出SVG2框架，利用语义感知排列来识别并仅处理关键令牌，从而解决Diffusion Transformers (DiTs)在视频生成中的延迟问题，实现了生成质量和效率之间的平衡，在HunyuanVideo和Wan 2.1上的加速分别达到2.30倍和1.89倍，同时保持PSNR值分别为30和26。

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits

Authors: Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, Weiyao Lin

First: 2025-10-07T17:08:33+00:00 · Latest: 2025-10-07T17:08:33+00:00

Comments: 18 pages,8 figures,4 tables

Abs · PDF · Code1 · Code2

Abstract

Diffusion large language models (dLLMs) generate text through iterative denoising steps, achieving parallel decoding by denoising only high-confidence positions at each step. However, existing approaches often repetitively remask tokens due to initially low confidence scores, leading to redundant iterations and limiting overall acceleration. Through the analysis of dLLM decoding traces, we observe that the model often determines the final prediction for a token several steps before the decoding step. To leverage this historical information and avoid redundant steps, we introduce the concept of Trace Credit, which quantifies each token's convergence potential by accumulating historical logits. Furthermore, we propose CreditDecoding, a training-free parallel decoding algorithm that accelerates the confidence convergence of correct but underconfident tokens by fusing current logits with Trace Credit. This process significantly reduces redundant iterations and enhances decoding robustness. On eight benchmarks, CreditDecoding achieves a 5.48 times speedup and a 0.48 performance improvement over LLaDA-8B-Instruct, and a 4.11 times speedup with a 0.15 performance improvement over LLaDA-MoE-Instruct. Importantly, CreditDecoding scales effectively to long sequences and is orthogonal to mainstream inference optimizations, making it a readily integrable and versatile solution.

中文标题/摘要

标题：CreditDecoding：通过轨迹信用加速扩散大型语言模型的并行解码

扩散大型语言模型（dLLMs）通过迭代去噪步骤生成文本，通过仅在每一步去噪高置信度位置实现并行解码。然而，现有方法由于初始置信度分数较低，经常重复重新标记令牌，导致冗余迭代并限制整体加速。通过对dLLM解码轨迹的分析，我们发现模型通常在解码步骤前几轮就确定了最终的预测结果。为了利用这种历史信息并避免冗余步骤，我们引入了轨迹信用的概念，通过累积历史logits量化每个令牌的收敛潜力。此外，我们提出了CreditDecoding，这是一种无需训练的并行解码算法，通过融合当前logits与轨迹信用加速正确但置信度不足的令牌的置信度收敛。这一过程显著减少了冗余迭代并增强了解码的鲁棒性。在八个基准测试中，CreditDecoding相对于LLaDA-8B-Instruct实现了5.48倍的加速和0.48的性能提升，相对于LLaDA-MoE-Instruct实现了4.11倍的加速和0.15的性能提升。重要的是，CreditDecoding能够有效地扩展到长序列，并且与主流推理优化方案兼容，使其成为易于集成和多功能的解决方案。

Summary / 总结

CreditDecoding accelerates parallel decoding in diffusion large language models by using Trace Credits to avoid redundant iterations. It quantifies each token's convergence potential and fuses current logits with historical information to enhance decoding robustness. CreditDecoding achieves significant speedups and performance improvements on various benchmarks, and scales effectively to long sequences without conflicting with other optimizations.

CreditDecoding通过使用Trace Credits来避免冗余迭代，加速了扩散大语言模型的并行解码。它量化每个token的收敛潜力，并将当前logits与历史信息融合以增强解码的鲁棒性。CreditDecoding在各种基准上实现了显著的加速和性能提升，并且能够有效扩展到长序列，且不会与其他优化方法冲突。

Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA

Authors: Python Song, Luke Tenyi Chang, Yun-Yun Tsai, Penghui Li, Junfeng Yang

First: 2025-10-07T15:56:21+00:00 · Latest: 2025-10-07T15:56:21+00:00

Comments: 14pages, 11figures

Abs · PDF · Code1 · Code2

Abstract

CAPTCHA, originally designed to distinguish humans from robots, has evolved into a real-world benchmark for assessing the spatial reasoning capabilities of vision-language models. In this work, we first show that step-by-step reasoning is crucial for vision-language models (VLMs) to solve CAPTCHAs, which represent high-difficulty spatial reasoning tasks, and that current commercial vision-language models still struggle with such reasoning. In particular, we observe that most commercial VLMs (e.g., Gemini, Claude, GPT, etc.) fail to effectively solve CAPTCHAs and thus achieve low accuracy (around 21.9 percent). However, our findings indicate that requiring the model to perform step-by-step reasoning before generating the final coordinates can significantly enhance its solving accuracy, underscoring the severity of the gap. To systematically study this issue, we introduce CAPTCHA-X, the first real-world CAPTCHA benchmark with reasoning, covering seven categories of CAPTCHAs (such as Gobang, hCaptcha, etc.) with step-by-step action solutions and grounding annotations. We further define five reasoning-oriented metrics that enable a comprehensive evaluation of models reasoning capabilities. To validate the effectiveness of reasoning, we also propose a general agentic VLM-based framework that incorporates the models inherent reasoning abilities. Our method achieves state-of-the-art performance across five high-difficulty CAPTCHA types, with an average solving accuracy of 83.9 percent, substantially surpassing existing baselines. These results reveal the limitations of current models and highlight the importance of reasoning in advancing visual-spatial challenges in the future.

中文标题/摘要

标题：视觉下的推理：视觉-语言模型在CAPTCHA中的视觉-空间认知理解

CAPTCHA最初设计用于区分人类和机器人，现已演变为评估视觉-语言模型空间推理能力的实际基准。在本研究中，我们首先表明，逐步推理对于视觉-语言模型（VLMs）解决CAPTCHA至关重要，因为这些CAPTCHA代表了高难度的空间推理任务，而当前的商业视觉-语言模型仍然难以应对这种推理。特别是，我们观察到大多数商业VLMs（如Gemini、Claude、GPT等）无法有效解决CAPTCHA，因此准确率较低（约21.9%）。然而，我们的研究结果表明，在生成最终坐标之前要求模型进行逐步推理可以显著提高其解决准确率，突显了差距的严重性。为了系统地研究这一问题，我们引入了CAPTCHA-X，这是第一个包含推理的现实世界CAPTCHA基准，涵盖了七类CAPTCHA（如五子棋、hCaptcha等），并附有逐步操作解决方案和注解。我们进一步定义了五个基于推理的评估指标，以全面评估模型的推理能力。为了验证推理的有效性，我们还提出了一种通用的基于代理的VLM框架，该框架整合了模型固有的推理能力。我们的方法在五种高难度CAPTCHA类型中均达到了最先进的性能，平均解决准确率为83.9%，显著超越现有基线。这些结果揭示了当前模型的局限性，并强调了在未来的视觉-空间挑战中推理的重要性。

Summary / 总结

This study investigates the spatial reasoning capabilities of vision-language models (VLMs) using CAPTCHA as a benchmark. It demonstrates that step-by-step reasoning is essential for VLMs to solve high-difficulty CAPTCHAs, with current commercial models achieving low accuracy. The researchers introduce CAPTCHA-X, a benchmark with reasoning annotations, and propose a framework that enhances solving accuracy to 83.9 percent across five CAPTCHA types, highlighting the need for improved reasoning in VLMs.

该研究使用CAPTCHA作为基准，考察了视觉语言模型（VLMs）的空间推理能力。研究表明，逐步推理对于解决高难度CAPTCHA至关重要，当前商用VLMs的准确率较低。作者引入了带有推理注解的CAPTCHA-X基准，并提出了一种框架，将解决准确率提升至83.9%，超越现有方法，突显了VLMs在提高视觉空间挑战中的推理能力的重要性。

CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Authors: Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

First: 2025-08-04T11:42:48+00:00 · Latest: 2025-10-07T15:56:15+00:00

Comments: 16 pages, 9 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

中文标题/摘要

标题：CAMERA: MoE模型通过微专家冗余分析的多矩阵联合压缩

具有混合专家（MoE）架构的大语言模型（LLMs）在广泛的任务中表现出强大的性能随参数增加而增强，但同时也遭受着显著的计算和存储开销。值得注意的是，MoE模型的性能增益并不与专家参数的增长成正比。尽管先前的工作试图通过专家级别的剪枝、合并或分解来减少参数，但在性能和计算效率方面仍然面临挑战。在本文中，我们通过引入微专家作为跨越矩阵的更细粒度的压缩单元来解决这些挑战。我们首先从更基本的角度出发，将MoE层视为微专家的混合，并提出了CAMERA，一种轻量级且无需训练的框架，用于识别微专家冗余。我们的分析揭示了解码过程中微专家贡献的显著差异。基于这一见解，我们进一步提出了CAMERA-P，一种结构化的微专家剪枝框架，以及CAMERA-Q，一种针对微专家的混合精度量化方法。在九个下游任务上的广泛实验表明，CAMERA-P在从20%到60%的不同剪枝比下始终优于强大的基线。此外，CAMERA-Q在激进的2位量化下取得了更好的结果，超越了现有的矩阵级和通道级方法。值得注意的是，我们的方法在单个NVIDIA A100-40GB GPU上对Qwen2-57B-A14B进行完全的微专家分析只需不到5分钟。

Summary / 总结

This paper addresses the computational and storage overheads of Large Language Models with Mixture-of-Experts (MoE) architectures by introducing CAMERA, a framework that identifies micro-expert redundancy. CAMERA-P proposes a structured micro-expert pruning method, while CAMERA-Q introduces mixed-precision quantization for micro-experts. Experiments on nine downstream tasks demonstrate that CAMERA-P outperforms strong baselines under various pruning ratios, and CAMERA-Q achieves superior results under aggressive quantization, surpassing existing methods.

本文通过引入CAMERA框架，识别并减少MoE架构中的微专家冗余，以解决大型语言模型的计算和存储开销问题。提出了CAMERA-P，一种结构化的微专家剪枝框架，以及CAMERA-Q，一种针对微专家的混合精度量化方法。在九个下游任务上的实验表明，CAMERA-P在各种剪枝比例下均优于强基线，而CAMERA-Q在激进的2位量化下取得了优于现有方法的结果。

Medical Vision Language Models as Policies for Robotic Surgery

Authors: Akshay Muppidi, Martin Radfar

Venue: 2025 IEEE Conference on Artificial Intelligence (CAI), Santa Clara, CA, USA, 2025, pp. 513,518

First: 2025-10-07T15:54:34+00:00 · Latest: 2025-10-07T15:54:34+00:00

Comments: IEEE CAI 2025

Abs · PDF · Code1 · Code2

Abstract

Vision-based Proximal Policy Optimization (PPO) struggles with visual observation-based robotic laparoscopic surgical tasks due to the high-dimensional nature of visual input, the sparsity of rewards in surgical environments, and the difficulty of extracting task-relevant features from raw visual data. We introduce a simple approach integrating MedFlamingo, a medical domain-specific Vision-Language Model, with PPO. Our method is evaluated on five diverse laparoscopic surgery task environments in LapGym, using only endoscopic visual observations. MedFlamingo PPO outperforms and converges faster compared to both standard vision-based PPO and OpenFlamingo PPO baselines, achieving task success rates exceeding 70% across all environments, with improvements ranging from 66.67% to 1114.29% compared to baseline. By processing task observations and instructions once per episode to generate high-level planning tokens, our method efficiently combines medical expertise with real-time visual feedback. Our results highlight the value of specialized medical knowledge in robotic surgical planning and decision-making.

中文标题/摘要

标题：医疗视觉语言模型作为政策指导的机器人外科手术

基于视觉的近端策略优化（PPO）在处理视觉输入的高维性质、外科环境中奖励的稀疏性以及从原始视觉数据中提取相关任务特征的困难时，难以应对视觉观察驱动的腹腔镜外科手术任务。我们提出了一种简单的方法，将MedFlamingo，一种医疗领域特定的视觉-语言模型，与PPO结合。该方法在LapGym的五个不同的腹腔镜手术任务环境中进行评估，仅使用内窥镜视觉观察。MedFlamingo PPO在所有环境中均优于标准的基于视觉的PPO和OpenFlamingo PPO基线，任务成功率超过70%，与基线相比，改进幅度从66.67%到1114.29%不等。通过每集处理一次任务观察和指令以生成高级规划标记，我们的方法有效地结合了医学专业知识和实时视觉反馈。我们的结果突显了在机器人外科手术规划和决策中专门医学知识的价值。

Summary / 总结

The paper addresses the challenges of using Vision-based Proximal Policy Optimization (PPO) for robotic laparoscopic surgery due to high-dimensional visual inputs and sparse rewards. It proposes integrating MedFlamingo, a medical domain-specific Vision-Language Model, with PPO. The method outperforms standard vision-based PPO and OpenFlamingo PPO baselines, achieving task success rates over 70% in five diverse laparoscopic surgery environments, with improvements ranging from 66.67% to 1114.29%. By processing task observations and instructions once per episode, the method combines medical expertise with real-time visual feedback efficiently.

该研究通过将医疗领域特定的Vision-Language模型MedFlamingo与Proximal Policy Optimization (PPO)结合，解决了在机器人腹腔镜手术中使用基于视觉的PPO所面临的挑战。该方法在LapGym的五个不同腹腔镜手术任务环境中进行了评估，仅使用内窥镜视觉观察。MedFlamingo PPO在所有环境中均实现了超过70%的任务成功率，与基线相比，改进幅度从66.67%到1114.29%不等。

Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search

Authors: Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

First: 2025-01-31T16:09:30+00:00 · Latest: 2025-10-07T15:22:05+00:00

Comments: Accepted to NeurIPS2025. Website: https://sites.google.com/view/t2v-dlbs and Code: https://github.com/shim0114/T2V-Diffusion-Search

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

The remarkable progress in text-to-video diffusion models enables the generation of photorealistic videos, although the content of these generated videos often includes unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some measure of the content's goodness. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select a better diffusion latent to maximize a given alignment reward at inference time. We then point out that improving perceptual video quality with respect to alignment to prompts requires reward calibration by weighting existing metrics. This is because when humans or vision language models evaluate outputs, many previous metrics to quantify the naturalness of video do not always correlate with the evaluation. We demonstrate that our method improves the perceptual quality evaluated on the calibrated reward, VLMs, and human assessment, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling under much more efficient computational cost. The experiments highlight that our method is beneficial to many capable generative models, and provide a practical guideline: we should prioritize the inference-time compute allocation into enabling the lookahead estimator and increasing the search budget, rather than expanding the denoising steps.

中文标题/摘要

标题：在推理时使用扩散潜空间束搜索进行文本到视频对齐

文本到视频扩散模型的显著进步使得生成逼真视频成为可能，尽管这些生成视频的内容常常包含不自然的运动或变形、反向播放和静止场景。最近，对齐问题引起了极大的关注，我们根据内容的好坏度量来引导扩散模型的输出。由于在帧方向上还有很大的感知质量提升空间，我们需要确定应该优化哪些指标以及如何优化它们。在本文中，我们提出了扩散潜空间束搜索结合前瞻估计器，可以在推理时选择更好的扩散潜空间以最大化给定的对齐奖励。我们还指出，为了提高与提示对齐的视频感知质量，需要通过加权现有指标来校准奖励。这是因为当人类或视觉语言模型评估输出时，许多用于量化视频自然度的先前指标并不总是与评估相关。我们证明，我们的方法在经过校准的奖励、VLMs和人类评估下提高了感知质量，且无需更新模型参数，计算成本更低，生成效果优于贪婪搜索和N次采样。实验表明，我们的方法对许多生成模型都有益处，并提供了一条实用的指导方针：我们应该优先在使前瞻估计器生效和增加搜索预算上分配推理时的计算资源，而不是扩展去噪步骤。

Summary / 总结

This paper addresses the issue of unnatural movements in text-to-video generation by proposing diffusion latent beam search with a lookahead estimator. The method optimizes perceptual quality by calibrating existing metrics, leading to improved video quality in alignment with prompts. Experiments show that this approach enhances perceptual quality without updating model parameters, outperforming greedy search and best-of-N sampling with lower computational cost. The method is beneficial for various generative models and provides a practical guideline for compute allocation during inference.

本文提出了一种名为扩散潜空间束搜索带前瞻估计器的方法，以解决文本生成视频中的不自然动作问题。该方法通过在推理时选择更好的扩散潜空间来最大化对齐奖励，从而优化感知质量。实验表明，这种方法在根据校准奖励、视觉语言模型和人类评估方面提高了感知质量，且比贪婪搜索和最佳N次采样方法更高效。