arXiv 论文速递

RewardDance: Reward Scaling in Visual Generation

Authors: Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, Yan Zeng, Weilin Huang

First: 2025-09-10T17:59:31+00:00 · Latest: 2025-09-10T17:59:31+00:00

Comments: Bytedance Seed Technical Report

Abs · PDF

Abstract

Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-Language Models (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.

中文标题/摘要

标题：RewardDance：视觉生成中的奖励缩放

奖励模型（RMs）对于通过强化学习（RL）改进生成模型至关重要，但在视觉生成中的RM缩放范式尚未得到充分探索。这主要是由于现有方法的基本限制：基于CLIP的RMs受到架构和输入模态的限制，而常用的Bradley-Terry损失与视觉语言模型（VLM）的下一个词预测机制根本不对齐，阻碍了有效的缩放。更关键的是，RLHF优化过程受到奖励作弊问题的困扰，模型利用奖励信号中的缺陷而不提高真实质量。为了解决这些挑战，我们提出了RewardDance，这是一种通过新颖的生成奖励范式克服这些障碍的可扩展奖励建模框架。通过将奖励分数重新表述为模型预测“是”标记的概率，表明生成的图像根据特定标准优于参考图像，RewardDance内在地将奖励目标与VLM架构对齐。这种对齐在两个维度上解锁了缩放：（1）模型缩放：系统地将RMs扩展到260亿参数；（2）上下文缩放：集成任务特定指令、参考示例和链式推理（CoT）。大量实验表明，RewardDance在文本到图像、文本到视频和图像到视频生成方面显著超越了最先进的方法。最关键的是，我们解决了持续存在的“奖励作弊”挑战：我们的大规模RMs在RL微调过程中表现出并维持了高奖励方差，证明了它们对作弊的抵抗力和产生多样、高质量输出的能力。这大大缓解了困扰较小模型的模式崩溃问题。

Summary / 总结

RewardDance is a scalable reward modeling framework that addresses the limitations of existing reward scaling paradigms in visual generation. By reformulating the reward score as the model's probability of predicting a 'yes' token, it aligns reward objectives with Vision-Language Models (VLMs) and enables scaling in both model size and context. Extensive experiments show that RewardDance outperforms state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation, and it effectively mitigates the reward hacking issue, producing diverse and high-quality outputs.

RewardDance 是一种可扩展的奖励建模框架，解决了现有方法在视觉生成中的局限性。它将奖励分数重新定义为模型预测 '是' 令牌的概率，与视觉语言模型架构对齐，从而在模型大小和上下文方面实现扩展。大量实验表明，RewardDance 在文本到图像、文本到视频和图像到视频生成方面优于最先进的方法，并且通过在微调过程中保持高奖励方差有效解决了奖励作弊问题，生成多样且高质量的输出。

SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation

Authors: Michael J. Munje, Chen Tang, Shuijing Liu, Zichao Hu, Yifeng Zhu, Jiaxun Cui, Garrett Warnell, Joydeep Biswas, Peter Stone

Venue: CoRL

First: 2025-09-10T16:47:00+00:00 · Latest: 2025-09-10T16:47:00+00:00

Comments: Conference on Robot Learning (CoRL) 2025 Project site: https://larg.github.io/socialnav-sub

Abs · PDF · Project1

Abstract

Robot navigation in dynamic, human-centered environments requires socially-compliant decisions grounded in robust scene understanding. Recent Vision-Language Models (VLMs) exhibit promising capabilities such as object recognition, common-sense reasoning, and contextual understanding-capabilities that align with the nuanced requirements of social robot navigation. However, it remains unclear whether VLMs can accurately understand complex social navigation scenes (e.g., inferring the spatial-temporal relations among agents and human intentions), which is essential for safe and socially compliant robot navigation. While some recent works have explored the use of VLMs in social robot navigation, no existing work systematically evaluates their ability to meet these necessary conditions. In this paper, we introduce the Social Navigation Scene Understanding Benchmark (SocialNav-SUB), a Visual Question Answering (VQA) dataset and benchmark designed to evaluate VLMs for scene understanding in real-world social robot navigation scenarios. SocialNav-SUB provides a unified framework for evaluating VLMs against human and rule-based baselines across VQA tasks requiring spatial, spatiotemporal, and social reasoning in social robot navigation. Through experiments with state-of-the-art VLMs, we find that while the best-performing VLM achieves an encouraging probability of agreeing with human answers, it still underperforms simpler rule-based approach and human consensus baselines, indicating critical gaps in social scene understanding of current VLMs. Our benchmark sets the stage for further research on foundation models for social robot navigation, offering a framework to explore how VLMs can be tailored to meet real-world social robot navigation needs. An overview of this paper along with the code and data can be found at https://larg.github.io/socialnav-sub .

中文标题/摘要

标题：SocialNav-SUB：评估社会机器人导航场景理解的VLM基准

在动态的人类中心环境中，机器人的导航需要基于稳健场景理解的符合社会规范的决策。近期的视觉-语言模型（VLMs）展示了诸如物体识别、常识推理和上下文理解等有前景的能力，这些能力与社会机器人导航的复杂需求相契合。然而，尚不清楚VLMs是否能够准确理解复杂的社交导航场景（例如推断代理和人类意图的空间-时间关系），这对于安全和符合社会规范的机器人导航至关重要。尽管一些近期的研究探索了在社会机器人导航中使用VLMs，但目前尚无工作系统地评估它们是否能够满足这些必要条件。在本文中，我们介绍了社会导航场景理解基准（SocialNav-SUB），这是一个视觉问答（VQA）数据集和基准，旨在评估VLMs在真实世界社会机器人导航场景中的场景理解能力。SocialNav-SUB提供了一个统一的框架，用于评估VLMs在涉及社会机器人导航的空间、空间-时间和社会推理的VQA任务中与基于视觉问答的人类和基于规则的基线的对比。通过使用最先进的VLMs进行实验，我们发现尽管表现最佳的VLM在与人类答案一致的概率上取得了令人鼓舞的结果，但它仍然不如简单的基于规则的方法和人类共识基线表现良好，表明当前VLMs在社会场景理解方面存在关键差距。我们的基准为社会机器人导航的基础模型研究奠定了基础，提供了一个框架来探索如何将VLMs定制以满足现实世界的社会机器人导航需求。有关本文的概述、代码和数据可以在https://larg.github.io/socialnav-sub 查看。

Summary / 总结

This paper introduces SocialNav-SUB, a benchmark for evaluating Vision-Language Models (VLMs) in understanding complex social navigation scenes. The motivation is to assess whether VLMs can accurately infer spatial-temporal relations and human intentions, crucial for safe and socially compliant robot navigation. Experiments with state-of-the-art VLMs show that while they perform reasonably well, they still underperform simpler rule-based approaches and human consensus baselines, indicating significant gaps in social scene understanding. The benchmark provides a unified framework for future research on foundation models for social robot navigation.

本文介绍了SocialNav-SUB，这是一个用于评估Vision-Language模型（VLMs）在理解复杂社会导航场景能力的基准。动机是评估VLMs是否能够准确地解释空间-时间关系和人类意图，这对于安全和社交合规的机器人导航至关重要。实验结果显示，表现最佳的VLM仍然不如简单的基于规则的方法和人类共识基准表现好，这表明当前VLMs在社会场景理解方面存在关键差距。

LED: LLM Enhanced Open-Vocabulary Object Detection without Human Curated Data Generation

Authors: Yang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Can Jin, Dimitris N. Metaxas

First: 2025-03-18T00:50:40+00:00 · Latest: 2025-09-10T15:29:43+00:00

Abs · PDF

Abstract

Large foundation models trained on large-scale vision-language data can boost Open-Vocabulary Object Detection (OVD) via synthetic training data, yet the hand-crafted pipelines often introduce bias and overfit to specific prompts. We sidestep this issue by directly fusing hidden states from Large Language Models (LLMs) into detectors-an avenue surprisingly under-explored. This paper presents a systematic method to enhance visual grounding by utilizing decoder layers of the LLM of an MLLM. We introduce a zero-initialized cross-attention adapter to enable efficient knowledge fusion from LLMs to object detectors, a new approach called LED (LLM Enhanced Open-Vocabulary Object Detection). We find that intermediate LLM layers already encode rich spatial semantics; adapting only the early layers yields most of the gain. With Swin-T as the vision encoder, Qwen2-0.5B + LED lifts GroundingDINO by 3.82 % on OmniLabel at just 8.7 % extra GFLOPs, and a larger vision backbone pushes the improvement to 6.22 %. Extensive ablations on adapter variants, LLM scales and fusion depths further corroborate our design.

中文标题/摘要

标题：LED：大语言模型增强的开放词汇对象检测

大规模视觉-语言数据训练的基础模型可以通过合成训练数据提升开放词汇对象检测（OVD），但手工设计的管道往往引入偏差并过度拟合特定提示。我们通过直接将大语言模型（LLM）的隐藏状态融合到检测器中绕过了这个问题——这是一个令人惊讶地未被充分探索的途径。本文提出了一种系统方法，通过利用MLLM的LLM解码器层来增强视觉定位。我们引入了一个零初始化的交叉注意力适配器，以实现从LLM到对象检测器的有效知识融合，提出了一种新的方法LED（大语言模型增强的开放词汇对象检测）。我们发现中间的LLM层已经编码了丰富的空间语义；仅适应早期层就能获得大部分收益。使用Swin-T作为视觉编码器，Qwen2-0.5B + LED在OmniLabel上将GroundingDINO提升了3.82%，仅增加了8.7%的额外GFLOPs，而更大的视觉骨干将改进幅度推高至6.22%。广泛的适配器变体、LLM规模和融合深度的消融实验进一步证明了我们的设计。

Summary / 总结

This paper addresses the challenge of Open-Vocabulary Object Detection (OVD) by leveraging Large Language Models (LLMs) to enhance visual grounding without relying on human-curated synthetic data. The method, called LED (LLM Enhanced Open-Vocabulary Object Detection), introduces a zero-initialized cross-attention adapter to fuse LLM hidden states into object detectors. Experiments show that using intermediate LLM layers for knowledge fusion significantly improves performance. With Swin-T as the vision encoder, Qwen2-0.5B + LED improves GroundingDINO by 3.82% on OmniLabel with only 8.7% extra GFLOPs, and this improvement increases to 6.22% with a larger vision backbone.

该论文通过利用大型语言模型（LLM）增强视觉定位，解决了开放词汇对象检测（OVD）的挑战，而不依赖于人工标注的合成数据。方法称为LED，直接将LLM的隐藏状态融合到检测器中，使用零初始化的交叉注意力适配器。实验表明，LLM的中间层有效编码了空间语义，仅适应早期层即可获得显著改进。使用Swin-T作为视觉编码器，Qwen2-0.5B + LED在OmniLabel上将GroundingDINO的性能提升了3.82%，且计算开销很小，更大的视觉骨干进一步提高了性能。

LLaDA-VLA: Vision Language Diffusion Action Models

Authors: Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, Xiaoyan Sun

First: 2025-09-08T17:45:40+00:00 · Latest: 2025-09-10T14:34:25+00:00

Abs · PDF

Abstract

The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.

中文标题/摘要

标题：LLaDA-VLA：视觉语言扩散动作模型

自回归视觉语言模型（VLMs）的快速发展激发了对视觉语言动作模型（VLA）在机器人操作方面的研究兴趣。最近，掩码扩散模型，这一与自回归模型不同的范式，在文本生成和多模态应用中开始展示出竞争力，推动了一系列基于扩散的VLMs（d-VLMs）的发展。然而，利用这些模型进行机器人策略学习的研究仍处于起步阶段。本文介绍了LLaDA-VLA，这是首个基于预训练d-VLMs的视觉语言扩散动作模型，用于机器人操作。为了有效适应机器人领域，我们提出了两个关键设计：（1）局部特殊标记分类策略，用特殊动作标记分类替代全词汇分类，降低适应难度；（2）层次化动作结构解码策略，考虑动作内部和跨动作的依赖关系，逐级解码动作序列。大量实验表明，LLaDA-VLA 在仿真和真实机器人上均显著优于最先进的VLA。

Summary / 总结

This work introduces LLaDA-VLA, a novel Vision-Language-Diffusion-Action model for robotic manipulation, which builds upon pretrained diffusion-based vision-language models. The model includes a localized special-token classification strategy and a hierarchical action-structured decoding strategy to adapt to the robotic domain. Experimental results show that LLaDA-VLA surpasses existing vision-language-action models in both simulation and real-world robotic tasks.

本文介绍了LLaDA-VLA，这是首个基于预训练的扩散型视觉语言模型（d-VLMs）构建的视觉-语言-扩散-动作模型，用于机器人操作。为使d-VLMs适应机器人领域，作者提出了两种关键设计：局部特殊标记分类策略和分层动作结构解码策略。实验结果表明，LLaDA-VLA在仿真和真实机器人上均优于现有视觉-语言-动作模型。

Have Large Vision-Language Models Mastered Art History?

Authors: Ombretta Strafforello, Derya Soydaner, Michiel Willems, Anne-Sofie Maerten, Stefanie De Winter

First: 2024-09-05T13:33:57+00:00 · Latest: 2025-09-10T14:31:31+00:00

Abs · PDF

Abstract

The emergence of large Vision-Language Models (VLMs) has established new baselines in image classification across multiple domains. We examine whether their multimodal reasoning can also address a challenge mastered by human experts. Specifically, we test whether VLMs can classify the style, author and creation date of paintings, a domain traditionally mastered by art historians. Artworks pose a unique challenge compared to natural images due to their inherently complex and diverse structures, characterized by variable compositions and styles. This requires a contextual and stylistic interpretation rather than straightforward object recognition. Art historians have long studied the unique aspects of artworks, with style prediction being a crucial component of their discipline. This paper investigates whether large VLMs, which integrate visual and textual data, can effectively reason about the historical and stylistic attributes of paintings. We present the first study of its kind, conducting an in-depth analysis of three VLMs, namely CLIP, LLaVA, and GPT-4o, evaluating their zero-shot classification of art style, author and time period. Using two image benchmarks of artworks, we assess the models' ability to interpret style, evaluate their sensitivity to prompts, and examine failure cases. Additionally, we focus on how these models compare to human art historical expertise by analyzing misclassifications, providing insights into their reasoning and classification patterns.

中文标题/摘要

标题：大型视觉-语言模型是否掌握了艺术史？

大型视觉-语言模型（VLMs）在多个领域的图像分类中建立了新的基准。我们探讨它们的多模态推理是否也能解决由人类专家掌握的挑战。具体来说，我们测试VLMs是否能够对绘画的风格、作者和创作年代进行分类，这是一个传统上由艺术史学家掌握的领域。与自然图像相比，艺术品因其复杂多样的结构而构成独特的挑战，这些结构具有可变的组成和风格。这需要一种上下文和风格的解释，而不仅仅是简单的物体识别。艺术史学家长期研究艺术品的独特方面，风格预测是他们学科的关键组成部分。本文探讨了大型VLMs，这些模型整合了视觉和文本数据，是否能够有效地对绘画的历史和风格属性进行推理。我们进行了此类研究中的首次研究，对CLIP、LLaVA和GPT-4o三种VLMs进行了深入分析，评估它们在零样本分类中的艺术风格、作者和时代。使用两个艺术品图像基准，我们评估了模型解释风格的能力，评估了它们对提示的敏感性，并检查了失败案例。此外，我们通过分析错误分类，关注这些模型与人类艺术史专家的比较，提供了它们推理和分类模式的见解。

Summary / 总结

This study investigates whether large Vision-Language Models (VLMs) can classify the style, author, and creation date of paintings, a task traditionally mastered by art historians. The research examines CLIP, LLaVA, and GPT-4o, evaluating their zero-shot classification abilities using two art benchmarks. Key findings show that while VLMs can interpret styles and authors to some extent, they struggle with precise time periods and are sensitive to prompt variations, indicating a need for further improvement in contextual and stylistic reasoning.

该研究探讨大型视觉-语言模型（VLMs）是否能够分类绘画的风格、作者和创作年代，这是传统上由艺术史学家处理的任务。研究考察了CLIP、LLaVA和GPT-4o，发现这些模型虽然可以在零样本情况下进行分类，但在处理复杂多样的艺术品时常常出现错误分类，特别是在风格和时间上的判断上。研究强调了这些模型在上下文和风格解释方面与人类专业知识的差距。

To See a World in a Spark of Neuron: Disentangling Multi-task Interference for Training-free Model Merging

Authors: Zitao Fang, Guodong DU, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, Sim Kuan Goh

Venue: EMNLP 2025

First: 2025-03-07T11:00:24+00:00 · Latest: 2025-09-10T13:56:44+00:00

Comments: Accepted to EMNLP 2025 Main Conference. This is the camera-ready version. Code: https://ZzzitaoFang.github.io/projects/NeuroMerging/

Abs · PDF · Project1

Abstract

Fine-tuning pre-trained models on targeted datasets enhances task-specific performance but often comes at the expense of generalization. Model merging techniques, which integrate multiple fine-tuned models into a single multi-task model through task arithmetic, offer a promising solution. However, task interference remains a fundamental challenge, leading to performance degradation and suboptimal merged models. Existing approaches largely overlooked the fundamental roles of neurons, their connectivity, and activation, resulting in a merging process and a merged model that does not consider how neurons relay and process information. In this work, we present the first study that relies on neuronal mechanisms for model merging. Specifically, we decomposed task-specific representations into two complementary neuronal subspaces that regulate input sensitivity and task adaptability. Leveraging this decomposition, we introduced NeuroMerging, a novel merging framework developed to mitigate task interference within neuronal subspaces, enabling training-free model fusion across diverse tasks. Through extensive experiments, we demonstrated that NeuroMerging achieved superior performance compared to existing methods on multi-task benchmarks across both natural language and vision domains. Our findings highlighted the importance of aligning neuronal mechanisms in model merging, offering new insights into mitigating task interference and improving knowledge fusion. Our project is available at https://ZzzitaoFang.github.io/projects/NeuroMerging/.

中文标题/摘要

标题：在神经元火花中见世界：无监督模型融合中的多任务干扰拆解

在目标数据集上微调预训练模型可以提升特定任务的性能，但往往以牺牲泛化能力为代价。通过任务算术将多个微调模型整合成一个多功能模型的模型融合技术提供了一种有前景的解决方案。然而，任务干扰仍然是一个基本挑战，导致性能下降和次优融合模型。现有方法大多忽视了神经元及其连接性和激活的基本作用，导致融合过程和融合模型未能考虑神经元如何传递和处理信息。在本文中，我们首次依赖神经元机制进行模型融合。具体而言，我们将任务特定表示分解为两个互补的神经子空间，分别调节输入敏感性和任务适应性。利用这种分解，我们引入了NeuroMerging，这是一种新型融合框架，旨在减轻神经子空间内的任务干扰，实现跨多种任务的无监督模型融合。通过广泛的实验，我们证明了NeuroMerging在自然语言和视觉领域的多功能基准测试中优于现有方法。我们的研究结果强调了在模型融合中对齐神经元机制的重要性，为减轻任务干扰和提高知识融合提供了新的见解。我们的项目可在https://ZzzitaoFang.github.io/projects/NeuroMerging/找到。

Summary / 总结

This study addresses the challenge of task interference in model merging by proposing NeuroMerging, a framework that decomposes task-specific representations into input sensitivity and task adaptability subspaces. Through extensive experiments, NeuroMerging outperformed existing methods on multi-task benchmarks in natural language and vision domains, demonstrating the importance of aligning neuronal mechanisms in model merging to mitigate task interference and improve knowledge fusion.

本文通过提出NeuroMerging框架，将任务特定表示分解为互补的神经子空间，以解决模型合并中的任务干扰问题。通过广泛的实验，NeuroMerging在自然语言和视觉领域的多任务基准上优于现有方法，表明在模型合并中对齐神经机制以减轻任务干扰和提高知识融合的重要性。

TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making

Authors: Kechen Jiao, Zhirui Fang, Jiahao Liu, Bei Li, Qifan Wang, Xinyu Liu, Junhao Ruan, Zhongjian Qiao, Yifan Zhu, Yaxin Xu, Jingang Wang, Xiu Li

First: 2025-09-10T11:16:21+00:00 · Latest: 2025-09-10T11:16:21+00:00

Abs · PDF

Abstract

Using effective generalization capabilities of vision language models (VLMs) in context-specific dynamic tasks for embodied artificial intelligence remains a significant challenge. Although supervised fine-tuned models can better align with the real physical world, they still exhibit sluggish responses and hallucination issues in dynamically changing environments, necessitating further alignment. Existing post-SFT methods, reliant on reinforcement learning and chain-of-thought (CoT) approaches, are constrained by sparse rewards and action-only optimization, resulting in low sample efficiency, poor consistency, and model degradation. To address these issues, this paper proposes Thought-Centric Preference Optimization (TCPO) for effective embodied decision-making. Specifically, TCPO introduces a stepwise preference-based optimization approach, transforming sparse reward signals into richer step sample pairs. It emphasizes the alignment of the model's intermediate reasoning process, mitigating the problem of model degradation. Moreover, by incorporating Action Policy Consistency Constraint (APC), it further imposes consistency constraints on the model output. Experiments in the ALFWorld environment demonstrate an average success rate of 26.67%, achieving a 6% improvement over RL4VLM and validating the effectiveness of our approach in mitigating model degradation after fine-tuning. These results highlight the potential of integrating preference-based learning techniques with CoT processes to enhance the decision-making capabilities of vision-language models in embodied agents.

中文标题/摘要

标题：TCPO：以思维为中心的偏好优化以实现有效的具身决策

在具体动态任务中利用视觉语言模型（VLMs）的有效泛化能力以实现具身人工智能仍是一个重大挑战。尽管监督微调模型能更好地与现实物理世界对齐，但在动态变化环境中它们仍表现出迟钝的响应和幻觉问题，需要进一步对齐。现有后微调方法依赖强化学习和链式思维（CoT）方法，受到稀疏奖励和仅动作优化的限制，导致样本效率低、一致性差和模型退化。为解决这些问题，本文提出以思维为中心的偏好优化（TCPO）以实现有效的具身决策。具体而言，TCPO引入了一种逐步偏好优化方法，将稀疏奖励信号转化为更丰富的步骤样本对。它强调模型中间推理过程的对齐，缓解了模型退化问题。此外，通过引入行动策略一致性约束（APC），进一步对模型输出施加一致性约束。在ALFWorld环境中进行的实验显示平均成功率26.67%，比RL4VLM提高了6%，验证了我们方法在微调后缓解模型退化方面的有效性。这些结果表明，将偏好学习技术与CoT过程结合以增强视觉语言模型在具身代理中的决策能力的潜力。

Summary / 总结

This paper addresses the challenge of using vision language models (VLMs) for embodied decision-making in dynamic environments. It proposes Thought-Centric Preference Optimization (TCPO), which transforms sparse reward signals into richer step sample pairs and incorporates an Action Policy Consistency Constraint (APC) to enhance model consistency. Experiments show a 6% improvement in success rate over RL4VLM, validating the approach's effectiveness in mitigating model degradation.

该论文旨在解决在动态环境中使用视觉语言模型（VLMs）进行体态决策的挑战。它提出了基于思考的偏好优化（TCPO），将稀疏的奖励信号转换为更丰富的步骤样本对，并引入了行为策略一致性约束（APC）以增强模型的一致性。实验在ALFWorld环境中显示，与RL4VLM相比，成功率提高了6%，验证了该方法在微调后缓解模型退化方面的有效性。

A Structured Review of Underwater Object Detection Challenges and Solutions: From Traditional to Large Vision Language Models

Authors: Edwine Nabahirwa, Wei Song, Minghua Zhang, Yi Fang, Zhou Ni

First: 2025-09-10T11:01:29+00:00 · Latest: 2025-09-10T11:01:29+00:00

Comments: 72 Pages, 11 Figures

Abs · PDF

Abstract

Underwater object detection (UOD) is vital to diverse marine applications, including oceanographic research, underwater robotics, and marine conservation. However, UOD faces numerous challenges that compromise its performance. Over the years, various methods have been proposed to address these issues, but they often fail to fully capture the complexities of underwater environments. This review systematically categorizes UOD challenges into five key areas: Image quality degradation, target-related issues, data-related challenges, computational and processing constraints, and limitations in detection methodologies. To address these challenges, we analyze the progression from traditional image processing and object detection techniques to modern approaches. Additionally, we explore the potential of large vision-language models (LVLMs) in UOD, leveraging their multi-modal capabilities demonstrated in other domains. We also present case studies, including synthetic dataset generation using DALL-E 3 and fine-tuning Florence-2 LVLM for UOD. This review identifies three key insights: (i) Current UOD methods are insufficient to fully address challenges like image degradation and small object detection in dynamic underwater environments. (ii) Synthetic data generation using LVLMs shows potential for augmenting datasets but requires further refinement to ensure realism and applicability. (iii) LVLMs hold significant promise for UOD, but their real-time application remains under-explored, requiring further research on optimization techniques.

中文标题/摘要

标题：水下物体检测挑战与解决方案的结构化综述：从传统方法到大规模视觉语言模型

水下物体检测（UOD）对于海洋学研究、水下机器人技术和海洋保护等多样化的海洋应用至关重要。然而，UOD 面临诸多挑战，影响其性能。多年来，提出了各种方法来解决这些问题，但往往未能充分捕捉水下环境的复杂性。本文系统地将 UOD 挑战分为五个关键领域：图像质量退化、目标相关问题、数据相关挑战、计算和处理限制以及检测方法的局限性。为应对这些挑战，本文分析了从传统图像处理和物体检测技术到现代方法的进展。此外，本文探讨了大规模视觉语言模型（LVLM）在 UOD 中的潜力，利用其在其他领域展示的多模态能力。本文还介绍了案例研究，包括使用 DALL-E 3 生成合成数据集和微调 Florence-2 LVLM 进行 UOD。本文识别出三个关键见解：(i) 当前的 UOD 方法不足以充分解决如图像退化和动态水下环境中小物体检测等挑战。(ii) 使用 LVLM 生成合成数据具有增强数据集的潜力，但需要进一步完善以确保真实性和适用性。(iii) LVLM 在 UOD 中具有巨大潜力，但其实时应用仍需进一步研究，需要进一步研究优化技术。

Summary / 总结

This paper reviews the challenges and solutions in underwater object detection (UOD), categorizing them into five areas: image quality degradation, target-related issues, data-related challenges, computational constraints, and detection methodology limitations. It analyzes the progression from traditional techniques to modern approaches and explores the potential of large vision-language models (LVLMs) in UOD. Key findings include the insufficiency of current methods in addressing challenges like image degradation and small object detection, the potential of synthetic data generation using LVLMs, and the need for further research on real-time application of LVLMs.

本文回顾了水下物体检测（UOD）面临的挑战及其解决方案，将挑战分为五个方面：图像质量退化、目标相关问题、数据挑战、计算和处理限制以及检测方法的局限性。分析了从传统技术到现代方法的进步，并探讨了大型视觉-语言模型（LVLM）在UOD中的潜力。关键发现包括当前方法在应对图像退化和小物体检测等挑战方面的不足，使用LVLM生成合成数据的潜力及其需要进一步完善以确保真实性和适用性，以及LVLM在UOD中的实际应用仍需进一步研究，需要优化技术。

Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation

Authors: Kaleem Ahmad

First: 2025-09-10T11:00:12+00:00 · Latest: 2025-09-10T11:00:12+00:00

Comments: 14 pages. Preprint

Abs · PDF

Abstract

Prompt-driven image analysis converts a single natural-language instruction into multiple steps: locate, segment, edit, and describe. We present a practical case study of a unified pipeline that combines open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description into a single workflow. The system works end to end from a single prompt, retains intermediate artifacts for transparent debugging (such as detections, masks, overlays, edited images, and before and after composites), and provides the same functionality through an interactive UI and a scriptable CLI for consistent, repeatable runs. We highlight integration choices that reduce brittleness, including threshold adjustments, mask inspection with light morphology, and resource-aware defaults. In a small, single-word prompt segment, detection and segmentation produced usable masks in over 90% of cases with an accuracy above 85% based on our criteria. On a high-end GPU, inpainting makes up 60 to 75% of total runtime under typical guidance and sampling settings, which highlights the need for careful tuning. The study offers implementation-guided advice on thresholds, mask tightness, and diffusion parameters, and details version pinning, artifact logging, and seed control to support replay. Our contribution is a transparent, reliable pattern for assembling modern vision and multimodal models behind a single prompt, with clear guardrails and operational practices that improve reliability in object replacement, scene augmentation, and removal.

中文标题/摘要

标题：基于提示的多模态生成AI图像分析：检测、分割、修复和解释

基于提示的图像分析将单一自然语言指令转换为多个步骤：定位、分割、编辑和描述。我们提出了一种统一的工作流，结合了开放词汇检测、可提示分割、文本条件修复和视觉语言描述。该系统从单一提示开始工作，保留中间产物以实现透明调试（如检测、掩码、叠加、编辑图像和前后组合），并通过交互式UI和可脚本化的CLI提供一致且可重复的功能。我们强调了减少脆弱性的集成选择，包括阈值调整、轻量形态学掩码检查和资源感知默认值。在一个小的单个词提示片段中，检测和分割在超过90%的情况下产生了可用的掩码，准确率超过85%。在高端GPU上，修复占总运行时间的60%到75%，突显了仔细调优的必要性。该研究提供了关于阈值、掩码紧致性和扩散参数的实现指导建议，并详细说明了版本锁定、产物记录和种子控制，以支持重放。我们的贡献是一种透明且可靠的模式，用于在单一提示背后组装现代视觉和多模态模型，具有明确的护栏和操作实践，以提高对象替换、场景增强和移除的可靠性。

Summary / 总结

This paper presents a unified pipeline for prompt-driven image analysis that combines open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description. The system processes a single natural-language instruction to perform multiple steps and retains intermediate artifacts for debugging. Key findings include detection and segmentation producing usable masks in over 90% of cases with an accuracy above 85%, and inpainting accounting for 60 to 75% of the total runtime. The study offers advice on thresholds, mask tightness, and diffusion parameters to support consistent and repeatable runs.

本文介绍了一个统一的提示驱动图像分析管道，结合了开放词汇检测、可提示分割、文本条件修复和视觉语言描述。该系统处理单一自然语言指令执行多项任务，并保留中间结果以供调试。关键发现包括超过90%的可用掩膜，准确率超过85%的检测和分割，以及修复占总运行时间的60%到75%。研究提供了关于阈值、掩膜紧度和扩散参数的实现建议，强调了仔细调整和资源管理的重要性。

Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics

Authors: Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi

First: 2025-09-10T10:07:27+00:00 · Latest: 2025-09-10T10:07:27+00:00

Abs · PDF

Abstract

Recent advances in Large Language Models (LLMs) have demonstrated their remarkable capacity to process and reason over structured and unstructured data modalities beyond natural language. In this work, we explore the applications of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa 3.2, to the task of identifying neutrino interactions in pixelated detector data from high-energy physics (HEP) experiments. We benchmark this model against a state-of-the-art convolutional neural network (CNN) architecture, similar to those used in the NOvA and DUNE experiments, which have achieved high efficiency and purity in classifying electron and muon neutrino events. Our evaluation considers both the classification performance and interpretability of the model predictions. We find that VLMs can outperform CNNs, while also providing greater flexibility in integrating auxiliary textual or semantic information and offering more interpretable, reasoning-based predictions. This work highlights the potential of VLMs as a general-purpose backbone for physics event classification, due to their high performance, interpretability, and generalizability, which opens new avenues for integrating multimodal reasoning in experimental neutrino physics.

中文标题/摘要

标题：将视觉语言模型适应于高能物理中的中微子事件分类

近年来，大型语言模型（LLMs）在处理和推理结构化和非结构化数据方面的能力已经显示出其显著优势，超越了自然语言。在本文中，我们探讨了视觉语言模型（VLMs），特别是LLaMa 3.2的微调变体，将其应用于识别高能物理（HEP）实验中像素化探测器数据中的中微子相互作用的任务。我们用NOvA和DUNE实验中使用的类似卷积神经网络（CNN）架构的最新模型来测试该模型，这些架构在分类电子和Muon中微子事件方面已经实现了高效率和纯度。我们的评估考虑了模型分类性能和预测的可解释性。我们发现VLMs可以优于CNNs，同时还能提供更大的灵活性，以整合辅助的文本或语义信息，并提供更可解释、基于推理的预测。本文强调了VLMs作为物理事件分类的一般性基础架构的潜力，由于它们的高性能、可解释性和泛化能力，这为在实验中微子物理中整合多模态推理打开了新的途径。

Summary / 总结

This study investigates the use of Vision Language Models (VLMs) for identifying neutrino interactions in high-energy physics experiments, comparing them to state-of-the-art convolutional neural networks (CNNs). The VLMs, fine-tuned from LLaMa 3.2, outperform CNNs in classification performance while offering greater flexibility and interpretability by integrating textual or semantic information. The results suggest VLMs could be a versatile backbone for physics event classification, enhancing multimodal reasoning in experimental neutrino physics.

本研究探讨了使用视觉语言模型（VLMs）来分类高能物理实验中的中微子相互作用，将其与最先进的卷积神经网络（CNNs）进行了比较。这些VLMs基于LLaMa 3.2进行微调，表现出比CNNs更好的分类性能，并且在整合文本信息方面更具灵活性和可解释性。研究结果表明，VLMs可能因其高性能、可解释性和泛化能力而成为物理事件分类的有价值的工具。

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

Authors: Mohamed Salim Aissi, Clemence Grislain, Mohamed Chetouani, Olivier Sigaud, Laure Soulier, Nicolas Thome

First: 2025-03-19T11:05:42+00:00 · Latest: 2025-09-10T09:57:03+00:00

Abs · PDF

Abstract

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent's decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

中文标题/摘要

标题：VIPER：视觉感知与可解释推理在序列决策中的应用

虽然大型语言模型（LLMs）在文本推理方面表现出色，视觉语言模型（VLMs）在视觉感知方面非常有效，但将这些模型应用于基于视觉指令的规划仍然是一个开放问题。本文介绍了一种名为VIPER的新框架，该框架将VLM基于的感知与LLM基于的推理相结合，用于多模态指令驱动的规划。我们的方法使用一个模块化的流水线，其中冻结的VLM生成图像观察的文本描述，然后由LLM策略根据任务目标预测动作。我们通过行为克隆和强化学习微调推理模块，提高代理的决策能力。在ALFWorld基准测试中，VIPER显著优于最先进的基于视觉指令的规划器，同时缩小了与纯文本或acles的差距。通过利用文本作为中间表示，VIPER还增强了可解释性，为感知和推理组件的精细分析铺平了道路。

Summary / 总结

VIPER is a framework for multimodal instruction-based planning that combines VLM-based perception with LLM-based reasoning. It uses a modular pipeline where a VLM generates textual descriptions of visual observations, which are then processed by an LLM to predict actions based on the task goal. VIPER is fine-tuned using behavioral cloning and reinforcement learning, showing significant performance improvements over state-of-the-art visual instruction-based planners on the ALFWorld benchmark. Additionally, VIPER enhances explainability by using text as an intermediate representation, allowing for a detailed analysis of perception and reasoning components.

VIPER 是一种结合 VLM 基础的感知与 LLM 基础的推理的多模态指令规划框架。它使用模块化的流水线，其中 VLM 生成图像观察的文本描述，然后由 LLM 处理以根据任务目标预测动作。通过行为克隆和强化学习对 VIPER 进行微调，它在 ALFWorld 基准测试中显著优于现有的视觉指令规划方法，同时提高了可解释性。

Retrieval-Augmented VLMs for Multimodal Melanoma Diagnosis

Authors: Jihyun Moon, Charmgil Hong

Venue: MICCAI

First: 2025-09-10T07:23:30+00:00 · Latest: 2025-09-10T07:23:30+00:00

Comments: Medical Image Computing and Computer-Assisted Intervention (MICCAI) ISIC Skin Image Analysis Workshop (MICCAI ISIC) 2025; 10 pages

Abs · PDF

Abstract

Accurate and early diagnosis of malignant melanoma is critical for improving patient outcomes. While convolutional neural networks (CNNs) have shown promise in dermoscopic image analysis, they often neglect clinical metadata and require extensive preprocessing. Vision-language models (VLMs) offer a multimodal alternative but struggle to capture clinical specificity when trained on general-domain data. To address this, we propose a retrieval-augmented VLM framework that incorporates semantically similar patient cases into the diagnostic prompt. Our method enables informed predictions without fine-tuning and significantly improves classification accuracy and error correction over conventional baselines. These results demonstrate that retrieval-augmented prompting provides a robust strategy for clinical decision support.

中文标题/摘要

标题：基于检索增强的VLMs在多模态黑色素瘤诊断中的应用

准确且早期的黑色素瘤诊断对于改善患者预后至关重要。虽然卷积神经网络（CNNs）在皮肤镜图像分析中显示出潜力，但它们往往忽视临床元数据并需要大量预处理。视觉-语言模型（VLMs）提供了一种多模态替代方案，但在使用通用领域数据训练时难以捕捉临床特异性。为解决这一问题，我们提出了一种检索增强的VLM框架，将具有语义相似性的患者病例纳入诊断提示中。我们的方法能够在无需微调的情况下进行知情预测，并且在分类准确性和错误纠正方面显著优于传统基线。这些结果表明，检索增强的提示提供了一种稳健的临床决策支持策略。

Summary / 总结

The research aims to improve the accuracy and early diagnosis of malignant melanoma by leveraging a retrieval-augmented vision-language model (VLM) that incorporates semantically similar patient cases into the diagnostic prompt. This method enhances classification accuracy and error correction compared to conventional baselines without requiring fine-tuning. The key findings show that this approach provides a robust strategy for clinical decision support in dermatological image analysis.

研究旨在通过结合临床元数据来提高恶性黑色素瘤的诊断准确性。提出的检索增强VLM框架使用具有类似语义的患者病例来增强诊断提示。关键发现表明，这种方法在不需要微调模型的情况下显著提高了分类准确性和错误纠正能力，优于传统方法。

Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors

Authors: Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, Min Zhang

Venue: EMNLP

First: 2025-05-21T10:08:39+00:00 · Latest: 2025-09-10T07:03:03+00:00

Comments: Accepted by EMNLP-2025

Abs · PDF

Abstract

The misuse of large language models (LLMs), such as academic plagiarism, has driven the development of detectors to identify LLM-generated texts. To bypass these detectors, paraphrase attacks have emerged to purposely rewrite these texts to evade detection. Despite the success, existing methods require substantial data and computational budgets to train a specialized paraphraser, and their attack efficacy greatly reduces when faced with advanced detection algorithms. To address this, we propose \textbf{Co}ntrastive \textbf{P}araphrase \textbf{A}ttack (CoPA), a training-free method that effectively deceives text detectors using off-the-shelf LLMs. The first step is to carefully craft instructions that encourage LLMs to produce more human-like texts. Nonetheless, we observe that the inherent statistical biases of LLMs can still result in some generated texts carrying certain machine-like attributes that can be captured by detectors. To overcome this, CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by the LLM. By subtracting the machine-like patterns from the human-like distribution during the decoding process, CoPA is able to produce sentences that are less discernible by text detectors. Our theoretical analysis suggests the superiority of the proposed attack. Extensive experiments validate the effectiveness of CoPA in fooling text detectors across various scenarios.

中文标题/摘要

标题：你的语言模型可以秘密像人类写作：针对LLM生成文本检测器的对比性改写攻击

大型语言模型（LLMs）的滥用，如学术剽窃，推动了检测LLM生成文本的检测器的发展。为了绕过这些检测器，已经出现了改写攻击，故意重新编写这些文本以逃避检测。尽管取得了成功，但现有方法需要大量数据和计算预算来训练专门的改写器，而且在面对高级检测算法时其攻击效果大大降低。为了解决这一问题，我们提出了一种名为CoPA（Contrastive Paraphrase Attack）的无需训练的方法，利用现成的LLM有效地欺骗文本检测器。第一步是精心设计指令，鼓励LLM生成更像人类的文本。然而，我们观察到LLM固有的统计偏差仍然会导致一些生成的文本带有某些机器属性，这些属性可以被检测器捕捉到。为了克服这一问题，CoPA构建了一个辅助的机器属性词分布，作为与LLM生成的人类属性分布的对比。通过在解码过程中从人类属性分布中减去机器属性模式，CoPA能够生成更难以被文本检测器识别的句子。我们的理论分析表明了所提攻击的优越性。广泛的实验验证了CoPA在各种场景下欺骗文本检测器的有效性。

Summary / 总结

The paper addresses the challenge of bypassing detectors designed to identify texts generated by large language models (LLMs) through paraphrase attacks. It introduces CoPA, a training-free method that uses off-the-shelf LLMs to produce human-like text that can evade detection. CoPA first crafts instructions to encourage LLMs to generate more human-like text and then constructs an auxiliary machine-like word distribution to subtract machine-like patterns, making the generated text less discernible by detectors. Experiments show that CoPA effectively deceives various text detectors.

论文提出了一种名为CoPA的无训练方法，通过精心设计的指令使大语言模型生成更接近人类的文本，并进一步通过减去机器特征来使生成的文本更难被文本检测器识别。实验表明，CoPA在不同场景下有效欺骗了各种文本检测器。

A Survey on Training-free Alignment of Large Language Models

Authors: Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian

Venue: EMNLP 2025

First: 2025-08-12T15:30:44+00:00 · Latest: 2025-09-10T05:08:47+00:00

Comments: Accepted to EMNLP 2025 (findings), camera-ready version

Abs · PDF

Abstract

The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques--leveraging in-context learning, decoding-time adjustments, and post-generation corrections--offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.

Summary / 总结

The paper addresses the challenge of aligning large language models (LLMs) with human values and ethical standards without relying on resource-intensive fine-tuning. It reviews training-free (TF) alignment methods that use in-context learning, decoding-time adjustments, and post-generation corrections. The study categorizes these methods into pre-decoding, in-decoding, and post-decoding stages, providing a detailed analysis of their mechanisms and limitations for both LLMs and multimodal LLMs (MLLMs). Key challenges and future directions are identified to guide the development of more effective TF alignment techniques.

论文研究了无需训练（TF）的方法来对大型语言模型（LLMs）进行对齐，以确保其输出符合人类价值观和伦理标准，而无需进行大量微调。它将TF对齐技术分为预解码、在解码和后解码阶段，并从LLMs和多模态LLMs的角度详细探讨了它们的机制和局限性。研究指出了关键挑战和未来方向，提供了一个全面的综述，以指导从业者开发更安全和更可靠的LLMs。

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

Authors: Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan

Venue: ICML 2025

First: 2025-04-03T09:55:09+00:00 · Latest: 2025-09-10T04:22:46+00:00

Comments: Accepted by ICML 2025

Abs · PDF · Code1

Abstract

Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLAMP, a hierarchical video-language model that processes hour-long videos at "mixed precision" through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLAMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLAMP's superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance. Code and model are available at https://github.com/steven-ccq/ViLAMP.

中文标题/摘要

标题：通过分层差异性蒸馏将视频-语言模型扩展到10000帧

长视频处理从根本上挑战了视觉-语言模型（VLMs），因为处理延长的时间序列需要极高的计算成本。现有的标记剪裁和特征合并方法往往牺牲了关键的时间依赖性或稀释了语义信息。我们引入了差异性蒸馏，这是一种系统保留任务相关信息同时抑制冗余的原理性方法。基于这一原理，我们开发了ViLAMP，这是一种分层视频-语言模型，通过两种关键机制以“混合精度”处理长达一小时的视频：(1) 差异关键帧选择，最大化查询相关性同时在帧级别保持时间上的独特性；(2) 差异特征合并，保留非关键帧中的查询相关特征在块级别。因此，ViLAMP 在关键帧中保留了完整信息，同时将非关键帧减少到其最显著的特征，类似于混合精度训练。广泛的实验表明，ViLAMP 在四个视频理解基准测试中表现出色，特别是在长视频内容上。值得注意的是，ViLAMP 可以在单个 NVIDIA A100 GPU 上处理超长视频（多达10000帧），在保持最先进的性能的同时实现了显著的计算效率。代码和模型可在 https://github.com/steven-ccq/ViLAMP 获取。

Summary / 总结

This paper addresses the challenge of processing long-form videos for vision-language models (VLMs) by introducing differential distillation. The method, ViLAMP, selects keyframes and merges features hierarchically to preserve critical temporal and semantic information while reducing computational costs. Experiments show that ViLAMP outperforms existing methods on four video understanding benchmarks, especially for long-form content, and can handle ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU efficiently.

研究通过引入ViLAMP，一种使用差异性蒸馏的分层视频-语言模型，解决了长视频处理的挑战。关键机制包括差异性关键帧选择和差异性特征合并。实验表明，ViLAMP在四个视频理解基准测试中优于现有方法，特别适用于长视频内容，并能在单个NVIDIA A100 GPU上处理多达10K帧的超长视频，具有高效率和最先进的性能。

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

Authors: Pranav Pawar, Kavish Shah, Akshat Bhalani, Komal Kasat, Dev Mittal, Hadi Gala, Deepali Patil, Nikita Raichada, Monali Deshmukh

First: 2025-09-10T04:15:01+00:00 · Latest: 2025-09-10T04:15:01+00:00

Abs · PDF

Abstract

As Vision-Language Models (VLMs) grow in sophistication, their ability to perform reasoning is coming under increasing supervision. While they excel at many tasks, their grasp of fundamental scientific principles, such as physics, remains an underexplored frontier. To reflect the advancements in these capabilities, we introduce a novel and accessible framework designed to rigorously evaluate VLMs on their understanding of 2D physics. Our framework features a pragmatic scenario generator that creates a diverse testbed of over 400 problems across four core domains: Projectile Motion, Collision Dynamics, Mechanics, and Fluid Dynamics. Through comprehensive evaluation of four state-of-the-art VLMs, we demonstrate a strong correlation between model scale and reasoning ability, with our top-performing model, Qwen2.5-VL-7B, achieving an overall score of 0.815. We find that while models excel at formulaic problems, they struggle significantly with domains requiring abstract spatial reasoning. By designing this framework, we aim to democratize the study of scientific reasoning in VLMs and foster deeper insights into their capabilities and limitations.

中文标题/摘要

标题：可解释的物理推理和性能分类在视觉-语言模型中的应用

随着视觉-语言模型（VLMs）变得越来越复杂，它们的推理能力正受到越来越多的关注。尽管它们在许多任务上表现出色，但它们对基本科学原理，如物理的理解仍然是一片未被充分探索的领域。为了反映这些能力的进步，我们引入了一种新颖且易于使用的框架，旨在严格评估VLMs在对二维物理的理解上的表现。该框架包含一个实用的场景生成器，可以生成超过400个问题的多样化测试库，涵盖四个核心领域：抛射运动、碰撞动力学、力学和流体力学。通过对四种最先进的VLMs进行全面评估，我们证明了模型规模与推理能力之间存在很强的相关性，我们的表现最佳模型Qwen2.5-VL-7B的整体得分为0.815。我们发现，虽然模型在公式化问题上表现出色，但在需要抽象空间推理的领域中却面临重大挑战。通过设计这一框架，我们旨在使视觉-语言模型中的科学推理研究民主化，并促进对其能力和局限性的更深入理解。

Summary / 总结

The research aims to evaluate Vision-Language Models (VLMs) in their understanding of 2D physics by introducing a new framework with a scenario generator for diverse physics problems. Four state-of-the-art VLMs were evaluated, showing a correlation between model size and reasoning ability, with Qwen2.5-VL-7B achieving a score of 0.815. The study highlights that models perform well on formulaic problems but struggle with abstract spatial reasoning tasks.

研究旨在通过引入一个新颖的框架来评估Vision-Language模型在2D物理推理方面的能力，该框架生成了涵盖四个领域的多样化问题。对四种最先进的VLMs的评估显示，模型规模与推理能力之间存在关联，Qwen2.5-VL-7B的得分为0.815。研究指出，模型在公式问题上表现良好，但在抽象空间推理任务上存在显著困难。

Examining Vision Language Models through Multi-dimensional Experiments with Vision and Text Features

Authors: Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, Donald E. Brown

First: 2025-09-10T03:49:40+00:00 · Latest: 2025-09-10T03:49:40+00:00

Abs · PDF

Abstract

Recent research on Vision Language Models (VLMs) suggests that they rely on inherent biases learned during training to respond to questions about visual properties of an image. These biases are exacerbated when VLMs are asked highly specific questions that require focusing on specific areas of the image. For example, a VLM tasked with counting stars on a modified American flag (e.g., with more than 50 stars) will often disregard the visual evidence and fail to answer accurately. We build upon this research and develop a multi-dimensional examination framework to systematically determine which characteristics of the input data, including both the image and the accompanying prompt, lead to such differences in performance. Using open-source VLMs, we further examine how attention values fluctuate with varying input parameters (e.g., image size, number of objects in the image, background color, prompt specificity). This research aims to learn how the behavior of vision language models changes and to explore methods for characterizing such changes. Our results suggest, among other things, that even minor modifications in image characteristics and prompt specificity can lead to large changes in how a VLM formulates its answer and, subsequently, its overall performance.

Summary / 总结

This study examines the biases of Vision Language Models (VLMs) by analyzing their performance on specific visual tasks. The research uses a multi-dimensional approach to explore how different input characteristics, such as image features and prompt specificity, affect VLM performance. Key findings indicate that small changes in image characteristics and prompt specificity can significantly alter how VLMs formulate their answers and their overall performance.

研究通过多维度实验框架和开源VLMs，探讨了图像和提示特征对视觉语言模型（VLMs）性能的影响。结果显示，即使是图像和提示的小改动也会显著影响VLM的回答准确性和整体性能。研究强调了理解并减轻VLM中固有偏见的重要性，以提高其可靠性。

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Authors: Zhenyuan Chen, Chenxi Wang, Feng Zhang

First: 2025-09-02T03:01:23+00:00 · Latest: 2025-09-10T01:09:56+00:00

Comments: under review

Abs · PDF · Code1

Abstract

Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,315 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC's ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.

中文标题/摘要

标题：RSCC：一个用于灾害事件的大型遥感变化描述数据集

遥感对于灾害监测至关重要，但现有数据集缺乏时间图像对和详细的文本注释。虽然当前资源主要由单张快照图像主导，但无法捕捉到灾害随时间的变化影响。为解决这一问题，我们引入了遥感变化描述（RSCC）数据集，这是一个包含62,315个灾前/灾后图像对（涵盖地震、洪水、野火等）的大规模基准，这些图像对配有丰富的、类人类的变化描述。通过在遥感数据中架起时间与语义的桥梁，RSCC 使视觉-语言模型能够进行灾害意识的双时相理解的稳健训练和评估。我们的结果突显了RSCC促进详细灾害相关分析的能力，为遥感中更准确、可解释和可扩展的视觉-语言应用铺平了道路。代码和数据集可在https://github.com/Bili-Sakura/RSCC 获取。

Summary / 总结

The RSCC dataset addresses the lack of temporal image pairs and detailed textual annotations in existing disaster monitoring datasets. It consists of 62,315 pre-/post-disaster image pairs with rich change captions, covering various disaster types. This dataset enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding, facilitating detailed disaster-related analysis and improving remote sensing applications. The dataset is available at https://github.com/Bili-Sakura/RSCC.

RSCC数据集解决了现有灾害监测数据集中缺乏时间图像对和详细文本注释的问题。它包含62,315个灾前/灾后图像对，配有丰富的变化描述，涵盖了多种灾害类型。该数据集使视觉-语言模型能够进行灾害感知的双时相理解的稳健训练和评估，促进了详细的灾害相关分析，并改善了遥感应用。数据集可在https://github.com/Bili-Sakura/RSCC获取。

PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval

Authors: Jiancheng Pan, Muyuan Ma, Qing Ma, Cong Bai, Shengyong Chen

First: 2024-05-16T14:53:45+00:00 · Latest: 2025-09-09T20:30:28+00:00

Comments: 14 pages, 7 figures

Abs · PDF

Abstract

Remote sensing image-text retrieval plays a crucial role in remote sensing interpretation, yet remains challenging under both closed-domain and open-domain scenarios due to semantic noise and domain shifts. To address these issues, we propose a visual prior-guided vision-language model, PriorCLIP, which leverages visual priors for unbiased representation learning and adaptive vision-language alignment. In the closed-domain setting, PriorCLIP introduces two Progressive Attention Encoder (PAE) structures: Spatial-PAE constructs a belief matrix with instruction embeddings to filter key features and mitigate semantic bias. At the same time, Temporal-PAE exploits cyclic activation across time steps to enhance text representation. For the open-domain setting, we design a two-stage prior representation learning strategy, consisting of large-scale pre-training on coarse-grained image-text pairs, followed by fine-tuning on fine-grained pairs using vision-instruction, which enables robust retrieval across long-tail concepts and vocabulary shifts. Furthermore, a cluster-based symmetric contrastive Attribution Loss is proposed to constrain inter-class relations and alleviate semantic confusion in the shared embedding space. Extensive experiments on RSICD and RSITMD benchmarks demonstrate that PriorCLIP achieves substantial improvements, outperforming existing methods by 4.9% and 4.0% in closed-domain retrieval, and by 7.3% and 9.4% in open-domain retrieval, respectively.

中文标题/摘要

标题：PriorCLIP：视觉先验引导的遥感图像-文本检索视觉语言模型

遥感图像-文本检索在遥感解释中起着重要作用，但在封闭领域和开放领域场景下由于语义噪声和领域偏移仍具有挑战性。为了解决这些问题，我们提出了一种视觉先验引导的视觉语言模型PriorCLIP，该模型利用视觉先验进行无偏表示学习和自适应视觉语言对齐。在封闭领域设置中，PriorCLIP引入了两种渐进注意编码器（PAE）结构：空间PAE构建信念矩阵以指令嵌入过滤关键特征并减轻语义偏见。同时，时间PAE利用时间步长上的循环激活来增强文本表示。对于开放领域设置，我们设计了一种两阶段先验表示学习策略，包括在粗粒度图像-文本对上进行大规模预训练，然后使用视觉指令对细粒度对进行微调，这使得在长尾概念和词汇偏移下实现稳健的检索成为可能。此外，我们提出了一种基于聚类的对称对比归因损失来约束类间关系并缓解共享嵌入空间中的语义混淆。在RSICD和RSITMD基准上的广泛实验表明，PriorCLIP实现了显著的改进，在封闭领域检索中分别优于现有方法4.9%和4.0%，在开放领域检索中分别优于现有方法7.3%和9.4%。

Summary / 总结

PriorCLIP is a visual prior-guided vision-language model designed to improve remote sensing image-text retrieval in both closed-domain and open-domain settings. It uses two Progressive Attention Encoder (PAE) structures, Spatial-PAE and Temporal-PAE, to filter key features and enhance text representation, respectively. Additionally, it employs a two-stage prior representation learning strategy and a cluster-based symmetric contrastive Attribution Loss to handle semantic noise and domain shifts. Experiments show that PriorCLIP outperforms existing methods by 4.9% and 4.0% in closed-domain retrieval and by 7.3% and 9.4% in open-domain retrieval on RSICD and RSITMD benchmarks.

PriorCLIP是一种视觉先验引导的视觉语言模型，旨在提高在封闭域和开放域设置下的遥感图像-文本检索性能。它使用了两种渐进注意编码器（PAE）结构，分别是空间PAE和时间PAE，用于过滤关键特征和增强文本表示。对于开放域检索，PriorCLIP采用两阶段先验表示学习策略和基于聚类的对称对比性归因损失。实验表明，PriorCLIP在RSICD和RSITMD基准上的封闭域检索中分别优于现有方法4.9%和4.0%，在开放域检索中分别优于现有方法7.3%和9.4%。

Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

Authors: Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q. Ma, Simon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis-Philippe Morency, Katia Sycara, Yaqi Xie

Venue: ICLR 2025

First: 2025-02-10T03:43:55+00:00 · Latest: 2025-09-09T18:19:31+00:00

Comments: Accepted by ICLR 2025. Project page: https://zhangce01.github.io/DeGF/

Abs · PDF · Code1 · Project1

Abstract

While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at https://github.com/zhangce01/DeGF.

Summary / 总结

This work addresses the issue of hallucinations in Large Vision-Language Models (LVLMs) by proposing a self-correcting Decoding with Generative Feedback (DeGF) method. Inspired by the inverse relationship between text-to-image generation and image-conditioned response generation, DeGF uses text-to-image generative models to provide self-feedback at both the response and token levels. The method generates an image from the initial response, which serves as a visual reference to verify and correct the response. Experiments show that DeGF effectively mitigates various types of hallucinations and outperforms state-of-the-art methods across six benchmarks.

该研究通过提出一种自纠正的生成反馈解码方法（DeGF）来解决大型视觉语言模型（LVLM）中的幻觉问题。受文本到图像生成与图像条件响应生成之间逆向关系的启发，DeGF 使用文本到图像生成模型在响应和标记级别提供自我反馈。该方法从初始响应生成图像，作为视觉参考来验证和纠正响应。实验表明，DeGF 有效地缓解了各种类型的幻觉，并在六个基准测试中优于最先进的方法。

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Authors: Boammani Aser Lompo, Marc Haraoui

First: 2025-09-09T17:52:26+00:00 · Latest: 2025-09-09T17:52:26+00:00

Comments: Work in Progress

Abs · PDF · Code1

Abstract

Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.

中文标题/摘要

标题：Visual-TableQA：用于表格图像推理的大规模跨域多模态基准

在结构化数据（如表格）上进行视觉推理是现代视觉语言模型（VLMs）的关键能力，但当前基准在规模、多样性和推理深度上仍然有限，尤其是在涉及渲染表格图像时。为解决这一差距，我们引入了Visual-TableQA，这是一个大规模、跨域的多模态数据集，专门用于评估和增强对复杂表格数据的视觉推理能力。我们的生成管道是模块化、可扩展且完全自主的，涉及多个推理LLM在不同角色（生成、验证和灵感）之间的协作。Visual-TableQA 包含 2500 个丰富结构化的 LaTeX 渲染表格和 6000 个推理密集型问答对，所有这些数据的生成成本低于100美元。为了促进多样性和创造力，我们的管道通过跨模型提示（‘灵感’）和LLM-陪审团筛选进行多模型协作数据生成。更强的模型为较弱的模型提供布局和主题，集体提炼出多样化的推理模式和视觉结构。实验证明，基于Visual-TableQA微调的模型在外部基准上表现出强大的泛化能力，尽管数据集具有合成性，仍优于几个专有模型。完整的管道和资源可在 https://github.com/AI-4-Everyone/Visual-TableQA 公开获取。

Summary / 总结

Visual-TableQA is an open-domain dataset designed to evaluate and enhance visual reasoning over complex tabular data. It consists of 2,500 richly structured LaTeX-rendered tables and 6,000 reasoning-intensive QA pairs, generated through a modular and scalable pipeline involving multiple reasoning LLMs in roles of generation, validation, and inspiration. The dataset demonstrates strong generalization capabilities, with models fine-tuned on it outperforming proprietary models on external benchmarks despite its synthetic nature.

Visual-TableQA 是一个用于评估表格图像上视觉推理能力的开放领域数据集，解决了现有基准的局限性。该数据集使用了一个模块化和可扩展的管道，涉及多个推理语言模型进行生成、验证和灵感激发。数据集包含2,500个丰富结构化的LaTeX渲染表格和6,000个推理密集型问答对，生成成本较低。尽管是合成数据，但基于此数据集微调的模型在外部基准上的泛化能力很强，超过了几个专有模型。

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

Authors: Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong

First: 2024-11-05T07:56:24+00:00 · Latest: 2025-09-09T13:30:17+00:00

Comments: Accepted by EMNLP2025

Abs · PDF

Abstract

Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observations of consecutive Query similarity and implemented the efficient Paged Dot Product Kernel, significantly reducing the selection overhead. A comprehensive evaluation of TokenSelect demonstrates up to $23.84\times$ speedup in attention computation and up to $2.28\times$ acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.

中文标题/摘要

标题：TokenSelect：通过动态选择令牌级KV缓存实现高效长上下文推理和长度外推

大型语言模型（LLMs）的迅速发展推动了现代应用中处理扩展上下文序列的需求。然而，这一进展面临两个挑战：由于序列长度超出分布范围导致的性能下降，以及由于注意力机制的二次计算复杂性引起的推理时间过长。这些问题限制了LLMs在长上下文场景中的应用。本文提出了一种无需训练的方法——动态令牌级KV缓存选择（TokenSelect），以实现高效且准确的长上下文推理。TokenSelect基于非连续注意力稀疏性的观察，使用QK点积来衡量每个头在令牌级的KV缓存关键性。通过每个头的软投票机制，TokenSelect选择性地参与少量关键KV缓存令牌的注意力计算，而不牺牲准确性。为了进一步加速TokenSelect，我们基于连续查询相似性的观察设计了选择缓存，并实现了高效的分页点积内核，显著减少了选择开销。TokenSelect的全面评估显示，在注意力计算中可实现高达23.84倍的加速，在端到端延迟中可实现高达2.28倍的加速，同时在长上下文推理方法中提供更优的性能。

Summary / 总结

TokenSelect is a training-free method for efficient long-context inference in LLMs, addressing performance degradation and long inference times. It uses a per-head soft voting mechanism to select critical KV cache tokens based on QK dot products, reducing attention computation and end-to-end latency by up to 23.84 times and 2.28 times, respectively, while maintaining accuracy.

TokenSelect 是一种无需训练的方法，用于在大语言模型中实现高效长上下文推理，解决性能下降和长时间推理的问题。它通过 QK 点积测量每个头的 KV 缓存关键性，并使用每头软投票机制仅涉及关键令牌参与注意力计算。该方法在注意力计算中实现了高达 23.84 倍的加速，并在端到端延迟上实现了 2.28 倍的加速，优于现有长上下文推理方法。

Light-Weight Cross-Modal Enhancement Method with Benchmark Construction for UAV-based Open-Vocabulary Object Detection

Authors: Zhenhai Weng, Xinjie Li, Can Wu, Weijie He, Jianfeng Lv, Dong Zhou, Zhongliang Yu

First: 2025-09-07T10:59:02+00:00 · Latest: 2025-09-09T12:22:18+00:00

Abs · PDF

Abstract

Open-Vocabulary Object Detection (OVD) faces severe performance degradation when applied to UAV imagery due to the domain gap from ground-level datasets. To address this challenge, we propose a complete UAV-oriented solution that combines both dataset construction and model innovation. First, we design a refined UAV-Label Engine, which efficiently resolves annotation redundancy, inconsistency, and ambiguity, enabling the generation of largescale UAV datasets. Based on this engine, we construct two new benchmarks: UAVDE-2M, with over 2.4M instances across 1,800+ categories, and UAVCAP-15K, providing rich image-text pairs for vision-language pretraining. Second, we introduce the Cross-Attention Gated Enhancement (CAGE) module, a lightweight dual-path fusion design that integrates cross-attention, adaptive gating, and global FiLM modulation for robust textvision alignment. By embedding CAGE into the YOLO-World-v2 framework, our method achieves significant gains in both accuracy and efficiency, notably improving zero-shot detection on VisDrone by +5.3 mAP while reducing parameters and GFLOPs, and demonstrating strong cross-domain generalization on SIMD. Extensive experiments and real-world UAV deployment confirm the effectiveness and practicality of our proposed solution for UAV-based OVD

中文标题/摘要

标题：基于无人机的开放词汇目标检测轻量级跨模态增强方法及基准构建

开放词汇目标检测（OVD）在应用于无人机图像时由于与地面数据集之间的领域差距而面临严重的性能下降。为应对这一挑战，我们提出了一种完整的面向无人机的解决方案，结合了数据集构建和模型创新。首先，我们设计了一种改进的无人机标注引擎，高效地解决了标注冗余、不一致和模糊性问题，从而能够生成大规模的无人机数据集。基于此引擎，我们构建了两个新的基准：UAVDE-2M，包含超过240万实例和1800多个类别，以及UAVCAP-15K，提供了丰富的图像-文本对用于视觉-语言预训练。其次，我们引入了跨注意力门控增强（CAGE）模块，这是一种轻量级的双路径融合设计，结合了跨注意力、自适应门控和全局FiLM调制，以实现稳健的文本-视觉对齐。通过将CAGE嵌入到YOLO-World-v2框架中，我们的方法在准确性和效率上均取得了显著提升，在VisDrone上的零样本检测上提高了5.3个mAP，同时减少了参数和GFLOPs，并在SIMD上展示了强大的跨域泛化能力。广泛的实验和实际无人机部署验证了我们所提出解决方案的有效性和实用性，适用于无人机基于的OVD

MSCPT: Few-shot Whole Slide Image Classification with Multi-scale and Context-focused Prompt Tuning

Authors: Minghao Han, Linhao Qu, Dingkang Yang, Xukun Zhang, Xiaoying Wang, Lihua Zhang

First: 2024-08-21T10:25:51+00:00 · Latest: 2025-09-09T12:15:25+00:00

Comments: This work has been submitted to the IEEE TMI for possible publication

Abs · PDF · Code1

Abstract

Multiple instance learning (MIL) has become a standard paradigm for the weakly supervised classification of whole slide images (WSIs). However, this paradigm relies on using a large number of labeled WSIs for training. The lack of training data and the presence of rare diseases pose significant challenges for these methods. Prompt tuning combined with pre-trained Vision-Language models (VLMs) is an effective solution to the Few-shot Weakly Supervised WSI Classification (FSWC) task. Nevertheless, applying prompt tuning methods designed for natural images to WSIs presents three significant challenges: 1) These methods fail to fully leverage the prior knowledge from the VLM's text modality; 2) They overlook the essential multi-scale and contextual information in WSIs, leading to suboptimal results; and 3) They lack exploration of instance aggregation methods. To address these problems, we propose a Multi-Scale and Context-focused Prompt Tuning (MSCPT) method for FSWC task. Specifically, MSCPT employs the frozen large language model to generate pathological visual language prior knowledge at multiple scales, guiding hierarchical prompt tuning. Additionally, we design a graph prompt tuning module to learn essential contextual information within WSI, and finally, a non-parametric cross-guided instance aggregation module has been introduced to derive the WSI-level features. Extensive experiments, visualizations, and interpretability analyses were conducted on five datasets and three downstream tasks using three VLMs, demonstrating the strong performance of our MSCPT. All codes have been made publicly accessible at https://github.com/Hanminghao/MSCPT.

中文标题/摘要

标题：MSCPT：多尺度和上下文聚焦提示调优的少量样本全切片图像分类

多实例学习（MIL）已成为弱监督全切片图像（WSI）分类的标准范式。然而，这种方法依赖于大量标注的WSI进行训练。缺乏训练数据和罕见疾病的出现对这些方法构成了重大挑战。结合预训练的视觉-语言模型（VLM）的提示调优是一种有效的少量样本弱监督WSI分类（FSWC）任务解决方案。然而，将为自然图像设计的提示调优方法应用于WSI时，存在三个重大挑战：1）这些方法未能充分利用VLM文本模态的先验知识；2）它们忽略了WSI中的多尺度和上下文信息，导致结果次优；3）它们缺乏实例聚合方法的探索。为了解决这些问题，我们提出了一种多尺度和上下文聚焦提示调优（MSCPT）方法用于FSWC任务。具体而言，MSCPT利用冻结的大语言模型在多尺度下生成病理视觉语言先验知识，引导分层提示调优。此外，我们设计了一个图提示调优模块来学习WSI内的关键上下文信息，最后引入了一个非参数交叉引导实例聚合模块以提取WSI级别的特征。在五个数据集和三个下游任务上使用三个VLM进行了广泛的实验、可视化和可解释性分析，证明了我们MSCPT的强大性能。所有代码已公开发布在https://github.com/Hanminghao/MSCPT。

Summary / 总结

The research addresses the challenge of few-shot weakly supervised whole slide image classification (FSWC) by proposing MSCPT, which integrates multi-scale and context-focused prompt tuning. The method leverages the text modality of pre-trained Vision-Language models to generate scale-specific prior knowledge and uses a graph prompt tuning module to capture contextual information. Experimental results on five datasets show strong performance across three downstream tasks using three different Vision-Language models.

研究提出了MSCPT方法，以解决FSWC问题，该方法利用多尺度和上下文聚焦的提示调优。方法使用冻结的大语言模型在多尺度生成视觉语言先验知识，并结合图提示调优模块来捕捉WSI中的关键上下文信息。在五个数据集上的实验结果表明，该方法具有很强的性能，解决了先前方法在利用文本模态、多尺度信息和实例聚合方面的局限性。

Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease

Authors: Fangqi Cheng, Surajit Ray, Xiaochen Yang

First: 2025-09-09T11:36:21+00:00 · Latest: 2025-09-09T11:36:21+00:00

Abs · PDF

Abstract

Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer's disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on two AD datasets using 1,500 training images, outperforming existing methods fine-tuned on 10,000 images. Code will be released upon publication.

中文标题/摘要

标题：基于数据高效微调的视觉-语言模型在阿尔茨海默病诊断中的应用

医学视觉-语言模型（Med-VLMs）在报告生成和视觉问答等任务中取得了令人印象深刻的成果，但仍然面临一些限制。最显著的是，它们未能充分利用患者元数据，并缺乏临床诊断知识的整合。此外，大多数现有模型通常从头开始训练或在大规模2D图像-文本对上进行微调，需要大量的计算资源，而且由于缺乏结构信息，它们在3D医学成像上的效果往往有限。为了解决这些差距，我们提出了一种数据高效的微调流水线，以适应基于3D CT的Med-VLMs，并展示了其在阿尔茨海默病（AD）诊断中的应用。我们的系统引入了两个关键创新。首先，我们将结构化的元数据转换为合成报告，丰富了文本输入，以提高图像-文本对齐。其次，我们添加了一个辅助标记，用于预测迷你精神状态检查（MMSE）分数，这是一种广泛使用的临床认知功能测量指标，与AD严重程度相关。这为微调提供了额外的监督。通过轻量级提示微调图像和文本模态，我们的方法在两个AD数据集上使用1,500张训练图像达到了最先进的性能，优于在10,000张图像上微调的现有方法。代码将在发表后发布。

Summary / 总结

This study aims to enhance the diagnostic capabilities of vision-language models for Alzheimer's disease by addressing their limitations in utilizing patient metadata and integrating clinical knowledge. The researchers propose a data-efficient fine-tuning pipeline that converts structured metadata into synthetic reports and adds an auxiliary token to predict the MMSE score. This approach, applied to 3D CT-based Med-VLMs, achieves state-of-the-art performance on two AD datasets using only 1,500 training images, outperforming existing methods fine-tuned on 10,000 images through lightweight prompt tuning of both image and text modalities.

研究旨在通过解决视觉语言模型在利用患者元数据和整合临床知识方面的局限性，提升其在阿尔茨海默病诊断中的能力。方法包括一个数据高效微调管道，将结构化元数据转换为合成报告，并添加一个辅助标记以预测MMSE评分。该方法仅使用1,500张训练图像在两个AD数据集上实现了最先进的性能，优于现有方法在10,000张图像上微调的结果。代码将在发表后发布。

Visuospatial Cognitive Assistant

Authors: Qi Feng

First: 2025-05-18T08:55:02+00:00 · Latest: 2025-09-09T09:48:14+00:00

Comments: 31 pages, 10 figures, 6 tables

Abs · PDF

Abstract

Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.

中文标题/摘要

标题：空间视觉认知助手

基于视频的空间认知对于机器人技术和具身AI至关重要，但目前的视觉-语言模型（VLMs）面临挑战。本文做出了两项关键贡献。首先，我们引入了ViCA（空间视觉认知助手）-322K，这是一个包含322,003个问答对的多样数据集，来自真实室内视频（ARKitScenes、ScanNet、ScanNet++），提供3D元数据驱动查询和基于视频的复杂推理的监督。其次，我们开发了在ViCA-322K上微调的ViCA-7B，其在所有八个VSI-Bench任务上均达到新的最佳性能，超越现有模型，包括更大的模型（例如，在绝对距离上提高26.1%）。为了提高可解释性，我们提出了ViCA-Thinking-2.68K数据集，包含明确的推理链，并微调ViCA-7B创建了ViCA-7B-Thinking模型，该模型能够表达其空间推理。我们的工作强调了目标数据的重要性，并指出了改进时空建模的路径。我们发布了所有资源以促进稳健的空间视觉智能研究。

Summary / 总结

This paper addresses the challenge of video-based spatial cognition in robotics and embodied AI by introducing ViCA-322K, a large dataset of 322,003 QA pairs from real-world indoor videos, and ViCA-7B, a model fine-tuned on this dataset that outperforms existing models on eight VSI-Bench tasks. Additionally, the authors release ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and create ViCA-7B-Thinking, a model that can explain its spatial reasoning, highlighting the importance of targeted data for improved temporal-spatial modeling.

该论文通过引入包含322,003个问答对的大型数据集ViCA-322K，以及基于该数据集训练的ViCA-7B模型，解决了机器人和具身AI中的基于视频的空间认知挑战。此外，作者还开发了ViCA-Thinking-2.68K以提高可解释性，并对ViCA-7B进行微调以创建ViCA-7B-Thinking模型，该模型能够解释其空间推理过程，突显了针对目标数据对于提高时空建模的重要性。

InteractPro: A Unified Framework for Motion-Aware Image Composition

Authors: Weijing Tao, Xiaofeng Yang, Miaomiao Cui, Guosheng Lin

First: 2024-09-16T08:44:17+00:00 · Latest: 2025-09-09T08:10:04+00:00

Abs · PDF

Abstract

We introduce InteractPro, a comprehensive framework for dynamic motion-aware image composition. At its core is InteractPlan, an intelligent planner that leverages a Large Vision Language Model (LVLM) for scenario analysis and object placement, determining the optimal composition strategy to achieve realistic motion effects. Based on each scenario, InteractPlan selects between our two specialized modules: InteractPhys and InteractMotion. InteractPhys employs an enhanced Material Point Method (MPM)-based simulation to produce physically faithful and controllable object-scene interactions, capturing diverse and abstract events that require true physical modeling. InteractMotion, in contrast, is a training-free method based on pretrained video diffusion. Traditional composition approaches suffer from two major limitations: requiring manual planning for object placement and generating static, motionless outputs. By unifying simulation-based and diffusion-based methods under planner guidance, InteractPro overcomes these challenges, ensuring richly motion-aware compositions. Extensive quantitative and qualitative evaluations demonstrate InteractPro's effectiveness in producing controllable, and coherent compositions across varied scenarios.

中文标题/摘要

标题：InteractPro：一种统一的动态运动感知图像合成框架

我们介绍了InteractPro，一个全面的动态运动感知图像合成框架。其核心是InteractPlan，一种智能规划器，利用大型视觉语言模型（LVLM）进行场景分析和物体放置，确定最佳合成策略以实现逼真的运动效果。根据每个场景，InteractPlan 选择使用我们两个专门模块之一：InteractPhys 和 InteractMotion。InteractPhys 使用增强的基于材料点方法（MPM）的模拟来生成物理上忠实且可控的物体-场景交互，捕捉需要真实物理建模的多样和抽象事件。相比之下，InteractMotion 是一种无需训练的方法，基于预训练的视频扩散。传统合成方法存在两大局限性：需要手动规划物体放置和生成静态、无运动的输出。通过在规划器引导下统一基于模拟和基于扩散的方法，InteractPro 克服了这些挑战，确保了丰富的运动感知合成。广泛的定量和定性评估表明，InteractPro 在各种场景中生成可控且连贯的合成效果的有效性。

Summary / 总结

InteractPro is a framework for dynamic motion-aware image composition that uses an intelligent planner, InteractPlan, which leverages a Large Vision Language Model to analyze scenarios and place objects optimally. It selects between InteractPhys, which uses an enhanced MPM-based simulation for physically faithful object interactions, and InteractMotion, a training-free method based on pretrained video diffusion. This approach addresses the limitations of traditional methods by providing controllable and coherent compositions across various scenarios, overcoming the need for manual planning and static outputs.

InteractPro 是一种用于动态运动感知图像合成的框架，使用智能规划器 InteractPlan，结合大型视觉语言模型来分析场景并优化物体放置。它选择使用增强的 MPM 基模拟进行物理忠实物体交互的 InteractPhys，以及基于预训练视频扩散的无需训练方法 InteractMotion。这种方法通过提供各种场景下的可控和连贯的合成，解决了传统方法需要手动规划和静态输出的问题。

Fine-Tuning Vision-Language Models for Visual Navigation Assistance

Authors: Xiao Li, Bharat Gandhi, Ming Zhan, Mohit Nehra, Zhicheng Zhang, Yuchen Sun, Meijia Song, Naisheng Zhang, Xi Wang

First: 2025-09-09T08:08:35+00:00 · Latest: 2025-09-09T08:08:35+00:00

Abs · PDF

Abstract

We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.

中文标题/摘要

标题：视觉语言模型的微调以提供视觉导航辅助

我们解决基于视觉语言的室内导航问题，使用图像和自然语言指导视觉障碍人士到达目标位置。传统的导航系统在室内由于缺乏精确的位置数据而无效。我们的方法结合视觉和语言模型生成逐步导航指令，增强无障碍性和独立性。我们使用手动标注的室内导航数据集对BLIP-2模型进行低秩适应（LoRA）微调。我们提出了一种评估指标，通过强调方向性和顺序性变量改进了BERT F1分数，提供了一个更全面的导航性能衡量标准。应用LoRA后，模型在生成方向性指令方面显著提高，克服了原始BLIP-2模型的局限性。

Summary / 总结

The research aims to improve indoor navigation for visually impaired individuals by integrating vision and language models. The study fine-tunes the BLIP-2 model using Low Rank Adaptation (LoRA) on a manually annotated dataset, and introduces a new evaluation metric that focuses on directional and sequential variables. The results show that the model's performance in generating directional instructions was significantly enhanced after fine-tuning, addressing limitations in the original BLIP-2 model.

研究旨在通过结合视觉和语言模型来改善视力受损人士的室内导航。研究通过使用低秩适应（LoRA）对BLIP-2模型进行微调，并在手动标注的数据集上进行训练。主要发现是，微调后模型在生成方向性指令方面的表现显著提高，从而增强了对视力受损用户的导航辅助。

SheetDesigner: MLLM-Powered Spreadsheet Layout Generation with Rule-Based and Vision-Based Reflection

Authors: Qin Chen, Yuanyi Ren, Xiaojun Ma, Mugeng Liu, Han Shi, Dongmei Zhang

Venue: EMNLP 2025

First: 2025-09-09T07:51:38+00:00 · Latest: 2025-09-09T07:51:38+00:00

Comments: Accepted to EMNLP 2025 Main Conference

Abs · PDF

Abstract

Spreadsheets are critical to data-centric tasks, with rich, structured layouts that enable efficient information transmission. Given the time and expertise required for manual spreadsheet layout design, there is an urgent need for automated solutions. However, existing automated layout models are ill-suited to spreadsheets, as they often (1) treat components as axis-aligned rectangles with continuous coordinates, overlooking the inherently discrete, grid-based structure of spreadsheets; and (2) neglect interrelated semantics, such as data dependencies and contextual links, unique to spreadsheets. In this paper, we first formalize the spreadsheet layout generation task, supported by a seven-criterion evaluation protocol and a dataset of 3,326 spreadsheets. We then introduce SheetDesigner, a zero-shot and training-free framework using Multimodal Large Language Models (MLLMs) that combines rule and vision reflection for component placement and content population. SheetDesigner outperforms five baselines by at least 22.6\%. We further find that through vision modality, MLLMs handle overlap and balance well but struggle with alignment, necessitates hybrid rule and visual reflection strategies. Our codes and data is available at Github.

中文标题/摘要

标题：SheetDesigner：基于MLLM的规则与视觉反馈结合的电子表格布局生成

电子表格对于数据为中心的任务至关重要，具有丰富的结构化布局，能够高效地传递信息。鉴于手动设计电子表格布局所需的时间和专业知识，迫切需要自动化解决方案。然而，现有的自动化布局模型并不适合电子表格，因为它们通常（1）将组件视为轴对齐的矩形，忽略了电子表格固有的网格结构；（2）忽视了数据依赖关系和上下文链接等独特的相关语义。在本文中，我们首先通过一个七准则评估协议和包含3,326个电子表格的数据集，形式化了电子表格布局生成任务。然后，我们介绍了SheetDesigner，这是一种无需训练且零样本的框架，使用多模态大型语言模型（MLLMs），结合规则和视觉反馈进行组件放置和内容填充。SheetDesigner在五个基线模型上至少优于22.6%。我们进一步发现，通过视觉模态，MLLMs在处理重叠和平衡方面表现良好，但在对齐方面存在困难，需要混合规则和视觉反馈策略。我们的代码和数据可在Github上获取。

Summary / 总结

SheetDesigner addresses the need for automated spreadsheet layout design by leveraging Multimodal Large Language Models (MLLMs) and combining rule-based and vision-based approaches. It outperforms five baselines by at least 22.6% and handles overlap and balance well through the vision modality, although it struggles with alignment, requiring hybrid strategies. The framework is zero-shot and training-free, using a dataset of 3,326 spreadsheets to evaluate its performance based on seven criteria.

SheetDesigner 通过利用多模态大型语言模型（MLLM）并结合规则和视觉反射，解决了自动化电子表格布局设计的需求。该框架在包含3,326个电子表格的七标准评估协议上优于五个基线至少22.6%，并且在处理重叠和平衡方面表现良好，但在对齐方面存在困难。该框架为零样本且无需训练，代码和数据可在Github上获取。

ANYPORTAL: Zero-Shot Consistent Video Background Replacement

Authors: Wenshuo Gao, Xicheng Lan, Shuai Yang

Venue: ICCV 2025

First: 2025-09-09T07:50:53+00:00 · Latest: 2025-09-09T07:50:53+00:00

Comments: 8 pages, ICCV 2025, Website: https://gaowenshuo.github.io/AnyPortal/

Abs · PDF · Project1

Abstract

Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce ANYPORTAL, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. ANYPORTAL is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that ANYPORTAL achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.

中文标题/摘要

标题：ANYPORTAL：零样本一致视频背景替换

尽管视频生成技术取得了快速进步，但创建与用户意图精确匹配的高质量视频仍然是一个重大挑战。现有方法往往无法实现对视频细节的精细控制，限制了其实用性。我们提出ANYPORTAL，这是一种利用预训练扩散模型的零样本框架，用于视频背景替换。我们的框架在零样本设置中协作整合了视频扩散模型的时间先验与图像扩散模型的重新照明能力。为解决前景一致性这一关键挑战，我们提出了一种细化投影算法，该算法允许像素级细节操作以确保精确的前景保留。ANYPORTAL 是无训练的，并克服了实现前景一致性和时间上连贯的重新照明的挑战。实验结果表明，ANYPORTAL 在消费级 GPU 上实现了高质量的结果，提供了一种实用且高效的视频内容创作和编辑解决方案。

Summary / 总结

ANYPORTAL is a zero-shot framework for video background replacement that uses pre-trained diffusion models to integrate temporal priors and relighting capabilities. It introduces a Refinement Projection Algorithm to maintain foreground consistency. Experimental results show that ANYPORTAL can produce high-quality video results on consumer-grade GPUs, providing a practical solution for video content creation and editing without requiring training data.

ANYPORTAL 是一种利用预训练扩散模型结合时间先验和光照能力的零样本框架。它提出了细化投影算法以保持前景一致性。实验结果表明，ANYPORTAL 可以在消费级 GPU 上生成高质量的视频，提供一种无需训练数据的视频内容创作和编辑的实用解决方案。