arXiv 论文速递

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Authors: Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu

First: 2025-12-24T18:59:54+00:00 · Latest: 2025-12-24T18:59:54+00:00

Comments: Project page: https://sytwu.github.io/BeyondMemo/

Abstract

We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/

中文标题/摘要

标题：超越记忆：多模态序数回归基准以揭示视觉语言模型中的流行度偏差

我们揭示了最先进的视觉语言模型（VLMs）中存在显著的流行度偏差，这些模型在著名建筑上的准确率比普通建筑高出34%，表明它们依赖于记忆而非可泛化的理解。为了系统地研究这一问题，我们引入了该任务上最大的公开基准数据集：YearGuessr数据集，包含来自157个国家的55,546张建筑图像及其多模态属性，附有其建设年份的连续序数标签（1001-2024）、GPS数据和页面浏览量作为流行度的代理。使用该数据集，我们将建筑年份预测任务框定为序数回归，并引入了流行度感知的区间准确度指标来量化这种偏差。我们构建的包含30多个模型的基准，包括我们的YearCLIP模型，证实了VLMs在流行、记忆化的项目上表现出色，但在未识别的主题上却面临巨大挑战，揭示了它们推理能力中的关键缺陷。项目页面：https://sytwu.github.io/BeyondMemo/

Summary / 总结

The paper addresses the significant popularity bias in state-of-the-art vision-language models (VLMs), which perform better on famous buildings than ordinary ones. To systematically investigate this, the authors introduce the YearGuessr dataset, comprising 55,546 building images with multi-modal attributes and continuous ordinal labels of construction years. Using this dataset, they frame the task as ordinal regression and introduce new metrics to quantify the bias. The benchmark of 30+ models, including YearCLIP, confirms that VLMs excel on popular items but struggle with unrecognized subjects, highlighting a critical flaw in their reasoning capabilities.

论文探讨了最先进的视觉-语言模型（VLMs）中存在的显著流行度偏差，这些模型在著名建筑上的表现优于普通建筑。为了系统地研究这一问题，作者引入了包含55,546张建筑图像和多模态属性的YearGuessr数据集，并附有连续的按年份排序标签。使用该数据集，他们将任务定义为序数回归，并引入新的指标来量化偏差。30多种模型的基准测试，包括YearCLIP，证实了VLMs在流行项目上表现出色，但在未识别的主题上却面临重大挑战，揭示了其推理能力的关键缺陷。

LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

Authors: Anatoly O. Onishchenko, Alexey K. Kovalev, Aleksandr I. Panov

First: 2025-12-24T15:36:21+00:00 · Latest: 2025-12-24T15:36:21+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at https://lookplangraph.github.io .

中文标题/摘要

标题：LookPlanGraph：基于VLM图增强的体感指令跟随方法

使用大型语言模型（LLM）作为规划器的方法在体感指令跟随任务中变得普遍。为了成功完成任务，LLM 必须在机器人操作的环境中进行接地。一种解决方案是使用包含所有必要信息的场景图。现代方法依赖于预先构建的场景图，并假设在规划开始时所有任务相关信息都已可用。然而，这些方法没有考虑到在图构建和任务执行之间环境可能发生的改变。我们提出了 LookPlanGraph 方法，该方法利用由静态资产和对象先验组成的场景图。在计划执行过程中，LookPlanGraph 不断通过验证现有先验或发现新实体来更新图。这通过使用视觉语言模型处理代理的主观摄像机视图来实现。我们在具有改变对象位置的 VirtualHome 和 OmniGibson 模拟环境中进行了实验，证明 LookPlanGraph 在基于预定义静态场景图的方法中表现出色。为了展示我们方法的实际适用性，我们还在现实世界中进行了实验。此外，我们引入了 GraSIF（用于指令跟随的图场景）数据集，其中包括自动验证框架，包含从 SayPlan Office、BEHAVIOR-1K 和 VirtualHome RobotHow 中抽取的 514 个任务。项目页面可在 https://lookplangraph.github.io 查看。

Summary / 总结

The paper proposes LookPlanGraph, a method that enhances embodied instruction following by continuously updating a scene graph with relevant objects during plan execution. This is achieved through a Vision Language Model processing the agent's egocentric camera view. Experiments in simulated and real-world environments show that LookPlanGraph outperforms methods using predefined static scene graphs, especially when the environment changes between graph construction and task execution. The study also introduces the GraSIF dataset for instruction following tasks, including 514 tasks from various sources.

该论文提出了LookPlanGraph方法，通过结合视觉语言模型（VLM）动态更新场景图来增强基于指令的机器人任务执行。这种方法在任务执行过程中持续更新场景图，以应对环境变化。实验结果显示，LookPlanGraph在模拟和真实环境中表现优于依赖静态预定义场景图的方法，尤其是在物体位置改变的情况下。

Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

Authors: Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Tran Chi Nguyen

Venue: MM

First: 2025-12-24T15:02:33+00:00 · Latest: 2025-12-24T15:02:33+00:00

Comments: System description paper for EVENTA Grand Challenge Track 2 at ACM Multimedia 2025 (MM '25). Ranked 4th place. 6 pages, 1 figure, 2 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval

中文标题/摘要

标题：利用轻量级实体提取实现可扩展的基于事件的图像检索

从自然语言描述中检索图像是一项在计算机视觉和自然语言处理交叉领域中的核心任务，广泛应用于搜索引擎、媒体归档和数字内容管理等领域。然而，由于模糊或依赖上下文的查询、语言的多样性以及需要可扩展的解决方案，现实世界中的图像-文本检索仍然具有挑战性。在本文中，我们提出了一种轻量级的两阶段检索管道，利用事件中心的实体提取来结合现实世界标题中的时间与上下文信号。第一阶段使用基于显著实体的BM25高效候选过滤，第二阶段应用BEiT-3模型来捕捉深层多模态语义并重新排序结果。在OpenEvents v1基准上评估，我们的方法达到了0.559的平均精度，显著优于先前的基线。这些结果突显了结合事件引导过滤与长文本视觉语言建模在复杂现实场景中实现准确高效检索的有效性。我们的代码可在https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval 获取。

Summary / 总结

This work addresses the challenge of retrieving images from natural language descriptions by proposing a lightweight two-stage retrieval pipeline. The first stage uses BM25 based on salient entities for efficient candidate filtering, while the second stage employs BEiT-3 models to capture deep multimodal semantics and rerank the results. The method achieves a mean average precision of 0.559 on the OpenEvents v1 benchmark, significantly outperforming previous approaches, demonstrating the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in real-world scenarios.

本文提出了一种轻量级的两阶段检索管道，以解决从自然语言描述中检索图像的挑战。第一阶段使用基于显著实体的BM25进行高效的候选过滤，第二阶段则使用BEiT-3模型捕获深度多模态语义并重新排序结果。该方法在OpenEvents v1基准测试上达到了0.559的平均精度，显著优于先前的方法，展示了结合事件引导过滤与长文本视觉语言建模在复杂现实场景中实现准确高效检索的有效性。

RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

Authors: Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu

First: 2025-12-24T15:01:26+00:00 · Latest: 2025-12-24T15:01:26+00:00

Comments: 11 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.

中文标题/摘要

标题：RoboSafe：通过可执行安全逻辑保护具身代理

由视觉-语言模型（VLMs）驱动的具身代理越来越能够执行复杂的现实世界任务，但它们仍然容易受到可能导致不安全行为的危险指令的影响。运行时安全护栏可以在任务执行过程中拦截危险行为，提供了一种有前景的解决方案，因为它们具有灵活性。然而，现有的防御措施往往依赖于静态规则过滤或提示级控制，难以应对动态、时序依赖和上下文丰富的环境中隐含的风险。为了解决这一问题，我们提出了一种名为RoboSafe的混合推理运行时保护，通过可执行谓词基础的安全逻辑保护具身代理。RoboSafe结合了在混合长短期安全记忆上的两种互补推理过程。我们首先提出了一种后向反思推理模块，该模块不断回顾短期记忆中的最近轨迹，以推断时间安全谓词，并在检测到违规行为时主动触发重新规划。然后，我们提出了一种前瞻预测推理模块，该模块通过生成基于长期安全记忆和代理的多模态观察的安全谓词来预见即将出现的风险。这些组件共同形成了一种既可解释又可执行的适应性安全逻辑。在多个代理的广泛实验中，RoboSafe将危险行为的风险发生率降低了36.8%，同时保持了接近原始的任务性能。在物理机器人手臂上的实际评估进一步证实了其实用性。代码将在接受后发布。

Summary / 总结

RoboSafe is a hybrid reasoning runtime safeguard for embodied agents using executable predicate-based safety logic. It integrates Backward Reflective Reasoning and Forward Predictive Reasoning to continuously monitor and predict safety risks. Experiments show that RoboSafe significantly reduces hazardous actions by 36.8% compared to leading baselines while maintaining task performance. Real-world evaluations on robotic arms confirm its practicality.

RoboSafe 通过使用可执行的安全逻辑来保护实体代理免受有害指令的影响，结合了回顾性推理，用于回顾近期行为以检测安全违规，以及前瞻性推理，根据长期记忆和当前观察预测风险。实验表明，RoboSafe 相比现有方法将有害行为减少了 36.8%，同时保持了任务性能。实际机器人手臂上的评估进一步证实了其实用性。

VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

Authors: Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H. Lê Khac, Ankit Singh, Sofian Chaybouti, Sanath Narayan

First: 2025-12-24T14:18:38+00:00 · Latest: 2025-12-24T14:18:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.

中文标题/摘要

标题：VisRes 基准：关于评估 VLM 视觉推理能力的研究

视觉-语言模型（VLMs）在视觉问答和图像描述等任务上取得了显著进展。然而，这些模型在视觉推理方面的表现与其依赖语言先验的程度之间的关系尚不明确。为了解决这一问题，我们引入了 VisRes 基准，该基准旨在在无需上下文语言监督的自然环境中研究视觉推理。通过对三种复杂性级别的模型行为进行分析，我们发现了感知和关系视觉推理能力的明显局限性。VisRes 在其级别上隔离了不同的推理能力。第一级测试在模糊、纹理变化、遮挡和旋转等干扰下的感知完成和全局图像匹配；第二级测试单一属性（如颜色、数量、方向）的基于规则的推理；第三级则针对需要整合多个视觉属性的组合推理。在超过 19,000 张受控任务图像中，我们发现最先进的 VLMs 在细微的感知干扰下表现接近随机，揭示了其有限的抽象能力，仅限于模式识别。最后，我们讨论了 VisRes 如何为多模态研究中推进抽象视觉推理提供统一框架。

Summary / 总结

The paper introduces VisRes Bench, a benchmark to evaluate the visual reasoning capabilities of Vision-Language Models (VLMs) without relying on contextual language supervision. The benchmark consists of three levels of complexity: perceptual completion and global image matching (Level 1), rule-based inference over a single attribute (Level 2), and compositional reasoning integrating multiple visual attributes (Level 3). The study reveals that state-of-the-art VLMs struggle with subtle perceptual perturbations and show limited abstraction beyond pattern recognition, highlighting the need for improved visual reasoning abilities in VLMs.

VisRes Bench 是一个基准，旨在评估 Vision-Language 模型 (VLM) 在无需依赖上下文语言监督的情况下进行视觉推理的能力。它在三个复杂性级别上评估模型：感知完成、基于规则的推理和组合推理。研究发现，最先进的 VLM 在细微的感知扰动下表现不佳，表明它们的抽象能力仅限于模式识别。该基准提供了一个统一的框架，用于推进多模态研究中的抽象视觉推理。

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

Authors: Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang

First: 2025-10-18T09:22:40+00:00 · Latest: 2025-12-24T13:40:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

中文标题/摘要

标题：SSL4RL：重新审视自监督学习作为视觉-语言推理内在奖励的方法

视觉-语言模型（VLMs）通过结合大型语言模型和视觉输入展示了显著的能力。然而，它们往往未能充分利用视觉证据，要么依赖于视觉中心任务中的语言先验，要么在推理过程中求助于文本捷径。尽管强化学习（RL）可以将模型与期望的行为对齐，但将其应用于VLMs受到了缺乏可扩展和可靠的奖励机制的阻碍。为克服这一挑战，我们提出了一种名为SSL4RL的新框架，该框架利用自监督学习（SSL）任务作为基于RL的微调的验证性奖励来源。我们的方法将SSL目标，如预测图像旋转或重建遮罩片段，重新表述为密集的自动奖励信号，从而消除了对人工偏好数据或不可靠的人工智能评估者的需要。实验表明，SSL4RL在视觉中心和视觉-语言推理基准测试中显著提高了性能。此外，通过系统性的消融实验，我们确定了影响SSL4RL任务有效性的关键因素，如任务难度、模型规模和与目标领域的语义对齐，为未来工作提供了新的设计原则。我们还通过将其应用于图学习，展示了该框架的通用性，其中它带来了显著的收益。SSL4RL建立了一种使用可验证的自监督目标对齐多模态模型的灵活且有效的范式。

Summary / 总结

The paper proposes SSL4RL, a framework that uses self-supervised learning (SSL) tasks as intrinsic rewards for reinforcement learning (RL) fine-tuning of vision-language models (VLMs). This approach reformulates SSL objectives into dense, automatic reward signals, improving performance on vision-centric and vision-language reasoning benchmarks. Key factors influencing the effectiveness of SSL4RL tasks include task difficulty, model scale, and semantic alignment with the target domain.

研究旨在通过将自我监督学习（SSL）作为内在奖励集成到强化学习（RL）中来提升视觉语言模型的性能。方法是将SSL任务如图像旋转预测和遮罩块重建转换为密集的自动奖励信号。关键实验发现表明，这种方法在视觉中心任务和视觉语言推理基准测试中显著提高了性能，并且在图学习任务中也表现出色，展示了其通用性和有效性。

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

Authors: Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou, Ming-Ming Cheng

Venue: ICCV 2025

First: 2024-12-09T06:34:23+00:00 · Latest: 2025-12-24T13:11:11+00:00

Comments: Accepted at ICCV 2025. The code is available at https://github.com/HVision-NKU/DenseVLM

Abs · PDF · Code1 · Code2 · Code3

Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias', where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is available at https://github.com/HVision-NKU/DenseVLM.

中文标题/摘要

标题：无偏区域-语言对齐以实现开放词汇密集预测

预训练的视觉-语言模型（VLMs），如CLIP，已经展示了令人印象深刻的零样本识别能力，但在密集预测任务中仍然表现不佳。最近，自我蒸馏作为一种有希望的方法正在兴起，用于微调VLMs以更好地适应局部区域，而无需大量注释。然而，之前最先进的方法往往遭受显著的“前景偏差”问题，模型倾向于错误地将背景区域识别为前景对象。为了解决这一问题，我们提出了一种名为DenseVLM的框架，该框架旨在从强大的预训练VLM表示中学习无偏的区域-语言对齐。DenseVLM利用预训练的VLM检索未标记区域的类别，然后分离前景和背景特征之间的干扰。我们展示了DenseVLM可以直接替换原始VLM在开放词汇目标检测和图像分割方法中，从而显著提高性能。此外，当在更广泛和多样化的数据集上进行训练时，它还表现出有希望的零样本可扩展性。我们的代码可在https://github.com/HVision-NKU/DenseVLM获取。

Summary / 总结

The research aims to improve the performance of pre-trained vision-language models (VLMs) in dense prediction tasks by addressing the foreground bias issue. DenseVLM is proposed to learn unbiased region-language alignment using self-distillation from pre-trained VLMs. The method decouples foreground and background features, allowing it to replace the original VLM in open-vocabulary object detection and image segmentation, resulting in significant performance enhancements and zero-shot scalability with larger datasets.

研究旨在通过解决前景偏差问题，提升预训练视觉-语言模型（VLMs）在密集预测任务中的性能。提出了DenseVLM框架，利用自蒸馏技术学习无偏的区域-语言对齐。该方法利用预训练的VLM检索未标记区域的类别，并分离前景和背景特征，从而在开放词汇量的目标检测和图像分割任务中取得了显著的性能提升。此外，DenseVLM在更多样化的数据集上展示了良好的零样本扩展性。

ORCA: Object Recognition and Comprehension for Archiving Marine Species

Authors: Yuk-Kwan Wong, Haixin Liang, Zeyu Ma, Yiwei Chen, Ziqiang Zheng, Rinaldi Gotama, Pascal Sebastian, Lauren D. Sparks, Sai-Kit Yeung

Venue: WACV

First: 2025-12-24T12:36:57+00:00 · Latest: 2025-12-24T12:36:57+00:00

Comments: Accepted by The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

Abs · PDF · Code1 · Code2

Abstract

Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present ORCA, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. ORCA thus establishes a comprehensive benchmark to advance research in marine domain. Project Page: http://orca.hkustvgd.com/.

中文标题/摘要

标题：ORCA：海洋物种识别与理解以实现海洋生物存档

海洋视觉理解对于监测和保护海洋生态系统至关重要，能够实现自动化的生物调查。然而，由于训练数据有限且缺乏将特定海洋领域的挑战与明确的计算机视觉任务系统化结合的任务表述，进展受到限制，从而限制了有效模型的应用。为解决这一问题，我们提出了ORCA，一个包含14,647张图像和478个物种的多模态基准数据集，其中包含42,217个边界框注释和22,321个专家验证的实例描述。该数据集提供了细粒度的视觉和文本注释，捕捉了不同海洋物种的形态特征。为了促进方法学的进步，我们在三个任务上评估了18个最先进的模型：对象检测（封闭集和开放词汇）、实例描述和视觉定位。结果突显了关键挑战，包括物种多样性、形态重叠和专门领域的特殊需求，强调了海洋理解的难度。ORCA因此建立了一个全面的基准，以推进海洋领域的研究。项目页面：http://orca.hkustvgd.com/

Summary / 总结

The research aims to improve marine ecosystem monitoring through automatic and scalable biological surveys. ORCA, a multi-modal benchmark, is introduced with 14,647 images and detailed annotations for 478 marine species. The study evaluates 18 state-of-the-art models on object detection, instance captioning, and visual grounding tasks, revealing challenges such as species diversity and morphological overlap. ORCA provides a comprehensive benchmark to advance marine domain research.

研究旨在通过自动和大规模的生物调查来改善海洋生态系统的监测。ORCA是一个多模态基准，包含14,647张图像和478种海洋物种的详细注释。研究评估了18种最先进的模型在物体检测、实例描述和视觉定位任务上的表现，揭示了物种多样性、形态重叠等挑战。ORCA为海洋领域研究提供了全面的基准。

Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification

Authors: Yupeng Zhang, Adam G. Dunn, Usman Naseem, Jinman Kim

First: 2025-12-17T09:47:29+00:00 · Latest: 2025-12-24T12:33:48+00:00

Abs · PDF · Code1 · Code2

Abstract

Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model's decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $Δ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $Δ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.

中文标题/摘要

标题：视觉-语言模型在医学图像疾病分类中的交叉公平性

医学人工智能（AI）系统，尤其是多模态视觉-语言模型（VLM），常常表现出交叉偏见，模型在诊断边缘化患者亚组时系统性地缺乏信心。这种偏见可能导致由于样本数据的种族分布偏差和诊断确定性分布差异而出现更高的误诊和漏诊率。当前的公平性干预措施往往未能解决这些差距，或者在实现各亚组统计平等的同时牺牲整体诊断性能。在本研究中，我们开发了跨模态对齐一致性（CMAC-MMD）训练框架，该框架标准化了交叉公平性患者亚组的诊断确定性。与传统的去偏见方法不同，这种方法在临床推理过程中不需要敏感的种族数据即可使模型的决策信心相等。我们使用10,015张皮肤病变图像（HAM10000）和外部验证的12,000张图像（BCN20000）以及10,000张用于青光眼检测的视网膜图像（Harvard-FairVLMed），按交叉公平性年龄、性别和种族属性分层评估了该方法。在皮肤科队列中，所提出的方法将总体交叉公平性漏诊差距（真实阳性率差异，ΔTPR）从0.50降低到0.26，同时将总体曲线下面积（AUC）从0.94提高到0.97，优于标准训练。同样，在青光眼筛查中，该方法将ΔTPR从0.41降低到0.31，实现了更好的AUC（0.72，与0.71基线相比）。这建立了一个可扩展的框架，用于开发既准确又能在不同患者亚组中公平执行的高风险临床决策支持系统，确保可靠性能而不增加隐私风险。

Summary / 总结

This study addresses the intersectional biases in medical AI systems, particularly in vision-language models, which can lead to higher rates of inaccurate and missed diagnoses. The authors developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that equalizes diagnostic certainty across different patient subgroups without needing sensitive demographic data. Evaluations on skin lesion and fundus images showed that the proposed method reduced the missed diagnosis gap and improved overall diagnostic performance, achieving better Area Under the Curve (AUC) scores compared to standard training methods.

该研究针对医疗AI系统中存在交叉偏见的问题，特别是在用于疾病分类的多模态视觉-语言模型中。研究引入了跨模态一致性对齐（CMAC-MMD）的训练框架，该框架能够在不需要敏感人口统计数据的情况下，使不同患者亚组的诊断确定性标准化。在皮肤病变和眼底图像的评估中，所提出的方法减少了漏诊差距，并提高了整体诊断性能，AUC值优于标准训练方法。

MarineEval: Assessing the Marine Intelligence of Vision-Language Models

Authors: YuK-Kwan Wong, Tuan-An To, Jipeng Zhang, Ziqiang Zheng, Sai-Kit Yeung

Venue: WACV

First: 2025-12-24T11:57:50+00:00 · Latest: 2025-12-24T11:57:50+00:00

Comments: Accepted by The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

Abs · PDF · Code1 · Code2

Abstract

We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: http://marineeval.hkustvgd.com/

中文标题/摘要

标题：MarineEval：评估视觉语言模型的海洋智能

我们见证了由大型语言模型（LLMs）和进一步的视觉语言模型（VLMs）引领的在处理各种查询方面取得的令人鼓舞的进展，使其成为通用助手。VLMs 作为连接视觉世界和语言语料库的桥梁，接收视觉内容和各种文本指令以生成相应的响应。尽管 VLMs 在各个领域取得了巨大成功，但在本文中，我们质疑现有的 VLMs 是否可以作为领域专家，准确回答需要大量领域专业知识和解决特殊领域挑战/要求的海洋问题。为了全面评估现有 VLMs 的效果并探索其边界，我们构建了第一个大规模海洋 VLM 数据集和基准 MarineEval，包含 2,000 个基于图像的问题-答案对。在数据集构建过程中，我们确保了构建数据的多样性和覆盖面：7 个任务维度和 20 个能力维度。领域要求特别整合到数据构建中，并由相应的海洋领域专家进一步验证。我们在 MarineEval 上全面基准测试了 17 个现有 VLMs，并调查了现有模型在回答海洋研究问题方面的局限性。实验结果表明，现有 VLMs 无法有效回答领域特定问题，仍有很大的性能提升空间。我们希望我们的新基准和观察结果能促进未来的研究。项目页面：http://marineeval.hkustvgd.com/

Summary / 总结

MarineEval assesses the marine intelligence of vision-language models (VLMs) by constructing a large-scale marine dataset with 2,000 image-based question-answering pairs. The dataset covers 7 task dimensions and 20 capacity dimensions, ensuring diversity and domain-specific requirements. The benchmark evaluates 17 existing VLMs and finds that they struggle with domain-specific marine questions, indicating significant room for improvement. The study aims to advance future research in this area.

MarineEval通过构建包含2000个基于图像的问题-答案对的大规模数据集，涵盖7个任务维度和20个能力维度来评估视觉语言模型的海洋智能。评估结果显示现有的视觉语言模型难以准确回答特定的海洋问题，表明在处理专门知识和挑战方面仍有很大的改进空间。这项工作旨在促进该领域的未来研究。

UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters

Authors: Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Baiand Hao Feng, Wei Shi, Yuchen Su, Can Huang, Yu-Gang Jiang

First: 2025-12-24T10:35:21+00:00 · Latest: 2025-12-24T10:35:21+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$\times$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: https://github.com/Topdu/OpenOCR.

中文标题/摘要

标题：UniRec-0.1B: 统一文本和公式识别模型，参数量仅为0.1B

文本和公式是许多文档的核心信息组件。准确高效地识别两者对于开发稳健且通用的文档解析系统至关重要。最近，视觉-语言模型（VLMs）在统一识别文本和公式方面取得了令人印象深刻的成果。然而，它们体积庞大且计算需求高，限制了其在许多应用中的使用。在本文中，我们提出了一种仅包含0.1B参数的统一识别模型UniRec-0.1B。该模型能够在字符、单词、行、段落和文档等多个层次上进行文本和公式识别。为了实现这一任务，我们首先建立了包含4000万文本、公式及其混合样本的大型数据集UniRec40M，以训练出强大而轻量级的模型。其次，我们识别了构建这样一个轻量级但统一专家模型时的两个挑战：层次结构中的结构变异性以及文本和公式内容之间的语义纠缠。为了解决这些问题，我们引入了层次监督训练，以明确引导结构理解，并引入了语义解耦分词器，将文本和公式表示分离。最后，我们开发了一个全面的评估基准，涵盖了多个领域和多个层次的中文和英文文档。在该基准和公开基准上的实验结果表明，UniRec-0.1B 在性能和效率上均优于通用视觉语言模型和领先文档解析专家模型，验证了其有效性和效率。代码库和数据集：https://github.com/Topdu/OpenOCR.

Summary / 总结

The paper introduces UniRec-0.1B, a unified text and formula recognition model with only 0.1 billion parameters, which can perform recognition at multiple levels. To achieve this, the authors developed a large-scale dataset, UniRec40M, and introduced hierarchical supervision training and a semantic-decoupled tokenizer to address structural variability and semantic entanglement. The model outperforms both general-purpose vision-language models and document parsing expert models while achieving a 2-9 times speedup. Experimental results on various benchmarks demonstrate its effectiveness and efficiency.

该论文提出了一个仅包含0.1亿参数的统一文本和公式识别模型UniRec-0.1B，能够进行多级识别。为此，创建了一个大规模数据集UniRec40M，并解决了结构变异性和语义纠缠两个挑战。模型通过层次监督训练和语义解耦分词器进行训练。实验结果表明，UniRec-0.1B在性能上优于现有模型，同时速度更快。

Case Prompting to Mitigate Large Language Model Bias for ICU Mortality Prediction

Authors: Gangxiong Zhang, Yongchao Long, Yong Zhang, Yuxi Zhou, Shenda Hong

First: 2025-12-17T12:29:53+00:00 · Latest: 2025-12-24T08:34:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate mortality risk prediction for intensive care unit (ICU) patients is essential for clinical decision-making. Although large language models (LLMs) show promise in predicting outcomes from structured medical data, their predictions may exhibit demographic biases related to sex, age, and race, limiting their trustworthy use in clinical practice. Existing debiasing methods often reduce predictive performance, making it difficult to jointly optimize fairness and accuracy. In this study, we systematically examine bias in LLM-based ICU mortality prediction and propose a training-free, clinically adaptive prompting framework to simultaneously improve fairness and performance. We first develop a multi-dimensional bias assessment scheme for comprehensive model diagnosis. Building on this analysis, we introduce CAse Prompting (CAP), a novel prompting framework that integrates conventional debiasing prompts with case-based reasoning. CAP guides the model to learn from similar historical misprediction cases and their correct outcomes, enabling correction of biased reasoning patterns. Experiments on the MIMIC-IV dataset show that CAP substantially improves both predictive accuracy and fairness. CAP increases AUROC from 0.806 to 0.873 and AUPRC from 0.497 to 0.694, while reducing sex- and race-related disparities by over 90%. Feature reliance analysis further indicates highly consistent attention patterns across demographic groups, with similarity scores exceeding 0.98. These results demonstrate that LLMs exhibit measurable bias in ICU mortality prediction, and that a carefully designed prompting framework can effectively co-optimize fairness and performance without retraining, offering a transferable paradigm for equitable clinical decision support.

中文标题/摘要

标题：基于病例提示减轻大型语言模型偏见以降低ICU病死率预测

ICU患者病死率的准确预测对于临床决策至关重要。尽管大型语言模型（LLMs）在预测结构化医疗数据的结果方面显示出潜力，但它们的预测可能表现出与性别、年龄和种族相关的统计偏见，限制了其在临床实践中的可信应用。现有的去偏方法通常会降低预测性能，使得难以同时优化公平性和准确性。在本研究中，我们系统地检查了LLM基于ICU病死率预测中的偏见，并提出了一种无需训练、临床适应的提示框架，以同时提高公平性和性能。我们首先开发了一种多维度偏见评估方案，用于全面的模型诊断。在此基础上，我们引入了CAse Prompting（CAP），这是一种新颖的提示框架，将传统的去偏提示与案例推理相结合。CAP引导模型从历史上的错误预测案例及其正确结果中学习，以纠正偏见的推理模式。在MIMIC-IV数据集上的实验表明，CAP显著提高了预测准确性和公平性。CAP将AUROC从0.806提高到0.873，AUPRC从0.497提高到0.694，并通过超过90%的减少性别和种族相关的差异。特征依赖性分析进一步表明，不同人口群体之间的注意力模式高度一致，相似度分数超过0.98。这些结果表明，LLMs在ICU病死率预测中表现出可测量的偏见，并且精心设计的提示框架可以在无需重新训练的情况下有效协同优化公平性和性能，提供了一种可转移的公平临床决策支持范式。

Summary / 总结

This study addresses the issue of bias in large language models (LLMs) used for ICU mortality prediction, which can limit their clinical utility. The authors propose a prompting framework called CAse Prompting (CAP) to mitigate these biases without retraining the models. By integrating conventional debiasing prompts with case-based reasoning, CAP helps the model learn from historical mispredictions, thereby improving both predictive accuracy and fairness. Experiments on the MIMIC-IV dataset show that CAP significantly enhances AUROC and AUPRC while reducing sex- and race-related disparities by over 90%. Feature reliance analysis also reveals consistent attention patterns across demographic groups, indicating that CAP effectively corrects biased reasoning patterns.

该研究针对大型语言模型（LLM）在ICU病死率预测中的偏见问题，这些问题可能影响临床决策。研究提出了一种无需重新训练的提示框架——CAse Prompting（CAP），以同时提高公平性和预测准确性。通过多维度偏见分析和案例推理的结合，CAP 引导模型从历史错误预测中学习，从而减少人口统计学差异。MIMIC-IV 数据集上的实验结果显示，CAP 将 AUROC 提高到 0.873，AUPRC 提高到 0.694，同时显著减少了性别和种族相关的差异。特征依赖性分析还表明，不同人口统计学群体之间的注意力模式高度一致，表明 CAP 在不重新训练模型的情况下有效缓解了偏见。

O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model

Authors: Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty

Venue: AAAI 2026

First: 2025-11-18T11:18:08+00:00 · Latest: 2025-12-24T08:17:05+00:00

Comments: Accepted to AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.

中文标题/摘要

标题：O3SLM：开放权重、开放数据和开放词汇语言模型

虽然大型视觉语言模型（LVLMs）在越来越多的实际应用中被部署，但它们对抽象视觉输入的理解能力仍然有限。具体来说，它们难以理解手绘草图，这种模态提供了一种直观的方式来表达难以用文字描述的概念。我们发现的主要瓶颈是没有一个大规模的数据集能够同时建模草图、照片级真实图像及其相应的自然语言指令。为了解决这个问题，我们提出了两个关键贡献：（1）一个新设计的、大规模的图像-草图-指令三元组数据集，旨在促进预训练和指令微调；（2）O3SLM，一个在该数据集上训练的LVLM。在多个基于草图的任务上的全面评估：（a）物体定位，（b）计数，（c）图像检索，即（SBIR和细粒度SBIR），以及（d）视觉问答（VQA），结合现有的三个草图数据集，即QuickDraw！、Sketchy和Tu Berlin，以及我们生成的SketchVCL数据集，表明O3SLM达到了最先进的性能，显著优于现有的LVLMs在草图理解和推理方面的表现。

Summary / 总结

The research aims to improve Large Vision Language Models (LVLMs) in understanding hand-drawn sketches, which are useful for expressing complex concepts. To address this, the authors created a new large-scale dataset of image-sketch-instruction triplets and trained an LVLM called O3SLM on this dataset. Experimental results show that O3SLM outperforms existing models in tasks such as object localization, counting, image retrieval, and visual question answering, particularly in handling sketch-based inputs.

研究旨在提升大型视觉语言模型（LVLM）对手绘草图的理解能力，这是一个具有挑战性的输入模态。研究引入了一个新的大规模图像-草图-指令三元组数据集，并基于此训练了一个名为O3SLM的模型。实验结果表明，O3SLM在对象定位、计数、图像检索和与草图相关的视觉问答等任务上优于现有模型。

V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval

Authors: Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim

First: 2025-12-13T11:02:04+00:00 · Latest: 2025-12-24T07:46:59+00:00

Comments: 14 pages, 20 figures, conference, accepted by HPCA 2026

Abs · PDF · Code1 · Code2

Abstract

Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.

中文标题/摘要

标题：V-Rex：通过动态KV缓存检索实现实时流式视频LLM加速

流式视频大型语言模型（LLMs）越来越多地用于实时多模态任务，如视频字幕、问答、对话代理和增强现实。然而，这些模型面临着根本性的内存和计算挑战，因为它们的键值（KV）缓存会随着连续的流式视频输入而大幅增长。这一过程需要一个迭代预填充阶段，这是流式视频LLMs的一个独特特征。由于其迭代预填充阶段，它遭受了显著的限制，包括大量的计算、大量的数据传输以及准确性的下降。至关重要的是，这个问题在边缘部署中被进一步放大，这是这些模型的主要目标。在这项工作中，我们提出了V-Rex，这是第一个软硬件协同设计的加速器，全面解决了流式视频LLM推理中的算法和硬件瓶颈。V-Rex的核心是引入了ReSV，这是一种无需训练的动态KV缓存检索算法。ReSV利用基于时间和空间相似性的令牌聚类来减少视频帧间的冗余KV缓存内存。为了充分利用这些算法上的优势，V-Rex提供了一个紧凑、低延迟的硬件加速器，其中包括一个动态KV缓存检索引擎（DRE），具有位级和早期退出的计算单元。V-Rex在边缘部署中实现了前所未有的实时性能（3.9-8.3 FPS）和能效流式视频LLM推理，几乎无准确度损失。虽然DRE仅占2.2%的功耗和2.0%的面积，但该系统在功耗和能效上分别比AGX Orin GPU提高了1.9-19.7倍和3.1-18.5倍。这项工作首次全面解决了算法和硬件中的KV缓存检索问题，使实时流式视频LLM推理能够在资源受限的边缘设备上实现。

Summary / 总结

V-Rex is an accelerator designed to address the memory and computational challenges of streaming video large language models (LLMs) by introducing ReSV, a dynamic key-value cache retrieval algorithm. This algorithm reduces memory usage through token clustering, while the hardware accelerator, DRE, offers low-latency and energy-efficient processing. V-Rex achieves real-time inference with 3.9-8.3 FPS and 1.9-19.7x speedup over AGX Orin GPU, with negligible accuracy loss and significant energy efficiency improvements.

V-Rex 是一种软件硬件协同设计的加速器，通过引入基于时间与空间相似性的动态 KV 缓存检索算法 ReSV 来解决流式视频大语言模型 (LLM) 的内存和计算挑战。ReSV 通过 token 聚类减少不必要的 KV 缓存内存，V-Rex 提供了一个紧凑的硬件加速器，包含动态 KV 缓存检索引擎 (DRE)，实现低延迟推理。V-Rex 实现了 3.9-8.3 FPS 的实时性能和相对于 AGX Orin GPU 1.9-19.7 倍的能效提升，同时保持了无显著的准确率损失，适用于边缘部署。

Generalization of Diffusion Models Arises with a Balanced Representation Space

Authors: Zekai Zhang, Xiao Li, Xiang Li, Lianghe Shi, Meng Wu, Molei Tao, Qing Qu

First: 2025-12-24T05:40:40+00:00 · Latest: 2025-12-24T05:40:40+00:00

Comments: 40 pages, 19 figures. The first two authors contributed equally

Abs · PDF · Code1 · Code2

Abstract

Diffusion models excel at generating high-quality, diverse samples, yet they risk memorizing training data when overfit to the training objective. We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning. By investigating a two-layer ReLU denoising autoencoder (DAE), we prove that (i) memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized "spiky" representations, whereas (ii) generalization arises when the model captures local data statistics, producing "balanced" representations. Furthermore, we validate these theoretical findings on real-world unconditional and text-to-image diffusion models, demonstrating that the same representation structures emerge in deep generative models with significant practical implications. Building on these insights, we propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering. Together, our results highlight that learning good representations is central to novel and meaningful generative modeling.

中文标题/摘要

标题：扩散模型的泛化能力源于平衡的表示空间

扩散模型在生成高质量、多样化样本方面表现出色，但当过度拟合训练目标时，它们可能会记住训练数据。我们通过表示学习的视角分析扩散模型中的记忆与泛化之间的区别。通过研究两层ReLU去噪自编码器（DAE），我们证明了(i) 记忆对应于模型在编码和解码的学得权重中存储原始训练样本，产生局部的“尖峰”表示，而(ii) 泛化则发生在模型捕捉局部数据统计时，产生“平衡”的表示。此外，我们在现实世界的无条件和文本到图像扩散模型上验证了这些理论发现，展示了这些表示结构在深度生成模型中的重要实践意义。基于这些见解，我们提出了一种基于表示的检测记忆的方法以及一种无需训练的编辑技术，允许通过表示引导实现精确控制。我们的结果共同强调了学习良好表示对于新颖和有意义的生成建模至关重要。

Summary / 总结

This study investigates the generalization capabilities of diffusion models by analyzing the representation learning process. It shows that memorization leads to localized 'spiky' representations, while generalization results in 'balanced' representations. The research validates these findings on real-world models and proposes a method for detecting memorization and a training-free editing technique. The results emphasize the importance of learning good representations for meaningful generative modeling.

研究通过分析表示学习过程，探讨了扩散模型的泛化能力。结果显示，记忆化导致局部的“尖刺”表示，而泛化则产生“平衡”的表示。研究在实际模型上进行了验证，并提出了一种检测记忆化的表示方法和一种无需训练的编辑技术。研究结果强调了学习良好表示对于生成有意义模型的重要性。

Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning

Authors: Shengguang Wu, Xiaohan Wang, Yuhui Zhang, Hao Zhu, Serena Yeung-Levy

First: 2025-12-24T04:30:21+00:00 · Latest: 2025-12-24T04:30:21+00:00

Comments: Project Website: https://transductive-visualprogram.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at https://transductive-visualprogram.github.io/.

中文标题/摘要

标题：归纳视觉编程：从经验演化工具库以进行空间推理

在3D场景中的空间推理需要精确的几何计算，这挑战了视觉语言模型的能力。视觉编程通过将问题分解为步骤并调用专门的工具来解决这一问题，但现有方法要么依赖固定的工具集，要么在解决问题之前进行推测性的工具诱导，导致生成的程序效果不佳且工具利用不足。我们提出了一种新的框架——归纳视觉编程（TVP），该框架能够从自身经验中构建新的工具，而不是基于推测。TVP 首先使用基本工具解决问题，同时将经验性解决方案积累到示例库中，然后从这些程序中抽象出重复出现的模式，形成可重用的高级工具，从而构建一个不断演化的工具库。这使得TVP能够利用从经验中学到的越来越强大的工具来解决新问题。在Omni3D-Bench上，TVP达到了最先进的性能，比GPT-4o高出22%，比之前最好的视觉编程系统高出11%。我们归纳学习得到的工具比推测生成的工具更常被用作核心程序依赖，使用频率高5倍，表明了更有效的工具发现和重用。演化出的工具还展示了强大的泛化能力，能够在SpatialScore-Hard集合的基准测试中取得优异表现，而无需对测试集进行任何特定修改。我们的工作确立了经验驱动的归纳工具创建作为构建自我演化的视觉编程代理的强大范式，这些代理能够有效应对具有挑战性的空间推理任务。我们将在https://transductive-visualprogram.github.io/发布我们的代码。

Summary / 总结

The research aims to improve spatial reasoning in 3D scenes by developing a framework called Transductive Visual Programming (TVP) that evolves tool libraries from experience. TVP first solves problems using basic tools and accumulates solutions into an Example Library. It then abstracts recurring patterns into reusable higher-level tools for an evolving Tool Library. On Omni3D-Bench, TVP outperforms GPT-4o by 22% and the previous best visual programming system by 11%, with transductively learned tools being used 5x more frequently as core program dependencies and showing strong generalization to unseen spatial tasks.

研究旨在通过开发一种称为Transductive Visual Programming (TVP)的框架来改进3D场景中的空间推理，该框架从经验中进化工具库。TVP使用基本工具解决问题，并将解决方案累积到Example Library中，然后从这些模式中抽象出可重用的高级工具，形成一个不断进化的Tool Library。在Omni3D-Bench上，TVP的性能优于GPT-4o 22%，优于之前最佳的视觉编程系统11%，并且通过经验学习得到的工具作为核心程序依赖被使用了5倍，同时在未修改的情况下对未见过的空间任务表现出强大的泛化能力。这项工作确立了经验驱动的递推工具创建作为构建自我进化的视觉编程代理的强大范式，以应对具有挑战性的空间推理任务。

Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting

Authors: Yoonwoo Jeong, Cheng Sun, Frank Wang, Minsu Cho, Jaesung Choe

First: 2025-12-24T04:16:18+00:00 · Latest: 2025-12-24T04:16:18+00:00

Comments: Will be updated

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in computer vision have successfully extended Open-vocabulary segmentation (OVS) to the 3D domain by leveraging 3D Gaussian Splatting (3D-GS). Despite this progress, efficiently rendering the high-dimensional features required for open-vocabulary queries poses a significant challenge. Existing methods employ codebooks or feature compression, causing information loss, thereby degrading segmentation quality. To address this limitation, we introduce Quantile Rendering (Q-Render), a novel rendering strategy for 3D Gaussians that efficiently handles high-dimensional features while maintaining high fidelity. Unlike conventional volume rendering, which densely samples all 3D Gaussians intersecting each ray, Q-Render sparsely samples only those with dominant influence along the ray. By integrating Q-Render into a generalizable 3D neural network, we also propose Gaussian Splatting Network (GS-Net), which predicts Gaussian features in a generalizable manner. Extensive experiments on ScanNet and LeRF demonstrate that our framework outperforms state-of-the-art methods, while enabling real-time rendering with an approximate ~43.7x speedup on 512-D feature maps. Code will be made publicly available.

中文标题/摘要

标题：分位数渲染：高效嵌入高维特征的3D高斯点绘制

计算机视觉领域的最新进展通过利用3D高斯点绘制（3D-GS）成功地将开放词汇分割（OVS）扩展到了3D领域。尽管取得了这一进展，但高效渲染用于开放词汇查询所需的高维特征仍然面临重大挑战。现有方法使用码本或特征压缩，导致信息丢失，从而降低分割质量。为了解决这一限制，我们提出了分位数渲染（Q-Render），这是一种新颖的3D高斯渲染策略，能够高效处理高维特征同时保持高保真度。与传统的体绘制不同，后者密集地沿每个射线采样所有相交的3D高斯，Q-Render仅稀疏采样沿射线具有主导影响的那些。通过将Q-Render集成到一个通用的3D神经网络中，我们还提出了高斯点绘制网络（GS-Net），该网络以通用的方式预测高斯特征。在ScanNet和LeRF上的广泛实验表明，我们的框架优于最先进的方法，同时能够实现接近43.7倍的加速进行实时渲染。代码将公开提供。

Summary / 总结

The paper introduces Quantile Rendering (Q-Render), a novel rendering strategy for 3D Gaussians that efficiently handles high-dimensional features while maintaining high fidelity. Unlike conventional volume rendering, Q-Render sparsely samples only those 3D Gaussians with dominant influence along the ray. The authors propose Gaussian Splatting Network (GS-Net), which integrates Q-Render into a generalizable 3D neural network to predict Gaussian features. Experiments on ScanNet and LeRF show that GS-Net outperforms state-of-the-art methods and enables real-time rendering with an approximate 43.7x speedup on 512-D feature maps.

论文提出了Quantile Rendering (Q-Render)，这是一种新型的3D高斯渲染策略，能够高效处理高维特征并保持高保真度。不同于传统的体渲染方法，Q-Render仅稀疏采样沿光线具有主导影响的3D高斯。作者将Q-Render集成到Gaussian Splatting Network (GS-Net)中，以通用的方式预测高斯特征。实验结果表明，该框架在ScanNet和LeRF上优于现有方法，并且能够实现近43.7倍的实时渲染加速，适用于512-D特征图。

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Authors: Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang

First: 2025-12-17T18:59:55+00:00 · Latest: 2025-12-24T03:37:34+00:00

Comments: 11 pages, 5 figures, conference or other essential info

Abs · PDF · Code1 · Code2 · Code3

Abstract

In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

中文标题/摘要

标题：DiffusionVL：将任何自回归模型转化为扩散视觉语言模型

在最近的多模态研究中，扩散范式因其独特的解码优势，已成为自回归范式（AR）的有前途的替代方案。然而，由于基础扩散语言模型能力的限制，扩散视觉语言模型（dVLM）的性能仍然远远落后于主流模型。这引发了一个简单而基本的问题：是否可以基于现有的强大自回归模型构建dVLM？为此，我们提出了DiffusionVL，这是一个可以从任何强大自回归模型转换而来的dVLM家族。通过简单的微调，我们成功地将自回归预训练模型适应到扩散范式中。这种方法产生了两个关键观察结果：（1）从基于自回归的多模态模型到扩散的范式转变非常有效。（2）直接将自回归语言模型转换为dVLM也是可行的，其性能与LLaVA风格的视觉指令调优相当。此外，我们引入了一种块解码设计到dVLM中，支持任意长度的生成和KV缓存重用，实现了显著的推理速度提升。我们进行了大量的实验。尽管使用了比先前方法少于5%的数据进行训练，DiffusionVL在MMMU-Pro（视觉）基准上的综合性能提高了34.4%，在MME（认知）基准上的性能提高了37.5%，同时实现了2倍的推理速度提升。模型和代码发布在https://github.com/hustvl/DiffusionVL。

Summary / 总结

DiffusionVL translates existing powerful autoregressive models into diffusion vision language models through simple fine-tuning, achieving significant performance improvements and a 2x inference speedup compared to previous methods. It introduces a block-decoding design for arbitrary-length generation and KV cache reuse. Despite using less than 5% of the data, DiffusionVL outperforms previous models on vision and cognitive benchmarks by 34.4% and 37.5%, respectively.

DiffusionVL通过简单的微调将现有的强大自回归模型转化为扩散视觉语言模型(dVLM)，在MMMU-Pro（视觉）基准上取得了34.4%的性能提升，在MME（认知）基准上取得了37.5%的性能提升，并实现了2倍的推理速度提升。这种方法展示了从自回归基于多模态模型到扩散的转变的有效性，并证明了直接将自回归语言模型转换为dVLM的可行性，与LLaVA风格的视觉指令调优相当。此外，引入了一种块解码设计，以支持任意长度的生成和KV缓存重用，进一步提升了性能和速度。

PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding

Authors: Seongmin Jung, Seongho Choi, Gunwoo Jeon, Minsu Cho, Jongwoo Lim

First: 2025-12-24T03:18:51+00:00 · Latest: 2025-12-24T03:18:51+00:00

Abs · PDF · Code1 · Code2

Abstract

3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.

中文标题/摘要

标题：PanoGrounder：通过全景场景表示连接2D和3D的基于VLM的3D视觉定位

3D视觉定位（3DVG）是视觉语言感知与机器人技术之间的关键桥梁，需要语言理解与3D场景推理。传统监督模型利用显式的3D几何结构，但由于3D视觉语言数据集稀缺和推理能力有限，其泛化能力有限。我们提出PanoGrounder，这是一种通用的3DVG框架，将多模态全景表示与预训练的2D视觉语言模型结合，以实现强大的视觉语言推理。全景渲染，结合3D语义和几何特征，作为2D和3D之间的中间表示，提供了两大优势：（i）可以直接馈送到视觉语言模型中，无需大量适应；（ii）由于其360度的视野，保留了长距离的物体间关系。我们设计了一个三阶段流水线，考虑场景布局和几何结构放置一组紧凑的全景视点，使用视觉语言模型在每个全景渲染上定位文本查询，并通过提升将每个视点的预测融合为一个3D边界框。我们的方法在ScanRefer和Nr3D上达到了最先进的结果，并展示了对未见过的3D数据集和文本重述的优越泛化能力。

Summary / 总结

PanoGrounder is a framework for 3D Visual Grounding that combines panoramic scene representations with pretrained 2D vision-language models. It uses panoramic renderings, which include 3D semantic and geometric features, to bridge the gap between 2D and 3D, enabling strong vision-language reasoning. The method involves a three-stage pipeline: placing panoramic viewpoints, grounding text queries on each view, and fusing predictions into a 3D bounding box. PanoGrounder achieves state-of-the-art results on ScanRefer and Nr3D and shows better generalization to unseen datasets and text rephrasings.

PanoGrounder 是一种结合全景场景表示与预训练的视觉语言模型的框架，以增强 3D 视觉定位。它使用包含 3D 语义和几何特征的全景渲染作为 2D 和 3D 之间的中间表示，从而实现强大的视觉语言推理。该方法在 ScanRefer 和 Nr3D 上达到了最先进的结果，并且在未见过的数据集和文本重述方面表现出更好的泛化能力。

Benchmarking and Enhancing VLM for Compressed Image Understanding

Authors: Zifu Zhang, Tongda Xu, Siqi Li, Shengxi Li, Yue Zhang, Mai Xu, Yan Wang

First: 2025-12-24T02:59:01+00:00 · Latest: 2025-12-24T02:59:01+00:00

Abs · PDF · Code1 · Code2

Abstract

With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.

中文标题/摘要

标题：视觉语言模型在压缩图像理解中的基准测试与增强

随着视觉语言模型（VLMs）的快速发展及其应用需求的不断增加，高效压缩图像输入变得越来越重要。现有的VLMs主要处理和理解高比特率压缩图像，而它们对低比特率压缩图像的理解能力尚未得到充分探索。在本文中，我们介绍了第一个全面的基准测试，以评估VLM在处理压缩图像方面的能力，涵盖了广泛使用的图像编解码器和多种任务，基准中包含超过一百万张压缩图像。接下来，我们通过将性能差距分为a) 压缩过程中的信息损失和b) VLM的一般化失败来分析性能差距。我们通过具体示例可视化这些差距，并确定对于压缩图像，只能减轻一般化差距。最后，我们提出了一种通用的VLM适配器，以增强现有编解码器压缩图像的模型性能。结果表明，单个适配器可以提高VLM在不同编解码器和比特率图像上的性能10%-30%。我们相信，我们的基准测试和增强方法提供了有价值的见解，并有助于弥合VLMs与压缩图像之间的差距。

Summary / 总结

This paper addresses the challenge of VLMs understanding low-bitrate compressed images, which is less explored compared to high-bitrate images. The authors introduce a comprehensive benchmark with over one million compressed images from various codecs and tasks. They identify that the performance gap is mainly due to the generalization failure of VLMs rather than information loss during compression. To enhance VLM performance, they propose a universal adaptor that improves performance by 10%-30% across different codecs and bitrates. This work provides valuable insights for improving VLMs' ability to handle compressed images.

本文旨在解决视觉语言模型（VLM）对图像输入高效压缩的需求，并引入了一个全面的基准来评估VLM在压缩图像上的表现。作者分析了VLM与压缩图像之间的性能差距，发现只有泛化差距可以被缓解。他们提出了一种通用的VLM适配器，可以在不同压缩码率的图像上将VLM性能提高10%-30%。该基准和适配器为增强VLM在实际应用中的表现提供了有价值的见解。

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Authors: Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, Feng Zhang

Venue: NeurIPS 2025

First: 2025-09-02T03:01:23+00:00 · Latest: 2025-12-24T01:10:58+00:00

Comments: Accepted by NeurIPS 2025 Dataset and Benchmark Track

Abs · PDF · Code1 · Code2 · Code3

Abstract

Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,351 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC's ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.

中文标题/摘要

标题：RSCC：一种用于灾害事件的大型遥感变化描述数据集

遥感对于灾害监测至关重要，但现有数据集缺乏时间图像对和详细的文本注释。当前资源主要以单张快照图像为主，无法捕捉灾害随时间的变化影响。为解决这一问题，我们引入了遥感变化描述（RSCC）数据集，这是一个包含62,351个灾前/灾后图像对（涵盖地震、洪水、野火等）的大规模基准，每个图像对配有丰富的、类似人类的变更描述。通过在遥感数据中架起时间和语义的桥梁，RSCC 使视觉-语言模型能够进行灾害意识的双时相理解的稳健训练和评估。我们的结果突显了RSCC 在促进详细灾害相关分析方面的能力，为遥感中更准确、可解释和可扩展的视觉-语言应用铺平了道路。代码和数据集可在 https://github.com/Bili-Sakura/RSCC 获取。

Summary / 总结

The RSCC dataset addresses the lack of temporal image pairs and detailed textual annotations in existing disaster monitoring datasets. It consists of 62,351 pre-/post-disaster image pairs with rich change captions, covering various disaster types. This dataset enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding, facilitating detailed disaster-related analysis.

研究旨在解决现有灾害监测数据集中缺乏时间序列图像对和详细文本注释的问题。Remote Sensing Change Caption (RSCC) 数据集包含62,351对灾前/灾后图像及其丰富的变化描述，填补了这一空白。该数据集能够用于训练和评估针对灾害的双时相理解的视觉-语言模型，展示了其在促进详细灾害分析方面的能力。

Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference

Authors: Putu Indah Githa Cahyani, Komang David Dananjaya Suartana, Novanto Yudistira

First: 2025-12-23T23:30:56+00:00 · Latest: 2025-12-23T23:30:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture or requiring retraining. We evaluate the proposed method on a subset of the DocVQA dataset in an inference-only setting, focusing on efficiency-oriented metrics. Experimental results show that adaptive preprocessing reduces per-image inference time by over 50\%, lowers mean full generation time, and achieves a consistent reduction of more than 55\% in visual token count compared to the baseline pipeline. These findings demonstrate that input-aware preprocessing is an effective and lightweight strategy for improving deployment-oriented efficiency of vision-language models. To facilitate reproducibility, our implementation is provided as a fork of the FastVLM repository, incorporating the files for the proposed method, and is available at https://github.com/kmdavidds/mlfastlm.

中文标题/摘要

标题：输入自适应视觉预处理以提高快速视觉-语言模型推理效率

视觉-语言模型（VLMs）在多模态推理任务中表现出强大的性能，但由于高推理延迟和计算成本，其部署仍然具有挑战性，尤其是在处理高分辨率视觉输入时。尽管最近的架构如FastVLM通过优化视觉编码器提高了效率，但现有的管道仍然依赖于静态视觉预处理，导致对于视觉简单的输入存在冗余计算。在本文中，我们提出了一种自适应视觉预处理方法，该方法根据图像内容特征动态调整输入分辨率和空间覆盖范围。所提出的方法结合了内容感知图像分析、自适应分辨率选择和内容感知裁剪，以在视觉编码前减少视觉冗余。重要的是，该方法与FastVLM集成，无需修改其架构或重新训练。我们在DocVQA数据集的子集上仅在推理设置中评估了所提出的方法，重点关注效率导向的指标。实验结果表明，自适应预处理将每张图像的推理时间减少了超过50%，降低了平均完整生成时间，并且与基线管道相比，视觉标记数量减少了超过55%。这些发现表明，输入感知预处理是一种有效且轻量级的策略，可以提高视觉-语言模型的部署效率。为了便于可重复性，我们的实现作为FastVLM仓库的分支提供，包含所提出方法的文件，并可在https://github.com/kmdavidds/mlfastlm/获得。

Summary / 总结

This work addresses the challenge of high inference latency in Vision-Language Models (VLMs) by proposing an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content. This method, integrated with FastVLM without requiring retraining, reduces per-image inference time by over 50%, lowers mean full generation time, and decreases visual token count by more than 55% compared to the baseline pipeline. These findings highlight the effectiveness of input-aware preprocessing in improving the efficiency of VLMs for deployment.

本文提出了一种自适应视觉预处理方法，该方法根据图像内容动态调整输入分辨率和空间覆盖范围，以解决视觉语言模型（VLMs）的高推理延迟问题。该方法无需重新训练即可与FastVLM集成，并将每张图像的推理时间减少了超过50%，降低了平均完整生成时间，并将视觉标记数量的一致减少率提高了超过55%，与基线管道相比。

VL4Gaze: Unleashing Vision-Language Models for Gaze Following

Authors: Shijing Wang, Chaoqun Cui, Yaping Huang, Hyung Jin Chang, Yihua Cheng

First: 2025-12-23T19:47:11+00:00 · Latest: 2025-12-23T19:47:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.

中文标题/摘要

标题：VL4Gaze：释放视觉语言模型的凝视跟随潜力

人类凝视为理解注意力、意图和社会互动提供了关键线索，但在当前的视觉语言模型（VLM）中，凝视理解尚未得到充分探索。尽管最近的VLM在一系列视觉任务中实现了强大的场景级推理，但尚无基准系统地评估或训练它们进行凝视解释，这使得人们质疑是否可以从通用视觉语言预训练中自然涌现出凝视理解能力。为解决这一问题，我们引入了VL4Gaze，这是首个旨在研究、评估和解锁VLMs在凝视理解方面潜力的大规模基准。VL4Gaze包含489K个自动生成的问题-答案对，覆盖124K张图像，并通过四个互补任务将凝视理解统一为一个VQA问题：(1) 目标描述，(2) 方向描述，(3) 点位定位，(4) 含糊问题识别。我们全面评估了商业和开源VLMs在上下文学习和微调设置下的表现。结果表明，即使大规模VLMs在没有特定任务监督的情况下也难以可靠地推断凝视语义和空间定位。相比之下，通过VL4Gaze进行训练在所有任务上都带来了显著且一致的改进，突显了为开发VLMs的凝视理解能力进行针对性多任务监督的重要性。我们将发布数据集和代码以支持该领域的进一步研究和发展。

Summary / 总结

VL4Gaze introduces a new benchmark to evaluate and train vision-language models (VLMs) for gaze understanding, addressing the lack of such benchmarks in current VLMs. The benchmark includes 489K question-answer pairs across 124K images, formulated through four tasks: gaze object description, gaze direction description, gaze point location, and ambiguous question recognition. Evaluations show that large-scale VLMs struggle with gaze semantics and spatial localization without task-specific supervision, but training on VL4Gaze significantly improves performance across all tasks, emphasizing the need for targeted multi-task supervision for gaze understanding in VLMs.

VL4Gaze 提出了一个新的基准来评估和训练视觉-语言模型（VLMs）在理解注视方面的表现，这是解释视觉场景的关键方面。该基准包含 489K 个问题-答案对，跨越 124K 张图片，通过四个任务进行形式化：注视对象描述、注视方向描述、注视点定位和模糊问题识别。评估结果显示，大型 VLMs 在没有特定任务监督的情况下难以可靠地推断注视语义和空间定位，而通过 VL4Gaze 训练则在所有任务上都带来了显著且一致的改进，突显了为 VLMs 发展注视理解能力所需的目标多任务监督的重要性。

FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models

Authors: Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, Keze Wang

First: 2025-12-23T18:05:43+00:00 · Latest: 2025-12-23T18:05:43+00:00

Comments: Under submission

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.

中文标题/摘要

标题：FlashVLM：文本引导的视觉标记选择框架

大型视觉-语言模型（VLMs）通常处理每张图像或视频帧数百或数千个视觉标记，导致二次注意力成本和大量冗余。现有的标记减少方法往往忽视了文本查询或依赖于深度注意力图，这些图在激进剪枝下的不稳定性导致语义对齐下降。我们提出了一种FlashVLM，这是一种文本引导的视觉标记选择框架，能够动态适应查询。FlashVLM 不依赖于嘈杂的注意力权重，而是计算投影图像标记与语言模型空间中归一化文本嵌入之间的显式跨模态相似性。这种外在的相关性与内在的视觉显著性通过对数域加权和温度控制锐化进行融合。此外，一种保持多样性的划分保留了少量但具有代表性的背景标记，以保持全局上下文。在相同的标记预算和评估协议下，FlashVLM 实现了超越无损压缩的效果，略优于未剪枝的基线，在LLaVA 1.5上剪枝高达77.8%的视觉标记，同时保持92.8%的准确率，即使在94.4%的压缩下也是如此。在14个图像和视频基准上的广泛实验表明，FlashVLM 在保持强大鲁棒性和泛化能力的同时，提供了最先进的效率性能折衷。

Summary / 总结

FlashVLM is a text-guided visual token selection framework that dynamically adapts visual inputs to textual queries by computing explicit cross-modal similarities and fusing them with intrinsic visual saliency. It achieves beyond lossless compression, surpassing the unpruned baseline while pruning up to 77.8% of visual tokens on LLaVA 1.5, and maintaining 92.8% accuracy even under 94.4% compression. Extensive experiments on 14 benchmarks show that FlashVLM offers state-of-the-art efficiency while maintaining robustness and generalization across various VLMs.

FlashVLM 是一种文本引导的视觉标记选择框架，通过减少大型视觉-语言模型处理的视觉标记数量来提高效率，同时不牺牲性能。它计算图像标记与文本嵌入之间的跨模态相似性，将其与视觉显著性融合，并保留一组多样化的背景标记。FlashVLM 在最多 77.8% 的视觉标记剪枝下仍能保持 92.8% 的准确性，优于未剪枝基线和其他方法在 14 个基准上的表现。

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Authors: Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi

First: 2025-12-23T17:56:36+00:00 · Latest: 2025-12-23T17:56:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.

中文标题/摘要

标题：在四维中学习推理：视觉语言模型的动态空间理解

视觉-语言模型（VLM）在一般理解方面表现出色，但在动态空间推理（DSR），即在三维空间中随时间推移对物体几何形状和关系的推理方面仍然较弱，这主要是由于缺乏可扩展的四维感知训练资源。为了在数据集、基准和模型的各个方面弥合这一差距，我们引入了DSR套件。首先，我们提出了一种自动流水线，从野外视频中生成DSR的多项选择题-答案对。通过利用现代视觉基础模型，该流水线提取了丰富的几何和运动信息，包括相机姿态、局部点云、物体掩码、方向和三维轨迹。这些几何线索使DSR-Train得以构建，进一步通过人工精炼构建DSR-Bench用于评估。与以往工作相比，我们的数据强调（i）野外视频来源，（ii）物体和场景级别的三维要求，（iii）视角变换，（iv）多物体交互，以及（v）细粒度、程序化的答案。除了数据，我们还提出了一种轻量级的几何选择模块（GSM），以无缝地将几何先验整合到VLM中，该模块压缩了问题语义，并从预训练的四维重建先验中提取与问题相关的信息，形成一组紧凑的几何标记。这种有针对性的提取避免了向模型灌输无关知识。实验表明，将DSR-Train和GSM集成到Qwen2.5-VL-7B中显著增强了其动态空间推理能力，同时在通用视频理解基准测试中保持了准确性。

Summary / 总结

The research aims to improve vision-language models' ability in dynamic spatial reasoning (DSR) by addressing the scarcity of 4D-aware training resources. It introduces DSR Suite, which includes an automated pipeline for generating DSR question-answer pairs from in-the-wild videos and a lightweight Geometry Selection Module (GSM) to integrate geometric priors into VLMs. The key findings show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability while maintaining accuracy on general video understanding benchmarks.

论文通过引入DSR Suite解决视觉语言模型（VLM）在动态空间推理（DSR）方面的局限性，DSR Suite包括从野生视频自动生成DSR问题-答案对的自动化管道和轻量级的几何选择模块（GSM），以将几何先验整合到VLM中。关键发现表明，将DSR-Train和GSM整合到Qwen2.5-VL-7B中可以提高其动态空间推理能力，同时保持一般视频理解基准的准确性。

Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios

Authors: Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li

Venue: WACV 2026

First: 2025-12-23T17:55:35+00:00 · Latest: 2025-12-23T17:55:35+00:00

Comments: Accepted to WACV 2026

Abs · PDF · Code1 · Code2

Abstract

Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.

中文标题/摘要

标题：多粒度文本引导图像融合以应对多曝光和多焦点场景

图像融合旨在从在具有挑战性条件下拍摄的一对输入中合成一张高质量的图像，例如不同的曝光水平或焦深。核心挑战在于有效处理输入之间的动态范围和焦深差异。随着视觉语言模型的出现，最近的方法将文本描述作为辅助指导以提高融合质量。然而，简单地引入粗粒度描述会阻碍对细粒度细节的理解，并且对精确的跨模态对齐构成挑战。为了解决这些限制，我们提出了多粒度文本引导图像融合（MTIF），这是一种具有三个关键设计的新型融合范式。首先，它引入了多粒度文本描述，分别捕捉细粒度细节、结构线索和语义内容，通过分层跨模态调制模块引导图像融合。其次，它在每个粒度级别引入监督信号，以促进视觉和文本特征之间的对齐并增强辅助文本的实用性。第三，它采用注意力驱动的增强模块，通过密集的语义内容增强训练数据，进一步加强跨模态调制和对齐。广泛的实验表明，MTIF在多曝光和多焦点图像融合任务中始终优于先前的方法。

Summary / 总结

The research aims to improve image fusion quality by addressing challenges in handling dynamic range and focus depth disparities between inputs. The proposed Multi-grained Text-guided Image Fusion (MTIF) introduces a hierarchical cross-modal modulation module using multi-grained textual descriptions to capture fine details, structural cues, and semantic content. Supervision signals at each granularity and a saliency-driven enrichment module further enhance cross-modal alignment. Experiments demonstrate that MTIF outperforms existing methods in both multi-exposure and multi-focus image fusion tasks.

论文提出了一种多粒度文本引导图像融合方法（MTIF），以解决在多曝光和多聚焦场景下的图像融合问题。MTIF通过层次化的跨模态调制模块使用多粒度的文本描述来引导图像融合，并通过每个粒度的监督信号增强视觉和文本特征的对齐，同时采用注意力驱动的增强模块增强训练数据中的密集语义内容。实验表明，MTIF在多曝光和多聚焦图像融合任务中均优于先前的方法。

Video Generation Models Are Good Latent Reward Models

Authors: Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang

First: 2025-11-26T16:14:18+00:00 · Latest: 2025-12-23T15:17:06+00:00

Abs · PDF · Code1 · Code2

Abstract

Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

中文标题/摘要

标题：视频生成模型是良好的潜在空间奖励模型

奖励反馈学习（ReFL）已被证明对于使图像生成与人类偏好对齐非常有效。然而，将其扩展到视频生成面临着重大挑战。现有的视频奖励模型依赖于为像素空间输入设计的视觉语言模型，这将ReFL优化限制在昂贵的VAE解码之后的近完全去噪步骤中。这种像素空间的方法会产生大量的内存开销并增加训练时间，而且其后期优化缺乏早期监督，仅能优化视觉质量而不能优化基本的运动动态和结构一致性。在本文中，我们展示了预训练的视频生成模型天然适合在噪声潜在空间中进行奖励建模，因为它们明确设计为可以处理任意时间步的噪声潜在表示，并通过其序列建模能力内在地保留时间信息。因此，我们提出了过程奖励反馈学习（PRFL）框架，该框架在潜在空间中完全进行偏好优化，从而在整个去噪链中实现高效的梯度反向传播，而无需VAE解码。广泛的实验表明，PRFL在提高与人类偏好的对齐程度方面显著优于RGB ReFL，同时在内存消耗和训练时间上也实现了大幅减少。

Summary / 总结

This work addresses the challenge of applying reward feedback learning (ReFL) to video generation by proposing Process Reward Feedback Learning (PRFL). PRFL leverages pre-trained video generation models to optimize preferences in the latent space, avoiding the need for computationally expensive VAE decoding. This approach leads to better alignment with human preferences and reduces memory consumption and training time compared to traditional pixel-space ReFL methods.

本文通过提出Process Reward Feedback Learning (PRFL)解决了将奖励反馈学习（ReFL）应用于视频生成的挑战。PRFL利用预训练的视频生成模型在噪声的潜在空间中优化偏好，避免了昂贵的VAE解码。这种方法使得与人类偏好的对齐效果更好，同时减少了内存消耗和训练时间，相比传统的像素空间ReFL方法有显著优势。

Scaling Laws for Energy Efficiency of Local LLMs

Authors: Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Mikail Okyay, Bakbergen Ryskulov, David Montero, Samuel Mugel, Román Orús

First: 2025-12-18T13:40:33+00:00 · Latest: 2025-12-23T15:02:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.

中文标题/摘要

标题：局部LLM能效的标度律

在边缘设备上部署局部大型语言模型和视觉-语言模型需要在准确性与受限的计算和能源预算之间进行权衡。尽管图形处理器主导了现代人工智能部署，但大多数消费级硬件（包括笔记本电脑、台式机、工业控制器和嵌入式系统）仍依赖于中央处理器。尽管如此，仅中央处理器的推理计算法则对局部语言和视觉-语言工作负载的研究仍相对较少。我们系统地在两个广泛用于局部推理的中央处理器层级上对大型语言和视觉-语言模型进行了基准测试：一台搭载M2芯片的MacBook Pro，代表主流笔记本电脑级部署，以及一个Raspberry Pi 5，代表受限的、低功耗嵌入式设置。基于连续采样处理器和内存使用情况并结合面积-曲线积分的方法，我们描述了计算负载随输入文本长度对语言模型和随图像分辨率对视觉-语言模型的标度关系。我们发现了两条经验标度律：（1）语言模型推理的计算成本大约与标记长度成线性关系；（2）视觉-语言模型表现出预处理驱动的“分辨率拐点”，其中计算在内部分辨率限制以上保持恒定，在以下则急剧下降。除了这些标度律，我们展示了量子启发式压缩可将处理器和内存使用量最多减少71.9%，能源消耗最多减少62%，同时保持或提高语义准确性。这些结果提供了对局部语言和视觉-语言工作负载的多模态中央处理器仅计算法则的系统量化，并指出了模型压缩和输入分辨率预处理作为可持续边缘推理的有效、低成本杠杆。

Chain-of-Anomaly Thoughts with Large Vision-Language Models

Authors: Pedro Domingos, João Pereira, Vasco Lopes, João Neves, David Semedo

First: 2025-12-23T15:01:05+00:00 · Latest: 2025-12-23T15:01:05+00:00

Comments: 2 pages, 3 figures, 1 table. Accepted for RECPAD 2025

Abs · PDF · Code1 · Code2

Abstract

Automated video surveillance with Large Vision-Language Models is limited by their inherent bias towards normality, often failing to detect crimes. While Chain-of-Thought reasoning strategies show significant potential for improving performance in language tasks, the lack of inductive anomaly biases in their reasoning further steers the models towards normal interpretations. To address this, we propose Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias in the reasoning process through a final, anomaly-focused classification layer. Our method significantly improves Anomaly Detection, boosting F1-score by 11.8 p.p. on challenging low-resolution footage and Anomaly Classification by 3.78 p.p. in high-resolution videos.

中文标题/摘要

标题：异常链推理与大型视觉语言模型

大型视觉语言模型在自动化视频监控中的应用受限于其对正常情况的固有偏见，往往无法检测犯罪。虽然链式推理策略在语言任务中显示出显著的改进潜力，但在推理过程中缺乏归纳异常偏见进一步导致模型倾向于正常解释。为了解决这一问题，我们提出了一种异常链推理（CoAT）多代理推理框架，通过引入最终的异常分类层，在推理过程中引入归纳犯罪偏见。我们的方法显著提高了异常检测性能，低分辨率视频上的F1分数提高了11.8个百分点，在高分辨率视频上的异常分类提高了3.78个百分点。

Summary / 总结

The paper addresses the limitation of large vision-language models in detecting crimes during automated video surveillance due to their bias towards normality. It proposes Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias through a final anomaly-focused classification layer. The method improves Anomaly Detection by 11.8 percentage points on low-resolution footage and Anomaly Classification by 3.78 percentage points in high-resolution videos.

论文针对大型视觉-语言模型在自动视频监控中由于偏向正常性而在犯罪检测方面的局限性。提出了Chain-of-Anomaly-Thoughts (CoAT) 多智能体推理框架，通过引入归纳犯罪偏见来增强异常检测。该方法在低分辨率视频上的F1分数提高了11.8个百分点，在高分辨率视频上的异常分类提高了3.78个百分点。

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Authors: Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang

First: 2025-08-01T09:51:54+00:00 · Latest: 2025-12-23T14:27:42+00:00

Comments: 8 pages, 5 figures, 3 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.

中文标题/摘要

标题：LAMIC：基于布局感知的多图像合成通过多模态扩散变换器的可扩展性

在可控图像合成中，从多个参考中生成具有空间布局意识的连贯且一致的图像仍然是一个开放的挑战。我们提出了LAMIC，一种布局感知的多图像合成框架，首次以无需训练的方式将单参考扩散模型扩展到多参考场景。基于MMDiT模型，LAMIC引入了两种即插即用的注意力机制：1）组隔离注意力（GIA）以增强实体分离；2）区域调节注意力（RMA）以实现布局感知生成。为了全面评估模型能力，我们进一步引入了三个指标：1）包含比（IN-R）和填充比（FI-R）以评估布局控制；2）背景相似度（BG-S）以衡量背景一致性。大量实验表明，LAMIC在大多数主要指标上均取得了最先进的性能：在所有设置中，它在ID-S、BG-S、IN-R和AVG得分上始终优于现有的多参考基线，并在复杂合成任务中实现了最佳的DPG。这些结果表明，LAMIC在保持身份、保留背景、布局控制和遵循提示方面具有优越的能力，所有这些均无需任何训练或微调，展示了强大的零样本泛化能力。通过继承先进的单参考模型的优势并使其无缝扩展到多图像场景，LAMIC为可控多图像合成建立了一个新的无需训练的范式。随着基础模型的不断进化，LAMIC的性能预计会相应扩展。我们的实现可在以下链接获取：https://github.com/Suchenl/LAMIC。

Summary / 总结

LAMIC is a Layout-Aware Multi-Image Composition framework that extends single-reference diffusion models to multi-reference scenarios without training. It introduces two mechanisms: Group Isolation Attention (GIA) for entity disentanglement and Region-Modulated Attention (RMA) for layout-aware generation. LAMIC outperforms existing multi-reference baselines in metrics such as Inclusion Ratio, Fill Ratio, and Background Similarity, demonstrating superior identity keeping, background preservation, and layout control capabilities.

LAMIC 是一种布局感知的多图像合成框架，它将单参考扩散模型扩展到多参考场景，无需训练。它引入了两种机制：组隔离注意力（GIA）用于实体分离和区域调节注意力（RMA）用于布局感知生成。LAMIC 在包括包含比（IN-R）、填充比（FI-R）和背景相似性（BG-S）在内的多个指标上优于现有方法，展示了在复杂合成任务中出色的实体保持、背景保留和布局控制能力。