arXiv 论文速递

2025-12-28 03:32
Snapshot: 20251228_0332
Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
Authors: Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu
First: 2025-12-24T18:59:54+00:00 · Latest: 2025-12-24T18:59:54+00:00
Comments: Project page: https://sytwu.github.io/BeyondMemo/
Abstract
We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/
中文标题/摘要
标题:超越记忆:多模态序数回归基准以揭示视觉语言模型中的流行度偏差
我们揭示了最先进的视觉语言模型(VLMs)中存在显著的流行度偏差,这些模型在著名建筑上的准确率比普通建筑高出34%,表明它们依赖于记忆而非可泛化的理解。为了系统地研究这一问题,我们引入了该任务上最大的公开基准数据集:YearGuessr数据集,包含来自157个国家的55,546张建筑图像,具有多模态属性,并附有其建设年份的连续序数标签(1001-2024)、GPS数据和页面浏览量作为流行度的代理。使用该数据集,我们将建筑年份预测任务框架化为序数回归,并引入了流行度感知的区间准确度指标来量化这种偏差。我们基准测试的30多种模型,包括我们的YearCLIP模型,证实了VLMs在流行、记忆化的项目上表现出色,但在未识别的主题上却面临重大挑战,揭示了它们推理能力中的关键缺陷。项目页面:https://sytwu.github.io/BeyondMemo/
Summary / 总结
The paper addresses the popularity bias in state-of-the-art vision-language models (VLMs), showing they perform 34% better on famous buildings than ordinary ones. To systematically investigate this, the authors created the YearGuessr dataset with 55,546 building images from 157 countries, annotated with construction years, GPS data, and page-view counts. Using this dataset, they introduced ordinal regression and popularity-aware metrics, confirming that VLMs excel on popular items but struggle with unrecognized subjects, highlighting a critical flaw in their reasoning capabilities.
该研究揭示了最先进的视觉-语言模型(VLMs)中存在的流行度偏差,表明它们在著名建筑上的表现比普通建筑高出34%。为了系统地研究这一问题,作者创建了包含55,546张来自157个国家的建筑图像的YearGuessr数据集,这些图像被标注了建造年份、GPS数据和页面浏览量。使用该数据集,他们引入了序数回归和流行度感知的指标,证实了VLMs在流行物品上表现出色,但在未识别的主题上却面临重大挑战,揭示了其推理能力的关键缺陷。
LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation
Authors: Anatoly O. Onishchenko, Alexey K. Kovalev, Aleksandr I. Panov
First: 2025-12-24T15:36:21+00:00 · Latest: 2025-12-24T15:36:21+00:00
Abstract
Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at https://lookplangraph.github.io .
中文标题/摘要
标题:LookPlanGraph:基于VLM图增强的体感指令跟随方法
使用大型语言模型(LLM)作为规划器的方法在体感指令跟随任务中变得普遍。为了成功完成任务,LLM 必须在机器人操作的环境中得到接地。一种解决方案是使用包含所有必要信息的场景图。现代方法依赖于预先构建的场景图,并假设在规划开始时所有任务相关信息都已可用。然而,这些方法没有考虑到在图构建和任务执行之间环境可能发生的变化。我们提出了 LookPlanGraph 方法,该方法利用由静态资产和对象先验组成的场景图。在计划执行过程中,LookPlanGraph 不断通过验证现有先验或发现新实体来更新图。这通过使用视觉语言模型处理代理的主观摄像机视图来实现。我们在具有改变对象位置的 VirtualHome 和 OmniGibson 模拟环境中进行了实验,证明了 LookPlanGraph 在基于预定义静态场景图的方法中表现出色。为了展示我们方法的实际适用性,我们还在现实世界中进行了实验。此外,我们引入了 GraSIF(用于指令跟随的图场景)数据集及其自动验证框架,包含来自 SayPlan Office、BEHAVIOR-1K 和 VirtualHome RobotHow 的 514 个任务。项目页面可在 https://lookplangraph.github.io 查看。
Summary / 总结
The research aims to improve embodied instruction following by addressing the limitations of static scene graphs that do not account for environmental changes. LookPlanGraph uses a scene graph augmented with object priors and updates it during execution by processing the agent's egocentric view with a Vision Language Model. Experiments in simulated and real-world environments show that LookPlanGraph outperforms methods relying on predefined static scene graphs, particularly in scenarios with changed object positions.
研究旨在通过解决静态场景图不考虑环境变化的局限性,提高基于指令的机器人操作能力。LookPlanGraph 使用包含静态资产和对象先验的场景图,并通过视觉语言模型处理代理的主观视角来在任务执行期间不断更新场景图。实验在模拟和真实环境中表明,LookPlanGraph 在对象位置发生变化的情况下比依赖预定义静态场景图的方法表现更好。
Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval
Authors: Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Tran Chi Nguyen
Venue: MM
First: 2025-12-24T15:02:33+00:00 · Latest: 2025-12-24T15:02:33+00:00
Comments: System description paper for EVENTA Grand Challenge Track 2 at ACM Multimedia 2025 (MM '25). Ranked 4th place. 6 pages, 1 figure, 2 tables
Abstract
Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval
中文标题/摘要
标题:利用轻量级实体提取实现可扩展的基于事件的图像检索
从自然语言描述中检索图像是一项核心任务,位于计算机视觉和自然语言处理的交叉点,广泛应用于搜索引擎、媒体归档和数字内容管理等领域。然而,由于模糊或依赖上下文的查询、语言的多变性以及需要可扩展的解决方案,现实世界中的图像-文本检索仍然具有挑战性。在本文中,我们提出了一种轻量级的两阶段检索管道,利用事件中心的实体提取来结合现实世界标题中的时间与上下文信号。第一阶段使用BM25基于显著实体进行高效的候选过滤,而第二阶段则应用BEiT-3模型来捕捉深层次的多模态语义并重新排序结果。在OpenEvents v1基准上评估,我们的方法达到了0.559的平均平均精度,显著优于先前的基线。这些结果突显了结合事件引导的过滤与长文本视觉语言建模在复杂现实场景中实现准确高效检索的有效性。我们的代码可在https://github.com/PhamPhuHoa-23/Event-Based-Image-Retrieval 获取。
Summary / 总结
This paper addresses the challenge of retrieving images from natural language descriptions by proposing a lightweight two-stage retrieval pipeline. The first stage filters candidates based on salient entities using BM25, and the second stage uses BEiT-3 models to capture deep multimodal semantics and rerank the results. The method achieves a mean average precision of 0.559 on the OpenEvents v1 benchmark, outperforming previous approaches and demonstrating the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in real-world scenarios.
该研究通过提出一种轻量级的两阶段检索管道来解决从自然语言描述中检索图像的挑战。第一阶段使用基于显著实体的BM25进行高效的候选过滤,第二阶段则使用BEiT-3模型来捕捉深度多模态语义并重新排序结果。该方法在OpenEvents v1基准测试中实现了0.559的平均精度,显著优于之前的基线方法,在事件驱动的图像检索场景中表现出色。
RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic
Authors: Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu
First: 2025-12-24T15:01:26+00:00 · Latest: 2025-12-24T15:01:26+00:00
Comments: 11 pages, 6 figures
Abstract
Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.
中文标题/摘要
标题:RoboSafe:通过可执行安全逻辑保护具身代理
由视觉-语言模型(VLMs)驱动的具身代理越来越能够执行复杂的现实世界任务,但它们仍然容易受到可能导致不安全行为的危险指令的影响。运行时安全护栏可以在任务执行过程中拦截危险行为,提供了一种有前景的解决方案,因为它们具有灵活性。然而,现有的防御措施往往依赖于静态规则过滤或提示级控制,难以应对动态、时间依赖性和上下文丰富的环境中出现的隐含风险。为了解决这个问题,我们提出了RoboSafe,这是一种通过可执行谓词安全逻辑为具身代理提供混合推理运行时保护的混合方法。RoboSafe结合了在混合长短期安全记忆上的两种互补推理过程。我们首先提出了一种反向反思推理模块,该模块不断回顾短期记忆中的最近轨迹,以推断时间安全谓词,并在检测到违规行为时主动触发重新规划。然后,我们提出了一种前瞻预测推理模块,该模块通过生成基于长期安全记忆和代理的多模态观察的安全谓词来预见即将出现的风险。这些组件共同形成了一个既可解释又可执行的适应性安全逻辑。在多个代理的广泛实验中,RoboSafe与领先基准相比显著减少了危险行为(风险发生率降低36.8%),同时保持了接近原始的任务性能。在物理机器人手臂上的实际评估进一步证实了其实用性。代码将在接受后发布。
Summary / 总结
RoboSafe is a hybrid reasoning runtime safeguard for embodied agents using executable predicate-based safety logic. It combines Backward Reflective Reasoning, which revisits recent trajectories to infer temporal safety predicates, and Forward Predictive Reasoning, which anticipates risks by generating context-aware safety predicates. Experiments show that RoboSafe significantly reduces hazardous actions by 36.8% compared to leading baselines while maintaining near-original task performance.
RoboSafe 通过使用可执行的安全逻辑来保护实体代理免受有害指令的影响,结合后向反思推理和前瞻预测推理,持续监控和预测潜在的安全风险。实验表明,RoboSafe 相比现有方法将有害行为减少了 36.8%,同时保持了相似的任务性能。实际机器人手臂的评估进一步证实了其实用性。
VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs
Authors: Brigitta Malagurski Törtei, Yasser Dahou, Ngoc Dung Huynh, Wamiq Reyaz Para, Phúc H. Lê Khac, Ankit Singh, Sofian Chaybouti, Sanath Narayan
First: 2025-12-24T14:18:38+00:00 · Latest: 2025-12-24T14:18:38+00:00
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.
中文标题/摘要
标题:VisRes 基准:关于评估 VLM 视觉推理能力的研究
视觉-语言模型(VLMs)在视觉问答和图像描述等任务上取得了显著进展。然而,这些模型在视觉推理方面的表现与其依赖语言先验的程度仍然不清楚。为了解决这个问题,我们引入了 VisRes 基准,该基准旨在在无需上下文语言监督的自然环境中研究视觉推理。通过对三种复杂性级别的模型行为进行分析,我们发现了感知和关系视觉推理能力的明显局限性。VisRes 在其级别上隔离了不同的推理能力。第一级测试在模糊、纹理变化、遮挡和旋转等干扰下的感知完成和全局图像匹配;第二级测试单一属性(如颜色、数量、方向)的基于规则的推理;第三级则针对需要整合多个视觉属性的组合推理。在超过 19,000 张受控任务图像中,我们发现最先进的 VLMs 在微妙的感知干扰下表现接近随机,揭示了其有限的抽象能力,仅限于模式识别。最后,我们讨论了 VisRes 如何为多模态研究中的抽象视觉推理提供统一框架。
Summary / 总结
The paper introduces VisRes Bench, a benchmark to evaluate the visual reasoning capabilities of Vision-Language Models (VLMs) without contextual language supervision. The benchmark consists of three levels of complexity: perceptual completion and global image matching (Level 1), rule-based inference over a single attribute (Level 2), and compositional reasoning integrating multiple visual attributes (Level 3). Across over 19,000 controlled task images, state-of-the-art VLMs showed limited performance under subtle perceptual perturbations, indicating weak abstraction beyond pattern recognition capabilities.
研究旨在通过引入VisRes Bench这一基准来评估视觉语言模型(VLMs)的视觉推理能力,该基准在没有语言监督的情况下测试模型在自然环境中的表现。研究分析了模型在三个复杂度级别的行为:感知完成、基于规则的推理和组合推理。关键发现表明,最先进的VLMs在细微的感知干扰下表现不佳,表明它们的能力主要局限于模式识别,缺乏超越这一层面的抽象能力,这表明VLMs更多依赖于语言先验而非真正的视觉推理能力。
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning
Authors: Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang
First: 2025-10-18T09:22:40+00:00 · Latest: 2025-12-24T13:40:37+00:00
Abstract
Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.
中文标题/摘要
标题:SSL4RL:重新审视自我监督学习作为视觉-语言推理内在奖励的方法
视觉-语言模型(VLMs)通过结合大型语言模型和视觉输入展示了显著的能力。然而,它们往往未能充分利用视觉证据,要么依赖于视觉中心任务中的语言先验,要么在推理过程中求助于文本捷径。尽管强化学习(RL)可以将模型与期望的行为对齐,但将其应用于VLMs受到了缺乏可扩展和可靠的奖励机制的阻碍。为克服这一挑战,我们提出了一种名为SSL4RL的新框架,该框架利用自我监督学习(SSL)任务作为RL基础微调的验证奖励来源。我们的方法将SSL目标,如预测图像旋转或重建遮罩片段,重新表述为密集的自动奖励信号,从而消除了对人类偏好数据或不可靠的人工智能评估者的需要。实验表明,SSL4RL在视觉中心和视觉-语言推理基准测试中显著提高了性能。此外,通过系统性的消融实验,我们确定了影响SSL4RL任务有效性的关键因素,如任务难度、模型规模和与目标领域的语义对齐,为未来工作提供了新的设计原则。我们还通过将其应用于图学习,展示了该框架的通用性,其中它带来了显著的收益。SSL4RL建立了一种灵活且有效的多模态模型对齐范式,使用可验证的自我监督目标。
Summary / 总结
The research aims to enhance the performance of vision-language models (VLMs) by integrating self-supervised learning (SSL) tasks as intrinsic rewards for reinforcement learning (RL) fine-tuning. The proposed SSL4RL framework converts SSL objectives like image rotation prediction into dense reward signals, avoiding the need for human preferences or unreliable evaluators. Experiments show that SSL4RL significantly improves VLMs on various benchmarks and also demonstrates its effectiveness in graph learning tasks.
论文提出了SSL4RL框架,利用自监督学习(SSL)任务作为强化学习(RL)微调视觉语言模型(VLM)的内在奖励。该方法通过提供密集的自动奖励信号,而无需人类偏好数据,提高了视觉中心和视觉语言推理基准上的性能。研究还确定了影响SSL4RL任务有效性的关键因素,并展示了其在图学习中的应用,取得了显著的提升。
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
Authors: Yunheng Li, Yuxuan Li, Quansheng Zeng, Wenhai Wang, Qibin Hou, Ming-Ming Cheng
Venue: ICCV 2025
First: 2024-12-09T06:34:23+00:00 · Latest: 2025-12-24T13:11:11+00:00
Comments: Accepted at ICCV 2025. The code is available at https://github.com/HVision-NKU/DenseVLM
Abstract
Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias', where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is available at https://github.com/HVision-NKU/DenseVLM.
中文标题/摘要
标题:无偏区域-语言对齐以实现开放词汇密集预测
预训练的视觉-语言模型(VLMs),如CLIP,展示了令人印象深刻的零样本识别能力,但在密集预测任务中仍然表现不佳。最近,自我蒸馏作为一种有希望的方法正在兴起,用于微调VLMs以更好地适应局部区域,而无需大量注释。然而,之前最先进的方法往往遭受显著的“前景偏差”问题,模型倾向于错误地将背景区域识别为前景对象。为了解决这一问题,我们提出了一种名为DenseVLM的框架,旨在从强大的预训练VLM表示中学习无偏的区域-语言对齐。DenseVLM利用预训练的VLM检索未标记区域的类别,然后分离前景和背景特征之间的干扰。我们展示了DenseVLM可以直接替换开放词汇目标检测和图像分割方法中的原始VLM,从而显著提高性能。此外,当在更广泛和多样化的数据集上进行训练时,它还表现出有希望的零样本扩展性。我们的代码可在https://github.com/HVision-NKU/DenseVLM获取。
Summary / 总结
The research aims to improve the performance of pre-trained vision-language models (VLMs) in dense prediction tasks by addressing the foreground bias issue. DenseVLM is proposed to learn unbiased region-language alignment using pre-trained VLMs, decoupling foreground and background features. The method directly replaces the original VLM in open-vocabulary object detection and image segmentation, resulting in significant performance improvements and promising zero-shot scalability with larger datasets.
研究旨在通过解决前景偏差问题,提升预训练视觉-语言模型(VLMs)在密集预测任务中的性能。提出了DenseVLM框架,从预训练VLM中学习无偏的区域-语言对齐,分离前景和背景特征。该方法可以直接替换原始VLM用于开放词汇对象检测和图像分割,取得了显著的性能提升,并且在更大、更多样化的数据集上展示了良好的零样本扩展性。
ORCA: Object Recognition and Comprehension for Archiving Marine Species
Authors: Yuk-Kwan Wong, Haixin Liang, Zeyu Ma, Yiwei Chen, Ziqiang Zheng, Rinaldi Gotama, Pascal Sebastian, Lauren D. Sparks, Sai-Kit Yeung
Venue: WACV
First: 2025-12-24T12:36:57+00:00 · Latest: 2025-12-24T12:36:57+00:00
Comments: Accepted by The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026
Abstract
Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present ORCA, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. ORCA thus establishes a comprehensive benchmark to advance research in marine domain. Project Page: http://orca.hkustvgd.com/.
中文标题/摘要
标题:ORCA:海洋物种识别与理解以存档海洋生物
海洋视觉理解对于监测和保护海洋生态系统至关重要,能够实现自动化的生物调查。然而,由于训练数据有限且缺乏将特定海洋领域的挑战与明确的计算机视觉任务系统化结合的任务表述,进展受到限制,从而限制了有效模型的应用。为解决这一问题,我们提出了ORCA,一个包含14,647张图像和478个物种的多模态基准数据集,其中包含42,217个边界框注释和22,321个专家验证的实例描述。该数据集提供了细粒度的视觉和文本注释,捕捉了不同海洋物种的形态特征。为了促进方法学的进步,我们在三个任务上评估了18个最先进的模型:对象检测(封闭集和开放词汇)、实例描述和视觉定位。结果突显了关键挑战,包括物种多样性、形态重叠和专门领域的特殊需求,强调了海洋理解的难度。ORCA因此建立了一个全面的基准,以推进海洋领域的研究。项目页面:http://orca.hkustvgd.com/
Summary / 总结
The research aims to improve marine ecosystem monitoring through automatic and scalable biological surveys by addressing the challenges of limited training data and lack of systematic task formulation. ORCA, a multi-modal benchmark, was created with 14,647 images from 478 species and detailed annotations, evaluated on object detection, instance captioning, and visual grounding tasks. The results revealed key challenges such as species diversity and morphological overlap, highlighting the difficulty of marine understanding and the need for further research advancements.
研究旨在通过自动化和可扩展的生物调查来改善海洋生态系统的监测。ORCA 是一个多模态基准,包含来自 478 种海洋物种的 14,647 张图像,并附有详细的注释。研究评估了 18 种最先进的模型在物体检测、实例描述和视觉定位上的表现,揭示了物种多样性、形态重叠等挑战。ORCA 提供了一个全面的基准,以促进海洋领域的研究进展。
Intersectional Fairness in Vision-Language Models for Medical Image Disease Classification
Authors: Yupeng Zhang, Adam G. Dunn, Usman Naseem, Jinman Kim
First: 2025-12-17T09:47:29+00:00 · Latest: 2025-12-24T12:33:48+00:00
Abstract
Medical artificial intelligence (AI) systems, particularly multimodal vision-language models (VLM), often exhibit intersectional biases where models are systematically less confident in diagnosing marginalised patient subgroups. Such bias can lead to higher rates of inaccurate and missed diagnoses due to demographically skewed data and divergent distributions of diagnostic certainty. Current fairness interventions frequently fail to address these gaps or compromise overall diagnostic performance to achieve statistical parity among the subgroups. In this study, we developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardises diagnostic certainty across intersectional patient subgroups. Unlike traditional debiasing methods, this approach equalises the model's decision confidence without requiring sensitive demographic data during clinical inference. We evaluated this approach using 10,015 skin lesion images (HAM10000) with external validation on 12,000 images (BCN20000), and 10,000 fundus images for glaucoma detection (Harvard-FairVLMed), stratifying performance by intersectional age, gender, and race attributes. In the dermatology cohort, the proposed method reduced the overall intersectional missed diagnosis gap (difference in True Positive Rate, $Δ$TPR) from 0.50 to 0.26 while improving the overall Area Under the Curve (AUC) from 0.94 to 0.97 compared to standard training. Similarly, for glaucoma screening, the method reduced $Δ$TPR from 0.41 to 0.31, achieving a better AUC of 0.72 (vs. 0.71 baseline). This establishes a scalable framework for developing high-stakes clinical decision support systems that are both accurate and can perform equitably across diverse patient subgroups, ensuring reliable performance without increasing privacy risks.
中文标题/摘要
标题:视觉语言模型在医学图像疾病分类中的交叉公平性
医学人工智能(AI)系统,尤其是多模态视觉语言模型(VLM),常常表现出交叉偏见,模型在诊断边缘化患者亚组时系统性地缺乏信心。这种偏见可能导致由于样本数据的种族分布偏差和诊断确定性分布差异而出现更高的误诊和漏诊率。当前的公平性干预措施往往未能解决这些差距,或者在实现各亚组统计平等的同时牺牲整体诊断性能。在本研究中,我们开发了跨模态一致性匹配(CMAC-MMD)训练框架,以标准化交叉公平性患者亚组的诊断确定性。与传统的去偏见方法不同,该方法在临床推理过程中不需要敏感的种族数据即可使模型的决策信心平等化。我们使用10,015张皮肤病变图像(HAM10000)和外部验证的12,000张图像(BCN20000)以及10,000张用于青光眼检测的视网膜图像(Harvard-FairVLMed),按交叉公平性年龄、性别和种族属性分层评估了该方法。在皮肤科队列中,所提出的方法将总体交叉公平性漏诊差距(真实阳性率差异,ΔTPR)从0.50降低到0.26,同时将总体曲线下面积(AUC)从0.94提高到0.97,优于标准训练。同样,在青光眼筛查中,该方法将ΔTPR从0.41降低到0.31,实现了更好的AUC(0.72,与0.71基线相比)。这建立了一个可扩展的框架,用于开发既准确又能在不同患者亚组中公平执行的高风险临床决策支持系统,确保可靠性能而不增加隐私风险。
Summary / 总结
This study addresses the intersectional biases in medical AI systems, particularly in vision-language models, which can lead to higher rates of inaccurate and missed diagnoses for marginalized patient subgroups. The authors developed Cross-Modal Alignment Consistency (CMAC-MMD), a training framework that standardizes diagnostic certainty across different patient subgroups without requiring sensitive demographic data. Evaluations on skin lesion and fundus images showed that the proposed method reduced the intersectional missed diagnosis gap and improved overall diagnostic performance, achieving better Area Under the Curve (AUC) scores compared to standard training methods.
该研究针对医疗AI系统中的交集偏见问题,特别是用于疾病分类的多模态视觉-语言模型。研究引入了跨模态一致性对齐(CMAC-MMD)的训练框架,该框架能够在不需要敏感人口统计数据的情况下,标准化不同患者亚组的诊断置信度。在皮肤病变和视网膜图像上的评估表明,所提出的方法减少了漏诊差距,并提高了整体诊断性能,AUC值优于标准训练方法。
MarineEval: Assessing the Marine Intelligence of Vision-Language Models
Authors: YuK-Kwan Wong, Tuan-An To, Jipeng Zhang, Ziqiang Zheng, Sai-Kit Yeung
Venue: WACV
First: 2025-12-24T11:57:50+00:00 · Latest: 2025-12-24T11:57:50+00:00
Comments: Accepted by The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026
Abstract
We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: http://marineeval.hkustvgd.com/
中文标题/摘要
标题:MarineEval:评估视觉语言模型的海洋智能
我们见证了由大型语言模型(LLMs)和进一步的视觉语言模型(VLMs)引领的在处理各种查询方面的进展,使其成为通用助手。VLMs 作为连接视觉世界和语言语料库的桥梁,接收视觉内容和各种文本指令以生成相应的响应。尽管 VLMs 在各个领域取得了巨大成功,但在本文中,我们询问现有的 VLMs 是否可以作为领域专家,准确回答需要大量领域专业知识和解决特殊领域挑战/要求的海洋问题。为了全面评估现有 VLMs 的效果并探索其边界,我们构建了第一个大规模海洋 VLM 数据集和基准 MarineEval,包含 2,000 个基于图像的问题-答案对。在数据集构建过程中,我们确保了构建数据的多样性和覆盖面:7 个任务维度和 20 个能力维度。领域要求特别整合到数据构建中,并进一步由相应的海洋领域专家进行验证。我们在 MarineEval 上全面基准测试了 17 个现有 VLMs,并且还调查了现有模型在回答海洋研究问题方面的局限性。实验结果表明,现有 VLMs 无法有效回答领域特定问题,仍有很大的性能提升空间。我们希望我们的新基准和观察结果能够促进未来的研究。项目页面:http://marineeval.hkustvgd.com/
Summary / 总结
MarineEval evaluates the marine intelligence of vision-language models (VLMs) by constructing a large-scale marine VLM dataset with 2,000 image-based question-answering pairs, covering 7 task dimensions and 20 capacity dimensions. The study finds that existing VLMs struggle to accurately answer domain-specific marine questions, indicating a need for further improvements in their domain expertise and capabilities.
MarineEval通过构建包含2,000个基于图像的问题-答案对的大规模数据集,涵盖7个任务维度和20个能力维度,评估视觉语言模型的海洋智能。基准测试显示,现有的VLMs难以准确回答特定领域的海洋问题,表明在处理专业领域挑战方面还有很大的改进空间。
UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters
Authors: Yongkun Du, Zhineng Chen, Yazhen Xie, Weikang Baiand Hao Feng, Wei Shi, Yuchen Su, Can Huang, Yu-Gang Jiang
First: 2025-12-24T10:35:21+00:00 · Latest: 2025-12-24T10:35:21+00:00
Abstract
Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$\times$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: https://github.com/Topdu/OpenOCR.
中文标题/摘要
标题:UniRec-0.1B: 统一文本和公式识别模型,参数量仅为0.1B
文本和公式是许多文档的核心信息组件。准确高效地识别两者对于开发稳健且通用的文档解析系统至关重要。最近,视觉-语言模型(VLMs)在统一识别文本和公式方面取得了令人印象深刻的成果。然而,它们体积庞大且计算需求高,限制了它们在许多应用中的使用。在本文中,我们提出了一种仅包含0.1B参数的统一识别模型UniRec-0.1B。该模型能够在字符、单词、行、段落和文档等多个层次上进行文本和公式识别。为了实现这一任务,我们首先建立了包含4000万文本、公式及其混合样本的大型数据集UniRec40M,以训练出强大而轻量级的模型。其次,我们识别了构建这样一个轻量级但统一专家模型时的两个挑战:层次结构中的结构变异性以及文本和公式内容之间的语义纠缠。为了解决这些问题,我们引入了层次监督训练,以明确引导结构理解,并引入了语义解耦分词器,将文本和公式表示分离。最后,我们开发了一个全面的评估基准,涵盖了多个领域和多个层次的中文和英文文档。在该基准和公开基准上的实验结果表明,UniRec-0.1B 在性能和效率方面均优于通用视觉语言模型和领先文档解析专家模型,验证了其有效性和效率。代码库和数据集:https://github.com/Topdu/OpenOCR.
Summary / 总结
This paper introduces UniRec-0.1B, a lightweight unified recognition model for text and formulas with only 0.1B parameters. It addresses the challenges of structural variability and semantic entanglement by using hierarchical supervision and a semantic-decoupled tokenizer. The model is trained on UniRec40M, a large dataset of 40 million samples. Experimental results show that UniRec-0.1B outperforms both general-purpose vision-language models and specialized document parsing models, achieving a 2-9 times speedup.
论文介绍了UniRec-0.1B,这是一种仅包含0.1亿参数的统一文本和公式识别模型,旨在解决现有视觉语言模型的计算需求问题。该模型使用分层监督训练方法和语义解耦分词器来处理结构变异性和语义纠缠。在综合评估基准上,UniRec-0.1B在性能上优于通用视觉语言模型和专门的文档解析模型,同时实现2-9倍的加速。
Case Prompting to Mitigate Large Language Model Bias for ICU Mortality Prediction
Authors: Gangxiong Zhang, Yongchao Long, Yong Zhang, Yuxi Zhou, Shenda Hong
First: 2025-12-17T12:29:53+00:00 · Latest: 2025-12-24T08:34:41+00:00
Abstract
Accurate mortality risk prediction for intensive care unit (ICU) patients is essential for clinical decision-making. Although large language models (LLMs) show promise in predicting outcomes from structured medical data, their predictions may exhibit demographic biases related to sex, age, and race, limiting their trustworthy use in clinical practice. Existing debiasing methods often reduce predictive performance, making it difficult to jointly optimize fairness and accuracy. In this study, we systematically examine bias in LLM-based ICU mortality prediction and propose a training-free, clinically adaptive prompting framework to simultaneously improve fairness and performance. We first develop a multi-dimensional bias assessment scheme for comprehensive model diagnosis. Building on this analysis, we introduce CAse Prompting (CAP), a novel prompting framework that integrates conventional debiasing prompts with case-based reasoning. CAP guides the model to learn from similar historical misprediction cases and their correct outcomes, enabling correction of biased reasoning patterns. Experiments on the MIMIC-IV dataset show that CAP substantially improves both predictive accuracy and fairness. CAP increases AUROC from 0.806 to 0.873 and AUPRC from 0.497 to 0.694, while reducing sex- and race-related disparities by over 90%. Feature reliance analysis further indicates highly consistent attention patterns across demographic groups, with similarity scores exceeding 0.98. These results demonstrate that LLMs exhibit measurable bias in ICU mortality prediction, and that a carefully designed prompting framework can effectively co-optimize fairness and performance without retraining, offering a transferable paradigm for equitable clinical decision support.
中文标题/摘要
标题:基于病例提示以减轻大型语言模型偏见以降低ICU病死率预测
ICU患者病死率的准确预测对于临床决策至关重要。尽管大型语言模型(LLMs)在预测结构化医疗数据的结果方面显示出潜力,但它们的预测可能表现出与性别、年龄和种族相关的统计偏见,限制了其在临床实践中的可信应用。现有的去偏方法通常会降低预测性能,使得难以同时优化公平性和准确性。在本研究中,我们系统地检查了LLM基于ICU病死率预测中的偏见,并提出了一种无需训练、临床适应的提示框架,以同时提高公平性和性能。我们首先开发了一种多维度偏见评估方案,用于全面的模型诊断。在此基础上,我们引入了CAse Prompting(CAP),这是一种新颖的提示框架,将传统的去偏提示与案例推理相结合。CAP引导模型从历史上的错误预测案例及其正确结果中学习,以纠正偏见的推理模式。在MIMIC-IV数据集上的实验表明,CAP显著提高了预测准确性和公平性。CAP将AUROC从0.806提高到0.873,AUPRC从0.497提高到0.694,并通过超过90%的减少性别和种族相关的差异。特征依赖性分析进一步表明,不同人口统计学组之间的注意力模式高度一致,相似度分数超过0.98。这些结果表明,LLMs在ICU病死率预测中表现出可测量的偏见,并且精心设计的提示框架可以在无需重新训练的情况下有效协同优化公平性和性能,提供了一种可转移的公平临床决策支持范式。
Summary / 总结
This study addresses the issue of demographic biases in large language models (LLMs) used for ICU mortality prediction. It proposes a training-free prompting framework called CAse Prompting (CAP) to improve both fairness and predictive accuracy. The method involves developing a multi-dimensional bias assessment scheme and integrating conventional debiasing prompts with case-based reasoning. Experiments on the MIMIC-IV dataset show that CAP significantly enhances AUROC and AUPRC, and reduces sex- and race-related disparities by over 90%. Feature reliance analysis also reveals highly consistent attention patterns across demographic groups.
该研究针对用于ICU死亡率预测的大语言模型(LLM)中存在的偏见问题,这些偏见限制了其临床应用。研究提出了一种无需训练的提示框架——CAse Prompting(CAP),以同时提高公平性和预测准确性。方法包括开发一个多维度的偏见评估方案,并结合传统的去偏提示与案例推理。实验结果显示,CAP显著提升了AUROC从0.806到0.873和AUPRC从0.497到0.694,同时减少了超过90%的性别和种族相关差异。特征依赖性分析还表明,不同人口统计学组之间的注意力模式高度一致,表明CAP有效地减轻了偏见,而无需重新训练模型。
O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model
Authors: Rishi Gupta, Mukilan Karuppasamy, Shyam Marjit, Aditay Tripathi, Anirban Chakraborty
Venue: AAAI 2026
First: 2025-11-18T11:18:08+00:00 · Latest: 2025-12-24T08:17:05+00:00
Comments: Accepted to AAAI 2026
Abstract
While Large Vision Language Models (LVLMs) are increasingly deployed in real-world applications, their ability to interpret abstract visual inputs remains limited. Specifically, they struggle to comprehend hand-drawn sketches, a modality that offers an intuitive means of expressing concepts that are difficult to describe textually. We identify the primary bottleneck as the absence of a large-scale dataset that jointly models sketches, photorealistic images, and corresponding natural language instructions. To address this, we present two key contributions: (1) a new, large-scale dataset of image-sketch-instruction triplets designed to facilitate both pretraining and instruction tuning, and (2) O3SLM, an LVLM trained on this dataset. Comprehensive evaluations on multiple sketch-based tasks: (a) object localization, (b) counting, (c) image retrieval i.e., (SBIR and fine-grained SBIR), and (d) visual question answering (VQA); while incorporating the three existing sketch datasets, namely QuickDraw!, Sketchy, and Tu Berlin, along with our generated SketchVCL dataset, show that O3SLM achieves state-of-the-art performance, substantially outperforming existing LVLMs in sketch comprehension and reasoning.
中文标题/摘要
标题:O3SLM:开放权重、开放数据和开放词汇量草图语言模型
尽管大型视觉语言模型(LVLMs)在越来越多的实际应用中被部署,但它们对抽象视觉输入的理解能力仍然有限。具体来说,它们难以理解手绘草图,这种模态提供了一种直观的方式来表达难以用文字描述的概念。我们确定的主要瓶颈是没有一个大规模的数据集能够同时建模草图、照片写实图像及其相应的自然语言指令。为了解决这个问题,我们提出了两个关键贡献:(1)一个新设计的、大规模的图像-草图-指令三元组数据集,旨在促进预训练和指令微调;(2)O3SLM,一个在该数据集上训练的LVLM。在多个基于草图的任务上的全面评估:(a)物体定位,(b)计数,(c)图像检索,即(SBIR和细粒度SBIR),以及(d)视觉问答(VQA),结合现有的三个草图数据集,即QuickDraw!、Sketchy和Tu Berlin,以及我们生成的SketchVCL数据集,表明O3SLM达到了最先进的性能,显著优于现有的LVLMs在草图理解和推理方面的表现。
Summary / 总结
The research aims to enhance the ability of large vision language models to interpret abstract visual inputs, particularly hand-drawn sketches. To address this, the authors created a new large-scale dataset of image-sketch-instruction triplets and trained a model called O3SLM. Experimental results demonstrate that O3SLM outperforms existing models in tasks such as object localization, counting, image retrieval, and visual question answering, especially in understanding and reasoning about sketches.
研究旨在提升大型视觉语言模型对抽象视觉输入的理解能力,特别是手绘草图。为此,作者引入了一个新的大规模图像-草图-指令三元组数据集和一个名为O3SLM的模型。全面的评估表明,O3SLM在各种草图任务上的表现优于现有模型,在草图理解和推理方面取得了最先进的性能。
V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval
Authors: Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim
First: 2025-12-13T11:02:04+00:00 · Latest: 2025-12-24T07:46:59+00:00
Comments: 14 pages, 20 figures, conference, accepted by HPCA 2026
Abstract
Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models. In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.
中文标题/摘要
标题:V-Rex:通过动态KV缓存检索实现实时流式视频LLM加速
流式视频大型语言模型(LLMs)越来越多地用于实时多模态任务,如视频字幕、问答、对话代理和增强现实。然而,这些模型面临着根本性的内存和计算挑战,因为它们的键值(KV)缓存会随着连续的流式视频输入而大幅增长。这一过程需要一个迭代预填充阶段,这是流式视频LLMs的一个独特特征。由于其迭代预填充阶段,它遭受了显著的限制,包括大量的计算、大量的数据传输以及准确性的下降。至关重要的是,这个问题在边缘部署中被进一步放大,这是这些模型的主要目标。 在这项工作中,我们提出了V-Rex,这是第一个软件和硬件协同设计的加速器,全面解决了流式视频LLM推理中的算法和硬件瓶颈。核心上,V-Rex 引入了ReSV,这是一种无需训练的动态KV缓存检索算法。ReSV 利用基于时间和空间相似性的令牌聚类来减少视频帧间的过度KV缓存内存。为了充分利用这些算法上的优势,V-Rex 提供了一个紧凑的、低延迟的硬件加速器,其中包括一个动态KV缓存检索引擎(DRE),具有位级和早期退出基于的计算单元。V-Rex 在边缘部署中实现了前所未有的实时性能(3.9-8.3 FPS)和高效的流式视频LLM推理,几乎无准确度损失。虽然DRE仅占2.2%的功耗和2.0%的面积,但该系统在AGX Orin GPU上实现了1.9-19.7倍的速度提升和3.1-18.5倍的能量效率提升。这项工作是首次全面解决算法和硬件中的KV缓存检索问题,使实时流式视频LLM推理能够在资源受限的边缘设备上实现。
Summary / 总结
V-Rex is a software-hardware co-designed accelerator for real-time streaming video LLM inference, addressing memory and computational challenges through a training-free dynamic KV cache retrieval algorithm called ReSV. ReSV reduces excessive KV cache memory by exploiting temporal and spatial similarity-based token clustering. V-Rex also includes a compact hardware accelerator with a dynamic KV cache retrieval engine (DRE) that provides low-latency and energy-efficient inference, achieving 3.9-8.3 FPS with negligible accuracy loss. Compared to the AGX Orin GPU, V-Rex offers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements while consuming only 2.2% power and 2.0% area.
V-Rex 是一种软件硬件协同设计的加速器,用于实时流视频 LLM 推断,通过引入基于时空相似性的 ReSV 动态 KV 缓存检索算法来解决内存和计算挑战。ReSV 通过 token 聚类减少视频帧间的 KV 缓存内存使用。V-Rex 还配备了一个紧凑的硬件加速器,其中包括一个动态 KV 缓存检索引擎(DRE),提供低延迟和能效推断。该系统实现了 3.9-8.3 FPS 的推断速度,几乎无精度损失,并提供了 1.9-19.7 倍的速度提升和 3.1-18.5 倍的能效提升,使其能够在资源受限的边缘设备上实现实时流视频 LLM 推断。
Generalization of Diffusion Models Arises with a Balanced Representation Space
Authors: Zekai Zhang, Xiao Li, Xiang Li, Lianghe Shi, Meng Wu, Molei Tao, Qing Qu
First: 2025-12-24T05:40:40+00:00 · Latest: 2025-12-24T05:40:40+00:00
Comments: 40 pages, 19 figures. The first two authors contributed equally
Abstract
Diffusion models excel at generating high-quality, diverse samples, yet they risk memorizing training data when overfit to the training objective. We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning. By investigating a two-layer ReLU denoising autoencoder (DAE), we prove that (i) memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized "spiky" representations, whereas (ii) generalization arises when the model captures local data statistics, producing "balanced" representations. Furthermore, we validate these theoretical findings on real-world unconditional and text-to-image diffusion models, demonstrating that the same representation structures emerge in deep generative models with significant practical implications. Building on these insights, we propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering. Together, our results highlight that learning good representations is central to novel and meaningful generative modeling.
中文标题/摘要
标题:扩散模型中的泛化能力源于平衡的表示空间
扩散模型在生成高质量、多样化样本方面表现出色,但当过度拟合训练目标时,它们可能会记住训练数据。我们通过表示学习的视角分析了扩散模型中记忆和泛化之间的区别。通过研究两层ReLU去噪自编码器(DAE),我们证明了(i) 记忆对应于模型在编码和解码的学得权重中存储原始训练样本,产生局部的“尖峰”表示,而(ii) 泛化则发生在模型捕捉局部数据统计时,产生“平衡”的表示。此外,我们在现实世界的无条件和文本到图像扩散模型上验证了这些理论发现,展示了这些表示结构在深层生成模型中的重要实践意义。基于这些见解,我们提出了一种基于表示的检测记忆的方法以及一种无需训练的编辑技术,允许通过表示引导实现精确控制。我们的结果共同强调了学习良好表示对于新颖和有意义的生成建模至关重要。
Summary / 总结
This study investigates the generalization capabilities of diffusion models by analyzing the representation learning process. It shows that memorization leads to localized 'spiky' representations, while generalization results in 'balanced' representations. The research validates these findings on real-world models and proposes a method for detecting memorization and a representation steering technique for precise control. The study emphasizes the importance of learning good representations for meaningful generative modeling.
该论文通过分析表示学习过程,研究了扩散模型的泛化能力。研究表明,记忆化会导致局部的“尖峰”表示,而泛化则会产生“平衡”的表示。该研究在实际模型上验证了这些发现,并提出了一种检测记忆化的表示方法和一种无需训练的编辑技术。研究结果强调了学习良好表示对于生成有意义模型的重要性。
Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning
Authors: Shengguang Wu, Xiaohan Wang, Yuhui Zhang, Hao Zhu, Serena Yeung-Levy
First: 2025-12-24T04:30:21+00:00 · Latest: 2025-12-24T04:30:21+00:00
Comments: Project Website: https://transductive-visualprogram.github.io/
Abstract
Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at https://transductive-visualprogram.github.io/.
中文标题/摘要
标题:归纳视觉编程:从经验中演化工具库以进行空间推理
在3D场景中的空间推理需要精确的几何计算,这挑战了视觉语言模型的能力。视觉编程通过将问题分解为步骤并调用专门的工具来解决这一问题,但现有方法要么依赖固定的工具集,要么在解决问题之前进行推测性的工具诱导,导致生成的程序效果不佳且工具利用不足。我们提出了归纳视觉编程(TVP),这是一种新型框架,能够从自身经验中构建新的工具,而不是基于推测。TVP 首先使用基本工具解决问题,同时将经验性解决方案积累到示例库中,然后从这些程序中抽象出重复模式,形成可重用的高级工具,从而构建一个不断进化的工具库。这使得TVP能够利用从经验中学到的越来越强大的工具来解决新问题。在Omni3D-Bench上,TVP达到了最先进的性能,比GPT-4o高出22%,比之前最好的视觉编程系统高出11%。我们归纳学习得到的工具比推测生成的工具更常被用作核心程序依赖,显示出更有效的工具发现和重用。进化出的工具还表现出强大的泛化能力,无需对测试集进行任何修改,就在SpatialScore-Hard集合的基准测试中取得了优异的性能。我们的工作确立了经验驱动的归纳工具创建作为构建自我进化的视觉编程代理的强大范式,这些代理能够有效应对具有挑战性的空间推理任务。我们将在https://transductive-visualprogram.github.io/发布我们的代码。
Summary / 总结
The research aims to improve spatial reasoning in 3D scenes by developing a framework called Transductive Visual Programming (TVP) that evolves tool libraries from experience. TVP first solves problems using basic tools and accumulates solutions into an Example Library, then abstracts recurring patterns into reusable higher-level tools for an evolving Tool Library. On the Omni3D-Bench, TVP outperforms GPT-4o by 22% and the previous best visual programming system by 11%, with transductively learned tools being used 5x more frequently as core program dependencies and showing strong generalization to unseen spatial tasks.
研究旨在通过开发一种名为Transductive Visual Programming (TVP)的框架来提高3D场景中的空间推理能力,该框架从经验中进化工具库。TVP使用基本工具解决问题,并将解决方案积累到Example Library中,然后从这些程序中抽象出可重用的高级工具,形成一个不断进化的Tool Library。在Omni3D-Bench上,TVP的性能优于GPT-4o 22%,优于之前的最佳视觉编程系统11%,并且通过经验学习得到的工具作为核心程序依赖被使用了5倍,同时在SpatialScore-Hard集合的基准测试中表现出强大的泛化能力,无需对测试集进行任何特定修改。
Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting
Authors: Yoonwoo Jeong, Cheng Sun, Frank Wang, Minsu Cho, Jaesung Choe
First: 2025-12-24T04:16:18+00:00 · Latest: 2025-12-24T04:16:18+00:00
Comments: Will be updated
Abstract
Recent advancements in computer vision have successfully extended Open-vocabulary segmentation (OVS) to the 3D domain by leveraging 3D Gaussian Splatting (3D-GS). Despite this progress, efficiently rendering the high-dimensional features required for open-vocabulary queries poses a significant challenge. Existing methods employ codebooks or feature compression, causing information loss, thereby degrading segmentation quality. To address this limitation, we introduce Quantile Rendering (Q-Render), a novel rendering strategy for 3D Gaussians that efficiently handles high-dimensional features while maintaining high fidelity. Unlike conventional volume rendering, which densely samples all 3D Gaussians intersecting each ray, Q-Render sparsely samples only those with dominant influence along the ray. By integrating Q-Render into a generalizable 3D neural network, we also propose Gaussian Splatting Network (GS-Net), which predicts Gaussian features in a generalizable manner. Extensive experiments on ScanNet and LeRF demonstrate that our framework outperforms state-of-the-art methods, while enabling real-time rendering with an approximate ~43.7x speedup on 512-D feature maps. Code will be made publicly available.
中文标题/摘要
标题:分位数渲染:高效嵌入高维特征的3D高斯点绘制
计算机视觉领域的最新进展成功地通过利用3D高斯点绘制(3D-GS)将开放词汇分割(OVS)扩展到了3D领域。尽管取得了这些进展,但高效渲染用于开放词汇查询所需的高维特征仍然面临重大挑战。现有方法使用码本或特征压缩,导致信息丢失,从而降低分割质量。为了解决这一限制,我们引入了分位数渲染(Q-Render),这是一种新颖的3D高斯渲染策略,能够高效处理高维特征同时保持高保真度。与传统的体渲染不同,后者沿每个射线密集采样所有相交的3D高斯,Q-Render仅稀疏采样沿射线具有主导影响的那些。通过将Q-Render集成到一个通用的3D神经网络中,我们还提出了高斯点绘制网络(GS-Net),该网络以通用方式预测高斯特征。在ScanNet和LeRF上的广泛实验表明,我们的框架在性能上优于最先进的方法,同时能够实现接近43.7倍的加速进行实时渲染。代码将公开提供。
Summary / 总结
The paper addresses the challenge of efficiently rendering high-dimensional features in 3D Gaussian Splatting for open-vocabulary segmentation. It introduces Quantile Rendering (Q-Render), a method that sparsely samples 3D Gaussians along rays, reducing information loss and improving segmentation quality. The proposed Gaussian Splatting Network (GS-Net) integrates Q-Render and achieves superior performance compared to existing methods, with real-time rendering capabilities on 512-D feature maps, approximately 43.7 times faster.
论文旨在解决在3D高维特征渲染中的高效性问题,提出了一种新颖的渲染策略Quantile Rendering (Q-Render),该策略仅稀疏采样沿每个光线影响最大的3D高斯,避免信息丢失。作者还提出了Gaussian Splatting Network (GS-Net),将Q-Render集成到一个通用的3D神经网络中。实验表明,该框架在512-D特征图上实现了显著的约43.7倍加速,并优于现有方法。
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
Authors: Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang
First: 2025-12-17T18:59:55+00:00 · Latest: 2025-12-24T03:37:34+00:00
Comments: 11 pages, 5 figures, conference or other essential info
Abstract
In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.
中文标题/摘要
标题:DiffusionVL:将任何自回归模型转化为扩散视觉语言模型
在最近的多模态研究中,扩散范式因其独特的解码优势,已成为自回归范式(AR)的有前途的替代方案。然而,由于基础扩散语言模型能力的限制,扩散视觉语言模型(dVLM)的性能仍然远远落后于主流模型。这引发了一个简单而基本的问题:是否可以基于现有的强大自回归模型构建dVLM?为此,我们提出了DiffusionVL,这是一个可以从任何强大自回归模型转换而来的dVLM家族。通过简单的微调,我们成功地将自回归预训练模型适应到扩散范式中。这种方法产生了两个关键观察结果:(1)从基于自回归的多模态模型到扩散的范式转变非常有效。(2)直接将自回归语言模型转换为dVLM也是可行的,性能与LLaVA风格的视觉指令调优相当。此外,我们引入了一种块解码设计到dVLM中,支持任意长度的生成和KV缓存重用,实现了显著的推理速度提升。我们进行了大量的实验。尽管使用了比先前方法少于5%的数据进行训练,DiffusionVL在MMMU-Pro(视觉)基准上实现了34.4%的整体性能提升,在MME(认知)基准上实现了37.5%的提升,同时推理速度提升了2倍。模型和代码发布在https://github.com/hustvl/DiffusionVL。
Summary / 总结
DiffusionVL translates powerful autoregressive models into diffusion vision language models through simple fine-tuning, achieving significant performance improvements and a 2x inference speedup. Despite using less than 5% of the data required by previous methods, DiffusionVL outperforms existing models on vision and cognitive benchmarks, with gains of 34.4% and 37.5%, respectively.
DiffusionVL 是一种可以从现有强大的自回归(AR)模型中转换而来的扩散视觉语言模型(dVLM)家族,通过简单的微调实现。这种方法表明,从基于AR的多模态模型到扩散的转变非常有效,直接将AR语言模型转换为dVLM可以达到与LLaVA风格的视觉指令调优相当的性能。此外,引入了一种块解码设计,支持任意长度的生成和KV缓存重用,从而实现了显著的推理速度提升。尽管使用了比以前方法少于5%的数据,DiffusionVL 在 MMMU-Pro(视觉)基准上实现了34.4%的性能提升,在 MME(认知)基准上实现了37.5%的性能提升,同时实现了2倍的推理速度提升。
PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding
Authors: Seongmin Jung, Seongho Choi, Gunwoo Jeon, Minsu Cho, Jongwoo Lim
First: 2025-12-24T03:18:51+00:00 · Latest: 2025-12-24T03:18:51+00:00
Abstract
3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.
中文标题/摘要
标题:PanoGrounder:通过全景场景表示连接2D和3D的基于VLM的3D视觉定位
3D视觉定位(3DVG)是视觉语言感知与机器人技术之间的关键桥梁,需要语言理解与3D场景推理。传统的监督模型利用显式的3D几何结构,但由于3D视觉语言数据集稀缺以及与现代视觉语言模型(VLM)相比推理能力有限,其泛化能力有限。我们提出了一种可泛化的3DVG框架PanoGrounder,该框架结合了多模态全景表示与预训练的2D VLM,以实现强大的视觉语言推理。全景渲染图,结合3D语义和几何特征,作为2D和3D之间的中间表示,提供了两大优势:(i)可以直接馈送到VLM中,无需大量适应;(ii)由于其360度的视野,保留了长距离的物体到物体的关系。我们设计了一个三阶段流水线,考虑场景布局和几何结构放置一组紧凑的全景视点,使用VLM在每个全景渲染图上定位文本查询,并通过提升将每个视点的预测融合为一个3D边界框。我们的方法在ScanRefer和Nr3D上达到了最先进的结果,并且在未见过的3D数据集和文本重述方面表现出色。
Summary / 总结
PanoGrounder is a framework that combines panoramic scene representations with pretrained 2D vision-language models to enhance 3D visual grounding. By using panoramic renderings that include 3D semantic and geometric features, it bridges the gap between 2D and 3D, allowing for strong vision-language reasoning. The method involves a three-stage pipeline that places panoramic viewpoints, grounds text queries, and fuses predictions to form a 3D bounding box. PanoGrounder achieves state-of-the-art results on ScanRefer and Nr3D and shows better generalization to unseen datasets and text rephrasings.
PanoGrounder 是一种结合全景场景表示与预训练的 2D 视觉语言模型的方法,以增强 3D 视觉定位。它使用包含 3D 语义和几何特征的全景渲染作为中间表示,以提高泛化能力和保留长距离物体关系。该方法在 ScanRefer 和 Nr3D 数据集上达到了最先进的效果,并且在未见过的 3D 数据集和文本重述方面表现出强大的泛化能力。
Benchmarking and Enhancing VLM for Compressed Image Understanding
Authors: Zifu Zhang, Tongda Xu, Siqi Li, Shengxi Li, Yue Zhang, Mai Xu, Yan Wang
First: 2025-12-24T02:59:01+00:00 · Latest: 2025-12-24T02:59:01+00:00
Abstract
With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.
中文标题/摘要
标题:视觉语言模型在压缩图像理解中的基准测试与增强
随着视觉语言模型(VLMs)的快速发展及其应用需求的不断增加,高效压缩图像输入变得越来越重要。现有的VLMs主要处理和理解高比特率压缩图像,而它们对低比特率压缩图像的理解能力尚未得到充分探索。在本文中,我们介绍了第一个全面的基准测试,以评估VLM在处理压缩图像方面的能力,涵盖了广泛使用的图像编解码器和多种任务,基准中包含超过一百万张压缩图像。接下来,我们通过将性能差距分为a) 压缩过程中的信息损失和b) VLM的一般化失败来分析性能差距。我们通过具体示例可视化这些差距,并确定对于压缩图像,只能通过缓解一般化差距来改善。最后,我们提出了一种通用的VLM适配器,以增强现有编解码器压缩图像的模型性能。结果表明,单个适配器可以提高VLM在不同编解码器和比特率图像上的性能10%-30%。我们相信,我们的基准测试和增强方法提供了有价值的见解,并有助于弥合VLMs与压缩图像之间的差距。
Summary / 总结
This paper addresses the need for efficient compression of image inputs for Vision-Language Models (VLMs) and introduces a comprehensive benchmark to evaluate VLMs on compressed images. The authors analyze the performance gap due to information loss during compression and generalization failure of VLMs, and propose a universal VLM adaptor to enhance model performance. The adaptor improves VLM performance across different codecs and bitrates by 10%-30%. The benchmark and enhancement method provide valuable insights for improving VLMs' ability to interpret compressed images.
本文针对视觉语言模型(VLM)对压缩图像的高效处理需求,引入了一个全面的基准来评估VLM在压缩图像上的表现。作者分析了由于压缩过程中的信息丢失和VLM的一般化失败导致的性能差距,并提出了一种通用的VLM适配器,该适配器可以提高模型在不同编解码器和比特率下的性能,提升幅度在10%-30%之间。研究为压缩图像的一般化差距提供了有价值的见解。
RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events
Authors: Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, Feng Zhang
Venue: NeurIPS 2025
First: 2025-09-02T03:01:23+00:00 · Latest: 2025-12-24T01:10:58+00:00
Comments: Accepted by NeurIPS 2025 Dataset and Benchmark Track
Abstract
Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,351 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC's ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.
中文标题/摘要
标题:RSCC:用于灾害事件的大型遥感变化描述数据集
遥感对于灾害监测至关重要,但现有数据集缺乏时间图像对和详细的文本注释。当前资源主要以单张快照图像为主,无法捕捉灾害随时间的变化影响。为解决这一问题,我们引入了遥感变化描述(RSCC)数据集,这是一个包含62,351个灾前/灾后图像对的大规模基准数据集(涵盖地震、洪水、野火等),并配有丰富的、类人类的变化描述。通过在遥感数据中架起时间和语义的桥梁,RSCC 使视觉-语言模型能够进行灾害意识的双时态理解的稳健训练和评估。我们的结果突显了RSCC在促进详细灾害相关分析方面的能力,为遥感中更准确、可解释和可扩展的视觉-语言应用铺平了道路。代码和数据集可在https://github.com/Bili-Sakura/RSCC/ 获取。
Summary / 总结
The research aims to address the lack of temporal image pairs and detailed textual annotations in existing disaster monitoring datasets. The Remote Sensing Change Caption (RSCC) dataset, comprising 62,351 pre-/post-disaster image pairs with rich textual annotations, was created to fill this gap. The dataset covers various disaster types such as earthquakes, floods, and wildfires. Key findings show that RSCC enhances the training and evaluation of vision-language models for disaster-aware bi-temporal understanding, facilitating more accurate and interpretable analysis. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.
RSCC数据集解决了现有灾害监测数据集中缺乏时间图像对和详细文本注释的问题。它包含了62,351对灾前/灾后图像配以丰富的变化描述,涵盖多种灾害类型。该数据集能够促进视觉-语言模型的灾后双时相理解训练和评估,有助于详细灾害分析,并提高遥感应用的准确性和可解释性。
Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference
Authors: Putu Indah Githa Cahyani, Komang David Dananjaya Suartana, Novanto Yudistira
First: 2025-12-23T23:30:56+00:00 · Latest: 2025-12-23T23:30:56+00:00
Abstract
Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture or requiring retraining. We evaluate the proposed method on a subset of the DocVQA dataset in an inference-only setting, focusing on efficiency-oriented metrics. Experimental results show that adaptive preprocessing reduces per-image inference time by over 50\%, lowers mean full generation time, and achieves a consistent reduction of more than 55\% in visual token count compared to the baseline pipeline. These findings demonstrate that input-aware preprocessing is an effective and lightweight strategy for improving deployment-oriented efficiency of vision-language models. To facilitate reproducibility, our implementation is provided as a fork of the FastVLM repository, incorporating the files for the proposed method, and is available at https://github.com/kmdavidds/mlfastlm.
中文标题/摘要
标题:输入自适应视觉预处理以提高快速视觉-语言模型推理效率
视觉-语言模型(VLMs)在多模态推理任务中表现出强大的性能,但由于高推理延迟和计算成本,其部署仍然具有挑战性,尤其是在处理高分辨率视觉输入时。虽然最近的架构如FastVLM通过优化视觉编码器提高了效率,但现有的管道仍然依赖于静态视觉预处理,导致对于视觉简单的输入存在冗余计算。在本文中,我们提出了一种自适应视觉预处理方法,该方法根据图像内容特征动态调整输入分辨率和空间覆盖范围。所提出的方法结合了内容感知图像分析、自适应分辨率选择和内容感知裁剪,以在视觉编码前减少视觉冗余。重要的是,该方法与FastVLM集成,无需修改其架构或重新训练。我们在DocVQA数据集的子集上仅进行推理评估,重点关注效率导向的指标。实验结果表明,自适应预处理可以将每张图像的推理时间减少超过50%,降低平均完整生成时间,并且与基线管道相比,视觉标记数量减少超过55%。这些发现表明,输入感知预处理是一种有效且轻量级的策略,可以提高视觉-语言模型的部署效率。为了便于可重复性,我们的实现作为FastVLM仓库的分支提供,包含所提出方法的文件,并可在https://github.com/kmdavidds/mlfastlm/获得。
Summary / 总结
This work addresses the challenge of high inference latency in Vision-Language Models (VLMs) by proposing an adaptive visual preprocessing method. The method dynamically adjusts input resolution and spatial coverage based on image content, reducing visual redundancy before vision encoding. Evaluated on the DocVQA dataset, the approach significantly reduces per-image inference time by over 50%, lowers mean full generation time, and decreases visual token count by more than 55% compared to the baseline pipeline.
本文提出了一种输入自适应视觉预处理方法,以解决视觉语言模型(VLMs)的高推理延迟问题。该方法根据图像内容动态调整输入分辨率和空间覆盖范围,减少冗余计算。在DocVQA数据集上的评估结果显示,该方法显著减少了每张图像的推理时间超过50%,降低了平均完整生成时间,并将视觉标记数量减少了超过55%。这表明输入感知的预处理是提高VLMs部署效率的有效且轻量级策略,无需修改模型架构或重新训练。
VL4Gaze: Unleashing Vision-Language Models for Gaze Following
Authors: Shijing Wang, Chaoqun Cui, Yaping Huang, Hyung Jin Chang, Yihua Cheng
First: 2025-12-23T19:47:11+00:00 · Latest: 2025-12-23T19:47:11+00:00
Abstract
Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.
中文标题/摘要
标题:VL4Gaze:释放视觉语言模型的凝视跟踪能力
人类凝视为理解注意力、意图和社会互动提供了重要的线索,但在当前的视觉语言模型(VLMs)中,凝视理解尚未得到充分探索。尽管最近的VLMs在一系列视觉任务中实现了强大的场景级推理,但尚无基准系统地评估或训练它们进行凝视解释,这使得人们质疑是否可以从通用视觉语言预训练中自然涌现出凝视理解能力。为解决这一问题,我们引入了VL4Gaze,这是首个旨在研究、评估和解锁VLMs在凝视理解方面潜力的大规模基准。VL4Gaze包含489K个自动生成的问题-答案对,覆盖124K张图像,并通过四个互补任务将凝视理解统一为一个VQA问题:(1)凝视对象描述,(2)凝视方向描述,(3)凝视点定位,(4)模糊问题识别。我们全面评估了商业和开源VLMs在上下文学习和微调设置下的表现。结果表明,即使大规模VLMs在没有特定任务监督的情况下也难以可靠地推断凝视语义和空间定位。相比之下,通过VL4Gaze进行训练在所有任务上都带来了显著且一致的改进,突显了为开发VLMs的凝视理解能力而进行针对性多任务监督的重要性。我们将发布数据集和代码,以支持该领域的进一步研究和发展。
Summary / 总结
The research aims to explore the capability of vision-language models (VLMs) in understanding human gaze, which is crucial for interpreting attention and social interaction. VL4Gaze, a new benchmark, was introduced to evaluate and train VLMs for gaze interpretation through four tasks: gaze object description, gaze direction description, gaze point location, and ambiguous question recognition. The study found that large-scale VLMs perform poorly in inferring gaze semantics and spatial localization without task-specific supervision, but training on VL4Gaze significantly improves their performance across all tasks, emphasizing the need for targeted multi-task supervision for developing gaze understanding capabilities in VLMs.
论文介绍了VL4Gaze,这是一个用于评估视觉-语言模型(VLMs)在理解凝视方面的基准。它解决了VLMs在凝视解释方面缺乏基准的问题,并提出了四个任务:凝视对象描述、凝视方向描述、凝视点定位和模糊问题识别。评估结果显示,即使大型VLMs在没有特定监督的情况下也难以可靠地推断凝视语义和空间定位,但通过在VL4Gaze上进行训练,可以在所有任务上显著提高性能,强调了为VLMs开发凝视理解能力需要有针对性的多任务监督的重要性。
FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
Authors: Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, Keze Wang
First: 2025-12-23T18:05:43+00:00 · Latest: 2025-12-23T18:05:43+00:00
Comments: Under submission
Abstract
Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual query or rely on deep attention maps, whose instability under aggressive pruning leads to degraded semantic alignment. We propose FlashVLM, a text guided visual token selection framework that dynamically adapts visual inputs to the query. Instead of relying on noisy attention weights, FlashVLM computes an explicit cross modal similarity between projected image tokens and normalized text embeddings in the language model space. This extrinsic relevance is fused with intrinsic visual saliency using log domain weighting and temperature controlled sharpening. In addition, a diversity preserving partition retains a minimal yet representative set of background tokens to maintain global context. Under identical token budgets and evaluation protocols, FlashVLM achieves beyond lossless compression, slightly surpassing the unpruned baseline while pruning up to 77.8 percent of visual tokens on LLaVA 1.5, and maintaining 92.8 percent accuracy even under 94.4 percent compression. Extensive experiments on 14 image and video benchmarks demonstrate that FlashVLM delivers state of the art efficiency performance trade offs while maintaining strong robustness and generalization across mainstream VLMs.
中文标题/摘要
标题:FlashVLM:文本引导的视觉标记选择框架用于大型多模态模型
大型视觉-语言模型(VLMs)通常每张图像或视频帧处理数百或数千个视觉标记,导致二次注意成本和大量冗余。现有的标记减少方法通常忽略文本查询或依赖于深度注意图,这些图在激进剪枝下的不稳定性导致语义对齐下降。 我们提出了一种FlashVLM,这是一种文本引导的视觉标记选择框架,能够动态适应查询。FlashVLM 不依赖于嘈杂的注意权重,而是计算投影图像标记与语言模型空间中归一化文本嵌入之间的显式跨模态相似性。这种外在的相关性与内在的视觉显著性通过对数域加权和温度控制锐化进行融合。此外,一种保留多样性的划分保留了一组最小但具有代表性的背景标记,以保持全局上下文。 在相同的标记预算和评估协议下,FlashVLM 实现了超越无损压缩的效果,在LLaVA 1.5上剪枝高达77.8%的视觉标记,同时保持92.8%的准确率,即使在94.4%的压缩下也是如此。在14个图像和视频基准上的广泛实验表明,FlashVLM 在保持强大鲁棒性和泛化能力的同时,提供了最先进的效率性能权衡。
Summary / 总结
FlashVLM is a text-guided visual token selection framework that reduces the number of visual tokens processed by large vision-language models, improving efficiency without degrading performance. It computes cross-modal similarity between image tokens and text embeddings, fuses this with visual saliency, and retains a diverse set of background tokens. FlashVLM achieves up to 77.8% token pruning while maintaining 92.8% accuracy on LLaVA 1.5, and outperforms the unpruned baseline under various compression levels.
FlashVLM 是一种文本引导的视觉 token 选择框架,通过减少大型视觉-语言模型处理的视觉 token 数量而不损失性能。它计算图像 token 和文本嵌入之间的跨模态相似性,将其与视觉显著性融合,并保留少量的背景 token。FlashVLM 在 LLaVA 1.5 上实现了高达 77.8% 的视觉 token 剪枝,同时保持 92.8% 的准确性,并在各种压缩水平下优于未剪枝的基线。广泛的实验表明,FlashVLM 提供了高效的性能权衡,并且在多个基准测试中具有强大的鲁棒性和泛化能力。
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Authors: Shengchao Zhou, Yuxin Chen, Yuying Ge, Wei Huang, Jiehong Lin, Ying Shan, Xiaojuan Qi
First: 2025-12-23T17:56:36+00:00 · Latest: 2025-12-23T17:56:36+00:00
Abstract
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
中文标题/摘要
标题:学习在四维中推理:视觉语言模型的动态空间理解
视觉语言模型(VLM)在一般理解方面表现出色,但在动态空间推理(DSR)方面仍然较弱,即在时间上理解物体几何形状和关系在三维空间中的演变,这主要是由于缺乏可扩展的四维意识训练资源。为了在数据集、基准和模型的各个方面弥合这一差距,我们引入了DSR套件。首先,我们提出了一种自动管道,从野外视频中生成DSR的多项选择题-答案对。通过利用现代视觉基础模型,该管道提取丰富的几何和运动信息,包括相机姿态、局部点云、物体掩码、方向和三维轨迹。这些几何线索使DSR-Train得以构建,用于学习和进一步的人工细化DSR-基准用于评估。与以往工作相比,我们的数据强调(i)野外视频来源,(ii)物体和场景级别的三维要求,(iii)视角变换,(iv)多物体交互,以及(v)细粒度、程序化的答案。除了数据,我们还提出了一种轻量级的几何选择模块(GSM),以无缝地将几何先验整合到VLM中,该模块浓缩了问题语义,并从预训练的四维重建先验中提取与问题相关的信息到一个紧凑的几何标记集。这种有针对性的提取避免了向模型灌输无关知识。实验表明,将DSR-Train和GSM集成到Qwen2.5-VL-7B中显著增强了其动态空间推理能力,同时在通用视频理解基准测试中保持了准确性。
Summary / 总结
The research aims to improve vision-language models' ability to reason about dynamic spatial relationships in videos. It introduces DSR Suite, which includes an automated pipeline for generating question-answer pairs from in-the-wild videos and a lightweight Geometry Selection Module (GSM) to integrate geometric priors. The pipeline extracts geometric and motion information, leading to the creation of DSR-Train and DSR-Bench datasets. Experiments show that integrating these datasets and GSM into Qwen2.5-VL-7B enhances its dynamic spatial reasoning capability without compromising general video understanding accuracy.
研究旨在通过解决4D感知训练资源稀缺问题,提升视觉语言模型在动态空间推理(DSR)方面的能力。提出了DSR Suite,包括从野生视频自动生成DSR问答对的自动化管道和轻量级的几何选择模块(GSM),以整合几何先验。关键发现表明,将DSR-Train和GSM集成到Qwen2.5-VL-7B中,显著提升了动态空间推理能力,同时保持了通用视频理解基准的准确性。
Multi-Grained Text-Guided Image Fusion for Multi-Exposure and Multi-Focus Scenarios
Authors: Mingwei Tang, Jiahao Nie, Guang Yang, Ziqing Cui, Jie Li
Venue: WACV 2026
First: 2025-12-23T17:55:35+00:00 · Latest: 2025-12-23T17:55:35+00:00
Comments: Accepted to WACV 2026
Abstract
Image fusion aims to synthesize a single high-quality image from a pair of inputs captured under challenging conditions, such as differing exposure levels or focal depths. A core challenge lies in effectively handling disparities in dynamic range and focus depth between the inputs. With the advent of vision-language models, recent methods incorporate textual descriptions as auxiliary guidance to enhance fusion quality. However, simply incorporating coarse-grained descriptions hampers the understanding of fine-grained details and poses challenges for precise cross-modal alignment. To address these limitations, we propose Multi-grained Text-guided Image Fusion (MTIF), a novel fusion paradigm with three key designs. First, it introduces multi-grained textual descriptions that separately capture fine details, structural cues, and semantic content, guiding image fusion through a hierarchical cross-modal modulation module. Second, it involves supervision signals at each granularity to facilitate alignment between visual and textual features and enhance the utility of auxiliary text. Third, it adopts a saliency-driven enrichment module to augment training data with dense semantic content, further strengthening the cross-modal modulation and alignment. Extensive experiments show that MTIF consistently outperforms previous methods on both multi-exposure and multi-focus image fusion tasks.
中文标题/摘要
标题:多粒度文本引导图像融合以应对多曝光和多焦深场景
图像融合旨在从在具有挑战性条件下拍摄的一对输入中合成一张高质量的图像,例如不同的曝光水平或焦深。核心挑战在于有效处理输入之间的动态范围和焦深差异。随着视觉语言模型的出现,最近的方法将文本描述作为辅助指导以提高融合质量。然而,简单地引入粗粒度描述会阻碍对细粒度细节的理解,并且对跨模态对齐提出了挑战。为了解决这些限制,我们提出了多粒度文本引导图像融合(MTIF),这是一种具有三个关键设计的新融合范式。首先,它引入了多粒度文本描述,分别捕捉细粒度细节、结构线索和语义内容,通过分层跨模态调制模块引导图像融合。其次,它在每个粒度级别引入监督信号,以促进视觉和文本特征之间的对齐并增强辅助文本的实用性。第三,它采用了一种基于显著性的增强模块,通过密集的语义内容增强训练数据,进一步加强跨模态调制和对齐。广泛的实验表明,MTIF在多曝光和多焦深图像融合任务上始终优于先前的方法。
Summary / 总结
The paper addresses the challenge of image fusion under varying exposure and focus conditions by proposing Multi-grained Text-guided Image Fusion (MTIF). MTIF uses hierarchical textual descriptions to guide the fusion process, incorporating supervision signals at different granularities to enhance cross-modal alignment. The method also includes a saliency-driven enrichment module to improve semantic content in training data. Experimental results demonstrate that MTIF consistently outperforms existing methods in both multi-exposure and multi-focus scenarios.
该研究针对不同曝光和焦深条件下的图像融合挑战,提出了一种多粒度文本引导图像融合方法(MTIF)。MTIF利用层次化的文本描述来引导融合过程,包括捕捉细粒度细节、结构线索和语义内容,并通过注意力驱动的增强模块进一步增强跨模态对齐。实验表明,MTIF在多曝光和多焦点场景中均优于先前的方法。
Video Generation Models Are Good Latent Reward Models
Authors: Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang
First: 2025-11-26T16:14:18+00:00 · Latest: 2025-12-23T15:17:06+00:00
Abstract
Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.
中文标题/摘要
标题:视频生成模型是良好的潜在奖励模型
奖励反馈学习(ReFL)已被证明对于使图像生成与人类偏好对齐是有效的。然而,将其扩展到视频生成面临着重大挑战。现有的视频奖励模型依赖于为像素空间输入设计的视觉语言模型,这将ReFL优化限制在昂贵的VAE解码之后的近完全去噪步骤中。这种像素空间的方法带来了巨大的内存开销和增加的训练时间,并且其后期优化缺乏早期监督,仅优化视觉质量而不是基本的运动动态和结构一致性。在本文中,我们展示了预训练的视频生成模型自然适合在嘈杂的潜在空间中进行奖励建模,因为它们明确设计为可以处理任意时间步的嘈杂潜在表示,并通过其序列建模能力内在地保留时间信息。因此,我们提出了过程奖励反馈学习(PRFL)框架,该框架完全在潜在空间中进行偏好优化,从而在整个去噪链中实现高效的梯度反向传播,而无需VAE解码。广泛的实验表明,PRFL在提高与人类偏好的对齐程度方面具有显著优势,同时与RGB ReFL相比实现了显著的内存消耗和训练时间减少。
Summary / 总结
This study addresses the challenge of applying reward feedback learning (ReFL) to video generation by proposing Process Reward Feedback Learning (PRFL). PRFL utilizes pre-trained video generation models for reward modeling in the noisy latent space, avoiding the need for computationally expensive VAE decoding. This approach leads to better alignment with human preferences and reduces memory consumption and training time.
本文解决了将奖励反馈学习(ReFL)应用于视频生成的问题,这比图像生成更为复杂。作者提出了Process Reward Feedback Learning(PRFL),利用预训练的视频生成模型在潜在空间中优化偏好,避免了昂贵的VAE解码。这种方法减少了内存使用和训练时间,同时提高了与人类偏好的一致性。
Scaling Laws for Energy Efficiency of Local LLMs
Authors: Ander Alvarez, Alessandro Genuardi, Nilotpal Sinha, Antonio Tiene, Mikail Okyay, Bakbergen Ryskulov, David Montero, Samuel Mugel, Román Orús
First: 2025-12-18T13:40:33+00:00 · Latest: 2025-12-23T15:02:39+00:00
Abstract
Deploying local large language models and vision-language models on edge devices requires balancing accuracy with constrained computational and energy budgets. Although graphics processors dominate modern artificial-intelligence deployment, most consumer hardware--including laptops, desktops, industrial controllers, and embedded systems--relies on central processing units. Despite this, the computational laws governing central-processing-unit-only inference for local language and vision-language workloads remain largely unexplored. We systematically benchmark large language and vision-language models on two representative central-processing-unit tiers widely used for local inference: a MacBook Pro M2, reflecting mainstream laptop-class deployment, and a Raspberry Pi 5, representing constrained, low-power embedded settings. Using a unified methodology based on continuous sampling of processor and memory usage together with area-under-curve integration, we characterize how computational load scales with input text length for language models and with image resolution for vision-language models. We uncover two empirical scaling laws: (1) computational cost for language-model inference scales approximately linearly with token length; and (2) vision-language models exhibit a preprocessing-driven "resolution knee", where compute remains constant above an internal resolution clamp and decreases sharply below it. Beyond these laws, we show that quantum-inspired compression reduces processor and memory usage by up to 71.9% and energy consumption by up to 62%, while preserving or improving semantic accuracy. These results provide a systematic quantification of multimodal central-processing-unit-only scaling for local language and vision-language workloads, and they identify model compression and input-resolution preprocessing as effective, low-cost levers for sustainable edge inference.
中文标题/摘要
标题:本地大语言模型的能量效率扩展定律
在边缘设备上部署本地的大语言模型和视觉-语言模型需要在准确性与受限的计算和能源预算之间进行权衡。尽管图形处理器主导了现代人工智能部署,但大多数消费级硬件——包括笔记本电脑、台式机、工业控制器和嵌入式系统——仍然依赖于中央处理器。尽管如此,仅中央处理器的推理计算法则对本地语言和视觉-语言工作负载的研究仍然相对较少。我们系统地在两个广泛用于本地推理的中央处理器级别上对大语言模型和视觉-语言模型进行了基准测试:一台搭载M2芯片的MacBook Pro,代表主流笔记本电脑级别的部署,以及一个Raspberry Pi 5,代表受限的、低功耗嵌入式设置。我们采用了一种基于连续采样处理器和内存使用情况并结合面积-曲线积分的统一方法,来表征计算负载随输入文本长度的变化规律,以及图像分辨率对视觉-语言模型的影响。我们发现了两条经验性扩展定律:(1)语言模型推理的计算成本大约与标记长度成线性关系;(2)视觉-语言模型表现出一种预处理驱动的“分辨率拐点”,其中计算量在内部分辨率限制以上保持恒定,在以下则急剧下降。除了这些定律,我们还展示了量子启发式压缩可以将处理器和内存使用量最多减少71.9%,能源消耗最多减少62%,同时保持或提高语义准确性。这些结果为本地语言和视觉-语言工作负载的单一中央处理器多模态扩展提供了一种系统化的量化方法,并指出了模型压缩和输入分辨率预处理作为可持续边缘推理的有效、低成本杠杆。
Summary / 总结
This study investigates the energy efficiency of deploying large language models and vision-language models on edge devices, focusing on central processing units. By benchmarking these models on a MacBook Pro M2 and a Raspberry Pi 5, the researchers discovered two scaling laws: computational cost for language models scales linearly with token length, and vision-language models have a resolution knee where compute remains constant above a certain resolution and decreases below it. Additionally, quantum-inspired compression techniques were found to reduce processor and memory usage by up to 71.9% and energy consumption by up to 62%, while maintaining or improving semantic accuracy.
该研究探讨了在边缘设备上部署大型语言模型和视觉语言模型的能效问题,重点关注中央处理单元。通过在MacBook Pro M2和Raspberry Pi 5上进行基准测试,研究发现了两个缩放定律:语言模型的计算成本随标记长度线性增加,而视觉语言模型表现出预处理驱动的“分辨率拐点”,其中计算在某个分辨率以上保持恒定,在以下则急剧下降。此外,研究还表明,量子启发式压缩可以将处理器和内存使用量最多减少71.9%,能量消耗最多减少62%,同时保持或提高语义准确性。
Chain-of-Anomaly Thoughts with Large Vision-Language Models
Authors: Pedro Domingos, João Pereira, Vasco Lopes, João Neves, David Semedo
First: 2025-12-23T15:01:05+00:00 · Latest: 2025-12-23T15:01:05+00:00
Comments: 2 pages, 3 figures, 1 table. Accepted for RECPAD 2025
Abstract
Automated video surveillance with Large Vision-Language Models is limited by their inherent bias towards normality, often failing to detect crimes. While Chain-of-Thought reasoning strategies show significant potential for improving performance in language tasks, the lack of inductive anomaly biases in their reasoning further steers the models towards normal interpretations. To address this, we propose Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that introduces inductive criminal bias in the reasoning process through a final, anomaly-focused classification layer. Our method significantly improves Anomaly Detection, boosting F1-score by 11.8 p.p. on challenging low-resolution footage and Anomaly Classification by 3.78 p.p. in high-resolution videos.
中文标题/摘要
标题:大型视觉语言模型中的异常链思考
大型视觉语言模型在自动化视频监控中的应用受限于它们对正常情况的固有偏见,往往无法检测到犯罪行为。虽然链式思考推理策略在语言任务中显示出显著的改进潜力,但在推理过程中缺乏归纳异常偏见进一步导致模型倾向于正常解释。为了解决这一问题,我们提出了一种名为异常链思考(CoAT)的多智能体推理框架,通过引入最终的异常分类层来在推理过程中引入归纳犯罪偏见。我们的方法显著提高了异常检测性能,低分辨率视频上的F1分数提高了11.8个百分点,在高分辨率视频上的异常分类提高了3.78个百分点。
Summary / 总结
The paper addresses the limitation of large vision-language models in detecting anomalies in video surveillance due to their bias towards normality. It introduces Chain-of-Anomaly-Thoughts (CoAT), a multi-agent reasoning framework that enhances anomaly detection and classification by incorporating inductive criminal bias. The method achieves a 11.8 percentage point improvement in F1-score for low-resolution footage and a 3.78 percentage point improvement in high-resolution videos.
论文针对大型视觉-语言模型在自动视频监控中检测犯罪时因偏向正常性而存在的局限性。提出了Chain-of-Anomaly-Thoughts (CoAT) 多代理推理框架,通过引入归纳犯罪偏见来增强异常检测。该方法在低分辨率视频中实现了F1分数11.8个百分点的提升,在高分辨率视频中实现了3.78个百分点的异常分类改进。
LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer
Authors: Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang
First: 2025-08-01T09:51:54+00:00 · Latest: 2025-12-23T14:27:42+00:00
Comments: 8 pages, 5 figures, 3 tables
Abstract
In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.
中文标题/摘要
标题:LAMIC:基于布局感知的多图像合成通过多模态扩散变换器的可扩展性
在可控图像合成中,从多个参考中生成具有空间布局感知的连贯且一致的图像仍然是一个开放的挑战。我们提出了LAMIC,一种布局感知的多图像合成框架,首次以无需训练的方式将单参考扩散模型扩展到多参考场景。基于MMDiT模型,LAMIC引入了两种即插即用的注意力机制:1)组隔离注意力(GIA)以增强实体分离;2)区域调节注意力(RMA)以实现布局感知生成。为进一步评估模型能力,我们还引入了三个指标:1)包含比(IN-R)和填充比(FI-R)以评估布局控制;2)背景相似度(BG-S)以衡量背景一致性。大量实验表明,LAMIC在大多数主要指标上均取得了最先进的性能:在所有设置中,它在ID-S、BG-S、IN-R和AVG得分上始终优于现有的多参考基线,并在复杂合成任务中实现了最佳的DPG。这些结果表明,LAMIC在身份保持、背景保存、布局控制和指令跟随方面具有优越的能力,所有这些均无需任何训练或微调,展示了强大的零样本泛化能力。通过继承高级单参考模型的优势并使其实现无缝扩展到多图像场景,LAMIC为可控多图像合成建立了一个新的无需训练的范式。随着基础模型的不断进化,LAMIC的性能预计也将相应扩展。我们的实现可在以下链接获取:https://github.com/Suchenl/LAMIC。
Summary / 总结
LAMIC is a Layout-Aware Multi-Image Composition framework that extends single-reference diffusion models to multi-reference scenarios without training. It introduces two attention mechanisms: Group Isolation Attention (GIA) for entity disentanglement and Region-Modulated Attention (RMA) for layout-aware generation. LAMIC achieves state-of-the-art performance across major metrics, outperforming existing multi-reference baselines in ID-S, BG-S, IN-R, and AVG scores, and achieving the best DPG in complex composition tasks. These results highlight LAMIC's strong zero-shot generalization ability in identity keeping, background preservation, and prompt-following.
LAMIC 是一种布局感知的多图像合成框架,它将单参考扩散模型扩展到多参考场景,无需训练。该框架引入了两种注意力机制:组隔离注意力用于实体分离,区域调节注意力用于布局感知生成。LAMIC 在大多数主要指标上(包括 ID-S、BG-S、IN-R 和 AVG 分数)实现了最先进的性能,并在复杂合成任务中展示了强大的零样本泛化能力,无需任何微调。该框架建立了新的无训练框架,用于可控的多图像合成。
History
20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553