arXiv 论文速递

Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Authors: Boammani Aser Lompo, Marc Haraoui

First: 2025-09-09T17:52:26+00:00 · Latest: 2025-09-09T17:52:26+00:00

Comments: Work in Progress

Abstract

Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.

中文标题/摘要

标题：Visual-TableQA：基于表格图像的开放领域基准测试，用于表格结构化数据的视觉推理

对结构化数据（如表格）进行视觉推理是现代视觉-语言模型（VLMs）的关键能力，但当前的基准测试在规模、多样性或推理深度上仍然有限，尤其是在涉及渲染表格图像时。为解决这一差距，我们引入了Visual-TableQA，这是一个大规模、开放领域的多模态数据集，专门用于评估和提升对复杂表格数据的视觉推理能力。我们的生成管道是模块化、可扩展且完全自主的，涉及多个推理LLM在不同角色下的协作：生成、验证和启发。Visual-TableQA 包含2500个丰富结构化的LaTeX渲染表格和6000个推理密集型问答对，所有这些数据的生成成本低于100美元。为了促进多样性和创造力，我们的管道通过跨模型提示（‘启发’）和LLM-陪审团筛选进行多模型协作数据生成。更强的模型为较弱的模型提供布局和主题，共同提炼出多样化的推理模式和视觉结构。实验证明，基于Visual-TableQA微调的模型能够稳健地泛化到外部基准测试，尽管数据集具有合成性，仍能超越几个专有模型。完整的管道和资源可在https://github.com/AI-4-Everyone/Visual-TableQA上公开获取。

Summary / 总结

Visual-TableQA is a large-scale dataset designed to evaluate and enhance visual reasoning over complex tabular data, addressing limitations in current benchmarks. It uses a modular and scalable generation pipeline involving multiple reasoning LLMs in roles of generation, validation, and inspiration. The dataset includes 2,500 richly structured LaTeX-rendered tables and 6,000 reasoning-intensive QA pairs, generated at a low cost. Empirical results show that models fine-tuned on Visual-TableQA generalize well to external benchmarks, outperforming several proprietary models despite the synthetic nature of the dataset.

Visual-TableQA 是一个大规模的数据集，旨在评估和提升对复杂表格数据的视觉推理能力。它包含一个模块化和可扩展的生成管道，涉及多个推理 LLM 在生成、验证和灵感角色中的协作。该数据集包括 2,500 个丰富结构化的 LaTeX 渲染表格和 6,000 个推理密集型 QA 对，以较低的成本生成。在外部基准测试上的实证结果显示，使用 Visual-TableQA 微调的模型能够稳健地泛化，并且在多个专有模型中表现出色，尽管数据集具有合成性质。

TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection

Authors: Wei Wu, Zhuoshi Pan, Chao Wang, Liyi Chen, Yunchu Bai, Tianfu Wang, Kun Fu, Zheng Wang, Hui Xiong

First: 2024-11-05T07:56:24+00:00 · Latest: 2025-09-09T13:30:17+00:00

Comments: Accepted by EMNLP2025

Abs · PDF

Abstract

Rapid advances in Large Language Models (LLMs) have spurred demand for processing extended context sequences in contemporary applications. However, this progress faces two challenges: performance degradation due to sequence lengths out-of-distribution, and excessively long inference times caused by the quadratic computational complexity of attention. These issues limit LLMs in long-context scenarios. In this paper, we propose Dynamic Token-Level KV Cache Selection (TokenSelect), a training-free method for efficient and accurate long-context inference. TokenSelect builds upon the observation of non-contiguous attention sparsity, using QK dot products to measure per-head KV Cache criticality at token-level. By per-head soft voting mechanism, TokenSelect selectively involves a few critical KV cache tokens in attention calculation without sacrificing accuracy. To further accelerate TokenSelect, we design the Selection Cache based on observations of consecutive Query similarity and implemented the efficient Paged Dot Product Kernel, significantly reducing the selection overhead. A comprehensive evaluation of TokenSelect demonstrates up to $23.84\times$ speedup in attention computation and up to $2.28\times$ acceleration in end-to-end latency, while providing superior performance compared to state-of-the-art long-context inference methods.

中文标题/摘要

标题：TokenSelect：通过动态选择令牌级KV缓存实现高效长上下文推理和长度外推

大型语言模型（LLMs）的迅速发展推动了现代应用中处理扩展上下文序列的需求。然而，这一进展面临两个挑战：由于序列长度超出分布而导致的性能下降，以及由于注意力机制的二次计算复杂性导致的过长推理时间。这些问题限制了LLMs在长上下文场景中的应用。本文提出了一种无需训练的方法——动态令牌级KV缓存选择（TokenSelect），以实现高效准确的长上下文推理。TokenSelect基于非连续注意力稀疏性的观察，使用QK点积来衡量每个头在令牌级的KV缓存关键性。通过每个头的软投票机制，TokenSelect选择性地参与少量关键KV缓存令牌的注意力计算，而不牺牲准确性。为了进一步加速TokenSelect，我们基于连续查询相似性的观察设计了选择缓存，并实现了高效的分页点积内核，显著减少了选择开销。全面评估表明，TokenSelect在注意力计算中可实现高达23.84倍的加速，在端到端延迟中可实现高达2.28倍的加速，同时在长上下文推理方法中提供更优的性能。

Summary / 总结

TokenSelect is a training-free method for efficient long-context inference in LLMs, addressing performance degradation and long inference times. It uses dynamic token-level KV cache selection based on QK dot products to measure the criticality of each head, enabling selective attention calculation without accuracy loss. TokenSelect further accelerates the process with a Selection Cache and an efficient Paged Dot Product Kernel, achieving up to 23.84 times speedup in attention computation and 2.28 times acceleration in end-to-end latency.

TokenSelect 是一种无需训练的方法，用于在大语言模型中高效进行长上下文推理，解决性能下降和长推理时间的问题。它通过基于 QK 点积测量 KV 缓存 token 的关键性来进行动态 token 级别 KV 缓存选择，允许在不牺牲准确性的前提下进行选择性的注意力计算。TokenSelect 进一步通过选择缓存和高效的分页点积内核加速过程，实现了高达 23.84 倍的注意力计算加速和 2.28 倍的端到端延迟加速，相比现有最先进的长上下文推理方法具有更优的性能。

Light-Weight Cross-Modal Enhancement Method with Benchmark Construction for UAV-based Open-Vocabulary Object Detection

Authors: Zhenhai Weng, Xinjie Li, Can Wu, Weijie He, Jianfeng Lv, Dong Zhou, Zhongliang Yu

First: 2025-09-07T10:59:02+00:00 · Latest: 2025-09-09T12:22:18+00:00

Abs · PDF

Abstract

Open-Vocabulary Object Detection (OVD) faces severe performance degradation when applied to UAV imagery due to the domain gap from ground-level datasets. To address this challenge, we propose a complete UAV-oriented solution that combines both dataset construction and model innovation. First, we design a refined UAV-Label Engine, which efficiently resolves annotation redundancy, inconsistency, and ambiguity, enabling the generation of largescale UAV datasets. Based on this engine, we construct two new benchmarks: UAVDE-2M, with over 2.4M instances across 1,800+ categories, and UAVCAP-15K, providing rich image-text pairs for vision-language pretraining. Second, we introduce the Cross-Attention Gated Enhancement (CAGE) module, a lightweight dual-path fusion design that integrates cross-attention, adaptive gating, and global FiLM modulation for robust textvision alignment. By embedding CAGE into the YOLO-World-v2 framework, our method achieves significant gains in both accuracy and efficiency, notably improving zero-shot detection on VisDrone by +5.3 mAP while reducing parameters and GFLOPs, and demonstrating strong cross-domain generalization on SIMD. Extensive experiments and real-world UAV deployment confirm the effectiveness and practicality of our proposed solution for UAV-based OVD

中文标题/摘要

标题：基于无人机的开放词汇目标检测轻量级跨模态增强方法及基准构建

开放词汇目标检测（OVD）在应用于无人机图像时由于与地面数据集之间的领域差距而面临严重的性能下降。为解决这一挑战，我们提出了一种完整的面向无人机的解决方案，结合了数据集构建和模型创新。首先，我们设计了一种改进的无人机标注引擎，该引擎高效地解决了标注冗余、不一致和模糊性问题，从而能够生成大规模的无人机数据集。基于此引擎，我们构建了两个新的基准：UAVDE-2M，包含超过240万实例和1800多个类别，以及UAVCAP-15K，提供了丰富的图像-文本对用于视觉-语言预训练。其次，我们引入了跨注意力门控增强（CAGE）模块，这是一种轻量级的双路径融合设计，结合了跨注意力、自适应门控和全局FiLM调制，以实现稳健的文本-视觉对齐。通过将CAGE嵌入YOLO-World-v2框架中，我们的方法在准确性和效率上均取得了显著提升，在VisDrone上的零样本检测上提高了5.3个mAP，同时减少了参数和GFLOPs，并在SIMD上展示了强大的跨域泛化能力。广泛的实验和实际无人机部署证实了我们提出的方法在无人机基于的OVD中的有效性和实用性

Summary / 总结

The paper addresses the performance degradation of Open-Vocabulary Object Detection (OVD) in UAV imagery by proposing a comprehensive solution that includes dataset construction and model innovation. It introduces a UAV-Label Engine to generate large-scale UAV datasets and two new benchmarks, UAVDE-2M and UAVCAP-15K. Additionally, it presents the Cross-Attention Gated Enhancement (CAGE) module, a lightweight dual-path fusion design, which improves zero-shot detection accuracy by +5.3 mAP on VisDrone while reducing computational resources and demonstrating strong cross-domain generalization on SIMD.

论文针对无人机图像中开放词汇对象检测（OVD）性能下降的问题，提出了一种完整的无人机导向解决方案，包括一个改进的无人机标注引擎以构建数据集，以及两个新的基准：UAVDE-2M和UAVCAP-15K。此外，还引入了一个轻量级的交叉注意力门控增强（CAGE）模块，该模块在VisDrone上的零样本检测精度提高了5.3 mAP，同时减少了计算资源的使用，并在SIMD上展示了强大的跨域泛化能力。

MSCPT: Few-shot Whole Slide Image Classification with Multi-scale and Context-focused Prompt Tuning

Authors: Minghao Han, Linhao Qu, Dingkang Yang, Xukun Zhang, Xiaoying Wang, Lihua Zhang

First: 2024-08-21T10:25:51+00:00 · Latest: 2025-09-09T12:15:25+00:00

Comments: This work has been submitted to the IEEE TMI for possible publication

Abs · PDF · Code1

Abstract

Multiple instance learning (MIL) has become a standard paradigm for the weakly supervised classification of whole slide images (WSIs). However, this paradigm relies on using a large number of labeled WSIs for training. The lack of training data and the presence of rare diseases pose significant challenges for these methods. Prompt tuning combined with pre-trained Vision-Language models (VLMs) is an effective solution to the Few-shot Weakly Supervised WSI Classification (FSWC) task. Nevertheless, applying prompt tuning methods designed for natural images to WSIs presents three significant challenges: 1) These methods fail to fully leverage the prior knowledge from the VLM's text modality; 2) They overlook the essential multi-scale and contextual information in WSIs, leading to suboptimal results; and 3) They lack exploration of instance aggregation methods. To address these problems, we propose a Multi-Scale and Context-focused Prompt Tuning (MSCPT) method for FSWC task. Specifically, MSCPT employs the frozen large language model to generate pathological visual language prior knowledge at multiple scales, guiding hierarchical prompt tuning. Additionally, we design a graph prompt tuning module to learn essential contextual information within WSI, and finally, a non-parametric cross-guided instance aggregation module has been introduced to derive the WSI-level features. Extensive experiments, visualizations, and interpretability analyses were conducted on five datasets and three downstream tasks using three VLMs, demonstrating the strong performance of our MSCPT. All codes have been made publicly accessible at https://github.com/Hanminghao/MSCPT.

中文标题/摘要

标题：MSCPT：多尺度和语境聚焦提示调优的少量样本全切片图像分类

多实例学习（MIL）已成为弱监督全切片图像（WSI）分类的标准范式。然而，这种方法依赖于大量标注的WSI进行训练。缺乏训练数据和罕见疾病的出现对这些方法构成了重大挑战。结合预训练的视觉-语言模型（VLMs）的提示调优是一种有效的少量样本弱监督WSI分类（FSWC）任务解决方案。然而，将为自然图像设计的提示调优方法应用于WSI时，存在三个重大挑战：1）这些方法未能充分利用VLM文本模态的先验知识；2）它们忽略了WSI中的重要多尺度和上下文信息，导致结果次优；3）它们缺乏实例聚合方法的探索。为了解决这些问题，我们提出了一种多尺度和语境聚焦提示调优（MSCPT）方法用于FSWC任务。具体而言，MSCPT利用冻结的大语言模型在多尺度下生成病理视觉语言先验知识，引导分层提示调优。此外，我们设计了一个图提示调优模块来学习WSI内的关键上下文信息，最后引入了一个非参数交叉引导实例聚合模块以提取WSI级别的特征。在三个VLMs的五个数据集和三个下游任务上进行了广泛的实验、可视化和可解释性分析，证明了我们MSCPT的强大性能。所有代码已公开发布在https://github.com/Hanminghao/MSCPT。

Summary / 总结

The research addresses the challenge of few-shot weakly supervised whole slide image classification (FSWC) by proposing MSCPT, which integrates multi-scale and context-focused prompt tuning. It leverages the text modality of pre-trained Vision-Language models to generate scale-specific prior knowledge and uses a graph prompt tuning module to capture contextual information. The method also introduces a non-parametric cross-guided instance aggregation module to enhance WSI-level feature extraction. Experiments on five datasets show strong performance in three downstream tasks using three different Vision-Language models.

研究旨在通过提出MSCPT方法解决少量标注的全切片图像分类问题，该方法结合了多尺度和上下文聚焦的提示调优。该方法利用预训练的视觉语言模型的文本模态生成特定尺度的先验知识，并使用图提示调优模块来捕捉上下文信息。在五个数据集上的实验结果表明，MSCPT在三个下游任务中表现出强大的性能，证明了其有效性。

Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease

Authors: Fangqi Cheng, Surajit Ray, Xiaochen Yang

First: 2025-09-09T11:36:21+00:00 · Latest: 2025-09-09T11:36:21+00:00

Abs · PDF

Abstract

Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer's disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on two AD datasets using 1,500 training images, outperforming existing methods fine-tuned on 10,000 images. Code will be released upon publication.

中文标题/摘要

标题：数据高效微调视觉-语言模型以诊断阿尔茨海默病

医疗视觉-语言模型（Med-VLMs）在报告生成和视觉问答等任务中取得了令人印象深刻的成果，但仍然面临一些限制。最明显的是，它们未能充分利用患者元数据，并缺乏临床诊断知识的整合。此外，大多数现有模型通常从头开始训练或在大规模的2D图像-文本对上进行微调，需要大量的计算资源，而且由于缺乏结构信息，它们在3D医学成像上的效果往往有限。为了解决这些差距，我们提出了一种数据高效的微调流水线，以适应基于3D CT的Med-VLMs，并展示了其在阿尔茨海默病（AD）诊断中的应用。我们的系统引入了两个关键创新。首先，我们将结构化的元数据转换为合成报告，丰富了文本输入，以改善图像-文本对齐。其次，我们添加了一个辅助标记，用于预测迷你精神状态检查（MMSE）分数，这是一种广泛使用的临床认知功能测量指标，与AD严重程度相关。这为微调提供了额外的监督。通过在图像和文本模态上应用轻量级提示微调，我们的方法在两个AD数据集上使用1,500张训练图像达到了最先进的性能，优于在10,000张图像上微调的现有方法。代码将在发表后发布。

Visuospatial Cognitive Assistant

Authors: Qi Feng

First: 2025-05-18T08:55:02+00:00 · Latest: 2025-09-09T09:48:14+00:00

Comments: 31 pages, 10 figures, 6 tables

Abs · PDF

Abstract

Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs). This paper makes two key contributions. First, we introduce ViCA (Visuospatial Cognitive Assistant)-322K, a diverse dataset of 322,003 QA pairs from real-world indoor videos (ARKitScenes, ScanNet, ScanNet++), offering supervision for 3D metadata-grounded queries and video-based complex reasoning. Second, we develop ViCA-7B, fine-tuned on ViCA-322K, which achieves new state-of-the-art on all eight VSI-Bench tasks, outperforming existing models, including larger ones (e.g., +26.1 on Absolute Distance). For interpretability, we present ViCA-Thinking-2.68K, a dataset with explicit reasoning chains, and fine-tune ViCA-7B to create ViCA-7B-Thinking, a model that articulates its spatial reasoning. Our work highlights the importance of targeted data and suggests paths for improved temporal-spatial modeling. We release all resources to foster research in robust visuospatial intelligence.

中文标题/摘要

标题：空间认知辅助系统

基于视频的空间认知对于机器人技术和具身AI至关重要，但目前的视觉-语言模型（VLMs）面临挑战。本文做出了两项关键贡献。首先，我们引入了ViCA（空间认知辅助系统）-322K，这是一个包含322,003个问答对的多样数据集，来自真实室内视频（ARKitScenes、ScanNet、ScanNet++），提供3D元数据驱动查询和基于视频的复杂推理的监督。其次，我们开发了在ViCA-322K上微调的ViCA-7B，该模型在所有八个VSI-Bench任务上达到了新的最佳性能，超越了现有模型，包括更大的模型（例如，在绝对距离上提高了26.1%）。为了提高可解释性，我们提出了ViCA-Thinking-2.68K数据集，其中包含明确的推理链，并微调ViCA-7B创建了ViCA-7B-Thinking模型，该模型能够阐述其空间推理。我们的工作强调了目标数据的重要性，并指出了改进时空建模的路径。我们发布了所有资源以促进稳健的空间智能研究。

Summary / 总结

Video-based spatial cognition is vital for robotics and embodied AI but challenges current Vision-Language Models (VLMs).

InteractPro: A Unified Framework for Motion-Aware Image Composition

Authors: Weijing Tao, Xiaofeng Yang, Miaomiao Cui, Guosheng Lin

First: 2024-09-16T08:44:17+00:00 · Latest: 2025-09-09T08:10:04+00:00

Abs · PDF

Abstract

We introduce InteractPro, a comprehensive framework for dynamic motion-aware image composition. At its core is InteractPlan, an intelligent planner that leverages a Large Vision Language Model (LVLM) for scenario analysis and object placement, determining the optimal composition strategy to achieve realistic motion effects. Based on each scenario, InteractPlan selects between our two specialized modules: InteractPhys and InteractMotion. InteractPhys employs an enhanced Material Point Method (MPM)-based simulation to produce physically faithful and controllable object-scene interactions, capturing diverse and abstract events that require true physical modeling. InteractMotion, in contrast, is a training-free method based on pretrained video diffusion. Traditional composition approaches suffer from two major limitations: requiring manual planning for object placement and generating static, motionless outputs. By unifying simulation-based and diffusion-based methods under planner guidance, InteractPro overcomes these challenges, ensuring richly motion-aware compositions. Extensive quantitative and qualitative evaluations demonstrate InteractPro's effectiveness in producing controllable, and coherent compositions across varied scenarios.

中文标题/摘要

标题：InteractPro：一种统一的动态运动感知图像合成框架

我们介绍了InteractPro，一个全面的动态运动感知图像合成框架。其核心是InteractPlan，一种智能规划器，利用大型视觉语言模型（LVLM）进行场景分析和物体放置，确定实现逼真运动效果的最佳合成策略。根据每个场景，InteractPlan 选择使用我们两个专门模块之一：InteractPhys 和 InteractMotion。InteractPhys 使用增强的基于材料点方法（MPM）的模拟来生成物理上忠实且可控的物体-场景交互，捕捉需要真实物理建模的多样和抽象事件。相比之下，InteractMotion 是一种无需训练的方法，基于预训练的视频扩散。传统合成方法存在两大局限性：需要手动规划物体放置和生成静态、无运动的输出。通过在规划器指导下统一基于模拟和基于扩散的方法，InteractPro 克服了这些挑战，确保了丰富的运动感知合成。广泛的定量和定性评估表明，InteractPro 在各种场景中生成可控且连贯的合成效果方面非常有效。

Summary / 总结

InteractPro is a framework for dynamic motion-aware image composition that uses an intelligent planner called InteractPlan, which leverages a Large Vision Language Model to analyze scenarios and place objects optimally. It selects between InteractPhys, which uses an enhanced MPM-based simulation for physically faithful object interactions, and InteractMotion, a training-free method based on pretrained video diffusion. This approach addresses the limitations of traditional methods by providing controllable and coherent compositions across various scenarios, overcoming the need for manual planning and static outputs.

InteractPro 是一个用于动态运动感知图像合成的框架，使用了智能规划器 InteractPlan 来分析场景并优化物体放置。它会选择使用增强的 MPM 基础模拟进行物理忠实交互的 InteractPhys，以及基于预训练视频扩散的无需训练的 InteractMotion 方法。这种方法通过提供各种场景下的可控且连贯的合成来克服传统方法的局限性，无需手动规划和静态输出。

Fine-Tuning Vision-Language Models for Visual Navigation Assistance

Authors: Xiao Li, Bharat Gandhi, Ming Zhan, Mohit Nehra, Zhicheng Zhang, Yuchen Sun, Meijia Song, Naisheng Zhang, Xi Wang

First: 2025-09-09T08:08:35+00:00 · Latest: 2025-09-09T08:08:35+00:00

Abs · PDF

Abstract

We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.

中文标题/摘要

标题：视觉语言模型的视觉导航辅助微调

我们解决基于视觉语言的室内导航问题，使用图像和自然语言指导视觉障碍者到达目标位置。传统的导航系统由于缺乏精确的位置数据而在室内无效。我们的方法结合视觉和语言模型生成逐步导航指令，增强无障碍性和独立性。我们使用手动标注的室内导航数据集对BLIP-2模型进行低秩适应（LoRA）微调。我们提出了一种评估指标，通过强调方向性和顺序性变量改进了BERT F1分数，提供了一个更全面的导航性能衡量标准。应用LoRA后，模型在生成方向性指令方面显著提高，克服了原始BLIP-2模型的局限性。

Summary / 总结

The research aims to improve indoor navigation for visually impaired individuals by integrating vision and language models. The method involves fine-tuning the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated dataset. Key findings show that the model's performance in generating directional instructions was significantly enhanced, surpassing the limitations of the original BLIP-2 model.

研究旨在通过结合视觉和语言模型来改善视障人士的室内导航。方法是使用低秩适应（LoRA）对BLIP-2模型进行微调，并在手动标注的数据集上进行训练。主要发现表明，模型在生成方向性指令方面的性能得到了显著提升，超越了原始BLIP-2模型的局限性。

SheetDesigner: MLLM-Powered Spreadsheet Layout Generation with Rule-Based and Vision-Based Reflection

Authors: Qin Chen, Yuanyi Ren, Xiaojun Ma, Mugeng Liu, Han Shi, Dongmei Zhang

Venue: EMNLP 2025

First: 2025-09-09T07:51:38+00:00 · Latest: 2025-09-09T07:51:38+00:00

Comments: Accepted to EMNLP 2025 Main Conference

Abs · PDF

Abstract

Spreadsheets are critical to data-centric tasks, with rich, structured layouts that enable efficient information transmission. Given the time and expertise required for manual spreadsheet layout design, there is an urgent need for automated solutions. However, existing automated layout models are ill-suited to spreadsheets, as they often (1) treat components as axis-aligned rectangles with continuous coordinates, overlooking the inherently discrete, grid-based structure of spreadsheets; and (2) neglect interrelated semantics, such as data dependencies and contextual links, unique to spreadsheets. In this paper, we first formalize the spreadsheet layout generation task, supported by a seven-criterion evaluation protocol and a dataset of 3,326 spreadsheets. We then introduce SheetDesigner, a zero-shot and training-free framework using Multimodal Large Language Models (MLLMs) that combines rule and vision reflection for component placement and content population. SheetDesigner outperforms five baselines by at least 22.6\%. We further find that through vision modality, MLLMs handle overlap and balance well but struggle with alignment, necessitates hybrid rule and visual reflection strategies. Our codes and data is available at Github.

中文标题/摘要

标题：SheetDesigner：基于MLLM的规则和视觉反馈驱动的电子表格布局生成

电子表格对于数据为中心的任务至关重要，具有丰富的结构化布局，能够高效地传递信息。鉴于手动设计电子表格布局所需的时间和专业知识，迫切需要自动化解决方案。然而，现有的自动化布局模型并不适合电子表格，因为它们通常（1）将组件视为轴对齐的矩形，忽略了电子表格固有的离散、网格结构；（2）忽视了数据依赖关系和上下文链接等独特的相关语义。在本文中，我们首先通过一个七准则评估协议和3,326个电子表格的数据集，正式化了电子表格布局生成任务。然后，我们介绍了SheetDesigner，这是一个无需训练的零样本框架，利用多模态大型语言模型（MLLMs），结合规则和视觉反馈进行组件放置和内容填充。SheetDesigner在五个基线模型上至少优于22.6%。我们进一步发现，通过视觉模态，MLLMs在处理重叠和平衡方面表现良好，但在对齐方面存在困难，需要采用混合规则和视觉反馈策略。我们的代码和数据可在Github上获取。

Summary / 总结

SheetDesigner addresses the need for automated spreadsheet layout design by leveraging Multimodal Large Language Models (MLLMs) and combining rule-based and vision-based approaches. It outperforms five baselines by at least 22.6%, particularly excelling in handling overlap and balance but facing challenges with alignment. The framework is designed to overcome the limitations of existing models that often treat spreadsheet components as axis-aligned rectangles and neglect interrelated semantics. The evaluation protocol includes seven criteria and a dataset of 3,326 spreadsheets, demonstrating the effectiveness of the proposed method.

SheetDesigner 通过利用多模态大型语言模型（MLLM）并结合规则和视觉反馈的方法，解决了自动化电子表格布局设计的需求。该框架在七个评价标准和3,326个电子表格的数据集支持下，优于五种基线方法至少22.6%。特别地，MLLM 在处理重叠和平衡方面表现出色，但在对齐方面存在困难，需要采用混合规则和视觉反馈策略。现有的模型通常将电子表格组件视为轴对齐的矩形，并忽略相关语义，因此该方法旨在克服这些限制。

ANYPORTAL: Zero-Shot Consistent Video Background Replacement

Authors: Wenshuo Gao, Xicheng Lan, Shuai Yang

Venue: ICCV 2025

First: 2025-09-09T07:50:53+00:00 · Latest: 2025-09-09T07:50:53+00:00

Comments: 8 pages, ICCV 2025, Website: https://gaowenshuo.github.io/AnyPortal/

Abs · PDF · Project1

Abstract

Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce ANYPORTAL, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. ANYPORTAL is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that ANYPORTAL achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.

中文标题/摘要

标题：ANYPORTAL：零样本一致视频背景替换

尽管视频生成技术取得了快速进步，但创建与用户意图精确匹配的高质量视频仍然是一个重大挑战。现有方法往往无法实现对视频细节的精细控制，限制了其实用性。我们提出了ANYPORTAL，这是一种利用预训练扩散模型的零样本框架，用于视频背景替换。我们的框架在零样本设置中协作整合了视频扩散模型的时间先验与图像扩散模型的重新照明能力。为了解决前景一致性这一关键挑战，我们提出了一种细化投影算法，该算法允许像素级细节操作，以确保精确的前景保留。ANYPORTAL 是无需训练的，并克服了实现前景一致性和时间上连贯的重新照明的挑战。实验结果表明，ANYPORTAL 在消费级 GPU 上实现了高质量的结果，提供了一种实用且高效的视频内容创建和编辑解决方案。

Summary / 总结

ANYPORTAL is a zero-shot framework for video background replacement that uses pre-trained diffusion models to integrate temporal priors with relighting capabilities. It introduces a Refinement Projection Algorithm to maintain foreground consistency. Experimental results show that ANYPORTAL can achieve high-quality video background replacement on consumer-grade GPUs, providing a practical solution for video content creation and editing without requiring training data.

ANYPORTAL 是一种使用预训练扩散模型集成视频时间先验和光照能力的零样本框架。它引入了细化投影算法以保持前景一致性。实验结果表明，ANYPORTAL 可以在消费级 GPU 上生成高质量的视频结果，提供一种实用且高效的视频内容创建和编辑解决方案。

DepthVision: Robust Vision-Language Understanding through GAN-Based LiDAR-to-RGB Synthesis

Authors: Sven Kirchner, Nils Purschke, Ross Greer, Alois C. Knoll

First: 2025-09-09T07:42:07+00:00 · Latest: 2025-09-09T07:42:07+00:00

Abs · PDF

Abstract

Ensuring reliable robot operation when visual input is degraded or insufficient remains a central challenge in robotics. This letter introduces DepthVision, a framework for multimodal scene understanding designed to address this problem. Unlike existing Vision-Language Models (VLMs), which use only camera-based visual input alongside language, DepthVision synthesizes RGB images from sparse LiDAR point clouds using a conditional generative adversarial network (GAN) with an integrated refiner network. These synthetic views are then combined with real RGB data using a Luminance-Aware Modality Adaptation (LAMA), which blends the two types of data dynamically based on ambient lighting conditions. This approach compensates for sensor degradation, such as darkness or motion blur, without requiring any fine-tuning of downstream vision-language models. We evaluate DepthVision on real and simulated datasets across various models and tasks, with particular attention to safety-critical tasks. The results demonstrate that our approach improves performance in low-light conditions, achieving substantial gains over RGB-only baselines while preserving compatibility with frozen VLMs. This work highlights the potential of LiDAR-guided RGB synthesis for achieving robust robot operation in real-world environments.

中文标题/摘要

标题：DepthVision：通过基于GAN的LiDAR到RGB合成实现稳健的跨模态理解

确保在视觉输入退化或不足时机器人操作的可靠性仍然是机器人技术中的一个核心挑战。本信介绍了一种名为DepthVision的多模态场景理解框架，旨在解决这一问题。与现有的仅使用相机视觉输入的Vision-Language模型（VLM）不同，DepthVision利用条件生成对抗网络（GAN）和集成精炼网络从稀疏的LiDAR点云中合成RGB图像。这些合成视图随后与真实RGB数据结合使用Luminance-Aware模态适应（LAMA），根据环境照明条件动态融合两种类型的数据。这种方法可以在不需对下游视觉语言模型进行微调的情况下补偿传感器退化，如黑暗或运动模糊。我们在各种模型和任务的真实和模拟数据集上评估了DepthVision，特别关注安全关键任务。结果表明，我们的方法在低光条件下提高了性能，相对于仅使用RGB基线实现了显著的性能提升，同时保持了与冻结VLM的兼容性。这项工作突显了LiDAR引导的RGB合成在实现真实环境中的稳健机器人操作方面的潜力。

Summary / 总结

DepthVision is a framework for multimodal scene understanding that addresses the challenge of reliable robot operation with degraded visual input. It synthesizes RGB images from sparse LiDAR point clouds using a GAN with a refiner network, and combines them with real RGB data using Luminance-Aware Modality Adaptation (LAMA). This approach improves performance in low-light conditions, outperforming RGB-only baselines and maintaining compatibility with frozen Vision-Language Models (VLMs).

DepthVision 是一个多模态场景理解框架，通过结合生成对抗网络（GAN）和集成精炼网络，从稀疏的 LiDAR 点云中合成 RGB 图像。这些合成视图然后通过亮度感知的模态适应（LAMA）与真实 RGB 数据结合，以适应不同的光照条件。该方法在低光条件下提高了性能，相对于仅使用 RGB 的基线实现了显著的改进，同时保持与冻结的视觉-语言模型（VLMs）的兼容性。

"Humor, Art, or Misinformation?": A Multimodal Dataset for Intent-Aware Synthetic Image Detection

Authors: Anastasios Skoularikis, Stefanos-Iordanis Papadopoulos, Symeon Papadopoulos, Panagiotis C. Petrantonakis

First: 2025-08-28T11:22:15+00:00 · Latest: 2025-09-09T06:47:10+00:00

Abs · PDF

Abstract

Recent advances in multimodal AI have enabled progress in detecting synthetic and out-of-context content. However, existing efforts largely overlook the intent behind AI-generated images. To fill this gap, we introduce S-HArM, a multimodal dataset for intent-aware classification, comprising 9,576 "in the wild" image-text pairs from Twitter/X and Reddit, labeled as Humor/Satire, Art, or Misinformation. Additionally, we explore three prompting strategies (image-guided, description-guided, and multimodally-guided) to construct a large-scale synthetic training dataset with Stable Diffusion. We conduct an extensive comparative study including modality fusion, contrastive learning, reconstruction networks, attention mechanisms, and large vision-language models. Our results show that models trained on image- and multimodally-guided data generalize better to "in the wild" content, due to preserved visual context. However, overall performance remains limited, highlighting the complexity of inferring intent and the need for specialized architectures.

中文标题/摘要

标题："幽默、艺术还是误导信息？": 一种面向意图的合成图像检测多模态数据集

近年来，多模态AI的进步使得检测合成和脱离上下文的内容成为可能。然而，现有的努力大多忽略了AI生成图像背后的意图。为填补这一空白，我们引入了S-HArM，这是一个面向意图分类的多模态数据集，包含来自Twitter/X和Reddit的9,576个“野生”图像-文本对，标记为幽默/讽刺、艺术或误导信息。此外，我们探索了三种提示策略（图像导向、描述导向和多模态导向）来构建一个大规模的合成训练数据集，使用Stable Diffusion。我们进行了广泛的比较研究，包括模态融合、对比学习、重建网络、注意力机制和大型视觉-语言模型。我们的结果显示，使用图像和多模态导向的数据训练的模型在“野生”内容上的泛化能力更好，因为保留了视觉上下文。然而，总体性能仍然有限，突显了推断意图的复杂性以及需要专门的架构。

Summary / 总结

The research aims to address the gap in detecting the intent behind AI-generated images by introducing S-HArM, a multimodal dataset. The dataset includes 9,576 image-text pairs from social media, labeled as Humor/Satire, Art, or Misinformation. The study explores three prompting strategies and finds that models trained on image- and multimodally-guided data perform better on real-world content, though overall performance is still limited.

研究旨在通过引入S-HArM数据集填补检测AI生成图像背后意图的空白，该数据集包含9,576个来自社交媒体的图像-文本对，标记为幽默/讽刺、艺术或误导信息。研究探索了三种提示策略，并发现使用图像和多模态指导的数据训练的模型在真实世界内容上的表现更好，尽管整体性能仍然有限，因为推断意图的复杂性仍然存在。

Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision

Authors: Raehyuk Jung, Seungjun Yu, Hyunjung Shim

First: 2025-08-31T05:00:51+00:00 · Latest: 2025-09-09T03:19:40+00:00

Comments: Link to publicly available codes is added

Abs · PDF

Abstract

Vision-Language Models (VLMs) combine a vision encoder and a large language model (LLM) through alignment training, showing strong performance on multimodal tasks. A central component in this architecture is the projection layer, which maps visual features into the LLM's embedding space. Despite its importance, its ability to generalize to unseen visual concepts has not been systematically evaluated. To address this, we propose a benchmark for evaluating projection-layer generalization. We adapt object detection datasets (rich in fine-grained annotations) into a prompting format and design train/test splits with disjoint label sets, enabling precise control over seen and unseen concept separation. Experimental results show that the projection layer retains about 79 to 88 percent of the performance on unseen classes compared to seen ones across various settings, suggesting a non-trivial level of generalization even without explicit alignment supervision on those concepts. We further analyze this behavior through a mechanistic interpretability lens. Our findings indicate that the feed-forward network in the projection layer functions like a key-value memory, processing seen and unseen tokens in similar ways. This study introduces a new evaluation framework for alignment generalization and highlights the potential for efficient VLM training with limited aligned data.

中文标题/摘要

标题：揭示无形：超越监督的视觉-语言对齐评估

视觉-语言模型（VLMs）通过对齐训练结合了视觉编码器和大型语言模型（LLM），在多模态任务上表现出色。该架构中的关键组件是投影层，它将视觉特征映射到LLM的嵌入空间。尽管其重要性，但其在未见过的视觉概念上的泛化能力尚未系统评估。为解决这一问题，我们提出了一种评估投影层泛化能力的基准。我们将富含细粒度注释的对象检测数据集改编为提示格式，并设计了具有分离标签集的训练/测试分割，以精确控制已见过和未见过的概念分离。实验结果显示，投影层在不同设置下对未见过类别的性能保留了约79%到88%，表明即使在未对这些概念进行显式对齐监督的情况下，也存在一定的泛化能力。我们进一步通过机制可解释性视角分析了这种行为。我们的研究结果表明，投影层中的前馈网络功能类似于键值记忆，以类似的方式处理已见过和未见过的标记。本研究引入了一种新的对齐泛化评估框架，并突显了在有限对齐数据下高效训练VLM的潜力。

Summary / 总结

The research aims to evaluate the generalization ability of the projection layer in Vision-Language Models (VLMs) to unseen visual concepts. The method involves adapting object detection datasets into a prompting format and creating train/test splits with disjoint label sets. The key findings show that the projection layer retains about 79 to 88 percent of the performance on unseen classes compared to seen ones, indicating a non-trivial level of generalization without explicit alignment supervision on those concepts. This study introduces a new evaluation framework for alignment generalization and suggests the potential for efficient VLM training with limited aligned data.

该研究评估了Vision-Language模型中投影层在处理未见过的视觉概念时的泛化能力。通过将目标检测数据集转换为提示格式，并设计训练/测试拆分以分离标签集，研究人员能够精确控制已见过和未见过的概念。实验结果显示，投影层在未见过的类别上的性能保留了约79到88%，表明存在一定的泛化能力。研究引入了一个新的评估框架，并表明投影层中的前馈网络处理已见过和未见过的标记方式类似，类似于一个键值记忆。

A Novel Image Similarity Metric for Scene Composition Structure

Authors: Md Redwanul Haque, Manzur Murshed, Manoranjan Paul, Tsz-Kwan Lee

First: 2025-08-07T05:29:21+00:00 · Latest: 2025-09-09T02:42:55+00:00

Comments: 2025 IEEE ICIP (Workshop: Generative AI for World Simulations and Communications). Code at https://github.com/RedwanPlague/scssim

Abs · PDF · Code1

Abstract

The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image's underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM's high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition.

中文标题/摘要

标题：一种新的图像相似度度量方法用于场景构成结构

生成式AI模型的快速发展需要超越人类感知的新方法来评估图像质量。这些模型的一个关键问题是保持图像的底层场景构成结构（SCS），即定义物体与背景之间的几何关系、相对位置、大小、方向等。保持SCS的完整性对于确保生成式AI输出的忠实性和结构准确性至关重要。传统的图像相似度度量往往在评估SCS方面表现不佳。像素级方法对细微视觉噪声过于敏感，而基于感知的度量则侧重于人类审美，两者均未能充分捕捉结构保真度。此外，最近基于神经网络的度量引入了训练开销和潜在的泛化问题。我们提出了场景构成结构相似度指数度量（SCSSIM），这是一种新颖的、分析性的、无需训练的度量方法，通过利用从图像立方体分层划分中导出的统计措施来量化SCS的保持情况，稳健地捕捉非基于对象的结构关系。我们的实验表明，SCSSIM 对非构成性失真具有高度不变性，准确反映了未改变的SCS。相反，它对构成性失真表现出强烈的单调下降，精确地指示了SCS是否被改变。与现有度量相比，SCSSIM 在结构评估方面表现出更优越的特性，使其成为开发和评估生成式模型的重要工具，确保场景构成的完整性。

Summary / 总结

The paper introduces SCSSIM, a novel metric for evaluating the preservation of Scene Composition Structure (SCS) in images generated by generative AI models. Unlike traditional metrics, SCSSIM uses statistical measures from Cuboidal hierarchical partitioning to robustly capture non-object-based structural relationships, avoiding sensitivity to minor visual noise and prioritizing structural fidelity. Experiments show that SCSSIM is highly invariant to non-compositional distortions and strongly indicates compositional changes, making it a superior tool for evaluating generative models.

论文提出了SCSSIM，这是一种用于评估生成AI模型生成图像中场景组成结构（SCS）保留的新方法。SCSSIM利用立方体层次划分的统计措施来稳健地捕捉非对象基础的结构关系，避免对细微视觉噪声的敏感性，优先考虑结构保真度。实验表明，SCSSIM对非组成性失真具有高不变性，并且在组成性失真时表现出强烈的单调减少，使其成为结构评估生成模型的优越工具。

RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events

Authors: Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, Feng Zhang

First: 2025-09-02T03:01:23+00:00 · Latest: 2025-09-09T01:10:25+00:00

Comments: under review

Abs · PDF · Code1

Abstract

Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,315 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC's ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.

中文标题/摘要

标题：RSCC：一种用于灾害事件的大型遥感变化描述数据集

遥感对于灾害监测至关重要，但现有数据集缺乏时间图像对和详细的文本注释。虽然当前资源主要由单张快照图像主导，但无法捕捉到灾害随时间的变化影响。为解决这一问题，我们介绍了遥感变化描述（RSCC）数据集，这是一个包含62,315个灾前/灾后图像对（涵盖地震、洪水、野火等）的大规模基准，这些图像对配有丰富的、类人类的变化描述。通过在遥感数据中弥合时间与语义的鸿沟，RSCC 使视觉-语言模型能够进行灾害意识的双时相理解的稳健训练和评估。我们的结果突显了RSCC在促进详细灾害相关分析方面的能力，为遥感中更准确、可解释和可扩展的视觉-语言应用铺平了道路。代码和数据集可在https://github.com/Bili-Sakura/RSCC 获取。

Summary / 总结

The RSCC dataset addresses the lack of temporal image pairs and detailed textual annotations in existing disaster monitoring datasets. It consists of 62,315 pre-/post-disaster image pairs with rich change captions, covering various disaster types. This dataset enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding, facilitating detailed disaster-related analysis.

RSCC数据集解决了现有灾害监测数据集中缺乏时间图像对和详细文本注释的问题。它包含62,315个灾前/灾后图像对，并配有丰富的变化描述，涵盖了多种灾害类型。该数据集使视觉-语言模型能够进行灾害感知的双时相理解的稳健训练和评估，促进了详细的灾害相关分析，并提高了遥感应用的准确性和可解释性。

Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges

Authors: Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

First: 2025-07-02T18:21:01+00:00 · Latest: 2025-09-08T19:23:04+00:00

Abs · PDF

Abstract

Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.

中文标题/摘要

标题：大型语言模型在视频事故检测中的应用：方法、数据集和挑战综述

从视频流中检测事故是智能交通系统中的一个关键问题。近年来，大型语言模型（LLMs）和视觉语言模型（VLMs）的发展改变了我们处理、推理和总结多模态信息的方式。本文综述了利用LLMs进行视频数据事故检测的最新方法。我们提出了融合策略的结构化分类，总结了关键数据集，分析了模型架构，比较了性能基准，并讨论了当前的挑战和机遇。我们的综述为这一快速发展的视频理解和基础模型交叉领域提供了研究基础。

Summary / 总结

The paper motivates the need for crash detection in video feeds for intelligent transportation systems. It surveys methods using large language models (LLMs) and vision-language models (VLMs) for crash detection, presenting a taxonomy of fusion strategies, summarizing key datasets, and comparing model architectures. The research highlights ongoing challenges and opportunities in this field, offering a foundation for future studies.

论文旨在探讨视频流中智能交通系统中碰撞检测的必要性。它回顾了使用大型语言模型（LLMs）和视觉语言模型（VLMs）进行碰撞检测的方法，介绍了融合策略的分类，总结了关键数据集，并比较了模型架构。研究指出了该领域的持续挑战和机遇，为未来的研究提供了基础。

Grounding DINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Authors: Hamza Rasaee, Taha Koleilat, Hassan Rivaz

Venue: IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, Sept. 2025

First: 2025-06-30T14:33:44+00:00 · Latest: 2025-09-08T19:18:30+00:00

Comments: 11 pages, 3 figures, 7 tables

Abs · PDF

Abstract

Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 (Segment Anything Model2) to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine-tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ-specific annotated datasets. We will publish our code on code.sonography.ai after acceptance.

中文标题/摘要

标题：Grounding DINO-US-SAM：基于文本提示的超声多器官分割

在超声成像中实现准确且通用的物体分割仍然是一个重大挑战，原因包括解剖变异、多样化的成像协议以及标注数据的有限性。本研究提出了一种基于提示的视觉-语言模型（VLM），将Grounding DINO与SAM2（Segment Anything Model2）结合，以实现跨多个超声器官的物体分割。共使用了18个公开的超声数据集，涵盖乳腺、甲状腺、肝脏、前列腺、肾脏和腰背肌肉。这些数据集分为15个用于微调和验证Grounding DINO，使用低秩适应（LoRA）调整至超声领域，另外3个数据集用于测试，以评估其在未见分布中的性能。全面的实验表明，我们的方法在大多数已见数据集上优于最先进的分割方法，包括UniverSeg、MedSAM、MedCLIP-SAM、BiomedParse和SAMUS，同时在未见数据集上保持了强大的性能，无需额外的微调。这些结果突显了VLMs在可扩展和稳健的超声图像分析中的潜力，减少了对大型、器官特定标注数据集的依赖。论文接受后，我们将发布我们的代码于code.sonography.ai。

Understanding Museum Exhibits using Vision-Language Reasoning

Authors: Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool

Venue: ICCV 2025

First: 2024-12-02T10:54:31+00:00 · Latest: 2025-09-08T18:23:33+00:00

Comments: Accepted at ICCV 2025

Abs · PDF

Abstract

Museums serve as repositories of cultural heritage and historical artifacts from diverse epochs, civilizations, and regions, preserving well-documented collections that encapsulate vast knowledge, which, when systematically structured into large-scale datasets, can train specialized models. Visitors engage with exhibits through curiosity and questions, making expert domain-specific models essential for interactive query resolution and gaining historical insights. Understanding exhibits from images requires analyzing visual features and linking them to historical knowledge to derive meaningful correlations. We facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs for exhibits from all around the world; (b) training large vision-language models (VLMs) on the collected dataset; (c) benchmarking their ability on five visual question answering tasks, specifically designed to reflect real-world inquiries and challenges observed in museum settings. The complete dataset is labeled by museum experts, ensuring the quality and the practical significance of the labels. We train two VLMs from different categories: BLIP with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through extensive experiments, we find that while both model types effectively answer visually grounded questions, large vision-language models excel in queries requiring deeper historical context and reasoning. We further demonstrate the necessity of fine-tuning models on large-scale domain-specific datasets by showing that our fine-tuned models significantly outperform current SOTA VLMs in answering questions related to specific attributes, highlighting their limitations in handling complex, nuanced queries.

中文标题/摘要

标题：利用视觉语言推理理解博物馆展览

博物馆是文化遗存和不同历史时期、文明和地区的文物的保存地，保存了详尽的收藏品，蕴含了大量知识。当这些收藏品系统地结构化为大规模数据集时，可以训练专门的模型。访客通过好奇心和问题与展览互动，因此专家领域特定模型对于交互式查询解决和获取历史见解至关重要。理解展览需要分析视觉特征并将其与历史知识联系起来，以推导出有意义的关联。我们通过（a）收集和整理来自世界各地的6500万张图像和2亿个问题-答案对的大规模数据集；（b）在收集的数据集上训练大规模视觉语言模型（VLMs）；（c）在五个视觉问答任务上测试它们的能力，这些任务专门设计以反映博物馆环境中观察到的真实世界询问和挑战。整个数据集由博物馆专家标注，确保了标签的质量和实际意义。我们训练了两类VLMs：BLIP，具有视觉语言对齐嵌入但缺乏大型语言模型的表达能力，以及LLaVA模型，这是一种强大的指令调优的大语言模型，增强了视觉语言推理能力。通过大量实验，我们发现虽然两种模型类型都能有效回答基于视觉的问题，但大规模视觉语言模型在需要更深层次历史背景和推理的查询中表现出色。我们进一步通过展示我们的微调模型在回答特定属性相关问题时显著优于当前最先进的VLMs，证明了在大规模领域特定数据集上微调模型的必要性，突显了它们在处理复杂、细腻查询方面的局限性。

Summary / 总结

This study aims to enhance the understanding of museum exhibits through vision-language reasoning by collecting a large-scale dataset of images and question-answer pairs. The research trains and benchmarks two vision-language models, BLIP and LLaVA, on this dataset for five visual question-answering tasks. The findings show that large vision-language models perform better in queries requiring historical context and reasoning, and that fine-tuning on domain-specific datasets significantly improves model performance for complex, nuanced questions.

本研究旨在通过视觉-语言推理增强对博物馆展品的理解，通过收集大量图像和问答对的数据集。研究在该数据集上训练和评估了两种视觉-语言模型BLIP和LLaVA，用于五个视觉问答任务。研究发现，大型视觉-语言模型在需要历史背景和推理的查询中表现更佳，并且在特定领域的大规模数据集上进行微调可以显著提高模型对复杂、细腻查询的处理能力。

LLaDA-VLA: Vision Language Diffusion Action Models

Authors: Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, Xiaoyan Sun

First: 2025-09-08T17:45:40+00:00 · Latest: 2025-09-08T17:45:40+00:00

Abs · PDF

Abstract

The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.

中文标题/摘要

标题：LLaDA-VLA：视觉语言扩散动作模型

自回归视觉语言模型（VLMs）的快速发展激发了对用于机器人操作的视觉语言动作模型（VLA）的兴趣。最近，掩码扩散模型，这一与自回归模型不同的范式，在文本生成和多模态应用中开始展示出竞争力，推动了一系列基于扩散的VLMs（d-VLMs）的发展。然而，利用这些模型进行机器人策略学习仍然鲜有探索。在本文中，我们提出了LLaDA-VLA，这是首个基于预训练d-VLMs的视觉语言扩散动作模型。为了有效适应机器人领域，我们引入了两个关键设计：（1）局部特殊标记分类策略，用特殊动作标记分类替代全词汇分类，降低适应难度；（2）分层动作结构解码策略，考虑动作内部和跨动作的依赖关系，逐级解码动作序列。大量实验表明，LLaDA-VLA在模拟和真实机器人上均显著优于现有最先进的VLA。

Summary / 总结

LLaDA-VLA is the first vision-language-diffusion-action model for robotic manipulation, built on pretrained diffusion-based vision-language models (d-VLMs). It introduces two key designs: a localized special-token classification strategy and a hierarchical action-structured decoding strategy. Experimental results show that LLaDA-VLA outperforms existing vision-language-action models on both simulated and real-world robotic tasks.

LLaDA-VLA是首个基于预训练扩散型视觉语言模型（d-VLM）的视觉语言动作模型，用于机器人操作。它引入了局部特殊标记分类策略和层次动作结构解码策略，以适应机器人任务。实验结果表明，LLaDA-VLA在仿真和真实世界机器人操作任务中均优于现有视觉语言动作模型。

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Authors: Eugene Kwek, Wenpeng Yin

First: 2025-09-08T16:07:06+00:00 · Latest: 2025-09-08T16:07:06+00:00

Abs · PDF

Abstract

Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.

中文标题/摘要

标题：COMPACT: 共享词元优化模型剪枝跨通道和词元

使大语言模型（LLM）在内存、延迟和提供服务成本方面更高效对于边缘部署、交互式应用以及大规模可持续推理至关重要。剪枝是实现这一目标的关键技术。然而，先前的剪枝方法有限：宽度剪枝通常会破坏标准的变压器布局或需要自定义推理代码，而深度剪枝会移除整个层并可能导致准确率骤降。在本工作中，我们提出了COMPACT，它联合（i）剪枝稀有词汇以缩小嵌入/解嵌入，并（ii）使用共享词元加权激活剪枝FFN中间通道，使重要性与后剪枝词元分布对齐。COMPACT兼具深度和宽度剪枝的优点，如：部署友好性（保持标准的变压器架构）、规模适应性（在词汇量与FFN剪枝之间权衡），无需训练即可操作且具有竞争力的剪枝时间，以及强大的内存节省和吞吐量提升。在Qwen、LLaMA和Gemma家族（0.5B-70B）中进行的实验显示，COMPACT在相似或更高的剪枝比率下，下游任务性能达到最先进的水平，同时参数、GPU内存和端到端延迟显著减少。

Summary / 总结

The research aims to enhance the efficiency of large language models (LLMs) in terms of memory, latency, and serving cost for edge deployment and interactive applications. To achieve this, the paper introduces COMPACT, a pruning method that jointly prunes rare vocabulary and FFN intermediate channels using common-token-weighted activations. The key findings show that COMPACT maintains a standard transformer architecture, offers scale-adaptive pruning, and achieves state-of-the-art downstream task performance with significant reductions in parameters, GPU memory, and latency across different model sizes (0.5B-70B).

研究旨在提高大型语言模型（LLMs）在边缘部署和交互应用中的内存、延迟和运行成本效率。提出的COMPACT方法同时剪枝稀有词汇和FFN中间通道，保持标准的变压器架构，并提供可扩展的剪枝适应性。实验表明，COMPACT在不同LLM家族中实现了最先进的下游任务性能，同时显著减少了参数、GPU内存和端到端延迟，且剪枝比例相似或更高。

D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

Authors: Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar

First: 2025-09-08T14:55:16+00:00 · Latest: 2025-09-08T14:55:16+00:00

Comments: Accepted at IEEE International Conference on Data Mining (ICDM) 2025

Abs · PDF · Code1

Abstract

Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning

中文标题/摘要

标题：D-HUMOR：通过多模态开放推理理解黑色幽默

在线表情包中的黑色幽默因其依赖于隐含、敏感和文化背景的提示而面临独特挑战。为了解决检测多模态内容中黑色幽默资源和方法的缺乏，我们引入了一个包含4,379个带有黑色幽默标注的Reddit表情包的数据集，标注了目标类别（性别、心理健康、暴力、种族、残疾和其他）和三级强度评分（轻微、中等、严重）。基于此资源，我们提出了一种增强推理框架，首先使用大型视觉-语言模型（VLM）为每个表情包生成结构化解释。通过角色反转自我循环，VLM 采用作者视角迭代完善其解释，确保完整性和一致性。然后，我们从OCR转录文本和自我完善的推理中提取文本特征，使用视觉变换器获取视觉特征。三流交叉推理网络（TCRNet）通过成对注意力机制融合这三流，即文本、图像和推理，生成统一表示进行分类。实验结果表明，我们的方法在黑色幽默检测、目标识别和强度预测三项任务上均优于强基线。数据集、标注和代码已发布，以促进多模态幽默理解和内容审核的进一步研究。代码和数据集可在以下链接获取：https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning

Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization

Authors: Thanh Thi Nguyen, Campbell Wilson, Janis Dalins

First: 2025-09-08T14:47:57+00:00 · Latest: 2025-09-08T14:47:57+00:00

Comments: Accepted for publication in the Proceedings of the 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025)

Abs · PDF

Abstract

Large Vision-Language Models (LVLMs) or multimodal large language models represent a significant advancement in artificial intelligence, enabling systems to understand and generate content across both visual and textual modalities. While large-scale pretraining has driven substantial progress, fine-tuning these models for aligning with human values or engaging in specific tasks or behaviors remains a critical challenge. Deep Reinforcement Learning (DRL) and Direct Preference Optimization (DPO) offer promising frameworks for this aligning process. While DRL enables models to optimize actions using reward signals instead of relying solely on supervised preference data, DPO directly aligns the policy with preferences, eliminating the need for an explicit reward model. This overview explores paradigms for fine-tuning LVLMs, highlighting how DRL and DPO techniques can be used to align models with human preferences and values, improve task performance, and enable adaptive multimodal interaction. We categorize key approaches, examine sources of preference data, reward signals, and discuss open challenges such as scalability, sample efficiency, continual learning, generalization, and safety. The goal is to provide a clear understanding of how DRL and DPO contribute to the evolution of robust and human-aligned LVLMs.

中文标题/摘要

标题：通过深度强化学习和直接偏好优化对大型视觉-语言模型进行对齐

大型视觉-语言模型（LVLMs）或跨模态大型语言模型是人工智能的一个重要进步，使系统能够理解和生成跨视觉和文本模态的内容。虽然大规模预训练推动了显著的进步，但将这些模型微调以与人类价值观对齐或执行特定任务或行为仍然是一个关键挑战。深度强化学习（DRL）和直接偏好优化（DPO）为这一对齐过程提供了有希望的框架。DRL使模型能够使用奖励信号来优化行为，而不仅仅是依赖监督偏好数据，而DPO直接将策略与偏好对齐，消除了显式奖励模型的需要。本文综述了微调LVLMs的范式，强调了DRL和DPO技术如何用于使模型与人类偏好和价值观对齐、提高任务性能和实现适应性跨模态交互。我们对关键方法进行了分类，检查了偏好数据来源、奖励信号，并讨论了可扩展性、样本效率、持续学习、泛化和安全性等开放挑战。目标是提供DRL和DPO如何促进稳健且与人类对齐的LVLMs演化的清晰理解。

Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

Authors: Amirmohammad Izadi, Mohammad Ali Banayeeanzade, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah

First: 2025-06-27T11:44:40+00:00 · Latest: 2025-09-08T14:34:04+00:00

Abs · PDF

Abstract

Despite progress in Vision-Language Models (VLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current VLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces VISER (Visual Input Structure for Enhanced Reasoning), a simple yet effective intervention: augmenting visual inputs with low-level spatial structures and pairing this with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks. Specifically, VISER improves GPT-4o visual search accuracy by 25.00%, increases counting accuracy by 26.83%, reduces edit distance error in scene description by 0.32, and enhances performance on spatial relationship tasks by 9.50% on a 2D synthetic dataset. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. VISER enhances binding only with a single-query inference, underscoring the importance of visual input design over purely linguistically-based approaches. These findings suggest that low-level visual structuring is a powerful and underexplored direction for improving compositional visual reasoning and could serve as a general strategy for enhancing VLM performance on spatially grounded tasks.

中文标题/摘要

标题：视觉结构有助于视觉推理：解决VLMs中的绑定问题

尽管在视觉语言模型（VLMs）方面取得了进展，但它们在视觉推理方面的能力往往受限于绑定问题：无法可靠地将感知特征与正确的视觉参照物关联起来。这一限制导致了诸如计数、视觉搜索、场景描述和空间关系理解等任务中的持续错误。关键因素在于当前的VLMs主要以并行方式处理视觉特征，缺乏空间定位的序列注意力机制。本文介绍了VISER（视觉输入结构以增强推理），这是一种简单而有效的干预措施：通过在视觉输入中添加低级的空间结构，并配以鼓励顺序、空间意识解析的文本提示。我们实证展示了在核心视觉推理任务中取得了显著的性能提升。具体而言，VISER将GPT-4o的视觉搜索准确性提高了25.00%，将计数准确性提高了26.83%，将场景描述中的编辑距离错误降低了0.32，并在2D合成数据集上将空间关系任务的性能提高了9.50%。此外，我们发现视觉修改对于这些提升是必不可少的；纯粹的文本策略，包括链式思考提示，是不够的，甚至可能降低性能。VISER仅通过单查询推理就能增强绑定，突显了视觉输入设计的重要性，而非纯粹基于语言的方法。这些发现表明，低级视觉结构化是一个强大且未被充分探索的方向，可以提高组合视觉推理，并且可以作为增强VLM在空间定位任务上性能的一般策略。

Summary / 总结

This paper addresses the binding problem in Vision-Language Models (VLMs) by introducing VISER, which augments visual inputs with spatial structures and encourages sequential parsing through textual prompts. The method significantly improves performance in visual reasoning tasks, with gains of 25.00% in visual search, 26.83% in counting, a reduction of 0.32 in edit distance for scene description, and 9.50% in spatial relationship tasks. VISER is essential for these improvements, as purely textual strategies are insufficient and can even degrade performance.

本文通过引入VISER方法，解决了视觉语言模型（VLMs）中的绑定问题，该方法通过添加空间结构并结合文本提示促进顺序的空间感知解析。该方法在各种视觉推理任务中显著提高了性能，具体表现为视觉搜索提高了25.00%，计数准确性提高了26.83%，场景描述的编辑距离减少了0.32，空间关系任务提高了9.50%。VISER对于这些改进至关重要，纯文本策略不仅无效，甚至会降低性能。研究结果强调了低级视觉结构化对于组成视觉推理的重要性，并可作为增强VLM在空间定位任务中性能的一般策略。

Robust and Label-Efficient Deep Waste Detection

Authors: Hassan Abid, Khan Muhammad, Muhammad Haris Khan

First: 2025-08-26T08:34:04+00:00 · Latest: 2025-09-08T10:07:31+00:00

Comments: Accepted at BMVC 2025

Abs · PDF · Code1

Abstract

Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.

中文标题/摘要

标题：稳健且标签高效的深度垃圾检测

有效的垃圾分类对于可持续回收至关重要，但由于数据集有限且依赖于过时的对象检测器，AI 研究在这一领域仍落后于商业系统。在本文中，我们通过建立强基准并引入基于集成的半监督学习框架，推进了 AI 驱动的垃圾检测。我们首先在真实的 ZeroWaste 数据集上基准测试最先进的开放词汇对象检测（OVOD）模型，表明仅类别提示表现不佳，而 LLM 优化的提示显著提高了零样本准确性。接下来，为了解决领域特定的限制，我们微调了现代基于变换器的对象检测器，实现了新的基线 51.6 mAP。然后，我们提出了一种软伪标签策略，使用空间和共识感知加权融合集成预测，实现稳健的半监督训练。应用于未标记的 ZeroWaste-s 子集，我们的伪注释实现了超过全监督训练的性能提升，突显了可扩展注释管道的有效性。我们的工作为研究界做出了贡献，通过建立严格的基准，引入了稳健的基于集成的伪标签管道，生成了未标记 ZeroWaste-s 子集的高质量注释，并系统地评估了 OVOD 模型在真实世界垃圾分类条件下的性能。我们的代码可在 https://github.com/h-abid97/robust-waste-detection 获取。

Summary / 总结

This work aims to improve AI-driven waste detection for sustainable recycling by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. The authors benchmark OVOD models and fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. They propose a soft pseudo-labeling strategy that enhances semi-supervised training and surpasses fully supervised training on the unlabeled ZeroWaste-s subset, demonstrating the effectiveness of scalable annotation pipelines. The research contributes to the community by providing rigorous baselines and a robust pseudo-labeling pipeline for real-world waste sorting conditions.

该研究旨在通过建立强基准并引入基于集成的半监督学习框架来改进面向可持续回收的AI垃圾检测。作者对标了OVOD模型并微调了基于Transformer的检测器，实现了新的51.6 mAP基线。他们提出了一种软伪标签策略，增强了半监督训练，使得在未标记的ZeroWaste-s子集上获得了超越全监督训练的表现。该研究为研究界提供了严格的基准和用于可扩展注释的稳健伪标签流水线。

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Authors: Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

First: 2025-09-08T09:20:04+00:00 · Latest: 2025-09-08T09:20:04+00:00

Abs · PDF

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

中文标题/摘要

标题：基于对比注意力聚焦：增强VLMs的视觉推理

视觉-语言模型（VLMs）在多种视觉任务中表现出色，但在复杂视觉环境中其性能会下降。尽管现有的增强方法需要额外的训练、依赖外部分割工具或在粗粒度级别上操作，但它们忽视了VLMs内部固有的能力。为了弥合这一差距，我们研究了VLMs的注意力模式，并发现：（1）视觉复杂性与注意力熵呈强相关性，负面影响了推理性能；（2）注意力从浅层的全局扫描逐渐细化到深层的集中收敛，收敛程度由视觉复杂性决定；（3）理论上，我们证明了通用查询与任务特定查询之间的注意力图对比能够将视觉信号分解为语义信号和视觉噪声成分。基于这些见解，我们提出了对比注意力精炼以增强视觉（CARVE），这是一种无需训练的方法，通过像素级的注意力对比提取任务相关的视觉信号。广泛的实验表明，CARVE能够一致地提升性能，开源模型的性能提升高达75%。我们的工作为理解视觉复杂性和注意力机制之间的相互作用提供了关键见解，为通过对比注意力提高视觉推理提供了高效途径。

Summary / 总结

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.

该研究解决了视觉-语言模型（VLMs）在复杂视觉环境中的性能下降问题。通过分析注意力模式，作者发现视觉复杂性会负面影响推理性能，并且注意力会从全局扫描逐渐精炼到聚焦收敛。他们提出了一种名为CARVE的无训练方法，通过像素级注意力对比提取任务相关的视觉信号，实现了对开源模型高达75%的性能提升。

When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection

Authors: Rabin Dulal, Lihong Zheng, Muhammad Ashad Kabir

Venue: Australasian Joint Conference on Artificial Intelligence 2025

First: 2025-09-08T08:21:34+00:00 · Latest: 2025-09-08T08:21:34+00:00

Abs · PDF

Abstract

Muzzle patterns are among the most effective biometric traits for cattle identification. Fast and accurate detection of the muzzle region as the region of interest is critical to automatic visual cattle identification.. Earlier approaches relied on manual detection, which is labor-intensive and inconsistent. Recently, automated methods using supervised models like YOLO have become popular for muzzle detection. Although effective, these methods require extensive annotated datasets and tend to be trained data-dependent, limiting their performance on new or unseen cattle. To address these limitations, this study proposes a zero-shot muzzle detection framework based on Grounding DINO, a vision-language model capable of detecting muzzles without any task-specific training or annotated data. This approach leverages natural language prompts to guide detection, enabling scalable and flexible muzzle localization across diverse breeds and environments. Our model achieves a mean Average Precision (mAP)@0.5 of 76.8\%, demonstrating promising performance without requiring annotated data. To our knowledge, this is the first research to provide a real-world, industry-oriented, and annotation-free solution for cattle muzzle detection. The framework offers a practical alternative to supervised methods, promising improved adaptability and ease of deployment in livestock monitoring applications.

中文标题/摘要

标题：语言模型引导视觉：基于Grounding DINO的牛鼻孔检测

鼻孔模式是牛身份识别中最有效的生物特征之一。快速准确地检测鼻孔区域作为感兴趣区域对于自动视觉牛身份识别至关重要。早期的方法依赖于手动检测，这既耗时又不一致。最近，使用监督模型如YOLO进行鼻孔检测的自动化方法变得流行。尽管有效，但这些方法需要大量的标注数据集，并且往往依赖于训练数据，限制了它们在新或未见过的牛上的性能。为了解决这些限制，本研究提出了一种基于Grounding DINO的零样本鼻孔检测框架，Grounding DINO是一种能够无需任何任务特定训练或标注数据即可检测鼻孔的视觉语言模型。该方法利用自然语言提示来引导检测，使鼻孔定位在不同品种和环境中具有可扩展性和灵活性。我们的模型在mAP@0.5上达到了76.8%，展示了无需标注数据的有希望的性能。据我们所知，这是首次为牛鼻孔检测提供一种实际可行、面向行业且无需标注的解决方案。该框架为牲畜监测应用提供了监督方法的实用替代方案，有望提高适应性和部署便利性。

Summary / 总结

This study addresses the challenge of cattle muzzle detection by proposing a zero-shot framework using Grounding DINO, a vision-language model. The method relies on natural language prompts to guide detection, eliminating the need for annotated data. The model achieves a mean Average Precision (mAP)@0.5 of 76.8%, demonstrating effective muzzle localization across various breeds and environments without requiring any training data. This is the first annotation-free solution for cattle muzzle detection, offering a practical alternative to supervised methods in livestock monitoring applications.

该研究旨在通过提出基于Grounding DINO的零样本框架来提高牛鼻环区域检测的准确性和效率。该方法利用自然语言提示来引导检测，无需标注数据。模型在mAP@0.5上达到了76.8%的性能，展示了无需任何标注数据进行训练的有前景的表现。这是首次提供一种适用于现实世界、面向行业的无需标注数据的牛鼻环检测解决方案，为牲畜监测应用提供了实用的替代方案。

Content Generation Models in Computational Pathology: A Comprehensive Survey on Methods, Applications, and Challenges

Authors: Yuan Zhang, Xinfeng Zhang, Xiaoming Qi, Xinyu Wu, Feng Chen, Guanyu Yang, Huazhu Fu

First: 2025-05-16T08:44:50+00:00 · Latest: 2025-09-08T08:12:51+00:00

Comments: 20 pages, 8 figures

Abs · PDF

Abstract

Content generation modeling has emerged as a promising direction in computational pathology, offering capabilities such as data-efficient learning, synthetic data augmentation, and task-oriented generation across diverse diagnostic tasks. This review provides a comprehensive synthesis of recent progress in the field, organized into four key domains: image generation, text generation, molecular profile-morphology generation, and other specialized generation applications. By analyzing over 150 representative studies, we trace the evolution of content generation architectures -- from early generative adversarial networks to recent advances in diffusion models and generative vision-language models. We further examine the datasets and evaluation protocols commonly used in this domain and highlight ongoing limitations, including challenges in generating high-fidelity whole slide images, clinical interpretability, and concerns related to the ethical and legal implications of synthetic data. The review concludes with a discussion of open challenges and prospective research directions, with an emphasis on developing integrated and clinically deployable generation systems. This work aims to provide a foundational reference for researchers and practitioners developing content generation models in computational pathology.

中文标题/摘要

标题：计算病理学中的内容生成模型：方法、应用与挑战的全面综述

内容生成建模已成为计算病理学的一个有前途的方向，提供了诸如数据高效学习、合成数据增强和面向任务的内容生成等能力，适用于多种诊断任务。本文综述了该领域的最新进展，分为四个关键领域：图像生成、文本生成、分子特征-形态学生成和其他专门生成应用。通过分析超过150篇代表性研究，我们追溯了内容生成架构的发展历程——从早期的生成对抗网络到最近的扩散模型和生成视觉语言模型的进步。我们还探讨了该领域常用的数据库和评估协议，并指出了持续存在的局限性，包括生成高保真全切片图像的挑战、临床解释性以及合成数据的伦理和法律问题。综述最后讨论了开放挑战和未来研究方向，强调了开发集成和临床可部署生成系统的必要性。本文旨在为计算病理学中内容生成模型的研究人员和实践者提供一个基础参考。

Summary / 总结

This review explores the development of content generation models in computational pathology, focusing on image, text, and molecular profile-morphology generation. By analyzing over 150 studies, the authors trace the evolution from early generative adversarial networks to recent diffusion models and vision-language models. Key findings include challenges in generating high-fidelity whole slide images and ensuring clinical interpretability, with a call for integrated and clinically deployable systems.

该综述探讨了计算病理学中内容生成模型的进展，重点关注图像、文本和分子特征-形态生成。通过对150多篇研究的分析，作者追溯了从生成对抗网络到扩散模型和视觉-语言模型的架构演变。主要发现包括在生成高保真全切片图像和确保临床可解释性方面面临的挑战，同时还要解决与合成数据相关的伦理和法律问题。

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Authors: Jaemin Son, Sujin Choi, Inyong Yun

Venue: ICASSP 2026

First: 2025-09-08T08:12:26+00:00 · Latest: 2025-09-08T08:12:26+00:00

Comments: Submitted to ICASSP 2026

Abs · PDF

Abstract

Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

中文标题/摘要

标题：索引保留轻量级分词剪枝方法在视觉语言模型中高效文档理解

视觉语言模型（VLMs）在文档理解任务中取得了令人印象深刻的成果，但其高计算需求仍然是一个挑战。为减轻计算负担，我们提出了一种轻量级分词剪枝框架，在VLM处理之前从文档图像中过滤掉非信息性背景区域。二元块级分类器移除非文本区域，最大池化精炼步骤恢复碎片化的文本区域以增强空间连贯性。在真实世界文档数据集上的实验表明，我们的方法显著降低了计算成本，同时保持了相当的准确性。

Summary / 总结

The research aims to reduce the computational demands of vision-language models in document understanding tasks. It introduces a lightweight token pruning framework that filters out non-informative background regions from document images before VLM processing. The method uses a binary patch-level classifier to remove non-text areas and a max-pooling refinement step to recover fragmented text regions. Experiments show that this approach significantly reduces computational costs while maintaining similar accuracy to existing methods.

研究旨在减少视觉语言模型在文档理解任务中的计算需求。提出了一种轻量级的标记剪枝框架，在VLM处理之前过滤掉文档图像中的非信息性背景区域。该方法使用二元块级分类器去除非文本区域，并通过最大池化精修步骤恢复断开的文本区域。实验表明，这种方法可以显著降低计算成本，同时保持与现有方法相当的准确性。

Teaching AI Stepwise Diagnostic Reasoning with Report-Guided Chain-of-Thought Learning

Authors: Yihong Luo, Wenwu He, Zhuo-Xu Cui, Dong Liang

First: 2025-09-08T08:01:26+00:00 · Latest: 2025-09-08T08:01:26+00:00

Abs · PDF

Abstract

This study presents DiagCoT, a multi-stage framework that applies supervised fine-tuning to general-purpose vision-language models (VLMs) to emulate radiologists' stepwise diagnostic reasoning using only free-text reports. DiagCoT combines contrastive image-report tuning for domain alignment, chain-of-thought supervision to capture inferential logic, and reinforcement tuning with clinical reward signals to enhance factual accuracy and fluency. On the MIMIC-CXR benchmark, DiagCoT improved zero-shot disease classification AUC from 0.52 to 0.76 (absolute gain of 0.24), pathology grounding mIoU from 0.08 to 0.31 (absolute gain of 0.23), and report generation BLEU from 0.11 to 0.33 (absolute gain of 0.22). It outperformed state-of-the-art models including LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets. By converting unstructured clinical narratives into structured supervision, DiagCoT offers a scalable approach for developing interpretable and diagnostically competent AI systems for radiology.

中文标题/摘要

标题：使用报告引导的链式推理逐步教学AI诊断推理

本研究提出了DiagCoT，这是一种多阶段框架，通过监督微调将通用视觉-语言模型（VLMs）转化为仅使用自由文本报告来模仿放射科医生逐步诊断推理的机制。DiagCoT结合了对比图像-报告调优以实现领域对齐、链式推理监督以捕捉推理逻辑，以及强化调优与临床奖励信号以提高事实准确性和流畅性。在MIMIC-CXR基准测试中，DiagCoT将零样本疾病分类AUC从0.52提高到0.76（绝对增益0.24），病理定位mIoU从0.08提高到0.31（绝对增益0.23），报告生成BLEU从0.11提高到0.33（绝对增益0.22）。它在长尾疾病和外部数据集上优于包括LLaVA-Med和CXR-LLAVA在内的最新模型。通过将未结构化的临床叙述转换为结构化的监督，DiagCoT提供了一种可扩展的方法，用于开发可解释且诊断能力较强的AI系统以应用于放射学。

Summary / 总结

This study introduces DiagCoT, a multi-stage framework that fine-tunes general-purpose vision-language models using supervised learning to emulate radiologists' diagnostic reasoning based on free-text reports. DiagCoT combines contrastive image-report tuning, chain-of-thought supervision, and reinforcement tuning with clinical reward signals. On the MIMIC-CXR benchmark, DiagCoT significantly improved zero-shot disease classification AUC, pathology grounding mIoU, and report generation BLEU scores, outperforming state-of-the-art models like LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets.

本研究提出了一种多阶段框架DiagCoT，通过监督微调通用视觉-语言模型来模仿放射科医生基于自由文本报告的逐步诊断推理。DiagCoT结合了对比图像-报告调优、链式推理监督和基于临床奖励信号的强化调优。在MIMIC-CXR基准上，DiagCoT显著提高了零样本疾病分类AUC、病理定位mIoU和报告生成BLEU分数，优于包括LLaVA-Med和CXR-LLAVA在内的最新模型在长尾疾病和外部数据集上的表现。

REVEAL -- Reasoning and Evaluation of Visual Evidence through Aligned Language

Authors: Ipsita Praharaj, Yukta Butala, Badrikanath Praharaj, Yash Butala

Venue: ICCV 2025

First: 2025-08-18T00:42:02+00:00 · Latest: 2025-09-08T07:14:44+00:00

Comments: 4 pages, 6 figures, International Conference on Computer Vision, ICCV 2025

Abs · PDF

Abstract

The rapid advancement of generative models has intensified the challenge of detecting and interpreting visual forgeries, necessitating robust frameworks for image forgery detection while providing reasoning as well as localization. While existing works approach this problem using supervised training for specific manipulation or anomaly detection in the embedding space, generalization across domains remains a challenge. We frame this problem of forgery detection as a prompt-driven visual reasoning task, leveraging the semantic alignment capabilities of large vision-language models. We propose a framework, `REVEAL` (Reasoning and Evaluation of Visual Evidence through Aligned Language), that incorporates generalized guidelines. We propose two tangential approaches - (1) Holistic Scene-level Evaluation that relies on the physics, semantics, perspective, and realism of the image as a whole and (2) Region-wise anomaly detection that splits the image into multiple regions and analyzes each of them. We conduct experiments over datasets from different domains (Photoshop, DeepFake and AIGC editing). We compare the Vision Language Models against competitive baselines and analyze the reasoning provided by them.

中文标题/摘要

标题：REVEAL —— 通过对齐语言进行视觉证据的推理和评估

生成模型的迅速发展加剧了对视觉伪造的检测和解释的挑战，需要建立稳健的图像伪造检测框架，同时提供推理和定位。现有工作通过监督训练特定操作或嵌入空间中的异常检测来解决这一问题，但在跨领域泛化方面仍面临挑战。我们将伪造检测问题框架化为一个提示驱动的视觉推理任务，利用大型视觉-语言模型的语义对齐能力。我们提出了一种框架，`REVEAL`（通过对齐语言进行视觉证据的推理和评估），并提出了两种辅助方法：（1）整体场景级评估，依赖于图像的整体物理、语义、视角和现实性；（2）区域级异常检测，将图像划分为多个区域并逐个分析。我们在不同领域的数据集（Photoshop、DeepFake和AIGC编辑）上进行了实验。我们将视觉语言模型与竞争性基线进行了比较，并分析了它们提供的推理。

Summary / 总结

The paper addresses the challenge of detecting and interpreting visual forgeries using generative models. It proposes a framework called REVEAL, which uses large vision-language models for reasoning and evaluating visual evidence. REVEAL includes two approaches: Holistic Scene-level Evaluation and Region-wise Anomaly Detection. Experiments show that REVEAL outperforms existing baselines across different domains such as Photoshop, DeepFake, and AIGC editing, providing robust image forgery detection with detailed reasoning.

论文提出了一种名为REVEAL的框架，利用大型视觉语言模型的语义对齐能力来检测和解释视觉伪造。该框架提出了两种方法：整体场景评估和区域异常检测。该框架在包括Photoshop、DeepFake和AIGC编辑在内的多个数据集上进行了测试，结果显示其在提供伪造定位和推理方面优于现有方法。