arXiv 论文速递

LLaDA-VLA: Vision Language Diffusion Action Models

Authors: Yuqing Wen, Hebei Li, Kefan Gu, Yucheng Zhao, Tiancai Wang, Xiaoyan Sun

First: 2025-09-08T17:45:40+00:00 · Latest: 2025-09-08T17:45:40+00:00

Abstract

The rapid progress of auto-regressive vision-language models (VLMs) has inspired growing interest in vision-language-action models (VLA) for robotic manipulation. Recently, masked diffusion models, a paradigm distinct from autoregressive models, have begun to demonstrate competitive performance in text generation and multimodal applications, leading to the development of a series of diffusion-based VLMs (d-VLMs). However, leveraging such models for robot policy learning remains largely unexplored. In this work, we present LLaDA-VLA, the first Vision-Language-Diffusion-Action model built upon pretrained d-VLMs for robotic manipulation. To effectively adapt d-VLMs to robotic domain, we introduce two key designs: (1) a localized special-token classification strategy that replaces full-vocabulary classification with special action token classification, reducing adaptation difficulty; (2) a hierarchical action-structured decoding strategy that decodes action sequences hierarchically considering the dependencies within and across actions. Extensive experiments demonstrate that LLaDA-VLA significantly outperforms state-of-the-art VLAs on both simulation and real-world robots.

中文标题/摘要

标题：LLaDA-VLA：视觉语言扩散动作模型

自回归视觉语言模型（VLMs）的快速发展激发了对视觉语言动作模型（VLA）在机器人操作方面的研究兴趣。最近，掩码扩散模型作为一种不同于自回归模型的范式，在文本生成和多模态应用中开始表现出竞争力，推动了一系列基于扩散的VLMs（d-VLMs）的发展。然而，利用这些模型进行机器人策略学习仍然鲜有探索。本文介绍了LLaDA-VLA，这是首个基于预训练d-VLMs的视觉语言扩散动作模型，用于机器人操作。为了有效适应机器人领域，我们提出了两个关键设计：（1）局部特殊标记分类策略，用特殊动作标记分类替代全词汇分类，降低适应难度；（2）分层动作结构解码策略，考虑动作内部和跨动作的依赖关系，逐级解码动作序列。大量实验表明，LLaDA-VLA在模拟和真实机器人上均显著优于最先进的VLA。

Summary / 总结

LLaDA-VLA is the first vision-language-diffusion-action model for robotic manipulation, built on pretrained diffusion-based vision-language models (d-VLMs). It introduces a localized special-token classification strategy and a hierarchical action-structured decoding strategy to adapt d-VLMs to the robotic domain. Experimental results show that LLaDA-VLA outperforms existing vision-language-action models on both simulation and real-world robots.

研究旨在利用基于扩散的视觉语言模型（d-VLMs）进行机器人操作，填补了政策学习的空白。LLaDA-VLA 是首个视觉语言扩散动作模型，引入了局部特殊标记分类和分层动作结构解码策略。实验表明，LLaDA-VLA 在模拟和真实世界机器人任务中均优于现有视觉语言动作模型。

COMPACT: Common-token Optimized Model Pruning Across Channels and Tokens

Authors: Eugene Kwek, Wenpeng Yin

First: 2025-09-08T16:07:06+00:00 · Latest: 2025-09-08T16:07:06+00:00

Abs · PDF

Abstract

Making LLMs more efficient in memory, latency, and serving cost is crucial for edge deployment, interactive applications, and sustainable inference at scale. Pruning is a key technique toward this goal. However, prior pruning methods are limited: width pruning often breaks the standard transformer layout or requires custom inference code, while depth pruning removes entire layers and can cause abrupt accuracy drops. In this work, we propose COMPACT, which jointly (i) prunes rare vocabulary to shrink embedding/unembedding and (ii) prunes FFN intermediate channels using common-token-weighted activations, aligning importance with the post-pruning token distribution. COMPACT enjoys merits of both depth and width pruning, such as: deployment-friendliness (keeps a standard transformer architecture), scale-adaptivity (trade off vocab vs. FFN pruning), training-free operation with competitive pruning time, and strong memory savings alongside throughput gains. Experiments across Qwen, LLaMA, and Gemma families (0.5B-70B) show state-of-the-art downstream task performance at similar or higher pruning ratios, with substantial reductions in parameters, GPU memory, and end-to-end latency.

中文标题/摘要

标题：COMPACT: 共有令牌优化模型剪枝跨通道和令牌

使大语言模型（LLM）在内存、延迟和提供服务成本方面更加高效对于边缘部署、交互式应用以及大规模可持续推理至关重要。剪枝是实现这一目标的关键技术。然而，先前的剪枝方法存在局限性：宽度剪枝通常会破坏标准的变压器布局或需要自定义推理代码，而深度剪枝会移除整个层并可能导致准确率骤降。在本工作中，我们提出了COMPACT，它联合（i）剪枝罕见词汇以缩小嵌入/解嵌入，并（ii）使用共有令牌加权激活剪枝FFN中间通道，使重要性与后剪枝的令牌分布相一致。COMPACT兼具深度和宽度剪枝的优点，如：部署友好性（保持标准的变压器架构）、规模适应性（在词汇量与FFN剪枝之间权衡），无需训练即可操作且具有竞争力的剪枝时间，以及强大的内存节省和吞吐量提升。在Qwen、LLaMA和Gemma家族（0.5B-70B）中进行的实验显示，COMPACT在相似或更高的剪枝比率下，下游任务性能达到最先进的水平，同时参数、GPU内存和端到端延迟显著减少。

Summary / 总结

The research aims to enhance the efficiency of large language models (LLMs) in terms of memory, latency, and serving cost for edge deployment and interactive applications. COMPACT, a novel pruning method, jointly prunes rare vocabulary and FFN intermediate channels using common-token-weighted activations. This approach combines the benefits of both depth and width pruning, maintaining a standard transformer architecture, offering scale-adaptive pruning, and achieving strong memory savings and throughput gains. Experiments across different LLM families demonstrate state-of-the-art performance with higher pruning ratios and substantial reductions in parameters, GPU memory, and latency.

研究旨在通过提高大型语言模型（LLM）在内存、延迟和部署成本方面的效率，以支持边缘部署和交互式应用。提出的COMPACT方法同时剪枝稀有词汇的嵌入和使用共词权重激活剪枝FFN中间通道，结合了深度和宽度剪枝的优点，保持了标准的变压器架构，实现了可调节的剪枝规模，并获得了显著的内存节省和吞吐量提升。实验结果显示，与之前的方法相比，该方法在不同LLM家族中实现了更优的性能和更高的剪枝比例。

D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

Authors: Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar

First: 2025-09-08T14:55:16+00:00 · Latest: 2025-09-08T14:55:16+00:00

Comments: Accepted at IEEE International Conference on Data Mining (ICDM) 2025

Abs · PDF · Code1

Abstract

Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning

中文标题/摘要

标题：D-HUMOR：通过多模态开放式推理理解黑色幽默

在线表情包中的黑色幽默因其依赖于隐含的、敏感的和文化背景相关的提示而面临独特挑战。为了解决检测多模态内容中黑色幽默资源和方法的缺乏，我们引入了一个包含4,379个Reddit表情包的新数据集，这些表情包被标注了黑色幽默、目标类别（性别、心理健康、暴力、种族、残疾和其他）以及三级强度评级（轻微、中等、严重）。基于此资源，我们提出了一种增强推理框架，该框架首先使用大型视觉-语言模型（VLM）为每个表情包生成结构化解释。通过角色反转自循环，VLM 采用作者的视角迭代地细化其解释，确保完整性和一致性。然后，我们从OCR转录文本和自精炼的推理中提取文本特征，使用视觉变换器获取视觉特征。三流交叉推理网络（TCRNet）通过成对注意力机制融合这三流——文本、图像和推理，生成分类的统一表示。实验结果表明，我们的方法在黑色幽默检测、目标识别和强度预测三项任务上均优于强基线。该数据集、注释和代码已发布，以促进多模态幽默理解和内容审核的进一步研究。代码和数据集可在以下链接获取：https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning

Summary / 总结

The research addresses the challenge of detecting dark humor in multimodal online content, introducing a new dataset of 4,379 annotated Reddit memes. A reasoning-augmented framework is proposed, which uses a Large Vision-Language Model to generate structured explanations and iteratively refine them. A Tri-stream Cross-Reasoning Network then fuses text, image, and reasoning features to classify dark humor, outperforming existing methods in detection, target identification, and intensity prediction.

研究针对在线内容中暗黑幽默的检测难题，引入了一个包含4,379个标注的Reddit表情包新数据集。提出了一种增强推理框架，使用大型视觉-语言模型生成结构化解释并迭代精炼。然后，通过三流交叉推理网络融合文本、图像和推理特征进行分类，超越了现有方法在暗黑幽默检测、目标识别和强度预测方面的表现。

Aligning Large Vision-Language Models by Deep Reinforcement Learning and Direct Preference Optimization

Authors: Thanh Thi Nguyen, Campbell Wilson, Janis Dalins

First: 2025-09-08T14:47:57+00:00 · Latest: 2025-09-08T14:47:57+00:00

Comments: Accepted for publication in the Proceedings of the 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025)

Abs · PDF

Abstract

Large Vision-Language Models (LVLMs) or multimodal large language models represent a significant advancement in artificial intelligence, enabling systems to understand and generate content across both visual and textual modalities. While large-scale pretraining has driven substantial progress, fine-tuning these models for aligning with human values or engaging in specific tasks or behaviors remains a critical challenge. Deep Reinforcement Learning (DRL) and Direct Preference Optimization (DPO) offer promising frameworks for this aligning process. While DRL enables models to optimize actions using reward signals instead of relying solely on supervised preference data, DPO directly aligns the policy with preferences, eliminating the need for an explicit reward model. This overview explores paradigms for fine-tuning LVLMs, highlighting how DRL and DPO techniques can be used to align models with human preferences and values, improve task performance, and enable adaptive multimodal interaction. We categorize key approaches, examine sources of preference data, reward signals, and discuss open challenges such as scalability, sample efficiency, continual learning, generalization, and safety. The goal is to provide a clear understanding of how DRL and DPO contribute to the evolution of robust and human-aligned LVLMs.

中文标题/摘要

标题：通过深度强化学习和直接偏好优化对大型视觉-语言模型进行对齐

大型视觉-语言模型（LVLMs）或跨模态大型语言模型是人工智能的一个重要进步，使系统能够理解和生成跨视觉和文本模态的内容。虽然大规模预训练推动了显著的进步，但将这些模型微调以与人类价值观对齐或执行特定任务或行为仍然是一个关键挑战。深度强化学习（DRL）和直接偏好优化（DPO）为这一对齐过程提供了有希望的框架。DRL使模型能够使用奖励信号来优化行为，而不仅仅是依赖监督偏好数据，而DPO直接将策略与偏好对齐，消除了显式奖励模型的需要。本文综述了LVLMs的微调范式，强调了DRL和DPO技术如何用于使模型与人类偏好和价值观对齐、提高任务性能和实现适应性跨模态交互。我们对关键方法进行了分类，检查了偏好数据来源、奖励信号，并讨论了可扩展性、样本效率、持续学习、泛化和安全性等开放挑战。目标是提供DRL和DPO如何促进稳健且与人类对齐的LVLMs演化的清晰理解。

Summary / 总结

The research aims to align large vision-language models with human preferences and values by employing Deep Reinforcement Learning (DRL) and Direct Preference Optimization (DPO). The study explores how DRL and DPO can be used to fine-tune these models, improving task performance and enabling adaptive multimodal interaction. Key findings include the ability of DPO to directly align policies with preferences, reducing the need for explicit reward models and enhancing model robustness and human alignment.

研究旨在通过深度强化学习（DRL）和直接偏好优化（DPO）来使大型视觉语言模型与人类偏好和价值观保持一致。研究探讨了如何使用DRL和DPO来微调这些模型，提高任务性能并实现适应性多模态交互。主要发现包括DPO可以直接将策略与偏好对齐，减少对显式奖励模型的需求，从而增强模型的稳健性和人类一致性。

Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

Authors: Amirmohammad Izadi, Mohammad Ali Banayeeanzade, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah

First: 2025-06-27T11:44:40+00:00 · Latest: 2025-09-08T14:34:04+00:00

Abs · PDF

Abstract

Despite progress in Vision-Language Models (VLMs), their capacity for visual reasoning is often limited by the binding problem: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current VLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces VISER (Visual Input Structure for Enhanced Reasoning), a simple yet effective intervention: augmenting visual inputs with low-level spatial structures and pairing this with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks. Specifically, VISER improves GPT-4o visual search accuracy by 25.00%, increases counting accuracy by 26.83%, reduces edit distance error in scene description by 0.32, and enhances performance on spatial relationship tasks by 9.50% on a 2D synthetic dataset. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. VISER enhances binding only with a single-query inference, underscoring the importance of visual input design over purely linguistically-based approaches. These findings suggest that low-level visual structuring is a powerful and underexplored direction for improving compositional visual reasoning and could serve as a general strategy for enhancing VLM performance on spatially grounded tasks.

中文标题/摘要

标题：视觉结构有助于视觉推理：解决VLM中的绑定问题

尽管在视觉语言模型（VLMs）方面取得了进展，但它们在视觉推理方面的能力往往受限于绑定问题：无法可靠地将感知特征与正确的视觉参照物关联起来。这一限制导致了诸如计数、视觉搜索、场景描述和空间关系理解等任务中的持续错误。关键因素在于当前的VLMs主要以并行方式处理视觉特征，缺乏空间定位的序列注意力机制。本文介绍了VISER（视觉输入结构以增强推理），这是一种简单而有效的干预措施：通过在视觉输入中添加低级的空间结构，并配以鼓励顺序、空间意识解析的文本提示。我们实证展示了在核心视觉推理任务中取得了显著的性能提升。具体而言，VISER将GPT-4o的视觉搜索准确性提高了25.00%，计数准确性提高了26.83%，场景描述中的编辑距离误差减少了0.32，并在2D合成数据集上将空间关系任务的性能提高了9.50%。此外，我们发现视觉修改对于这些提升是必不可少的；纯粹的文本策略，包括链式思考提示，是不够的，甚至可能降低性能。VISER仅通过单查询推理就能增强绑定，突显了视觉输入设计的重要性，而非纯粹基于语言的方法。这些发现表明，低级视觉结构化是一个强大且未被充分探索的方向，可以提高组合视觉推理，并可能作为增强VLM在空间定位任务上性能的一般策略。

Summary / 总结

This paper addresses the binding problem in Vision-Language Models (VLMs) by introducing VISER, which augments visual inputs with low-level spatial structures and encourages sequential, spatially-aware parsing through a textual prompt. The method significantly improves performance in core visual reasoning tasks, with VISER enhancing GPT-4o visual search accuracy by 25.00%, counting accuracy by 26.83%, reducing scene description errors by 0.32, and improving spatial relationship tasks by 9.50%. The study also shows that visual modifications are crucial for these gains, as purely textual strategies are insufficient and can even degrade performance.

本文通过引入VISER，即在视觉输入中增加低级空间结构，并通过文本提示鼓励顺序的空间感知解析，来解决视觉语言模型（VLMs）中的绑定问题。该方法在核心视觉推理任务中显著提高了性能，GPT-4o的视觉搜索准确性提高了25.00%，计数准确性提高了26.83%，空间关系任务性能提高了9.50%。研究还表明，视觉修改对于这些改进是必不可少的，纯文本策略不仅无效，甚至会降低性能。

Robust and Label-Efficient Deep Waste Detection

Authors: Hassan Abid, Khan Muhammad, Muhammad Haris Khan

First: 2025-08-26T08:34:04+00:00 · Latest: 2025-09-08T10:07:31+00:00

Comments: Accepted at BMVC 2025

Abs · PDF · Code1

Abstract

Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.

中文标题/摘要

标题：鲁棒且标签高效的深度垃圾检测

有效的垃圾分类对于可持续回收至关重要，但由于数据集有限且依赖于过时的对象检测器，该领域的AI研究仍落后于商业系统。在本文中，我们通过建立强基准并引入基于集成的半监督学习框架，推进了AI驱动的垃圾检测。我们首先在真实的ZeroWaste数据集上基准测试最先进的开放词汇对象检测（OVOD）模型，表明仅类别提示表现不佳，而LLM优化的提示显著提高了零样本准确性。接着，为了解决领域特定的限制，我们微调了现代基于变换器的对象检测器，实现了新的基线51.6 mAP。然后，我们提出了一种软伪标签策略，通过空间和共识感知加权融合集成预测，实现稳健的半监督训练。应用于未标记的ZeroWaste-s子集，我们的伪注释实现了超过全监督训练的性能提升，突显了可扩展注释管道的有效性。我们的工作为研究界做出了贡献，通过建立严格的基准，引入了稳健的集成伪标签管道，生成了未标记ZeroWaste-s子集的高质量注释，并系统地评估了OVOD模型在真实世界垃圾分类条件下的表现。我们的代码可在https://github.com/h-abid97/robust-waste-detection获取。

Summary / 总结

This research aims to improve AI-driven waste detection for sustainable recycling by addressing limitations in existing datasets and object detectors. The study benchmarks state-of-the-art Open-Vocabulary Object Detection models and introduces an ensemble-based semi-supervised learning framework. Key findings include enhanced zero-shot accuracy with LLM-optimized prompts and improved performance through a soft pseudo-labeling strategy that uses spatial and consensus-aware weighting, surpassing fully supervised training on the unlabeled ZeroWaste-s subset.

该研究旨在通过解决现有数据集和物体检测器的限制，提高AI驱动的废物检测，以促进可持续回收。研究对比了最先进的开放词汇物体检测模型，并引入了一种基于集成的半监督学习框架。关键发现包括通过LLM优化提示增强零样本准确性，并通过使用空间和共识感知加权的软伪标签策略实现性能提升，超越了完全监督训练在未标记的ZeroWaste-s子集上的表现。

Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Authors: Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

First: 2025-09-08T09:20:04+00:00 · Latest: 2025-09-08T09:20:04+00:00

Abs · PDF

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

中文标题/摘要

标题：基于对比注意力聚焦：增强VLMs的视觉推理

视觉-语言模型（VLMs）在多种视觉任务中取得了显著的成功，但在复杂视觉环境中其性能会下降。尽管现有的增强方法需要额外的训练、依赖外部分割工具或在粗粒度级别上操作，但它们忽视了VLMs内部固有的能力。为了弥合这一差距，我们研究了VLMs的注意力模式，并发现：（1）视觉复杂性与注意力熵呈强相关性，负面影响了推理性能；（2）注意力从浅层的全局扫描逐渐细化到深层的集中收敛，收敛程度由视觉复杂性决定；（3）理论上，我们证明了通用查询与任务特定查询之间的注意力图对比能够将视觉信号分解为语义信号和视觉噪声成分。基于这些见解，我们提出了对比注意力精炼以增强视觉（CARVE），这是一种无需训练的方法，通过像素级的注意力对比提取与任务相关的视觉信号。大量实验表明，CARVE能够一致地提升性能，开源模型的性能提升高达75%。我们的工作为理解视觉复杂性和注意力机制之间的相互作用提供了关键见解，为通过对比注意力改进视觉推理提供了高效途径。

Summary / 总结

The research aims to enhance the visual reasoning capabilities of Vision-Language Models (VLMs) in complex environments. It identifies that attention entropy increases with visual complexity, negatively affecting reasoning. The proposed Contrastive Attention Refinement for Visual Enhancement (CARVE) method enhances performance by contrasting attention maps at the pixel level, achieving up to 75% improvement on open-source models without additional training. This work highlights the importance of attention mechanisms in visual complexity and offers an efficient solution for visual reasoning improvement.

研究旨在通过解决VLMs在复杂视觉环境中的推理能力下降问题来提升其性能。方法是分析VLMs的注意力模式，并提出通过像素级注意力对比增强视觉信号的Contrastive Attention Refinement for Visual Enhancement (CARVE)方法，无需额外训练。关键发现表明，CARVE显著提升了性能，开源模型的提升幅度可达75%。

When Language Model Guides Vision: Grounding DINO for Cattle Muzzle Detection

Authors: Rabin Dulal, Lihong Zheng, Muhammad Ashad Kabir

Venue: Australasian Joint Conference on Artificial Intelligence 2025

First: 2025-09-08T08:21:34+00:00 · Latest: 2025-09-08T08:21:34+00:00

Abs · PDF

Abstract

Muzzle patterns are among the most effective biometric traits for cattle identification. Fast and accurate detection of the muzzle region as the region of interest is critical to automatic visual cattle identification.. Earlier approaches relied on manual detection, which is labor-intensive and inconsistent. Recently, automated methods using supervised models like YOLO have become popular for muzzle detection. Although effective, these methods require extensive annotated datasets and tend to be trained data-dependent, limiting their performance on new or unseen cattle. To address these limitations, this study proposes a zero-shot muzzle detection framework based on Grounding DINO, a vision-language model capable of detecting muzzles without any task-specific training or annotated data. This approach leverages natural language prompts to guide detection, enabling scalable and flexible muzzle localization across diverse breeds and environments. Our model achieves a mean Average Precision (mAP)@0.5 of 76.8\%, demonstrating promising performance without requiring annotated data. To our knowledge, this is the first research to provide a real-world, industry-oriented, and annotation-free solution for cattle muzzle detection. The framework offers a practical alternative to supervised methods, promising improved adaptability and ease of deployment in livestock monitoring applications.

中文标题/摘要

标题：语言模型引导视觉：基于DINO的牛鼻孔检测

鼻孔模式是牛身份识别中最有效的生物特征之一。快速准确地检测鼻孔区域作为感兴趣区域是自动视觉牛识别的关键。早期的方法依赖于人工检测，这既费时又不一致。最近，使用监督模型如YOLO的自动化方法在鼻孔检测中变得流行。尽管有效，但这些方法需要大量标注数据，并且倾向于依赖训练数据，限制了它们在新或未见过的牛上的性能。为了解决这些限制，本研究提出了一种基于Grounding DINO的零样本鼻孔检测框架，Grounding DINO是一种能够无需任何任务特定训练或标注数据就能检测鼻孔的视觉语言模型。该方法利用自然语言提示来引导检测，使鼻孔定位在不同品种和环境中具有可扩展性和灵活性。我们的模型在0.5的平均精度（mAP）上达到了76.8%，证明了在无需标注数据的情况下具有良好的性能。据我们所知，这是首次为牛鼻孔检测提供一种实际可行、面向行业且无需标注的解决方案。该框架为牲畜监测应用提供了监督方法的实用替代方案，有望提高适应性和部署便利性。

Content Generation Models in Computational Pathology: A Comprehensive Survey on Methods, Applications, and Challenges

Authors: Yuan Zhang, Xinfeng Zhang, Xiaoming Qi, Xinyu Wu, Feng Chen, Guanyu Yang, Huazhu Fu

First: 2025-05-16T08:44:50+00:00 · Latest: 2025-09-08T08:12:51+00:00

Comments: 20 pages, 8 figures

Abs · PDF

Abstract

Content generation modeling has emerged as a promising direction in computational pathology, offering capabilities such as data-efficient learning, synthetic data augmentation, and task-oriented generation across diverse diagnostic tasks. This review provides a comprehensive synthesis of recent progress in the field, organized into four key domains: image generation, text generation, molecular profile-morphology generation, and other specialized generation applications. By analyzing over 150 representative studies, we trace the evolution of content generation architectures -- from early generative adversarial networks to recent advances in diffusion models and generative vision-language models. We further examine the datasets and evaluation protocols commonly used in this domain and highlight ongoing limitations, including challenges in generating high-fidelity whole slide images, clinical interpretability, and concerns related to the ethical and legal implications of synthetic data. The review concludes with a discussion of open challenges and prospective research directions, with an emphasis on developing integrated and clinically deployable generation systems. This work aims to provide a foundational reference for researchers and practitioners developing content generation models in computational pathology.

中文标题/摘要

标题：计算病理学中的内容生成模型：方法、应用与挑战的综合综述

内容生成建模已成为计算病理学的一个有前途的方向，提供了诸如数据高效学习、合成数据增强和面向任务的内容生成等能力，适用于多种诊断任务。本文综述了该领域的最新进展，分为四个关键领域：图像生成、文本生成、分子特征-形态学生成和其他专门生成应用。通过分析超过150篇代表性研究，我们追溯了内容生成架构的发展历程——从早期的生成对抗网络到最近的扩散模型和生成视觉-语言模型的进步。我们还探讨了该领域常用的数据库和评估协议，并指出了持续存在的局限性，包括生成高保真全切片图像的挑战、临床解释性以及合成数据的伦理和法律问题。综述最后讨论了开放挑战和未来研究方向，强调了开发集成和临床可部署生成系统的必要性。本文旨在为计算病理学中内容生成模型的研究人员和实践者提供基础参考。

Summary / 总结

This review explores the development of content generation models in computational pathology, focusing on image, text, and molecular profile-morphology generation. By analyzing over 150 studies, the authors trace the evolution from early generative adversarial networks to recent diffusion models and vision-language models. Key findings include the challenges in generating high-fidelity whole slide images and ensuring clinical interpretability, with a call for integrated and deployable systems.

该综述探讨了计算病理学中内容生成模型的发展，重点关注图像、文本和分子特征-形态学生成。通过分析超过150项研究，作者追溯了从生成对抗网络到扩散模型和视觉语言模型的架构演变。主要发现包括在生成高保真全切片图像方面的挑战以及确保临床可解释性，同时还要解决与合成数据相关的伦理和法律问题。

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Authors: Jaemin Son, Sujin Choi, Inyong Yun

Venue: ICASSP 2026

First: 2025-09-08T08:12:26+00:00 · Latest: 2025-09-08T08:12:26+00:00

Comments: Submitted to ICASSP 2026

Abs · PDF

Abstract

Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

中文标题/摘要

标题：索引保留轻量级分词剪枝方法在视觉语言模型中高效文档理解

视觉语言模型（VLMs）在文档理解任务中取得了令人印象深刻的成果，但其高计算需求仍然是一个挑战。为减轻计算负担，我们提出了一种轻量级分词剪枝框架，在VLM处理之前从文档图像中过滤掉非信息性背景区域。二元块级分类器移除非文本区域，最大池化精炼步骤恢复断开的文本区域以增强空间连贯性。在真实世界文档数据集上的实验表明，我们的方法显著降低了计算成本，同时保持了相当的准确性。

Teaching AI Stepwise Diagnostic Reasoning with Report-Guided Chain-of-Thought Learning

Authors: Yihong Luo, Wenwu He, Zhuo-Xu Cui, Dong Liang

First: 2025-09-08T08:01:26+00:00 · Latest: 2025-09-08T08:01:26+00:00

Abs · PDF

Abstract

This study presents DiagCoT, a multi-stage framework that applies supervised fine-tuning to general-purpose vision-language models (VLMs) to emulate radiologists' stepwise diagnostic reasoning using only free-text reports. DiagCoT combines contrastive image-report tuning for domain alignment, chain-of-thought supervision to capture inferential logic, and reinforcement tuning with clinical reward signals to enhance factual accuracy and fluency. On the MIMIC-CXR benchmark, DiagCoT improved zero-shot disease classification AUC from 0.52 to 0.76 (absolute gain of 0.24), pathology grounding mIoU from 0.08 to 0.31 (absolute gain of 0.23), and report generation BLEU from 0.11 to 0.33 (absolute gain of 0.22). It outperformed state-of-the-art models including LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets. By converting unstructured clinical narratives into structured supervision, DiagCoT offers a scalable approach for developing interpretable and diagnostically competent AI systems for radiology.

中文标题/摘要

标题：使用报告引导的链式推理教学AI逐步诊断推理

本研究提出了DiagCoT，这是一种多阶段框架，通过监督微调将通用视觉-语言模型（VLMs）转化为仅使用自由文本报告来模仿放射科医生逐步诊断推理。DiagCoT结合了对比图像-报告微调以实现领域对齐、链式推理监督以捕捉推理逻辑，以及强化微调带有临床奖励信号以提高事实准确性和流畅性。在MIMIC-CXR基准测试中，DiagCoT将零样本疾病分类AUC从0.52提高到0.76（绝对增益0.24），病理定位mIoU从0.08提高到0.31（绝对增益0.23），报告生成BLEU从0.11提高到0.33（绝对增益0.22）。它在长尾疾病和外部数据集上优于包括LLaVA-Med和CXR-LLAVA在内的最新模型。通过将未结构化的临床叙述转换为结构化的监督，DiagCoT提供了一种可扩展的方法，用于开发可解释且诊断能力较强的AI系统以应用于放射学。

Summary / 总结

This study introduces DiagCoT, a multi-stage framework that fine-tunes VLMs using supervised learning to mimic radiologists' diagnostic reasoning based on free-text reports. It combines contrastive image-report tuning, chain-of-thought supervision, and reinforcement tuning with clinical rewards. DiagCoT significantly improved zero-shot disease classification AUC, pathology grounding mIoU, and report generation BLEU on the MIMIC-CXR benchmark, outperforming state-of-the-art models like LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets.

本研究提出了一种多阶段框架DiagCoT，通过监督学习微调VLMs，模仿放射科医生基于自由文本报告的诊断推理。该框架结合了对比图像-报告调优、推理链监督和基于临床奖励的强化调优。DiagCoT在MIMIC-CXR基准上显著提高了零样本疾病分类AUC、病理定位mIoU和报告生成BLEU，超越了包括LLaVA-Med和CXR-LLAVA在内的最新模型在长尾疾病和外部数据集上的表现。

REVEAL -- Reasoning and Evaluation of Visual Evidence through Aligned Language

Authors: Ipsita Praharaj, Yukta Butala, Badrikanath Praharaj, Yash Butala

Venue: ICCV 2025

First: 2025-08-18T00:42:02+00:00 · Latest: 2025-09-08T07:14:44+00:00

Comments: 4 pages, 6 figures, International Conference on Computer Vision, ICCV 2025

Abs · PDF

Abstract

The rapid advancement of generative models has intensified the challenge of detecting and interpreting visual forgeries, necessitating robust frameworks for image forgery detection while providing reasoning as well as localization. While existing works approach this problem using supervised training for specific manipulation or anomaly detection in the embedding space, generalization across domains remains a challenge. We frame this problem of forgery detection as a prompt-driven visual reasoning task, leveraging the semantic alignment capabilities of large vision-language models. We propose a framework, `REVEAL` (Reasoning and Evaluation of Visual Evidence through Aligned Language), that incorporates generalized guidelines. We propose two tangential approaches - (1) Holistic Scene-level Evaluation that relies on the physics, semantics, perspective, and realism of the image as a whole and (2) Region-wise anomaly detection that splits the image into multiple regions and analyzes each of them. We conduct experiments over datasets from different domains (Photoshop, DeepFake and AIGC editing). We compare the Vision Language Models against competitive baselines and analyze the reasoning provided by them.

中文标题/摘要

标题：REVEAL——通过对齐语言进行视觉证据的推理与评估

生成模型的迅速发展加剧了对视觉伪造的检测和解释的挑战，需要建立稳健的图像伪造检测框架，同时提供推理和定位。现有工作通过监督训练特定操作或嵌入空间中的异常检测来解决这一问题，但在跨领域泛化方面仍面临挑战。我们将伪造检测问题框架化为一个提示驱动的视觉推理任务，利用大型视觉-语言模型的语义对齐能力。我们提出了一种框架，`REVEAL`（通过对齐语言进行视觉证据的推理与评估），并提出了两种辅助方法：（1）整体场景级评估，依赖于图像的整体物理、语义、视角和现实性；（2）区域级异常检测，将图像划分为多个区域并逐个分析。我们在不同领域的数据集（Photoshop、DeepFake和AIGC编辑）上进行了实验。我们将视觉语言模型与竞争性基线进行了比较，并分析了它们提供的推理。

Summary / 总结

The paper addresses the challenge of detecting and interpreting visual forgeries using a framework called REVEAL, which leverages the semantic alignment capabilities of large vision-language models. It proposes two approaches: Holistic Scene-level Evaluation and Region-wise anomaly detection. The framework is tested across various domains including Photoshop, DeepFake, and AIGC editing, showing improved performance over existing baselines in both forgery detection and reasoning.

论文提出了一种名为REVEAL的框架，利用大型视觉语言模型的语义对齐能力来检测和解释视觉伪造。REVEAL提出了两种方法：整体场景评估和区域异常检测。该框架在Photoshop、DeepFake和AIGC编辑等多个领域进行了测试，显示出在伪造检测和解释方面优于现有基线的方法。

Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative Models

Authors: Meiling Li, Zhenxing Qian, Xinpeng Zhang

First: 2024-03-03T11:55:49+00:00 · Latest: 2025-09-08T06:46:18+00:00

Comments: The paper has been withdrawn by the authors because the proposed approach is currently undergoing optimization and improvement. We are refining the methodology to achieve more robust and convincing results, and a revised version will be submitted once the enhancements are completed

Abs · PDF

Abstract

Text-to-image generative models have recently garnered significant attention due to their ability to generate images based on prompt descriptions. While these models have shown promising performance, concerns have been raised regarding the potential misuse of the generated fake images. In response to this, we have presented a simple yet effective training-free method to attribute fake images generated by text-to-image models to their source models. Given a test image to be attributed, we first inverse the textual prompt of the image, and then put the reconstructed prompt into different candidate models to regenerate candidate fake images. By calculating and ranking the similarity of the test image and the candidate images, we can determine the source of the image. This attribution allows model owners to be held accountable for any misuse of their models. Note that our approach does not limit the number of candidate text-to-image generative models. Comprehensive experiments reveal that (1) Our method can effectively attribute fake images to their source models, achieving comparable attribution performance with the state-of-the-art method; (2) Our method has high scalability ability, which is well adapted to real-world attribution scenarios. (3) The proposed method yields satisfactory robustness to common attacks, such as Gaussian blurring, JPEG compression, and Resizing. We also analyze the factors that influence the attribution performance, and explore the boost brought by the proposed method as a plug-in to improve the performance of existing SOTA. We hope our work can shed some light on the solutions to addressing the source of AI-generated images, as well as to prevent the misuse of text-to-image generative models.

中文标题/摘要

标题：基于再生的无需训练归因伪造图像生成的文本到图像生成模型

基于文本的图像生成模型由于能够根据提示描述生成图像，最近引起了广泛关注。尽管这些模型表现出色，但人们对生成的伪造图像的潜在滥用表示担忧。为应对这一问题，我们提出了一种简单而有效的无需训练的方法，用于将由文本到图像生成模型生成的伪造图像归因于其源头模型。给定一个待归因的测试图像，我们首先逆向生成图像的文本提示，然后将重构的提示输入不同的候选模型以再生候选伪造图像。通过计算和排名测试图像与候选图像的相似度，可以确定图像的来源。这种归因使模型所有者能够对其模型的任何滥用负责。值得注意的是，我们的方法不限制候选文本到图像生成模型的数量。全面的实验表明：(1) 我们的方法可以有效地将伪造图像归因于其源头模型，其归因性能与最先进的方法相当；(2) 我们的方法具有很高的可扩展性，能够很好地适应实际的归因场景；(3) 所提出的方法对常见的攻击（如高斯模糊、JPEG压缩和缩放）具有良好的鲁棒性。我们还分析了影响归因性能的因素，并探讨了所提出方法作为插件带来的性能提升。我们希望我们的工作能够为解决AI生成图像的来源以及防止文本到图像生成模型的滥用提供一些启示。

Summary / 总结

The paper presents a training-free method to attribute fake images generated by text-to-image models to their source models. By inverting the textual prompt of the test image and regenerating candidate images using different models, the method ranks the similarity to determine the source. Experiments show that the method achieves comparable performance to state-of-the-art approaches, is scalable, and robust to common attacks. The authors aim to improve the methodology for more robust results and will resubmit a revised version once enhancements are complete.

论文提出了一种无需训练的方法，用于将由文本到图像生成模型生成的假图像归因于其源头模型。通过反转测试图像的文本提示并将其与不同模型生成的候选图像进行比较，可以有效确定图像的来源。实验表明，该方法在性能上与最先进的方法相当，具有高度的可扩展性，并且对常见的攻击具有鲁棒性。作者希望这项工作能够为解决AI生成图像的来源问题以及防止文本到图像生成模型的滥用提供帮助。

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

Authors: Mohamed Salim Aissi, Clemence Grislain, Mohamed Chetouani, Olivier Sigaud, Laure Soulier, Nicolas Thome

First: 2025-03-19T11:05:42+00:00 · Latest: 2025-09-08T06:43:33+00:00

Abs · PDF

Abstract

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent's decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

中文标题/摘要

标题：VIPER：视觉感知与可解释推理在序列决策中的应用

虽然大型语言模型（LLMs）在文本推理方面表现出色，视觉语言模型（VLMs）在视觉感知方面非常有效，但将这些模型应用于基于视觉指令的规划仍然是一个开放的问题。本文介绍了一种名为VIPER的新框架，该框架将VLM基于的感知与LLM基于的推理相结合，用于多模态指令驱动的规划。我们的方法使用一个模块化的流水线，其中冻结的VLM生成图像观察的文本描述，然后由LLM策略根据任务目标预测动作。我们通过行为克隆和强化学习微调推理模块，提高代理的决策能力。在ALFWorld基准测试中，VIPER显著优于最先进的基于视觉指令的规划器，同时缩小了与纯文本或acles之间的差距。通过利用文本作为中间表示，VIPER还增强了可解释性，为感知和推理组件的精细分析铺平了道路。

Summary / 总结

VIPER is a framework for multimodal instruction-based planning that combines VLM-based perception with LLM-based reasoning. It uses a modular pipeline where a VLM generates textual descriptions of image observations, which are then processed by an LLM to predict actions based on the task goal. VIPER is fine-tuned using behavioral cloning and reinforcement learning, and it outperforms state-of-the-art visual instruction-based planners on the ALFWorld benchmark while improving explainability through text-based intermediate representations.

VIPER 是一种结合 VLM 基础感知和 LLM 基础推理的多模态指令驱动规划框架。它采用模块化流水线，其中 VLM 生成图像观察的文本描述，然后由 LLM 处理以根据任务目标预测动作。通过行为克隆和强化学习对 VIPER 进行微调，它在 ALFWorld 基准测试中显著优于最先进的视觉指令驱动规划器，并通过基于文本的中间表示增强可解释性。

Multi View Slot Attention Using Paraphrased Texts For Face Anti-Spoofing

Authors: Jeongmin Yu, Susang Kim, Kisu Lee, Taekyoung Kwon, Won-Yong Shin, Ha Young Kim

Venue: ICCV 2025

First: 2025-09-08T04:53:46+00:00 · Latest: 2025-09-08T04:53:46+00:00

Comments: Accepted by ICCV 2025

Abs · PDF · Code1

Abstract

Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP's patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.

中文标题/摘要

标题：使用 paraphrased 文本的多视图槽注意机制用于面部防伪

近期的面部防伪（FAS）方法通过使用像 CLIP 这样的视觉-语言模型展示了跨域的出色性能。然而，现有的基于 CLIP 的 FAS 模型未能充分利用 CLIP 的补丁嵌入标记，未能检测到关键的防伪线索。此外，这些模型依赖于每个类别单一的文本提示（例如 'live' 或 'fake'），这限制了泛化能力。为了解决这些问题，我们提出了 MVP-FAS，这是一种新颖的框架，结合了两个关键模块：多视图槽注意（MVS）和多文本补丁对齐（MTPA）。这两个模块利用多种 paraphrased 文本生成通用特征，减少对特定领域文本的依赖。MVS 通过利用多种视角的多样文本提取局部详细的空域特征和全局上下文。MTPA 对齐补丁与多种文本表示，以提高语义鲁棒性。广泛的实验表明，MVP-FAS 达到了优越的泛化性能，在跨域数据集上超越了先前的最先进方法。代码：https://github.com/Elune001/MVP-FAS.

Summary / 总结

The paper proposes MVP-FAS, a novel framework for face anti-spoofing that addresses limitations of existing CLIP-based models by incorporating Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). These modules use multiple paraphrased texts to generate generalized features and improve semantic robustness. Experiments show that MVP-FAS outperforms previous state-of-the-art methods on cross-domain datasets, demonstrating superior generalization performance.

论文提出了MVP-FAS框架，通过引入Multi-View Slot注意力（MVS）和Multi-Text Patch对齐（MTPA）模块，解决了现有基于CLIP的面部防伪模型的局限性。MVS利用多种文本从不同视角提取详细的和全局的特征，而MTPA则将patches与多种文本表示进行对齐。实验表明，MVP-FAS在跨域数据集上的泛化性能优于之前的方法。

DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving

Authors: Mingyu Yang, Jae-Young Choi, Kihyo Moon, Minsung Jang, Eunjoo Jeon

First: 2025-09-01T03:13:50+00:00 · Latest: 2025-09-08T03:27:39+00:00

Comments: 10 pages, 9 figures. Preprint submitted to IEEE BigData 2025

Abs · PDF

Abstract

Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation's regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.

中文标题/摘要

标题：DSDE：基于KLD稳定性动态推测解码用于实际服务

推测解码加速了大型语言模型的推理，但在具有多样化请求的大批量服务环境中，其依赖于固定推测长度是次优的。本文探索了一种新的动态适应方向，通过研究一种新的后验诊断信号。我们提出了动态推测解码引擎（DSDE），这是一种无需训练的框架，主要由两个组成部分构成：（1）基于Kullback-Leibler（KLD）散度方差的预测信号，用于诊断生成的区域稳定性；（2）一种自适应推测长度上限，以缓解逐序列解码中的拖后腿问题。实验表明，使用KLD基稳定性信号进行动态适应具有潜力。由这些信号指导的算法在端到端延迟方面与领先基准相当，并且在各种工作负载下表现出更优越的鲁棒性。这种鲁棒性在低接受率的挑战性环境中尤为重要，所提出的信号在此类环境中仍保持其诊断作用。这些发现验证了后验信号作为构建更鲁棒和智能的LLM推理系统的重要组成部分的价值，并强调了未来研究动态推测长度适应的有希望的方向。

Summary / 总结

This paper addresses the limitations of speculative decoding in large-batch serving environments by proposing Dynamic Speculative Decoding Engine (DSDE), which uses a Kullback-Leibler (KLD) divergence-based predictive signal to dynamically adjust speculation length. Experiments show that DSDE achieves competitive end-to-end latency and superior robustness across various workloads, especially in low-acceptance-rate regimes, validating the use of post-hoc signals for LLM inference systems.

本文针对固定推测长度在多样请求环境中的大型语言模型推理中的局限性，提出了一种名为DSDE的无训练框架，利用KLD方差作为区域稳定性的预测信号，并采用自适应推测长度上限来处理滞后问题。实验表明，DSDE在各种工作负载下实现了竞争力的端到端延迟，并且在低接受率环境中表现出更优的鲁棒性，验证了后验信号在动态推测长度调整中的应用价值。

Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding

Authors: Jiangnan Xie, Xiaolong Zheng, Liang Zheng

First: 2025-09-08T02:27:10+00:00 · Latest: 2025-09-08T02:27:10+00:00

Abs · PDF · Code1

Abstract

Visual Grounding (VG) aims to utilize given natural language queries to locate specific target objects within images. While current transformer-based approaches demonstrate strong localization performance in standard scene (i.e, scenarios without any novel objects), they exhibit notable limitations in open-vocabulary scene (i.e, both familiar and novel object categories during testing). These limitations primarily stem from three key factors: (1) imperfect alignment between visual and linguistic modalities, (2) insufficient cross-modal feature fusion, and (3) ineffective utilization of semantic prototype information. To overcome these challenges, we present Prototype-Aware Multimodal Learning (PAML), an innovative framework that systematically addresses these issues through several key components: First, we leverage ALBEF to establish robust cross-modal alignment during initial feature encoding. Subsequently, our Visual Discriminative Feature Encoder selectively enhances salient object representations while suppressing irrelevant visual context. The framework then incorporates a novel prototype discovering and inheriting mechanism that extracts and aggregates multi-neighbor semantic prototypes to facilitate open-vocabulary recognition. These enriched features undergo comprehensive multimodal integration through our Multi-stage Decoder before final bounding box regression. Extensive experiments across five benchmark datasets validate our approach, showing competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene. Our code is available at https://github.com/plankXie/PAML.

中文标题/摘要

标题：面向原型的多模态对齐以实现开放词汇视觉定位

视觉定位(VG)旨在利用给定的自然语言查询在图像中定位特定目标物体。虽然当前基于变换器的方法在标准场景（即，没有新型物体的场景）中表现出强大的定位性能，但在开放词汇场景（即，测试时既有熟悉又有新型物体类别）中表现出明显的局限性。这些局限性主要源于三个关键因素：（1）视觉和语言模态之间的不完美对齐，（2）跨模态特征融合不足，以及（3）语义原型信息的无效利用。为克服这些挑战，我们提出了面向原型的多模态学习(PAML)框架，该框架通过几个关键组件系统地解决了这些问题：首先，我们利用ALBEF在初始特征编码期间建立稳健的跨模态对齐。随后，我们的视觉区分特征编码器选择性地增强显著物体表示并抑制无关的视觉上下文。该框架还引入了一种新颖的原型发现和继承机制，提取并聚合多邻域语义原型以促进开放词汇识别。这些丰富化的特征通过我们的多阶段解码器进行全面的多模态整合，最终进行边界框回归。在五个基准数据集上的广泛实验验证了我们的方法，在标准场景中表现出竞争力，在开放词汇场景中达到最先进的性能。我们的代码可在https://github.com/plankXie/PAML 获取。

Summary / 总结

The paper addresses the limitations of current transformer-based approaches in Visual Grounding (VG) for open-vocabulary scenes by proposing Prototype-Aware Multimodal Learning (PAML). PAML improves cross-modal alignment, enhances salient object representations, and incorporates a prototype discovering and inheriting mechanism. The method achieves competitive performance in standard scenes and state-of-the-art results in open-vocabulary scenes across five benchmark datasets.

论文通过提出 Prototype-Aware Multimodal Learning (PAML) 来解决当前基于变换器的方法在开放词汇场景下的视觉定位问题。PAML 使用 ALBEF 进行稳健的跨模态对齐，使用 Visual Discriminative Feature Encoder 来增强显著对象的表示，并引入一种新的原型发现和继承机制来促进开放词汇识别。该方法在标准场景中表现出竞争力，在五个基准数据集的开放词汇场景中达到了最先进的性能。

Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models

Authors: Yunqing Liu, Nan Zhang, Zhiming Tan

First: 2025-09-01T10:39:37+00:00 · Latest: 2025-09-08T02:22:16+00:00

Abs · PDF

Abstract

Effective specification-aware part retrieval within complex CAD assemblies is essential for automated design verification and downstream engineering tasks. However, directly using LLMs/VLMs to this task presents some challenges: the input sequences may exceed model token limits, and even after processing, performance remains unsatisfactory. Moreover, fine-tuning LLMs/VLMs requires significant computational resources, and for many high-performing general-use proprietary models (e.g., GPT or Gemini), fine-tuning access is not available. In this paper, we propose a novel part retrieval framework that requires no extra training, but using Error Notebooks + RAG for refined prompt engineering to help improve the existing general model's retrieval performance. The construction of Error Notebooks consists of two steps: (1) collecting historical erroneous CoTs and their incorrect answers, and (2) connecting these CoTs through reflective corrections until the correct solutions are obtained. As a result, the Error Notebooks serve as a repository of tasks along with their corrected CoTs and final answers. RAG is then employed to retrieve specification-relevant records from the Error Notebooks and incorporate them into the inference process. Another major contribution of our work is a human-in-the-loop CAD dataset, which is used to evaluate our method. In addition, the engineering value of our novel framework lies in its ability to effectively handle 3D models with lengthy, non-natural language metadata. Experiments with proprietary models, including GPT-4o and the Gemini series, show substantial gains, with GPT-4o (Omni) achieving up to a 23.4% absolute accuracy improvement on the human preference dataset. Moreover, ablation studies confirm that CoT reasoning provides benefits especially in challenging cases with higher part counts (>10).

中文标题/摘要

标题：基于错误笔记本的无需训练部件检索在3D CAD装配中的视觉-语言模型

在复杂CAD装配中实现有效的基于规范的部件检索对于自动化设计验证和下游工程任务至关重要。然而，直接使用LLM/VLM进行此任务存在一些挑战：输入序列可能超出模型的标记限制，即使经过处理，性能仍然不尽如人意。此外，微调LLM/VLM需要大量的计算资源，而对于许多高性能的通用专有模型（例如GPT或Gemini），微调访问权限不可用。在本文中，我们提出了一种无需额外训练的新颖部件检索框架，而是使用错误笔记本+RAG进行精细的提示工程，以帮助提高现有通用模型的检索性能。错误笔记本的构建分为两步：（1）收集历史错误的CoTs及其错误答案，（2）通过反思性修正将这些CoTs连接起来，直到获得正确的解决方案。结果，错误笔记本作为任务及其修正后的CoTs和最终答案的存储库。然后使用RAG从错误笔记本中检索与规范相关的记录，并将其纳入推理过程。我们工作的另一个重要贡献是包含人类在环中的CAD数据集，用于评估我们的方法。此外，我们新颖框架的工程价值在于其能够有效处理具有长且非自然语言元数据的3D模型。使用GPT-4o和Gemini系列等专有模型的实验显示了显著的改进，GPT-4o (Omni)在人类偏好数据集上实现了高达23.4%的绝对准确率提升。此外，消融研究证实，CoT推理在部件数量较高（>10）的更具挑战性的情况下尤其有益。

Summary / 总结

This paper addresses the challenge of part retrieval in complex CAD assemblies using vision-language models. It proposes a training-free framework that leverages Error Notebooks and Retrieval-Augmented Generation (RAG) for improved performance. Error Notebooks are constructed by collecting and correcting historical erroneous reasoning paths, serving as a repository of corrected solutions. RAG retrieves relevant records from these notebooks to enhance the model's inference. Experiments show significant improvements, with GPT-4o achieving up to 23.4% absolute accuracy improvement. The framework also demonstrates benefits in handling complex 3D models with lengthy metadata.

本文旨在使用视觉-语言模型解决复杂CAD装配中的部件检索问题，这些模型面临因token限制和细调所需大量计算资源的挑战。提出了一种无需额外训练的框架，利用错误笔记本和检索增强生成（RAG）来提高性能。错误笔记本通过收集并修正历史错误的推理痕迹来构建，作为包含修正任务的仓库。RAG从这些笔记本中检索相关记录，增强推理过程。实验表明，GPT-4o和Gemini模型在人类偏好数据集上的准确率提高了23.4%，并且在超过10个部件的复杂情况下，推理链推理提供了显著的好处。

A Novel Image Similarity Metric for Scene Composition Structure

Authors: Md Redwanul Haque, Manzur Murshed, Manoranjan Paul, Tsz-Kwan Lee

First: 2025-08-07T05:29:21+00:00 · Latest: 2025-09-08T01:12:51+00:00

Comments: 2025 IEEE ICIPW (Generative AI for World Simulations and Communications)

Abs · PDF · Code1

Abstract

The rapid advancement of generative AI models necessitates novel methods for evaluating image quality that extend beyond human perception. A critical concern for these models is the preservation of an image's underlying Scene Composition Structure (SCS), which defines the geometric relationships among objects and the background, their relative positions, sizes, orientations, etc. Maintaining SCS integrity is paramount for ensuring faithful and structurally accurate GenAI outputs. Traditional image similarity metrics often fall short in assessing SCS. Pixel-level approaches are overly sensitive to minor visual noise, while perception-based metrics prioritize human aesthetic appeal, neither adequately capturing structural fidelity. Furthermore, recent neural-network-based metrics introduce training overheads and potential generalization issues. We introduce the SCS Similarity Index Measure (SCSSIM), a novel, analytical, and training-free metric that quantifies SCS preservation by exploiting statistical measures derived from the Cuboidal hierarchical partitioning of images, robustly capturing non-object-based structural relationships. Our experiments demonstrate SCSSIM's high invariance to non-compositional distortions, accurately reflecting unchanged SCS. Conversely, it shows a strong monotonic decrease for compositional distortions, precisely indicating when SCS has been altered. Compared to existing metrics, SCSSIM exhibits superior properties for structural evaluation, making it an invaluable tool for developing and evaluating generative models, ensuring the integrity of scene composition. See \href{https://github.com/RedwanPlague/scssim}{code}.

中文标题/摘要

标题：一种新的场景组成结构图像相似度度量

生成式AI模型的迅速发展需要新的方法来评估图像质量，超越人类感知。这些模型的一个关键关注点是保持图像的底层场景组成结构（SCS），定义了对象与背景之间的几何关系、相对位置、大小、方向等。保持SCS的完整性对于确保生成式AI输出的忠实性和结构准确性至关重要。传统的图像相似度度量往往在评估SCS方面表现不佳。像素级方法对细微视觉噪声过于敏感，而感知基度量则侧重于人类审美，均未能充分捕捉结构保真度。此外，最近的基于神经网络的度量引入了训练开销和潜在泛化问题。我们提出了场景组成结构相似度指数度量（SCSSIM），这是一种新颖的、分析性的、无需训练的度量，通过利用从图像立方体分层划分中导出的统计措施来量化SCS的保持情况，稳健地捕捉非对象基的结构关系。我们的实验表明，SCSSIM 对非组成性失真具有高度不变性，准确反映了未改变的SCS。相反，它对组成性失真表现出强烈的单调下降，精确地指示了SCS是否被改变。与现有度量相比，SCSSIM 在结构评估方面表现出更优越的特性，使其成为开发和评估生成模型的重要工具，确保场景组成的一致性。参见 https://github.com/RedwanPlague/scssim。

Summary / 总结

The paper introduces SCSSIM, a novel metric for evaluating the preservation of Scene Composition Structure (SCS) in images generated by generative AI models. Unlike traditional pixel-level or perception-based metrics, SCSSIM uses statistical measures from Cuboidal hierarchical partitioning to robustly capture non-object-based structural relationships. Experiments show that SCSSIM is highly invariant to non-compositional distortions and strongly indicates changes in SCS due to compositional distortions, making it a superior tool for structural evaluation in generative models.

论文提出了SCSSIM，这是一种用于评估生成AI模型生成图像中场景组成结构（SCS）保留的新颖度量方法。不同于传统的像素级或感知基度量，SCSSIM 使用立方体分层分区中的统计措施来稳健地捕捉非对象基的结构关系。实验表明，SCSSIM 对非组成性失真具有高度不变性，并且强烈表明由于组成性失真导致的SCS变化，使其成为结构评估的优越工具。

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

Authors: Mohsen Gholami, Ahmad Rezaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

First: 2025-09-08T01:08:41+00:00 · Latest: 2025-09-08T01:08:41+00:00

Abs · PDF

Abstract

Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents such as robots and self-driving cars typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding. To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% average improvement on multi-choice QA and 56% average improvement on absolute distance estimation. Ego3D-VLM is modular and can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level spatial understanding in real-world, multi-view environments.

Summary / 总结

The research aims to improve the 3D spatial reasoning capabilities of Vision-Language Models (VLMs) by addressing their limitations in understanding real-world, multi-view environments. To achieve this, the authors created Ego3D-Bench, a new benchmark with over 8,600 QA pairs, and benchmarked 16 state-of-the-art VLMs. The results showed a significant performance gap between human and VLM performance in spatial reasoning tasks. To enhance VLMs, the authors proposed Ego3D-VLM, which improves 3D spatial reasoning by generating cognitive maps, leading to a 12% improvement in multi-choice QA and a 56% improvement in absolute distance estimation.

研究旨在通过解决视觉-语言模型在理解真实世界多视图环境中的3D空间推理能力的局限性，来提升其性能。为此，作者创建了包含超过8,600个问答对的新基准Ego3D-Bench，并对16个最先进的视觉-语言模型进行了基准测试。结果显示，人类在空间推理任务上的表现与视觉-语言模型之间存在显著差距。为了提升视觉-语言模型的空间推理能力，作者提出了Ego3D-VLM，通过生成认知地图来增强3D空间推理，从而在多项选择问答任务上提高了12%的准确率，在绝对距离估计任务上提高了56%。

Semantic Discrepancy-aware Detector for Image Forgery Identification

Authors: Ziye Wang, Minghang Yu, Chunyan Xu, Zhen Cui

First: 2025-08-17T12:11:09+00:00 · Latest: 2025-09-07T15:56:13+00:00

Comments: 10 pages, 5 figures

Abs · PDF · Code1

Abstract

With the rapid advancement of image generation techniques, robust forgery detection has become increasingly imperative to ensure the trustworthiness of digital media. Recent research indicates that the learned semantic concepts of pre-trained models are critical for identifying fake images. However, the misalignment between the forgery and semantic concept spaces hinders the model's forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction learning to align the two spaces at a fine-grained visual level. By exploiting the conceptual knowledge embedded in the pre-trained vision language model, we specifically design a semantic token sampling module to mitigate the space shifts caused by features irrelevant to both forgery traces and semantic concepts. A concept-level forgery discrepancy learning module, built upon a visual reconstruction paradigm, is proposed to strengthen the interaction between visual semantic concepts and forgery traces, effectively capturing discrepancies under the concepts' guidance. Finally, the low-level forgery feature enhancemer integrates the learned concept level forgery discrepancies to minimize redundant forgery information. Experiments conducted on two standard image forgery datasets demonstrate the efficacy of the proposed SDD, which achieves superior results compared to existing methods. The code is available at https://github.com/wzy1111111/SSD.

中文标题/摘要

标题：图像伪造识别的语义不一致性感知检测器

随着图像生成技术的迅速发展，稳健的伪造检测变得越来越重要，以确保数字媒体的可信度。近期研究表明，预训练模型学习到的语义概念对于识别假图像至关重要。然而，伪造与语义概念空间之间的不一致阻碍了模型的伪造检测性能。为了解决这一问题，我们提出了一种新颖的语义不一致性感知检测器（SDD），利用重建学习在细粒度视觉层面对两个空间进行对齐。通过利用预训练视觉语言模型中嵌入的概念知识，我们特别设计了一个语义标记采样模块，以减轻由与伪造痕迹和语义概念无关的特征引起的空间偏移。基于视觉重建范式的概念级伪造不一致性学习模块被提出，以加强视觉语义概念与伪造痕迹之间的交互，有效地在概念的指导下捕捉不一致性。最后，低级伪造特征增强器将学习到的概念级伪造不一致性整合起来，以最小化冗余的伪造信息。在两个标准图像伪造数据集上的实验表明，所提出的SDD具有优越的效果，优于现有方法。代码可在https://github.com/wzy1111111/SSD获取。

Summary / 总结

The research aims to improve forgery detection in digital media by addressing the misalignment between the forgery and semantic concept spaces. The proposed Semantic Discrepancy-aware Detector (SDD) uses reconstruction learning to align these spaces and a semantic token sampling module to mitigate irrelevant feature shifts. The concept-level forgery discrepancy learning module enhances the interaction between visual semantic concepts and forgery traces, leading to better forgery detection. Experiments show that SDD outperforms existing methods on standard image forgery datasets.

研究旨在通过解决伪造与语义概念空间之间的对齐问题，提高数字媒体中的伪造检测。提出的语义不一致性感知检测器（SDD）使用重构学习来对齐这些空间，并包括一个语义标记采样模块和一个概念级伪造不一致性学习模块。在标准数据集上的实验结果显示，SDD在识别图像伪造方面优于现有方法。

PathoHR: Hierarchical Reasoning for Vision-Language Models in Pathology

Authors: Yating Huang, Ziyan Huang, Lintao Xiang, Qijun Yang, Hujun Yin

First: 2025-09-07T15:42:38+00:00 · Latest: 2025-09-07T15:42:38+00:00

Comments: Accept by EMNLP2025

Abs · PDF

Abstract

Accurate analysis of pathological images is essential for automated tumor diagnosis but remains challenging due to high structural similarity and subtle morphological variations in tissue images. Current vision-language (VL) models often struggle to capture the complex reasoning required for interpreting structured pathological reports. To address these limitations, we propose PathoHR-Bench, a novel benchmark designed to evaluate VL models' abilities in hierarchical semantic understanding and compositional reasoning within the pathology domain. Results of this benchmark reveal that existing VL models fail to effectively model intricate cross-modal relationships, hence limiting their applicability in clinical setting. To overcome this, we further introduce a pathology-specific VL training scheme that generates enhanced and perturbed samples for multimodal contrastive learning. Experimental evaluations demonstrate that our approach achieves state-of-the-art performance on PathoHR-Bench and six additional pathology datasets, highlighting its effectiveness in fine-grained pathology representation.

中文标题/摘要

标题：PathoHR：病理学中的层次推理

准确分析病理图像对于自动化肿瘤诊断至关重要，但由于组织图像中存在高结构相似性和细微的形态学变化，这一任务仍然具有挑战性。当前的视觉-语言（VL）模型往往难以捕捉到解释结构化病理报告所需的复杂推理。为了解决这些限制，我们提出了PathoHR-Bench，这是一种新型基准，旨在评估VL模型在病理学领域中的层次语义理解和组合推理能力。该基准的结果表明，现有的VL模型无法有效地建模跨模态关系，从而限制了它们在临床环境中的应用。为克服这一问题，我们进一步引入了一种针对病理学的VL训练方案，该方案生成增强和扰动样本以进行多模态对比学习。实验评估表明，我们的方法在PathoHR-Bench和六个额外的病理数据集上达到了最先进的性能，突显了其在细粒度病理表示方面的有效性。

Analysis of Blood Report Images Using General Purpose Vision-Language Models

Authors: Nadia Bakhsheshi, Hamid Beigy

First: 2025-09-07T12:31:16+00:00 · Latest: 2025-09-07T12:31:16+00:00

Comments: 4 pages , 3 figures , This paper has been submitted to the IEEE-affiliated ICBME Conference (Iran), 2025, and is currently under review. DOR number: [20.1001.2.0425023682.1404.10.1.440.7]

Abs · PDF

Abstract

The reliable analysis of blood reports is important for health knowledge, but individuals often struggle with interpretation, leading to anxiety and overlooked issues. We explore the potential of general-purpose Vision-Language Models (VLMs) to address this challenge by automatically analyzing blood report images. We conduct a comparative evaluation of three VLMs: Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick, determining their performance on a dataset of 100 diverse blood report images. Each model was prompted with clinically relevant questions adapted to each blood report. The answers were then processed using Sentence-BERT to compare and evaluate how closely the models responded. The findings suggest that general-purpose VLMs are a practical and promising technology for developing patient-facing tools for preliminary blood report analysis. Their ability to provide clear interpretations directly from images can improve health literacy and reduce the limitations to understanding complex medical information. This work establishes a foundation for the future development of reliable and accessible AI-assisted healthcare applications. While results are encouraging, they should be interpreted cautiously given the limited dataset size.

中文标题/摘要

标题：使用通用视觉-语言模型分析血液报告图像

可靠地分析血液报告对于健康知识至关重要，但个人往往难以解读，导致焦虑和忽视问题。我们探讨了通用视觉-语言模型（VLMs）的潜力，通过自动分析血液报告图像来应对这一挑战。我们对三种VLMs（Qwen-VL-Max、Gemini 2.5 Pro和Llama 4 Maverick）进行了比较评估，在包含100张不同血液报告图像的数据集上确定了它们的表现。每个模型都用适应每份血液报告的临床相关问题进行了提示。然后使用Sentence-BERT处理答案，以比较和评估模型的响应程度。研究结果表明，通用视觉-语言模型是一种实用且有前景的技术，可用于开发面向患者的工具，以进行初步血液报告分析。它们能够直接从图像中提供清晰的解释，可以提高健康素养并减少理解复杂医学信息的局限性。这项工作为未来开发可靠且可访问的AI辅助医疗应用奠定了基础。尽管结果令人鼓舞，但鉴于数据集规模有限，应谨慎解读。

Summary / 总结

This study investigates the use of general-purpose Vision-Language Models (VLMs) to analyze blood report images, aiming to improve patient understanding of medical information. Three models—Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick—were evaluated on 100 diverse blood report images. The models were prompted with clinically relevant questions, and their responses were compared using Sentence-BERT. The results indicate that VLMs can provide clear interpretations of blood reports, enhancing health literacy and reducing the complexity of medical information. However, the limited dataset size should be considered when interpreting the findings.

本研究探讨了通用视觉-语言模型（VLMs）分析血液报告图像的可能性，旨在提高患者对医疗信息的理解。三种模型——Qwen-VL-Max、Gemini 2.5 Pro 和 Llama 4 Maverick——在100张不同的血液报告图像上进行了评估。这些模型被提示提出临床相关的问题，然后使用Sentence-BERT比较它们的回答。结果表明，VLMs能够清晰地解释血液报告，提高健康素养并降低理解复杂医学信息的难度。然而，由于数据集较小，应谨慎解释这些发现。

Cross-Modal Enhancement and Benchmark for UAV-based Open-Vocabulary Object Detection

Authors: Zhenhai Weng, Zhongliang Yu

First: 2025-09-07T10:59:02+00:00 · Latest: 2025-09-07T10:59:02+00:00

Abs · PDF

Abstract

Open-Vocabulary Object Detection (OVD) has emerged as a pivotal technology for applications involving Unmanned Aerial Vehicles (UAVs). However, the prevailing large-scale datasets for OVD pre-training are predominantly composed of ground-level, natural images. This creates a significant domain gap, causing models trained on them to exhibit a substantial drop in performance on UAV imagery. To address this limitation, we first propose a refined UAV-Label engine. Then we construct and introduce UAVDE-2M(contains over 2,000,000 instances and 1800 categories) and UAVCAP-15k(contains over 15,000 images). Furthermore, we propose a novel Cross-Attention Gated Enhancement Fusion (CAGE) module and integrate it into the YOLO-World-v2 architecture. Finally, extensive experiments on the VisDrone and SIMD datasets verify the effectiveness of our proposed method for applications in UAV-based imagery and remote sensing.

中文标题/摘要

标题：基于无人机的开放式词汇目标检测的跨模态增强和基准

开放式词汇目标检测（OVD）已成为涉及无人机（UAV）应用的关键技术。然而，现有的大规模OVD预训练数据集主要由地面自然图像组成，这造成了显著的领域差距，导致在无人机图像上的性能大幅下降。为解决这一局限性，我们首先提出了一种改进的无人机标签引擎。然后构建并介绍了UAVDE-2M（包含超过200万实例和1800个类别）和UAVCAP-15k（包含超过15000张图像）。此外，我们提出了一种新的跨注意力门控增强融合（CAGE）模块，并将其集成到YOLO-World-v2架构中。最后，在VisDrone和SIMD数据集上的广泛实验验证了我们提出的方法在无人机图像和遥感应用中的有效性。

Summary / 总结

The paper addresses the domain gap between ground-level and UAV imagery in Open-Vocabulary Object Detection (OVD) by proposing a refined UAV-Label engine and constructing UAVDE-2M and UAVCAP-15k datasets. A novel Cross-Attention Gated Enhancement Fusion (CAGE) module is integrated into the YOLO-World-v2 architecture. Experiments on VisDrone and SIMD datasets demonstrate the effectiveness of the proposed method for UAV-based imagery and remote sensing.

研究旨在通过解决地面和无人机图像之间的领域差距，提高无人机（UAV）图像的开放词汇目标检测（OVD）。作者提出了一种改进的UAV-Label引擎，并创建了包含超过200万实例和1800类别的UAVDE-2M和超过15000张图像的UAVCAP-15k数据集。他们还引入了一个名为Cross-Attention Gated Enhancement Fusion（CAGE）的模块，并将其集成到YOLO-World-v2架构中。在VisDrone和SIMD数据集上的实验验证了所提出方法在无人机图像和遥感应用中的有效性。

ADIR: Adaptive Diffusion for Image Reconstruction

Authors: Shady Abu-Hussein, Tom Tirer, Raja Giryes

Venue: BMVC 2025

First: 2022-12-06T18:39:58+00:00 · Latest: 2025-09-07T10:42:42+00:00

Comments: Project page https://shadyabh.github.io/ADIR/

Abs · PDF · Project1

Abstract

Denoising diffusion models have recently achieved remarkable success in image generation, capturing rich information about natural image statistics. This makes them highly promising for image reconstruction, where the goal is to recover a clean image from a degraded observation. In this work, we introduce a conditional sampling framework that leverages the powerful priors learned by diffusion models while enforcing consistency with the available measurements. To further adapt pre-trained diffusion models to the specific degradation at hand, we propose a novel fine-tuning strategy. In particular, we employ LoRA-based adaptation using images that are semantically and visually similar to the degraded input, efficiently retrieved from a large and diverse dataset via an off-the-shelf vision-language model. We evaluate our approach on two leading publicly available diffusion models--Stable Diffusion and Guided Diffusion--and demonstrate that our method, termed Adaptive Diffusion for Image Reconstruction (ADIR), yields substantial improvements across a range of image reconstruction tasks.

中文标题/摘要

标题：ADIR：自适应扩散用于图像重建

去噪扩散模型在图像生成方面取得了显著的成功，能够捕捉自然图像统计信息的丰富内容。这使它们在图像重建方面极具前景，目标是从退化观察中恢复干净的图像。在本文中，我们提出了一种条件采样框架，利用扩散模型学习的强大先验知识，同时确保与可用测量数据的一致性。为了进一步适应预训练的扩散模型以适应特定的退化情况，我们提出了一种新的微调策略。具体而言，我们使用与退化输入在语义和视觉上相似的图像进行LoRA基适应，这些图像通过一个现成的视觉语言模型从一个大型和多样化的数据集中高效检索。我们在两个领先的公开可用的扩散模型——Stable Diffusion和Guided Diffusion——上评估了我们的方法，并展示了我们的方法，称为自适应扩散用于图像重建（ADIR），在一系列图像重建任务中取得了显著的改进。

Summary / 总结

The research aims to enhance image reconstruction by leveraging denoising diffusion models, which are known for their ability to capture natural image statistics. The method introduces a conditional sampling framework that uses pre-trained diffusion models and fine-tunes them with a novel LoRA-based strategy using semantically and visually similar images. Experiments on Stable Diffusion and Guided Diffusion show that ADIR significantly improves image reconstruction quality across various tasks.

研究旨在通过利用能够捕捉自然图像统计特性的去噪扩散模型来提升图像重建效果。方法引入了一种条件采样框架，结合预训练的扩散模型，并通过一种新颖的基于LoRA的策略，使用与降级输入在语义和视觉上相似的图像进行微调。实验表明，ADIR在多种图像重建任务中显著提高了重建质量。

Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance

Authors: Mohamed Mohamed, Brennan Nichyporuk, Douglas L. Arnold, Tal Arbel

First: 2025-09-07T08:52:18+00:00 · Latest: 2025-09-07T08:52:18+00:00

Abs · PDF

Abstract

Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however the impressive performance of these models in 2D is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained foundation models do not exist for 3D, significantly limiting progress in this domain. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language descriptions remains completely unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression scenarios, and enhanced medical training by visualizing hypothetical medical conditions in realistic detail. Our work takes a meaningful step toward addressing this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this represents the first demonstration of a language-guided native-3D diffusion model applied specifically to neurological imaging data, where faithful three-dimensional modeling is essential to represent the brain's three-dimensional structure. Through results on two distinct neurological MRI datasets, our framework successfully simulates varying counterfactual lesion loads in Multiple Sclerosis (MS), and cognitive states in Alzheimer's disease, generating high-quality images while preserving subject fidelity in synthetically generated medical images. Our results lay the groundwork for prompt-driven disease progression analysis within 3D medical imaging.

中文标题/摘要

标题：想象替代方案：通过语言指导生成高分辨率3D反事实医学图像

视觉-语言模型在各种条件下生成2D图像方面展现了令人印象深刻的性能；然而，这些模型在2D方面的出色表现很大程度上得益于广泛且易于获取的预训练基础模型。关键在于，3D领域缺乏类似的预训练基础模型，极大地限制了该领域的进展。因此，视觉-语言模型仅凭自然语言描述生成高分辨率3D反事实医学图像的潜力尚未被探索。解决这一差距将使临床和研究应用成为可能，例如个性化反事实解释、疾病进展情景模拟以及通过可视化假设的医学状况进行增强医学培训。我们的工作朝着解决这一挑战迈出了有意义的一步，通过引入一种框架，该框架能够根据自由形式的语言提示生成合成患者高分辨率3D反事实医学图像。我们采用最先进的3D扩散模型，并结合Simple Diffusion的改进，引入增强条件以提高文本对齐和图像质量。据我们所知，这是首次将语言指导的原生3D扩散模型应用于神经影像数据的演示，其中准确的三维建模对于表示大脑的三维结构至关重要。通过两个不同的神经影像MRI数据集的结果，我们的框架成功模拟了多发性硬化症（MS）和阿尔茨海默病中不同认知状态下的不同反事实病灶负荷，生成高质量图像同时在合成医学图像中保持受试者的真实性。我们的结果为3D医学影像中的提示驱动疾病进展分析奠定了基础。

Summary / 总结

The research aims to generate high-resolution 3D counterfactual medical images using language guidance, addressing the lack of comparable pretrained models for 3D vision-language tasks. The method involves adapting state-of-the-art 3D diffusion models with enhancements and augmented conditioning to improve text alignment and image quality. Key findings include successful simulation of varying lesion loads in Multiple Sclerosis and cognitive states in Alzheimer's disease, generating high-quality images that preserve subject fidelity in synthetic medical images, paving the way for prompt-driven disease progression analysis in 3D medical imaging.

该论文旨在解决使用语言指导生成高分辨率3D反事实医学图像的问题，这对于临床和研究应用至关重要。作者引入了一个框架，使用增强的3D扩散模型和增强的条件处理来根据自然语言描述生成详细的3D图像。该框架成功模拟了多发性硬化症中的不同病灶负荷和阿尔茨海默病中的认知状态，生成了高质量的图像，同时保持了合成医学图像中的主体保真度。

Leveraging Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model

Authors: Wooseok Shin, Jisu Kang, Hyeonki Jeong, Jin Sob Kim, Sung Won Han

First: 2025-07-04T05:12:37+00:00 · Latest: 2025-09-07T07:45:25+00:00

Comments: Accepted for publication in Knowledge-Based Systems

Abs · PDF · Code1

Abstract

In semi-supervised semantic segmentation, existing studies have shown promising results in academic settings with controlled splits of benchmark datasets. However, the potential benefits of leveraging significantly larger sets of unlabeled images remain unexplored. In real-world scenarios, abundant unlabeled images are often available from online sources (web-scraped images) or large-scale datasets. However, these images may have different distributions from those of the target dataset, a situation known as out-of-distribution (OOD). Using these images as unlabeled data in semi-supervised learning can lead to inaccurate pseudo-labels, potentially misguiding network training. In this paper, we propose a new semi-supervised semantic segmentation framework with an open-vocabulary segmentation model (SemiOVS) to effectively utilize unlabeled OOD images. Extensive experiments on Pascal VOC and Context datasets demonstrate two key findings: (1) using additional unlabeled images improves the performance of semi-supervised learners in scenarios with few labels, and (2) using the open-vocabulary segmentation (OVS) model to pseudo-label OOD images leads to substantial performance gains. In particular, SemiOVS outperforms existing PrevMatch and SemiVL methods by +3.5 and +3.0 mIoU, respectively, on Pascal VOC with a 92-label setting, achieving state-of-the-art performance. These findings demonstrate that our approach effectively utilizes abundant unlabeled OOD images for semantic segmentation tasks. We hope this work can inspire future research and real-world applications. The code is available at https://github.com/wooseok-shin/SemiOVS

中文标题/摘要

标题：利用分布外未标注图像：开放词汇模型下的半监督语义分割

在半监督语义分割中，现有研究在学术环境中展示了控制基准数据集划分后的有希望的结果。然而，利用大量未标注图像的潜在益处尚未被探索。在现实场景中，大量未标注图像通常可以从在线来源（网页抓取图像）或大规模数据集中获得。然而，这些图像可能与目标数据集的分布不同，这种情况称为分布外（OOD）。使用这些图像作为半监督学习中的未标注数据可能导致不准确的伪标签，可能误导网络训练。在本文中，我们提出了一种新的半监督语义分割框架（SemiOVS），结合开放词汇分割模型（OVS）有效利用分布外未标注图像。在Pascal VOC和Context数据集上的广泛实验表明两个关键发现：（1）使用额外的未标注图像可以提高在少量标签场景中半监督学习者的性能；（2）使用开放词汇分割模型（OVS）对分布外图像进行伪标签可以带来显著的性能提升。特别是，SemiOVS在Pascal VOC 92标签设置中分别比PrevMatch和SemiVL方法高出+3.5和+3.0 mIoU，达到最先进的性能。这些发现表明，我们的方法有效地利用了大量分布外未标注图像进行语义分割任务。我们希望这项工作能激发未来的研究和实际应用。代码可在https://github.com/wooseok-shin/SemiOVS获取。

Summary / 总结

This paper proposes a semi-supervised semantic segmentation framework (SemiOVS) that leverages out-of-distribution (OOD) unlabeled images using an open-vocabulary segmentation model. Experiments on Pascal VOC and Context datasets show that using additional unlabeled images improves performance in scenarios with few labels, and that the open-vocabulary model leads to significant performance gains, outperforming existing methods by +3.5 and +3.0 mIoU on Pascal VOC with a 92-label setting.

本文提出了一种半监督语义分割框架（SemiOVS），利用开放词汇分割模型利用分布外（OOD）未标注图像。在Pascal VOC和Context数据集上的实验表明，在少量标签的情况下使用额外的未标注图像可以提高性能，并且开放词汇模型能够带来显著的性能提升，在Pascal VOC 92标签设置下分别比现有方法PrevMatch和SemiVL高出+3.5和+3.0 mIoU。

A Survey on Training-free Alignment of Large Language Models

Authors: Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian

Venue: EMNLP 2025

First: 2025-08-12T15:30:44+00:00 · Latest: 2025-09-07T02:11:17+00:00

Comments: Accepted to EMNLP 2025 (findings), camera-ready version

Abs · PDF

Abstract

The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques--leveraging in-context learning, decoding-time adjustments, and post-generation corrections--offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.

中文标题/摘要

标题：大型语言模型无训练对齐综述

大型语言模型（LLMs）的对齐旨在确保其输出符合人类价值观、道德标准和法律规范。传统的对齐方法通常依赖于资源密集型微调（FT），这可能会导致知识退化，并在模型访问受限或计算资源有限的情况下面临挑战。相比之下，无训练（TF）对齐技术——利用上下文学习、解码时调整和生成后修正——通过无需大量重新训练LLMs来实现对齐，使其能够适应开源和封闭源环境。本文首次系统地回顾了TF对齐方法，按解码前、解码中和解码后阶段进行分类。对于每个阶段，我们从LLMs和多模态LLMs（MLLMs）的角度进行了详细的分析，突出了其机制和局限性。此外，我们还指出了关键挑战和未来方向，为更包容和有效的TF对齐技术铺平了道路。通过综合和组织快速增长的研究文献，本文为从业者提供了指导，并促进了更安全和更可靠的LLMs的发展。

Summary / 总结

The paper aims to address the challenges of aligning large language models (LLMs) with human values and ethical standards without the need for resource-intensive fine-tuning. It reviews training-free (TF) alignment methods that use in-context learning, decoding-time adjustments, and post-generation corrections. The study categorizes these methods into pre-decoding, in-decoding, and post-decoding stages and highlights their mechanisms and limitations, identifying key challenges and future directions for more effective TF alignment techniques.

论文旨在解决在无需进行资源密集型微调的情况下，将大型语言模型（LLMs）与人类价值观和伦理标准对齐的问题。它回顾了使用上下文学习、解码时调整和后生成修正的无训练（TF）对齐方法，并将这些方法分为解码前、解码中和解码后三个阶段，详细分析了它们的工作机制和局限性，并指出了未来研究的关键挑战。

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Authors: Shaina Raza, Aravind Narayanan, Vahid Reza Khazaie, Ashmal Vayani, Mukund S. Chettiar, Amandeep Singh, Mubarak Shah, Deval Pandya

First: 2025-05-16T17:09:44+00:00 · Latest: 2025-09-06T21:27:33+00:00

Abs · PDF

Abstract

Large multimodal models (LMMs) have been widely tested on tasks like visual question answering (VQA), image captioning, and grounding, but lack rigorous evaluation for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce \textbf{HumaniBench}, a novel benchmark of 32,000 real-world image-question pairs and an evaluation suite. Labels are generated via an AI-assisted pipeline and validated by experts. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality, through diverse open-ended and closed-ended VQA tasks. Grounded in AI ethics and real-world needs, these principles provide a holistic lens for societal impact. Benchmarking results on different LMM shows that proprietary models generally lead in reasoning, fairness, and multilinguality, while open-source models excel in robustness and grounding. Most models struggle to balance accuracy with ethical and inclusive behavior. Techniques like Chain-of-Thought prompting and test-time scaling improve alignment. As the first benchmark tailored for HC alignment, HumaniBench offers a rigorous testbed to diagnose limitations, and promote responsible LMM development. All data and code are publicly available for reproducibility. Keywords: HumaniBench, vision-language models, responsible AI benchmark, AI alignment evaluation, AI ethics assessment, fairness in AI models, visual question answering (VQA) benchmark, image captioning evaluation, visual grounding tasks, trustworthy AI models, Chain-of-Thought prompting, test-time scaling, ethical AI development tools.

中文标题/摘要

标题：HumaniBench：一种以人为本的大规模多模态模型评估框架

大规模多模态模型（LMMs）已在视觉问答（VQA）、图像描述和语义对接等任务中得到广泛应用，但缺乏对公平性、伦理性和包容性等以人为本（HC）价值观的严格评估。为解决这一问题，我们引入了**HumaniBench**，一个包含32,000个真实图像-问题对的新基准以及评估套件。标签通过AI辅助的流程生成并由专家验证。HumaniBench 通过多样化的开放性和封闭性视觉问答任务，从公平性、伦理学、同理心、包容性、推理、稳健性和多语言性七个关键对齐原则评估LMMs。这些原则基于AI伦理和现实需求，为社会影响提供了一个全面的视角。不同LMMs的基准测试结果显示，专有模型通常在推理、公平性和多语言性方面表现更佳，而开源模型在稳健性和语义对接方面表现更优。大多数模型难以在准确性与伦理和包容性行为之间取得平衡。通过链式思考提示和测试时缩放等技术可以提高对齐性。作为首个针对以人为本对齐的基准，HumaniBench 提供了一个严格的测试平台，用于诊断局限性并促进负责任的大规模多模态模型开发。所有数据和代码均已公开，以确保可再现性。

Summary / 总结

HumaniBench is a new benchmark for evaluating large multimodal models (LMMs) on human-centered values such as fairness, ethics, and inclusivity. It consists of 32,000 real-world image-question pairs and assesses models across seven principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality. The evaluation shows that proprietary models perform better in reasoning, fairness, and multilinguality, while open-source models excel in robustness and grounding. Techniques like Chain-of-Thought prompting and test-time scaling improve alignment with human values. HumaniBench provides a rigorous framework for diagnosing and promoting responsible LMM development.

HumaniBench 是一个用于评估大型多模态模型（LMM）在公平性、伦理和包容性等人类中心（HC）价值方面的基准。它包含 32,000 个真实世界的图像-问题对和一个评估套件，评估 LMM 在七个原则上的表现。结果显示，专有模型在推理、公平性和多语言性方面表现更好，而开源模型在鲁棒性和定位方面表现更佳。通过链式思考提示和测试时缩放等技术可以提高对齐。HumaniBench 提供了一个严格的框架，用于诊断限制并促进负责任的 LMM 开发。

VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models

Authors: Jen-tse Huang, Jiantong Qin, Jianping Zhang, Youliang Yuan, Wenxuan Wang, Jieyu Zhao

Venue: EMNLP 2025

First: 2025-03-10T17:42:30+00:00 · Latest: 2025-09-06T20:25:10+00:00

Comments: Accepted to EMNLP 2025 (Main)

Abs · PDF · Code1

Abstract

This research investigates both explicit and implicit social biases exhibited by Vision-Language Models (VLMs). The key distinction between these bias types lies in the level of awareness: explicit bias refers to conscious, intentional biases, while implicit bias operates subconsciously. To analyze explicit bias, we directly pose questions to VLMs related to gender and racial differences: (1) Multiple-choice questions based on a given image (e.g., "What is the education level of the person in the image?") (2) Yes-No comparisons using two images (e.g., "Is the person in the first image more educated than the person in the second image?") For implicit bias, we design tasks where VLMs assist users but reveal biases through their responses: (1) Image description tasks: Models are asked to describe individuals in images, and we analyze disparities in textual cues across demographic groups. (2) Form completion tasks: Models draft a personal information collection form with 20 attributes, and we examine correlations among selected attributes for potential biases. We evaluate Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6. Our code and data are publicly available at https://github.com/uscnlp-lime/VisBias.

中文标题/摘要

标题：VisBias：视觉语言模型中显性和隐性社会偏见的测量

这项研究探讨了视觉语言模型（VLMs）中显性和隐性社会偏见的表现。这些偏见类型的关键区别在于意识水平：显性偏见指的是有意识的、故意的偏见，而隐性偏见则在潜意识中起作用。为了分析显性偏见，我们直接向VLMs提出关于性别和种族差异的问题：(1) 基于给定图像的多项选择题（例如，“图像中的人的教育水平是什么？”）(2) 使用两张图像的“是/否”比较（例如，“第一张图像中的人比第二张图像中的人更受教育吗？”）对于隐性偏见，我们设计了任务，让VLMs在帮助用户时通过其回答揭示偏见：(1) 图像描述任务：模型被要求描述图像中的个体，我们分析不同人口群体在文本提示方面的差异。(2) 表单填写任务：模型草拟一份包含20个属性的个人信息收集表，我们检查所选属性之间的相关性以寻找潜在的偏见。我们评估了Gemini-1.5、GPT-4V、GPT-4o、LLaMA-3.2-Vision和LLaVA-v1.6。我们的代码和数据可在https://github.com/uscnlp-lime/VisBias上公开获取。