arXiv 论文速递

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

Authors: Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang

Venue: EMNLP 2025

First: 2025-06-12T03:13:21+00:00 · Latest: 2025-09-05T17:21:02+00:00

Comments: EMNLP 2025 Main Conference

Abstract

Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model's initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.

中文标题/摘要

标题：DiMo-GUI：通过模态意识视觉推理提升GUI语义定位的测试时扩展

将自然语言查询与图形用户界面（GUI）关联存在独特挑战，由于视觉元素的多样性、空间杂乱以及语言的模糊性。本文介绍了一种无需训练的DiMo-GUI框架，该框架采用两种核心策略：动态视觉定位和模态意识优化。我们的方法不将GUI视为单一图像，而是将输入分为文本元素和图示元素，使模型能够使用通用的视觉语言模型独立地对每个模态进行推理。当预测结果模糊或错误时，DiMo-GUI动态聚焦注意力，生成以模型初始预测为中心的候选焦点区域，并逐步放大子区域以细化定位结果。这种分层细化过程有助于在无需额外训练或注释的情况下澄清视觉拥挤的布局。我们在标准GUI语义定位基准上评估了该方法，并展示了相对于基线推理管道的一致改进，突显了结合模态分离与区域聚焦推理的有效性。

ArtRAG: Retrieval-Augmented Generation with Structured Context for Visual Art Understanding

Authors: Shuai Wang, Ivona Najdenkoska, Hongyi Zhu, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

First: 2025-05-09T13:08:27+00:00 · Latest: 2025-09-05T16:04:23+00:00

Abs · PDF

Abstract

Understanding visual art requires reasoning across multiple perspectives -- cultural, historical, and stylistic -- beyond mere object recognition. While recent multimodal large language models (MLLMs) perform well on general image captioning, they often fail to capture the nuanced interpretations that fine art demands. We propose ArtRAG, a novel, training-free framework that combines structured knowledge with retrieval-augmented generation (RAG) for multi-perspective artwork explanation. ArtRAG automatically constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources, organizing entities such as artists, movements, themes, and historical events into a rich, interpretable graph. At inference time, a multi-granular structured retriever selects semantically and topologically relevant subgraphs to guide generation. This enables MLLMs to produce contextually grounded, culturally informed art descriptions. Experiments on the SemArt and Artpedia datasets show that ArtRAG outperforms several heavily trained baselines. Human evaluations further confirm that ArtRAG generates coherent, insightful, and culturally enriched interpretations.

中文标题/摘要

标题：ArtRAG：结构化背景增强生成以理解视觉艺术

理解视觉艺术需要在文化、历史和风格等多个视角上进行推理，而不仅仅是对象识别。尽管最近的多模态大型语言模型（MLLMs）在通用图像描述方面表现良好，但它们往往无法捕捉到精细艺术所需的细微解释。我们提出ArtRAG，这是一种新颖的无需训练的框架，结合了结构化知识与检索增强生成（RAG），用于多视角艺术品解释。ArtRAG 从领域特定的文本来源中自动构建艺术背景知识图谱（ACKG），将艺术家、运动、主题和历史事件组织成一个丰富且可解释的图。在推理时，多粒度结构化检索器选择语义和拓扑上相关的子图来引导生成。这使MLLMs 能够生成上下文相关、文化背景丰富的艺术描述。在SemArt 和 Artpedia 数据集上的实验表明，ArtRAG 在多个高度训练的基线中表现出色。进一步的人类评估也证实，ArtRAG 生成了连贯、有洞察力且文化丰富的解释。

Summary / 总结

ArtRAG is a framework that combines structured knowledge with retrieval-augmented generation to provide multi-perspective explanations of visual art. It constructs an Art Context Knowledge Graph (ACKG) from domain-specific textual sources to guide the generation of culturally informed descriptions. Experiments show that ArtRAG outperforms several heavily trained baselines and generates coherent, insightful interpretations of artworks.

ArtRAG 是一个结合结构化知识与检索增强生成的框架，用于提供视觉艺术的多视角解释。它从领域特定的文本来源构建艺术上下文知识图谱（ACKG），以指导生成文化背景下的描述。实验表明，ArtRAG 在多个基准模型中表现更优，并生成了连贯且富有洞察力的艺术解读。

GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization

Authors: Pengyue Jia, Yingyi Zhang, Xiangyu Zhao, Yixuan Li

First: 2025-09-04T15:52:04+00:00 · Latest: 2025-09-05T15:02:49+00:00

Abs · PDF

Abstract

Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model's actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task.

中文标题/摘要

标题：GeoArena：一个用于评估全球图像地理定位的大规模视觉语言模型的开放平台

图像地理定位旨在预测地球上任何地方拍摄的图像的地理位置，但其全球性质带来了重大挑战。当前的评估方法存在两个主要局限性。首先，数据泄露：先进的方法通常依赖大规模视觉语言模型（LVLMs）来预测图像位置，但这些模型经常在测试数据集上进行预训练，这会损害评估模型实际地理定位能力的准确性。其次，现有的评估指标主要依赖于精确的地理坐标来评估预测结果，这不仅忽视了推理过程，还当需要用户级别的位置数据时引发了隐私问题。为了解决这些问题，我们提出了GeoArena，这是一个首个用于评估大规模视觉语言模型在世界范围图像地理定位任务上的开放平台，提供真实的野外和以人为本的基准测试。GeoArena 允许用户上传野外图像以获得更多样化的评估语料，并利用成对的人类判断来确定哪个模型输出更符合人类期望。该平台已在线部署两个月，期间我们收集了数千条投票记录。基于这些数据，我们进行了详细分析，并建立了不同大规模视觉语言模型在图像地理定位任务上的排行榜。

Summary / 总结

GeoArena is an open platform designed to benchmark large vision-language models (LVLMs) on global image geolocalization tasks. It addresses the limitations of current evaluation methods by avoiding data leakage and using human judgments to assess model outputs. The platform has collected thousands of voting records, leading to a detailed analysis and a leaderboard of different LVLMs on image geolocalization.

GeoArena 是一个开放平台，旨在评估大型视觉语言模型在环球图像地理定位任务中的表现。它通过避免数据泄露并使用人类判断来评估模型预测，解决了现有评估方法的局限性。该平台在两个月内收集了数千条投票记录，进行了详细分析并建立了不同模型在图像地理定位任务中的排行榜。

VLSM-Ensemble: Ensembling CLIP-based Vision-Language Models for Enhanced Medical Image Segmentation

Authors: Julia Dietlmeier, Oluwabukola Grace Adegboro, Vayangi Ganepola, Claudia Mazo, Noel E. O'Connor

First: 2025-09-05T14:48:19+00:00 · Latest: 2025-09-05T14:48:19+00:00

Comments: Medical Imaging with Deep Learning (MIDL 2025) short paper

Abs · PDF · Code1

Abstract

Vision-language models and their adaptations to image segmentation tasks present enormous potential for producing highly accurate and interpretable results. However, implementations based on CLIP and BiomedCLIP are still lagging behind more sophisticated architectures such as CRIS. In this work, instead of focusing on text prompt engineering as is the norm, we attempt to narrow this gap by showing how to ensemble vision-language segmentation models (VLSMs) with a low-complexity CNN. By doing so, we achieve a significant Dice score improvement of 6.3% on the BKAI polyp dataset using the ensembled BiomedCLIPSeg, while other datasets exhibit gains ranging from 1% to 6%. Furthermore, we provide initial results on additional four radiology and non-radiology datasets. We conclude that ensembling works differently across these datasets (from outperforming to underperforming the CRIS model), indicating a topic for future investigation by the community. The code is available at https://github.com/juliadietlmeier/VLSM-Ensemble.

中文标题/摘要

标题：VLSM-Ensemble：基于CLIP的视觉语言模型集成以增强医学图像分割

视觉语言模型及其在图像分割任务中的适应性具有巨大的潜力，能够产生高度准确且可解释的结果。然而，基于CLIP和BiomedCLIP的实现仍然落后于更复杂的架构如CRIS。在本研究中，我们没有专注于文本提示工程，而是通过展示如何使用低复杂度CNN集成视觉语言分割模型（VLSMs）来缩小这一差距。通过这种方式，我们在BKAI息肉数据集上使用集成的BiomedCLIPSeg实现了显著的Dice分数提高6.3%，而在其他数据集上则表现出1%到6%的增益。此外，我们还提供了四个放射学和非放射学数据集的初步结果。我们得出结论，集成在这些数据集上的效果不同（从超越CRIS模型到不如CRIS模型），这表明这是一个值得社区进一步研究的主题。代码可在https://github.com/juliadietlmeier/VLSM-Ensemble获取。

Summary / 总结

This study aims to enhance medical image segmentation by ensembling CLIP-based vision-language models with low-complexity CNNs, addressing the limitations of existing models. The authors achieved a 6.3% improvement in Dice score on the BKAI polyp dataset using ensembled BiomedCLIPSeg, with other datasets showing gains ranging from 1% to 6%. The results suggest that ensembling works differently across various datasets, indicating further research potential.

该研究旨在通过视觉-语言模型的集成来提升医学图像分割。作者采用低复杂度CNN与CLIP基模型进行集成，BKAI息肉数据集上实现了6.3%的Dice分数提升。该方法在其他数据集上也表现出1%到6%的改进，不同数据集上的性能差异表明未来的研究方向。代码可在GitHub上获得。

SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing

Authors: Chaolei Wang, Yang Luo, Jing Du, Siyu Chen, Yiping Chen, Ting Han

First: 2025-09-05T14:37:31+00:00 · Latest: 2025-09-05T14:37:31+00:00

Abs · PDF

Abstract

Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel "split-then-grow" framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available in the supplementary materials.

中文标题/摘要

标题：SGS-3D：通过可靠的语义分割和生长实现高保真3D实例分割

准确的3D实例分割对于3D视觉领域高质量场景理解至关重要。然而，基于2D到3D提升的方法在提升过程中由于语义指导模糊和深度约束不足而引入的累积误差，难以产生精确的实例级分割。为应对这些挑战，我们提出了一种新的“分割-然后生长”框架SGS-3D，该框架首先使用几何原语净化和分割模糊的提升掩码，然后在场景中将其生长为完整的实例。与现有直接依赖原始提升掩码并牺牲分割精度的方法不同，SGS-3D作为一种无需训练的细化方法，联合融合语义和几何信息，使两个表示层次之间能够有效合作。具体而言，对于语义指导，我们引入了一种掩码过滤策略，利用3D几何原语的共现性来识别并移除模糊的掩码，从而确保与3D对象实例更可靠的语义一致性。对于几何细化，我们通过利用空间连续性和高层特征构建精细的物体实例，特别是在不同物体之间语义模糊的情况下。在ScanNet200、ScanNet++和KITTI-360上的实验结果表明，SGS-3D显著提高了分割精度，并且在对抗预训练模型产生的不准确掩码时具有更高的鲁棒性，同时在多种室内外环境中保持了强大的泛化能力。代码可在附录材料中获取。

Summary / 总结

SGS-3D proposes a novel 'split-then-grow' framework for high-fidelity 3D instance segmentation, addressing the limitations of 2D-to-3D lifting approaches by purifying and splitting ambiguous masks using geometric primitives and then growing them into complete instances. Experimental results on ScanNet200, ScanNet++, and KITTI-360 show that SGS-3D significantly improves segmentation accuracy and robustness, producing high-fidelity object instances with strong generalization across various environments.

SGS-3D 提出了一种新颖的 '分割-然后生长' 框架，用于高保真 3D 实例分割，通过几何原语净化和分割模糊的掩码，然后在场景中生长为完整的实例。实验结果表明，SGS-3D 显著提高了分割精度和鲁棒性，生成了高保真的对象实例，并在各种室内和室外环境中具有强大的泛化能力。

Poison Once, Control Anywhere: Clean-Text Visual Backdoors in VLM-based Mobile Agents

Authors: Xuan Wang, Siyuan Liang, Zhe Liu, Yi Yu, Aishan Liu, Yuliang Lu, Xitong Gao, Ee-Chien Chang

First: 2025-06-16T08:09:32+00:00 · Latest: 2025-09-05T14:19:03+00:00

Comments: 10 pages

Abs · PDF

Abstract

Mobile agents powered by vision-language models (VLMs) are increasingly adopted for tasks such as UI automation and camera-based assistance. These agents are typically fine-tuned using small-scale, user-collected data, making them susceptible to stealthy training-time threats. This work introduces VIBMA, the first clean-text backdoor attack targeting VLM-based mobile agents. The attack injects malicious behaviors into the model by modifying only the visual input while preserving textual prompts and instructions, achieving stealth through the complete absence of textual anomalies. Once the agent is fine-tuned on this poisoned data, adding a predefined visual pattern (trigger) at inference time activates the attacker-specified behavior (backdoor). Our attack aligns the training gradients of poisoned samples with those of an attacker-specified target instance, effectively embedding backdoor-specific features into the poisoned data. To ensure the robustness and stealthiness of the attack, we design three trigger variants that better resemble real-world scenarios: static patches, dynamic motion patterns, and low-opacity blended content. Extensive experiments on six Android applications and three mobile-compatible VLMs demonstrate that our attack achieves high success rates (ASR up to 94.67%) while preserving clean-task behavior (FSR up to 95.85%). We further conduct ablation studies to understand how key design factors impact attack reliability and stealth. These findings is the first to reveal the security vulnerabilities of mobile agents and their susceptibility to backdoor injection, underscoring the need for robust defenses in mobile agent adaptation pipelines.

中文标题/摘要

标题：一毒多控：基于VLM的移动代理中的清洁文本视觉后门

基于视觉语言模型（VLMs）的移动代理越来越多地被用于UI自动化和基于摄像头的帮助任务。这些代理通常使用小型用户收集的数据进行微调，使其容易受到隐蔽的训练时威胁。本文介绍了VIBMA，这是第一个针对基于VLM的移动代理的清洁文本后门攻击。该攻击通过仅修改视觉输入而不改变文本提示和指令来注入恶意行为，从而通过完全不存在文本异常实现隐蔽。一旦代理使用被污染的数据进行微调，在推理时添加预定义的视觉模式（触发器）即可激活攻击者指定的行为（后门）。我们的攻击使被污染样本的训练梯度与攻击者指定的目标实例的梯度对齐，有效地将后门特定的特征嵌入到被污染的数据中。为了确保攻击的稳健性和隐蔽性，我们设计了三种更符合现实场景的触发器变体：静态补丁、动态运动模式和低不透明度融合内容。在六个Android应用程序和三个移动兼容的VLMs上的广泛实验表明，我们的攻击在保持清洁任务行为的同时（最高FSR为95.85%）实现了高成功率（ASR最高为94.67%）。我们还进行了消融研究，以了解关键设计因素如何影响攻击的可靠性和隐蔽性。这些发现首次揭示了移动代理的安全漏洞及其对后门注入的易感性，强调了在移动代理适应管道中需要强大的防御措施。

Summary / 总结

This work introduces VIBMA, a clean-text backdoor attack on vision-language model-based mobile agents, which injects malicious behaviors by modifying only the visual input without altering textual prompts. The attack uses predefined visual triggers to activate the backdoor at inference time. Experiments on six Android applications and three mobile-compatible VLMs show high success rates (up to 94.67%) for the backdoor activation while maintaining clean-task behavior (up to 95.85%).

这项工作提出了针对基于视觉语言模型的移动代理的无文本后门攻击VIBMA。通过仅修改视觉输入而不改变文本提示，该攻击实现了隐蔽性。一旦在中毒数据上进行微调，添加预定义的视觉触发器即可激活攻击者指定的行为。实验在六个Android应用程序和三个移动兼容的VLM上显示了高成功率（最高94.67%）的同时保持了清洁任务行为（最高95.85%）。

GenAI-based test case generation and execution in SDV platform

Authors: Denesa Zyberaj, Lukasz Mazur, Nenad Petrovic, Pankhuri Verma, Pascal Hirmer, Dirk Slama, Xiangwei Cheng, Alois Knoll

First: 2025-09-05T13:50:26+00:00 · Latest: 2025-09-05T13:50:26+00:00

Abs · PDF

Abstract

This paper introduces a GenAI-driven approach for automated test case generation, leveraging Large Language Models and Vision-Language Models to translate natural language requirements and system diagrams into structured Gherkin test cases. The methodology integrates Vehicle Signal Specification modeling to standardize vehicle signal definitions, improve compatibility across automotive subsystems, and streamline integration with third-party testing tools. Generated test cases are executed within the digital.auto playground, an open and vendor-neutral environment designed to facilitate rapid validation of software-defined vehicle functionalities. We evaluate our approach using the Child Presence Detection System use case, demonstrating substantial reductions in manual test specification effort and rapid execution of generated tests. Despite significant automation, the generation of test cases and test scripts still requires manual intervention due to current limitations in the GenAI pipeline and constraints of the digital.auto platform.

中文标题/摘要

标题：基于GenAI的SDV平台测试用例生成与执行

本文介绍了一种基于GenAI的自动化测试用例生成方法，利用大型语言模型和视觉-语言模型将自然语言需求和系统图转换为结构化的Gherkin测试用例。该方法结合了车辆信号规范建模，以标准化车辆信号定义、提高跨汽车子系统的兼容性，并简化与第三方测试工具的集成。生成的测试用例在digital.auto游乐场中执行，这是一个开放且供应商中立的环境，旨在促进对软件定义车辆功能的快速验证。我们使用儿童存在检测系统用例评估了该方法，展示了显著减少手动测试规范工作量和快速执行生成测试的优势。尽管自动化程度很高，但由于GenAI管道的当前限制和digital.auto平台的约束，测试用例和测试脚本的生成仍需人工干预。

Dual-Domain Perspective on Degradation-Aware Fusion: A VLM-Guided Robust Infrared and Visible Image Fusion Framework

Authors: Tianpei Zhang, Jufeng Zhao, Yiming Zhu, Guangmang Cui

First: 2025-09-05T10:48:46+00:00 · Latest: 2025-09-05T10:48:46+00:00

Abs · PDF

Abstract

Most existing infrared-visible image fusion (IVIF) methods assume high-quality inputs, and therefore struggle to handle dual-source degraded scenarios, typically requiring manual selection and sequential application of multiple pre-enhancement steps. This decoupled pre-enhancement-to-fusion pipeline inevitably leads to error accumulation and performance degradation. To overcome these limitations, we propose Guided Dual-Domain Fusion (GD^2Fusion), a novel framework that synergistically integrates vision-language models (VLMs) for degradation perception with dual-domain (frequency/spatial) joint optimization. Concretely, the designed Guided Frequency Modality-Specific Extraction (GFMSE) module performs frequency-domain degradation perception and suppression and discriminatively extracts fusion-relevant sub-band features. Meanwhile, the Guided Spatial Modality-Aggregated Fusion (GSMAF) module carries out cross-modal degradation filtering and adaptive multi-source feature aggregation in the spatial domain to enhance modality complementarity and structural consistency. Extensive qualitative and quantitative experiments demonstrate that GD^2Fusion achieves superior fusion performance compared with existing algorithms and strategies in dual-source degraded scenarios. The code will be publicly released after acceptance of this paper.

中文标题/摘要

标题：退化感知融合的双域视角：一种基于VLM的鲁棒红外和可见光图像融合框架

大多数现有的红外-可见光图像融合（IVIF）方法假设高质量的输入，因此在处理双源退化场景时往往难以应对，通常需要手动选择并按顺序应用多个预增强步骤。这种分离的预增强-融合管道不可避免地导致误差累积和性能下降。为克服这些限制，我们提出了一种名为引导双域融合（GD^2Fusion）的新框架，该框架将视觉-语言模型（VLMs）用于退化感知与双域（频率/空间）联合优化协同整合。具体而言，设计的引导频率模态特定提取（GFMSE）模块在频率域中进行退化感知和抑制，并区分性地提取融合相关的子带特征。同时，引导空间模态聚合融合（GSMAF）模块在空间域中进行跨模态退化过滤和自适应多源特征聚合，以增强模态互补性和结构一致性。广泛的定性和定量实验表明，GD^2Fusion在双源退化场景中实现了优于现有算法和策略的融合性能。论文被接受后，代码将公开发布。

Summary / 总结

The paper addresses the limitations of existing infrared-visible image fusion methods that assume high-quality inputs and struggle with degraded scenarios. It introduces GD^2Fusion, a framework that integrates vision-language models for degradation perception and performs joint frequency and spatial domain optimization. GD^2Fusion includes a GFMSE module for frequency-domain degradation perception and suppression, and a GSMAF module for cross-modal degradation filtering and adaptive feature aggregation. Experimental results show that GD^2Fusion outperforms existing methods in dual-source degraded scenarios.

论文提出GD^2Fusion框架，结合VLMs进行降级感知和双域联合优化，以处理降级的红外和可见光图像。GFMSE模块在频域进行降级感知和抑制，而GSMAF模块在空间域增强模态互补性和结构一致性。通过广泛的实验，GD^2Fusion在双源降级场景中优于现有方法。

InfoScale: Unleashing Training-free Variable-scaled Image Generation via Effective Utilization of Information

Authors: Guohui Zhang, Jiangtong Tan, Linjiang Huang, Zhonghang Yuan, Naishan Zheng, Jie Huang, Feng Zhao

First: 2025-09-01T12:27:04+00:00 · Latest: 2025-09-05T09:39:32+00:00

Abs · PDF

Abstract

Diffusion models (DMs) have become dominant in visual generation but suffer performance drop when tested on resolutions that differ from the training scale, whether lower or higher. In fact, the key challenge in generating variable-scale images lies in the differing amounts of information across resolutions, which requires information conversion procedures to be varied for generating variable-scaled images. In this paper, we investigate the issues of three critical aspects in DMs for a unified analysis in variable-scaled generation: dilated convolution, attention mechanisms, and initial noise. Specifically, 1) dilated convolution in DMs for the higher-resolution generation loses high-frequency information. 2) Attention for variable-scaled image generation struggles to adjust the information aggregation adaptively. 3) The spatial distribution of information in the initial noise is misaligned with variable-scaled image. To solve the above problems, we propose \textbf{InfoScale}, an information-centric framework for variable-scaled image generation by effectively utilizing information from three aspects correspondingly. For information loss in 1), we introduce Progressive Frequency Compensation module to compensate for high-frequency information lost by dilated convolution in higher-resolution generation. For information aggregation inflexibility in 2), we introduce Adaptive Information Aggregation module to adaptively aggregate information in lower-resolution generation and achieve an effective balance between local and global information in higher-resolution generation. For information distribution misalignment in 3), we design Noise Adaptation module to re-distribute information in initial noise for variable-scaled generation. Our method is plug-and-play for DMs and extensive experiments demonstrate the effectiveness in variable-scaled image generation.

中文标题/摘要

标题：InfoScale：通过有效利用信息释放无需训练的可变比例图像生成

扩散模型（DMs）在视觉生成中已成为主流，但在测试不同训练比例的分辨率时会表现出性能下降。实际上，生成可变比例图像的关键挑战在于不同分辨率下的信息量不同，这需要信息转换过程随之变化。在本文中，我们探讨了DMs中三个关键方面的三个问题，以统一分析可变比例生成：膨胀卷积、注意力机制和初始噪声。具体来说，1）DMs中的膨胀卷积在高分辨率生成中会丢失高频信息。2）在可变比例图像生成中，注意力机制难以适应性地聚合信息。3）初始噪声中的信息空间分布与可变比例图像不一致。为了解决上述问题，我们提出了InfoScale，这是一种信息为中心的框架，通过从三个方面有效利用信息来实现可变比例图像生成。对于1）中的信息损失，我们引入了渐进频率补偿模块，以补偿膨胀卷积在高分辨率生成中丢失的高频信息。对于2）中的信息聚合灵活性不足，我们引入了自适应信息聚合模块，以适应性地在低分辨率生成中聚合信息，并在高分辨率生成中实现局部和全局信息的有效平衡。对于3）中的信息分布不一致，我们设计了噪声适应模块，以重新分配初始噪声中的信息，实现可变比例生成。我们的方法适用于DMs，广泛的实验表明其在可变比例图像生成中的有效性。

Summary / 总结

The research aims to address the performance drop of diffusion models (DMs) when generating images at resolutions different from their training scale. The key challenges include information loss in dilated convolution, inflexibility in information aggregation with attention mechanisms, and misalignment of information distribution in initial noise. To tackle these issues, the authors propose InfoScale, an information-centric framework that introduces three modules: Progressive Frequency Compensation, Adaptive Information Aggregation, and Noise Adaptation. These modules effectively utilize information from three aspects to improve variable-scaled image generation, demonstrating effectiveness in extensive experiments.

研究旨在解决扩散模型（DMs）在生成与训练尺度不同的分辨率图像时性能下降的问题。主要挑战包括卷积中的高频信息丢失、注意力机制在信息聚合上的灵活性不足以及初始噪声中的信息分布与变尺度图像不匹配。为了解决这些问题，作者提出了InfoScale，这是一种信息为中心的框架，引入了三个模块：渐进频率补偿、自适应信息聚合和噪声适应。这些模块有效地利用了三个方面的信息来改进变尺度图像生成，并在大量实验中证明了其有效性。

SynGen-Vision: Synthetic Data Generation for training industrial vision models

Authors: Alpana Dubey, Suma Mani Kuriakose, Nitish Bhardwaj

First: 2025-09-05T08:15:46+00:00 · Latest: 2025-09-05T08:15:46+00:00

Abs · PDF

Abstract

We propose an approach to generate synthetic data to train computer vision (CV) models for industrial wear and tear detection. Wear and tear detection is an important CV problem for predictive maintenance tasks in any industry. However, data curation for training such models is expensive and time-consuming due to the unavailability of datasets for different wear and tear scenarios. Our approach employs a vision language model along with a 3D simulation and rendering engine to generate synthetic data for varying rust conditions. We evaluate our approach by training a CV model for rust detection using the generated dataset and tested the trained model on real images of rusted industrial objects. The model trained with the synthetic data generated by our approach, outperforms the other approaches with a mAP50 score of 0.87. The approach is customizable and can be easily extended to other industrial wear and tear detection scenarios

中文标题/摘要

标题：SynGen-Vision: 工业视觉模型训练的合成数据生成

我们提出了一种生成合成数据的方法，用于训练计算机视觉(CV)模型以进行工业磨损检测。磨损检测是任何行业中预测性维护任务中一个重要的CV问题。然而，由于不同磨损场景数据集的缺乏，训练此类模型的数据整理既昂贵又耗时。我们的方法利用视觉语言模型和3D模拟与渲染引擎生成不同锈蚀条件下的合成数据。我们通过使用生成的数据集训练CV模型进行锈蚀检测，并在锈蚀的工业物体的真实图像上测试训练好的模型。使用我们方法生成的合成数据训练的模型，在mAP50得分上优于其他方法，得分为0.87。该方法可定制，并且可以轻松扩展到其他工业磨损检测场景

Summary / 总结

The research aims to address the challenges of data curation for training computer vision models for industrial wear and tear detection, which is crucial for predictive maintenance. The method involves using a vision language model and a 3D simulation engine to generate synthetic data for different rust conditions. The experimental results show that the model trained with the synthetic data outperforms other approaches, achieving an mAP50 score of 0.87 on real rusted industrial objects images.

研究旨在解决工业磨损检测中计算机视觉模型训练数据收集的挑战，这对于预测性维护至关重要。方法是使用视觉语言模型和3D仿真引擎生成不同锈蚀条件的合成数据。实验结果表明，使用该方法生成的合成数据训练的CV模型在真实锈蚀工业物体上的mAP50得分为0.87，优于其他方法，证明了该方法的有效性和可定制性。

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens

Authors: Sohee Kim, Soohyun Ryu, Joonhyung Park, Eunho Yang

Venue: EMNLP 2025

First: 2025-09-03T05:17:25+00:00 · Latest: 2025-09-05T07:49:47+00:00

Comments: accepted to EMNLP 2025

Abs · PDF

Abstract

Large Vision-Language Models (LVLMs) generate contextually relevant responses by jointly interpreting visual and textual inputs. However, our finding reveals they often mistakenly perceive text inputs lacking visual evidence as being part of the image, leading to erroneous responses. In light of this finding, we probe whether LVLMs possess an internal capability to determine if textual concepts are grounded in the image, and discover a specific subset of Feed-Forward Network (FFN) neurons, termed Visual Absence-aware (VA) neurons, that consistently signal the visual absence through a distinctive activation pattern. Leveraging these patterns, we develop a detection module that systematically classifies whether an input token is visually grounded. Guided by its prediction, we propose a method to refine the outputs by reinterpreting question prompts or replacing the detected absent tokens during generation. Extensive experiments show that our method effectively mitigates the models' tendency to falsely presume the visual presence of text input and its generality across various LVLMs.

中文标题/摘要

标题：揭示大型视觉-语言模型对视觉缺失词的响应

大型视觉-语言模型（LVLMs）通过联合解释视觉和文本输入来生成上下文相关响应。然而，我们的发现表明，它们经常错误地将缺乏视觉证据的文本输入视为图像的一部分，导致错误的响应。鉴于这一发现，我们探究LVLMs是否具有内部能力来判断文本概念是否与图像相关，并发现一种称为视觉缺失感知（VA）神经元的特定前馈网络（FFN）神经元，它们通过独特的激活模式一致地信号化视觉缺失。利用这些模式，我们开发了一个检测模块，系统地分类输入词是否与视觉相关。根据其预测，我们提出了一种方法，通过重新解释问题提示或在生成过程中替换检测到的缺失词来改进输出。广泛的实验表明，我们的方法有效地减轻了模型对文本输入视觉存在的虚假假设，并且具有跨各种LVLMs的一般性。

Summary / 总结

The study investigates how large vision-language models (LVLMs) handle text inputs without visual evidence, revealing that they often incorrectly interpret such text as part of the image. To address this issue, the researchers identified specific neurons, termed Visual Absence-aware (VA) neurons, which consistently signal the absence of visual elements. They developed a detection module to classify whether input tokens are visually grounded and proposed a method to refine model outputs by reinterpreting question prompts or replacing absent tokens. Experiments demonstrated that this method reduces the models' tendency to assume the visual presence of text inputs and is applicable across different LVLMs.

研究探讨了大型视觉-语言模型（LVLMs）如何处理缺乏视觉证据的文本输入，发现它们常常错误地将此类文本视为图像的一部分。为解决这一问题，研究人员识别出特定的神经元，称为视觉缺失感知（VA）神经元，这些神经元会通过特定的激活模式信号化视觉元素的缺失。他们开发了一个检测模块来判断输入令牌是否与视觉内容相关，并提出了一种方法，通过重新解释问题提示或在生成过程中替换检测到的缺失令牌来改进模型输出。实验表明，这种方法减少了模型假设文本输入具有视觉存在的倾向，并且适用于不同的LVLMs。

TemporalFlowViz: Parameter-Aware Visual Analytics for Interpreting Scramjet Combustion Evolution

Authors: Yifei Jia, Shiyu Cheng, Yu Dong, Guan Li, Dong Tian, Ruixiao Peng, Xuyi Lu, Yu Wang, Wei Yao, Guihua Shan

First: 2025-09-05T06:35:36+00:00 · Latest: 2025-09-05T06:35:36+00:00

Abs · PDF

Abstract

Understanding the complex combustion dynamics within scramjet engines is critical for advancing high-speed propulsion technologies. However, the large scale and high dimensionality of simulation-generated temporal flow field data present significant challenges for visual interpretation, feature differentiation, and cross-case comparison. In this paper, we present TemporalFlowViz, a parameter-aware visual analytics workflow and system designed to support expert-driven clustering, visualization, and interpretation of temporal flow fields from scramjet combustion simulations. Our approach leverages hundreds of simulated combustion cases with varying initial conditions, each producing time-sequenced flow field images. We use pretrained Vision Transformers to extract high-dimensional embeddings from these frames, apply dimensionality reduction and density-based clustering to uncover latent combustion modes, and construct temporal trajectories in the embedding space to track the evolution of each simulation over time. To bridge the gap between latent representations and expert reasoning, domain specialists annotate representative cluster centroids with descriptive labels. These annotations are used as contextual prompts for a vision-language model, which generates natural-language summaries for individual frames and full simulation cases. The system also supports parameter-based filtering, similarity-based case retrieval, and coordinated multi-view exploration to facilitate in-depth analysis. We demonstrate the effectiveness of TemporalFlowViz through two expert-informed case studies and expert feedback, showing TemporalFlowViz enhances hypothesis generation, supports interpretable pattern discovery, and enhances knowledge discovery in large-scale scramjet combustion analysis.

中文标题/摘要

标题：TemporalFlowViz：面向参数的视觉分析工作流以解释超燃冲压发动机燃烧演变

理解超燃冲压发动机内的复杂燃烧动力学对于推进高速推进技术至关重要。然而，由仿真生成的时间序列流场数据的大规模和高维度性为视觉解释、特征区分和跨案例比较带来了重大挑战。本文介绍了TemporalFlowViz，一种面向参数的视觉分析工作流和系统，旨在支持专家驱动的聚类、可视化和超燃冲压发动机燃烧仿真时间序列流场的解释。我们的方法利用了数百个具有不同初始条件的仿真燃烧案例，每个案例都生成了时间序列的流场图像。我们使用预训练的视觉变换器从这些帧中提取高维嵌入，应用降维和基于密度的聚类以发现潜在的燃烧模式，并在嵌入空间中构建时间轨迹以跟踪每个仿真随时间的演变。为了弥合潜在表示与专家推理之间的差距，领域专家对代表性的聚类中心进行标注并赋予描述性标签。这些注释被用作视觉语言模型的上下文提示，生成单个帧和完整仿真案例的自然语言摘要。该系统还支持基于参数的过滤、基于相似性的案例检索以及协调的多视图探索，以促进深入分析。通过两个专家指导的案例研究和专家反馈，我们展示了TemporalFlowViz的有效性，证明TemporalFlowViz增强了假设生成、支持可解释模式发现，并增强了大规模超燃冲压发动机燃烧分析中的知识发现。

Summary / 总结

TemporalFlowViz is a visual analytics system designed to interpret complex combustion dynamics in scramjet engines. It uses pretrained Vision Transformers to extract high-dimensional embeddings from time-sequenced flow field images and applies dimensionality reduction and clustering to uncover latent combustion modes. The system supports parameter-based filtering and coordinated multi-view exploration, and domain specialists annotate cluster centroids with descriptive labels to generate natural-language summaries. The effectiveness of TemporalFlowViz is demonstrated through expert case studies, showing it enhances hypothesis generation and pattern discovery.

TemporalFlowViz 是一个用于解释超燃冲压发动机复杂燃烧动力学的可视化分析系统。该系统使用预训练的 Vision Transformers 提取时间序列流场图像的高维嵌入，并应用降维和聚类来发现潜在的燃烧模式。系统支持基于参数的过滤、基于相似性的案例检索和协调的多视图探索，以促进深入分析。专家反馈表明，TemporalFlowViz 有助于假设生成和大规模超燃冲压发动机燃烧分析中的可解释模式发现。

FloodVision: Urban Flood Depth Estimation Using Foundation Vision-Language Models and Domain Knowledge Graph

Authors: Zhangding Liu, Neda Mohammadi, John E. Taylor

First: 2025-09-05T03:05:18+00:00 · Latest: 2025-09-05T03:05:18+00:00

Abs · PDF

Abstract

Timely and accurate floodwater depth estimation is critical for road accessibility and emergency response. While recent computer vision methods have enabled flood detection, they suffer from both accuracy limitations and poor generalization due to dependence on fixed object detectors and task-specific training. To enable accurate depth estimation that can generalize across diverse flood scenarios, this paper presents FloodVision, a zero-shot framework that combines the semantic reasoning abilities of the foundation vision-language model GPT-4o with a structured domain knowledge graph. The knowledge graph encodes canonical real-world dimensions for common urban objects including vehicles, people, and infrastructure elements to ground the model's reasoning in physical reality. FloodVision dynamically identifies visible reference objects in RGB images, retrieves verified heights from the knowledge graph to mitigate hallucination, estimates submergence ratios, and applies statistical outlier filtering to compute final depth values. Evaluated on 110 crowdsourced images from MyCoast New York, FloodVision achieves a mean absolute error of 8.17 cm, reducing the GPT-4o baseline 10.28 cm by 20.5% and surpassing prior CNN-based methods. The system generalizes well across varying scenes and operates in near real-time, making it suitable for future integration into digital twin platforms and citizen-reporting apps for smart city flood resilience.

中文标题/摘要

标题：FloodVision：使用基础视觉-语言模型和领域知识图谱的城市洪水深度估计

及时准确的洪水水位估计对于道路通行能力和应急响应至关重要。虽然最近的计算机视觉方法已经实现了洪水检测，但它们在准确性和泛化能力方面存在局限，因为依赖于固定的物体检测器和特定任务的训练。为了实现能够跨不同洪水场景泛化的准确深度估计，本文提出了FloodVision，这是一种零样本框架，结合了基础视觉-语言模型GPT-4o的语义推理能力和结构化的领域知识图谱。知识图谱编码了包括车辆、人员和基础设施元素在内的常见城市物体的标准现实世界尺寸，使模型的推理基于物理现实。FloodVision动态识别RGB图像中的可见参考物体，从知识图谱中检索验证过的高度以减轻幻觉，估计淹没比例，并应用统计异常值过滤来计算最终的深度值。在MyCoast New York提供的110张众包图像上评估，FloodVision的平均绝对误差为8.17厘米，比GPT-4o基线降低了10.28厘米的20.5%，超过了基于CNN的先前方法。该系统在不同场景下泛化良好，可近实时运行，适用于未来集成到数字孪生平台和市民报告应用中以增强智慧城市防洪韧性。

AnomalyLMM: Bridging Generative Knowledge and Discriminative Retrieval for Text-Based Person Anomaly Search

Authors: Hao Ju, Hu Zhang, Zhedong Zheng

First: 2025-09-04T16:34:46+00:00 · Latest: 2025-09-05T02:40:36+00:00

Abs · PDF

Abstract

With growing public safety demands, text-based person anomaly search has emerged as a critical task, aiming to retrieve individuals with abnormal behaviors via natural language descriptions. Unlike conventional person search, this task presents two unique challenges: (1) fine-grained cross-modal alignment between textual anomalies and visual behaviors, and (2) anomaly recognition under sparse real-world samples. While Large Multi-modal Models (LMMs) excel in multi-modal understanding, their potential for fine-grained anomaly retrieval remains underexplored, hindered by: (1) a domain gap between generative knowledge and discriminative retrieval, and (2) the absence of efficient adaptation strategies for deployment. In this work, we propose AnomalyLMM, the first framework that harnesses LMMs for text-based person anomaly search. Our key contributions are: (1) A novel coarse-to-fine pipeline integrating LMMs to bridge generative world knowledge with retrieval-centric anomaly detection; (2) A training-free adaptation cookbook featuring masked cross-modal prompting, behavioral saliency prediction, and knowledge-aware re-ranking, enabling zero-shot focus on subtle anomaly cues. As the first study to explore LMMs for this task, we conduct a rigorous evaluation on the PAB dataset, the only publicly available benchmark for text-based person anomaly search, with its curated real-world anomalies covering diverse scenarios (e.g., falling, collision, and being hit). Experiments show the effectiveness of the proposed method, surpassing the competitive baseline by +0.96% Recall@1 accuracy. Notably, our method reveals interpretable alignment between textual anomalies and visual behaviors, validated via qualitative analysis. Our code and models will be released for future research.

中文标题/摘要

标题：AnomalyLMM：连接生成性知识与辨别性检索的文本基础人员异常搜索框架

随着公共安全需求的增长，基于文本的人员异常搜索已成为一项关键任务，旨在通过自然语言描述检索具有异常行为的个体。与传统的人员搜索任务不同，这一任务面临两个独特的挑战：（1）文本异常与视觉行为之间的精细跨模态对齐，以及（2）在稀疏的现实世界样本下异常识别。虽然大型多模态模型（LMMs）在多模态理解方面表现出色，但它们在精细异常检索方面的潜力尚未得到充分探索，受到以下因素的阻碍：（1）生成性知识与辨别性检索之间的领域差距，以及（2）缺乏有效的部署适应策略。在本文中，我们提出了AnomalyLMM，这是第一个利用LMMs进行基于文本的人员异常搜索的框架。我们的主要贡献包括：（1）一种新颖的从粗到细的流水线，将LMMs集成以连接生成性世界的知识与检索为中心的异常检测；（2）一种无需训练的适应食谱，包括掩码跨模态提示、行为显著性预测和知识感知再排序，使零样本聚焦于细微的异常线索。作为第一个探索LMMs用于此任务的研究，我们在PAB数据集上进行了严格的评估，这是唯一公开的基于文本的人员异常搜索基准数据集，其精心策划的现实世界异常涵盖了多种场景（例如，跌倒、碰撞和被击中）。实验表明，所提出的方法的有效性，超越了竞争性基线+0.96%的召回率。值得注意的是，我们的方法揭示了文本异常与视觉行为之间的可解释对齐，通过定性分析进行了验证。我们的代码和模型将为未来的研究发布。

Dynamic Group Detection using VLM-augmented Temporal Groupness Graph

Authors: Kaname Yokoyama, Chihiro Nakatani, Norimichi Ukita

First: 2025-09-05T02:37:01+00:00 · Latest: 2025-09-05T02:37:01+00:00

Comments: 10 pages, Accepted to ICCV2025

Abs · PDF · Code1

Abstract

This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames' groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://github.com/irajisamurai/VLM-GroupDetection.git

中文标题/摘要

标题：使用VLM增强的时间群体性图动态群体检测

本文提出了一种视频中动态人群群体检测方法。对于检测复杂群体而言，不仅需要考虑群体成员的局部外观特征，还需要考虑场景的全局上下文。在我们的方法中，使用增强的视觉-语言模型（VLM）从每一帧中提取局部和全局外观特征。为了进一步改进，群体结构应保持时间一致性。虽然之前的模型假设视频中的群体不会发生变化，我们的方法通过使用估计了所有帧群体性概率的图进行全局优化，来检测动态变化的群体。我们的实验结果表明，我们的方法在公共数据集上优于最先进的群体检测方法。代码：https://github.com/irajisamurai/VLM-GroupDetection.git

Summary / 总结

This paper proposes dynamic human group detection in videos.

该研究提出了一种利用Vision-Language Model (VLM) 提取局部外观特征和全局场景上下文的方法，以实现视频中动态人类群体检测。该方法通过一个图来表示所有帧的群体概率，并利用全局优化来检测动态变化的群体。实验结果表明，该方法在公共数据集上优于现有方法。

Guideline-Consistent Segmentation via Multi-Agent Refinement

Authors: Vanshika Vats, Ashwani Rathee, James Davis

First: 2025-09-04T22:32:57+00:00 · Latest: 2025-09-04T22:32:57+00:00

Abs · PDF

Abstract

Semantic segmentation in real-world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task-specific retraining that must be repeated as the guidelines evolve. Although recent open-vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph-length guidelines that specify intricate segmentation rules. To address this, we introduce a multi-agent, training-free framework that coordinates general-purpose vision-language models within an iterative Worker-Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline-consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state-of-the-art baselines, demonstrating strong generalization and instruction adherence.

中文标题/摘要

标题：基于指南一致性的多智能体细化分割

在实际应用中的语义分割不仅需要准确的掩膜，还需要严格遵守文本标签指南。这些指南通常复杂且冗长，无论是人工还是自动标注往往未能忠实遵守。传统方法依赖于昂贵的任务特定重新训练，且随着指南的演变需要重复进行。尽管最近的开放式词汇分割方法在简单的提示下表现出色，但在面对包含段落长度复杂分割规则的指南集时，它们往往会失败。为解决这一问题，我们提出了一种无需训练的多智能体框架，该框架在迭代的工人-监督者细化架构中协调通用视觉-语言模型。工人执行分割，监督者根据检索到的指南对其进行评价，轻量级的强化学习停止策略决定何时终止循环，以确保指南一致的掩膜并平衡资源使用。在Waymo和ReasonSeg数据集上的评估表明，我们的方法显著优于最先进的基线，展示了强大的泛化能力和指令遵守能力。

Summary / 总结

The research addresses the challenge of semantic segmentation in real-world applications, where strict adherence to complex textual labeling guidelines is crucial. It proposes a multi-agent, training-free framework that uses an iterative Worker-Supervisor architecture to ensure guideline-consistent segmentation. The Worker performs the segmentation, the Supervisor critiques it against retrieved guidelines, and a reinforcement learning policy decides when to terminate the loop. Experiments on the Waymo and ReasonSeg datasets show that this method outperforms existing approaches, demonstrating strong generalization and instruction adherence.

研究针对实际应用中语义分割需要严格遵守复杂文本标注指南的挑战。提出了一种无需训练的多代理框架，采用迭代的工人-监督者架构来确保符合指南的分割。工人执行分割任务，监督者根据检索到的指南对其进行批评，轻量级的强化学习停止策略决定何时终止循环。在Waymo和ReasonSeg数据集上的实验表明，该方法优于现有方法，展示了强大的泛化能力和指令遵循能力。

TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection

Authors: Zehong Yan, Peng Qi, Wynne Hsu, Mong Li Lee

Venue: EMNLP 2025

First: 2025-09-04T17:59:43+00:00 · Latest: 2025-09-04T17:59:43+00:00

Comments: EMNLP 2025; Project Homepage: https://yanzehong.github.io/trust-vl/

Abs · PDF · Project1

Abstract

Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.

中文标题/摘要

标题：TRUST-VL：一种可解释的通用多模态虚假信息检测助手

多模态虚假信息，包括文本、视觉和跨模态的扭曲，构成了日益严重的社会威胁，这种威胁被生成式AI放大。现有方法通常专注于一种类型的扭曲，并且难以泛化到未见过的场景中。在这项工作中，我们观察到不同类型的扭曲共享一些共同的推理能力，同时也需要特定任务的技能。我们假设跨类型联合训练有助于知识共享并增强模型的泛化能力。为此，我们引入了TRUST-VL，这是一种统一且可解释的视觉语言模型，用于通用多模态虚假信息检测。TRUST-VL 包含一个新颖的问答感知视觉增强模块，旨在提取特定任务的视觉特征。为了支持训练，我们还构建了TRUST-Instruct，这是一个包含198,000个样本的大规模指令数据集，这些样本具有与人类事实核查工作流程对齐的结构化推理链。在领域内和零样本基准上的广泛实验表明，TRUST-VL 达到了最先进的性能，同时提供了强大的泛化能力和可解释性。

Summary / 总结

The research aims to address the challenge of detecting multimodal misinformation that involves textual, visual, and cross-modal distortions, which are becoming more prevalent due to generative AI. The study introduces TRUST-VL, a unified vision-language model that incorporates a Question-Aware Visual Amplifier module to extract task-specific visual features. The model is trained using TRUST-Instruct, a large instruction dataset with 198K samples. Experimental results show that TRUST-VL outperforms existing methods on both in-domain and zero-shot benchmarks, demonstrating strong generalization and interpretability capabilities.

研究旨在应对包含文本、视觉和跨模态扭曲的多模态 misinformation 检测挑战。方法是使用一个统一的视觉语言模型 TRUST-VL，该模型在不同类型的扭曲上进行联合训练以共享知识并提高泛化能力。关键实验结果表明，TRUST-VL 在领域内和零样本基准测试中均优于现有方法，并且还提供了强大的可解释性和泛化能力。

OVGrasp: Open-Vocabulary Grasping Assistance via Multimodal Intent Detection

Authors: Chen Hu, Shan Luo, Letizia Gionfrida

First: 2025-09-04T15:42:36+00:00 · Latest: 2025-09-04T15:42:36+00:00

Abs · PDF

Abstract

Grasping assistance is essential for restoring autonomy in individuals with motor impairments, particularly in unstructured environments where object categories and user intentions are diverse and unpredictable. We present OVGrasp, a hierarchical control framework for soft exoskeleton-based grasp assistance that integrates RGB-D vision, open-vocabulary prompts, and voice commands to enable robust multimodal interaction. To enhance generalization in open environments, OVGrasp incorporates a vision-language foundation model with an open-vocabulary mechanism, allowing zero-shot detection of previously unseen objects without retraining. A multimodal decision-maker further fuses spatial and linguistic cues to infer user intent, such as grasp or release, in multi-object scenarios. We deploy the complete framework on a custom egocentric-view wearable exoskeleton and conduct systematic evaluations on 15 objects across three grasp types. Experimental results with ten participants demonstrate that OVGrasp achieves a grasping ability score (GAS) of 87.00%, outperforming state-of-the-art baselines and achieving improved kinematic alignment with natural hand motion.

中文标题/摘要

标题：OVGrasp: 开放词汇抓取辅助通过多模态意图检测

抓取辅助对于恢复运动受损个体的自主性至关重要，特别是在物体类别和用户意图多样且不可预测的非结构化环境中。我们提出OVGrasp，一种基于软外骨骼的抓取辅助的分层控制框架，结合RGB-D视觉、开放词汇提示和语音命令，实现稳健的多模态交互。为了在开放环境中增强泛化能力，OVGrasp整合了一个视觉语言基础模型和开放词汇机制，允许在无需重新训练的情况下进行零样本检测，以识别未见过的对象。多模态决策者进一步融合空间和语言线索，推断用户意图，如抓取或释放，在多物体场景中。我们在一个定制的主观视角可穿戴外骨骼上部署了完整的框架，并在15个物体上进行了三种抓取类型的系统评估。十名参与者的实验结果表明，OVGrasp实现了87.00%的抓取能力评分（GAS），优于最先进的基线，并实现了与自然手部运动更好的运动学对齐。

Image Embedding Sampling Method for Diverse Captioning

Authors: Sania Waheed, Na Min An

First: 2025-02-14T12:33:19+00:00 · Latest: 2025-09-04T15:00:25+00:00

Comments: 17 pages, 5 figures, 9 tables

Abs · PDF

Abstract

Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, comparably smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset, respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions.

中文标题/摘要

标题：图像嵌入采样方法以实现多样的描述

最先进的VLM的图像描述随着时间的推移显著提高，但这也带来了计算复杂性的增加，使得它们对于资源受限的应用，如移动设备和辅助技术来说不够普及。相反，较小的VLM更侧重于高层次的场景描述，而忽略了有助于更深入理解图像的细节。在本文中，我们介绍了一种无需训练的框架，通过使用BLIP作为骨干网络，明确关注不同的图像区域，从而增强描述的多样性和信息量。我们的方法利用结构化分割生成层次表示，捕捉全局和局部语义。无需额外的模型训练，我们证明了我们的方法使较小的VLM在图像-描述对齐、语义完整性和多样性方面达到了与较大模型相当的性能。我们在MSCOCO、Flickr30k和Nocaps测试数据集上评估了我们的框架，分别获得了Div-2得分为0.735、0.750和0.748，同时保持了与人工标注描述的强烈相关性和语义完整性。

Summary / 总结

This paper addresses the challenge of enhancing the diversity and informativeness of image captions using a small vision-language model (VLM) called BLIP. By leveraging structured segmentation, the method captures both global and localized semantics without additional training. The approach significantly improves the performance of smaller VLMs, achieving comparable results to larger models in terms of image-caption alignment, semantic integrity, and diversity, as evaluated on MSCOCO, Flickr30k, and Nocaps datasets.

本文提出了一种无需额外训练的方法，利用小型视觉-语言模型BLIP和结构化分割，增强图像描述的多样性和信息量。该方法能够捕捉全局和局部语义，使较小的VLM在图像-描述对齐、语义完整性和多样性方面达到与大型模型相当的性能。该框架在MSCOCO、Flickr30k和Nocaps数据集上的Div-2得分为0.735、0.750和0.748，同时保持了与人工标注描述的高度相关性和语义完整性。

Straighter Flow Matching via a Diffusion-Based Coupling Prior

Authors: Siyu Xing, Jie Cao, Huaibo Huang, Haichao Shi, Xiao-Yu Zhang

First: 2023-11-28T06:19:30+00:00 · Latest: 2025-09-04T14:24:04+00:00

Abs · PDF

Abstract

Flow matching as a paradigm of generative model achieves notable success across various domains. However, existing methods use either multi-round training or knowledge within minibatches, posing challenges in finding a favorable coupling strategy for straightening trajectories to few-step generation. To address this issue, we propose a novel approach, Straighter trajectories of Flow Matching (StraightFM). It straightens trajectories with the coupling strategy from the entire distribution level. More specifically, during training, StraightFM creates couplings of images and noise via one diffusion model as a coupling prior to straighten trajectories for few-step generation. Our coupling strategy can also integrate with the existing coupling direction from real data to noise, improving image quality in few-step generation. Experimental results on pixel space and latent space show that StraightFM yields attractive samples within 5 steps. Moreover, our unconditional StraightFM is seamlessly compatible with training-free multimodal conditional generation, maintaining high-quality image generation in few steps.

中文标题/摘要

标题：基于扩散耦合先验的更直流水流动匹配

水流动匹配作为一种生成模型的范式，在各个领域取得了显著的成功。然而，现有方法要么采用多轮训练，要么利用小批量内的知识，这在寻找适合直流水流动策略以实现几步生成方面提出了挑战。为解决这一问题，我们提出了一种新的方法，即更直流水流动匹配（StraightFM）。该方法在整体分布层面采用耦合策略来直流水流动。具体而言，在训练过程中，StraightFM通过一个扩散模型将图像和噪声耦合起来作为耦合先验，以直流水流动进行几步生成。我们的耦合策略还可以与真实数据到噪声的现有耦合方向结合，从而在几步生成中提高图像质量。在像素空间和潜在空间的实验结果显示，StraightFM在5步内生成了具有吸引力的样本。此外，我们的无条件StraightFM与无需训练的多模态条件生成无缝兼容，在几步内保持高质量的图像生成。

Summary / 总结

The paper addresses the challenge of straightening trajectories in flow matching for few-step generation by proposing Straighter trajectories of Flow Matching (StraightFM). StraightFM uses a diffusion-based coupling prior to straighten trajectories during training, improving image quality in few-step generation. Experiments show that StraightFM generates attractive samples within 5 steps and maintains high-quality image generation in few-step unconditional and multimodal conditional generation.

论文提出了一种名为StraightFM的方法，通过使用基于扩散的耦合先验在训练过程中直角化轨迹来解决流匹配中轨迹直角化的问题，该方法还与现有的从真实数据到噪声的耦合策略相结合，以提高几步生成中的图像质量。实验结果表明，StraightFM在五步内生成了吸引人的样本，并且在无条件和多模态条件生成中保持了高质量的图像生成。

Learning Active Perception via Self-Evolving Preference Optimization for GUI Grounding

Authors: Wanfu Wang, Qipeng Huang, Guangquan Xue, Xiaobo Liang, Juntao Li

First: 2025-09-04T14:17:01+00:00 · Latest: 2025-09-04T14:17:01+00:00

Abs · PDF

Abstract

Vision Language Models (VLMs) have recently achieved significant progress in bridging visual perception and linguistic reasoning. Recently, OpenAI o3 model introduced a zoom-in search strategy that effectively elicits active perception capabilities in VLMs, improving downstream task performance. However, enabling VLMs to reason effectively over appropriate image regions remains a core challenge in GUI grounding, particularly under high-resolution inputs and complex multi-element visual interactions. In this work, we propose LASER, a self-evolving framework that progressively endows VLMs with multi-step perception capabilities, enabling precise coordinate prediction. Specifically, our approach integrate Monte Carlo quality estimation with Intersection-over-Union (IoU)-based region quality evaluation to jointly encourage both accuracy and diversity in constructing high-quality preference data. This combination explicitly guides the model to focus on instruction-relevant key regions while adaptively allocating reasoning steps based on task complexity. Comprehensive experiments on the ScreenSpot Pro and ScreenSpot-v2 benchmarks demonstrate consistent performance gains, validating the effectiveness of our method. Furthermore, when fine-tuned on GTA1-7B, LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark, establishing a new state-of-the-art (SoTA) among 7B-scale models.

中文标题/摘要

标题：通过自我进化偏好优化学习主动感知 GUI 地址

视觉语言模型（VLMs）最近在视觉感知和语言推理的结合方面取得了显著进展。最近，OpenAI 的 o3 模型引入了一种缩放搜索策略，有效地激发了 VLMs 的主动感知能力，提高了下游任务的性能。然而，在 GUI 地址中，特别是在高分辨率输入和复杂多元素视觉交互下，使 VLMs 有效地在适当图像区域进行推理仍然是一个核心挑战。在本文中，我们提出了一种自我进化的框架 LASER，该框架逐步赋予 VLMs 多步感知能力，使其能够进行精确的坐标预测。具体而言，我们的方法将蒙特卡洛质量估计与基于交并比（IoU）的区域质量评估相结合，以共同促进构建高质量偏好数据的准确性和多样性。这种结合明确地引导模型关注与指令相关的关键区域，并根据任务复杂性自适应地分配推理步骤。在 ScreenSpot Pro 和 ScreenSpot-v2 基准上的全面实验表明，该方法具有一致的性能提升，验证了其有效性。此外，当在 GTA1-7B 上进行微调时，LASER 在 ScreenSpot-Pro 基准上的得分为 55.7，成为 7B 规模模型中的新最佳水平。

Summary / 总结

This work addresses the challenge of enabling Vision Language Models (VLMs) to effectively reason over appropriate image regions in GUI grounding tasks. The proposed LASER framework uses a self-evolving preference optimization method that combines Monte Carlo quality estimation with IoU-based region quality evaluation. This approach enhances the model's ability to predict precise coordinates and focus on instruction-relevant key regions. Experiments on ScreenSpot Pro and ScreenSpot-v2 benchmarks show consistent performance gains, and LASER achieves a score of 55.7 on the ScreenSpot-Pro benchmark when fine-tuned on GTA1-7B, setting a new state-of-the-art for 7B-scale models.

本文旨在解决使视觉语言模型（VLMs）在GUI定位任务中有效推理适当图像区域的挑战。作者提出了一种自进化的框架LASER，该框架结合了蒙特卡洛质量估计和交并比（IoU）区域质量评估，以提高多步感知能力。在ScreenSpot Pro和ScreenSpot-v2基准上的实验显示了一致的性能提升，并且LASER在ScreenSpot-Pro基准上的得分为55.7，成为7B规模模型中的新最佳表现。

Exposing Synthetic Speech: Model Attribution and Detection of AI-generated Speech via Audio Fingerprints

Authors: Matías Pizarro, Mike Laszkiewicz, Shawkat Hesso, Dorothea Kolossa, Asja Fischer

First: 2024-11-21T10:55:49+00:00 · Latest: 2025-09-04T12:43:52+00:00

Abs · PDF

Abstract

As speech generation technologies continue to advance in quality and accessibility, the risk of malicious use cases, including impersonation, misinformation, and spoofing, increases rapidly. This work addresses this threat by introducing a simple, training-free, yet effective approach for detecting AI-generated speech and attributing it to its source model. Specifically, we tackle three key tasks: (1) single-model attribution in an open-world setting, where the goal is to determine whether a given audio sample was generated by a specific target neural speech synthesis system (with access only to data from that system); (2) multi-model attribution in a closed-world setting, where the objective is to identify the generating system from a known pool of candidates; and last but not least (3) detection of synthetic versus real speech. Our approach leverages standardized average residuals-the difference between an input audio signal and its filtered version using either a low-pass filter or the EnCodec audio autoencoder. We demonstrate that these residuals consistently capture artifacts introduced by diverse speech synthesis systems, serving as distinctive, model-agnostic fingerprints for attribution. Across extensive experiments, our approach achieves AUROC scores exceeding 99% in most scenarios, evaluated on augmented benchmark datasets that pair real speech with synthetic audio generated by multiple synthesis systems. In addition, our robustness analysis underscores the method's ability to maintain high performance even in the presence of moderate additive noise. Due to its simplicity, efficiency, and strong generalization across speech synthesis systems and languages, this technique offers a practical tool for digital forensics and security applications.

中文标题/摘要

标题：揭示合成语音：通过音频指纹检测和归因于AI生成语音的模型

随着语音生成技术在质量和可访问性方面的不断进步，恶意使用案例，包括冒充、误导和欺诈，的风险迅速增加。本研究通过引入一种简单、无需训练且有效的检测方法和归因方法来应对这一威胁，该方法可用于检测AI生成的语音并将其归因于其来源模型。具体而言，我们解决了三个关键任务：（1）开放世界中的单模型归因，目标是在仅访问该系统数据的情况下确定给定音频样本是否由特定目标神经语音合成系统生成；（2）封闭世界中的多模型归因，目标是从已知候选池中识别生成系统；最后但同样重要的是（3）合成语音与真实语音的检测。我们的方法利用标准化平均残差——输入音频信号与其使用低通滤波器或EnCodec音频自编码器进行滤波后的版本之间的差异。我们证明这些残差能够一致地捕捉到由多种语音合成系统引入的特征，作为区分的、模型无关的指纹用于归因。在广泛的实验中，我们的方法在大多数场景中实现了超过99%的AUROC分数，评估基于扩展基准数据集，该数据集将真实语音与由多个合成系统生成的合成音频配对。此外，我们的鲁棒性分析强调了该方法即使在存在中等附加噪声的情况下仍能保持高性能的能力。由于其简单性、效率以及在语音合成系统和语言方面的强大泛化能力，该技术为数字取证和安全应用提供了一种实用工具。

Summary / 总结

This study addresses the risk of malicious use of speech generation technologies by developing a training-free approach for detecting and attributing AI-generated speech. The method uses standardized average residuals to identify artifacts introduced by different speech synthesis systems, serving as distinctive fingerprints. Experiments show high AUROC scores exceeding 99% across various scenarios, and the technique remains robust under moderate noise conditions.

该研究提出了一种无需训练的方法，通过标准化平均残差来检测和归因AI生成的语音。该方法能够区分真实和合成语音，并在开放和封闭世界设置中识别出生成模型。实验结果显示，该方法在大多数场景下的AUROC得分超过99%，即使在有中等噪声的情况下也能保持高性能，显示出其实用性，适用于数字取证和安全领域。

TAGAL: Tabular Data Generation using Agentic LLM Methods

Authors: Benoît Ronval, Pierre Dupont, Siegfried Nijssen

First: 2025-09-04T12:25:14+00:00 · Latest: 2025-09-04T12:25:14+00:00

Abs · PDF

Abstract

The generation of data is a common approach to improve the performance of machine learning tasks, among which is the training of models for classification. In this paper, we present TAGAL, a collection of methods able to generate synthetic tabular data using an agentic workflow. The methods leverage Large Language Models (LLMs) for an automatic and iterative process that uses feedback to improve the generated data without any further LLM training. The use of LLMs also allows for the addition of external knowledge in the generation process. We evaluate TAGAL across diverse datasets and different aspects of quality for the generated data. We look at the utility of downstream ML models, both by training classifiers on synthetic data only and by combining real and synthetic data. Moreover, we compare the similarities between the real and the generated data. We show that TAGAL is able to perform on par with state-of-the-art approaches that require LLM training and generally outperforms other training-free approaches. These findings highlight the potential of agentic workflow and open new directions for LLM-based data generation methods.

中文标题/摘要

标题：TAGAL：使用代理型LLM方法生成表格数据

数据生成是提高机器学习任务性能的常见方法，其中也包括分类模型的训练。本文介绍了TAGAL，一种能够使用代理型工作流生成合成表格数据的方法。该方法利用大型语言模型（LLMs）进行自动且迭代的过程，通过反馈不断改进生成的数据，而无需进一步训练LLM。使用LLMs还允许在生成过程中添加外部知识。我们通过多种数据集和生成数据的不同质量方面评估了TAGAL。我们不仅通过仅使用合成数据训练分类器，还通过结合真实和合成数据来评估下游机器学习模型的实用性。此外，我们还比较了真实数据和生成数据之间的相似性。结果显示，TAGAL能够与需要训练LLM的最新方法相媲美，并且通常优于其他无需训练的方法。这些发现突显了代理型工作流的潜力，并为基于LLM的数据生成方法开辟了新的方向。

Summary / 总结

TAGAL is a method for generating synthetic tabular data using an agentic workflow that leverages Large Language Models (LLMs) for an automatic and iterative process. The method improves generated data through feedback without further LLM training and incorporates external knowledge. Experiments across various datasets show that TAGAL performs comparably to state-of-the-art approaches requiring LLM training and outperforms other training-free methods in terms of downstream model utility and data similarity.

TAGAL 是一种使用大型语言模型（LLMs）的自动工作流生成合成表格数据的方法。该方法利用LLMs进行一个自动迭代的过程，通过反馈改进生成的数据，无需进一步的LLM训练。该方法在多种数据集和数据质量方面进行了评估，结果显示TAGAL在生成用于机器学习模型的数据方面与需要LLM训练的先进方法表现相当，并且优于其他无需训练的方法。

MUNBa: Machine Unlearning via Nash Bargaining

Authors: Jing Wu, Mehrtash Harandi

First: 2024-11-23T12:18:28+00:00 · Latest: 2025-09-04T11:00:46+00:00

Abs · PDF

Abstract

Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts and dominance, impeding MU algorithms from reaching optimal solutions. To address the gradient conflict and dominance issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain and balance their contributions. To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto stationary point. Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective. We evaluate our algorithm's effectiveness on a diverse set of tasks across image classification and image generation. Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving a better trade-off between forgetting and preserving. Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.

中文标题/摘要

标题：MUNBa: 机器去学习通过纳什讨价还价

机器去学习（MU）旨在从模型中选择性地删除有害行为，同时保留模型的整体效用。作为多任务学习问题，MU涉及平衡忘记特定概念/数据和保持整体性能的目标。简单地将这些忘记和保留目标结合起来可能导致梯度冲突和支配，阻碍MU算法达到最优解。为了解决梯度冲突和支配问题，我们将MU重新表述为一个两玩家合作博弈，其中两个玩家，即忘记玩家和保留玩家，通过他们的梯度提案来最大化他们的整体收益并平衡他们的贡献。为此，借鉴纳什讨价还价理论，我们推导出一个闭式解来引导模型向帕累托稳定点发展。我们对MU的表述保证了一个均衡解，在此解中，任何偏离最终状态都会导致两个玩家的整体目标减少，确保每个目标的最优性。我们在图像分类和图像生成等多种任务上评估了我们算法的有效性。广泛的实验使用ResNet、视觉-语言模型CLIP和文本到图像扩散模型表明，我们的方法优于最先进的MU算法，实现了更好的忘记与保留之间的权衡。我们的结果还突显了忘记精度、保持泛化能力和对抗攻击鲁棒性的改进。

Summary / 总结

MUNBa reformulates Machine Unlearning as a two-player cooperative game using Nash Bargaining to address gradient conflicts. It derives a closed-form solution to guide the model towards a Pareto stationary point, ensuring optimality in both forgetting and preserving objectives. Experiments on various tasks show MUNBa outperforms existing methods in achieving a better trade-off between forgetting and preserving, with improvements in forgetting precision and robustness against adversarial attacks.

MUNBa将机器遗忘（MU）重新表述为一个基于纳什讨价还价理论的两人合作博弈，以解决梯度冲突和支配问题。这种方法确保了双方遗忘和保留目标的优化。实验表明，MUNBa在图像分类和生成任务中优于现有MU算法，实现了更好的遗忘与保留之间的权衡，并提高了遗忘精度和对抗攻击的鲁棒性。

SMooGPT: Stylized Motion Generation using Large Language Models

Authors: Lei Zhong, Yi Yang, Changjian Li

First: 2025-09-04T09:41:18+00:00 · Latest: 2025-09-04T09:41:18+00:00

Abs · PDF

Abstract

Stylized motion generation is actively studied in computer graphics, especially benefiting from the rapid advances in diffusion models. The goal of this task is to produce a novel motion respecting both the motion content and the desired motion style, e.g., ``walking in a loop like a Monkey''. Existing research attempts to address this problem via motion style transfer or conditional motion generation. They typically embed the motion style into a latent space and guide the motion implicitly in a latent space as well. Despite the progress, their methods suffer from low interpretability and control, limited generalization to new styles, and fail to produce motions other than ``walking'' due to the strong bias in the public stylization dataset. In this paper, we propose to solve the stylized motion generation problem from a new perspective of reasoning-composition-generation, based on our observations: i) human motion can often be effectively described using natural language in a body-part centric manner, ii) LLMs exhibit a strong ability to understand and reason about human motion, and iii) human motion has an inherently compositional nature, facilitating the new motion content or style generation via effective recomposing. We thus propose utilizing body-part text space as an intermediate representation, and present SMooGPT, a fine-tuned LLM, acting as a reasoner, composer, and generator when generating the desired stylized motion. Our method executes in the body-part text space with much higher interpretability, enabling fine-grained motion control, effectively resolving potential conflicts between motion content and style, and generalizes well to new styles thanks to the open-vocabulary ability of LLMs. Comprehensive experiments and evaluations, and a user perceptual study, demonstrate the effectiveness of our approach, especially under the pure text-driven stylized motion generation.

中文标题/摘要

标题：SMooGPT：使用大型语言模型的风格化运动生成

风格化运动生成在计算机图形学中得到了积极的研究，特别得益于扩散模型的迅速发展。该任务的目标是生成既尊重运动内容又符合所需运动风格的新运动，例如“像猴子一样环形行走”。现有研究试图通过运动风格转换或条件运动生成来解决这一问题。它们通常将运动风格嵌入到潜在空间中，并在潜在空间中隐式地引导运动。尽管取得了进展，但它们的方法在可解释性和控制性方面较低，难以泛化到新的风格，并且由于公共风格化数据集中的强烈偏见，无法生成除“行走”之外的运动。在本文中，我们从推理-组合-生成的新视角出发，解决风格化运动生成问题，基于我们的观察：i) 人体运动往往可以用自然语言在以身体部位为中心的方式进行有效描述，ii) 大型语言模型表现出强大的理解与推理人体运动的能力，iii) 人体运动具有固有的组合性质，有助于通过有效的重组生成新的运动内容或风格。因此，我们提出使用身体部位文本空间作为中间表示，并提出SMooGPT，这是一种微调后的大型语言模型，在生成所需风格化运动时充当推理者、组合者和生成者。我们的方法在身体部位文本空间中执行，具有更高的可解释性，能够实现精细的运动控制，有效解决运动内容与风格之间的潜在冲突，并由于大型语言模型的开放式词汇能力，能够很好地泛化到新的风格。全面的实验和评估，以及用户感知研究，证明了我们方法的有效性，特别是在纯文本驱动的风格化运动生成中。

Summary / 总结

The research aims to generate stylized motions that respect both motion content and style, such as 'walking like a Monkey'. Existing methods use motion style transfer or conditional motion generation, embedding styles in latent spaces, but suffer from low interpretability and limited generalization. This paper introduces SMooGPT, which leverages large language models (LLMs) to reason, compose, and generate motions in a body-part text space, offering higher interpretability and better control over motion generation. Experiments show that SMooGPT effectively handles potential conflicts between motion content and style and generalizes well to new styles.

研究旨在生成既符合动作内容又符合特定风格的动作，例如“像猴子一样行走”。现有方法使用动作风格转换或条件动作生成，将风格嵌入到潜在空间中，但存在可解释性低和泛化能力有限的问题。本文提出了SMooGPT，利用大型语言模型（LLMs）在身体部位文本空间中进行推理、合成和生成动作，提供更高的可解释性和更好的动作生成控制。实验表明，SMooGPT能够有效解决动作内容和风格之间的潜在冲突，并且能够很好地泛化到新的风格。

DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model

Authors: Qian Chen, Xianyin Zhang, Lifan Guo, Feng Chen, Chi Zhang

First: 2025-08-18T03:28:57+00:00 · Latest: 2025-09-04T08:05:29+00:00

Abs · PDF

Abstract

Recent advances in large vision-language models (LVLMs) have enabled a new paradigm of end-to-end document image parsing, excelling in Optical Character Recognition (OCR) tasks such as text, table, and formula recognition. However, generative LVLMs, similarly to large language models (LLMs), are prone to hallucinations--generating words that do not exist in input images. Furthermore, LVLMs are designed for general purposes and tend to be less effective on OCR tasks compared to expert models that are trained on domain-specific datasets. In this paper, we propose DianJin-OCR-R1, a reasoning-enhanced framework designed to address these limitations through training reasoning-and-tool interleaved VLMs. Given a recognition instruction, our DianJin-OCR-R1 model first recognizes the content in the input image by its own OCR capabilities, and then calls other tools (i.e., other expert models) to obtain their results as references, finally "looks again" the image and rethinks about the reasoning process to provide the final recognized content. Since architectures of expert models are tailored for specific OCR tasks, which makes them less prone to hallucinations, their results can help VLMs mitigate hallucinations. We evaluate our model on ReST and OmniDocBench, and experimental results show that our DianJin-OCR-R1 models consistently outperform their non-reasoning counterparts and expert OCR models, which proves the effectiveness of our method. Additionally, the results indicate that enhancing expert models, which are typically small and easy to iterate, enable performance improvements for VLMs.

Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection

Authors: Yijun Zhou, Yikui Zhai, Zilu Ying, Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Xiaolin Tian, Xudong Jia, Hongsheng Zhang, C. L. Philip Chen

First: 2025-09-04T07:39:18+00:00 · Latest: 2025-09-04T07:39:18+00:00

Abs · PDF · Code1

Abstract

Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.

中文标题/摘要

标题：基于文本差异增强的多模态特征融合网络在遥感变化检测中的应用

尽管深度学习已推动遥感变化检测（RSCD）的发展，但大多数方法仅依赖图像模态，限制了特征表示、变化模式建模和泛化能力，尤其是在光照和噪声干扰下。为解决这一问题，我们提出了一种MMChange方法，该方法结合图像和文本模态以提高准确性和鲁棒性。引入了图像特征精炼（IFR）模块以突出关键区域并抑制环境噪声。为克服图像特征的语义限制，我们采用视觉语言模型（VLM）生成双时相图像的语义描述。随后，文本差异增强（TDE）模块捕捉细微的语义变化，引导模型关注有意义的变化。为弥合模态之间的异质性，我们设计了图像文本特征融合（ITFF）模块，实现深层次的跨模态集成。在LEVIRCD、WHUCD和SYSUCD上的广泛实验表明，MMChange在多个指标上始终超越了现有方法，验证了其在多模态RSCD中的有效性。代码可在：https://github.com/yikuizhai/MMChange 获取。

Summary / 总结

The research aims to improve the accuracy and robustness of remote sensing change detection by integrating image and text modalities. It introduces MMChange, which includes an IFR module for image feature refinement, a VLM-based TDE module for capturing semantic shifts, and an ITFF module for cross-modal integration. Experiments show that MMChange outperforms existing methods on LEVIRCD, WHUCD, and SYSUCD datasets, confirming its effectiveness in multimodal RSCD.

研究旨在通过结合图像和文本模态来提高遥感变化检测的准确性和鲁棒性。提出的MMChange方法包括图像特征精炼模块以突出关键区域、文本差异增强模块以捕捉语义变化，以及图像文本特征融合模块以实现模态间的深度跨模态集成。在LEVIRCD、WHUCD和SYSUCD上的实验表明，MMChange在多个指标上优于现有方法，验证了其在多模态遥感变化检测中的有效性。

ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection

Authors: Zhu Wenjie, Zhang Yabin, Xin Jin, Wenjun Zeng, Lei Zhang

First: 2025-09-04T07:26:20+00:00 · Latest: 2025-09-04T07:26:20+00:00

Abs · PDF

Abstract

The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.

中文标题/摘要

标题：ANTS：通过MLLM塑造适应性负文本空间以进行OOD检测

引入负标签（NLs）已被证明能有效提升Out-of-Distribution (OOD)检测。然而，现有方法往往缺乏对OOD图像的理解，难以构建准确的负空间。此外，假负标签的存在显著降低了其近OOD性能。为解决这些问题，我们提出利用多模态大语言模型（MLLM）的理解和推理能力，塑造适应性负文本空间（ANTS）。具体而言，我们识别出可能为OOD样本的图像作为负图像，并提示MLLM描述这些图像，生成能够精确描述OOD分布的表达性负句子，从而增强远OOD检测。对于近OOD设置，其中OOD样本与分布内（ID）子集相似，我们首先识别出与负图像视觉相似的ID类子集，然后利用MLLM的推理能力生成针对该子集的视觉相似负标签，有效减少假负标签并提高近OOD检测。为了平衡这两种类型的负文本空间，我们设计了一个自适应加权分数，使方法能够在无需依赖特定任务先验知识的情况下处理不同的OOD任务设置（近OOD和远OOD），使其在开放环境中具有高度适应性。在ImageNet基准测试中，我们的ANTS显著降低了FPR95，达到新的最佳水平。此外，我们的方法无需训练且零样本，具有高可扩展性。

Summary / 总结

The paper introduces ANTS, a method that uses multimodal large language models (MLLMs) to shape an adaptive negative textual space for enhancing Out-of-Distribution (OOD) detection. By identifying OOD samples and prompting MLLMs to generate descriptive negative sentences, ANTS improves far-OOD detection. For near-OOD scenarios, it generates visually similar negative labels to reduce false negatives. ANTS uses an adaptive weighted score to balance far-OOD and near-OOD settings, achieving state-of-the-art performance with a 4.2% reduction in FPR95 on ImageNet. The method is training-free and zero-shot, making it highly scalable.

该论文提出了一种名为ANTS的方法，利用多模态大语言模型（MLLMs）塑造适应性的负面文本空间，以增强Out-of-Distribution (OOD)检测。通过识别OOD样本并促使MLLM生成描述性的负面句子，ANTS提升了远OOD检测效果。对于近OOD场景，它生成视觉上相似的负面标签以减少误检。ANTS使用自适应加权分数来平衡远OOD和近OOD设置，实现了在ImageNet基准上的最新性能，FPR95降低了4.2%。该方法无需训练且为零样本，具有高度的可扩展性。

Defending LVLMs Against Vision Attacks through Partial-Perception Supervision

Authors: Qi Zhou, Tianlin Li, Qing Guo, Dongxia Wang, Yun Lin, Yang Liu, Jin Song Dong

Venue: ICML 2025

First: 2024-12-17T09:38:58+00:00 · Latest: 2025-09-04T06:43:22+00:00

Comments: Accepted to ICML 2025

Abs · PDF

Abstract

Recent studies have raised significant concerns regarding the vulnerability of Large Vision Language Models (LVLMs) to maliciously injected or perturbed input images, which can mislead their responses. Existing defense methods show that such vision attacks are sensitive to image modifications especially cropping, using majority voting across responses of modified images as corrected responses. However, these modifications often result in partial images and distort the semantics, which reduces response quality on clean images after voting. Instead of directly using responses from partial images for voting, we investigate using them to supervise the LVLM's responses to the original images. We propose a black-box, training-free method called DPS (Defense through Partial-Perception Supervision). In this approach, the model is prompted using the responses generated by a model that perceives only a partial image. With DPS, the model can adjust its response based on partial image understanding when under attack, while confidently maintaining its original response for clean input. Our findings show that the weak model can supervise the strong model: when faced with an attacked input, the strong model becomes less confident and adjusts its response based on the weak model's partial understanding, effectively defending against the attack. With clean input, it confidently maintains its original response. Empirical experiments show our method outperforms the baseline, cutting the average attack success rate by 76.3% across six datasets on three popular models.

中文标题/摘要

标题：通过部分感知监督防御LVLM的视觉攻击

近期研究对大型视觉语言模型（LVLMs）在恶意注入或篡改输入图像时的脆弱性提出了严重关切，这些篡改可以误导其响应。现有的防御方法表明，这种视觉攻击对图像修改特别敏感，尤其是裁剪，通过在修改图像的响应中采用多数投票来获得正确的响应。然而，这些修改通常会导致部分图像，从而扭曲语义，这在投票后降低了干净图像的响应质量。我们不直接使用部分图像的响应进行投票，而是研究使用它们来监督LVLM对原始图像的响应。我们提出了一种无需训练的黑盒方法，称为DPS（通过部分感知监督的防御）。在此方法中，模型使用仅感知部分图像的模型生成的响应进行提示。通过DPS，模型在受到攻击时可以根据部分图像的理解调整其响应，同时自信地保持其原始响应以应对干净输入。我们的研究发现，弱模型可以监督强模型：面对攻击输入时，强模型变得不那么自信，并根据弱模型的部分理解调整其响应，从而有效防御攻击。在干净输入时，它自信地保持其原始响应。实验证明，我们的方法优于基线方法，在三个流行模型的六个数据集上将平均攻击成功率降低了76.3%。

Summary / 总结

The research addresses the vulnerability of Large Vision Language Models (LVLMs) to vision attacks by proposing a black-box, training-free method called DPS. DPS uses responses from a model that perceives only partial images to supervise the LVLM's responses to the original images. This method enhances the model's ability to adjust its response under attack while maintaining confidence for clean inputs. Experiments show that DPS significantly reduces the average attack success rate by 76.3% across six datasets on three popular models.

本文提出了一种名为DPS（通过部分感知监督进行防御）的方法，用于防御大型视觉语言模型（LVLM）受到的视觉攻击。该方法利用从部分图像生成的响应来监督原始图像的响应，使模型在受到攻击时能够调整其响应，同时对干净输入保持信心。实验结果显示，DPS在三个流行模型的六个数据集上将平均攻击成功率降低了76.3%。

Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model

Authors: Phuoc-Nguyen Bui, Khanh-Binh Nguyen, Hyunseung Choo

Venue: ICCV 2025

First: 2025-09-04T05:42:02+00:00 · Latest: 2025-09-04T05:42:02+00:00

Comments: ICCV 2025 - LIMIT Workshop

Abs · PDF

Abstract

Contrastive vision-language models excel in zero-shot image recognition but face challenges in few-shot scenarios due to computationally intensive offline fine-tuning using prompt learning, which risks overfitting. To overcome these limitations, we propose Attn-Adapter, a novel online few-shot learning framework that enhances CLIP's adaptability via a dual attention mechanism. Our design incorporates dataset-specific information through two components: the Memory Attn-Adapter, which refines category embeddings using support examples, and the Local-Global Attn-Adapter, which enriches image embeddings by integrating local and global features. This architecture enables dynamic adaptation from a few labeled samples without retraining the base model. Attn-Adapter outperforms state-of-the-art methods in cross-category and cross-dataset generalization, maintaining efficient inference and scaling across CLIP backbones.

中文标题/摘要

标题：Attn-Adapter：无需离线微调的视觉-语言模型在线少样本学习者

对比视觉-语言模型在零样本图像识别中表现出色，但在少样本场景中由于使用提示学习进行计算密集型离线微调而面临过拟合风险。为克服这些限制，我们提出了一种名为Attn-Adapter的新颖在线少样本学习框架，通过双重注意力机制增强CLIP的适应性。我们的设计通过两个组件整合了数据集特定的信息：Memory Attn-Adapter，通过支持样本细化类别嵌入；Local-Global Attn-Adapter，通过整合局部和全局特征丰富图像嵌入。该架构能够在少量标记样本下实现动态适应，而无需重新训练基础模型。Attn-Adapter在跨类别和跨数据集泛化方面优于现有方法，同时保持高效的推理并适用于各种CLIP基础模型。

Summary / 总结

The research aims to address the limitations of contrastive vision-language models in few-shot scenarios by proposing Attn-Adapter, a novel online few-shot learning framework. It enhances CLIP's adaptability through a dual attention mechanism, incorporating dataset-specific information via Memory Attn-Adapter and Local-Global Attn-Adapter. The framework allows dynamic adaptation from a few labeled samples without retraining the base model, outperforming state-of-the-art methods in cross-category and cross-dataset generalization while maintaining efficient inference and scalability across CLIP backbones.

研究旨在通过提出Attn-Adapter，增强CLIP的适应性，解决视觉-语言模型在少样本学习中的挑战。该方法使用Memory Attn-Adapter通过支持示例细化类别嵌入，使用Local-Global Attn-Adapter整合局部和全局特征到图像嵌入中。实验结果显示，Attn-Adapter在跨类别和跨数据集泛化方面优于现有最佳方法，同时保持高效的推理和在CLIP骨干网络上的可扩展性。