arXiv 论文速递

2025-08-29 03:55
Snapshot: 20250829_0355
OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations
Authors: Peng-Hao Hsu, Ke Zhang, Fu-En Wang, Tao Tu, Ming-Feng Li, Yu-Lun Liu, Albert Y. C. Chen, Min Sun, Cheng-Hao Kuo
First: 2025-08-27T17:17:00+00:00 · Latest: 2025-08-27T17:17:00+00:00
Comments: ICCV2025
Abstract
Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong two-stage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.
Segmentation Assisted Incremental Test Time Adaptation in an Open World
Authors: Manogna Sreenivas, Soma Biswas
First: 2025-08-27T16:33:32+00:00 · Latest: 2025-08-27T16:33:32+00:00
Comments: Accepted at BMVC 2025
Abstract
In dynamic environments, unfamiliar objects and distribution shifts are often encountered, which challenge the generalization abilities of the deployed trained models. This work addresses Incremental Test Time Adaptation of Vision Language Models, tackling scenarios where unseen classes and unseen domains continuously appear during testing. Unlike traditional Test Time Adaptation approaches, where the test stream comes only from a predefined set of classes, our framework allows models to adapt simultaneously to both covariate and label shifts, actively incorporating new classes as they emerge. Towards this goal, we establish a new benchmark for ITTA, integrating single image TTA methods for VLMs with active labeling techniques that query an oracle for samples potentially representing unseen classes during test time. We propose a segmentation assisted active labeling module, termed SegAssist, which is training free and repurposes the segmentation capabilities of VLMs to refine active sample selection, prioritizing samples likely to belong to unseen classes. Extensive experiments on several benchmark datasets demonstrate the potential of SegAssist to enhance the performance of VLMs in real world scenarios, where continuous adaptation to emerging data is essential. Project-page:https://manogna-s.github.io/segassist/
中文标题/摘要
标题:开放世界中的分割辅助增量测试时适应
在动态环境中,经常遇到不熟悉的对象和分布变化,这挑战了部署模型的泛化能力。本研究针对视觉语言模型的增量测试时适应问题,处理测试过程中不断出现的未见类别和未见领域的情况。与传统的测试时适应方法不同,后者仅从预定义的类别集合中获取测试流,我们的框架允许模型同时适应协变量和标签的变化,积极地将新类别纳入其中。为此,我们为增量测试时适应建立了新的基准,将单张图像的测试时适应方法与主动标注技术结合,测试时查询或acles以获取可能代表未见类别的样本。我们提出了一种分割辅助主动标注模块,称为SegAssist,该模块无需训练,重新利用视觉语言模型的分割能力来细化主动样本选择,优先选择可能属于未见类别的样本。在多个基准数据集上的广泛实验表明,SegAssist能够增强视觉语言模型在现实世界场景中的性能,其中持续适应新兴数据至关重要。项目页面:https://manogna-s.github.io/segassist/
Summary / 总结
This work addresses Incremental Test Time Adaptation (TTA) for Vision Language Models (VLMs) in dynamic environments with continuous emergence of unseen classes and domains. It introduces SegAssist, a segmentation-assisted active labeling module that helps VLMs adapt to new classes without retraining. Experiments show that SegAssist improves VLM performance in real-world scenarios requiring continuous adaptation to new data.
该研究针对动态环境中出现的未见过的类别和领域,为视觉语言模型(VLMs)引入增量测试时适应(TTA)方法。提出了一个基于分割的主动标注模块SegAssist,该模块通过查询可能代表未见过类别的样本来帮助VLMs适应新类别,而无需重新训练。实验表明,SegAssist能够提高VLMs在需要持续适应新兴数据的现实场景中的性能。
SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control
Authors: Quanfeng Lu, Zhantao Ma, Shuai Zhong, Jin Wang, Dahai Yu, Michael K. Ng, Ping Luo
First: 2025-08-27T16:27:19+00:00 · Latest: 2025-08-27T16:27:19+00:00
Comments: 28 pages, 12 figures
Abstract
The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.
中文标题/摘要
标题:SWIRL:多智能体系统中交错强化学习的分阶段工作流
大型视觉语言模型(LVLM)和代理系统的快速发展激发了对能够可靠地将自然语言转换为界面操作的移动GUI代理的兴趣。然而,现有的单智能体方法仍然受到结构限制的限制。尽管多智能体系统自然地解耦了不同的能力,但最近多智能体强化学习(MARL)的进步往往受到效率低下的阻碍,并且与当前的LVLM架构不兼容。为了解决这些挑战,我们提出了SWIRL,这是一种为多智能体系统设计的交错强化学习的分阶段工作流。SWIRL将MARL重新表述为一系列单智能体强化学习任务,每次更新一个智能体,同时保持其他智能体不变。这种表述形式使训练更加稳定,并促进了智能体之间的高效协调。理论上,我们提供了逐步安全性边界、跨轮次单调改进定理以及回报收敛保证,确保了稳健和原则性的优化。在移动GUI控制的应用中,SWIRL 实现了一个导航器,将语言和屏幕上下文转换为结构化的计划,以及一个执行器,将这些计划转化为可执行的原子动作。广泛的实验表明,SWIRL 在高阶和低阶GUI基准测试中均表现出色。除了GUI任务,SWIRL 还展示了强大的多智能体数学推理能力,突显了其作为开发高效和稳健的多智能体系统的一般框架的潜力。
Diffusion Language Models Know the Answer Before Decoding
Authors: Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan, Li Shen, Yi Liang, Soroush Vosoughi, Shiwei Liu
First: 2025-08-27T15:40:25+00:00 · Latest: 2025-08-27T15:40:25+00:00
Abstract
Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.
中文标题/摘要
标题:扩散语言模型在解码前就知道答案
扩散语言模型(DLMs)最近作为一种替代自回归方法出现,提供并行序列生成和灵活的标记顺序。然而,其推理速度仍慢于自回归模型,主要由于双向注意的成本和生成高质量输出所需的大量细化步骤。在这项工作中,我们强调并利用了DLMs早期答案收敛的一个未被重视的特性:在许多情况下,正确的答案可以在最终解码步骤之前由半步骤内部识别,无论是半自回归还是随机遮盖调度。例如,在GSM8K和MMLU上,分别有97%和99%的实例仅使用一半的细化步骤即可正确解码。基于这一观察,我们引入了Prophet,这是一种无需训练的快速解码范式,可实现早期提交解码。具体而言,Prophet动态决定是否继续细化或“全押”(即一次性解码剩余所有标记),使用前两名预测候选之间的置信度差距作为标准。它无缝集成到现有的DLM实现中,几乎不增加开销,并不需要额外的训练。对LLaDA-8B和Dream-7B在多个任务上的实证评估显示,Prophet将解码步骤减少多达3.4倍,同时保持高质量生成。这些结果将DLM解码重新定义为何时停止采样的问题,并证明了早期解码收敛提供了一种简单而强大的机制,用于加速DLM推理,补充现有的加速技术。我们的代码可在https://github.com/pixeli99/Prophet上公开获取。
Summary / 总结
Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders.
研究通过利用扩散语言模型(DLMs)的早期答案收敛特性,解决了其推理速度慢的问题。研究引入了Prophet,这是一种无需训练的快速解码范式,根据前两个预测候选之间的置信度差距来决定是否继续细化或一次性解码剩余的所有标记。实验表明,Prophet可以在保持高质量生成的同时,将解码步骤减少多达3.4倍,从而加速DLM推理,且无需额外的训练开销。
GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity
Authors: Seongheon Park, Yixuan Li
First: 2025-08-27T15:30:06+00:00 · Latest: 2025-08-27T15:30:06+00:00
Abstract
Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.
中文标题/摘要
标题:GLSim:通过全局-局部相似性检测LVLM中的对象幻觉
大型视觉-语言模型中的对象幻觉对其实用场景的安全部署构成了重大挑战。近期研究提出了对象级幻觉评分以估计对象幻觉的可能性;然而,这些方法通常孤立地采用全局或局部视角,这可能限制了检测的可靠性。本文介绍了一种名为GLSim的新颖无训练框架,该框架利用图像和文本模态之间互补的全局和局部嵌入相似性信号,能够在多种场景下实现更准确和可靠的幻觉检测。我们全面评估了现有的对象幻觉检测方法,并证明GLSim在检测性能上优于竞争基线,具有显著优势。
Summary / 总结
GLSim is a training-free framework designed to detect object hallucinations in large vision-language models by combining global and local embedding similarity signals between image and text modalities. This approach enhances the accuracy and reliability of hallucination detection across various scenarios. GLSim outperforms existing methods, achieving superior detection performance compared to competitive baselines.
GLSim 是一个无需训练的框架,通过结合图像和文本模态之间的全局和局部嵌入相似性信号来检测大型视觉-语言模型中的对象幻觉。这种方法相比仅关注全局或局部视角的方法,提高了幻觉检测的准确性和可靠性。GLSim 在全面的基准测试中表现出色,其检测性能显著优于现有基线,在多种场景中表现更优。
Assessing the Geolocation Capabilities, Limitations and Societal Risks of Generative Vision-Language Models
Authors: Oliver Grainge, Sania Waheed, Jack Stilgoe, Michael Milford, Shoaib Ehsan
Venue: AAAI
First: 2025-08-27T15:21:31+00:00 · Latest: 2025-08-27T15:21:31+00:00
Comments: Accepted to AAAI Fall Symposium 2025 on AI Trustworthiness and Risk Assessment for Challenging Contexts (ATRACC)
Abstract
Geo-localization is the task of identifying the location of an image using visual cues alone. It has beneficial applications, such as improving disaster response, enhancing navigation, and geography education. Recently, Vision-Language Models (VLMs) are increasingly demonstrating capabilities as accurate image geo-locators. This brings significant privacy risks, including those related to stalking and surveillance, considering the widespread uses of AI models and sharing of photos on social media. The precision of these models is likely to improve in the future. Despite these risks, there is little work on systematically evaluating the geolocation precision of Generative VLMs, their limits and potential for unintended inferences. To bridge this gap, we conduct a comprehensive assessment of the geolocation capabilities of 25 state-of-the-art VLMs on four benchmark image datasets captured in diverse environments. Our results offer insight into the internal reasoning of VLMs and highlight their strengths, limitations, and potential societal risks. Our findings indicate that current VLMs perform poorly on generic street-level images yet achieve notably high accuracy (61\%) on images resembling social media content, raising significant and urgent privacy concerns.
中文标题/摘要
标题:评估生成型视觉语言模型的地理定位能力、局限性和社会风险
地理定位是指仅通过视觉线索识别图像位置的任务。它具有有益的应用,如提高灾害响应、增强导航和地理教育。近年来,视觉语言模型(VLMs)越来越显示出作为准确图像地理定位器的能力。这带来了重大的隐私风险,包括跟踪和监视风险,考虑到人工智能模型的广泛应用和社交媒体上照片的共享。这些模型的精度未来可能会提高。尽管存在这些风险,但很少有系统评估生成型VLMs的地理定位精度、其局限性和潜在的非预期推断的工作。为了弥补这一差距,我们在四个基准图像数据集上对25个最先进的VLMs的地理定位能力进行了全面评估,这些数据集在不同的环境中拍摄。我们的结果提供了对VLMs内部推理的见解,并突显了它们的优势、局限性和潜在的社会风险。我们的发现表明,当前的VLMs在通用街道级图像上的表现不佳,但在类似于社交媒体内容的图像上达到了显著的高精度(61%),这引发了重大的和迫切的隐私担忧。
Summary / 总结
This study evaluates the geolocation capabilities of Generative Vision-Language Models (VLMs) by assessing their performance on four benchmark datasets. The research aims to address the privacy risks associated with these models, which can accurately locate images using visual cues. Key findings show that VLMs perform poorly on generic street-level images but achieve high accuracy (61%) on social media-like images, highlighting significant privacy concerns.
该研究通过评估四种基准数据集上的表现,评估生成型视觉-语言模型(VLMs)的地理定位能力。研究旨在揭示这些模型的精度、局限性和潜在的隐私风险。关键发现表明,虽然VLMs在普通街道图像上的表现较差,但在类似社交媒体内容的图像上却能实现高达61%的高精度,这引发了重大的隐私担忧。
KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
Authors: Taebaek Hwang, Minseo Kim, Gisang Lee, Seonuk Kim, Hyunjun Eun
First: 2025-08-27T15:01:02+00:00 · Latest: 2025-08-27T15:01:02+00:00
Abstract
Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at https://github.com/tabtoyou/KRETA.
中文标题/摘要
标题:KRETA:面向文本丰富视觉问答的韩语阅读与推理基准,适应多样的视觉环境
在视觉上下文中理解和推理文本对视觉语言模型(VLMs)构成了重大挑战,鉴于现实世界场景的复杂性和多样性。为应对这一挑战,面向高资源语言如英语的文本丰富视觉问答(VQA)数据集和基准已经出现。然而,对于低资源语言如韩语来说,缺乏全面的基准阻碍了模型的稳健评估和比较。为弥补这一差距,我们引入了KRETA,一个面向韩语阅读与推理的文本丰富视觉问答基准,适应多样的视觉环境。KRETA 促进了对视觉文本理解和推理能力的深入评估,同时支持跨15个领域和26种图像类型的多维度评估。此外,我们还引入了一种针对文本丰富设置的半自动化VQA生成流水线,利用细化的逐步图像分解和严格的七项指标评估协议来确保数据质量。虽然KRETA 专为韩语设计,但我们希望我们的可适应和可扩展的流水线能够促进其他语言类似基准的开发,从而加速多语言VLM研究。KRETA 的代码和数据集可在 https://github.com/tabtoyou/KRETA 获取。
Summary / 总结
Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios.
KRETA 是一个用于评估韩语阅读和推理能力的基准,特别是在文本丰富的视觉问答(VQA)场景中,旨在填补低资源语言如韩语缺乏全面基准的空白。它涵盖了15个领域和26种图像类型,并采用了一种半自动化的VQA生成流水线和七项指标评估协议来确保数据质量。主要发现包括能够评估在多种情境下的视觉文本理解和推理能力。
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
Authors: Zeyi Sun, Ziyang Chu, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
First: 2024-12-02T18:59:26+00:00 · Latest: 2025-08-27T12:26:39+00:00
Comments: code: https://github.com/SunzeY/X-Prompt
Abstract
In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.
中文标题/摘要
标题:X-Prompt:面向自回归视觉语言基础模型的通用上下文内图像生成
上下文生成是大型语言模型(LLMs)开放任务泛化能力的关键组成部分。通过利用少量示例作为上下文,LLMs 可以执行领域内和领域外任务。基于LLMs构建的自回归视觉语言模型(VLMs)在文本到图像生成方面展现了令人印象深刻的性能。然而,上下文学习在通用图像生成任务中的潜力尚未得到充分探索。为了解决这一问题,我们引入了X-Prompt,这是一种纯粹的自回归大型视觉语言模型,旨在在一个统一的上下文内学习框架中实现广泛可见和不可见图像生成任务的竞争力。X-Prompt结合了专门设计,有效压缩了上下文示例中的有价值特征,支持更长的上下文标记序列,并提高其对未见过任务的泛化能力。统一的训练任务使X-Prompt能够处理一般图像生成,增强任务意识。广泛的实验验证了该模型在多种可见图像生成任务中的性能,并展示了其对以前未见过任务的泛化能力。
Summary / 总结
The research aims to enhance the open-task generalization capability of large language models (LLMs) by leveraging in-context learning for image generation tasks. X-Prompt, a specialized auto-regressive large-vision language model, is introduced to support both seen and unseen image generation tasks within a unified framework. Key findings show that X-Prompt effectively compresses valuable features from in-context examples, enabling it to handle a wide range of image generation tasks with improved generalization capabilities.
X-Prompt 是一种旨在增强自动回归视觉语言模型在图像生成任务中上下文学习能力的设计。通过高效压缩上下文示例中的特征,X-Prompt 支持更长的序列并更好地泛化到未见过的任务。实验表明,X-Prompt 在多种已见和未见的图像生成任务中表现出色,均在一个统一的框架内完成。
DiffArtist: Towards Structure and Appearance Controllable Image Stylization
Authors: Ruixiang Jiang, Changwen Chen
Venue: ACM MM 2025
First: 2024-07-22T17:58:05+00:00 · Latest: 2025-08-27T10:30:27+00:00
Comments: Accepted to ACM MM 2025, Homepage: https://DiffusionArtist.github.io
Abstract
Artistic styles are defined by both their structural and appearance elements. Existing neural stylization techniques primarily focus on transferring appearance-level features such as color and texture, often neglecting the equally crucial aspect of structural stylization. To address this gap, we introduce \textbf{DiffArtist}, the first 2D stylization method to offer fine-grained, simultaneous control over both structure and appearance style strength. This dual controllability is achieved by representing structure and appearance generation as separate diffusion processes, necessitating no further tuning or additional adapters. To properly evaluate this new capability of dual stylization, we further propose a Multimodal LLM-based stylization evaluator that aligns significantly better with human preferences than existing metrics. Extensive analysis shows that DiffArtist achieves superior style fidelity and dual-controllability compared to state-of-the-art methods. Its text-driven, training-free design and unprecedented dual controllability make it a powerful and interactive tool for various creative applications. Project homepage: https://diffusionartist.github.io.
中文标题/摘要
标题:DiffArtist:朝结构和外观可控图像风格化迈进
艺术风格由结构和外观元素共同定义。现有的神经风格化技术主要侧重于转移色彩和纹理等外观特征,往往忽视了同样重要的结构风格化方面。为解决这一问题,我们引入了**DiffArtist**,这是首个同时提供结构和外观风格精细控制的2D风格化方法。这种双重可控性通过将结构和外观生成表示为独立的扩散过程来实现,无需进一步调整或附加适配器。为了评估这种新的双重风格化能力,我们进一步提出了一种基于多模态LLM的风格化评估器,其与人类偏好更为一致。广泛分析表明,DiffArtist在风格保真度和双重可控性方面优于现有方法。其基于文本驱动、无需训练的设计和前所未有的双重可控性使其成为各种创意应用的强大且互动的工具。项目主页:https://diffusionartist.github.io
Summary / 总结
DiffArtist is a novel neural stylization method that simultaneously controls both structural and appearance styles in images. It achieves this by treating structure and appearance generation as separate diffusion processes, without requiring additional tuning. The method outperforms existing techniques in terms of style fidelity and dual controllability, and includes a new stylization evaluator that better aligns with human preferences. This makes DiffArtist a powerful tool for creative applications.
DiffArtist 是一种新的 2D 图像风格化方法,允许对结构和外观进行精细控制。它通过分别使用结构和外观的扩散过程来实现这一点,无需额外调整。该方法引入了一种基于多模态 LLM 的评估器来评估双风格化,显示出比现有技术更高的风格保真度和双控制性。其文本驱动和无需训练的设计使其成为创意应用的强大工具。
Do Vision Encoders Truly Explain Object Hallucination?: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore
Authors: Hongseok Oh, Wonseok Hwang
First: 2025-02-27T12:20:02+00:00 · Latest: 2025-08-27T10:19:08+00:00
Abstract
Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. This study revisits the previous claim that the cause of such hallucinations lies in the limited representational capacity of the vision encoder. Our analysis implies that the capacity of the vision encoder is not necessarily a major limiting factor in detecting object hallucination. Based on this insight, we propose Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of \textbf{39.6\%} without additional training. We further demonstrate that F-CLIPScore-based data filtering reduces object hallucination in LVLM (4.9\% in POPE).
Summary / 总结
Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains.
HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs
Authors: Zheng Qin, Ruobing Zheng, Yabing Wang, Tianqi Li, Yi Yuan, Jingdong Chen, Le Wang
First: 2025-08-14T12:14:15+00:00 · Latest: 2025-08-27T10:04:02+00:00
Abstract
While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks. Furthermore, we argue that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, with reasoning ability serving as the key to unlocking it. Accordingly, we employ a multi-stage, modality-progressive reinforcement learning to enhance the reasoning abilities of an Omni model, achieving substantial gains on evaluation results. Additionally, we observe that successful reasoning processes exhibit highly consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner. Project page: \textcolor{brightpink}https://digital-avatar.github.io/ai/HumanSense/
Summary / 总结
HumanSense is a benchmark designed to evaluate the human-centered perception and interaction capabilities of Multimodal Large Language Models (MLLMs), focusing on understanding complex human intentions and providing empathetic, context-aware responses. The evaluation shows that leading MLLMs still need improvement, especially for advanced interaction tasks. Supplementing visual input with audio and text information improves performance, and omni-modal models show advantages. Reasoning ability is crucial for generating appropriate feedback, and a multi-stage, modality-progressive reinforcement learning method enhances the reasoning abilities of an omni model, leading to better evaluation results. Thought patterns in successful reasoning processes are highly consistent, which can be leveraged to improve non-reasoning models without additional training.
HumanSense 是一个基准,旨在评估多模态大型语言模型(MLLMs)的人类中心感知和交互能力,重点在于理解复杂的人类意图并提供同理心和上下文相关的回应。评估结果显示,领先的 MLLMs 在高级交互任务上仍需改进。补充视觉输入以包含音频和文本信息可以提高性能,而全模态模型在这些任务上表现出优势。推理能力对于生成适当的反馈至关重要,通过多阶段、模态渐进的强化学习方法可以增强全模型的推理能力,从而在评估结果上取得显著进步。成功的推理过程表现出高度一致的思维模式,这可以用于在无需额外训练的情况下提升非推理模型的性能。
NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
Authors: Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, Somak Aditya
First: 2025-08-27T09:34:28+00:00 · Latest: 2025-08-27T09:34:28+00:00
Abstract
Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized VLMs such as Qwen-2 VL-2B and SmolVLM-2.5B. As these benchmarks contain 10-25% label noise, additional finetuning using noise-robust losses (such as symmetric cross entropy and generalised cross entropy) adds another 2.5% in CRIC, and 5.5% in AOKVQA. Our findings expose when LLM-based commonsense knowledge beats retrieval from commonsense knowledge bases, how noise-aware training stabilises small models in the context of external knowledge augmentation, and why parameter-efficient commonsense reasoning is now within reach for 250M models.
FreeVPS: Repurposing Training-Free SAM2 for Generalizable Video Polyp Segmentation
Authors: Qiang Hu, Ying Zhou, Gepeng Ji, Nick Barnes, Qiang Li, Zhiwei Wang
First: 2025-08-27T09:12:38+00:00 · Latest: 2025-08-27T09:12:38+00:00
Abstract
Existing video polyp segmentation (VPS) paradigms usually struggle to balance between spatiotemporal modeling and domain generalization, limiting their applicability in real clinical scenarios. To embrace this challenge, we recast the VPS task as a track-by-detect paradigm that leverages the spatial contexts captured by the image polyp segmentation (IPS) model while integrating the temporal modeling capabilities of segment anything model 2 (SAM2). However, during long-term polyp tracking in colonoscopy videos, SAM2 suffers from error accumulation, resulting in a snowball effect that compromises segmentation stability. We mitigate this issue by repurposing SAM2 as a video polyp segmenter with two training-free modules. In particular, the intra-association filtering module eliminates spatial inaccuracies originating from the detecting stage, reducing false positives. The inter-association refinement module adaptively updates the memory bank to prevent error propagation over time, enhancing temporal coherence. Both modules work synergistically to stabilize SAM2, achieving cutting-edge performance in both in-domain and out-of-domain scenarios. Furthermore, we demonstrate the robust tracking capabilities of FreeVPS in long-untrimmed colonoscopy videos, underscoring its potential reliable clinical analysis.
中文标题/摘要
标题:FreeVPS: 重新利用无训练SAM2进行通用视频息肉分割
现有的视频息肉分割(VPS)范式通常难以在时空建模和领域泛化之间取得平衡,限制了其在实际临床场景中的应用。为应对这一挑战,我们将VPS任务重新定义为一个跟踪-检测范式,该范式利用图像息肉分割(IPS)模型捕获的空间上下文,同时整合分割一切模型2(SAM2)的时空建模能力。然而,在结肠镜视频中进行长期息肉跟踪时,SAM2会因误差累积而表现不佳,导致雪崩效应,影响分割稳定性。我们通过重新利用SAM2作为视频息肉分割器,并引入两个无训练模块来缓解这一问题。具体而言,内部关联过滤模块消除了检测阶段产生的空间不准确性,减少假阳性。跨关联精炼模块则自适应地更新记忆库,防止错误随时间传播,增强时间连贯性。两个模块协同工作,稳定SAM2,使其在领域内和领域外场景中均达到前沿性能。此外,我们展示了FreeVPS在长未剪辑结肠镜视频中的稳健跟踪能力,突显了其在临床分析中的潜在可靠性。
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Authors: Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Ranjay Krishna, Jiajun Wu, Hamid Rezatofighi
First: 2025-08-24T11:01:51+00:00 · Latest: 2025-08-27T08:55:54+00:00
Comments: Project Page: https://github.com/pokerme7777/Compositional-Visual-Reasoning-Survey
Abstract
Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.
Summary / 总结
This survey aims to address the gap in literature on compositional visual reasoning, a key area in multimodal AI. It reviews over 260 papers from top venues from 2023 to 2025, formalizing core definitions and tracing a paradigm shift in approaches. The survey highlights the advantages of compositional methods and catalogs benchmarks to probe reasoning accuracy. Key insights include challenges such as limitations of LLM-based reasoning and a bias toward deductive reasoning, with future directions including human-AI collaborative reasoning and richer evaluation protocols.
该综述旨在填补关于组合视觉推理的文献空白,组合视觉推理是多模态AI的关键领域。它回顾了2023年至2025年间来自顶级会议的260多篇论文,正式化了核心定义并追踪了方法的范式转变。综述突出了组合方法的优势,并列出了用于探究推理准确性的基准测试。关键见解包括LLM推理的局限性以及倾向于演绎推理的偏差等挑战,未来方向包括人机协作推理和更丰富的评估协议。
VideoEraser: Concept Erasure in Text-to-Video Diffusion Models
Authors: Naen Xu, Jinghuai Zhang, Changjiang Li, Zhi Chen, Chunyi Zhou, Qingming Li, Tianyu Du, Shouling Ji
Venue: EMNLP
First: 2025-08-21T07:15:18+00:00 · Latest: 2025-08-27T08:48:35+00:00
Comments: To appear in the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Abstract
The rapid growth of text-to-video (T2V) diffusion models has raised concerns about privacy, copyright, and safety due to their potential misuse in generating harmful or misleading content. These models are often trained on numerous datasets, including unauthorized personal identities, artistic creations, and harmful materials, which can lead to uncontrolled production and distribution of such content. To address this, we propose VideoEraser, a training-free framework that prevents T2V diffusion models from generating videos with undesirable concepts, even when explicitly prompted with those concepts. Designed as a plug-and-play module, VideoEraser can seamlessly integrate with representative T2V diffusion models via a two-stage process: Selective Prompt Embedding Adjustment (SPEA) and Adversarial-Resilient Noise Guidance (ARNG). We conduct extensive evaluations across four tasks, including object erasure, artistic style erasure, celebrity erasure, and explicit content erasure. Experimental results show that VideoEraser consistently outperforms prior methods regarding efficacy, integrity, fidelity, robustness, and generalizability. Notably, VideoEraser achieves state-of-the-art performance in suppressing undesirable content during T2V generation, reducing it by 46% on average across four tasks compared to baselines.
中文标题/摘要
标题:VideoEraser:文本到视频扩散模型中的概念擦除
文本到视频(T2V)扩散模型的迅速发展引发了对隐私、版权和安全的担忧,因为这些模型有可能被滥用以生成有害或误导性的内容。这些模型通常基于大量数据集进行训练,包括未经授权的个人身份、艺术创作和有害材料,这可能导致此类内容的不受控生产和分发。为了解决这一问题,我们提出了一种无需训练的VideoEraser框架,即使在明确提示这些概念时,也能防止T2V扩散模型生成带有不良概念的视频。VideoEraser设计为即插即用模块,可以通过两阶段过程——选择性提示嵌入调整(SPEA)和抗对抗噪声引导(ARNG)无缝集成到代表性T2V扩散模型中。我们在四个任务中进行了广泛的评估,包括对象擦除、艺术风格擦除、名人擦除和裸露内容擦除。实验结果表明,VideoEraser在有效性、完整性、保真度、鲁棒性和泛化能力方面均优于先前方法。值得注意的是,VideoEraser在T2V生成过程中抑制不良内容方面达到了最先进的性能,在四个任务中平均降低了46%。
Summary / 总结
VideoEraser is a training-free framework designed to prevent text-to-video diffusion models from generating videos with undesirable concepts, even when explicitly prompted. It integrates seamlessly via a two-stage process: Selective Prompt Embedding Adjustment and Adversarial-Resilient Noise Guidance. Extensive evaluations across four tasks show that VideoEraser outperforms prior methods in terms of efficacy, integrity, fidelity, robustness, and generalizability, reducing undesirable content by 46% on average compared to baselines.
VideoEraser 是一个无需训练的框架,旨在防止文本到视频扩散模型在受到特定概念提示时生成包含不良内容的视频。它通过两个阶段的过程无缝集成:选择性提示嵌入调整和抗对抗噪声引导。广泛的评估显示,VideoEraser 在有效性、完整性、保真度、鲁棒性和通用性方面优于先前的方法,平均将不良内容减少 46% 以上,优于基线方法。
InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning
Authors: Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang, Bo Zheng
First: 2025-08-27T08:40:05+00:00 · Latest: 2025-08-27T08:40:05+00:00
Abstract
Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce \textbf{InquireBench}, a comprehensive benchmark specifically designed to evaluate mobile agents' capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose \textbf{InquireMobile}, a novel model inspired by reinforcement learning, featuring a two-stage training strategy and an interactive pre-action reasoning mechanism. Finally, our model achieves an 46.8% improvement in inquiry success rate and the best overall success rate among existing baselines on InquireBench. We will open-source all datasets, models, and evaluation codes to facilitate development in both academia and industry.
中文标题/摘要
标题:InquireMobile: 使用基于VLM的移动代理通过强化微调请求人类帮助
近期视觉语言模型(VLMs)的进步使移动代理能够基于人类指令感知和互动现实移动环境。然而,当前完全自主的范式在模型理解或推理能力不足时存在潜在的安全风险。为应对这一挑战,我们首先引入了**InquireBench**,这是一个专门设计的基准,用于评估移动代理在安全互动和主动问询用户方面的能力,涵盖5个类别和22个子类别,其中大多数现有的基于VLM的代理表现出接近零的性能。在本文中,我们旨在开发一个交互系统,在关键决策点主动寻求人类确认。为此,我们提出了**InquireMobile**,这是一种受强化学习启发的新模型,具有两阶段训练策略和交互式预动作推理机制。最终,我们的模型在InquireBench上的问询成功率提高了46.8%,并且在现有基线中的整体成功率最高。我们将开源所有数据集、模型和评估代码,以促进学术界和工业界的开发。
Summary / 总结
The research aims to enhance the safety of mobile agents by developing an interactive system that seeks human confirmation at critical decision points. The method involves a novel model called InquireMobile, which uses reinforcement learning for a two-stage training strategy and an interactive pre-action reasoning mechanism. The model shows a 46.8% improvement in inquiry success rate and the best overall success rate on the InquireBench benchmark compared to existing baselines.
研究旨在通过开发一个在关键决策点寻求人类确认的交互系统来提高移动代理的安全性。方法是提出了一种名为InquireMobile的新模型,该模型利用强化学习进行两阶段训练策略和交互式预动作推理机制。该模型在InquireBench基准测试中表现出46.8%的问询成功率提升,并且在所有现有基线中具有最佳的整体成功率。
R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning
Authors: Lijun Sheng, Jian Liang, Zilei Wang, Ran He
Venue: CVPR 2025
First: 2025-04-15T13:49:31+00:00 · Latest: 2025-08-27T08:19:03+00:00
Comments: CVPR 2025 (Corrected the results on the Aircraft dataset)
Abstract
Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks. However, due to their inherent vulnerability and the common practice of selecting from a limited set of open-source models, VLMs suffer from a higher risk of adversarial attacks than traditional vision models. Existing defense techniques typically rely on adversarial fine-tuning during training, which requires labeled data and lacks of flexibility for downstream tasks. To address these limitations, we propose robust test-time prompt tuning (R-TPT), which mitigates the impact of adversarial attacks during the inference stage. We first reformulate the classic marginal entropy objective by eliminating the term that introduces conflicts under adversarial conditions, retaining only the pointwise entropy minimization. Furthermore, we introduce a plug-and-play reliability-based weighted ensembling strategy, which aggregates useful information from reliable augmented views to strengthen the defense. R-TPT enhances defense against adversarial attacks without requiring labeled training data while offering high flexibility for inference tasks. Extensive experiments on widely used benchmarks with various attacks demonstrate the effectiveness of R-TPT. The code is available in https://github.com/TomSheng21/R-TPT.
Summary / 总结
Vision-language models (VLMs), such as CLIP, have gained significant popularity as foundation models, with numerous fine-tuning methods developed to enhance performance on downstream tasks.
R-TPT 通过在推理阶段缓解对抗攻击的影响来提升视觉语言模型的对抗鲁棒性。它重新定义了边际熵目标,并引入了一种基于可靠性的加权集成策略。实验表明,R-TPT 在无需标注训练数据的情况下有效防御了各种对抗攻击,并且在推理任务中具有高度的灵活性。
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Authors: Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, Dong Yu
First: 2025-08-27T08:01:03+00:00 · Latest: 2025-08-27T08:01:03+00:00
Comments: 16 pages, two figures
Abstract
Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors. These issues arise because most post-training methods for VLMs rely on simple verifiable answer matching and supervise only final outputs, leaving intermediate visual reasoning without explicit guidance. As a result, VLMs receive sparse visual signals and often learn to prioritize language-based reasoning over visual perception. To mitigate this, some existing methods add visual supervision using human annotations or distilled labels from external large models. However, human annotations are labor-intensive and costly, and because external signals cannot adapt to the evolving policy, they cause distributional shifts that can lead to reward hacking. In this paper, we introduce Vision-SR1, a self-rewarding method that improves visual reasoning without relying on external visual supervisions via reinforcement learning. Vision-SR1 decomposes VLM reasoning into two stages: visual perception and language reasoning. The model is first prompted to produce self-contained visual perceptions that are sufficient to answer the question without referring back the input image. To validate this self-containment, the same VLM model is then re-prompted to perform language reasoning using only the generated perception as input to compute reward. This self-reward is combined with supervision on final outputs, providing a balanced training signal that strengthens both visual perception and language reasoning. Our experiments demonstrate that Vision-SR1 improves visual reasoning, mitigates visual hallucinations, and reduces reliance on language shortcuts across diverse vision-language tasks.
Summary / 总结
Vision-Language Models (VLMs) often suffer from visual hallucinations, saying things that are not actually in the image, and language shortcuts, where they skip the visual part and just rely on text priors.
本文通过引入Vision-SR1,一种自我奖励的方法,解决了视觉语言模型(VLM)中的视觉幻觉和语言捷径问题。Vision-SR1将VLM的推理分解为视觉感知和语言推理两个阶段,利用强化学习引导模型。模型生成自我包含的视觉感知,然后重新提示模型仅使用这些感知进行语言推理以计算奖励。实验表明,Vision-SR1增强了视觉推理,减少了视觉幻觉,并降低了对语言捷径的依赖,适用于各种视觉语言任务。
A Survey on Training-free Alignment of Large Language Models
Authors: Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
Venue: EMNLP 2025
First: 2025-08-12T15:30:44+00:00 · Latest: 2025-08-27T05:46:37+00:00
Comments: Accepted to EMNLP 2025 (findings), camera-ready version
Abstract
The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques--leveraging in-context learning, decoding-time adjustments, and post-generation corrections--offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.
中文标题/摘要
标题:大型语言模型无训练对齐综述
大型语言模型(LLMs)的对齐旨在确保其输出符合人类价值观、道德标准和法律规范。传统的对齐方法通常依赖于资源密集型微调(FT),这可能会导致知识退化,并在模型可访问性或计算资源受限的情况下面临挑战。相比之下,无训练(TF)对齐技术——利用上下文学习、解码时调整和生成后修正——通过不重训LLMs来实现对齐,从而使其能够适应开源和封闭源环境。本文首次系统地回顾了TF对齐方法,按解码前、解码中和解码后阶段进行分类。对于每个阶段,我们从LLMs和多模态LLMs(MLLMs)的角度进行了详细的分析,突出了其机制和局限性。此外,我们还指出了关键挑战和未来方向,为更包容和有效的TF对齐技术铺平了道路。通过综合和组织快速增长的研究文献,本文为从业者提供了指导,并促进了更安全和更可靠的LLMs的发展。
Summary / 总结
The paper investigates training-free alignment methods for large language models (LLMs) to ensure their outputs align with human values and ethical standards without the need for resource-intensive fine-tuning. It categorizes these methods into pre-decoding, in-decoding, and post-decoding stages, providing a detailed examination of their mechanisms and limitations for both LLMs and multimodal LLMs. The study highlights key challenges and future directions, aiming to develop more inclusive and effective alignment techniques.
本文探讨了无需重新训练即可对大型语言模型(LLMs)进行对齐的方法,以确保其输出符合人类价值观和伦理标准,而无需进行资源密集型微调。研究将这些方法分为预解码、在解码和后解码阶段,并从LLMs和多模态LLMs的角度详细分析了它们的机制和局限性。研究还指出了关键挑战和未来方向,旨在提高LLMs的安全性和可靠性。
NPHardEval4V: Dynamic Evaluation of Large Vision-Language Models with Effects of Vision
Authors: Xiang Li, Wenyue Hua, Kaijie Zhu, Lingyao Li, Haoyang Ling, Jinkui Chi, Qi Dou, Jindong Wang, Yongfeng Zhang, Xin Ma, Lizhou Fan
First: 2024-03-04T07:10:31+00:00 · Latest: 2025-08-27T05:37:31+00:00
Comments: 25 pages, 9 figures, 2 tables
Abstract
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal understanding, yet their reasoning abilities remain underexplored. Existing benchmarks tend to focus on perception or text-based comprehension, offering limited insight into how well these models perform on structured, logic-driven tasks that require both visual and linguistic reasoning. To address this gap, we introduce NPHardEval4V, a multimodal benchmark suite grounded in four classical NP-hard problems: Knapsack, Set Cover, Traveling Salesperson, and Vertex Cover. Each task is presented through a combination of structured visual layouts and textual prompts, designed to assess the ability of LVLMs to perform combinatorial reasoning under visual-linguistic constraints. We evaluate a set of advanced open-source and closed-source vision-language models under a unified prompting and problem representation framework. This enables fair comparison across models and task types, while isolating key variables affecting performance. Our results show that while these models perform reasonably well on perception-based inputs, they struggle with global optimization, abstraction, and constraint satisfaction. No single model demonstrates consistent reasoning capability across all problem types, and common failure patterns reveal fundamental limitations in current architectures. By leveraging the structure and complexity of NP-hard problems, NPHardEval4V provides a scalable, interpretable, and challenging testbed for diagnosing reasoning behaviors in LVLMs. We hope this benchmark can support the community in building more robust, inference-capable multimodal systems. The benchmark dataset and code are available at https://github.com/lizhouf/NPHardEval4.
Summary / 总结
NPHardEval4V is a multimodal benchmark suite designed to evaluate the reasoning abilities of large vision-language models (LVLMs) on NP-hard problems. The benchmark uses four classical NP-hard problems: Knapsack, Set Cover, Traveling Salesperson, and Vertex Cover, presented through visual and textual inputs to assess combinatorial reasoning. The study evaluates various advanced LVLMs and finds that while they perform well on perception-based tasks, they struggle with global optimization, abstraction, and constraint satisfaction. No single model consistently performs well across all problem types, indicating fundamental limitations in current architectures. This benchmark provides a scalable and challenging testbed for diagnosing reasoning behaviors in LVLMs.
NPHardEval4V 是一个用于评估大型视觉-语言模型(LVLM)在 NP-hard 问题上推理能力的多模态基准套件。该基准使用四个经典 NP-hard 问题:背包问题、集合覆盖问题、旅行商问题和顶点覆盖问题,通过视觉和文本输入来评估组合推理。研究评估了多种先进的 LVLM,并发现尽管它们在感知任务上表现良好,但在全局优化、抽象和约束满足方面却表现不佳。没有一个模型在所有问题类型上都能一致地表现出色,表明当前架构存在根本性的局限性。该基准提供了可扩展、可解释且具有挑战性的测试平台,用于诊断 LVLM 的推理行为。
Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models
Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu
First: 2025-03-14T15:42:42+00:00 · Latest: 2025-08-27T03:19:29+00:00
Comments: This paper is accepted by IJCAI2025 Workshop on Deepfake Detection, Localization, and Interpretability
Abstract
Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.
中文标题/摘要
标题:探索跨模态生成模型中的字体视觉提示注入威胁
当前的跨模态生成模型(GMs)在各种生成任务中表现出色。鉴于现实场景中视觉模态输入的普遍性和信息丰富性,包括视觉语言感知(VLP)和图像到图像(I2I)在内的跨视觉任务引起了广泛关注。大型视觉语言模型(LVLMs)和I2I生成模型分别用于处理VLP和I2I任务。先前的研究表明,在输入图像中印刷字体文字会显著诱导LVLMs和I2I GMs生成与这些文字语义一致的破坏性输出。此外,视觉提示作为一种更复杂的字体形式,也被发现对跨视觉任务的各种应用构成安全风险。然而,视觉提示所造成的具体威胁特征仍待进一步探索。在本文中,为了全面调查字体视觉提示注入(TVPI)对各种LVLMs和I2I GMs性能影响,我们提出了字体视觉提示注入数据集,并在具有不同目标语义的视觉提示下对各种开源和闭源LVLMs和I2I GMs进行了彻底的安全风险评估,加深了对TVPI威胁的理解。
Summary / 总结
This paper investigates the security threats posed by typographic visual prompts in cross-modality generation models. It proposes a dataset to evaluate the impact of these prompts on various large vision language models and image-to-image generation models. Key findings show that visual prompts can significantly influence model outputs, leading to semantically aligned but disruptive results, highlighting the need for better security measures in cross-vision tasks.
本文探讨了图文提示注入(TVPI)在跨模态生成模型中的安全威胁。研究引入了一个TVPI数据集,并评估了不同模型在图文提示下的影响,发现图文提示可以诱导生成与提示内容一致的破坏性输出。研究加深了对LVLMs和I2I GMs中TVPI威胁的理解。
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Authors: Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin
Venue: EMNLP
First: 2024-11-25T02:15:30+00:00 · Latest: 2025-08-27T02:39:28+00:00
Comments: Accepted by EMNLP-2025 Main. Project page: https://szhanz.github.io/zoomeye/
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language understanding. Recently, with the integration of test-time scaling techniques, these models have also shown strong potential in visual reasoning. However, most existing reasoning approaches remain text-level in nature: MLLMs are prompted to explore various combinations of textual tokens via their underlying language model, while the visual input remains fixed throughout the reasoning process. This paradigm limits the model's ability to fully exploit rich visual information, particularly when dealing with images containing numerous fine-grained elements. In such cases, vision-level reasoning becomes crucial - where models dynamically zoom into specific regions of the image to gather detailed visual cues necessary for accurate decision-making. In this paper, we propose Zoom Eye, a training-free, model-agnostic tree search algorithm tailored for vision-level reasoning. Zoom Eye treats an image as a hierarchical tree structure, where each child node represents a zoomed-in sub-region of its parent, and the root corresponds to the full image. The algorithm enables MLLMs to simulate human-like zooming behavior by navigating from root to leaf nodes in search of task-relevant visual evidence. We experiment on a series of high-resolution benchmarks and the results demonstrate that Zoom Eye consistently improves the performance of multiple MLLMs by a large margin (e.g., InternVL2.5-8B increases by 15.71% and 17.69% on HR-Bench) and also enables small 3-8B MLLMs to outperform strong large models such as GPT-4o. Code: https://github.com/om-ai-lab/ZoomEye
Summary / 总结
ZoomEye is a training-free, model-agnostic tree search algorithm designed to enhance multimodal large language models (MLLMs) for vision-level reasoning. By treating images as hierarchical tree structures, ZoomEye allows MLLMs to simulate human-like zooming behavior, dynamically focusing on specific image regions to gather detailed visual cues. Experiments on high-resolution benchmarks show that ZoomEye significantly improves the performance of multiple MLLMs, with some models showing up to a 17.69% increase in accuracy.
ZoomEye 是一种无需训练、适用于多种模型的树搜索算法,旨在增强多模态大型语言模型(MLLMs)在视觉层面的推理能力。它将图像视为层次树结构,使MLLMs能够模拟人类的放大行为,探索特定的图像区域。实验表明,ZoomEye 在高分辨率基准测试中显著提升了多个 MLLMs 的性能,例如 InternVL2.5-8B 在 HR-Bench 上分别提高了 15.71% 和 17.69%,并且小型 3-8B MLLMs 能够超越更大的模型如 GPT-4o。
Do VLMs Have Bad Eyes? Diagnosing Compositional Failures via Mechanistic Interpretability
Authors: Ashwath Vaithinathan Aravindan, Abha Jha, Mihir Kulkarni
Venue: ICCV
First: 2025-08-20T01:15:28+00:00 · Latest: 2025-08-26T20:07:44+00:00
Comments: To be published in Explainable Computer Vision: Quo Vadis? workshop at ICCV'25
Abstract
Vision-Language Models (VLMs) have shown remarkable performance in integrating visual and textual information for tasks such as image captioning and visual question answering. However, these models struggle with compositional generalization and object binding, which limit their ability to handle novel combinations of objects and their attributes. Our work explores the root causes of these failures using mechanistic interpretability techniques. We show evidence that individual neurons in the MLP layers of CLIP's vision encoder represent multiple features, and this "superposition" directly hinders its compositional feature representation which consequently affects compositional reasoning and object binding capabilities. We hope this study will serve as an initial step toward uncovering the mechanistic roots of compositional failures in VLMs. The code and supporting results can be found https://github.com/Mystic-Slice/Do-VLMs-Have-Bad-Eyes.
中文标题/摘要
标题:VLMs的视觉缺陷?通过机制可解释性诊断组合失败
视觉-语言模型(VLMs)在图像字幕和视觉问答等任务中表现出色,能够整合视觉和文本信息。然而,这些模型在组合泛化和对象绑定方面存在困难,限制了它们处理新对象组合及其属性的能力。我们的研究使用机制可解释性技术探索这些失败的根本原因。我们展示了CLIP视觉编码器的MLP层中的单个神经元代表多个特征,这种“叠加”直接阻碍了其组合特征表示,从而影响了组合推理和对象绑定能力。我们希望这项研究能成为揭示VLMs组合失败机制根源的初步步骤。相关代码和支持结果可在https://github.com/Mystic-Slice/Do-VLMs-Have-Bad-Eyes获取。
Summary / 总结
This study investigates the root causes of compositional failures in Vision-Language Models (VLMs) by employing mechanistic interpretability techniques. It reveals that individual neurons in CLIP's vision encoder represent multiple features, leading to a 'superposition' that hinders compositional feature representation, thereby affecting the model's ability to handle novel object combinations and perform compositional reasoning. This work aims to uncover the mechanistic roots of compositional failures in VLMs, contributing to a better understanding of their limitations in handling complex visual and textual information.
该研究通过使用机制可解释性技术探讨了视觉-语言模型(VLMs)在处理组合性任务时失败的原因。研究发现,CLIP视觉编码器中的个别神经元代表多个特征,这阻碍了模型对新对象组合的组合性和推理能力。这项工作旨在揭示VLMs中组合性失败的机制根源,有助于更好地理解并改进这些模型。
GENIE-ASI: Generative Instruction and Executable Code for Analog Subcircuit Identification
Authors: Phuoc Pham, Arun Venkitaraman, Chia-Yu Hsieh, Andrea Bonetti, Stefan Uhlich, Markus Leibl, Simon Hofmann, Eisaku Ohbuchi, Lorenzo Servadei, Ulf Schlichtmann, Robert Wille
First: 2025-08-26T19:39:10+00:00 · Latest: 2025-08-26T19:39:10+00:00
Abstract
Analog subcircuit identification is a core task in analog design, essential for simulation, sizing, and layout. Traditional methods often require extensive human expertise, rule-based encoding, or large labeled datasets. To address these challenges, we propose GENIE-ASI, the first training-free, large language model (LLM)-based methodology for analog subcircuit identification. GENIE-ASI operates in two phases: it first uses in-context learning to derive natural language instructions from a few demonstration examples, then translates these into executable Python code to identify subcircuits in unseen SPICE netlists. In addition, to evaluate LLM-based approaches systematically, we introduce a new benchmark composed of operational amplifier netlists (op-amps) that cover a wide range of subcircuit variants. Experimental results on the proposed benchmark show that GENIE-ASI matches rule-based performance on simple structures (F1-score = 1.0), remains competitive on moderate abstractions (F1-score = 0.81), and shows potential even on complex subcircuits (F1-score = 0.31). These findings demonstrate that LLMs can serve as adaptable, general-purpose tools in analog design automation, opening new research directions for foundation model applications in analog design automation.
中文标题/摘要
标题:GENIE-ASI:生成式指令和可执行代码的模拟子电路识别
模拟子电路识别是模拟设计中的核心任务,对于仿真、尺寸确定和布局至关重要。传统方法通常需要大量的手工专业知识、基于规则的编码或大量标记的数据集。为了解决这些挑战,我们提出了GENIE-ASI,这是第一个无需训练、基于大型语言模型(LLM)的模拟子电路识别方法。GENIE-ASI分为两个阶段:首先使用上下文学习从少量示范示例中推导出自然语言指令,然后将这些指令翻译成可执行的Python代码,以识别未见过的SPICE网表中的子电路。此外,为了系统地评估LLM方法,我们引入了一个新的基准,该基准由运算放大器网表(op-amps)组成,涵盖了广泛的子电路变体。在所提出的基准上的实验结果表明,GENIE-ASI在简单结构上与基于规则的方法具有相同的性能(F1分数=1.0),在中等抽象上保持竞争力(F1分数=0.81),甚至在复杂子电路中也显示出潜力(F1分数=0.31)。这些发现表明,LLM可以作为模拟设计自动化中的适应性强、通用的工具,为LLM在模拟设计自动化中的基础模型应用开辟了新的研究方向。
Summary / 总结
The paper introduces GENIE-ASI, a methodology for analog subcircuit identification using large language models, addressing the need for extensive human expertise. GENIE-ASI consists of two phases: deriving natural language instructions from demonstrations and translating them into executable Python code. On a new benchmark of operational amplifier netlists, GENIE-ASI achieves F1-scores of 1.0, 0.81, and 0.31 for simple, moderate, and complex subcircuits, respectively, indicating its potential as an adaptable tool in analog design automation.
论文提出了GENIE-ASI,一种使用大型语言模型进行模拟子电路识别的方法,解决了需要大量人工经验和标注数据的问题。GENIE-ASI 包含两个阶段:从示例中推导自然语言指令并将其翻译成可执行的Python代码。在新的运算放大器网表基准测试中,GENIE-ASI 对简单结构的F1分数为1.0,中等抽象为0.81,复杂子电路为0.31,表明其作为模拟设计自动化中可适应的工具的潜力。
Fine-Tuning Vision-Language Models for Neutrino Event Analysis in High-Energy Physics Experiments
Authors: Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi
First: 2025-08-26T19:12:28+00:00 · Latest: 2025-08-26T19:12:28+00:00
Abstract
Recent progress in large language models (LLMs) has shown strong potential for multimodal reasoning beyond natural language. In this work, we explore the use of a fine-tuned Vision-Language Model (VLM), based on LLaMA 3.2, for classifying neutrino interactions from pixelated detector images in high-energy physics (HEP) experiments. We benchmark its performance against an established CNN baseline used in experiments like NOvA and DUNE, evaluating metrics such as classification accuracy, precision, recall, and AUC-ROC. Our results show that the VLM not only matches or exceeds CNN performance but also enables richer reasoning and better integration of auxiliary textual or semantic context. These findings suggest that VLMs offer a promising general-purpose backbone for event classification in HEP, paving the way for multimodal approaches in experimental neutrino physics.
中文标题/摘要
标题:高能物理实验中细调视觉-语言模型进行中微子事件分析
大型语言模型(LLMs)的最新进展显示了其在自然语言之外的多模态推理中的强大潜力。在本文中,我们探讨了基于LLaMA 3.2的细调视觉-语言模型(VLM)在高能物理(HEP)实验中从像素化探测器图像分类中微子相互作用的应用。我们将其性能与NOvA和DUNE等实验中使用的成熟CNN基线进行了基准测试,评估了分类准确性、精确度、召回率和AUC-ROC等指标。我们的结果表明,VLM不仅能够匹配甚至超越CNN的性能,还能够实现更丰富的推理并更好地整合辅助文本或语义上下文。这些发现表明,VLM为HEP中的事件分类提供了一个有前景的通用基础架构,为实验中微子物理中的多模态方法铺平了道路。
Summary / 总结
This study explores the use of a fine-tuned Vision-Language Model (VLM) based on LLaMA 3.2 for classifying neutrino interactions from detector images in high-energy physics experiments. The model is benchmarked against a CNN baseline used in experiments like NOvA and DUNE, showing that it matches or exceeds CNN performance while enabling richer reasoning and better integration of textual or semantic context. The results suggest VLMs could be a promising general-purpose backbone for event classification in HEP, facilitating multimodal approaches in experimental neutrino physics.
本研究探讨了基于LLaMA 3.2的细调视觉-语言模型(VLM)在高能物理实验中从探测器图像分类中中微子相互作用的应用。该模型与NOvA和DUNE等实验中使用的CNN基线进行了基准测试,结果显示其性能与CNN相当或更优,同时能够实现更丰富的推理并更好地整合文本或语义上下文。研究结果表明,VLMs可能是HEP中事件分类的有前途的一般用途基础架构,有助于实验中中微子物理的多模态方法。
VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space
Authors: Lin Li, Zehuan Huang, Haoran Feng, Gengxiong Zhuang, Rui Chen, Chunchao Guo, Lu Sheng
First: 2025-08-26T17:59:47+00:00 · Latest: 2025-08-26T17:59:47+00:00
Comments: Project page: https://huanngzh.github.io/VoxHammer-Page/
Abstract
3D local editing of specified regions is crucial for game industry and robot interaction. Recent methods typically edit rendered multi-view images and then reconstruct 3D models, but they face challenges in precisely preserving unedited regions and overall coherence. Inspired by structured 3D generative models, we propose VoxHammer, a novel training-free approach that performs precise and coherent editing in 3D latent space. Given a 3D model, VoxHammer first predicts its inversion trajectory and obtains its inverted latents and key-value tokens at each timestep. Subsequently, in the denoising and editing phase, we replace the denoising features of preserved regions with the corresponding inverted latents and cached key-value tokens. By retaining these contextual features, this approach ensures consistent reconstruction of preserved areas and coherent integration of edited parts. To evaluate the consistency of preserved regions, we constructed Edit3D-Bench, a human-annotated dataset comprising hundreds of samples, each with carefully labeled 3D editing regions. Experiments demonstrate that VoxHammer significantly outperforms existing methods in terms of both 3D consistency of preserved regions and overall quality. Our method holds promise for synthesizing high-quality edited paired data, thereby laying the data foundation for in-context 3D generation. See our project page at https://huanngzh.github.io/VoxHammer-Page/.
中文标题/摘要
标题:VoxHammer:无需训练的精确和连贯的3D编辑
对指定区域进行3D局部编辑对于游戏行业和机器人交互至关重要。最近的方法通常编辑渲染的多视图图像,然后重建3D模型,但它们在精确保留未编辑区域和整体连贯性方面面临挑战。受结构化3D生成模型的启发,我们提出了VoxHammer,这是一种新颖的无需训练的方法,可以在3D潜在空间中进行精确和连贯的编辑。给定一个3D模型,VoxHammer首先预测其反转轨迹,并在每个时间步获取其反转的潜在变量和键值令牌。随后,在去噪和编辑阶段,我们用保留区域的相应反转潜在变量和缓存的键值令牌替换去噪特征。通过保留这些上下文特征,该方法确保保留区域的一致重建和编辑部分的连贯集成。为了评估保留区域的一致性,我们构建了Edit3D-Bench,这是一个由数百个样本组成的人工标注数据集,每个样本都有仔细标注的3D编辑区域。实验表明,VoxHammer在保留区域的3D一致性和整体质量方面显著优于现有方法。我们的方法有望合成高质量的编辑配对数据,从而为基于上下文的3D生成奠定数据基础。请参见我们的项目页面:https://huanngzh.github.io/VoxHammer-Page/
Summary / 总结
VoxHammer is a training-free approach for precise and coherent 3D editing in native 3D space. It predicts the inversion trajectory of a 3D model to obtain latents and key-value tokens, which are then used to replace denoising features of preserved regions during the editing phase. This method ensures consistent reconstruction of preserved areas and coherent integration of edited parts. Experimental results show that VoxHammer outperforms existing methods in terms of 3D consistency of preserved regions and overall quality, making it suitable for synthesizing high-quality edited paired data for in-context 3D generation.
VoxHammer 是一种无需训练的 3D 编辑方法,通过预测 3D 模型的反转轨迹并使用反转的潜在变量和键值令牌来编辑保留区域,确保一致性和连贯性。实验表明,VoxHammer 在保留 3D 一致性和整体质量方面优于现有方法。它有望用于合成高质量的编辑配对数据,为 3D 生成奠定数据基础。
Articulate3D: Zero-Shot Text-Driven 3D Object Posing
Authors: Oishi Deb, Anjun Hu, Ashkan Khakzar, Philip Torr, Christian Rupprecht
First: 2025-08-26T17:59:17+00:00 · Latest: 2025-08-26T17:59:17+00:00
Comments: Project page:https://odeb1.github.io/articulate3d_page_deb/
Abstract
We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85\% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/
中文标题/摘要
标题:Articulate3D:零样本文本驱动的3D物体姿态控制
我们提出了一种无需训练的方法Articulate3D,通过语言控制来摆放3D资产。尽管在视觉和语言模型方面取得了进展,但这项任务仍然令人惊讶地具有挑战性。为了实现这一目标,我们将问题分解为两个步骤。我们修改了一个强大的图像生成器,使其根据输入图像和文本指令生成目标图像。然后,我们通过多视角姿态优化步骤将网格与目标图像对齐。具体来说,我们引入了一种自注意力重连机制(RSActrl),该机制在图像生成模型中解耦了源结构和姿态,使其能够在不同姿态下保持一致的结构。我们观察到,可微渲染对于姿态优化来说是一个不可靠的信号;相反,我们使用关键点来建立输入图像和目标图像之间的对应关系。Articulate3D的有效性在各种3D物体和自由形式的文本提示下得到了验证,成功地操控了姿态同时保持了网格的原始身份。定量评估和对比用户研究证实了其优于现有方法的优越性。项目页面:https://odeb1.github.io/articulate3d_page_deb/
Summary / 总结
Articulate3D proposes a training-free method to pose 3D assets using language control. The method decomposes the task into two steps: modifying an image-generator to create target images based on input images and text instructions, followed by aligning the mesh to these target images through multi-view pose optimization. A self-attention rewiring mechanism (RSActrl) is introduced to maintain consistent structure across varying poses. The method uses keypoints to establish correspondences between input and target images, and it is effective for diverse 3D objects and free-form text prompts, with quantitative evaluations and user studies showing its superiority over existing approaches.
Articulate3D 提出了一种无需训练的方法,通过语言控制来调整 3D 资产的姿态。方法将任务分解为两步:修改图像生成器以根据输入图像和文本指令生成目标图像,然后通过多视图姿态优化步骤将网格对齐到这些目标图像。引入了一种自注意力重连线机制(RSActrl),以在不同姿态下保持结构的一致性。该方法使用关键点来建立输入图像和目标图像之间的对应关系,并且对于各种 3D 对象和自由形式的文本提示都有效,定量评估和用户研究显示其优于现有方法。
Route-and-Execute: Auditable Model-Card Matching and Specialty-Level Deployment
Authors: Shayan Vassef, Soorya Ram Shimegekar, Abhay Goyal, Koustuv Saha, Pi Zonooz, Navin Kumar
First: 2025-08-22T23:34:37+00:00 · Latest: 2025-08-26T17:13:21+00:00
Abstract
Clinical workflows are fragmented as a patchwork of scripts and task-specific networks that often handle triage, task selection, and model deployment. These pipelines are rarely streamlined for data science pipeline, reducing efficiency and raising operational costs. Workflows also lack data-driven model identification (from imaging/tabular inputs) and standardized delivery of model outputs. In response, we present a practical, healthcare-first framework that uses a single vision-language model (VLM) in two complementary roles. First (Solution 1), the VLM acts as an aware model-card matcher that routes an incoming image to the appropriate specialist model via a three-stage workflow (modality -> primary abnormality -> model-card id). Checks are provided by (i) stagewise prompts that allow early exit via None/Normal/Other and (ii) a stagewise answer selector that arbitrates between the top-2 candidates at each stage, reducing the chance of an incorrect selection and aligning the workflow with clinical risk tolerance. Second (Solution 2), we fine-tune the VLM on specialty-specific datasets ensuring a single model covers multiple downstream tasks within each specialty, maintaining performance while simplifying deployment. Across gastroenterology, hematology, ophthalmology, and pathology, our single-model deployment matches or approaches specialized baselines. Compared with pipelines composed of many task-specific agents, this approach shows that one VLM can both decide and do. It may reduce effort by data scientists, shorten monitoring, increase the transparency of model selection (with per-stage justifications), and lower integration overhead.
中文标题/摘要
标题:路由执行:可审计的模型卡匹配和专科级部署
临床工作流是碎片化的,由一系列脚本和特定任务的网络组成,通常处理分诊、任务选择和模型部署。这些管道很少为数据科学管道进行优化,降低了效率并增加了运营成本。工作流还缺乏基于数据的模型识别(从影像/表格输入)和标准化的模型输出交付。为此,我们提出了一种实用的、以医疗保健为先的框架,该框架使用单一的视觉语言模型(VLM)在两个互补角色中发挥作用。首先(解决方案1),VLM 作为有意识的模型卡匹配器,通过三阶段工作流(模态 -> 主要异常 -> 模型卡ID)将传入的图像路由到合适的专科模型。通过(i)阶段性的提示,允许早期退出(无/正常/其他),以及(ii)阶段性的答案选择器,在每个阶段之间选择前两名候选人,从而减少错误选择的机会,并使工作流与临床风险容忍度保持一致。其次(解决方案2),我们对VLM 进行微调,使其适用于特定专科的数据集,确保一个模型可以覆盖每个专科内的多个下游任务,保持性能的同时简化部署。在胃肠病学、血液学、眼科和病理学中,我们的单模型部署达到了或接近专科基准。与由许多特定任务代理组成的管道相比,这种方法表明一个VLM 可以同时决定和执行。这可能减少数据科学家的工作量,缩短监控时间,增加模型选择的透明度(每阶段都有理由),并降低集成成本。
Summary / 总结
The paper addresses the inefficiencies in clinical workflows by proposing a framework that uses a single vision-language model (VLM) for both model-card matching and specialty-level deployment. The VLM routes incoming images through a three-stage workflow, providing early exits and reducing the chance of incorrect selections. Additionally, the VLM is fine-tuned on specialty-specific datasets to handle multiple downstream tasks within each specialty, simplifying deployment. The approach shows competitive performance across gastroenterology, hematology, ophthalmology, and pathology, potentially reducing the effort required by data scientists and increasing model transparency.
论文提出了一种框架,使用单一的视觉语言模型(VLM)进行模型卡片匹配和专科级部署,以解决临床工作流中的低效问题。VLM通过三阶段工作流对传入图像进行路由,提供早期退出并减少错误选择的机会。此外,VLM在专科特定数据集上进行微调,以处理每个专科内的多个下游任务,简化部署。该方法在胃肠病学、血液学、眼科和病理学中表现出竞争力,可能减少数据科学家的工作量,提高模型透明度。
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Authors: Chan-Wei Hu, Yueqi Wang, Shuo Xing, Chia-Ju Chen, Suofei Feng, Ryan Rossi, Zhengzhong Tu
First: 2025-05-29T23:32:03+00:00 · Latest: 2025-08-26T16:42:37+00:00
Comments: 16 pages
Abstract
Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.
中文标题/摘要
标题:mRAG:多模态检索增强生成的设计空间阐明
大型视觉-语言模型(LVLMs)在视觉问答、视觉定位和复杂推理等多模态任务中取得了显著进展。然而,它们仍然受限于静态训练数据、幻觉倾向以及无法验证与最新外部证据一致的断言,这在动态现实世界应用中影响了它们的性能。检索增强生成(RAG)提供了一种实用的解决方案,通过检索机制使LVLMs能够访问大规模知识数据库,从而将模型输出与事实性、上下文相关的信息联系起来。在本文中,我们首次系统地剖析了LVLMs的多模态RAG管道,明确研究了(1)检索阶段:模态配置和检索策略,(2)重排序阶段:减少位置偏差并提高检索证据的相关性的策略,以及(3)生成阶段:进一步研究如何将检索到的候选者最佳地整合到最终生成过程中。最后,我们扩展了研究,探索了一种统一的代理框架,通过自我反思将重排序和生成阶段整合起来,使LVLMs能够动态地选择相关证据并抑制无关背景。我们对RAG的全面研究提供了宝贵的见解,导致性能平均提升5%,无需微调。
Summary / 总结
Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning.
AT-CXR: Uncertainty-Aware Agentic Triage for Chest X-rays
Authors: Xueyang Li, Mingze Jiang, Gelei Xu, Jun Xia, Mengzhao Jia, Danny Chen, Yiyu Shi
First: 2025-08-26T14:33:09+00:00 · Latest: 2025-08-26T14:33:09+00:00
Abstract
Agentic AI is advancing rapidly, yet truly autonomous medical-imaging triage, where a system decides when to stop, escalate, or defer under real constraints, remains relatively underexplored. To address this gap, we introduce AT-CXR, an uncertainty-aware agent for chest X-rays. The system estimates per-case confidence and distributional fit, then follows a stepwise policy to issue an automated decision or abstain with a suggested label for human intervention. We evaluate two router designs that share the same inputs and actions: a deterministic rule-based router and an LLM-decided router. Across five-fold evaluation on a balanced subset of NIH ChestX-ray14 dataset, both variants outperform strong zero-shot vision-language models and state-of-the-art supervised classifiers, achieving higher full-coverage accuracy and superior selective-prediction performance, evidenced by a lower area under the risk-coverage curve (AURC) and a lower error rate at high coverage, while operating with lower latency that meets practical clinical constraints. The two routers provide complementary operating points, enabling deployments to prioritize maximal throughput or maximal accuracy. Our code is available at https://github.com/XLIAaron/uncertainty-aware-cxr-agent.
中文标题/摘要
标题:AT-CXR:胸部X光的不确定性感知自主分诊
自主AI正迅速发展,但真正自主的医学影像分诊,即在实际约束下系统决定何时停止、升级或推迟,仍相对未被充分探索。为解决这一问题,我们引入了AT-CXR,一种不确定性感知的胸部X光代理。该系统估计每例的置信度和分布拟合,然后遵循逐步策略发出自动决策或在建议标签下弃权供人类干预。我们评估了两种共享相同输入和动作的路由器设计:确定性规则路由器和基于LLM的路由器。在对NIH ChestX-ray14数据集平衡子集进行五折评估中,两种变体均优于强大的零样本视觉语言模型和最先进的监督分类器,实现了更高的全面覆盖准确性和更优的选择性预测性能,通过较低的风险覆盖曲线下的面积(AURC)和高覆盖下的较低错误率得到证实,同时以满足实际临床约束的较低延迟运行。两种路由器提供了互补的操作点,使部署能够优先考虑最大吞吐量或最大准确性。我们的代码可在https://github.com/XLIAaron/uncertainty-aware-cxr-agent 获取。
Summary / 总结
AT-CXR is an uncertainty-aware agent designed for autonomous medical imaging triage of chest X-rays. It estimates case confidence and distributional fit, then decides whether to issue an automated decision or abstain with a suggested label for human intervention. Evaluations on a balanced subset of the NIH ChestX-ray14 dataset show that both deterministic and LLM-decided router designs outperform strong zero-shot vision-language models and state-of-the-art supervised classifiers, achieving higher full-coverage accuracy and better selective-prediction performance with lower latency that meets clinical constraints. The two routers provide complementary operating points for different deployment priorities.
AT-CXR 是一种针对胸部 X 光片的自主医学影像分诊的不确定性感知代理。它估计每个病例的置信度和分布拟合,并遵循逐步策略决定是否发出自动决策或在需要人类干预时提供建议标签。在 NIH ChestX-ray14 数据集平衡子集上的评估表明,这两种基于规则的确定性路由器和基于大语言模型的路由器都优于强大的零样本视觉-语言模型和最先进的监督分类器,实现了更高的全面覆盖准确性和更好的选择性预测性能,同时具有较低的延迟,符合实际临床约束。这两种路由器提供了互补的操作点,以适应不同的部署优先级。
History