Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Authors: Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou, Konstantinos Arvanitis, Sophia Ananiadou
First: 2026-04-08T17:53:26+00:00 · Latest: 2026-04-08T17:53:26+00:00
Abstract
Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.
中文标题/摘要
标题:Appear2Meaning: 跨文化结构化文化元数据从图像推断基准
近期视觉-语言模型(VLMs)的进步提高了文化遗产图像描述的效果。然而,从视觉输入中推断结构化文化元数据(例如,创作者、起源、时期)仍然未被充分探索。我们引入了一个多类别、跨文化的基准用于此任务,并使用LLM作为评判者框架来衡量与参考注释的语义对齐。为了评估文化推理,我们在不同文化区域报告了精确匹配、部分匹配和属性级准确度。结果表明,模型捕捉到了碎片化的信号,并且在不同文化和元数据类型上表现出显著的性能差异,导致不一致且缺乏依据的预测。这些发现突显了当前VLMs在结构化文化元数据推断方面超越视觉感知的局限性。
Summary / 总结
The research aims to improve the inference of structured cultural metadata from images using vision-language models (VLMs). A multi-category, cross-cultural benchmark is introduced to evaluate these models, focusing on cultural reasoning through an LLM-as-Judge framework. The study finds that models capture fragmented signals and show significant performance variation across different cultures and metadata types, indicating limitations in structured cultural metadata inference beyond visual perception.
研究旨在通过视觉语言模型(VLMs)来提高从图像中推断结构化文化元数据的能力。引入了一个多类别、跨文化的基准测试来评估这些模型,并通过LLM-as-Judge框架关注文化推理。研究发现,模型捕获的是碎片化的信号,并且在不同文化和元数据类型上表现出显著的性能差异,这表明在视觉感知之外进行结构化文化元数据推断时存在局限性。
Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity
Authors: Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra
First: 2025-09-11T18:53:21+00:00 · Latest: 2026-04-08T16:45:50+00:00
Comments: 33 pages; 2 appendices; 6 figures; 2 tables. Code available at https://github.com/Lafayette-EshbaughSilveyra-Group/synthetic-homes
Abstract
Computational models have emerged as powerful tools for multi-scale energy modeling research at the building and urban scale, supporting data-driven analysis across building and urban energy systems. However, these models require large amounts of building parameter data that is often inaccessible, expensive to collect, or subject to privacy constraints. We introduce a modular, multimodal generative Artificial Intelligence (AI) framework that integrates image, tabular, and simulation-based components and produces synthetic residential building datasets from publicly available county records and images, and present an end-to-end pipeline instantiating this framework. To reduce typical Large Language Model (LLM) challenges, we evaluate our model's components using occlusion-based visual focus analysis. Our analysis demonstrates that our selected vision-language model achieves significantly stronger visual focus than a GPT-based alternative for building image processing. We also assess realism of our results against a national reference dataset. Our synthetic data overlaps more than 65% with the reference dataset across all evaluated parameters and greater than 90% for three of the four. This work reduces dependence on costly or restricted data sources, lowering barriers to building-scale energy research and Machine Learning (ML)-driven urban energy modeling, and therefore enabling scalable downstream tasks such as energy modeling, retrofit analysis, and urban-scale simulation under data scarcity.
中文标题/摘要
标题:合成住宅:在数据稀缺条件下住宅建筑数据生成的多模态生成AI管道
计算模型已成为多尺度建筑和城市能效建模研究的强大工具,支持建筑和城市能效系统的数据驱动分析。然而,这些模型需要大量的建筑参数数据,这些数据往往难以获取、收集成本高昂或受到隐私限制。我们介绍了一种模块化、多模态生成人工智能(AI)框架,该框架结合了图像、表格和基于模拟的组件,并从公开的县记录和图像中生成合成住宅建筑数据集,同时展示了该框架的端到端管道实例。为了减少大型语言模型(LLM)的典型挑战,我们使用遮挡基视觉焦点分析评估了模型的各个组件。我们的分析表明,我们选择的视觉-语言模型在建筑图像处理方面的视觉焦点显著强于基于GPT的替代方案。我们还评估了结果的现实性,与国家参考数据集进行了对比。我们的合成数据在所有评估参数上与参考数据集的重叠超过65%,在四个参数中的三个上超过90%。这项工作减少了对昂贵或受限数据源的依赖,降低了建筑规模能效研究和基于机器学习(ML)的城市能效建模的门槛,从而使得在数据稀缺条件下进行能效建模、改造分析和城市规模模拟等下游任务变得可行。
Summary / 总结
This paper introduces a multimodal generative AI framework to create synthetic residential building datasets, addressing the scarcity of building parameter data. The framework integrates image, tabular, and simulation components and uses publicly available county records and images. Visual focus analysis shows that the vision-language model outperforms a GPT-based model for building image processing. The synthetic data closely matches a national reference dataset, with overlaps exceeding 65% across all parameters and over 90% for three out of four parameters, facilitating scalable energy modeling and urban simulation under data scarcity.
本文介绍了一种多模态生成AI框架,用于创建合成的住宅建筑数据集,以应对建筑参数数据稀缺的问题。该框架整合了图像、表格和模拟组件,并利用公开的县记录和图像。视觉焦点分析表明,视觉语言模型在处理建筑图像方面优于基于GPT的模型。合成数据与国家参考数据集高度匹配,所有参数的重叠率超过65%,三个参数中的两个超过90%,这有助于在数据稀缺条件下进行大规模的能源建模和城市模拟。
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Authors: Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, Ziyan Wu
Venue: CVPR 2026
First: 2025-12-06T22:27:59+00:00 · Latest: 2026-04-08T16:04:11+00:00
Comments: Accepted at CVPR 2026
Abstract
Large vision-language models struggle with medical video understanding, where spatial precision, temporal reasoning, and clinical semantics are critical. To address this, we first introduce \textbf{MedVidBench}, a large-scale benchmark of 531,850 video-instruction pairs across 8 medical sources spanning video, segment, and frame-level tasks, curated through a rigorous quality assurance pipeline with expert-guided prompting and dual-model validation. While supervised fine-tuning on MedVidBench yields noticeable gains, standard Reinforcement Learning (RL) fails due to imbalanced reward scales across datasets, which destabilizes optimization and leads to training collapse. To overcome this, we introduce \textbf{MedGRPO}, a novel RL framework for balanced multi-dataset training with two key innovations: (1) \emph{cross-dataset reward normalization} that maps each dataset's median performance to a common reward value, ensuring fair optimization regardless of difficulty, and (2) a \emph{medical LLM judge} that evaluates caption quality on five clinical dimensions through comparative similarity scoring. Supervised fine-tuning Qwen2.5-VL-7B on MedVidBench outperforms GPT-4.1 and Gemini-2.5-Flash across all tasks, while MedGRPO further improves the SFT baseline on grounding and captioning. Our work establishes a foundational benchmark and training methodology for advancing medical video understanding with VLMs. Our project website is available at: https://uii-america.github.io/MedGRPO/.
中文标题/摘要
标题:MedGRPO:多任务强化学习在异质医疗视频理解中的应用
大型视觉-语言模型在医疗视频理解方面存在困难,因为需要精确的空间感知、时间推理和临床语义。为了解决这一问题,我们首先引入了**MedVidBench**,这是一个包含531,850个视频指令对的大规模基准数据集,覆盖8个医疗来源,涵盖视频、片段和帧级任务,通过严格的质控管道和专家引导提示及双模型验证进行编目。虽然在MedVidBench上进行监督微调可以取得显著进步,但标准强化学习(RL)由于不同数据集之间的奖励尺度不平衡,导致优化不稳定并导致训练崩溃。为克服这一问题,我们引入了**MedGRPO**,这是一种新的RL框架,用于平衡多数据集训练,具有两个关键创新:(1)**跨数据集奖励归一化**,将每个数据集的中位性能映射到一个共同的奖励值,确保无论难度如何都能公平优化,以及(2)**医疗LLM裁判**,通过比较相似度评分评估字幕质量的五个临床维度。在MedVidBench上对Qwen2.5-VL-7B进行监督微调,其表现优于GPT-4.1和Gemini-2.5-Flash,而MedGRPO进一步提高了微调基线在定位和字幕生成上的表现。我们的工作为使用VLMs推进医疗视频理解奠定了基础,并提出了训练方法。我们的项目网站可访问:https://uii-america.github.io/MedGRPO/。
Summary / 总结
The research aims to improve medical video understanding by addressing the limitations of large vision-language models in handling spatial precision, temporal reasoning, and clinical semantics. MedGRPO, a novel RL framework, introduces cross-dataset reward normalization and a medical LLM judge to balance multi-dataset training. Experiments show that supervised fine-tuning Qwen2.5-VL-7B on MedVidBench outperforms GPT-4.1 and Gemini-2.5-Flash, and MedGRPO further enhances the SFT baseline on grounding and captioning tasks.
研究通过引入包含531,850个视频-指令对的MedVidBench基准,解决了医学视频理解的挑战。提出了MedGRPO,这是一种强化学习框架,包括跨数据集奖励归一化和医学LLM裁判,以实现多数据集训练的平衡。监督微调Qwen2.5-VL-7B在MedVidBench上优于GPT-4.1和Gemini-2.5-Flash,并且MedGRPO进一步提高了监督微调基线的定位和描述性能。
Caption-Matching: A Multimodal Approach for Cross-Domain Image Retrieval
Authors: Lucas Iijima, Nikolaos Giakoumoglou, Tania Stathaki
First: 2024-03-22T12:08:16+00:00 · Latest: 2026-04-08T14:55:27+00:00
Abstract
Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Existing CDIR methods rely either on supervised learning with labeled cross-domain correspondences or on methods that require training or fine-tuning on target datasets, often struggling with substantial domain gaps and limited generalization to unseen domains. This paper introduces a novel CDIR approach that incorporates textual context by leveraging publicly available pre-trained vision-language models. Our method, Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or further training. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in plug-and-play settings with consistent improvements on Office-Home and DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.
中文标题/摘要
标题:图像配对:跨域图像检索的多模态方法
跨域图像检索(CDIR)是计算机视觉中的一个挑战性任务,旨在匹配不同视觉域(如素描、绘画和照片)中的图像。现有CDIR方法要么依赖于带有跨域对应标签的监督学习,要么需要在目标数据集上进行训练或微调,通常难以处理显著的域差距并有限地泛化到未见过的域。本文提出了一种新的CDIR方法,通过利用公开的预训练视觉-语言模型来引入文本上下文。我们的方法,图像配对(CM),使用生成的图像描述作为域无关的中间表示,从而在无需标注数据或进一步训练的情况下实现有效的跨域相似性计算。我们在标准CDIR基准数据集上评估了我们的方法,展示了在插件即用设置中优于先前方法的最新性能,并在Office-Home和DomainNet上实现了持续改进。我们还展示了该方法在Midjourney生成的图像数据集上的有效性,证明了其处理复杂、多域查询的能力。
Summary / 总结
The research aims to address the challenge of Cross-Domain Image Retrieval (CDIR) by proposing a novel approach called Caption-Matching (CM) that uses publicly available pre-trained vision-language models to generate image captions as a domain-agnostic intermediate representation. This method enables effective cross-domain similarity computation without requiring labeled data or further training. Experimental results show that CM outperforms previous methods on standard CDIR benchmark datasets, including Office-Home and DomainNet, and also demonstrates robustness in handling complex, multi-domain queries from AI-generated images.
研究提出了一种名为Caption-Matching (CM)的新方法,利用生成的图像描述作为跨域通用的中间表示,无需使用标注数据或在目标数据集上进行进一步训练,能够有效处理显著的域差距。实验结果显示,CM在标准CDIR基准数据集上优于先前的方法,并且也适用于来自Midjourney的复杂多域查询的AI生成图像。
Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering
Authors: Zhuohong Chen, Zhenxian Wu, Yunyao Yu, Hangrui Xu, Zirui Liao, Zhifang Liu, Xiangwen Deng, Pen Jiao, Haoqian Wang
First: 2026-04-08T14:37:35+00:00 · Latest: 2026-04-08T14:37:35+00:00
Abstract
Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based on its current information state. We further design an automated pipeline to collect multi-step trajectories that record the agent's reasoning process, tool usage, and intermediate decisions. These trajectories are then used as supervision for fine-tuning. Experiments on InfoSeek and E-VQA demonstrate that our method achieves state-of-the-art performance, consistently outperforming prior baselines and confirming the effectiveness of our framework.
中文标题/摘要
标题:学习搜索:基于决策的代理模型在知识驱动的视觉问答中的应用
知识驱动的视觉问答(KB-VQA)要求视觉-语言模型理解图像并利用外部知识,特别是对于稀有实体和长尾事实。现有的大多数检索增强生成(RAG)方法采用固定管道,顺序检索信息、过滤信息,然后生成答案。这种设计使得模型难以适应多种问题类型。此外,检索与推理分离,使得模型难以决定何时搜索、如何调整查询或何时停止。因此,检索到的证据往往与问题不匹配。为了解决这些局限性,我们将KB-VQA重新定义为搜索代理问题,并将解决问题的过程建模为多步决策过程。在每一步,代理根据其当前信息状态选择四种动作之一-回答、图像检索、文本检索和基于描述的检索。我们进一步设计了一个自动化管道来收集多步轨迹,记录代理的推理过程、工具使用情况和中间决策。这些轨迹随后用于微调的监督。在InfoSeek和E-VQA上的实验表明,我们的方法达到了最先进的性能,始终优于先前的基线,证实了我们框架的有效性。
Summary / 总结
The research aims to improve knowledge-based visual question answering by addressing the limitations of existing retrieval-augmented generation methods. It proposes a decision-based agent that models the solving process as a multi-step decision-making procedure, where the agent selects actions such as Answer, Image Retrieval, Text Retrieval, or Caption at each step based on its current information state. The method uses an automated pipeline to collect multi-step trajectories for fine-tuning, and experiments show that it achieves state-of-the-art performance on InfoSeek and E-VQA, outperforming previous methods.
论文通过将KB-VQA重新定义为搜索代理问题,解决了现有检索增强生成方法的局限性。代理在每一步根据当前信息状态选择回答问题、检索图像或文本、或使用描述,从而做出决策。该方法使用自动化管道收集多步轨迹进行微调。实验表明,这种方法在InfoSeek和E-VQA上的表现优于先前的方法,证明了其在处理不同问题类型和使检索证据与问题对齐方面的有效性。
A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing
Authors: Chenhao Liu, Zelin Wen, Yan Tong, Junjie Zhu, Xinyu Tian, Yuchi Liu, Ashu Gupta, Syed M. S. Islam, Tom Gedeon, Yue Yao
First: 2026-04-08T14:21:29+00:00 · Latest: 2026-04-08T14:21:29+00:00
Abstract
Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-language model training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.
中文标题/摘要
标题:一种保留用途的跨医院放射学数据去标识流水线
大规模放射学数据对于开发稳健的医疗AI系统至关重要。然而,由于隐私问题,跨医院共享此类数据仍然受到严重限制。现有放射学去标识研究主要集中在移除可识别信息以实现合规数据发布。然而,去标识的放射学数据是否仍能保留足够的用途以供大规模视觉-语言模型训练和跨医院转移尚未得到充分探索。在本文中,我们介绍了一种保留用途的跨医院放射学数据去标识流水线(UPDP)。具体而言,我们编制了一份隐私敏感术语黑名单和一份病理相关术语白名单。对于放射学图像,我们使用生成性过滤机制,合成隐私过滤和病理保留的图像对应物。这些合成图像对应物与ID过滤报告一起,可以安全地在医院之间共享,以供下游模型开发和评估使用。在公共胸部X光基准测试上的实验表明,我们的方法能够有效去除隐私敏感信息,同时保留诊断相关的病理线索。在去标识数据上训练的模型与在原始数据上训练的模型相比保持了竞争力的诊断准确性,而身份相关准确性则显著下降,这证实了有效的隐私保护。在跨医院环境中,我们进一步表明,去标识数据可以与本地数据结合使用,以获得更好的性能。
Summary / 总结
This paper addresses the challenge of sharing radiology data across hospitals while preserving utility for AI model training. It introduces a utility-preserving de-identification pipeline (UPDP) that uses a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. The method generates synthetic images that retain diagnostic information while removing privacy details, allowing secure data sharing. Experiments show that models trained on de-identified data maintain diagnostic accuracy similar to original data but with reduced identity-related accuracy, confirming effective privacy protection and utility preservation.
本文旨在解决在保护隐私的同时跨医院共享放射学数据并保持其对AI模型训练的实用性问题。提出了一种保留实用性的去标识化管道(UPDP),使用隐私敏感词汇黑名单和病理相关词汇白名单。该管道生成保留诊断信息但去除识别细节的合成图像。实验表明,使用去标识化数据训练的模型保持与使用原始数据训练的模型相似的诊断准确性,同时在身份相关准确性方面有所下降,表明有效保护隐私。此外,在跨医院环境中,去标识化数据与本地数据结合使用可以提高性能。
Selective Neuron Amplification for Training-Free Task Enhancement
Authors: Ryyan Akhtar
First: 2026-04-08T13:51:07+00:00 · Latest: 2026-04-08T13:51:07+00:00
Comments: 28 pages, 12 figures. Preprint. Code and experiments conducted independently
Abstract
Large language models often fail on tasks they seem to already understand. In our experiments, this appears to be less about missing knowledge and more about certain internal circuits not being strongly activated during inference. We explore Selective Neuron Amplification, which increases the influence of task relevant neurons without changing the model's parameters. The method works at inference time and does not permanently alter the model. SNA helps mainly when the model is uncertain, while having low effect when the model is already confident. This suggests that some model failures are due to weak activation rather than lack of capability.
Summary / 总结
The research aims to address the issue of large language models failing on tasks they seem to understand, which is often attributed to weak activation of task-relevant neurons rather than a lack of knowledge. The method, Selective Neuron Amplification (SNA), enhances the influence of task-relevant neurons during inference without altering the model's parameters. The key finding is that SNA is most effective when the model is uncertain, indicating that some model failures are due to insufficient activation of relevant neurons.
研究旨在解决大型语言模型在似乎理解的任务上失败的问题,这通常归因于任务相关神经元的激活不足,而不是知识的缺乏。方法是选择性神经放大(SNA),在推理过程中增强任务相关神经元的影响,而不改变模型参数。主要发现是,当模型不确定时,SNA最有效,这表明一些模型失败是由于相关神经元激活不足造成的。
PlaneCycle: Training-Free 2D-to-3D Lifting of Foundation Models Without Adapters
Authors: Yinghong Yu, Guangyuan Li, Jiancheng Yang
First: 2026-03-04T15:23:30+00:00 · Latest: 2026-04-08T13:17:26+00:00
Abstract
Large-scale 2D foundation models exhibit strong transferable representations, yet extending them to 3D volumetric data typically requires retraining, adapters, or architectural redesign. We introduce PlaneCycle, a training-free, adapter-free operator for architecture-agnostic 2D-to-3D lifting of foundation models. PlaneCycle reuses the original pretrained 2D backbone by cyclically distributing spatial aggregation across orthogonal HW, DW, and DH planes throughout network depth, enabling progressive 3D fusion while preserving pretrained inductive biases. The method introduces no additional parameters and is applicable to arbitrary 2D networks. Using pretrained DINOv3 models, we evaluate PlaneCycle on six 3D classification and three 3D segmentation benchmarks. Without any training, the lifted models exhibit intrinsic 3D fusion capability and, under linear probing, outperform slice-wise 2D baselines and strong 3D counterparts, approaching the performance of fully trained models. With full fine-tuning, PlaneCycle matches standard 3D architectures, highlighting its potential as a seamless and practical 2D-to-3D lifting operator. These results demonstrate that 3D capability can be unlocked from pretrained 2D foundation models without structural modification or retraining. Code is available at https://github.com/HINTLab/PlaneCycle.
中文标题/摘要
标题:PlaneCycle:无需训练的2D到3D基础模型提升操作
大规模的2D基础模型表现出强大的可迁移表示,但将其扩展到3D体数据通常需要重新训练、适配器或架构重设计。我们引入了PlaneCycle,这是一种无需训练、无需适配器的操作符,用于基础模型的架构无关的2D到3D提升。PlaneCycle 通过在网络深度中周期性地在正交的HW、DW和DH平面间分配空间聚合,重用了原始预训练的2D主干,从而实现渐进的3D融合并保留预训练的归纳偏置。该方法不引入额外参数,并适用于任意2D网络。使用预训练的DINOv3模型,我们在六个3D分类和三个3D分割基准上评估了PlaneCycle。在无需训练的情况下,提升后的模型展示了内在的3D融合能力,并在线性探测下优于切片式的2D基线和强大的3D对应物,接近完全训练模型的性能。在全微调后,PlaneCycle 达到了标准3D架构的性能,突显了其作为无缝且实用的2D到3D提升操作符的潜力。这些结果表明,3D能力可以从预训练的2D基础模型中解锁,无需结构修改或重新训练。代码可在 https://github.com/HINTLab/PlaneCycle 获取。
Summary / 总结
PlaneCycle is a training-free method for converting 2D foundation models into 3D models without the need for adapters or architectural redesign. It cyclically distributes spatial aggregation across orthogonal planes to enable progressive 3D fusion while preserving pretrained inductive biases. Evaluations on six 3D classification and three 3D segmentation benchmarks show that the lifted models outperform slice-wise 2D baselines and strong 3D counterparts under linear probing, and match the performance of fully trained models with full fine-tuning.
PlaneCycle 是一种无需训练、无需适配器或架构重设计的方法,用于将 2D 基础模型转换为 3D 模型。它通过在 HW、DW 和 DH 方向的正交平面上周期性地分布空间聚合来实现逐层的 3D 融合,同时保留预训练的归纳偏置。在六个 3D 分类和三个 3D 分割基准上的评估表明,提升后的模型在线性探针下优于切片式的 2D 基线和强大的 3D 对手,并且在全微调后与标准 3D 架构的性能相当。
PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking
Authors: Quanchen Zou, Zonghao Ying, Moyang Chen, Wenzhuo Xu, Yisong Xiao, Yakai Li, Deyue Zhang, Dongdong Yang, Zhao Liu, Xiangzheng Zhang
First: 2025-07-29T07:13:56+00:00 · Latest: 2026-04-08T13:05:48+00:00
Comments: This version is withdrawn to consolidate the submission under the corresponding author's primary account. The most recent and maintained version of this work can be found at arXiv:2603.09246
Abstract
The increasing sophistication of large vision-language models (LVLMs) has been accompanied by advances in safety alignment mechanisms designed to prevent harmful content generation. However, these defenses remain vulnerable to sophisticated adversarial attacks. Existing jailbreak methods typically rely on direct and semantically explicit prompts, overlooking subtle vulnerabilities in how LVLMs compose information over multiple reasoning steps. In this paper, we propose a novel and effective jailbreak framework inspired by Return-Oriented Programming (ROP) techniques from software security. Our approach decomposes a harmful instruction into a sequence of individually benign visual gadgets. A carefully engineered textual prompt directs the sequence of inputs, prompting the model to integrate the benign visual gadgets through its reasoning process to produce a coherent and harmful output. This makes the malicious intent emergent and difficult to detect from any single component. We validate our method through extensive experiments on established benchmarks including SafeBench and MM-SafetyBench, targeting popular LVLMs. Results show that our approach consistently and substantially outperforms existing baselines on state-of-the-art models, achieving near-perfect attack success rates (over 0.90 on SafeBench) and improving ASR by up to 0.39. Our findings reveal a critical and underexplored vulnerability that exploits the compositional reasoning abilities of LVLMs, highlighting the urgent need for defenses that secure the entire reasoning process.
中文标题/摘要
标题:PRISM:基于图像序列操作的程序化推理脱狱框架
随着大型视觉-语言模型(LVLM)的日益复杂,已经出现了旨在防止有害内容生成的安全对齐机制。然而,这些防御措施仍然容易受到高级对抗攻击的攻击。现有的脱狱方法通常依赖于直接且语义明确的提示,忽视了LVLM在多步推理过程中信息组合中的微妙漏洞。在本文中,我们提出了一种受软件安全领域Return-Oriented Programming (ROP) 技术启发的新型有效脱狱框架。我们的方法将有害指令分解为一系列单独无害的视觉组件。经过精心设计的文本提示指导输入序列,促使模型通过其推理过程整合这些无害的视觉组件,生成连贯且有害的输出。这使得恶意意图在任何单一组件中都难以被检测到。我们通过在SafeBench和MM-SafetyBench等现有基准上进行广泛的实验,针对流行的LVLM验证了我们的方法。结果显示,我们的方法在最先进的模型上始终且显著地优于现有基线,攻击成功率接近完美(SafeBench上超过0.90),并提高了ASR高达0.39。我们的研究结果揭示了一种关键且未被充分探索的漏洞,利用了LVLM的组合推理能力,突显了对整个推理过程进行防御的迫切需求。
Summary / 总结
This paper introduces PRISM, a jailbreak framework for large vision-language models (LVLMs) inspired by Return-Oriented Programming (ROP) techniques. It decomposes harmful instructions into benign visual gadgets and uses a textual prompt to guide the model's reasoning process, making the malicious intent emerge and evade detection. Experiments on SafeBench and MM-SafetyBench show that PRISM outperforms existing methods, achieving near-perfect attack success rates and improving adversarial success rate by up to 0.39.
论文提出了PRISM框架,借鉴了Return-Oriented Programming (ROP)技术,将有害指令分解为无害的视觉组件,并通过文本提示引导模型的推理过程,使恶意意图得以浮现并逃避检测。在SafeBench和MM-SafetyBench上的实验表明,PRISM优于现有方法,实现了接近完美的攻击成功率,并将对抗成功率提高了最多0.39。
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
Authors: Mehdi Hosseinzadeh, King Hang Wong, Feras Dayoub
Venue: ICRA 2026
First: 2026-04-08T12:49:24+00:00 · Latest: 2026-04-08T12:49:24+00:00
Comments: ICRA 2026; Project page: https://m80hz.github.io/kite/
Abstract
We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/
中文标题/摘要
标题:KITE:基于关键帧的标记化证据前端用于VLM基机器人故障分析
我们提出了KITE,一种无需训练、以关键帧为中心、以布局为基础的前端,将长机器人执行视频转换为紧凑、可解释的标记化证据,供视觉语言模型(VLM)使用。KITE 将每个轨迹提炼为一组具有开放词汇检测的关键帧,并为每个关键帧配以一个示意图的鸟瞰图(BEV)表示,该表示编码了相对物体布局、轴、时间戳和检测置信度。这些视觉线索与机器人配置文件和场景上下文标记序列化为一个统一的提示,允许相同的前端支持故障检测、识别、定位、解释和纠正,使用现成的VLM。在RoboFAC基准测试中,KITE与Qwen2.5-VL相比,在无需训练的情况下显著提高,特别是在模拟故障检测、识别和定位方面,同时仍与RoboFAC调优基线保持竞争力。进一步的小型QLoRA微调进一步提高了解释和纠正的质量。我们还在真实双臂机器人上报告了定性结果,展示了KITE作为结构化和可解释的前端在机器人故障分析中的实际应用。代码和模型发布在我们的项目页面:https://m80hz.github.io/kite/
Summary / 总结
KITE is a training-free system that converts robot-execution videos into compact, interpretable tokenized evidence for VLMs. It extracts keyframes with open-vocabulary detections and pairs them with BEV representations, which are then serialized into a unified prompt. KITE improves failure detection, identification, and localization on the RoboFAC benchmark compared to a vanilla VLM, and further fine-tuning enhances explanation and correction quality. Qualitative results on real dual-arm robots show KITE's practical applicability in robot failure analysis.
KITE 是一个无需训练的前端,将机器人执行视频转换为 VLM 可解释的紧凑化标记化证据。它提取具有开放词汇检测的运动显著关键帧,并与 BEV 表示配对,然后与机器人和场景标记序列化。在 RoboFAC 基准上,KITE 显著优于 Qwen2.5-VL,特别是在模拟故障检测、识别和定位方面,同时保持与调优基线的竞争力。一个小规模的 QLoRA 微调进一步提高了解释和纠正质量。实际双臂机器人的定性结果表明,KITE 在机器人故障分析中的实用性和可解释性。
ModuSeg: Decoupling Object Discovery and Semantic Retrieval for Training-Free Weakly Supervised Segmentation
Authors: Qingze He, Fagui Liu, Dengke Zhang, Qingmao Wei, Quan Tang
First: 2026-04-08T12:38:07+00:00 · Latest: 2026-04-08T12:38:07+00:00
Abstract
Weakly supervised semantic segmentation aims to achieve pixel-level predictions using image-level labels. Existing methods typically entangle semantic recognition and object localization, which often leads models to focus exclusively on sparse discriminative regions. Although foundation models show immense potential, many approaches still follow the tightly coupled optimization paradigm, struggling to effectively alleviate pseudo-label noise and often relying on time-consuming multi-stage retraining or unstable end-to-end joint optimization. To address the above challenges, we present ModuSeg, a training-free weakly supervised semantic segmentation framework centered on explicitly decoupling object discovery and semantic assignment. Specifically, we integrate a general mask proposer to extract geometric proposals with reliable boundaries, while leveraging semantic foundation models to construct an offline feature bank, transforming segmentation into a non-parametric feature retrieval process. Furthermore, we propose semantic boundary purification and soft-masked feature aggregation strategies to effectively mitigate boundary ambiguity and quantization errors, thereby extracting high-quality category prototypes. Extensive experiments demonstrate that the proposed decoupled architecture better preserves fine boundaries without parameter fine-tuning and achieves highly competitive performance on standard benchmark datasets. Code is available at https://github.com/Autumnair007/ModuSeg.
中文标题/摘要
标题:ModuSeg:解耦对象发现和语义检索的无监督弱监督分割
弱监督语义分割旨在使用图像级标签实现像素级预测。现有方法通常将语义识别和对象定位紧密结合,这往往导致模型专注于稀疏的判别区域。尽管基础模型具有巨大的潜力,但许多方法仍然遵循紧密耦合的优化范式,难以有效缓解伪标签噪声,经常依赖于耗时的多阶段重新训练或不稳定的端到端联合优化。为了解决上述挑战,我们提出了ModuSeg,这是一种以明确解耦对象发现和语义分配为中心的无监督弱监督语义分割框架。具体而言,我们整合了一种通用的掩码提案器以提取具有可靠边界的几何提案,同时利用语义基础模型构建离线特征库,将分割转换为非参数特征检索过程。此外,我们提出了语义边界净化和软掩码特征聚合策略,以有效缓解边界模糊和量化误差,从而提取高质量的类别原型。广泛的实验表明,提出的解耦架构在不进行参数微调的情况下更好地保留了精细边界,并在标准基准数据集上取得了高度竞争力的性能。代码可在https://github.com/Autumnair007/ModuSeg 获取。
Summary / 总结
ModuSeg is a training-free weakly supervised semantic segmentation framework that decouples object discovery and semantic assignment. It uses a general mask proposer to extract reliable geometric proposals and a semantic foundation model to build an offline feature bank, converting segmentation into a feature retrieval process. The method also includes semantic boundary purification and soft-masked feature aggregation to improve boundary quality. Experiments show that ModuSeg outperforms existing methods on standard benchmarks without fine-tuning, preserving fine boundaries and achieving competitive performance.
ModuSeg 是一个无需训练的弱监督语义分割框架,通过明确解耦对象发现和语义分配来提升性能。它使用掩码提案器提取可靠的几何提案,并利用语义基础模型构建离线特征库,将分割转换为特征检索过程。该方法还包括净化语义边界和软掩码聚合的策略,以减少模糊性和误差。实验表明,ModuSeg 在标准基准数据集上的性能优于现有方法,同时保持了精细的边界。
Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
Authors: Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin
First: 2026-04-02T09:53:20+00:00 · Latest: 2026-04-08T10:48:57+00:00
Abstract
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be released on https://github.com/Yzk1114/PGPO.
中文标题/摘要
标题:并非所有标记物都平等:基于感知的政策优化
虽然可验证奖励强化学习(RLVR)在大型视觉-语言模型(LVLMs)中推进了推理能力,但现有的框架存在一个根本性的方法论缺陷:通过向所有生成的标记物分配相同的优势,这些方法会稀释对于优化关键的视觉导向步骤至关重要的学习信号。为弥补这一差距,我们提出了标记物视觉依赖性,通过计算视觉条件下的预测分布与仅基于文本的预测分布之间的Kullback-Leibler(KL)散度来量化视觉输入的因果信息增益。揭示出这种依赖性高度稀疏且在语义上至关重要,我们引入了基于感知的政策优化(PGPO),这是一种新颖的细粒度的信用分配框架,能够动态地在标记物级别重塑优势。通过一个阈值门控、质量守恒的机制,PGPO能够积极放大依赖视觉的标记物的学习信号,同时抑制语言先验带来的梯度噪声。基于Qwen2.5-VL系列在七个具有挑战性的跨模态推理基准上的广泛实验表明,PGPO平均提升了模型18.7%。理论和实证分析均证实,PGPO有效降低了梯度方差,防止了训练崩溃,并作为强大的正则化器促进了稳健的、基于感知的跨模态推理。代码将在https://github.com/Yzk1114/PGPO上发布。
XR-CareerAssist: An Immersive Platform for Personalised Career Guidance Leveraging Extended Reality and Multimodal AI
Authors: N. D. Tantaroudas, A. J. McCracken, I. Karachalios, E. Papatheou, V. Pastrikakis
First: 2026-04-08T09:56:56+00:00 · Latest: 2026-04-08T09:56:56+00:00
Comments: 21
Abstract
Conventional career guidance platforms rely on static, text-driven interfaces that struggle to engage users or deliver personalised, evidence-based insights. Although Computer-Assisted Career Guidance Systems have evolved since the 1960s, they remain limited in interactivity and pay little attention to the narrative dimensions of career development. We introduce XR-CareerAssist, a platform that unifies Extended Reality (XR) with several Artificial Intelligence (AI) modules to deliver immersive, multilingual career guidance. The system integrates Automatic Speech Recognition for voice-driven interaction, Neural Machine Translation across English, Greek, French, and Italian, a Langchain-based conversational Training Assistant for personalised dialogue, a BLIP-based Vision-Language model for career visualisations, and AWS Polly Text-to-Speech delivered through an interactive 3D avatar. Career trajectories are rendered as dynamic Sankey diagrams derived from a repository of more than 100,000 anonymised professional profiles. The application was built in Unity for Meta Quest 3, with backend services hosted on AWS. A pilot evaluation at the University of Exeter with 23 participants returned 95.6% speech recognition accuracy, 78.3% overall user satisfaction, and 91.3% favourable ratings for system responsiveness, with feedback informing subsequent improvements to motion comfort, audio clarity, and text legibility. XR-CareerAssist demonstrates how the fusion of XR and AI can produce more engaging, accessible, and effective career development tools, with the integration of five AI modules within a single immersive environment yielding a multimodal interaction experience that distinguishes it from existing career guidance platforms.
中文标题/摘要
标题:XR-CareerAssist:利用扩展现实和多模态AI的沉浸式个性化职业指导平台
传统的职业指导平台依赖于静态、文本驱动的界面,难以吸引用户或提供个性化的、基于证据的见解。尽管自20世纪60年代以来,计算机辅助职业指导系统有所发展,但它们在互动性方面仍然有限,并且很少关注职业发展中的叙事维度。我们介绍了XR-CareerAssist平台,该平台将扩展现实(XR)与多个人工智能(AI)模块结合,提供沉浸式、多语言的职业指导。该系统集成了自动语音识别(ASR)以实现语音驱动的交互,跨英语、希腊语、法语和意大利语的神经机器翻译,基于Langchain的对话训练助手以实现个性化对话,基于BLIP的视觉语言模型以实现职业可视化,以及通过交互式3D化身提供的AWS Polly文本转语音。职业轨迹以动态的Sankey图呈现,这些图是从超过100,000份匿名职业档案库中提取的。该应用在Unity中为Meta Quest 3构建,后端服务托管在AWS上。在埃克塞特大学进行的试点评估中,23名参与者中有95.6%的语音识别准确率,78.3%的整体用户满意度,91.3%的系统响应性正面评价,反馈意见指导了后续对运动舒适性、音频清晰度和文本可读性的改进。XR-CareerAssist展示了XR和AI融合如何产生更具吸引力、更易访问和更有效的职业发展工具,五个AI模块的集成在一个沉浸式环境中提供了多模态交互体验,使其区别于现有的职业指导平台。
Summary / 总结
XR-CareerAssist is an immersive platform that combines Extended Reality (XR) and Artificial Intelligence (AI) to provide personalized career guidance. It uses Automatic Speech Recognition, Neural Machine Translation, a conversational AI assistant, a Vision-Language model, and Text-to-Speech technology to offer a dynamic and engaging experience. The platform was evaluated at the University of Exeter, achieving high speech recognition accuracy and user satisfaction, and demonstrating the potential of integrating AI with XR for more effective career development tools.
XR-CareerAssist 是一个结合扩展现实 (XR) 和 AI 模块的沉浸式平台,旨在提供个性化的职业指导。它包括自动语音识别、神经机器翻译、基于 Langchain 的对话训练助手、用于职业可视化的人景语言模型以及通过互动 3D 头像实现的文本转语音功能。该平台在埃克塞特大学进行了评估,实现了 95.6% 的语音识别准确率和高用户满意度。主要发现包括 78.3% 的总体用户满意度和 91.3% 的系统响应性好评,基于用户反馈进行了运动舒适性、音频清晰度和文本可读性的改进。
AgriPath: A Systematic Exploration of Architectural Trade-offs for Crop Disease Classification
Authors: Hamza Mooraj, George Pantazopoulos, Alessandro Suglia
First: 2026-03-08T17:28:01+00:00 · Latest: 2026-04-08T09:42:36+00:00
Comments: 11 pages main text, 24 pages total including references and appendix. 6 figures, 14 tables. Code and dataset will be released upon publication
Abstract
Reliable crop disease detection requires models that perform consistently across diverse acquisition conditions, yet existing evaluations often focus on single architectural families or lab-generated datasets. This work presents a systematic empirical comparison of three model paradigms for fine-grained crop disease classification: Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs. To enable controlled analysis of domain effects, we introduce AgriPath-LF16, a benchmark of 111k images spanning 16 crops and 41 diseases with explicit separation between laboratory and field imagery, alongside a balanced 30k subset for standardised training and evaluation. We train and evaluate all models under unified protocols across full, lab-only, and field-only training regimes using macro-F1 and Parse Success Rate (PSR) to account for generative reliability (i.e., output parsability measured via PSR). The results reveal distinct performance profiles: CNNs achieve the highest accuracy on in-domain imagery but exhibit pronounced degradation under domain shift; contrastive VLMs provide a robust and parameter-efficient alternative with competitive cross-domain performance; generative VLMs demonstrate the strongest resilience to distributional variation, albeit with additional failure modes stemming from free-text generation. These findings highlight that architectural choice should be guided by deployment context rather than aggregate performance alone.
中文标题/摘要
标题:AgriPath:作物病害分类架构权衡的系统性探索
可靠的作物病害检测需要在多种获取条件下表现一致的模型,但现有评估往往集中在单一架构家族或实验室生成的数据集上。本研究系统性地比较了三种模型范式在细粒度作物病害分类中的表现:卷积神经网络(CNNs)、对比视觉语言模型(VLMs)和生成性VLMs。为了控制领域效应的分析,我们引入了AgriPath-LF16基准,包含111,000张图像,覆盖16种作物和41种病害,明确区分了实验室和田间图像,并提供了一个平衡的30,000张图像子集用于标准化训练和评估。我们使用宏F1和解析成功率(PSR)统一协议训练和评估所有模型,在全领域、仅实验室和仅田间训练制度下进行评估。结果表明:CNNs在领域内图像上表现最佳,但在领域迁移时表现显著下降;对比VLMs提供了一种稳健且参数高效的替代方案,具有竞争力的跨领域性能;生成性VLMs在分布变化中表现出最强的鲁棒性,但存在额外的失败模式,源于自由文本生成。这些发现表明,架构选择应根据部署环境而非单一性能指标来指导。
Summary / 总结
This study aims to evaluate the performance of different model paradigms for crop disease classification under varying acquisition conditions. It compares Convolutional Neural Networks (CNNs), contrastive Vision-Language Models (VLMs), and generative VLMs using a new benchmark dataset, AgriPath-LF16, which includes 111k images from 16 crops and 41 diseases, separated into laboratory and field imagery. The results show that CNNs perform best in their domain but degrade significantly under domain shift, while contrastive VLMs offer robust and efficient cross-domain performance, and generative VLMs are highly resilient to distributional changes but have additional failure modes related to text generation. These findings suggest that model choice should be tailored to the specific deployment context.
该研究旨在评估不同模型架构在多样采集条件下对作物病害分类的性能。引入了包含111k张图像的AgriPath-LF16基准数据集,这些图像来自16种作物和41种病害,并分为实验室和田间图像。研究比较了卷积神经网络(CNNs)、对比视觉语言模型(VLMs)和生成型VLMs。结果显示,CNNs在域内表现最佳但在域间迁移时表现较差,对比VLMs提供了一种鲁棒且参数高效的替代方案,具有竞争力的跨域性能;生成型VLMs对分布变化具有最强的抗性,尽管存在额外的文本生成失败模式。这些发现表明,架构选择应根据部署环境而定,而不仅仅是基于整体性能。
Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
Authors: Bajian Xiang, Tingwei Guo, Xuan Chen, Yang Han
Venue: ACL 2026
First: 2026-04-08T09:33:44+00:00 · Latest: 2026-04-08T09:33:44+00:00
Comments: Accepted to ACL 2026 (Findings)
Abstract
Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengths that far exceed the underlying semantic content, incurring prohibitive inference costs. In this paper, we empirically revisit the necessity of such granular token-level processing. Through layer-wise oracle interventions, we unveil a structured redundancy hierarchy: while shallow layers encode essential acoustic details, deep layers exhibit extreme redundancy, allowing for aggressive compression. Motivated by these findings, we introduce Affinity Pooling, a training-free, similarity-based token merging mechanism. By strategically applying this method at both input and deep layers, we effectively compress speech representations without compromising semantic information. Extensive evaluations across three tasks demonstrate that our approach reduces prefilling FLOPs by 27.48\% while maintaining competitive accuracy. Practical deployment further confirms significant efficiency gains, yielding up to $\sim$1.7$\times$ memory savings and $\sim$1.1$\times$ faster time-to-first-token on long utterances. Our results challenge the necessity of fully distinct token representations, providing new perspectives on LSLM efficiency.
中文标题/摘要
标题:我们需要为每个语音令牌都使用独特的表示吗?揭示并利用大型语音语言模型中的冗余
大型语音语言模型(LSLMs)通常以高令牌率(令牌/秒)运行以确保声学保真度,但这也导致序列长度远超过底层语义内容,导致高昂的推理成本。在本文中,我们实证重新审视这种精细的令牌级处理的必要性。通过逐层的先验干预,我们揭示了一个结构化的冗余层次:虽然浅层编码了关键的声学细节,深层则表现出极端的冗余,允许进行激进的压缩。受这些发现的启发,我们引入了亲和池化,这是一种无需训练、基于相似性的令牌合并机制。通过在输入和深层策略性地应用此方法,我们有效地压缩了语音表示,同时不牺牲语义信息。在三个任务上的广泛评估表明,我们的方法在预填充FLOPs上减少了27.48%,同时保持了竞争力的准确性。实际部署进一步证实了显著的效率提升,长语音片段上节省了约1.7倍的内存和约1.1倍的首词时间。我们的结果挑战了完全独特令牌表示的必要性,为LSLM效率提供了新的视角。
Summary / 总结
This paper addresses the inefficiency of high-token-rate processing in Large Speech Language Models (LSLMs) by empirically demonstrating a redundancy hierarchy in model layers. Through layer-wise oracle interventions, the authors reveal that deep layers contain redundant information, allowing for compression. They introduce Affinity Pooling, a training-free token merging mechanism, which effectively reduces computational costs by 27.48% without compromising accuracy. Experiments across three tasks show significant memory savings and faster processing times on long utterances, challenging the necessity of distinct token representations in LSLMs.
本文通过实验证明,大型语音语言模型(LSLM)的深层层表现出极大的冗余性,重新审视了细粒度的令牌级处理的必要性。作者引入了一种无需训练的基于相似性的令牌合并机制——亲和池化,有效地压缩了语音表示,同时保留了语义信息。三项任务的评估显示,该方法在保持竞争力的同时减少了27.48%的预填充FLOPs,并且实际部署进一步确认了显著的内存和时间节省。
Vision-Language Model-Guided Deep Unrolling Enables Personalized, Fast MRI
Authors: Fangmao Ju, Yuzhu He, Zhiwen Xue, Chunfeng Lian, Jianhua Ma
First: 2026-04-08T09:10:30+00:00 · Latest: 2026-04-08T09:10:30+00:00
Abstract
Magnetic Resonance Imaging (MRI) is a cornerstone in medicine and healthcare but suffers from long acquisition times. Traditional accelerated MRI methods optimize for generic image quality, lacking adaptability for specific clinical tasks. To address this, we introduce PASS (Personalized, Anomaly-aware Sampling and reconStruction), an intelligent MRI framework that leverages a Vision-Language Model (VLM) to guide a deep unrolling network for task-oriented, fast imaging. PASS dynamically personalizes the imaging pipeline through three core contributions: (1) a deep unrolled reconstruction network derived from a physics-based MRI model; (2) a sampling module that generates patient-specific $k$-space trajectories; and (3) an anomaly-aware prior, extracted from a pretrained VLM, which steers both sampling and reconstruction toward clinically relevant regions. By integrating the high-level clinical reasoning of a VLM with an interpretable, physics-aware network, PASS achieves superior image quality across diverse anatomies, contrasts, anomalies, and acceleration factors. This enhancement directly translates to improvements in downstream diagnostic tasks, including fine-grained anomaly detection, localization, and diagnosis.
中文标题/摘要
标题:视觉-语言模型引导的深度反卷积使MRI个性化快速成像
磁共振成像(MRI)是医学和医疗保健中的基石,但其采集时间较长。传统的加速MRI方法优化通用图像质量,缺乏针对特定临床任务的适应性。为解决这一问题,我们引入了PASS(个性化、异常感知采样与重建),这是一种智能MRI框架,利用视觉-语言模型(VLM)引导深度反卷积网络进行任务导向的快速成像。PASS通过三个核心贡献动态个性化成像管道:(1)从物理MRI模型衍生的深度反卷积重建网络;(2)生成患者特定的$k$-空间轨迹的采样模块;(3)从预训练的VLM提取的异常感知先验,引导采样和重建向临床相关区域。通过将VLM的高级临床推理与一个可解释的、物理感知的网络相结合,PASS在多种解剖结构、对比度、异常和加速因子下实现了卓越的图像质量。这一增强直接转化为下游诊断任务的改进,包括细粒度异常检测、定位和诊断。
Summary / 总结
The research aims to address the long acquisition times in MRI by introducing PASS, a framework that uses a Vision-Language Model to guide a deep unrolling network for personalized, fast imaging. PASS includes a deep unrolled reconstruction network, a patient-specific sampling module, and an anomaly-aware prior. The framework dynamically personalizes the imaging pipeline, leading to superior image quality across various anatomies and acceleration factors, which improves downstream diagnostic tasks.
该研究提出了PASS框架,利用视觉语言模型指导深度展开网络实现个性化快速MRI。该框架包括一个基于物理的深度展开重建网络、一个患者特定的采样模块以及一个来自预训练视觉语言模型的异常感知先验。结果表明,PASS在多种解剖结构和加速因子下提高了图像质量,提升了如异常检测和诊断等下游诊断任务的效果。
Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning
Authors: Subin Park, Jung Uk Kim
Venue: CVPR 2026
First: 2026-04-08T08:40:33+00:00 · Latest: 2026-04-08T08:40:33+00:00
Comments: Accepted to CVPR 2026
Abstract
Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at https://github.com/VisualAIKHU/GAR-SSL.
中文标题/摘要
标题:生成、分析与精炼:基于MLLM元推理的无需训练声源定位
声源定位任务旨在通过利用音频和视觉模态之间的相关性来识别声源物体的位置。现有大多数声源定位(SSL)方法依赖于对比学习特征匹配,但缺乏明确的推理和验证,限制了其在复杂声学场景中的效果。受人类元认知过程的启发,我们提出了一种无需训练的SSL框架,利用多模态大型语言模型(MLLMs)的内在推理能力。我们的生成-分析-精炼(GAR)流水线包括三个阶段:生成阶段产生初始边界框和音频分类;分析阶段通过开放集角色标记和锚投票量化音频-视觉一致性;精炼阶段应用自适应门控以防止不必要的调整。在单源和多源基准上的广泛实验表明,该方法具有竞争力。源代码可在https://github.com/VisualAIKHU/GAR-SSL获取。
Summary / 总结
The research aims to improve sound source localization by leveraging the reasoning abilities of Multimodal Large Language Models (MLLMs) without requiring training data. The proposed framework, GAR (Generation-Analysis-Refinement), consists of three stages: generating initial bounding boxes and audio classifications, analyzing audio-visual consistency, and refining the results to prevent unnecessary adjustments. Experiments on single-source and multi-source benchmarks show competitive performance compared to existing methods.
研究旨在通过利用多模态大型语言模型(MLLMs)的推理能力来提高声源定位效果,无需训练数据。提出的GAR(生成-分析-精炼)框架包括三个阶段:初始边界框和音频分类、量化音频-视觉一致性以及应用自适应门控进行精炼。实验表明,在单源和多源基准上具有竞争力的表现。
FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts
Authors: Guillermo Gil de Avalle, Laura Maruster, Eric Sloot, Christos Emmanouilidis
First: 2026-04-08T07:38:43+00:00 · Latest: 2026-04-08T07:38:43+00:00
Abstract
Maintenance procedures in manufacturing facilities are often documented as flowcharts in static PDFs or scanned images. They encode procedural knowledge essential for asset lifecycle management, yet inaccessible to modern operator support systems. Vision-language models, the dominant paradigm for image understanding, struggle to reconstruct connection topology from such diagrams. We present FlowExtract, a pipeline for extracting directed graphs from ISO 5807-standardized flowcharts. The system separates element detection from connectivity reconstruction, using YOLOv8 and EasyOCR for standard domain-aligned node detection and text extraction, combined with a novel edge detection method that analyzes arrowhead orientations and traces connecting lines backward to source nodes. Evaluated on industrial troubleshooting guides, FlowExtract achieves very high node detection and substantially outperforms vision-language model baselines on edge extraction, offering organizations a practical path toward queryable procedural knowledge representations. The implementation is available athttps://github.com/guille-gil/FlowExtract.
中文标题/摘要
标题:FlowExtract: 从维护流程图中提取程序性知识
制造设施中的维护程序通常以静态PDF或扫描图像形式记录为流程图。它们包含了对于资产生命周期管理至关重要的程序性知识,但这些知识对现代操作支持系统来说是不可访问的。视觉-语言模型,图像理解的主要范式,难以从这些图表中重建连接拓扑结构。我们提出了FlowExtract,一种从标准化为ISO 5807的流程图中提取有向图的流水线。该系统将元素检测与连接性重建分离,使用YOLOv8和EasyOCR进行标准领域对齐的节点检测和文本提取,并结合一种新颖的边检测方法,该方法分析箭头方向并反向追踪连接线至源节点。在工业故障排除指南上的评估表明,FlowExtract在节点检测方面表现优异,并在边提取方面显著优于视觉-语言模型基线,为组织提供了可查询的程序性知识表示的实际途径。实现代码可在https://github.com/guille-gil/FlowExtract获取。
Summary / 总结
FlowExtract is a pipeline designed to extract directed graphs from ISO 5807-standardized maintenance flowcharts, addressing the challenge of reconstructing connection topology from static PDFs or scanned images. It uses YOLOv8 and EasyOCR for node detection and text extraction, and a novel edge detection method for connectivity reconstruction. Evaluated on industrial troubleshooting guides, FlowExtract shows high node detection accuracy and significantly outperforms vision-language model baselines in edge extraction, providing a practical solution for converting procedural knowledge into queryable representations.
FlowExtract 是一个用于从标准化为 ISO 5807 的维护流程图中提取有向图的管道,这些流程图通常出现在静态 PDF 或扫描图像中。它使用 YOLOv8 和 EasyOCR 进行节点检测和文本提取,并采用一种新颖的边检测方法来重构连接。在工业故障排除指南上的评估表明,FlowExtract 在边提取方面优于视觉语言模型基线,并且节点检测准确性很高,提供了一种将程序知识转换为可查询表示的实用解决方案。
CodecFlow: Codec-Guided End-to-End Optimization for Streaming Video Analytics
Authors: Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin, Luo Tao, Francisco Romero, Dmitrii Ustiugov
First: 2026-04-07T16:31:45+00:00 · Latest: 2026-04-08T07:19:13+00:00
Comments: 18 pages, 34 figures
Abstract
Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams.
We present CodecFlow, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CodecFlow treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CodecFlow achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.
中文标题/摘要
标题:CodecFlow:基于编解码器的端到端优化流媒体视频分析
流媒体视频分析是视觉语言模型服务中的关键工作负载,但多模态推理的高成本限制了其可扩展性。先前的系统通过利用视频流中的时域和空域冗余来减少推理成本,但它们要么针对视觉变换器(ViT),要么针对有限视角的LLM,从而错过了端到端的机会。此外,现有方法在识别冗余方面产生了显著的开销,要么通过离线配置和训练,要么通过昂贵的在线计算,这使得它们不适合动态实时流。我们提出了CodecFlow,这是一种基于编解码器的流媒体视频分析系统,基于一个关键观察,即视频编解码器在压缩过程中已经提取了每个流的时域和空域结构。CodecFlow将这种编解码器元数据视为低成本的运行时信号,以统一视频解码、视觉处理和LLM预填充的优化,直接操作压缩位流本身具有减少传输的好处。这驱动了ViT编码前的编解码器引导补丁修剪和LLM预填充期间的选择性键值缓存刷新,两者都是完全在线的,不需要离线训练。实验表明,CodecFlow在与最先进的基线相比,吞吐量提高了3倍,GPU计算减少了87%,同时保持了与仅0-8% F1下降的竞争力。
Summary / 总结
CodecFlow is a codec-guided system for optimizing streaming video analytics, addressing the high cost of multimodal inference. It leverages codec metadata to unify optimization across video decoding, visual processing, and LLM prefilling, reducing inference cost and overhead. Experiments demonstrate up to 3x throughput improvement and 87% GPU compute reduction, with minimal accuracy drop.
CodecFlow 是一个通过利用编解码器元数据来优化流式视频分析的系统,它结合了基于编解码器的补丁修剪和选择性缓存刷新,分别用于 ViT 编码和 LLM 填充,无需离线训练。实验表明,CodecFlow 可以实现高达 3 倍的吞吐量提升和 87% 的 GPU 计算减少,同时保持与最先进的方法相比仅 8% 的 F1 分数下降的准确性。
How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
Authors: Roberto Brusnicki, Mattia Piccinini, Johannes Betz
First: 2026-04-08T07:14:55+00:00 · Latest: 2026-04-08T07:14:55+00:00
Comments: 8 pages, 5 figures
Abstract
Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://V3NU55.github.io
中文标题/摘要
标题:视觉-语言模型在理解连续驾驶场景方面的理解能力如何?一项敏感性研究
视觉-语言模型(VLMs)在自主驾驶任务中越来越被提出,但它们在连续驾驶场景上的表现尚未得到充分描述,尤其是在输入配置如何影响其能力方面。我们引入了VENUSS(VLM Evaluation oN Understanding Sequential Scenes),一种系统分析VLM在连续驾驶场景上表现的框架,为未来研究建立了基线。基于现有数据集,VENUSS从驾驶视频中提取时间序列,并在自定义类别中生成结构化评估。通过在2,600多个场景中比较25多种现有VLM,我们揭示即使顶级模型也只能达到57%的准确率,未能达到类似约束下人类表现的65%,并暴露了显著的能力差距。我们的分析表明,VLMs在静态物体检测方面表现出色,但在理解车辆动力学和时间关系方面存在困难。VENUSS提供了第一个专注于输入图像配置(分辨率、帧数、时间间隔、空间布局和呈现模式)如何影响连续驾驶场景上表现的系统敏感性分析。补充材料可在https://V3NU55.github.io获取
Summary / 总结
The study evaluates the performance of Vision-Language Models (VLMs) on sequential driving scenes, introducing VENUSS as a framework for systematic sensitivity analysis. By comparing 25+ VLMs across 2,600+ scenarios, the research reveals that even top models achieve only 57% accuracy, highlighting significant gaps compared to human performance (65%). VLMs excel in static object detection but struggle with understanding vehicle dynamics and temporal relations.
研究评估了视觉-语言模型(VLMs)在连续驾驶场景中的表现,引入了VENUSS框架进行系统性的敏感性分析。通过比较25+个VLMs在2,600+场景中的表现,研究发现即使顶级模型也只能达到57%的准确率,与人类表现(65%)存在显著差距。VLMs在静态物体检测方面表现出色,但在理解车辆动态和时间关系方面存在困难。
PyFi: Toward Pyramid-like Financial Image Understanding for VLMs via Adversarial Agents
Authors: Yuqun Zhang, Yuxuan Zhao, Sijia Chen
First: 2025-12-11T06:04:33+00:00 · Latest: 2026-04-08T06:53:15+00:00
Abstract
This paper proposes PyFi, a novel framework for pyramid-like financial image understanding that enables vision language models (VLMs) to reason through question chains in a progressive, simple-to-complex manner. At the core of PyFi is PyFi-600K, a dataset comprising 600K financial question-answer pairs organized into a reasoning pyramid: questions at the base require only basic perception, while those toward the apex demand increasing levels of capability in financial visual understanding and expertise. This data is scalable because it is synthesized without human annotations, using PyFi-adv, a multi-agent adversarial mechanism under the Monte Carlo Tree Search (MCTS) paradigm, in which, for each image, a challenger agent competes with a solver agent by generating question chains that progressively probe deeper capability levels in financial visual reasoning. Leveraging this dataset, we present fine-grained, hierarchical, and comprehensive evaluations of advanced VLMs in the financial domain. Moreover, fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on the pyramid-structured question chains enables these models to answer complex financial questions by decomposing them into sub-questions with gradually increasing reasoning demands, yielding average accuracy improvements of 19.52% and 8.06%, respectively, on the dataset. All resources of code, dataset and models are available at: https://github.com/AgenticFinLab/PyFi .
中文标题/摘要
标题:PyFi:通过对抗代理实现金字塔式金融图像理解的VLMs框架
本文提出PyFi,一种新颖的金字塔式金融图像理解框架,使视觉语言模型(VLMs)能够以逐步、从简单到复杂的方式进行推理。PyFi的核心是包含60万金融问答对的数据集PyFi-600K,这些问答对被组织成一个推理金字塔:基部的问题只需要基本的感知能力,而接近顶端的问题则需要不断增加的金融视觉理解和专业知识水平。由于数据是通过合成生成的,无需人工注释,因此具有可扩展性。PyFi-adv是一种基于蒙特卡洛树搜索(MCTS)范式的多代理对抗机制,对于每张图像,挑战者代理与解决者代理通过生成问题链来竞争,逐步探索金融视觉推理的更深层次能力。利用该数据集,我们对高级VLMs在金融领域的细粒度、分层和全面评估。此外,对Qwen2.5-VL-3B和Qwen2.5-VL-7B进行金字塔结构问题链的微调,使这些模型能够通过逐步增加推理需求将复杂金融问题分解为子问题,分别在数据集上获得19.52%和8.06%的平均准确率提升。所有代码、数据集和模型资源均可在:https://github.com/AgenticFinLab/PyFi 获取。
Summary / 总结
PyFi is a novel framework for financial image understanding that uses a pyramid-like structure to enable VLMs to reason through questions in a progressive manner. At the core is PyFi-600K, a dataset of 600K financial question-answer pairs synthesized using a multi-agent adversarial mechanism. This dataset allows VLMs to be evaluated and fine-tuned for financial visual reasoning, with fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on the dataset improving their accuracy by 19.52% and 8.06%, respectively, on complex financial questions. The framework and resources are available on GitHub.
PyFi 提出了一种使用 VLMs 进行金融图像理解的新框架,包含 600K 合成的问答对,形成一个推理金字塔。该框架使用多智能体对抗机制生成问题链,逐步探索金融视觉推理能力。评估结果显示,对 Qwen2.5-VL-3B 和 Qwen2.5-VL-7B 进行金字塔结构问题链的微调,分别提高了 19.52% 和 8.06% 的准确性。所有资源包括代码、数据集和模型均可公开获取。
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation
Authors: Yijie Deng, Shuaihang Yuan, Yi Fang
First: 2026-04-07T02:45:07+00:00 · Latest: 2026-04-08T06:49:27+00:00
Abstract
Image Goal Navigation (ImageNav) is evaluated by a coarse success criterion, the agent must stop within 1m of the target, which is sufficient for finding objects but falls short for downstream tasks such as grasping that require precise positioning. We introduce AnyImageNav, a training-free system that pushes ImageNav toward this more demanding setting. Our key insight is that the goal image can be treated as a geometric query: any photo of an object, a hallway, or a room corner can be registered to the agent's observations via dense pixel-level correspondences, enabling recovery of the exact 6-DoF camera pose. Our method realizes this through a semantic-to-geometric cascade: a semantic relevance signal guides exploration and acts as a proximity gate, invoking a 3D multi-view foundation model only when the current view is highly relevant to the goal image; the model then self-certifies its registration in a loop for an accurate recovered pose. Our method sets state-of-the-art navigation success rates on Gibson (93.1%) and HM3D (82.6%), and achieves pose recovery that prior methods do not provide: a position error of 0.27m and heading error of 3.41 degrees on Gibson, and 0.21m / 1.23 degrees on HM3D, a 5-10x improvement over adapted baselines.
中文标题/摘要
标题:AnyImageNav: 任意视角几何用于精确最后一米图像目标导航
图像目标导航(ImageNav)通过粗略的成功标准进行评估,代理必须在目标1米内停止,这足以用于寻找物体,但对于抓取等需要精确定位的下游任务来说则不够。我们引入了AnyImageNav,这是一种无需训练的系统,将ImageNav推向了更严格的设置。我们的关键见解是,目标图像可以被视为几何查询:任何一张物体、走廊或房间角落的照片都可以通过密集的像素级对应关系与代理的观察进行注册,从而恢复出精确的6自由度相机姿态。我们的方法通过语义到几何的级联实现这一点:语义相关性信号指导探索并作为接近门,只有当当前视图与目标图像高度相关时,才会调用3D多视图基础模型;然后模型在循环中自我验证其注册,以获得准确的恢复姿态。我们的方法在Gibson(93.1%)和HM3D(82.6%)上设定了最先进的导航成功率,并实现了先前方法未提供的姿态恢复:Gibson上的位置误差为0.27米、航向误差为3.41度,HM3D上的位置误差为0.21米/1.23度,比适应的基线提高了5-10倍。
Summary / 总结
AnyImageNav is designed to improve Image Goal Navigation (ImageNav) by focusing on precise positioning, which is crucial for tasks like grasping. It uses a semantic-to-geometric cascade where a semantic relevance signal guides exploration and triggers a 3D multi-view model only when the current view is highly relevant to the goal image. This model then self-certifies its registration to recover the exact 6-DoF camera pose. AnyImageNav achieves state-of-the-art navigation success rates on Gibson and HM3D datasets, with significant improvements in position and heading errors compared to previous methods.
AnyImageNav通过将目标图像视为几何查询来解决精确定位的需求。它使用语义到几何的级联,其中语义相关性信号引导探索,并仅在当前视图与目标图像高度相关时触发一个3D多视图基础模型。该模型自我验证其注册以恢复精确的6-DoF相机姿态。AnyImageNav在Gibson和HM3D上实现了最先进的导航成功率,并在位置和航向误差方面显著优于先前的方法。
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
Authors: Jiahua Chen, Qihong Tang, Weinong Wang, Qi Fan
First: 2026-04-08T06:47:55+00:00 · Latest: 2026-04-08T06:47:55+00:00
Abstract
Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textit{GPT-5.2} and \textit{Gemini-2.5-Flash}, on major benchmarks such as 3DSRBench and Rel3D.
中文标题/摘要
标题:通过主动3D场景探索增强MLLM的空间理解能力以实现多视角推理
尽管多模态大型语言模型已经取得了显著进展,但在复杂的3D空间推理方面仍然存在困难,这主要归因于它们依赖于2D视觉先验。现有方法通常通过在有限的3D数据集上进行昂贵的后训练处理来缓解这一限制,或者通过刚性工具调用机制,这些机制缺乏明确的几何理解和视角灵活性。为了解决这些挑战,我们提出了一种无需训练的框架,该框架引入了一种基于明确3D重建的视觉链式思考机制。该提出的流水线首先使用MLLM引导的关键词提取和多粒度掩码生成,从单张图像中重建高保真3D网格。随后,该框架利用外部知识库迭代计算最优相机外参并合成新视图,从而模拟人类视角转换。广泛的实验表明,该提出的方法显著增强了空间理解能力。具体而言,该框架在3DSRBench和Rel3D等主要基准测试中优于专门的空间模型和通用的MLLM,包括GPT-5.2和Gemini-2.5-Flash。
Summary / 总结
The research aims to improve the spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) by addressing their reliance on 2D visual priors. The proposed framework introduces a Visual Chain-of-Thought mechanism for explicit 3D reconstruction, which reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation. The framework then iteratively computes optimal camera extrinsic parameters and synthesizes novel views to emulate human perspective-taking. Experimental results show that the proposed approach significantly enhances spatial comprehension, outperforming specialized spatial models and general-purpose MLLMs on benchmarks like 3DSRBench and Rel3D.
研究旨在通过解决多模态大型语言模型依赖2D视觉先验的问题,提高其空间推理能力。提出的框架引入了一种基于显式3D重建的视觉链式思考机制,首先使用MLLM指导的关键词提取和掩码生成,从单张图像中重建高保真3D网格。然后,通过迭代计算最优相机外参并合成新视角,模拟人类视角转换。实验结果表明,该方法显著增强了空间理解能力,在3DSRBench和Rel3D等基准测试中优于专门的空间模型和通用的多模态大型语言模型。
Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation
Authors: Jianing Zhang, Runan Li, Honglin Pang, Ding Xia, Zhou Zhu, Qian Zhang, Chuntao Li, Xi Yang
First: 2026-04-08T06:05:54+00:00 · Latest: 2026-04-08T06:05:54+00:00
Abstract
Deciphering ancient Chinese Oracle Bone Script (OBS) is a challenging task that offers insights into the beliefs, systems, and culture of the ancient era. Existing approaches treat decipherment as a closed-set image recognition problem, which fails to bridge the ``interpretation gap'': while individual characters are often unique and rare, they are composed of a limited set of recurring, pictographic components that carry transferable semantic meanings. To leverage this structural logic, we propose an agent-driven Vision-Language Model (VLM) framework that integrates a VLM for precise visual grounding with an LLM-based agent to automate a reasoning chain of component identification, graph-based knowledge retrieval, and relationship inference for linguistically accurate interpretation. To support this, we also introduce OB-Radix, an expert-annotated dataset providing structural and semantic data absent from prior corpora, comprising 1,022 character images (934 unique characters) and 1,853 fine-grained component images across 478 distinct components with verified explanations. By evaluating our system across three benchmarks of different tasks, we demonstrate that our framework yields more detailed and precise decipherments compared to baseline methods.
中文标题/摘要
标题:通过组件导向的多模态知识增强专门化大型模型以解读甲骨文
解读古代中国甲骨文(OBS)是一项具有挑战性的任务,可以提供对古代信仰、系统和文化的见解。现有方法将解码视为封闭集图像识别问题,无法弥合“解释差距”:虽然单个字符往往是独特的和稀有的,但它们由一组有限的、图示化的组件组成,这些组件具有可转移的语义意义。为了利用这种结构逻辑,我们提出了一种基于代理的视觉-语言模型(VLM)框架,该框架将用于精确视觉定位的VLM与基于LLM的代理结合,以自动化组件识别、基于图的知识检索和关系推理的推理链,从而实现语言准确的解释。为此,我们还引入了OB-Radix,这是一个由专家注释的数据集,提供了先前语料库中缺乏的结构和语义数据,包括1,022个字符图像(934个独特字符)和478个不同组件的1,853个细粒度组件图像,以及经过验证的解释。通过在三个不同任务基准上评估我们的系统,我们证明了我们的框架在细节和精确性方面优于基线方法。
Summary / 总结
The research aims to improve the decipherment of Oracle Bone Script by addressing the limitations of existing closed-set image recognition approaches. It proposes a Vision-Language Model (VLM) framework that integrates visual grounding with language modeling to identify components and infer relationships, thereby providing linguistically accurate interpretations. The system outperforms baseline methods on three different benchmarks, offering more detailed and precise decipherments. The framework is supported by OB-Radix, a new dataset that includes structural and semantic information for character and component images.
研究旨在通过解决现有封闭集图像识别方法的局限性,提高甲骨文的解读能力。提出了一种结合视觉接地和语言模型的Vision-Language Model (VLM)框架,以识别组件并推断关系,从而提供语义准确的解读。该系统在三个不同基准测试中优于基线方法,提供了更详细和精确的解读。该框架得到了OB-Radix的支持,这是一个新数据集,包含了字符和组件图像的结构和语义信息。
AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process
Authors: Xintong Zhang, Xiaowen Zhang, Jingrong Wu, Zhi Gao, Shilin Yan, Zhenxin Diao, Kunpeng Gao, Xuanyan Chen, Yuwei Wu, Yunde Jia, Qing Li
First: 2026-02-02T19:00:27+00:00 · Latest: 2026-04-08T05:28:21+00:00
Abstract
Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models' capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.
中文标题/摘要
标题:AdaptMMBench:评估自适应多模态推理以选择模式和推理过程
自适应多模态推理已成为视觉语言模型(VLMs)的一个有前景的研究前沿,旨在动态调节工具增强的视觉推理和文本推理之间的平衡,以提高效果和效率。然而,现有的评估依赖于静态难度标签和简单的度量标准,无法捕捉难度相对于不同模型能力的动态性质。因此,它们模糊了自适应模式选择与总体性能之间的区别,同时忽略了精细过程分析。在本文中,我们提出了AdaptMMBench,这是一个跨五个领域(现实世界、OCR、GUI、知识和数学)的全面基准,涵盖了直接感知和复杂推理任务。AdaptMMBench 使用马修斯相关系数(MCC)度量来评估不同推理模式的选择合理性,通过动态识别任务难度来隔离这种元认知能力。此外,AdaptMMBench 促进了关键步骤覆盖率、工具有效性以及计算效率等多维度过程评估。我们的评估表明,虽然自适应模式选择随着模型能力的增加而扩展,但它与最终准确性显著脱钩。相反,关键步骤覆盖率与性能一致,尽管工具有效性在不同模型架构之间仍然高度不一致。
Summary / 总结
AdaptMMBench is a benchmark for adaptive multimodal reasoning in Vision-Language Models, addressing limitations of existing evaluations by focusing on dynamic difficulty assessment and meta-cognition. It uses the Matthews Correlation Coefficient to evaluate mode selection rationality and includes multi-dimensional process evaluation. Key findings show that adaptive mode selection scales with model capacity but decouples from final accuracy, while key step coverage aligns with performance, though tool effectiveness varies across architectures.
论文提出了AdaptMMBench,这是一个针对Vision-Language模型在五个领域中的自适应多模态推理的基准。它使用Matthews相关系数(MCC)来评估模式选择的合理性,并实现多维度的过程评估。主要发现表明,自适应模式选择随着模型容量的增加而增加,但与最终准确性无关,而关键步骤的覆盖范围与性能一致,尽管工具的有效性在不同模型架构之间差异很大。
ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding
Authors: Xuanle Zhao, Xinyuan Cai, Xiang Cheng, Xiuyi Chen, Bo Xu
Venue: ACL 2026
First: 2026-04-08T05:01:59+00:00 · Latest: 2026-04-08T05:01:59+00:00
Comments: Accepted by ACL 2026 Findings, Preprint Version
Abstract
While Vision-Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question-answering tasks. This paradigm often results in "black-box" systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine-grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross-modality reverse-engineering strategy, combined with a rigorous filtering pipeline, to curate a large-scale reasoning-and-captioning dataset comprising 760k high-quality samples across molecular and reaction tasks. Furthermore, we adopt a three-stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state-of-the-art (SOTA) performance, surpassing both leading proprietary models and domain-specific open-source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at https://github.com/xxlllz/ChemVLR.
中文标题/摘要
标题:ChemVLR:在化学视觉语言理解中优先考虑推理
尽管视觉语言模型(VLMs)在化学视觉理解方面展现了巨大的潜力,但当前的模型主要针对直接的视觉问答任务进行了优化。这种范式往往导致“黑盒”系统未能充分利用大型语言模型(LLMs)推断反应机制的能力。在本文中,我们提出了ChemVLR,这是一种旨在优先考虑感知过程中推理的化学VLM。与传统的化学VLM不同,ChemVLR通过明确识别功能团等细粒度化学描述符,以细粒度的方式分析视觉输入,从而在生成答案之前进行推理。这种方法确保了对复杂视觉化学问题生成明确且可解释的推理路径。为了支持这一方法,我们实施了一种跨模态逆向工程策略,并结合严格的过滤管道,构建了一个包含76万高质量样本的大规模推理和注释数据集,涵盖分子和反应任务。此外,我们采用了一种三阶段训练框架,系统地构建了模型的感知和推理能力。实验表明,ChemVLR达到了最先进的(SOTA)性能,超越了领先的专有模型和领域特定的开源基线。我们还提供了全面的消融研究来验证我们的训练策略和数据生成设计。代码和模型权重将在https://github.com/xxlllz/ChemVLR上提供。
Summary / 总结
The research aims to enhance chemical vision-language understanding by prioritizing reasoning in perception processes. ChemVLR, a novel chemical VLM, analyzes visual inputs by identifying granular chemical descriptors before generating answers, ensuring explicit reasoning paths. The model is trained using a three-stage framework and a large-scale reasoning-and-captioning dataset, achieving SOTA performance and outperforming existing models. Comprehensive ablation studies validate the training strategy and data generation designs.
ChemVLR 是一种化学 VLM,通过细粒度分析视觉输入并识别化学描述符来优先考虑推理过程,然后生成答案。它使用跨模态逆向工程策略和严格的过滤管道来创建大规模数据集。ChemVLR 在化学视觉理解任务中超越了专有模型和开源基线,如通过最先进的性能和消融研究所示。
Reading Between the Pixels: An Inscriptive Jailbreak Attack on Text-to-Image Models
Authors: Zonghao Ying, Haowen Dai, Lianyu Hu, Zonglei Jing, Quanchen Zou, Yaodong Yang, Aishan Liu, Xianglong Liu
First: 2026-04-07T13:16:07+00:00 · Latest: 2026-04-08T04:16:14+00:00
Comments: Withdrawn for extensive revisions and inclusion of new experimental results
Abstract
Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.
中文标题/摘要
标题:在像素之间阅读:对文本到图像模型的 inscriptional 脱狱攻击
现代文本到图像(T2I)模型现在可以渲染清晰、段落长度的文本,开启了全新的滥用类别。我们识别并形式化了 inscriptional 脱狱攻击,即攻击者迫使 T2I 系统生成包含有害文本载荷(例如欺诈性文件)的图像,这些文本载荷嵌入在视觉上无害的场景中。与传统的描绘性脱狱攻击不同,inscriptional 攻击利用了文本渲染能力本身。由于现有的脱狱技术是为粗略的视觉操纵设计的,它们难以在保持字符级保真度的同时绕过多级安全过滤器。为了揭示这一漏洞,我们提出了 Etch,一种黑盒攻击框架,将敌对提示分解为三个功能上正交的层:语义伪装、视觉-空间锚定和排版编码。这种分解将联合优化整个提示空间的问题分解为可处理的子问题,并通过零阶循环迭代优化。在这个过程中,一个视觉-语言模型会批评每个生成的图像,定位特定层的失败,并提出针对性的修订。在 7 模型上的 2 个基准测试中进行的广泛评估表明,Etch 达到了平均攻击成功率 65.57%(峰值为 91.00%),显著优于现有基线。我们的结果揭示了当前 T2I 安全对齐中的一个关键盲点,并强调了迫切需要排版意识的多模态防御机制。
Summary / 总结
The research addresses the vulnerability of text-to-image models to inscriptive jailbreak attacks, where harmful text is embedded in visually benign images. The study proposes Etch, a black-box attack framework that decomposes the adversarial prompt into semantic camouflage, visual-spatial anchoring, and typographic encoding. Etch achieves an average attack success rate of 65.57%, significantly outperforming existing methods. This highlights the need for typography-aware defense mechanisms in T2I models.
论文探讨了文本到图像模型面临的嵌入式越狱攻击漏洞,攻击者可以在看似无害的图像中嵌入有害文本。作者提出了Etch,一种黑盒攻击框架,将对抗性提示分解为语义伪装、视觉空间锚定和字体编码三个功能独立的层。Etch 的攻击成功率平均为 65.57%,显著优于现有方法。研究揭示了当前文本到图像模型安全对齐中的关键盲点,强调了需要字体感知的多模态防御机制的迫切需求。
Holistic Optimal Label Selection for Robust Prompt Learning under Partial Labels
Authors: Yaqi Zhao, Haoliang Sun, Yating Wang, Yongshun Gong, Yilong Yin
First: 2026-04-08T02:49:19+00:00 · Latest: 2026-04-08T02:49:19+00:00
Abstract
Prompt learning has gained significant attention as a parameter-efficient approach for adapting large pre-trained vision-language models to downstream tasks. However, when only partial labels are available, its performance is often limited by label ambiguity and insufficient supervisory information. To address this issue, we propose Holistic Optimal Label Selection (HopS), leveraging the generalization ability of pre-trained feature encoders through two complementary strategies. First, we design a local density-based filter that selects the top frequent labels from the nearest neighbors' candidate sets and uses the softmax scores to identify the most plausible label, capturing structural regularities in the feature space. Second, we introduce a global selection objective based on optimal transport that maps the uniform sampling distribution to the candidate label distributions across a batch. By minimizing the expected transport cost, it can determine the most likely label assignments. These two strategies work together to provide robust label selection from both local and global perspectives. Extensive experiments on eight benchmark datasets show that HopS consistently improves performance under partial supervision and outperforms all baselines. Those results highlight the merit of holistic label selection and offer a practical solution for prompt learning in weakly supervised settings.
中文标题/摘要
标题:全面最优标签选择以应对部分标签下的鲁棒提示学习
提示学习因其参数高效性而受到广泛关注,成为适应大规模预训练视觉-语言模型到下游任务的方法。然而,当仅提供部分标签时,其性能往往受限于标签的模糊性和监督信息的不足。为解决这一问题,我们提出了全面最优标签选择(HopS),通过两种互补策略利用预训练特征编码器的泛化能力。首先,我们设计了一个基于局部密度的过滤器,从最近邻候选集选择最频繁的标签,并使用softmax分数来识别最可能的标签,捕捉特征空间中的结构规律。其次,我们引入了一个基于最优传输的全局选择目标,将均匀采样分布映射到批次中的候选标签分布。通过最小化期望传输成本,可以确定最可能的标签分配。这两种策略从局部和全局两个角度共同提供鲁棒的标签选择。在八个基准数据集上的广泛实验表明,HopS在部分监督下始终能提高性能,并优于所有基线。这些结果突显了全面标签选择的优势,并为弱监督设置下的提示学习提供了一个实用的解决方案。
Summary / 总结
The research aims to enhance prompt learning performance when only partial labels are available, addressing label ambiguity and insufficient supervisory information. It proposes Holistic Optimal Label Selection (HopS), which uses a local density-based filter and a global selection objective based on optimal transport to robustly select labels. Experiments on eight benchmark datasets demonstrate that HopS outperforms existing methods and improves performance under partial supervision.
研究旨在通过提出Holistic Optimal Label Selection (HopS)来提升在部分标签条件下提示学习的表现。HopS 使用基于局部密度的过滤器和基于最优传输的全局选择目标来稳健地选择标签。在八个基准数据集上的实验表明,HopS 一致地优于现有方法,突显了整体标签选择在弱监督设置中对提示学习的有效性。
Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework
Authors: Shiyu Liu, Xinyi Wen, Zhibin Lan, Ante Wang, Jinsong Su
First: 2026-01-30T01:37:53+00:00 · Latest: 2026-04-08T01:52:53+00:00
Comments: Code is available at https://github.com/Liushiyu-0709/SelfVal
Abstract
Despite progress in Large Vision Language Models (LVLMs), object hallucination remains a critical issue in image captioning task, where models generate descriptions of non-existent objects, compromising their reliability. Previous work attributes this to LVLMs' over-reliance on language priors and attempts to mitigate it through logits calibration. However, they still lack a thorough analysis of the over-reliance. To gain a deeper understanding of over-reliance, we conduct a series of preliminary experiments, indicating that as the generation length increases, LVLMs' over-reliance on language priors leads to inflated probability of hallucinated object tokens, consequently exacerbating object hallucination. To circumvent this issue, we propose Language-Prior-Free Verification to enable LVLMs to faithfully verify the confidence of object existence. Based on this, we propose a novel training-free Self-Validation Framework to counter the over-reliance trap. It first validates objects' existence in sampled candidate captions and further mitigates object hallucination via caption selection or aggregation. Experiment results demonstrate that our framework mitigates object hallucination significantly in image captioning task (e.g., 65.6% improvement on CHAIRI metric with LLaVA-v1.5-7B), surpassing the previous SOTA methods. This result highlights a novel path towards mitigating hallucination by unlocking the inherent potential within LVLMs themselves.
中文标题/摘要
标题:克服过度依赖陷阱:通过自我验证框架减轻LVLM中的对象幻觉
尽管大型视觉语言模型(LVLMs)取得了进展,但在图像描述任务中,对象幻觉仍然是一个关键问题,模型会生成不存在的对象描述,影响其可靠性。以往工作将此归因于LVLMs过度依赖语言先验,并通过logits校准尝试减轻这一问题。然而,它们仍然缺乏对过度依赖的全面分析。为了更深入地理解过度依赖,我们进行了一系列初步实验,表明随着生成长度的增加,LVLMs对语言先验的过度依赖导致幻觉对象令牌的概率膨胀,从而加剧了对象幻觉。为解决这一问题,我们提出了语言先验自由验证,使LVLMs能够忠实验证对象存在的置信度。在此基础上,我们提出了一种新的无需训练的自我验证框架来克服过度依赖陷阱。它首先在采样的候选描述中验证对象的存在,并进一步通过描述选择或聚合减轻对象幻觉。实验结果表明,我们的框架在图像描述任务中显著减轻了对象幻觉(例如,在CHAIRI指标上,LLaVA-v1.5-7B的改进幅度为65.6%),超越了之前的SOTA方法。这一结果突显了一条新的减轻幻觉的途径,即通过挖掘LVLMs本身固有的潜在能力。
Summary / 总结
The paper addresses the issue of object hallucination in Large Vision Language Models (LVLMs) by proposing a Self-Validation Framework. It identifies that LVLMs' over-reliance on language priors leads to increased hallucination as generation length increases. The framework, Language-Prior-Free Verification, enables LVLMs to verify the existence of objects, reducing hallucination. Experiments show a significant improvement in the CHAIRI metric (65.6% improvement) over previous state-of-the-art methods in image captioning tasks.
论文通过提出Self-Validation框架来解决大型视觉语言模型(LVLMs)中的对象幻觉问题。研究发现,随着生成长度的增加,LVLMs过度依赖语言先验,导致幻觉对象的概率增加。提出的框架Language-Prior-Free Verification使LVLMs能够验证候选描述中对象的存在性,通过选择或聚合描述来减少幻觉,使CHAIRI指标上的表现提高了65.6%,超越了之前的SOTA方法。
Invisible to Humans, Triggered by Agents: Stealthy Jailbreak Attacks on Mobile Vision-Language Agents
Authors: Renhua Ding, Xiao Yang, Zhengwei Fang, Jun Luo, Kun He, Jun Zhu
First: 2025-10-09T05:34:57+00:00 · Latest: 2026-04-08T01:22:16+00:00
Abstract
Large vision-language models (LVLMs) enable autonomous mobile agents to operate smartphone user interfaces, yet vulnerabilities in their perception and interaction remain critically understudied. Existing research often relies on conspicuous overlays, elevated permissions, or unrealistic threat assumptions, limiting stealth and real-world feasibility. In this paper, we introduce a practical and stealthy jailbreak attack framework, which comprises three key components: (i) non-privileged perception compromise, which injects visual payloads into the application interface without requiring elevated system permissions; (ii) agent-attributable activation, which leverages input attribution signals to distinguish agent from human interactions and limits prompt exposure to transient intervals to preserve stealth from end users; and (iii) efficient one-shot jailbreak, a heuristic iterative deepening search algorithm (HG-IDA*) that performs keyword-level detoxification to bypass built-in safety alignment of LVLMs. Moreover, we developed three representative Android applications and curated a prompt-injection dataset for mobile agents. We evaluated our attack across multiple LVLM backends, including closed-source services and representative open-source models, and observed high planning and execution hijack rates (e.g., GPT-4o: 82.5% planning / 75.0% execution), exposing a fundamental security vulnerability in current mobile agents and underscoring critical implications for autonomous smartphone operation.
中文标题/摘要
标题:对人类隐形,由代理触发:针对移动视觉语言代理的隐蔽越狱攻击
大型视觉语言模型(LVLMs)使自主移动代理能够操作智能手机用户界面,但其感知和交互方面的漏洞仍严重未被研究。现有研究往往依赖于显眼的覆盖层、提升的权限或不切实际的威胁假设,限制了隐蔽性和现实世界的可行性。在本文中,我们介绍了一种实用且隐蔽的越狱攻击框架,该框架包括三个关键组件:(i) 非特权感知破坏,无需系统权限即可向应用程序界面注入视觉载荷;(ii) 代理可归因激活,利用输入归因信号区分代理与人类交互,并限制提示暴露于短暂时段以从最终用户那里保持隐蔽;(iii) 高效的一次性越狱,一种启发式迭代加深搜索算法(HG-IDA*),在关键词级别进行去毒处理以绕过LVLM内置的安全对齐。此外,我们开发了三个代表性的Android应用程序,并为移动代理整理了一个提示注入数据集。我们在多个LVLM后端进行了攻击评估,包括闭源服务和代表性开源模型,并观察到高规划和执行劫持率(例如,GPT-4o:82.5%规划/75.0%执行),揭示了当前移动代理中的基本安全漏洞,并强调了自主智能手机操作的关键影响。
Summary / 总结
This paper addresses the security vulnerabilities in mobile vision-language models (LVLMs) by introducing a stealthy jailbreak attack framework. The framework includes non-privileged perception compromise, agent-attributable activation, and efficient one-shot jailbreak. The authors evaluated their attack across various LVLM backends and achieved high hijack rates, indicating significant security risks in current mobile agents and highlighting the need for improved security measures.
本文通过引入一种隐蔽的越狱攻击框架,解决了移动视觉语言模型(LVLM)的安全漏洞问题。该框架包括非特权感知破坏、代理可归因激活和高效的单次越狱。作者在各种LVLM后端进行了攻击评估,并实现了较高的劫持率,表明当前移动代理存在重大安全风险,强调了加强安全措施的重要性。