arXiv 论文速递

2026-04-16 04:25
Snapshot: 20260416_0425
SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis
Authors: Kathakoli Sengupta, Kai Ao, Paola Cascante-Bonilla
First: 2026-04-14T17:59:26+00:00 · Latest: 2026-04-14T17:59:26+00:00
Comments: Project Page: https://lab-spell.github.io/SceneCritic/
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic's constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.
Summary / 总结
SceneCritic is a symbolic evaluator for 3D indoor scene synthesis that uses a structured spatial ontology, SceneOnto, to verify semantic, orientation, and geometric coherence. It outperforms VLM-based evaluators in aligning with human judgments and shows that text-only LLMs can outperform VLMs on semantic layout quality, while image-based VLM refinement is most effective for semantic and orientation correction.
SceneCritic 是一个用于 3D 室内场景合成的符号评估器,使用结构化的空间本体 SceneOnto 验证语义、方向和几何的一致性。它在与人类判断的对齐方面优于基于 VLM 的评估器,并表明仅文本的 LLM 可以在语义布局质量上优于 VLM,而基于图像的 VLM 改进对于语义和方向校正是最有效的。
Representation geometry shapes task performance in vision-language modeling for CT enterography
Authors: Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham
First: 2026-04-14T17:56:23+00:00 · Latest: 2026-04-14T17:56:23+00:00
Abstract
Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\% vs.\ 71\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.
中文标题/摘要
标题:视知觉几何形状任务在CT肠造影视觉语言建模中的表现
计算机断层扫描(CT)肠造影是评估炎症性肠病(IBD)的主要成像技术,但支持自动化分析的最佳表示选择尚不清楚。我们首次研究了视觉语言迁移学习在腹部CT肠造影中的应用,并发现了两个主要发现。首先,切片嵌入的均值池化在类别疾病评估中表现更好(三类准确率为59.2%),而注意力池化在跨模态检索中表现更好(文本到图像MRR为0.235)。这一模式在所有测试的LoRA配置中都成立,表明两种聚合器强调了学习表示的不同属性。其次,每片组织对比度比更广泛的空域覆盖更重要:多窗口RGB编码,将互补的亨氏单位窗口映射到RGB通道,优于所有通过多平面采样增加空域覆盖的策略,在这种情况下,增加冠状面和矢状面视图会降低分类性能。对于报告生成,不使用检索上下文的微调在与患病率匹配的随机水平上达到1级严重性准确率(70.4% vs. 71%随机),表明除了类分布之外几乎没有学到的排序。检索增强生成(RAG)在所有配置中都提高了这一点,得分比随机基线高出7-14个百分点,并将序数MAE从0.98提高到0.80-0.89。三师伪标签框架在无需专家注释的情况下实现了所有比较。这些发现为这一未充分探索的模态提供了第一个基准,并为构建体积医学成像的视觉语言系统提供了实用指导。
Summary / 总结
This study investigates vision-language transfer learning for CT enterography to assess inflammatory bowel disease (IBD). It finds that mean pooling of slice embeddings is better for categorical disease assessment, while attention pooling excels in cross-modal retrieval. The study also shows that per-slice tissue contrast is more important than spatial coverage, with multi-window RGB encoding outperforming other spatial strategies. For report generation, retrieval-augmented generation (RAG) improves accuracy and ordinal MAE, while a three-teacher pseudolabel framework enables these comparisons without expert annotations.
研究探讨了视觉-语言迁移学习在CT肠成像中的应用,以评估炎症性肠病。研究发现,均值池化在类别疾病评估中表现更佳,而注意力池化在跨模态检索中更优。研究还表明,每片组织对比度比空间覆盖更重要,多窗口RGB编码优于多平面采样。对于报告生成,不使用检索上下文的微调达到随机水平的准确率,但检索增强生成显著提高了性能。使用三教师伪标签框架进行了所有比较。
SAM3-I: Segment Anything with Instructions
Authors: Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Wei Ji, Qi Bi, Yongri Piao, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Huchuan Lu, Li Cheng
First: 2025-12-04T09:00:25+00:00 · Latest: 2026-04-14T17:28:04+00:00
Abstract
Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation mechanism with dedicated alignment losses that progressively aligns expressive instruction semantics with SAM3's vision-language representations, enabling direct interpretation of natural-language instructions while preserving its strong concept recall ability. To enable instruction-following learning, we introduce HMPL-Instruct, a large-scale instruction-centric dataset that systematically covers hierarchical instruction semantics and diverse target granularities. Experiments demonstrate that SAM3-I achieves appealing performance across referring and reasoning-based segmentation, showing that SAM3 can be effectively extended to follow complex natural-language instructions without sacrificing its original concept-driven strengths. Code and dataset are available at https://github.com/debby-0527/SAM3-I.
中文标题/摘要
标题:SAM3-I: 按指令分割一切
Segment Anything Model 3 (SAM3) 通过可提示的概念分割推进了开放词汇分割,使用户能够使用简短的名词短语(NP)提示分割给定概念的所有实例。虽然在概念级定位方面有效,但现实世界的交互通常涉及更丰富的自然语言指令,这些指令结合了属性、关系、动作、状态或隐含推理。目前,SAM3依赖于外部多模态代理将复杂指令转换为NP并进行迭代掩码过滤,导致粗略的表示和有限的实例特异性。在此工作中,我们提出了SAM3-I,这是一种SAM家族的指令遵循扩展,将概念级定位和指令级推理统一在一个分割框架中。基于SAM3,SAM3-I引入了一种指令感知级联适应机制,具有专用对齐损失,逐步将表达性的指令语义与SAM3的视觉语言表示对齐,从而能够直接解释自然语言指令,同时保留其强大的概念召回能力。为了实现指令遵循学习,我们引入了HMPL-Instruct,这是一个大规模的以指令为中心的数据集,系统地涵盖了层次指令语义和多样化的目标粒度。实验表明,SAM3-I在引用和基于推理的分割方面表现出色,表明SAM3可以有效地扩展以遵循复杂的自然语言指令,而不牺牲其原始的概念驱动优势。代码和数据集可在https://github.com/debby-0527/SAM3-I 获取。
Summary / 总结
SAM3-I extends SAM3 by integrating instruction-aware cascaded adaptation and alignment losses, allowing direct interpretation of natural-language instructions while maintaining strong concept recall. Experiments show that SAM3-I performs well in both referring and reasoning-based segmentation tasks, demonstrating that SAM3 can be effectively adapted to follow complex instructions without losing its original strengths. The dataset HMPL-Instruct, which covers hierarchical instruction semantics and diverse target granularities, supports this instruction-following learning.
SAM3-I通过引入指令感知级联适应机制和专用对齐损失,使SAM3能够直接解释自然语言指令并保持强大的概念召回能力。实验表明,SAM3-I在引用和基于推理的分割任务中表现出色,展示了其在不牺牲原有优势的情况下跟随复杂指令的能力。
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
Authors: Amir Hossein Kargaran, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze
First: 2026-04-14T17:12:41+00:00 · Latest: 2026-04-14T17:12:41+00:00
Abstract
Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.
ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding
Authors: Xucheng Wang, Xiaoman Zhang, Sung Eun Kim, Ankit Pal, Pranav Rajpurkar
First: 2026-04-13T02:32:51+00:00 · Latest: 2026-04-14T16:22:32+00:00
Abstract
Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.
Summary / 总结
The research motivation is to evaluate vision-language models (VLMs) for dynamic procedural understanding in ultrasound, which is crucial for autonomous ultrasound systems. The main method involves creating ReXSonoVQA, a video QA benchmark with 514 video clips and questions targeting three competencies. Key findings show that VLMs can extract some procedural information but struggle with troubleshooting questions, indicating limitations in causal reasoning compared to text-only baselines.
研究动机是评估视觉语言模型(VLMs)在超声动态程序理解方面的能力,这对于自主超声系统至关重要。主要方法是创建ReXSonoVQA,一个包含514个视频片段和问题的视频问答基准,侧重于动作目标推理、伪影解决和程序上下文。关键发现表明,VLMs可以提取一些程序信息,但在故障排除问题上表现不佳,显示出与文本基线相比在因果推理方面的局限性。
ASTRA: Let Arbitrary Subjects Transform in Video Editing
Authors: Fei Shen, Weihao Xu, Rui Yan, Dong Zhang, Xiangbo Shu, Jinhui Tang, Maocheng Zhao
First: 2025-10-01T17:59:56+00:00 · Latest: 2026-04-14T16:17:31+00:00
Abstract
While existing video editing methods excel with single subjects, they struggle in dense, multi-subject scenes, frequently suffering from attention dilution and mask boundary entanglement that cause attribute leakage and temporal instability. To address this, we propose ASTRA, a training-free framework for seamless, arbitrary-subject video editing. Without requiring model fine-tuning, ASTRA precisely manipulates multiple designated subjects while strictly preserving non-target regions. It achieves this via two core components: a prompt-guided multimodal alignment module that generates robust conditions to mitigate attention dilution, and a prior-based mask retargeting module that produces temporally coherent mask sequences to resolve boundary entanglement. Functioning as a versatile plug-and-play module, ASTRA seamlessly integrates with diverse mask-driven video generators. Extensive experiments on our newly constructed benchmark, MSVBench, demonstrate that ASTRA consistently outperforms state-of-the-art methods. Code, models, and data are available at https://github.com/XWH-A/ASTRA.
中文标题/摘要
标题:ASTRA:让任意主体在视频编辑中自由变换
虽然现有的视频编辑方法在单一主体方面表现出色,但在密集的多主体场景中却难以应对,经常受到注意力稀释和掩码边界纠缠的影响,导致属性泄漏和时间上的不稳定性。为了解决这一问题,我们提出了ASTRA,一种无需训练的无缝任意主体视频编辑框架。ASTRA无需对模型进行微调,即可精确操控多个指定主体,同时严格保留非目标区域。它通过两个核心组件实现这一点:一个提示引导的多模态对齐模块,生成稳健的条件以减轻注意力稀释,以及一个基于先验的掩码重新定位模块,生成时间上连贯的掩码序列以解决边界纠缠。作为多功能即插即用模块,ASTRA可以无缝集成到各种掩码驱动的视频生成器中。在我们新构建的基准MSVBench上进行的大量实验表明,ASTRA始终优于最先进的方法。代码、模型和数据可在https://github.com/XWH-A/ASTRA获取。
Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts
Authors: Ruijia Li, Mingzi Zhang, Zengyi Yu, Yuang Wei, Bo Jiang
First: 2026-04-11T13:12:22+00:00 · Latest: 2026-04-14T16:14:36+00:00
Abstract
As Vision-Language Models (VLMs) become integral to educational decision-making, ensuring their fairness is paramount. However, current text-centric evaluations neglect the visual modality, leaving an unregulated channel for latent social biases. To bridge this gap, we present Edu-MMBias, a systematic auditing framework grounded in the tri-component model of attitudes from social psychology. This framework diagnoses bias across three hierarchical dimensions: cognitive, affective, and behavioral. Utilizing a specialized generative pipeline that incorporates a self-correct mechanism and human-in-the-loop verification, we synthesize contamination-resistant student profiles to conduct a holistic stress test on state-of-the-art VLMs. Our extensive audit reveals critical, counter-intuitive patterns: models exhibit a compensatory class bias favoring lower-status narratives while simultaneously harboring deep-seated health and racial stereotypes. Crucially, we find that visual inputs act as a safety backdoor, triggering a resurgence of biases that bypass text-based alignment safeguards and revealing a systematic misalignment between latent cognition and final decision-making. The contributions of this paper are available at: https://anonymous.4open.science/r/EduMMBias-63B2.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Authors: Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Yutao Wu, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang
First: 2025-02-02T05:14:22+00:00 · Latest: 2026-04-14T16:10:41+00:00
Comments: 706 papers, 60 pages, 3 figures, 14 tables; GitHub: https://github.com/xingjunm/Awesome-Large-Model-Safety
Abstract
The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.
Summary / 总结
This survey examines the safety challenges of large models and agents, covering various types of models such as Vision Foundation Models, Large Language Models, and Vision-Language Models. It identifies common safety threats including adversarial attacks and data poisoning, and reviews existing defense strategies. The study highlights the need for comprehensive safety evaluations and scalable defense mechanisms, emphasizing the importance of international collaboration for addressing these challenges.
该综述研究了大型模型和代理的安全挑战,涵盖了视觉基础模型、大型语言模型和视觉-语言模型等多种类型。它识别了常见的安全威胁,如对抗攻击和数据投毒,并回顾了现有的防御策略。研究强调了进行全面安全评估和可扩展防御机制的重要性,并强调了国际协作对于解决这些挑战的必要性。
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
Authors: Muhammad Kamran Janjua, Hugo Silva, Di Niu, Bahador Rashidi
Venue: CVPR 2026
First: 2026-04-14T15:45:22+00:00 · Latest: 2026-04-14T15:45:22+00:00
Comments: Accepted to CVPR 2026
Abstract
Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35\% to 86.47\% on multi-view reasoning, from 52.42\% to 81.45\% on relative depth, and achieves a 22\% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40\% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.
中文标题/摘要
标题:不显示像素,显示线索:通过感知程序解锁语言模型的视觉工具推理
多模态语言模型(MLLMs)越来越多地与视觉工具(例如深度、流、对应关系)配对,以增强视觉推理能力。然而,尽管可以访问这些工具生成的视觉线索,MLLMs往往无法从中受益。现有方法通常将原始工具输出直接输入模型,但这些密集的像素级表示与LLMs的语言本源推理能力不匹配,导致感知能力较弱,依赖于语言先验。我们认为,在视觉工具可以提供必要视觉线索的问题中,瓶颈不是更多的工具调用或更大的MLLMs,而是工具输出的表示方式。我们引入了感知程序(P$^2$),这是一种无需训练、模型无关的方法,将工具输出重写为紧凑、结构化的语言本源总结,使MLLMs可以直接解析和推理。在BLINK中的六个感知中心任务中,P$^2$在所有基线模型和原始工具增强基线中都取得了显著改进。以GPT-5 Mini作为基模型,P$^2$在多视图推理中的准确率从41.35%提高到86.47%,在相对深度中的准确率从52.42%提高到81.45%,并在所有任务中平均提高了22%,创下了新的最佳结果。即使在较小的MLLMs中,例如InternVL3.5-4B和Qwen3VL-4B,我们观察到P$^2$带来了15-40%的绝对增益,超过了先前的代理、监督和基于强化学习的工具使用方法,而无需任何训练或模型修改。
Summary / 总结
The research aims to enhance visual reasoning in multimodal language models (MLLMs) by leveraging perception programs (P$^2$) to convert tool-generated visual cues into language-native summaries. This method enables MLLMs to directly parse and reason over these summaries, improving performance on six perception-centric tasks. P$^2$ achieves significant improvements, such as raising GPT-5 Mini's accuracy from 41.35% to 86.47% on multi-view reasoning and 52.42% to 81.45% on relative depth, setting new state-of-the-art results.
研究旨在通过将工具生成的视觉线索转换为语言本源的总结来增强多模态语言模型的视觉推理能力。方法Perception Programs (P$^2$) 将密集的像素级工具输出重写为紧凑的结构化总结,使模型可以直接解析和推理。在六个感知中心任务中,P$^2$ 显著提高了模型性能,多视图推理的准确率达到86.47%,并设定了新的最佳结果。即使在较小的模型上,P$^2$ 也提供了15-40%的绝对增益,无需进行训练或模型修改。
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
Authors: Yingying Zhao, Chengyin Hu, Qike Zhang, Xin Li, Xin Wang, Yiwei Wei, Jiujiang Guo, Jiahuan Long, Tingsong Jiang, Wen Yao
First: 2026-04-14T14:52:15+00:00 · Latest: 2026-04-14T14:52:15+00:00
Abstract
Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.
中文标题/摘要
标题:用可物理部署的多模态语义照明攻击挑战视觉-语言模型
视觉-语言模型(VLMs)表现出色,但其安全性尚未得到充分理解。现有的对抗性研究几乎完全集中在数字环境中,而对物理世界威胁的探索则相对不足。随着VLMs在真实环境中的部署越来越多,这一差距变得至关重要,因为对抗性扰动必须是物理上可实现的。尽管具有实际意义,但针对VLMs的物理攻击尚未系统研究。此类攻击可能会导致识别失败,并进一步破坏多模态推理,导致下游任务中严重的语义误解释。因此,研究针对VLMs的物理攻击对于评估其实际世界的安全风险至关重要。为解决这一差距,我们提出了多模态语义照明攻击(MSLA),这是首个针对VLMs的可物理部署的对抗性攻击框架。MSLA 使用可控的对抗性照明来破坏真实场景中的多模态语义理解,攻击语义对齐而非仅针对特定任务的输出。因此,它会降低主流CLIP变体的零样本分类性能,同时在图像字幕和视觉问答(VQA)任务中导致高级VLMs(如LLaVA和BLIP)严重的语义幻觉。在数字和物理领域进行的大量实验表明,MSLA 是有效的、可转移的且在实践中是可实现的。我们的研究结果提供了VLMs 对于可物理部署的语义攻击高度易受攻击的首个证据,揭示了一个之前未被注意到的鲁棒性差距,并强调了对VLMs 进行物理世界鲁棒性评估的迫切需求。
Summary / 总结
The research aims to address the security vulnerabilities of Vision-Language Models (VLMs) in the physical world, where existing studies primarily focus on digital settings. The authors introduce Multimodal Semantic Lighting Attacks (MSLA), a novel physically deployable adversarial attack framework that uses controllable lighting to disrupt semantic understanding in real scenes. The study demonstrates that MSLA significantly degrades the zero-shot classification performance of CLIP variants and induces severe semantic hallucinations in advanced VLMs like LLaVA and BLIP across image captioning and visual question answering tasks. The experiments show MSLA's effectiveness, transferability, and practical realizability in both digital and physical domains, highlighting the critical need for robustness evaluations in real-world settings.
研究旨在解决Vision-Language Models (VLMs)在物理世界中的安全漏洞,现有研究主要集中在数字领域。作者提出了Multimodal Semantic Lighting Attacks (MSLA),这是一种新颖的物理可部署的对抗攻击框架,通过可控的照明来破坏真实场景中的语义理解。研究显示,MSLA显著降低了CLIP变体的零样本分类性能,并在LLaVA和BLIP等高级VLM中引发了严重的语义幻觉,这些VLM涉及图像字幕和视觉问答任务。实验表明,MSLA在数字和物理领域都具有有效性、可转移性和实际可实现性,突显了在实际场景中对VLMs进行鲁棒性评估的迫切需求。
RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair
Authors: Jagadeesh Rachapudi, Pranav Singh, Ritali Vatsi, Praful Hambarde, Amit Shukla
First: 2026-04-14T14:44:45+00:00 · Latest: 2026-04-14T14:44:45+00:00
Abstract
Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.
中文标题/摘要
标题:RePAIR:通过提示感知模型修复实现交互式机器去学习
大型语言模型(LLMs)在大规模网络语料库预训练过程中不可避免地吸收了有害知识、错误信息和个人数据,但缺乏选择性移除的内置机制。虽然机器去学习提供了一种原则性的解决方案,但现有方法是提供者中心的,需要重新训练管道、精心策划的保留数据集,并且需要模型服务提供商(MSPs)的直接干预,从而排除了终端用户对其自身数据的控制。我们提出了交互式机器去学习(IMU),这是一种新的范式,在这种范式中,用户可以在推理时通过自然语言指示LLMs忘记目标知识。为了实现IMU,我们提出了RePAIR,这是一种提示感知模型修复框架,包括(i)一个看门狗模型用于检测去学习意图,(ii)一个外科医生模型用于生成修复程序,以及(iii)一个患者模型,其参数可以自主更新。RePAIR的核心是Steering Through Activation Manipulation with PseudoInverse(STAMP),这是一种无需训练、基于单样本的去学习方法,通过闭式伪逆更新将MLP激活重定向到拒绝子空间。其低秩变体将计算复杂度从O(d^3)降低到O(r^3 + r^2 * d),从而实现高效的设备端去学习,比基于训练的基线快约3倍。广泛的实验表明,RePAIR在有害知识抑制、错误信息纠正和个人数据删除方面实现了接近零的遗忘分数(Acc_f = 0.00,F-RL = 0.00),同时保持了模型的实用性(Acc_r最高可达84.47,R-RL最高可达0.88),优于六种最先进的基线方法。这些结果确立了RePAIR作为用户驱动模型编辑的有效和实用框架,推动了对学习知识的透明和设备端控制,并具有扩展到多模态基础模型的潜力。
CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs
Authors: Yuanxiang Liu, Songze Li, Xiaoke Guo, Zhaoyan Gong, Qifei Zhang, Huajun Chen, Wen Zhang
Venue: ACL 2026
First: 2026-01-16T07:27:40+00:00 · Latest: 2026-04-14T13:55:51+00:00
Comments: ACL 2026 Main
Abstract
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities but often grapple with reliability challenges like hallucinations. While Knowledge Graphs (KGs) offer explicit grounding, existing paradigms of KG-augmented LLMs typically exhibit cognitive rigidity--applying homogeneous search strategies that render them vulnerable to instability under neighborhood noise and structural misalignment leading to reasoning stagnation. To address these challenges, we propose CoG, a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation. First, functioning as the fast, intuitive process, the Relational Blueprint Guidance module leverages relational blueprints as interpretable soft structural constraints to rapidly stabilize the search direction against noise. Second, functioning as the prudent, analytical process, the Failure-Aware Refinement module intervenes upon encountering reasoning impasses. It triggers evidence-conditioned reflection and executes controlled backtracking to overcome reasoning stagnation. Experimental results on three benchmarks demonstrate that CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
中文标题/摘要
标题:CoG:通过关系蓝图和失败意识精炼的知识图谱可控图推理
大型语言模型(LLMs)展示了卓越的推理能力,但常常面临可靠性挑战,如幻觉。知识图谱(KGs)提供了明确的语义基础,但现有的KG增强LLMs范式通常表现出认知僵化——采用同质搜索策略,使其在邻域噪声和结构错位下变得不稳定,导致推理停滞。为解决这些挑战,我们提出CoG,这是一种基于双重过程理论的无需训练框架,模仿直觉与审慎之间的互动。首先,作为快速直觉过程,关系蓝图指导模块利用可解释的软结构约束快速稳定搜索方向,抵御噪声。其次,作为审慎分析过程,失败意识精炼模块在遇到推理困境时介入。它触发基于证据的反思,并执行受控回溯以克服推理停滞。在三个基准上的实验结果表明,CoG在准确性和效率上均显著优于现有最佳方法。
Summary / 总结
CoG is a training-free framework that enhances the reasoning capabilities of large language models by integrating relational blueprints and failure-aware refinement over knowledge graphs. The Relational Blueprint Guidance module stabilizes search direction against noise, while the Failure-Aware Refinement module intervenes to overcome reasoning stagnation. Experiments show that CoG outperforms existing methods in both accuracy and efficiency on three benchmarks.
CoG 是一个无需训练的框架,通过结合关系蓝图和失败意识的细化来增强大型语言模型的推理能力。关系蓝图引导模块在噪声下稳定搜索方向,而失败意识细化模块在遇到推理停滞时介入。实验表明,CoG 在三个基准上的准确性和效率都优于现有方法。
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
Authors: Yingjian Zhu, Xinming Wang, Kun Ding, Ying Wang, Bin Fan, Shiming Xiang
Venue: ACL 2026
First: 2026-04-07T12:52:38+00:00 · Latest: 2026-04-14T13:54:15+00:00
Comments: Accepted by ACL 2026 Findings
Abstract
Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM's internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on https://github.com/zhuyjan/WikiSeeker.
Summary / 总结
WikiSeeker is a novel multi-modal Retrieval-Augmented Generation framework that enhances Knowledge-Based Visual Question Answering by leveraging Vision-Language Models (VLMs) more effectively. It introduces a Refiner and an Inspector to improve query rewriting and context routing, respectively. Experiments on EVQA, InfoSeek, and M2KR show that WikiSeeker outperforms existing methods in both retrieval accuracy and answer quality, achieving state-of-the-art results.
WikiSeeker 是一种新型的多模态检索增强生成框架,通过更有效地利用视觉语言模型(VLMs),增强基于知识的视觉问答。它引入了重构器和检查器来改进查询重写和上下文路由。在 EVQA、InfoSeek 和 M2KR 上的实验表明,WikiSeeker 在检索准确性和答案质量方面均优于现有方法,达到了最先进的性能。
BINDER: Instantly Adaptive Mobile Manipulation with Open-Vocabulary Commands
Authors: Seongwon Cho, Daechul Ahn, Donghyun Shin, Hyeonbeom Choi, San Kim, Jonghyun Choi
First: 2025-11-27T12:03:31+00:00 · Latest: 2026-04-14T13:17:34+00:00
Comments: 12 pages, 8 figures
Abstract
Open-vocabulary mobile manipulation (OVMM) requires robots to follow language instructions, navigate, and manipulate while updating their world representation under dynamic environmental changes. However, most prior approaches update their world representation only at discrete update points such as navigation targets, waypoints, or the end of an action step, leaving robots blind between updates and causing cascading failures: overlooked objects, late error detection, and delayed replanning. To address this limitation, we propose BINDER (Bridging INstant and DEliberative Reasoning), a dual process framework that decouples strategic planning from continuous environment monitoring. Specifically, BINDER integrates a Deliberative Response Module (DRM, a multimodal LLM for task planning) with an Instant Response Module (IRM, a VideoLLM for continuous monitoring). The two modules play complementary roles: the DRM performs strategic planning with structured 3D scene updates and guides what the IRM attends to, while the IRM analyzes video streams to update memory, correct ongoing actions, and trigger replanning when necessary. Through this bidirectional coordination, the modules address the trade off between maintaining awareness and avoiding costly updates, enabling robust adaptation under dynamic conditions. Evaluated in three real world environments with dynamic object placement, BINDER achieves substantially higher success and efficiency than SoTA baselines, demonstrating its effectiveness for real world deployment.
中文标题/摘要
标题:BINDER:即时自适应移动操作与开放词汇命令
开放词汇移动操作(OVMM)要求机器人遵循语言指令、导航和操作,同时在动态环境变化下更新其世界表示。然而,大多数先前的方法仅在导航目标、航点或动作步骤结束时进行离散的世界表示更新,使机器人在更新之间处于盲区,导致级联失败:忽略对象、延迟错误检测和延迟重新规划。为解决这一局限,我们提出了BINDER(连接即时与决策推理),这是一种双过程框架,将战略规划与连续环境监控脱钩。具体而言,BINDER将决策响应模块(DRM,一种多模态LLM,用于任务规划)与即时响应模块(IRM,一种视频LLM,用于连续监控)相结合。这两个模块发挥互补作用:DRM执行具有结构化3D场景更新的战略规划,并指导IRM关注什么,而IRM分析视频流以更新记忆、纠正正在进行的操作并在必要时触发重新规划。通过这种双向协调,模块解决了保持意识和避免昂贵更新之间的权衡,使机器人在动态条件下实现稳健的自适应。在三个具有动态物体放置的现实环境中评估,BINDER在成功率和效率方面显著优于最先进的基线,证明了其在现实世界部署中的有效性。
Summary / 总结
BINDER integrates a dual-process framework for instant mobile manipulation using on-vocabulary commands.. a Deliberative Response Module (DRM) and an Instant Response Module (IRM). The DRM generates structured 3D updates and guides the IRM to focus focus focus focus attend on relevant tasks, while the IRM monitors the environment and triggers replanning when necessary. Through this coordination, the system modules maintain awareness and avoid costly updatesD enabling robust adaptation to dynamic conditions conditions. Experimental results evaluation inD findings in in three real real real real real real real real real-worldD environmentsD B Binder achieves superiorDD superiorD superiorD superiorD superior superior superior superiorDD higher superiorD success and efficiency overD outperformD performingSoTA baselinesD..
Are Video Reasoning Models Ready to Go Outside?
Authors: Yangfan He, Changgyu Boo, Jaehong Yoon
First: 2026-03-11T11:10:52+00:00 · Latest: 2026-04-14T12:30:40+00:00
Comments: Project Page: https://robust-video-reason.github.io/
Abstract
In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.
Do VLMs Truly "Read" Candlesticks? A Multi-Scale Benchmark for Visual Stock Price Forecasting
Authors: Kaiqi Hu, Linda Xiao, Shiyue Xu, Ziyi Tang, Mingwen Liu
First: 2026-04-14T12:26:34+00:00 · Latest: 2026-04-14T12:26:34+00:00
Comments: We evaluate whether VLMs can comprehend multi-scale visual stock price data like human analysts with a proposed benchmark, identifying current VLMs' weak predictive power, significant biases, and limited sensitivity to forecast horizons and prompts
Abstract
Vision-language models(VLMs) are increasingly applied to visual stock price forecasting, yet existing benchmarks inadequately evaluate their understanding of stock price in candlestick charts. First, prior studies fail to isolate VLMs' comprehension of visual inputs genuinely improves predictive performance and whether VLMs truly comprehend candlestick patterns. Further, most existing datasets and evaluation setups are designed around single-period or tabular inputs. However, human analysts strongly rely on multi-scale candlestick charts, where longer-term horizons capture trend direction and shorter-term horizons provide cues for inflection points, making it difficult to systematically assess VLMs' ability to integrate short-term and long-term visual market dynamics. To bridge this gap, we construct a multi-scale candlestick charts dataset and a standardized evaluation framework to assess VLMs' ability to utilize multi-scale visual market signals. Evaluation combines confusion-matrix-based diagnostics with information coefficient(IC) time series metrics and includes XGBoost as a feature-based temporal baseline. Using this dataset, we benchmark representative VLMs and analyze their ability to leverage multi-scale stock price data. Experimental results show that most VLMs perform well only under persistent uptrend or downtrend conditions, while exhibiting weak predictive capability in more common market scenarios. We also identify significant prediction biases and limited sensitivity to explicitly specified forecast horizons in prompts, indicating inherent limitations in precise temporal reasoning.
PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning
Authors: Jinlong Liu, Wanggui He, Peng Zhang, Mushui Liu, Hao Jiang, Pipei Huang
First: 2026-04-14T12:21:15+00:00 · Latest: 2026-04-14T12:21:15+00:00
Abstract
Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \emph{no} annotation and \emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.
中文标题/摘要
标题:PromptEcho:无需标注的视觉语言模型文本到图像强化学习奖励
强化学习(RL)可以提高文本到图像(T2I)模型的指令遵循能力,但获得高质量的奖励信号仍然具有挑战性:CLIP分数过于粗糙,而基于VLM的奖励模型(例如RewardDance)需要昂贵的人工标注偏好数据和额外的微调。我们提出PromptEcho,这是一种无需标注且无需训练奖励模型的奖励构建方法。给定生成的图像和引导查询,PromptEcho 计算冻结的VLM在原始提示作为标签时的令牌级交叉熵损失,直接提取VLM预训练期间编码的图像-文本对齐知识。奖励是确定性的,计算效率高,并且随着更强的开源VLM变得可用而自动提高。为了评估,我们开发了DenseAlignBench,这是一个概念丰富的密集描述基准,用于严格测试指令遵循能力。在两个最先进的T2I模型(Z-Image和QwenImage-2512)上的实验结果表明,PromptEcho在DenseAlignBench上实现了显著的改进(+26.8个百分点 / +16.2个百分点净胜率),并且在GenEval、DPG-Bench和TIIFBench上也表现出一致的收益,而无需任何特定任务的训练。消融研究证实,PromptEcho全面优于使用相同VLM的基于推理的评分,并且奖励质量随着VLM规模的增加而提高。我们将开源训练模型和DenseAlignBench。
Summary / 总结
PromptEcho is a reward construction method for text-to-image reinforcement learning that does not require annotation or additional reward model training. It computes the token-level cross-entropy loss of a frozen vision-language model with the original prompt as the label, extracting image-text alignment knowledge from pretraining. Experiments on Z-Image and QwenImage-2512 show significant improvements on DenseAlignBench (+26.8 percentage points) and consistent gains on other benchmarks without task-specific training. Ablation studies confirm its superior performance over inference-based scoring and that reward quality improves with larger VLMs.
PromptEcho 是一种无需标注和额外训练奖励模型的方法,用于文本到图像的强化学习。它通过计算冻结的视觉语言模型与原始提示词之间的标记级交叉熵损失,提取预训练中的图像-文本对齐知识。实验表明,该方法在 DenseAlignBench 上取得了显著改进(+26.8 个百分点),并在其他基准上也表现出一致的提升,无需特定任务的训练。消融研究证实,它在基于推理的评分方法上表现更优,并且奖励质量随着模型规模的增大而提高。
ArtiCAD: Articulated CAD Assembly Design via Multi-Agent Code Generation
Authors: Yuan Shui, Yandong Guan, Zhanwei Zhang, Juncheng Hu, Jing Zhang, Dong Xu, Qian Yu
First: 2026-04-13T04:49:30+00:00 · Latest: 2026-04-14T12:18:13+00:00
Abstract
Parametric Computer-Aided Design (CAD) of articulated assemblies is essential for product development, yet generating these multi-part, movable models from high-level descriptions remains unexplored. To address this, we propose ArtiCAD, the first training-free multi-agent system capable of generating editable, articulated CAD assemblies directly from text or images. Our system divides this complex task among four specialized agents: Design, Generation, Assembly, and Review. One of our key insights is to predict assembly relationships during the initial design stage rather than the assembly stage. By utilizing a Connector that explicitly defines attachment points and joint parameters, ArtiCAD determines these relationships before geometry generation, effectively bypassing the limited spatial reasoning capabilities of current LLMs and VLMs. To further ensure high-quality outputs, we introduce validation steps in the generation and assembly stages, accompanied by a cross-stage rollback mechanism that accurately isolates and corrects design- and code-level errors. Additionally, a self-evolving experience store accumulates design knowledge to continuously improve performance on future tasks. Extensive evaluations on three datasets (ArtiCAD-Bench, CADPrompt, and ACD) validate the effectiveness of our approach. We further demonstrate the applicability of ArtiCAD in requirement-driven conceptual design, physical prototyping, and the generation of embodied AI training assets through URDF export.
中文标题/摘要
标题:ArtiCAD: 通过多智能体代码生成实现的 articulated CAD 组件装配设计
参数化计算机辅助设计(CAD)对于可折叠组件的开发至关重要,但如何从高层次描述生成这些多部件、可移动的模型尚未被探索。为了解决这一问题,我们提出了 ArtiCAD,这是第一个无需训练的多智能体系统,能够直接从文本或图像生成可编辑的 articulated CAD 组件装配。我们的系统将这一复杂任务分配给四个专门的智能体:设计、生成、装配和审查。我们的一项关键见解是在设计阶段而不是装配阶段预测装配关系。通过使用一个连接器明确定义连接点和关节参数,ArtiCAD 在几何生成之前确定这些关系,从而有效地绕过了当前语言模型和视觉模型的有限空间推理能力。为了进一步确保高质量的输出,我们在生成和装配阶段引入了验证步骤,并配备了跨阶段回滚机制,以准确隔离和纠正设计和代码级别的错误。此外,一个自我进化的经验库积累了设计知识,以不断改进未来任务的性能。在三个数据集(ArtiCAD-Bench、CADPrompt 和 ACD)上的广泛评估验证了我们方法的有效性。我们进一步通过 URDF 导出展示了 ArtiCAD 在需求驱动的概念设计、物理原型制作以及生成具身 AI 训练资产方面的适用性。
Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs
Authors: Jianhao Chen, Haoyang Chen, Hanjie Zhao, Haozhe Liang, Tieyun Qian
First: 2026-04-14T11:44:59+00:00 · Latest: 2026-04-14T11:44:59+00:00
Comments: 12 pages, 9 figures
Abstract
The rapid evolution of Vision-Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface-level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep-seated semantic vulnerabilities, we introduce \textbf{MemJack}, a \textbf{MEM}ory-augmented multi-agent \textbf{JA}ilbreak atta\textbf{CK} framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi-turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48\% ASR against Qwen3-VL-Plus, scaling to 90\% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release \textbf{MemJack-Bench}, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.
中文标题/摘要
标题:每张图片都讲述着危险的故事:基于记忆增强的多智能体监狱突破攻击框架对VLMs的探索
视觉语言模型(VLMs)的迅速发展催生了人工智能前所未有的能力;然而,这种不断扩展的多模态特性无意中暴露了一个广泛且不受约束的对抗攻击面。当前的多模态监狱突破策略主要集中在表面像素扰动和类型攻击或有害图像上,但未能触及视觉数据中固有的复杂语义结构。这使得原始自然图像的庞大语义攻击面大多未被审视。为了揭示这些深层次的语义漏洞,我们引入了**MemJack**,一种**MEM**记忆增强的多智能体**JA**监狱突破攻击框架,明确利用视觉语义来组织自动化监狱突破攻击。MemJack 通过协调多智能体合作动态地将视觉实体映射到恶意意图,通过多角度视觉语义伪装生成对抗性提示,并利用迭代零空间投影(INLP)几何滤波器绕过早期潜在空间拒绝。通过在持续的多模态经验记忆中积累和转移成功的策略,MemJack 能在不同图像上保持高度一致的多轮监狱突破攻击交互,从而提高新图像上的攻击成功率(ASR)。在完整的未修改的COCO val2017图像上进行的广泛实证评估表明,MemJack 在Qwen3-VL-Plus上的ASR达到71.48%,在扩展预算下可达到90%。此外,为了促进未来防御对齐研究,我们将发布**MemJack-Bench**,一个包含超过113,000个交互式多模态监狱突破攻击轨迹的综合数据集,为开发固有鲁棒的VLMs奠定重要基础。
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models
Authors: Zijian Liu, Sihan Cao, Pengcheng Zheng, Kuien Liu, Caiyan Qin, Xiaolin Qin, Jiwei Wei, Chaoning Zhang
First: 2026-04-14T11:05:42+00:00 · Latest: 2026-04-14T11:05:42+00:00
Abstract
Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.
Cross-Attentive Multiview Fusion of Vision-Language Embeddings
Authors: Tomas Berriel Martins, Martin R. Oswald, Javier Civera
First: 2026-04-14T10:25:32+00:00 · Latest: 2026-04-14T10:25:32+00:00
Abstract
Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.
中文标题/摘要
标题:视觉-语言嵌入的多视图注意力融合
视觉-语言模型对于开放词汇2D语义分割的发展至关重要。然而,将这些模型从2D图像提升到3D场景仍然是一个具有挑战性的问题。现有方法通常将2D描述符反投影并平均到多个视图中,或者选择一个单一的代表性描述符,经常导致次优的3D表示。在本文中,我们提出了一种新颖的多视图变换器架构,该架构在多个视角的视觉-语言描述符之间进行交叉注意,并将它们融合为一个统一的每3D实例嵌入。作为第二个贡献,我们利用多视图一致性作为此融合的自我监督信号,当将其添加到标准监督目标类损失中时,显著提高了性能。我们所称的交叉注意力多视图融合(CAMFusion),不仅一致地优于简单的平均或单视图描述符选择,而且在3D语义和实例分类基准测试中达到了最先进的结果,包括对域外数据集的零样本评估。
Summary / 总结
The research aims to improve 3D semantic and instance classification by developing a novel multiview transformer architecture called CAMFusion. This method cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified embedding for each 3D instance. The approach leverages multiview consistency as a self-supervision signal, which enhances performance when added to a standard supervised target-class loss. Experimental results show that CAMFusion outperforms naive averaging or single-view descriptor selection and achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.
研究旨在通过开发一种名为CAMFusion的新颖多视图变换器架构来改进3D语义和实例分类。该方法跨多个视角的视觉-语言描述符进行交叉注意,并将其融合为一个统一的每3D实例嵌入。该方法利用多视图一致性作为自我监督信号,在添加到标准监督目标类损失后显著提高性能。实验表明,CAMFusion不仅优于简单的平均或单视图描述符选择,还在3D语义和实例分类基准测试中达到了最先进的结果,包括在离域数据集上的零样本评估。
INFORM-CT: INtegrating LLMs and VLMs FOR Incidental Findings Management in Abdominal CT
Authors: Idan Tankel, Nir Mazor, Rafi Brada, Christina LeBedis, Guy ben-Yosef
First: 2025-12-10T23:28:26+00:00 · Latest: 2026-04-14T10:20:02+00:00
Comments: Accepted for Spotlight presentation at MIDL 2026
Abstract
Incidental findings in CT scans, though often benign, can have significant clinical implications and should be reported following established guidelines. Traditional manual inspection by radiologists is time-consuming and variable. This paper proposes a novel framework that leverages large language models (LLMs) and foundational vision-language models (VLMs) in a plan-and-execute agentic approach to improve the efficiency and precision of incidental findings detection, classification, and reporting for abdominal CT scans. Given medical guidelines for abdominal organs, the process of managing incidental findings is automated through a planner-executor framework. The planner, based on LLM, generates Python scripts using predefined base functions, while the executor runs these scripts to perform the necessary checks and detections, via VLMs, segmentation models, and image processing subroutines. We demonstrate the effectiveness of our approach through experiments on a CT abdominal benchmark for three organs, in a fully automatic end-to-end manner. Our results show that the proposed framework outperforms existing pure VLM-based approaches in terms of accuracy and efficiency.
中文标题/摘要
标题:INFORM-CT:结合LLMs和VLMs进行腹部CT意外发现管理
CT扫描中的意外发现虽然通常无害,但可能具有重要的临床意义,应按照既定指南进行报告。传统的放射科医生手动检查耗时且不一致。本文提出了一种新的框架,利用大型语言模型(LLMs)和基础视觉-语言模型(VLMs)在计划和执行代理式方法中,以提高腹部CT扫描中意外发现检测、分类和报告的效率和精确度。根据腹部器官的医疗指南,通过规划者-执行者框架自动化管理意外发现的过程。规划者基于LLM生成使用预定义基础函数的Python脚本,而执行者运行这些脚本,通过VLMs、分割模型和图像处理子程序进行必要的检查和检测。 我们通过在三个器官的CT腹部基准数据集上进行的端到端全自动实验,展示了该方法的有效性。我们的结果表明,所提出的框架在准确性和效率方面优于现有的纯VLM方法。
Summary / 总结
The research aims to improve the detection and reporting of incidental findings in abdominal CT scans by automating the process using a planner-executor framework. The planner, based on a large language model, generates Python scripts, which are then executed by a vision-language model and other image processing tools to perform checks and detections. Experiments on a CT abdominal benchmark for three organs show that the proposed framework outperforms existing methods in terms of accuracy and efficiency.
本文提出了一种结合大型语言模型(LLMs)和视觉语言模型(VLMs)的框架,以应对腹部CT扫描中意外发现的管理挑战。该方法通过规划者执行者系统自动化意外发现的检测、分类和报告。规划者使用LLMs生成Python脚本,执行者则使用VLMs和图像处理技术运行这些脚本。在三个腹部器官的CT基准测试上的实验表明,该框架在准确性和效率方面优于现有的基于VLM的方法。
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
Authors: Ruoxiang Huang, Zhen Yuan
Venue: CVPR 2026 Highlight
First: 2026-04-14T10:12:24+00:00 · Latest: 2026-04-14T10:12:24+00:00
Comments: Accepted by CVPR 2026 (Highlight). 10 pages, 2 figures, 5 tables
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.
Summary / 总结
The research aims to enhance the positional encoding mechanisms in Vision-Language Models (VLMs) by addressing the inefficiencies in attention allocation. MODIX, a training-free framework, dynamically adjusts positional strides based on modality-specific contributions, using covariance-based entropy for intra-modal density and cross-modal alignment for inter-modal interaction. Experiments show that MODIX improves multimodal reasoning and adaptively reallocates attention according to task-specific information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.
研究旨在通过解决注意力分配效率问题来提升Vision-Language Models (VLMs)的位臵编码机制。MODIX是一个无需训练的框架,根据模态特异性贡献动态调整位臵步长,使用协方差基熵来建模模内密度,并通过跨模态对齐来建模模间交互。实验表明,MODIX能够提升多模态推理能力,并根据任务特定的信息分布自适应地重新分配注意力,表明位臵编码在多模态序列建模中的应被视为一个可调资源。
Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation
Authors: Sihang Jia, Shuliang Liu, Songbo Yang, Yibo Yan, Xin Zou, Xuming Hu
First: 2026-04-14T08:15:44+00:00 · Latest: 2026-04-14T08:15:44+00:00
Abstract
Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model's inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.
中文标题/摘要
标题:扰动解码:通过动态文本扰动缓解MLLM幻觉
多模态大型语言模型在推理过程中经常出现幻觉现象,部分原因是语言先验主导了视觉证据。现有的无需训练的缓解方法要么扰动视觉表示,偏离自然图像分布,要么施加侵入性操作,损害模型的固有生成流畅性。我们提出了一种新的观点,即多模态幻觉在解码阶段表现为视觉定位对文本表述的超敏感性。基于这一洞察,我们提出了扰动解码(DeP),这是一种无需训练的框架,通过受控的文本干预缓解先验引起的幻觉。DeP 使用动态探针应用多层次的文本扰动,激发潜在的语言先验。利用注意力方差,它增强稳定证据区域的同时抑制特征空间中的可疑噪声。此外,它使用逻辑统计构建可解释的先验漂移方向,以对抗文本共现导致的概率偏差。大量实验表明,DeP 有效减少了幻觉并实现了在多个基准上的优越性能。
Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning
Authors: Jungwon Choi, Eunwoo Kim
Venue: CVPR 2026
First: 2026-04-14T07:39:03+00:00 · Latest: 2026-04-14T07:39:03+00:00
Comments: Accepted by CVPR 2026 findings
Abstract
Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks
Authors: Zijian Song, Xiaoxin Lin, Qiuming Huang, Sihan Qin, Guangrun Wang, Liang Lin
First: 2025-06-17T13:40:00+00:00 · Latest: 2026-04-14T07:32:54+00:00
Comments: 20 pages, 11 figures
Abstract
Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems into faithful 3D scenes. Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning. We hope that our study will bring researchers' attention to spatially grounded reasoning and advance VLMs in visual problem-solving.
中文标题/摘要
标题:SIRI-Bench:通过复杂推理任务挑战VLM的空间智能
大型语言模型(LLMs)取得了快速进步,主要归功于在复杂推理任务上的强化学习。相比之下,虽然空间智能是视觉语言模型(VLMs)在现实世界交互中不可或缺的基础,但对其复杂的空间推理的系统研究仍相对不足。为弥补这一差距,我们引入了SIRI-Bench,这是一个旨在通过空间关联推理任务评估VLMs结构空间智能的基准。SIRI-Bench 包含9,000个视频-问题-答案三元组,每个问题都嵌入在真实的3D场景中。该基准精心设计,使得解决每个问题都需要空间理解与结构推理。为了促进大规模数据合成,我们开发了一个自动场景生成引擎,该引擎利用协作的LLM代理将抽象的数学问题转化为忠实的3D场景。实验结果表明,最先进的VLMs在SIRI-Bench上面临巨大挑战,突显了结构空间推理的难度。我们希望我们的研究能够引起研究人员对空间关联推理的关注,并推动VLMs在视觉问题解决方面的进步。
MedVeriSeg: Teaching MLLM-Based Medical Segmentation Models to Verify Query Validity Without Extra Training
Authors: Ziqian Lu, Qinyue Tong, Jun Liu, Yunlong Yu
First: 2026-04-11T14:47:32+00:00 · Latest: 2026-04-14T07:00:34+00:00
Comments: 7 pages, 4 figures; the paper is under consideration at Pattern Recognition Letters
Abstract
Despite recent advances in MLLM-based medical image segmentation, existing LISA-like methods cannot reliably reject false queries and often produce hallucinated segmentation masks for absent targets. This limitation reduces practical reliability in both medical education and clinical use. In this work, we propose MedVeriSeg, a training-free verification framework that equips LISA-like medical segmentation models with the ability to identify and reject false queries which contain non-existent targets. Our key observation is that the similarity map between the [SEG] token feature and MLLM image features exhibits markedly different distribution patterns for true and false queries. Based on this, we introduce a Similarity Response Quality Scoring Module that characterizes the similarity map from three aspects: strength, compactness, and purity, producing an initial target-existence prediction. We further incorporate qualitative visual evidence by using GPT-4o to jointly assess the similarity heatmap and the results of Similarity Response Quality Scoring Module for final verification. Experiments on a small-scale benchmark constructed from SA-Med2D-20M show that MedVeriSeg effectively rejects false-query segmentation requests while maintaining reliable recognition of true queries.
Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models
Authors: Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg
Venue: ICLR 2026
First: 2026-04-14T06:59:27+00:00 · Latest: 2026-04-14T06:59:27+00:00
Comments: Accepted at ICLR 2026 Workshop on Agents in the Wild
Abstract
We study typographic prompt injection attacks on vision-language models (VLMs), where adversarial text is rendered as images to bypass safety mechanisms, posing a growing threat as VLMs serve as the perceptual backbone of autonomous agents, from browser automation and computer-use systems to camera-equipped embodied agents. In practice, the attack surface is heterogeneous: adversarial text appears at varying font sizes and under diverse visual conditions, while the growing ecosystem of VLMs exhibits substantial variation in vulnerability, complicating defensive approaches. Evaluating 1,000 prompts from SALAD-Bench across four VLMs, namely, GPT-4o, Claude Sonnet 4.5, Mistral-Large-3, and Qwen3-VL-4B-Instruct under varying font sizes (6--28px) and visual transformations (rotation, blur, noise, contrast changes), we find: (1) font size significantly affects attack success rate (ASR), with very small fonts (6px) yielding near-zero ASR while mid-range fonts achieve peak effectiveness; (2) text attacks are more effective than image attacks for GPT-4o (36% vs 8%) and Claude (47% vs 22%), while Qwen3-VL and Mistral show comparable ASR across modalities; (3) text-image embedding distance from two multimodal embedding models (JinaCLIP and Qwen3-VL-Embedding) shows strong negative correlation with ASR across all four models (r = -0.71 to -0.93, p < 0.01); (4) heavy degradations increase embedding distance by 10--12% and reduce ASR by 34--96%, while rotation asymmetrically affects models (Mistral drops 50%, GPT-4o unchanged). These findings highlight that model-specific robustness patterns preclude one-size-fits-all defenses and offer empirical guidance for practitioners selecting VLM backbones for agentic systems operating in adversarial environments.
中文标题/摘要
标题:在像素之间阅读:文本-图像嵌入对齐与视觉语言模型字体攻击成功率的关系
我们研究了视觉语言模型(VLMs)中的字体提示注入攻击,其中对抗性文本被渲染为图像以绕过安全机制,随着VLMs作为自主代理的感知核心,从浏览器自动化和计算机使用系统到配备摄像头的实体代理,这种威胁正在增长。实践中,攻击面是异质的:对抗性文本以不同的字体大小和多种视觉条件下出现,而不断增长的VLMs生态系统在脆弱性方面表现出显著差异,使防御方法复杂化。在SALAD-Bench的1,000个提示下,对四款VLMs(GPT-4o、Claude Sonnet 4.5、Mistral-Large-3和Qwen3-VL-4B-Instruct)进行了不同字体大小(6-28px)和视觉变换(旋转、模糊、噪声、对比度变化)的评估,我们发现:(1)字体大小显著影响攻击成功率(ASR),非常小的字体(6px)几乎不成功,而中等字体达到最佳效果;(2)对于GPT-4o和Claude,文本攻击比图像攻击更有效(36% vs 8%和47% vs 22%),而Qwen3-VL和Mistral在不同模态下ASR相似;(3)来自两种多模态嵌入模型(JinaCLIP和Qwen3-VL-Embedding)的文本-图像嵌入距离与所有四个模型的ASR之间存在强烈负相关(r = -0.71至-0.93,p < 0.01);(4)严重的降级使嵌入距离增加10-12%,ASR减少34-96%,而旋转对模型的影响不对称(Mistral下降50%,GPT-4o不变)。这些发现表明,模型特定的鲁棒性模式排除了一劳永逸的防御策略,并为在对抗环境中操作代理系统的实践者选择VLM核心提供了实证指导。
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Authors: Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, Chanyoung Park
First: 2026-04-14T06:48:31+00:00 · Latest: 2026-04-14T06:48:31+00:00
Comments: Preprint, Project : https://ptkjw1997.github.io/DSTP-page/
Abstract
Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.
Summary / 总结
The study investigates why and when visual token pruning fails in Multimodal Large Language Models, focusing on the challenge of complex visual reasoning tasks. By identifying Relevant Visual Information Shift (RVIS) as the primary issue, the researchers propose Decoding-stage Shift-aware Token Pruning (DSTP), which enhances existing pruning methods to better align with changing reasoning requirements during decoding. Experiments show that DSTP significantly improves performance in complex tasks while maintaining or even enhancing performance in simpler visual understanding benchmarks, and it is effective across various state-of-the-art architectures with minimal computational overhead.
研究探讨了为什么和何时视觉标记剪枝在多模态大型语言模型中失败,重点关注复杂视觉推理任务的挑战。通过识别相关视觉信息转移(RVIS)为主要问题,研究人员提出了解码阶段转移感知标记剪枝(DSTP),该方法增强了现有剪枝方法,使其在解码阶段更好地适应变化的推理需求。实验表明,DSTP在复杂任务中显著提高了性能,同时在更简单的视觉理解基准测试中保持或提升了性能,并且在各种最先进的架构中具有良好的通用性和高效的计算开销。
ReflectCAP: Detailed Image Captioning with Reflective Memory
Authors: Kyungmin Min, Minbeom Kim, Kang-il Lee, Seunghyun Yoon, Kyomin Jung
First: 2026-04-14T06:47:47+00:00 · Latest: 2026-04-14T06:47:47+00:00
Abstract
Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes -- what to avoid and what to attend to -- yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21--36\% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.
中文标题/摘要
标题:ReflectCAP:基于反射记忆的详细图像描述
详细图像描述既需要事实依据,又需要精细的覆盖范围,但现有方法难以同时实现这两点。我们通过引入反思笔记引导的描述方法(ReflectCAP)来解决这一矛盾,其中多智能体管道分析目标大型视觉-语言模型(LVLM)一致产生的幻觉和系统性忽略的内容,提炼出可重用的指南称为结构化反思笔记。在推理时,这些笔记引导描述模型在两个维度上——避免什么和关注什么——生成详细的描述,从而同时提高事实性和覆盖范围。将此方法应用于8个涵盖GPT-4.1家族、Qwen系列和InternVL变体的LVLM,ReflectCAP达到了事实性和覆盖范围权衡的帕累托前沿,并在CapArena-Auto中实现了显著的提升,其中生成的描述与强参考模型进行了直接对比。此外,与模型缩放或现有多智能体管道相比,ReflectCAP在描述质量和计算成本之间的权衡更为有利,后者比前者高出21-36%的成本。这使得在实际成本和延迟约束下实现高质量的详细描述成为可能。
Summary / 总结
ReflectCAP addresses the challenge of detailed image captioning by using a multi-agent pipeline to analyze the hallucinations and overlooked details of large vision-language models, distilling these insights into Structured Reflection Notes. These notes guide the captioning process to improve both factuality and coverage, achieving the Pareto frontier on CapArena-Auto and offering a better trade-off between caption quality and compute cost compared to model scaling or existing multi-agent pipelines.
ReflectCAP通过使用多代理管道来分析大型视觉-语言模型的幻觉和遗漏细节,并提炼出结构化的反思笔记。这些笔记指导生成过程,以提高事实性和覆盖面,实现了CapArena-Auto上的帕累托前沿,并在生成质量与计算成本的权衡上优于模型缩放或现有的多代理管道。
History
20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553