arXiv 论文速递

2026-01-29 03:44
Snapshot: 20260129_0344
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
Authors: Lukas Aichberger, Alasdair Paren, Guohao Li, Philip Torr, Yarin Gal, Adel Bibi
Venue: NeurIPS 2025
First: 2025-03-13T18:59:12+00:00 · Latest: 2026-01-27T18:10:17+00:00
Comments: NeurIPS 2025
Abstract
Recent advances in operating system (OS) agents have enabled vision-language models (VLMs) to directly control a user's computer. Unlike conventional VLMs that passively output text, OS agents autonomously perform computer-based tasks in response to a single user prompt. OS agents do so by capturing, parsing, and analysing screenshots and executing low-level actions via application programming interfaces (APIs), such as mouse clicks and keyboard inputs. This direct interaction with the OS significantly raises the stakes, as failures or manipulations can have immediate and tangible consequences. In this work, we uncover a novel attack vector against these OS agents: Malicious Image Patches (MIPs), adversarially perturbed screen regions that, when captured by an OS agent, induce it to perform harmful actions by exploiting specific APIs. For instance, a MIP can be embedded in a desktop wallpaper or shared on social media to cause an OS agent to exfiltrate sensitive user data. We show that MIPs generalise across user prompts and screen configurations, and that they can hijack multiple OS agents even during the execution of benign instructions. These findings expose critical security vulnerabilities in OS agents that have to be carefully addressed before their widespread deployment.
中文标题/摘要
标题:MIP对抗代理:恶意图像补丁劫持多模态OS代理
近年来,操作系统(OS)代理的进步使视觉语言模型(VLMs)能够直接控制用户的计算机。与传统的被动输出文本的VLMs不同,OS代理能够自主执行基于计算机的任务,仅需一个用户提示。OS代理通过捕获、解析和分析屏幕截图,并通过应用程序编程接口(APIs)执行低级操作(如鼠标点击和键盘输入)来实现这一目标。这种直接与OS的交互显著提高了风险,因为失败或操纵可能会立即产生实际后果。在本研究中,我们发现了一种针对这些OS代理的新攻击向量:恶意图像补丁(MIPs),这些对抗性扰动的屏幕区域在被OS代理捕获时,会利用特定的APIs诱导其执行有害操作。例如,MIP可以嵌入在桌面上的壁纸中或在社交媒体上分享,以使OS代理泄露敏感用户数据。我们展示了MIPs在用户提示和屏幕配置方面具有泛化能力,并且即使在执行良性指令期间也能劫持多个OS代理。这些发现揭示了OS代理中关键的安全漏洞,这些漏洞在广泛部署之前必须仔细解决。
Summary / 总结
This study investigates a new attack vector called Malicious Image Patches (MIPs) that can hijack OS agents by exploiting specific APIs. The research focuses on vision-language models (VLMs) that control user computers directly. The key finding is that MIPs can be embedded in desktop backgrounds or shared content to cause OS agents to perform harmful actions, such as exfiltrating sensitive data, even when executing benign instructions. This highlights critical security vulnerabilities in OS agents that need to be addressed before their widespread deployment.
研究探讨了一种新的攻击向量——恶意图像补丁(MIPs),通过利用特定的API来劫持OS代理。研究集中在可以直接控制用户计算机的视觉语言模型(VLMs)上。关键发现是,MIPs可以嵌入到桌面背景或共享内容中,导致OS代理执行有害操作,如泄露敏感数据,即使是在执行 benign 指令时也是如此。这揭示了OS代理中的关键安全漏洞,需要在广泛部署之前加以解决。
Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries
Authors: Kevin Robbins, Xiaotong Liu, Yu Wu, Le Sun, Grady McPeak, Abby Stylianou, Robert Pless
First: 2026-01-24T17:30:23+00:00 · Latest: 2026-01-27T18:04:35+00:00
Abstract
Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one domain may fail in another, and non-expert users have no straightforward way to assess whether their chosen VLM will work on their problem. We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task, and explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy. We show that generated imagery to the baseline text-only scores substantially improves the quality of these predictions. Additionally, it gives a user feedback on the kinds of images that were used to make the assessment. Experiments on standard CLIP benchmark datasets demonstrate that the image-based approach helps users predict, without any labeled examples, whether a VLM will be effective for their application.
中文标题/摘要
标题:零样本:预测任意查询的零样本分类性能
像CLIP这样的视觉-语言模型创建了文本和图像对齐的嵌入空间,使得任何人都可以通过简单地命名他们想要区分的类别来构建视觉分类器。然而,一个在某一领域表现良好的模型在另一个领域可能会失败,非专家用户没有直接的方法来评估他们选择的VLM是否适用于他们的问题。我们在此前仅使用文本比较的工作基础上,评估模型在给定自然语言任务中的表现,并探索生成与该任务相关的合成图像来评估和改进零样本准确性的预测方法。我们展示了生成的图像相对于基线文本仅比较分数显著提高了这些预测的质量。此外,它还为用户提供反馈,说明了用于评估的图像类型。在标准CLIP基准数据集上的实验表明,基于图像的方法帮助用户在没有任何标注示例的情况下预测VLM是否适用于他们的应用。
Summary / 总结
The research aims to predict the zero-shot classification performance for arbitrary queries using vision-language models like CLIP. The method involves comparing text-only and generating synthetic images relevant to the task to evaluate the model's effectiveness. The key experimental findings show that using generated imagery improves the prediction quality of zero-shot accuracy and provides users with feedback on the types of images used for assessment. Experiments on standard CLIP benchmark datasets confirm that the image-based approach enhances users' ability to predict the model's effectiveness without labeled examples.
研究旨在使用如CLIP的Vision-Language模型预测任意查询的零样本分类性能。研究基于文本比较的方法,并引入生成与任务相关的合成图像来提高预测准确性。实验表明,在标准CLIP基准数据集上,结合生成的图像可以显著提高零样本准确性的预测质量,并为用户提供有关用于评估的图像类型的反馈。这种方法使非专家用户能够在无需标注样本的情况下更好地预测VLM在特定应用中的有效性。
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs
Authors: Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti
First: 2025-04-24T17:39:25+00:00 · Latest: 2026-01-27T17:59:04+00:00
Abstract
Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with the largest-scale empirical analysis to date of training-free sparse attention, evaluating six methods across multiple model families and sizes, sequences up to 128K tokens, and sparsity levels up to 0.95 (i.e., $1/20$ attention budget) on nine diverse tasks. We first organise the rapidly evolving landscape of sparse attention methods into a taxonomy along four design axes. Our analysis then yields actionable insights: 1) sparse attention is effective -- larger sparse models outperform smaller dense ones at equivalent cost, improving the Pareto frontier; 2) due to computational constraints, token-to-page importance estimation is unfeasible during prefilling, where the choice of an alternative solution (global-to-token or block-to-block) depends on the task, but is possible during decoding, enabling better generalisation and tolerance to higher sparsity; 3) longer sequences tolerate higher sparsity, suggesting that fixed-budget methods in production are suboptimal. Together, these findings provide practical guidance for deploying sparse attention and methodological recommendations for future evaluations. Our code is available at https://github.com/PiotrNawrot/sparse-frontier.
中文标题/摘要
标题:稀疏前沿:Transformer大模型中稀疏注意机制的效率-准确度权衡
稀疏注意为扩展Transformer大模型的长上下文能力提供了有希望的策略,但由于缺乏全面评估,其效率-准确度权衡尚不明确。我们通过迄今为止最大规模的经验分析填补了这一空白,评估了六种方法在多个模型家族和规模、最多128K个标记的序列以及高达0.95(即1/20的注意预算)的稀疏水平上的表现,共涉及九个不同的任务。我们首先按照四个设计轴将快速发展的稀疏注意方法分类。我们的分析提供了可操作的见解:1) 稀疏注意是有效的——较大的稀疏模型在等效成本下优于较小的密集模型,改善了帕累托前沿;2) 由于计算限制,在预填充期间无法估计标记到页面的重要性,任务的不同决定了替代方案(全局到标记或块到块)的选择,但在解码期间是可行的,这有助于更好的泛化和对更高稀疏度的容忍;3) 较长的序列可以容忍更高的稀疏度,表明固定预算的方法在生产中是次优的。这些发现共同提供了部署稀疏注意的实际指导,并为未来的评估提供了方法论建议。我们的代码可在https://github.com/PiotrNawrot/sparse-frontier/ 获取。
Summary / 总结
This study investigates the efficiency-accuracy trade-offs of sparse attention in Transformer LLMs through a comprehensive empirical analysis of six sparse attention methods across various model sizes and sparsity levels. The research finds that larger sparse models outperform smaller dense models at equivalent cost, and that the choice of alternative solutions during prefilling depends on the task, but is feasible during decoding, leading to better generalization and tolerance to higher sparsity. Longer sequences can tolerate higher sparsity, suggesting that fixed-budget methods in production may be suboptimal. These findings offer practical guidance for deploying sparse attention and methodological recommendations for future evaluations.
研究通过在不同模型大小和高达0.95的稀疏度水平上评估六种方法,并在九种不同的任务上进行测试,探讨了Transformer LLM中稀疏注意力的效率-准确度权衡。研究发现,较大的稀疏模型在同等成本下优于较小的密集模型,并且在预填充期间无法进行令牌到页面的重要性估计,但在解码期间可以实现,这增强了泛化能力和对更高稀疏度的容忍度。较长的序列可以容忍更高的稀疏度,表明固定预算的方法在生产环境中可能不是最优的。
EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning
Authors: Binzhu Xie, Shi Qiu, Sicheng Zhang, Yinqiao Wang, Hao Xu, Muzammal Naseer, Chi-Wing Fu, Pheng-Ann Heng
Venue: ICLR 2026
First: 2026-01-27T17:58:12+00:00 · Latest: 2026-01-27T17:58:12+00:00
Comments: Accepted in ICLR 2026, Codebase: https://github.com/Nicous20/EgoHandICL
Abstract
Robust 3D hand reconstruction in egocentric vision is challenging due to depth ambiguity, self-occlusion, and complex hand-object interactions. Prior methods mitigate these issues by scaling training data or adding auxiliary cues, but they often struggle in unseen contexts. We present EgoHandICL, the first in-context learning (ICL) framework for 3D hand reconstruction that improves semantic alignment, visual consistency, and robustness under challenging egocentric conditions. EgoHandICL introduces complementary exemplar retrieval guided by vision-language models (VLMs), an ICL-tailored tokenizer for multimodal context, and a masked autoencoder (MAE)-based architecture trained with hand-guided geometric and perceptual objectives. Experiments on ARCTIC and EgoExo4D show consistent gains over state-of-the-art methods. We also demonstrate real-world generalization and improve EgoVLM hand-object interaction reasoning by using reconstructed hands as visual prompts. Code and data: https://github.com/Nicous20/EgoHandICL
中文标题/摘要
标题:EgoHandICL:基于上下文学习的自视点三维手部重建
自视点视角下的稳健三维手部重建具有挑战性,由于深度模糊、自遮挡以及复杂的手部-物体交互。先前的方法通过扩大训练数据或添加辅助提示来缓解这些问题,但它们在未见过的场景中往往表现不佳。我们提出了EgoHandICL,这是首个用于三维手部重建的上下文学习(ICL)框架,能够提高语义对齐、视觉一致性,并在具有挑战性的自视点条件下增强鲁棒性。EgoHandICL引入了由视觉语言模型(VLM)引导的补充示例检索、针对多模态上下文的ICL定制分词器以及基于掩码自编码器(MAE)的架构,该架构通过手部引导的几何和感知目标进行训练。在ARCTIC和EgoExo4D上的实验显示,EgoHandICL在最先进的方法上具有持续的改进。我们还展示了其实用场景下的泛化能力,并通过使用重建的手部作为视觉提示来改进EgoVLM对手部-物体交互的推理。
Summary / 总结
EgoHandICL addresses the challenges of 3D hand reconstruction in egocentric vision by introducing an in-context learning framework that enhances semantic alignment, visual consistency, and robustness. It uses complementary exemplar retrieval guided by vision-language models, a tailored tokenizer for multimodal context, and a masked autoencoder architecture. Experiments on ARCTIC and EgoExo4D show consistent improvements over existing methods. The framework also demonstrates real-world generalization and improves hand-object interaction reasoning in EgoVLM.
EgoHandICL通过引入一种在上下文学习框架来解决第一人称视角下的3D手部重建挑战,该框架增强了语义对齐、视觉一致性以及在复杂条件下的鲁棒性。它使用由视觉语言模型指导的互补示例检索、针对多模态上下文的定制化分词器以及基于掩码自编码器的架构。在ARCTIC和EgoExo4D上的实验显示,该方法在现有方法上的一致改进。此外,该框架还展示了在现实世界中的泛化能力,并通过重建的手部提高了EgoVLM对手物交互的理解。
When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering
Authors: Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee
First: 2026-01-27T17:35:05+00:00 · Latest: 2026-01-27T17:35:05+00:00
Comments: 27 pages, 15 figures
Abstract
Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.
中文标题/摘要
标题:当迭代RAG超越理想证据时:科学多跳问答中的诊断研究
检索增强生成(RAG)将大型语言模型(LLMs)扩展到参数化知识之外,但尚不清楚何时迭代检索-推理循环在意义上优于静态RAG,特别是在具有多跳推理、稀疏领域知识和异构证据的科学领域。我们提供了第一个受控的、机制层面的诊断研究,探讨同步迭代检索和推理是否能超越理想化的静态上限(黄金上下文)RAG。我们以三个范式对十一个最先进的LLMs进行了基准测试:(i)无上下文,衡量对参数化记忆的依赖;(ii)黄金上下文,所有先验证据一次性提供;(iii)迭代RAG,一个无需训练的控制器,交替进行检索、假设细化和证据感知停止。使用化学重点的ChemKGMultiHopQA数据集,我们隔离了需要真正检索的问题,并通过检索覆盖率差距、锚点携带丢失、查询质量、组合保真度和控制校准等诊断分析了行为。在所有模型中,迭代RAG始终优于黄金上下文,增幅高达25.6个百分点,尤其是对于非推理微调模型。分阶段检索减少了晚期跳失败,缓解了上下文过载,并允许动态纠正早期假设漂移,但剩余的失败模式包括不完整的跳覆盖、干扰物锁定轨迹、早期停止校准不当以及即使在完美检索的情况下也有较高的组合失败率。总体而言,分阶段检索往往比理想证据的存在更具影响力;我们提供了在专门的科学环境中部署和诊断RAG系统的实用指导,并为更可靠、可控的迭代检索-推理框架奠定了基础。
Summary / 总结
This study investigates when iterative RAG outperforms static RAG in scientific multi-hop question answering, using the ChemKGMultiHopQA dataset. Eleven state-of-the-art LLMs were benchmarked under three regimes: No Context, Gold Context, and Iterative RAG. Iterative RAG consistently outperformed Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduced late-hop failures and enabled dynamic correction of early hypothesis drift, but incomplete hop coverage and high composition failure rates remained challenges.
该研究探讨了在科学领域中迭代检索-推理循环何时能超越静态RAG。使用ChemKGMultiHopQA数据集,它对十一个最先进的LLM在三种模式下进行了基准测试:无上下文、理想证据上下文和迭代RAG。迭代RAG在所有模型中都优于理想证据上下文,增幅最高可达25.6个百分点,尤其对于非推理微调模型。阶段检索减少了晚期跳失败和上下文过载,但仍面临如不完整的跳覆盖和高组合失败率等挑战。
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Authors: Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, Minji Kim
First: 2025-02-20T18:01:41+00:00 · Latest: 2026-01-27T17:16:10+00:00
Comments: In Proceedings of the IJCNLP-AACL 2025
Abstract
Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.
中文标题/摘要
标题:ReVision:一种用于隐私保护任务导向视觉指令重写的数据集和基线VLM
随着AR、VR和配备强大摄像头的现代智能手机成为人机通信的主要接口,高效的隐私保护多模态交互变得至关重要。现有的强大视觉-语言模型(VLMs)支持多模态交互,通常依赖于基于云的处理,这引发了(1)视觉隐私问题,即传输敏感的视觉数据到服务器,以及(2)其有限的实时、设备端可用性问题。本文探讨了视觉指令重写这一新颖的方法,即将多模态指令转换为纯文本命令,允许轻量级设备端指令重写VLM(参数量250M)与现有对话AI系统的无缝集成,增强视觉数据隐私。为此,我们提供了一个涵盖14个领域的超过39,000个示例的数据集,并开发了一个紧凑的VLM,该模型在图像字幕数据集上进行预训练,并针对指令重写进行了微调。实验结果通过NLG指标(如BLEU、METEOR和ROUGE)以及语义解析分析评估,表明即使是最小量化版本的模型(存储占用量<500MB)也能实现有效的指令重写,从而实现以隐私为中心的多模态AI应用。
Summary / 总结
This paper addresses the need for efficient and privacy-preserving multimodal interaction by introducing ReVision, a dataset and baseline vision-language model for visual instruction rewriting. The model transforms visual instructions into text-only commands, enhancing privacy and on-device usability. Experiments show that even a quantized version of the model can effectively rewrite instructions, achieving good performance on NLG metrics and semantic parsing analysis.
该论文通过引入ReVision数据集和视觉指令重写基线模型,解决高效且保护隐私的多模态交互需求。该模型将多模态指令转换为纯文本命令,增强隐私性和设备端使用性。实验结果表明,即使是最小量化版本的模型也能有效重写指令,并在自然语言生成指标和语义解析分析中表现出色。
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
Authors: Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long, Shifeng Liu, Hongze Shen, Wei Liu, Xiaoyu Tan, Haojia Lin, Yubo Zhu, Qianyu Li, Di Yin, Haoyu Cao, Weibo Gu, Xin Li, Yinsong Liu, Deqiang Jiang, Xing Sun, Yunsheng Wu, Mingkong Tang, Shuangyin Liu, Lexiang Tang, Haodong Lin, Junru Lu, Jiarui Qin, Lingfeng Qiao, Ruizhi Qiao, Bo Ke, Jianfeng He, Ke Li, Yangning Li, Yunhang Shen, Mengdan Zhang, Peixian Chen, Kun Yin, Bing Liu, Yunfei Wu, Huang Chen, Zhongpeng Cai, Xiaotian Li
First: 2026-01-27T17:01:16+00:00 · Latest: 2026-01-27T17:01:16+00:00
Abstract
Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from ``vision-as-input'' to ``vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
中文标题/摘要
标题:Youtu-VL:通过统一的视觉语言监督释放视觉潜力
尽管视觉语言模型(VLMs)取得了显著进展,但当前架构在保留细微视觉信息方面仍存在局限性,导致粗粒度的多模态理解。我们将其归因于现有VLMs中固有的次优训练范式,这种范式表现出以文本为主的优化偏见,将视觉信号仅视为被动的条件输入而非监督目标。为解决这一问题,我们提出了Youtu-VL框架,该框架利用视觉语言统一自回归监督(VLUAS)范式,从根本上将优化目标从“视觉作为输入”转变为“视觉作为目标”。通过直接将视觉标记集成到预测流中,Youtu-VL 对视觉细节和语言内容应用统一的自回归监督。此外,我们还将此范式扩展到视觉中心任务,使标准VLM能够在无需特定任务添加的情况下执行视觉中心任务。广泛的实证评估表明,Youtu-VL 在通用多模态任务和视觉中心任务上均取得了竞争力的表现,为全面通用视觉代理的发展奠定了坚实基础。
Summary / 总结
The research aims to improve the fine-grained visual information retention in Vision-Language Models (VLMs) by addressing the text-dominant optimization bias. Youtu-VL introduces a Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which shifts the optimization objective to treat visual signals as supervisory targets. This method integrates visual tokens into the prediction stream and applies unified autoregressive supervision to both visual and linguistic content. Experimental results show that Youtu-VL performs competitively on both general multimodal tasks and vision-centric tasks, providing a strong foundation for comprehensive generalist visual agents.
研究旨在通过解决文本主导的优化偏差,改善视觉语言模型(VLMs)对细粒度视觉信息的保留。Youtu-VL 引入了视觉语言统一自回归监督(VLUAS)范式,将优化目标从“视觉作为输入”转变为“视觉作为目标”。该方法将视觉标记直接集成到预测流中,并使标准 VLM 能够在无需特定任务修改的情况下执行视觉中心任务。实验结果表明,Youtu-VL 在通用多模态任务和视觉中心任务上均表现出色,提升了 VLM 的整体性能。
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
Authors: Gyubeum Lim, Yemo Koo, Vijay Krishna Madisetti
First: 2025-10-22T17:47:12+00:00 · Latest: 2026-01-27T16:39:04+00:00
Abstract
Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to bridge the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.
中文标题/摘要
标题:SCoPE VLM:选择性上下文处理以提高视觉语言模型的文档导航效率
理解长上下文视觉信息仍然是视觉语言模型的基本挑战,尤其是在如GUI控制和网页导航等代理任务中。虽然网页和GUI环境本质上是结构化的文档,但当前的VLMs在训练目标中通常忽略了决策导向的文档理解。现有方法主要通过扩展视觉嵌入来处理长的高分辨率输入,但这些方法内存密集且不适用于本地部署解决方案。为了解决这些问题,我们提出了SCoPE VLM,这是一种文档导航专家,利用新颖的滚动链机制选择性和递归地导航文档,专注于相关段落。我们引入了一种专门的数据生成管道来构建有信息量的滚动链轨迹,并提出了一种定制的强化学习方法——阶段性组相对策略优化,以弥合训练和推理之间的差距。我们的方法显著减少了内存使用,并有效地模拟了人类的阅读行为。据我们所知,SCoPE VLM是第一个明确建模多页文档问答中代理阅读模式的框架,推动了多模态代理的能力。
Summary / 总结
The research aims to improve vision-language models' ability to understand long-context visual information, especially for tasks like GUI control and web navigation. The proposed SCoPE VLM uses a Chain of Scroll mechanism to selectively navigate documents, focusing on relevant segments, and employs Episodic Group Relative Policy Optimization for efficient training. This method reduces memory usage and models human-like reading behaviors, making it suitable for locally deployable solutions.
研究旨在提高视觉语言模型在理解长上下文视觉信息方面的能力,特别是对于GUI控制和网页导航等任务。提出的SCoPE VLM使用链式滚动机制选择性地导航文档,专注于相关部分。该方法减少了内存使用,并模拟了人类的阅读行为,使其适用于本地部署的解决方案。主要实验发现是,SCoPE VLM有效地解决了现有方法的内存密集问题,并推动了多模态代理在多页文档问答中的能力。
RvB: Automating AI System Hardening via Iterative Red-Blue Games
Authors: Lige Huang, Zicheng Liu, Jie Zhang, Lewen Yan, Dongrui Liu, Jing Shao
First: 2026-01-27T15:49:58+00:00 · Latest: 2026-01-27T15:49:58+00:00
Abstract
The dual offensive and defensive utility of Large Language Models (LLMs) highlights a critical gap in AI security: the lack of unified frameworks for dynamic, iterative adversarial adaptation hardening. To bridge this gap, we propose the Red Team vs. Blue Team (RvB) framework, formulated as a training-free, sequential, imperfect-information game. In this process, the Red Team exposes vulnerabilities, driving the Blue Team to learning effective solutions without parameter updates. We validate our framework across two challenging domains: dynamic code hardening against CVEs and guardrail optimization against jailbreaks. Our empirical results show that this interaction compels the Blue Team to learn fundamental defensive principles, leading to robust remediations that are not merely overfitted to specific exploits. RvB achieves Defense Success Rates of 90\% and 45\% across the respective tasks while maintaining near 0\% False Positive Rates, significantly surpassing baselines. This work establishes the iterative adversarial interaction framework as a practical paradigm that automates the continuous hardening of AI systems.
中文标题/摘要
标题:RvB:通过迭代红蓝游戏自动化AI系统加固
大型语言模型(LLMs)的双重进攻和防御功能突显了AI安全中的一个关键缺口:缺乏统一的动态迭代对抗适应加固框架。为弥补这一缺口,我们提出了红队 vs 蓝队(RvB)框架,该框架被表述为一个无需训练、顺序进行且信息不完全的游戏。在此过程中,红队揭示漏洞,促使蓝队学习有效的解决方案而不更新参数。我们在两个具有挑战性的领域验证了该框架:针对CVE的动态代码加固和针对越狱的护栏优化。我们的实验证明,这种互动促使蓝队学习到基本的防御原则,从而产生稳健的修复措施,而不仅仅是针对特定攻击的过拟合。RvB在相应任务中的防御成功率分别为90%和45%,同时保持接近0%的误报率,显著超越基线。本研究确立了迭代对抗互动框架作为自动化持续加固AI系统的实用范式。
Summary / 总结
The paper addresses the need for dynamic adversarial adaptation in AI security by proposing the Red Team vs. Blue Team (RvB) framework. This framework is formulated as a training-free, sequential, imperfect-information game where the Red Team identifies vulnerabilities, and the Blue Team learns effective defenses without parameter updates. The RvB framework was validated in two domains: dynamic code hardening against CVEs and guardrail optimization against jailbreaks. The results showed that the RvB framework achieved Defense Success Rates of 90% and 45% respectively, with near 0% False Positive Rates, outperforming baseline methods.
论文提出了一种名为Red Team vs. Blue Team (RvB)的框架,以解决AI安全中的动态对抗适应需求。该框架被表述为一个无需训练、顺序进行且信息不完全的游戏,其中红队发现漏洞,蓝队学习有效的防御措施而不更新参数。RvB框架在两个领域进行了验证:动态代码加固以应对CVEs和护栏优化以应对脱管攻击。实验结果显示,RvB框架在相应任务中的防御成功率分别达到了90%和45%,且几乎无误报率,显著优于基线方法。
KeepLoRA: Continual Learning with Residual Gradient Adaptation
Authors: Mao-Lin Luo, Zi-Hao Zhou, Yi-Lin Zhang, Yuanyu Wan, Tong Wei, Min-Ling Zhang
Venue: ICLR 2026
First: 2026-01-27T14:38:57+00:00 · Latest: 2026-01-27T14:38:57+00:00
Comments: Accepted at ICLR 2026
Abstract
Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents a simple but effective approach called KeepLoRA to effectively balance these objectives. We first analyze the knowledge retention mechanism within the model parameter space and find that general knowledge is mainly encoded in the principal subspace, while task-specific knowledge is encoded in the residual subspace. Motivated by this finding, KeepLoRA learns new tasks by restricting LoRA parameter updates in the residual subspace to prevent interfering with previously learned capabilities. Specifically, we infuse knowledge for a new task by projecting its gradient onto a subspace orthogonal to both the principal subspace of pre-trained model and the dominant directions of previous task features. Our theoretical and empirical analyses confirm that KeepLoRA balances the three objectives and achieves state-of-the-art performance. The implementation code is available at https://github.com/MaolinLuo/KeepLoRA.
中文标题/摘要
标题:KeepLoRA:残差梯度适应的持续学习
预训练视觉-语言模型的持续学习需要平衡三个相互竞争的目标:保留预训练知识、保持一系列学习任务的知识、以及保持获取新知识的可塑性。本文提出了一种简单而有效的方法KeepLoRA来有效平衡这些目标。我们首先分析了模型参数空间中的知识保留机制,并发现一般知识主要编码在主子空间中,而任务特定知识编码在残差子空间中。受这一发现的启发,KeepLoRA通过限制LoRA参数更新在残差子空间中,防止干扰之前学习的能力来学习新任务。具体而言,我们通过将新任务的梯度投影到与预训练模型的主子空间和先前任务特征的主要方向正交的子空间中来注入新任务的知识。我们的理论和实证分析证实,KeepLoRA平衡了这三个目标并实现了最先进的性能。代码实现可在https://github.com/MaolinLuo/KeepLoRA获取。
Summary / 总结
The paper introduces KeepLoRA, a method for continual learning in pre-trained vision-language models that balances retaining pre-trained knowledge, preserving knowledge from previous tasks, and maintaining the ability to learn new tasks. By analyzing the model parameter space, the authors find that general knowledge is encoded in the principal subspace and task-specific knowledge in the residual subspace. KeepLoRA updates parameters only in the residual subspace to prevent interference with previously learned capabilities, projecting the gradient of a new task onto a subspace orthogonal to both the principal subspace and previous task features. Experiments show that KeepLoRA achieves state-of-the-art performance in continual learning scenarios.
该论文提出了KeepLoRA方法,用于平衡预训练视觉-语言模型中的保留预训练知识、保持先前任务的知识以及维持学习新知识的能力。通过分析模型的参数空间,作者发现一般知识编码在主子空间中,而任务特定知识编码在残差子空间中。KeepLoRA仅在残差子空间更新参数以防止干扰先前学习的能力,并将新任务梯度投影到与预训练模型主子空间和先前任务特征主导方向正交的子空间中。该方法在理论和实验分析中达到了最先进的性能。
Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models
Authors: Shuang Liang, Zhihao Xu, Jiaqi Weng, Jialing Tao, Hui Xue, Xiting Wang
First: 2025-08-08T16:13:28+00:00 · Latest: 2026-01-27T13:58:13+00:00
Comments: 12 pages; Previously this version appeared as arXiv:2510.15430 which was submitted as a new work by accident
Abstract
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks. To mitigate these risks, existing detection methods are essential, yet they face two major challenges: generalization and accuracy. While learning-based methods trained on specific attacks fail to generalize to unseen attacks, learning-free methods based on hand-crafted heuristics suffer from limited accuracy and reduced efficiency. To address these limitations, we propose Learning to Detect (LoD), a learnable framework that eliminates the need for any attack data or hand-crafted heuristics. LoD operates by first extracting layer-wise safety representations directly from the model's internal activations using Multi-modal Safety Concept Activation Vectors classifiers, and then converting the high-dimensional representations into a one-dimensional anomaly score for detection via a Safety Pattern Auto-Encoder. Extensive experiments demonstrate that LoD consistently achieves state-of-the-art detection performance (AUROC) across diverse unseen jailbreak attacks on multiple LVLMs, while also significantly improving efficiency. Code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
中文标题/摘要
标题:在大型视觉-语言模型中学习检测未见过的越狱攻击
尽管进行了广泛的对齐努力,大型视觉-语言模型(LVLMs)仍然容易受到越狱攻击的影响。为了减轻这些风险,现有的检测方法至关重要,但它们面临两个主要挑战:泛化能力和准确性。基于特定攻击训练的学习方法无法泛化到未见过的攻击,而基于手工构建启发式的无学习方法则在准确性和效率方面受到限制。为了解决这些限制,我们提出了Learning to Detect(LoD),这是一种可学习的框架,无需任何攻击数据或手工构建的启发式方法。LoD 通过首先使用多模态安全性概念激活向量分类器从模型的内部激活中提取逐层的安全表示,然后通过安全性模式自编码器将高维表示转换为一维异常分数来进行检测。广泛的实验表明,LoD 在多个 LVLMs 上对多种未见过的越狱攻击的一致检测性能(AUROC)达到了最先进的水平,同时显著提高了效率。代码可在 https://anonymous.4open.science/r/Learning-to-Detect-51CB 获取。
Summary / 总结
The paper addresses the vulnerability of Large Vision-Language Models (LVLMs) to jailbreak attacks despite extensive alignment efforts. It proposes Learning to Detect (LoD), a learnable framework that extracts safety representations from the model's internal activations and converts them into anomaly scores for detection. Experiments show that LoD outperforms existing methods in terms of detection accuracy and efficiency across various unseen jailbreak attacks on multiple LVLMs.
本文针对大型视觉-语言模型(LVLMs)在遭受脱狱攻击时的脆弱性,尽管进行了广泛的对齐努力。它提出了一个可学习框架Learning to Detect (LoD),该框架从模型激活中提取安全表示并将其转换为异常分数进行检测。实验表明,LoD在多种LVLM上对各种未见过的脱狱攻击具有更高的检测准确性和效率。
ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving
Authors: Yujin Wang, Yutong Zheng, Wenxian Fan, Tianyi Wang, Hongqing Chu, Daxin Tian, Bingzhao Gao, Jianqiang Wang, Hong Chen
First: 2026-01-27T13:17:50+00:00 · Latest: 2026-01-27T13:17:50+00:00
Abstract
In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.
中文标题/摘要
标题:ScenePilot-Bench:自动驾驶场景中视觉语言模型评估的大规模数据集和基准
在本文中,我们介绍了ScenePilot-Bench,这是一个基于ScenePilot-4K的大型第一人称驾驶基准,旨在评估视觉语言模型(VLMs)在自动驾驶场景中的性能。ScenePilot-Bench 包含3,847小时的驾驶视频数据集,涵盖了场景描述、风险评估、关键参与者识别、自我轨迹和相机参数等多粒度信息。基准测试包括四个维度的评估套件,评估VLM在场景理解、空间感知、运动规划和GPT-Score方面的能力,具有安全意识的度量标准和跨区域泛化设置。我们在ScenePilot-Bench上对代表性VLM进行了基准测试,提供了实证分析,明确了当前的性能边界并指出了面向驾驶的推理缺口。ScenePilot-Bench 提供了一个全面的框架,用于评估和推进在安全关键的自动驾驶场景中的VLM。
Summary / 总结
ScenePilot-Bench is a large-scale driving benchmark designed to evaluate vision-language models in autonomous driving scenarios. It is built on a diverse dataset of 3,847 hours of driving videos, annotated with detailed information. The benchmark assesses models in scene understanding, spatial perception, motion planning, and safety-aware metrics through a four-axis evaluation suite. Key findings include clarifying current performance boundaries and identifying gaps for driving-oriented reasoning.
ScenePilot-Bench 是一个大规模的驾驶基准,旨在评估视觉语言模型在自动驾驶场景中的性能。它基于包含 3,847 小时驾驶视频的 ScenePilot-4K 数据集,评估模型在场景理解、空间感知、运动规划和安全意识指标方面的表现。基准提供了对代表性 VLM 的实证分析,明确了当前的性能边界并指出了驾驶导向推理中的不足。
High-Layer Attention Pruning with Rescaling
Authors: Songtao Liu, Peng Liu
First: 2025-07-02T17:15:05+00:00 · Latest: 2026-01-27T12:54:05+00:00
Comments: TMLR 2026
Abstract
Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that indiscriminately removes some attention heads across all pruning layers, without considering their positions within the network architecture. In this work, we propose a novel pruning algorithm that strategically prunes attention heads in the model's higher layers. Since the removal of attention heads can alter the magnitude of token representations, we introduce an adaptive rescaling parameter that calibrates the representation scale post-pruning to counteract this effect. We conduct comprehensive experiments on a wide range of LLMs, including LLaMA3.1-8B, Mistral-7B-v0.3, Qwen2-7B, and Gemma2-9B. Our evaluation includes both generation and discriminative tasks across 27 datasets. The results consistently demonstrate that our method outperforms existing structured pruning methods. This improvement is particularly notable in generation tasks, where our approach significantly outperforms existing baselines. Code is available at https://github.com/SongtaoLiu0823/HARP.
中文标题/摘要
标题:高层注意力剪枝与重新缩放
剪枝是压缩大型语言模型(LLMs)的一种非常有效的方法,显著减少了推理延迟。然而,传统的无训练结构化剪枝方法通常使用启发式指标,不分青红皂白地在所有剪枝层中移除一些注意力头,而不考虑它们在网络架构中的位置。在本工作中,我们提出了一种新的剪枝算法,该算法战略性地在模型的高层剪枝注意力头。由于移除注意力头会改变标记表示的幅度,我们引入了一个自适应重新缩放参数,在剪枝后校准表示尺度以抵消这种影响。我们在包括LLaMA3.1-8B、Mistral-7B-v0.3、Qwen2-7B和Gemma2-9B在内的多种LLMs上进行了全面实验。评估包括27个数据集上的生成和辨别任务。结果一致表明,我们的方法优于现有结构化剪枝方法。特别是在生成任务中,我们的方法显著优于现有基线。代码可在https://github.com/SongtaoLiu0823/HARP/ 获取。
Summary / 总结
This paper introduces a novel pruning method called High-Layer Attention Pruning with Rescaling (HARP) for compressing large language models. Unlike conventional methods that indiscriminately remove attention heads, HARP strategically prunes higher layers and introduces an adaptive rescaling parameter to maintain token representation scales post-pruning. Experiments on various LLMs show that HARP outperforms existing structured pruning methods, especially in generation tasks, demonstrating consistent improvements across 27 datasets.
本文提出了一种新的剪枝算法——高层注意力剪枝与缩放(HARP),旨在压缩大型语言模型(LLMs)的同时保持性能。与传统的随意剪枝注意力头的方法不同,HARP 在高层进行有选择性的剪枝,并引入了一个自适应缩放参数来调整剪枝后的 token 表示尺度。实验结果显示,HARP 在各种 LLM 上的表现优于现有剪枝方法,特别是在生成任务中,HARP 显著优于现有基线,覆盖了 27 个数据集。代码可在 https://github.com/SongtaoLiu0823/HARP 获取。
Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking
Authors: Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang
Venue: ICLR 2026
First: 2025-09-30T04:57:26+00:00 · Latest: 2026-01-27T11:48:21+00:00
Comments: Accepted to ICLR 2026
Abstract
Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets.
中文标题/摘要
标题:自我演化的视觉-语言模型在投票和排名基础上的图像质量评估
在后训练阶段提高视觉-语言模型(VLMs)通常依赖于监督微调或强化学习,这些方法需要昂贵的人工标注数据。虽然自监督技术已被证明对增强推理能力有效,但它们在感知领域如图像质量评估(IQA)中的应用仍鲜有探索。在本文中,我们提出了EvoQuality,这是一种新颖的框架,使VLM能够自主完善其质量感知能力,无需任何真实标签。EvoQuality将自我一致性原则应用于基于排名的IQA性质。通过在VLM自身输出之间进行成对多数投票生成伪标签,以建立相对质量的一致性。这些伪排名随后被转化为保真度奖励,通过组相对策略优化(GRPO)引导模型的迭代进化。通过反复利用自身的预测,EvoQuality逐步完善了VLM的感知能力。广泛实验表明,EvoQuality在不同IQA基准上的PLCC零样本性能提高了31.8%。尽管完全自监督,EvoQuality的表现与最先进的监督VLM基线IQA模型相当,甚至在7个IQA基准中有5个上超越了这些模型。此外,该框架展示了显著的灵活性,可以与预训练的IQA模型堆叠,以增强对未见数据集的泛化能力。
Summary / 总结
This work introduces EvoQuality, a self-evolving framework for vision-language models (VLMs) to autonomously refine their image quality assessment (IQA) capabilities without ground-truth labels. By using pairwise majority voting to generate pseudo-labels and a fidelity reward based on these labels, the model iteratively optimizes its performance through group relative policy optimization (GRPO). Experiments show that EvoQuality significantly improves the base VLM's zero-shot performance on PLCC by 31.8% across various IQA benchmarks, and it achieves competitive or superior performance to state-of-the-art supervised VLM-based IQA models on five out of seven benchmarks.
该研究提出了EvoQuality框架,使视觉-语言模型(VLMs)能够在无需使用标注数据的情况下自主提升其图像质量评估(IQA)能力。通过采用成对多数投票和组相对策略优化,EvoQuality生成伪排名来引导模型的迭代优化。该方法显著提高了基线VLM在各种IQA基准上的零样本性能,PLCC指标提升31.8%,并在七个IQA基准中有五个上超越了最先进的监督VLM基线模型。
Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation
Authors: Yizhao Han, Tianxing Shi, Zhao Wang, Zifan Xu, Zhiyuan Pu, Mingxiao Li, Qian Zhang, Wei Yin, Xiao-Xiao Long
First: 2026-01-27T11:19:53+00:00 · Latest: 2026-01-27T11:19:53+00:00
Abstract
Autoregressive (AR) architectures have achieved significant successes in LLMs, inspiring explorations for video generation. In LLMs, top-p/top-k sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strikes a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or get stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided k-Guard (ENkG) sampling, a simple yet effective strategy that adapts sampling to token-wise dispersion, quantified by the entropy of each token's predicted distribution. ENkG uses adaptive token candidate sizes: for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity; for high-entropy regions, it uses more candidates to mitigate error compounding. ENkG is model-agnostic, training-free, and adds negligible overhead. Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.
中文标题/摘要
标题:熵引导的k-卫采样用于长时序自回归视频生成
自回归(AR)架构在LLMs中取得了显著的成功,激发了视频生成的探索。在LLMs中,top-p/top-k采样策略表现非常出色:语言标记具有高语义密度和低冗余性,因此固定大小的标记候选者已经能够在语义准确性和生成多样性之间取得平衡。相比之下,视频标记具有低语义密度和高时空冗余性。这种不匹配使得静态的top-k/top-p策略对视频解码器无效:它们要么在低不确定性区域(静态背景)引入不必要的随机性,要么在高不确定性区域(前景对象)陷入早期错误。随着更多帧的生成,预测误差会累积并最终严重降低长时序质量。为了解决这个问题,我们提出了熵引导的k-卫采样(ENkG),这是一种简单而有效的策略,能够根据每个标记预测分布的熵来适应采样。ENkG使用自适应的标记候选者大小:在低熵区域,它使用较少的候选者来抑制冗余噪声并保持结构完整性;在高熵区域,它使用更多的候选者来减轻误差累积。ENkG是模型无关的,无需训练,并且几乎不增加开销。实验表明,与静态的top-k/top-p策略相比,ENkG在感知质量和结构稳定性方面表现出一致的改进。
Summary / 总结
The paper addresses the challenge of generating high-quality long-horizon videos using autoregressive models. It proposes Entropy-Guided k-Guard (ENkG) sampling, which adapts to the dispersion of predicted token distributions by using fewer candidates in low-entropy regions and more in high-entropy regions. This method improves perceptual quality and structural stability compared to static top-k/top-p strategies.
论文提出了一种熵引导的k-卫士采样(ENkG)方法,以解决使用自回归模型生成高质量长时序视频的挑战。该方法根据预测的令牌分布的熵进行自适应调整,低熵区域使用较少的候选令牌以抑制噪声,高熵区域使用较多的候选令牌以防止误差累积。该方法无需训练且适用于多种模型,实验结果表明其在感知质量和结构稳定性方面优于静态的top-k/top-p策略。
Fixed Aggregation Features Can Rival GNNs
Authors: Celia Rubio-Madrigal, Rebekka Burkholz
First: 2026-01-27T10:36:31+00:00 · Latest: 2026-01-27T10:36:31+00:00
Abstract
Graph neural networks (GNNs) are widely believed to excel at node representation learning through trainable neighborhood aggregations. We challenge this view by introducing Fixed Aggregation Features (FAFs), a training-free approach that transforms graph learning tasks into tabular problems. This simple shift enables the use of well-established tabular methods, offering strong interpretability and the flexibility to deploy diverse classifiers. Across 14 benchmarks, well-tuned multilayer perceptrons trained on FAFs rival or outperform state-of-the-art GNNs and graph transformers on 12 tasks -- often using only mean aggregation. The only exceptions are the Roman Empire and Minesweeper datasets, which typically require unusually deep GNNs. To explain the theoretical possibility of non-trainable aggregations, we connect our findings to Kolmogorov-Arnold representations and discuss when mean aggregation can be sufficient. In conclusion, our results call for (i) richer benchmarks benefiting from learning diverse neighborhood aggregations, (ii) strong tabular baselines as standard, and (iii) employing and advancing tabular models for graph data to gain new insights into related tasks.
中文标题/摘要
标题:固定聚合特征可与GNNs匹敌
图神经网络(GNNs)普遍被认为通过可训练的邻域聚合在节点表示学习方面表现出色。我们通过引入固定聚合特征(FAFs),一种无需训练的方法,将图学习任务转化为表格问题来挑战这一观点。这一简单的转变使得可以使用成熟的表格方法,提供强大的可解释性和部署多种分类器的灵活性。在14个基准测试中,使用FAFs训练的多层感知机在12个任务上与最先进的GNNs和图变压器相当或更优——通常仅使用均值聚合。唯一的例外是罗马帝国和扫雷数据集,通常需要异常深的GNNs。为了解释非训练聚合的理论可能性,我们将我们的发现与柯尔莫哥洛夫-阿诺尔德表示联系起来,并讨论均值聚合何时可以足够。总之,我们的结果呼吁(i)更丰富的基准测试,从中受益于学习多样化的邻域聚合,(ii)强大的表格基线作为标准,以及(iii)使用和推进表格模型以图数据获得相关任务的新见解。
Summary / 总结
The paper challenges the belief that GNNs are superior for node representation learning by introducing Fixed Aggregation Features (FAFs), a training-free method that transforms graph learning into tabular problems. Across 14 benchmarks, FAFs with well-tuned multilayer perceptrons outperform or match state-of-the-art GNNs and graph transformers on 12 tasks, often using only mean aggregation. The only exceptions are the Roman Empire and Minesweeper datasets, which typically require deeper GNNs. The authors also connect their findings to Kolmogorov-Arnold representations and discuss when mean aggregation can be sufficient.
论文通过提出固定聚合特征(FAFs),一种无需训练的方法,将图学习转化为表格问题,挑战了GNNs在节点表示学习中占优的观点。在14个基准测试中,使用FAFs训练的多层感知机在12个任务上超越或匹配了最先进的GNNs和图变压器,通常仅使用均值聚合。唯一的例外是某些数据集需要深度GNNs。作者还讨论了非训练聚合的理论基础,并建议需要更丰富的基准测试和更强的表格基线来处理图数据。
From Internal Diagnosis to External Auditing: A VLM-Driven Paradigm for Online Test-Time Backdoor Defense
Authors: Binyan Xu, Fan Yang, Xilin Dai, Di Tang, Kehuan Zhang
First: 2026-01-27T10:34:06+00:00 · Latest: 2026-01-27T10:34:06+00:00
Comments: 19 pages, 10 figures, 12 tables
Abstract
Deep Neural Networks remain inherently vulnerable to backdoor attacks. Traditional test-time defenses largely operate under the paradigm of internal diagnosis methods like model repairing or input robustness, yet these approaches are often fragile under advanced attacks as they remain entangled with the victim model's corrupted parameters. We propose a paradigm shift from Internal Diagnosis to External Semantic Auditing, arguing that effective defense requires decoupling safety from the victim model via an independent, semantically grounded auditor. To this end, we present a framework harnessing Universal Vision-Language Models (VLMs) as evolving semantic gatekeepers. We introduce PRISM (Prototype Refinement & Inspection via Statistical Monitoring), which overcomes the domain gap of general VLMs through two key mechanisms: a Hybrid VLM Teacher that dynamically refines visual prototypes online, and an Adaptive Router powered by statistical margin monitoring to calibrate gating thresholds in real-time. Extensive evaluation across 17 datasets and 11 attack types demonstrates that PRISM achieves state-of-the-art performance, suppressing Attack Success Rate to <1% on CIFAR-10 while improving clean accuracy, establishing a new standard for model-agnostic, externalized security.
中文标题/摘要
标题:从内部诊断到外部审计:基于VLM的在线测试时后门防御范式
深度神经网络仍然固有地容易受到后门攻击的影响。传统的测试时防御大多基于内部诊断方法,如模型修复或输入鲁棒性,但这些方法在面对高级攻击时往往脆弱,因为它们仍然与受害模型的受损参数纠缠在一起。我们提出了一种从内部诊断到外部语义审计的范式转变,认为有效的防御需要通过独立的、语义基础的审计器来解耦安全性与受害模型。为此,我们提出了一种框架,利用通用视觉语言模型(VLMs)作为不断进化的语义守门人。我们引入了PRISM(原型精炼与统计监控检验),通过两种关键机制克服了通用VLMs的领域差距:一种动态在线精炼视觉原型的混合VLM教师,以及由统计边际监控驱动的自适应路由器,以实现实时门限校准。在17个数据集和11种攻击类型上的广泛评估表明,PRISM达到了最先进的性能,在CIFAR-10上将攻击成功率抑制到<1%,同时提高了干净准确率,确立了模型无感知、外部化安全的新标准。
Summary / 总结
The paper addresses the vulnerability of deep neural networks to backdoor attacks by proposing a new paradigm that shifts from internal diagnosis methods to external semantic auditing. It introduces PRISM, which uses Universal Vision-Language Models as evolving semantic gatekeepers. PRISM includes a Hybrid VLM Teacher for online refinement of visual prototypes and an Adaptive Router for real-time calibration of gating thresholds. Experiments show that PRISM outperforms existing methods, achieving an Attack Success Rate of less than 1% on CIFAR-10 while maintaining clean accuracy, setting a new standard for model-agnostic, externalized security.
论文提出了一种新的范式,从内部诊断转向外部语义审计,以应对深度神经网络的后门攻击问题。它引入了PRISM框架,利用通用视觉语言模型作为不断进化的语义门卫。PRISM包括一种混合VLM教师进行在线视觉原型的精炼,以及一种基于统计边距监控的自适应路由器进行实时门限校准。实验表明,PRISM在CIFAR-10上将攻击成功率抑制到低于1%,同时保持了干净的准确性,确立了模型无感知、外部化安全的新标准。
RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming
Authors: Jisheng Chu, Wenrui Li, Rui Zhao, Wangmeng Zuo, Shifeng Chen, Xiaopeng Fan
First: 2026-01-27T10:10:55+00:00 · Latest: 2026-01-27T10:10:55+00:00
Abstract
Generating immersive 3D scenes from texts is a core task in computer vision, crucial for applications in virtual reality and game development. Despite the promise of leveraging 2D diffusion priors, existing methods suffer from spatial blindness and rely on predefined trajectories that fail to exploit the inner relationships among salient objects. Consequently, these approaches are unable to comprehend the semantic layout, preventing them from exploring the scene adaptively to infer occluded content. Moreover, current inpainting models operate in 2D image space, struggling to plausibly fill holes caused by camera motion. To address these limitations, we propose RoamScene3D, a novel framework that bridges the gap between semantic guidance and spatial generation. Our method reasons about the semantic relations among objects and produces consistent and photorealistic scenes. Specifically, we employ a vision-language model (VLM) to construct a scene graph that encodes object relations, guiding the camera to perceive salient object boundaries and plan an adaptive roaming trajectory. Furthermore, to mitigate the limitations of static 2D priors, we introduce a Motion-Injected Inpainting model that is fine-tuned on a synthetic panoramic dataset integrating authentic camera trajectories, making it adaptive to camera motion. Extensive experiments demonstrate that with semantic reasoning and geometric constraints, our method significantly outperforms state-of-the-art approaches in producing consistent and photorealistic scenes. Our code is available at https://github.com/JS-CHU/RoamScene3D.
中文标题/摘要
标题:RoamScene3D:通过自适应对象感知漫游实现沉浸式文本到3D场景生成
从文本生成沉浸式3D场景是计算机视觉中的核心任务,对于虚拟现实和游戏开发的应用至关重要。尽管利用2D扩散先验具有潜力,但现有方法存在空间盲视问题,依赖于预定义的轨迹,无法充分利用显著对象之间的内在关系。因此,这些方法无法理解语义布局,无法适应地探索场景以推断被遮挡的内容。此外,当前的修复模型在2D图像空间中运行,难以合理填补由相机运动造成的空洞。为解决这些限制,我们提出RoamScene3D,这是一种新颖的框架,将语义指导与空间生成结合起来。我们的方法考虑了对象之间的语义关系,生成一致且逼真的场景。具体而言,我们使用视觉-语言模型(VLM)构建场景图,编码对象关系,引导相机感知显著对象边界并规划自适应漫游轨迹。此外,为缓解静态2D先验的限制,我们引入了一种注入运动的修复模型,该模型在结合真实相机轨迹的合成全景数据集上进行微调,使其能够适应相机运动。大量实验表明,通过语义推理和几何约束,我们的方法在生成一致且逼真的场景方面显著优于现有最佳方法。我们的代码可在https://github.com/JS-CHU/RoamScene3D获取。
Summary / 总结
RoamScene3D addresses the limitations of existing text-to-3D scene generation methods by integrating semantic guidance and spatial generation. It uses a vision-language model to construct a scene graph that guides the camera to explore the scene adaptively and plan a trajectory, while a Motion-Injected Inpainting model is introduced to handle camera motion. Experiments show that RoamScene3D outperforms state-of-the-art approaches in generating consistent and photorealistic scenes.
RoamScene3D通过结合语义指导和空间生成来解决现有文本到3D场景生成方法的局限性。它使用视觉语言模型构建场景图,引导相机感知对象边界并规划自适应轨迹。此外,它引入了一种在合成全景数据集上微调的运动注入填充模型,以处理相机运动。实验表明,RoamScene3D在生成一致且逼真的场景方面优于现有最佳方法。
GhostUI: Unveiling Hidden Interactions in Mobile UI
Authors: Minkyu Kweon, Seokhyeon Park, Soohyun Lee, You Been Lee, Jeongmin Rhee, Jinwook Seo
First: 2026-01-27T06:40:29+00:00 · Latest: 2026-01-27T06:40:29+00:00
Comments: Accepted at ACM CHI Conference on Human Factors in Computing Systems (CHI '26)
Abstract
Modern mobile applications rely on hidden interactions--gestures without visual cues like long presses and swipes--to provide functionality without cluttering interfaces. While experienced users may discover these interactions through prior use or onboarding tutorials, their implicit nature makes them difficult for most users to uncover. Similarly, mobile agents--systems designed to automate tasks on mobile user interfaces, powered by vision language models (VLMs)--struggle to detect veiled interactions or determine actions for completing tasks. To address this challenge, we present GhostUI, a new dataset designed to enable the detection of hidden interactions in mobile applications. GhostUI provides before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions, allowing VLMs to better recognize concealed gestures and anticipate post-interaction states. Quantitative evaluations with VLMs show that models fine-tuned on GhostUI outperform baseline VLMs, particularly in predicting hidden interactions and inferring post-interaction screens, underscoring GhostUI's potential as a foundation for advancing mobile task automation.
中文标题/摘要
标题:GhostUI:揭示移动UI中的隐藏交互
现代移动应用程序依赖于没有视觉提示(如长按和滑动)的隐藏交互来提供功能,而不使界面显得杂乱。虽然经验丰富的用户可能通过先前的使用或引导教程发现这些交互,但它们的隐含性质使得大多数用户难以发现。同样,移动代理——旨在自动化移动用户界面任务的系统,由视觉语言模型(VLMs)驱动——在检测隐藏交互或确定完成任务的动作时也面临挑战。为了解决这一挑战,我们提出了GhostUI,这是一个新的数据集,旨在使隐藏交互在移动应用程序中的检测成为可能。GhostUI 提供了事前和事后的屏幕截图、简化视图层次结构、手势元数据和任务描述,使VLMs能够更好地识别隐藏手势并预测交互后的状态。使用VLMs的定量评估表明,基于GhostUI微调的模型在预测隐藏交互和推断交互后屏幕方面优于基线VLMs,突显了GhostUI作为移动任务自动化基础的潜力。
Summary / 总结
The research aims to address the challenge of hidden interactions in mobile applications, which are gestures without visual cues and difficult for most users to discover. GhostUI, a new dataset, is introduced to enable the detection of these interactions. It includes before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions. Experiments show that models fine-tuned on GhostUI outperform baseline models in predicting hidden interactions and inferring post-interaction screens, highlighting its potential for advancing mobile task automation.
该论文介绍了GhostUI数据集,旨在帮助视觉语言模型检测移动应用中的隐藏交互。这些交互,如没有视觉提示的手势,对于用户和移动代理来说都很难发现。GhostUI包含前后截图、简化视图层次结构、手势元数据和任务描述,这有助于模型更好地识别隐藏手势并预测交互后的状态。实验表明,基于GhostUI微调的模型在识别隐藏交互和推断后续屏幕方面优于基线模型,突显了其在移动任务自动化方面的潜力。
Knowledge-enhanced Pretraining for Vision-language Pathology Foundation Model on Cancer Diagnosis
Authors: Xiao Zhou, Luoyi Sun, Dexuan He, Wenbin Guan, Ge Wang, Ruifen Wang, Lifeng Wang, Xiaojun Yuan, Xin Sun, Ya Zhang, Kun Sun, Yanfeng Wang, Weidi Xie
First: 2024-12-17T17:45:21+00:00 · Latest: 2026-01-27T06:24:09+00:00
Comments: V2: fixed typos, updated experimental results, added ablation
Abstract
Vision-language foundation models have shown great promise in computational pathology but remain primarily data-driven, lacking explicit integration of medical knowledge. We introduce KEEP (KnowledgE-Enhanced Pathology), a foundation model that systematically incorporates disease knowledge into pretraining for cancer diagnosis. KEEP leverages a comprehensive disease knowledge graph encompassing 11,454 diseases and 139,143 attributes to reorganize millions of pathology image-text pairs into 143,000 semantically structured groups aligned with disease ontology hierarchies. This knowledge-enhanced pretraining aligns visual and textual representations within hierarchical semantic spaces, enabling deeper understanding of disease relationships and morphological patterns. Across 18 public benchmarks (over 14,000 whole-slide images) and 4 institutional rare cancer datasets (926 cases), KEEP consistently outperformed existing foundation models, showing substantial gains for rare subtypes. These results establish knowledge-enhanced vision-language modeling as a powerful paradigm for advancing computational pathology.
中文标题/摘要
标题:知识增强预训练在癌症诊断视觉语言病理基础模型中的应用
视觉语言基础模型在计算病理学中显示出巨大的潜力,但仍然主要依赖数据驱动,缺乏明确的知识整合。我们介绍了KEEP(知识增强病理学),这是一种系统地将疾病知识纳入癌症诊断预训练的基础模型。KEEP 利用一个包含11,454种疾病和139,143个属性的全面疾病知识图谱,重新组织数百万张病理图像-文本对,形成143,000个语义结构化的组,与疾病本体层次结构对齐。这种知识增强的预训练在层次语义空间内对齐视觉和文本表示,使对疾病关系和形态学模式的理解更加深入。在18个公开基准(超过14,000张全切片图像)和4个机构罕见癌症数据集(926例)上,KEEP 一致地优于现有基础模型,显示出对罕见亚型的显著改进。这些结果确立了知识增强的视觉语言建模作为推进计算病理学的强大范式。
Summary / 总结
The research aims to enhance vision-language foundation models in computational pathology by integrating medical knowledge. KEEP, a knowledge-enhanced pretraining model, uses a comprehensive disease knowledge graph to reorganize image-text pairs into semantically structured groups. This approach improves the understanding of disease relationships and morphological patterns, leading to better performance across various benchmarks and institutional datasets, especially for rare cancer subtypes.
研究旨在通过整合医学知识来提升计算病理中的视觉-语言基础模型。KEEP是一种知识增强的预训练模型,利用全面的疾病知识图谱重新组织图像-文本对为语义结构化的组。这种方法提高了对疾病关系和形态学模式的理解,使其在各种基准测试和机构数据集中的表现更好,尤其是对罕见癌症亚型的表现提升显著。
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
Authors: Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, Wengang Zhou, Yu Qiao, Jifeng Dai, Jiangmiao Pang, Gen Luo, Wenhai Wang, Yao Mu, Zhi Hou
First: 2025-10-13T05:51:22+00:00 · Latest: 2026-01-27T05:41:52+00:00
Abstract
While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
中文标题/摘要
标题:Vlaser:具有协同体态推理能力的视觉-语言-行动模型
尽管已有大量研究致力于通过视觉-语言模型(VLM)开发体态推理能力,或将先进的VLM集成到视觉-语言-行动(VLA)模型中以实现端到端的机器人控制,但鲜有研究直接解决上游基于VLM的推理与下游VLA策略学习之间的关键差距。在本文中,我们通过引入Vlaser——一种具有协同体态推理能力的视觉-语言-行动模型,迈出了一步,Vlaser是一种基础的视觉-语言模型,旨在将高级推理与低级控制相结合,以支持体态代理。基于高质量的Vlaser-6M数据集,Vlaser在一系列体态推理基准测试中取得了最先进的性能,包括空间推理、体态语义关联、体态问答和任务规划。此外,我们系统地研究了不同VLM初始化如何影响监督下的VLA微调,提供了缓解互联网规模预训练数据与体态特定策略学习数据之间领域转移的新见解。基于这些见解,我们的方法在WidowX基准测试中取得了最先进的结果,并在Google Robot基准测试中取得了竞争力的表现。
Summary / 总结
This paper addresses the gap between high-level reasoning and low-level control in Vision-Language-Action models by introducing Vlaser, a model that integrates embodied reasoning with VLA policy learning. Built on the Vlaser-6M dataset, Vlaser demonstrates superior performance in various embodied reasoning tasks. The study also explores how different VLM initializations impact supervised VLA fine-tuning, leading to state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
该研究通过引入Vlaser模型,将高阶推理与低阶控制结合,解决了Vision-Language-Action模型中的差距问题。基于Vlaser-6M数据集,Vlaser在多种体态推理基准测试中表现出色。研究还探讨了不同VLM初始化对监督VLA微调的影响,最终在WidowX基准测试中达到最先进的性能,并在Google Robot基准测试中取得竞争力的结果。
Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP
Authors: Sen Nie, Jie Zhang, Zhuo Wang, Shiguang Shan, Xilin Chen
First: 2026-01-27T05:24:45+00:00 · Latest: 2026-01-27T05:24:45+00:00
Comments: 21 pages
Abstract
Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial examples (AEs). While test-time defenses are promising, existing methods fail to provide sufficient robustness against strong attacks and are often hampered by high inference latency and task-specific applicability. To address these limitations, we start by investigating the intrinsic properties of AEs, which reveals that AEs exhibit severe feature inconsistency under progressive frequency attenuation. We further attribute this to the model's inherent spectral bias. Leveraging this insight, we propose an efficient test-time defense named Contrastive Spectral Rectification (CSR). CSR optimizes a rectification perturbation to realign the input with the natural manifold under a spectral-guided contrastive objective, which is applied input-adaptively. Extensive experiments across 16 classification benchmarks demonstrate that CSR outperforms the SOTA by an average of 18.1% against strong AutoAttack with modest inference overhead. Furthermore, CSR exhibits broad applicability across diverse visual tasks. Code is available at https://github.com/Summu77/CSR.
中文标题/摘要
标题:对比光谱校正:针对CLIP零样本对抗鲁棒性的测试时防御
视觉-语言模型(VLMs)如CLIP展示了显著的零样本泛化能力,但仍然高度易受对抗样本(AEs)的影响。虽然测试时的防御方法很有前景,但现有方法无法提供足够的鲁棒性以对抗强大的攻击,并且通常受到高推理延迟和任务特定适用性的限制。为了解决这些限制,我们首先研究了AEs的内在特性,发现AEs在渐进频率衰减下表现出严重的特征不一致性。我们进一步将其归因于模型固有的光谱偏差。利用这一洞察,我们提出了一种高效的测试时防御方法,称为对比光谱校正(CSR)。CSR通过光谱引导的对比目标优化了一个校正扰动,以适应性地将输入与自然流形对齐。在16个分类基准上的广泛实验表明,CSR在对抗强大的AutoAttack时平均优于当前最佳方法18.1%,且推理开销较小。此外,CSR在多种视觉任务中具有广泛的适用性。代码可在https://github.com/Summu77/CSR获取。
Summary / 总结
The paper addresses the vulnerability of vision-language models like CLIP to adversarial examples, which limits their zero-shot generalization. It proposes Contrastive Spectral Rectification (CSR), an efficient test-time defense that realigns inputs to match the natural manifold under a spectral-guided contrastive objective, achieving an average 18.1% improvement against strong AutoAttack with minimal inference latency. CSR is broadly applicable across various visual tasks.
论文针对视觉语言模型如CLIP对对抗样本的脆弱性,提出了高效的测试时防御方法Contrastive Spectral Rectification (CSR)。CSR通过在频谱引导的对比目标下优化校正扰动,使输入与自然流形对齐。实验表明,CSR在强AutoAttack攻击下比最先进的方法平均高出18.1%,且具有较低的推理开销和广泛的适用性,适用于多种视觉任务。
MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning
Authors: Zhixi Cai, Fucai Ke, Kevin Leo, Sukai Huang, Maria Garcia de la Banda, Peter J. Stuckey, Hamid Rezatofighi
Venue: ICLR 2026
First: 2026-01-27T05:06:54+00:00 · Latest: 2026-01-27T05:06:54+00:00
Comments: ICLR 2026
Abstract
Recent vision-language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand-crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi-Agent hierarchical Trainable Automaton), a multi-agent system presented as a hierarchical finite-state automaton for visual reasoning whose top-level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule-based sub-automaton for reliable micro-control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agent's transition policy, we build transition-trajectory trees and transform to memory-to-next-state pairs, forming the MATA-SFT-90K dataset for supervised finetuning (SFT). The finetuned LLM as the transition policy understands the query and the capacity of agents, and it can efficiently choose the optimal agent to solve the task. Across multiple visual reasoning benchmarks, MATA achieves the state-of-the-art results compared with monolithic and compositional baselines. The code and dataset are available at https://github.com/ControlNet/MATA.
中文标题/摘要
标题:MATA:一个多智能体视觉推理可训练层次自动机系统
近期的视觉-语言模型具有强大的感知能力,但其隐式的推理难以解释,并且在复杂查询上容易产生幻觉。组合方法可以提高可解释性,但大多数方法依赖于单个智能体或手工设计的流水线,无法决定何时在互补智能体之间协作或在重叠智能体之间竞争。我们引入了MATA(多智能体层次可训练自动机),这是一种以层次有限状态自动机形式呈现的多智能体系统,用于视觉推理,其顶层转换由可训练的超智能体选择。每个智能体对应超自动机中的一个状态,并运行一个小的基于规则的子自动机以实现可靠的微控制。所有智能体读取和写入共享内存,产生透明的执行历史。为了监督超智能体的转换策略,我们构建了转换轨迹树,并转换为内存到下一状态的对,形成了用于监督微调的MATA-SFT-90K数据集。微调后的LLM作为转换策略能够理解查询和智能体的能力,并能高效地选择合适的智能体来解决任务。在多个视觉推理基准测试中,MATA在与单一系统和组合基线的比较中取得了最先进的结果。代码和数据集可在https://github.com/ControlNet/MATA/获取。
Summary / 总结
MATA is a multi-agent system for visual reasoning that uses a trainable hierarchical automaton to improve interpretability and reduce hallucinations. Each agent runs a small rule-based sub-automaton and shares a memory, allowing for transparent execution. The system's top-level transitions are decided by a trainable hyper agent, which is fine-tuned using a dataset of transition-trajectory pairs. MATA outperforms both monolithic and compositional baselines on several visual reasoning benchmarks.
MATA 是一个多智能体系统,用于视觉推理,通过引入可训练的超代理来选择多个智能体中的一个,解决现有模型的局限性。每个智能体运行一个小的基于规则的子自动机,并共享一个共同的记忆,确保执行透明。该系统通过 MATA-SFT-90K 数据集进行训练,帮助超代理做出最优决策。MATA 在多种视觉推理基准测试中优于单一模型和组合模型。
Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
Authors: Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, Bo Du
First: 2025-10-14T19:57:03+00:00 · Latest: 2026-01-27T04:30:00+00:00
Comments: This paper contains fundamental errors and will not be replaced
Abstract
Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.
中文标题/摘要
标题:具有知识意识的视觉-语言基础模型用于胎儿超声图像解释
近期的医疗视觉-语言模型在诸如VQA、报告生成和异常检测等任务上显示出潜力。然而,大多数模型适应于结构化的成人成像,而在胎儿超声图像上表现不佳,这带来了多视角图像推理、多种疾病和图像多样性等挑战。为弥合这一差距,我们引入了FetalMind,这是一种针对胎儿超声图像的医疗AI系统,用于报告生成和诊断。根据临床工作流程,我们提出了显著的知识解耦(SED),该方法将专家策划的二分图注入模型中,以解耦视角-疾病关联,并通过强化学习引导临床忠实步骤的偏好选择。这种设计减轻了疾病间的变异性以及视角间的异质性,减少了学习瓶颈,使模型的推理与产科实践保持一致。为了大规模训练FetalMind,我们构建了FetalSigma-1M数据集,这是首个大规模的胎儿超声图像报告语料库,包含来自十二家医疗机构的20000份报告,解决了领域数据稀缺的问题。广泛的实验表明,FetalMind在所有妊娠阶段的表现均优于开源和闭源基线,平均提升14%的性能,并在关键条件下提高了61.2%的准确性,同时保持高效、稳定和可扩展。项目页面:https://hexiao0275.github.io/FetalMind。
Summary / 总结
The research aims to address the challenges of interpreting fetal ultrasound images, which include multi-view image reasoning and diverse diseases. To achieve this, the authors propose FetalMind, a medical AI system that uses Salient Epistemic Disentanglement (SED) to decouple view-disease associations and align model inference with obstetric practice. The system is trained on the FetalSigma-1M dataset, a large corpus of fetal ultrasound reports. Experimental results show that FetalMind outperforms existing baselines, achieving significant improvements in accuracy, especially for critical conditions, while maintaining efficiency and scalability.
研究旨在通过解决多视角图像推理和疾病多样性的问题,提高胎儿超声成像的解读能力。开发了FetalMind医疗AI系统,采用Salient Epistemic Disentanglement (SED)方法,结合二分图引导模型的选择偏好,并与临床工作流程对齐。实验表明,FetalMind在不同妊娠阶段和关键条件下优于现有基线,显著提高了准确性和效率。
SNR-Edit: Structure-Aware Noise Rectification for Inversion-Free Flow-Based Editing
Authors: Lifan Jiang, Boxi Wu, Yuhang Pei, Tianrun Wu, Yongyuan Chen, Yan Zhao, Shiyu Yu, Deng Cai
First: 2026-01-27T04:24:21+00:00 · Latest: 2026-01-27T04:24:21+00:00
Abstract
Inversion-free image editing using flow-based generative models challenges the prevailing inversion-based pipelines. However, existing approaches rely on fixed Gaussian noise to construct the source trajectory, leading to biased trajectory dynamics and causing structural degradation or quality loss. To address this, we introduce SNR-Edit, a training-free framework achieving faithful Latent Trajectory Correction via adaptive noise control. Mechanistically, SNR-Edit uses structure-aware noise rectification to inject segmentation constraints into the initial noise, anchoring the stochastic component of the source trajectory to the real image's implicit inversion position and reducing trajectory drift during source--target transport. This lightweight modification yields smoother latent trajectories and ensures high-fidelity structural preservation without requiring model tuning or inversion. Across SD3 and FLUX, evaluations on PIE-Bench and SNR-Bench show that SNR-Edit delivers performance on pixel-level metrics and VLM-based scoring, while adding only about 1s overhead per image.
中文标题/摘要
标题:SNR-Edit:基于结构感知的去噪编辑以实现无逆过程的流基编辑
使用基于流的生成模型进行无逆过程的图像编辑挑战了现有的基于逆过程的管道。然而,现有方法依赖于固定的高斯噪声来构建源轨迹,导致轨迹动力学偏差并造成结构退化或质量损失。为了解决这一问题,我们引入了SNR-Edit,这是一种无需训练的框架,通过自适应噪声控制实现忠实的潜在轨迹校正。机制上,SNR-Edit 使用结构感知的噪声校正将分割约束注入初始噪声中,将源轨迹的随机成分锚定到真实图像的隐式逆位置,并在源-目标传输过程中减少轨迹漂移。这种轻量级的修改产生了更平滑的潜在轨迹,并确保高保真的结构保留,无需进行模型调优或逆过程。在SD3和FLUX上,对PIE-Bench和SNR-Bench的评估显示,SNR-Edit 在像素级指标和VLM基评分上表现出色,同时每张图像仅增加约1秒的额外开销。
Summary / 总结
SNR-Edit is a training-free framework that uses structure-aware noise rectification to inject segmentation constraints into the initial noise, correcting the latent trajectory for faithful image editing. This method reduces trajectory drift and ensures high-fidelity structural preservation without requiring model tuning or inversion. Experiments on PIE-Bench and SNR-Bench demonstrate that SNR-Edit performs well on pixel-level metrics and VLM-based scoring, with only about 1s overhead per image.
SNR-Edit 是一个无需训练的框架,通过结构感知的噪声校正将分割约束注入初始噪声,将源轨迹的随机成分锚定到真实图像的隐式反演位置,从而减少轨迹漂移并确保高保真的结构保真度,无需进行模型调优或反演。PIE-Bench 和 SNR-Bench 上的实验表明,SNR-Edit 在像素级指标和 VLM 基准评分上表现出色,每张图像仅增加约 1 秒的额外开销。
SVBench: Evaluation of Video Generation Models on Social Reasoning
Authors: Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang
First: 2025-12-25T04:44:59+00:00 · Latest: 2026-01-27T03:50:59+00:00
Comments: 10pages
Abstract
Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.
中文标题/摘要
标题:SVBench:视频生成模型在社会推理评估中的应用
近期的文本到视频生成模型在视觉真实感、运动保真度和文本视频对齐方面取得了显著进展,但在生成社会连贯行为方面仍然存在根本性的局限。与人类能够从简短的视觉线索中轻松推断意图、信念、情感和社会规范不同,当前的模型往往渲染字面场景,而未能捕捉到潜在的因果或心理逻辑。为了系统地评估这一差距,我们引入了第一个视频生成中的社会推理基准。该基准基于发展心理学和社会心理学的研究成果,将三十个经典的社会认知范式组织成七个核心维度,包括心理状态推断、目标导向行为、共同注意、社会协调、亲社会行为、社会规范和多智能体策略。为了实现这些范式的操作化,我们开发了一个完全无需训练的基于代理的流水线,包括(i) 提炼每个实验的推理机制,(ii) 合成多种多样的视频准备场景,(iii) 通过基于线索的批评实现概念中立性和难度控制,以及(iv) 使用高容量的VLM裁判在五个可解释的社会推理维度上评估生成的视频。利用这一框架,我们首次对七个最先进的视频生成系统进行了大规模研究。我们的结果显示了显著的性能差距:尽管现代模型在表面合理性方面表现出色,但在意图识别、信念推理、共同注意和亲社会推理方面却系统性地失败。
Summary / 总结
The research aims to evaluate the ability of text-to-video generation models to produce socially coherent behavior, which current models struggle with despite advancements in visual realism and alignment. The study introduces SVBench, a benchmark based on social cognition paradigms from psychology, to assess models across seven dimensions. Key findings show that while models perform well in surface-level plausibility, they fail in recognizing intentions, belief reasoning, joint attention, and prosocial inference.
SVBench 提出了一个用于评估视频生成模型在生成社会连贯行为方面的基准,解决了当前模型在社会推理能力上的不足。基准包括三十个社会认知范式,分为七个维度。通过一个无需训练的基于代理的管道,该研究评估了七个最先进的视频生成系统,结果显示这些模型虽然在视觉逼真度上表现出色,但在意图识别、信念推理、共同注意和利他推理方面存在系统性缺陷。
Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding
Authors: Zhongyu Xiao, Zhiwei Hao, Jianyuan Guo, Yong Luo, Jia Liu, Jie Xu, Han Hu
First: 2026-01-25T17:36:04+00:00 · Latest: 2026-01-27T03:21:28+00:00
Comments: Tech report. Code is available at https://github.com/xiaoshideta/Streaming-dLLM
Abstract
Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation, leveraging parallel decoding and bidirectional attention to achieve superior global coherence compared to autoregressive models. While recent works have accelerated inference via KV cache reuse or heuristic decoding, they overlook the intrinsic inefficiencies within the block-wise diffusion process. Specifically, they suffer from spatial redundancy by modeling informative-sparse suffix regions uniformly and temporal inefficiency by applying fixed denoising schedules across all the decoding process. To address this, we propose Streaming-dLLM, a training-free framework that streamlines inference across both spatial and temporal dimensions. Spatially, we introduce attenuation guided suffix modeling to approximate the full context by pruning redundant mask tokens. Temporally, we employ a dynamic confidence aware strategy with an early exit mechanism, allowing the model to skip unnecessary iterations for converged tokens. Extensive experiments show that Streaming-dLLM achieves up to 68.2X speedup while maintaining generation quality, highlighting its effectiveness in diffusion decoding. The code is available at https://github.com/xiaoshideta/Streaming-dLLM.
中文标题/摘要
标题:Streaming-dLLM:通过后缀剪枝和动态解码加速扩散大语言模型
扩散大语言模型(dLLMs)提供了一种引人注目的自然语言生成范式,通过并行解码和双向注意力实现与自回归模型相比更好的全局连贯性。虽然最近的工作通过键值缓存重用或启发式解码加速了推理,但它们忽视了块状扩散过程中的内在低效性。具体来说,它们在建模具有信息稀疏性的后缀区域时存在空间冗余,并且在解码过程中应用固定去噪时间表存在时间低效性。为了解决这个问题,我们提出了一种无需训练的Streaming-dLLM框架,该框架在空间和时间维度上简化了推理。空间上,我们引入了衰减引导的后缀建模来通过剪枝冗余掩码令牌近似完整的上下文。时间上,我们采用了一种动态置信度感知策略并结合了早期退出机制,使模型能够跳过已收敛令牌的不必要的迭代。广泛的实验表明,Streaming-dLLM在保持生成质量的同时实现了高达68.2倍的加速,突显了其在扩散解码中的有效性。代码可在https://github.com/xiaoshideta/Streaming-dLLM获取。
Summary / 总结
The paper addresses the inefficiencies in diffusion Large Language Models (dLLMs) by proposing Streaming-dLLM, which reduces spatial redundancy through suffix pruning and temporal inefficiency by using a dynamic decoding strategy. Experiments demonstrate that Streaming-dLLM can achieve up to 68.2X speedup without compromising generation quality.
论文提出了Streaming-dLLM框架,通过解决空间冗余和时间效率问题来加速扩散大型语言模型(dLLMs)。它引入了衰减引导的后缀建模来修剪冗余的掩码令牌,并采用动态置信度感知策略和早期退出机制来跳过不必要的迭代。实验结果显示,加速比可达68.2倍,同时保持生成质量。
Generation then Reconstruction: Accelerating Masked Autoregressive Models via Two-Stage Sampling
Authors: Feihong Yan, Peiru Wang, Yao Zhu, Kaiyu Pang, Qingyan Wei, Huiqi Li, Linfeng Zhang
First: 2025-10-20T05:22:10+00:00 · Latest: 2026-01-27T02:28:22+00:00
Comments: 12 pages, 6 figures
Abstract
Masked Autoregressive (MAR) models promise better efficiency in visual generation than autoregressive (AR) models for the ability of parallel generation, yet their acceleration potential remains constrained by the modeling complexity of spatially correlated visual tokens in a single step. To address this limitation, we introduce Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy that decomposes generation into two stages: structure generation establishing global semantic scaffolding, followed by detail reconstruction efficiently completing remaining tokens. Assuming that it is more difficult to create an image from scratch than to complement images based on a basic image framework, GtR is designed to achieve acceleration by computing the reconstruction stage quickly while maintaining the generation quality by computing the generation stage slowly. Moreover, observing that tokens on the details of an image often carry more semantic information than tokens in the salient regions, we further propose Frequency-Weighted Token Selection (FTS) to offer more computation budget to tokens on image details, which are localized based on the energy of high frequency information. Extensive experiments on ImageNet class-conditional and text-to-image generation demonstrate 3.72x speedup on MAR-H while maintaining comparable quality (e.g., FID: 1.59, IS: 304.4 vs. original 1.59, 299.1), substantially outperforming existing acceleration methods across various model scales and generation tasks. Our codes will be released in https://github.com/feihongyan1/GtR.
中文标题/摘要
标题:生成然后重建:通过两阶段采样加速掩蔽自回归模型
掩蔽自回归(MAR)模型在视觉生成方面比自回归(AR)模型具有更好的效率,因为它们能够并行生成,但其加速潜力受限于在单步中建模空间相关视觉标记的复杂性。为了解决这一限制,我们引入了生成然后重建(GtR)这一无需训练的分层采样策略,将生成过程分解为两个阶段:结构生成建立全局语义框架,随后是细节重建高效地完成剩余标记。假设从零开始创建图像比基于基本图像框架补充图像更难,GtR 旨在通过快速计算重建阶段同时缓慢计算生成阶段来实现加速,从而保持生成质量。此外,鉴于图像细节中的标记通常携带比显著区域更多语义信息,我们进一步提出了频率加权标记选择(FTS),为图像细节中的标记分配更多的计算预算,这些标记基于高频信息的能量进行局部化。在 ImageNet 类条件和文本到图像生成上的广泛实验表明,GtR 在 MAR-H 上实现了 3.72 倍的加速,同时保持了相当的质量(例如,FID:1.59,IS:304.4 对比原始的 1.59,299.1),在各种模型规模和生成任务上显著优于现有加速方法。我们的代码将在 https://github.com/feihongyan1/GtR 发布。
Summary / 总结
The paper introduces Generation then Reconstruction (GtR), a training-free hierarchical sampling strategy for accelerating Masked Autoregressive (MAR) models. GtR decomposes the generation process into two stages: structure generation and detail reconstruction, aiming to maintain generation quality while significantly speeding up the reconstruction stage. Experiments on ImageNet and text-to-image generation show a 3.72x speedup with comparable quality, outperforming existing methods across different model scales and tasks.
论文提出了训练免费的层级采样策略Generation then Reconstruction (GtR),该策略将生成过程分解为两个阶段:结构生成用于构建全局语义框架,以及细节重建用于高效完成剩余部分。该方法在MAR-H上实现了3.72倍的加速,同时保持了相近的质量指标(例如,FID: 1.59,IS: 304.4 vs. 原始的1.59,299.1),在各种模型规模和生成任务中显著优于现有加速方法。
m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning
Authors: Yosub Shin, Michael Buriek, Igor Molybog
First: 2026-01-27T02:01:56+00:00 · Latest: 2026-01-27T02:01:56+00:00
Abstract
Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, far below the human baseline of 95%. While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.
中文标题/摘要
标题:m2sv:一种可扩展的用于地图到街景空间推理的基准
视觉-语言模型(VLMs)在许多多模态基准测试中表现出色,但在需要将抽象的鸟瞰图表示与第一人称视角对齐的空间推理任务上仍然脆弱。我们引入了m2sv,一种可扩展的用于地图到街景空间推理的基准,要求模型通过将北向上的鸟瞰图与同一真实交叉口拍摄的街景图像对齐来推断相机的视角方向。我们发布了包含地理多样性且具有可控模糊性的m2sv-20k基准,以及用于监督微调的m2sv-sft-11k结构化推理轨迹集。尽管在现有多模态基准测试中表现出色,但最佳评估的VLM在m2sv上的准确率仅为65.2%,远低于人类基准的95%。虽然监督微调和强化学习可以带来一致的改进,但跨基准评估揭示了有限的迁移。除了总体准确率外,我们系统地分析了地图到街景推理中的难度,使用结构信号和人力投入,并对适应的开放模型进行了广泛的失败分析。我们的发现突显了几何对齐、证据聚合和推理一致性方面的持续差距,激励未来在不同视角下的空间推理研究。
Summary / 总结
The motivation for this work is to evaluate the spatial reasoning capabilities of vision-language models, particularly their ability to align overhead maps with street view images. The method involves creating the m2sv benchmark, which requires models to infer the camera viewing direction by aligning a north-up map with a corresponding Street View image. Key experimental findings show that even the best VLMs achieve only 65.2% accuracy, significantly below the human baseline of 95%. Supervised fine-tuning and reinforcement learning provide some improvement but show limited transfer across benchmarks, indicating persistent challenges in geometric alignment and reasoning consistency.
论文介绍了m2sv,一个用于地图到街景空间推理的基准,评估模型将鸟瞰图与街景图像对齐的能力。尽管其他多模态基准上表现良好,但VLMs在m2sv上的准确率仅为65.2%,远低于人类基准。监督微调和强化学习可以提高性能,但在不同基准间表现有限。研究揭示了几何对齐、证据聚合和推理一致性方面的持续差距,指出了未来研究的方向。
Can We Trust LLM Detectors?
Authors: Jivnesh Sandhan, Harshit Jaiswal, Fei Cheng, Yugo Murawaki
First: 2026-01-09T04:53:06+00:00 · Latest: 2026-01-27T01:27:34+00:00
Abstract
The rapid adoption of LLMs has increased the need for reliable AI text detection, yet existing detectors often fail outside controlled benchmarks. We systematically evaluate 2 dominant paradigms (training-free and supervised) and show that both are brittle under distribution shift, unseen generators, and simple stylistic perturbations. To address these limitations, we propose a supervised contrastive learning (SCL) framework that learns discriminative style embeddings. Experiments show that while supervised detectors excel in-domain, they degrade sharply out-of-domain, and training-free methods remain highly sensitive to proxy choice. Overall, our results expose fundamental challenges in building domain-agnostic detectors. Our code is available at: https://github.com/HARSHITJAIS14/DetectAI
中文标题/摘要
标题:我们能信任LLM检测器吗?
LLM的快速采用增加了对可靠AI文本检测的需求,但现有的检测器往往在超出受控基准的情况下失效。我们系统地评估了2种主导范式(无监督训练和监督训练),并表明两者在分布偏移、未见过的生成器和简单的风格扰动下都表现脆弱。为解决这些局限性,我们提出了一种监督对比学习(SCL)框架,用于学习区分性风格嵌入。实验表明,虽然监督检测器在域内表现出色,但在域外表现急剧下降,无监督方法对代理选择仍然非常敏感。总体而言,我们的结果揭示了构建领域无关检测器的基本挑战。我们的代码可在:https://github.com/HARSHITJAIS14/DetectAI 获取。
Summary / 总结
The study evaluates the reliability of LLM text detectors by comparing training-free and supervised approaches, finding both to be brittle under distribution shifts and stylistic perturbations. The research introduces a supervised contrastive learning framework to improve detector performance, but notes that supervised detectors still degrade out-of-domain and training-free methods remain sensitive to proxy choice. The findings highlight the need for more robust detectors that can handle distribution shifts and stylistic variations. Code is available at https://github.com/HARSHITJAIS14/DetectAI.
研究系统地评估了两种主流的LLM检测方法:无监督和监督。发现这两种方法在分布变化和简单的风格变化下表现脆弱。研究提出了一种监督对比学习(SCL)框架来解决这些问题,结果显示虽然监督检测器在其领域内表现良好,但在领域外表现显著下降,而无监督方法对代理选择的高度敏感。研究揭示了构建领域无关检测器的基本挑战。
History
20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553