arXiv 论文速递

Snapshot: 20260503_0414

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Gordon Guocheng Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov

Venue: CVPR 2026

First: 2025-12-11T18:59:56+00:00 · Latest: 2026-04-30T17:59:43+00:00

Comments: CVPR 2026. Project page: https://snap-research.github.io/omni-attribute

Abs · PDF · Code1 · Code2 · Project1

Abstract

Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

中文标题/摘要

标题：Omni-Attribute：面向视觉概念个性化的大词汇量属性编码器

视觉概念个性化旨在将特定图像属性，如身份、表情、光照和风格，转移到未见的上下文中。然而，现有方法依赖于通用图像编码器的整体嵌入，这会将多个视觉因素纠缠在一起，使得难以隔离单一属性。这通常会导致信息泄露和不一致的合成。为了解决这一局限性，我们引入了Omni-Attribute，这是第一个用于学习高保真度、属性特定表示的大词汇量图像属性编码器。我们的方法联合设计了数据和模型：(i) 我们收集了带有正负属性标注的语义关联图像对，以明确地教导编码器保留或抑制什么；(ii) 我们采用了一种双目标训练范式，平衡生成保真度与对比性解耦。生成的嵌入在开放词汇量属性检索、个性化和组合生成方面证明是有效的，实现了多个基准上的最佳性能。

Summary / 总结

The research aims to develop a method for transferring specific image attributes like identity and lighting into unseen contexts without entangling multiple visual factors. Omni-Attribute, an open-vocabulary image attribute encoder, is introduced to learn high-fidelity, attribute-specific representations. The method curates semantically linked image pairs and uses a dual-objective training paradigm to achieve this. Experimental results show that Omni-Attribute outperforms existing methods in open-vocabulary attribute retrieval, personalization, and compositional generation, setting new state-of-the-art benchmarks.

研究旨在开发一种方法，以在不纠缠多种视觉因素的情况下，将特定图像属性如身份和照明转移到未见过的上下文中。引入了Omni-Attribute，这是一种开放词汇量的图像属性编码器，用于学习高保真度、属性特定的表示。该方法通过标注语义关联的图像对并采用双重目标训练范式来实现这一目标。实验结果表明，Omni-Attribute 在开放词汇量属性检索、个性化和组合生成方面优于现有方法，达到了新的最佳性能基准。

Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders

Authors: Emma Andrews, Sahan Sanjaya, Prabhat Mishra

First: 2026-04-30T17:56:40+00:00 · Latest: 2026-04-30T17:56:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Machine learning models can learn from data samples to carry out various tasks efficiently. When data samples are adversarially manipulated, such as by insertion of carefully crafted noise, it can cause the model to make mistakes. Quantum machine learning models are also vulnerable to such adversarial attacks, especially in image classification using variational quantum classifiers. While there are promising defenses against these adversarial perturbations, such as training with adversarial samples, they face practical limitations. For example, they are not applicable in scenarios where training with adversarial samples is either not possible or can overfit the models on one type of attack. In this paper, we propose an adversarial training-free defense framework that utilizes a quantum autoencoder to purify the adversarial samples through reconstruction. Moreover, our defense framework provides a confidence metric to identify potentially adversarial samples that cannot be purified the quantum autoencoder. Extensive evaluation demonstrates that our defense framework can significantly outperform state-of-the-art in prediction accuracy (up to 68%) under adversarial attacks.

Summary / 总结

The paper addresses the vulnerability of quantum classifiers to adversarial perturbations by proposing a defense framework that uses a quantum autoencoder to purify adversarial samples through reconstruction. The method does not require adversarial training and provides a confidence metric to identify uncleanable samples. Experimental results show that this approach significantly improves prediction accuracy under adversarial attacks, up to 68% better than state-of-the-art methods.

论文提出了一种使用量子自编码器通过重建来净化 adversarial 样本的防御框架，以应对量子分类器的 adversarial 攻击。该方法不需要 adversarial 训练，并提供了一个置信度指标来识别无法被量子自编码器净化的样本。实验结果表明，这种方法在 adversarial 攻击下的预测准确性显著提高，最高可比最先进的方法高出 68%。

PhyCo: Learning Controllable Physical Priors for Generative Motion

Authors: Sriram Narayanan, Ziyu Jiang, Srinivasa Narasimhan, Manmohan Chandraker

Venue: CVPR 2026

First: 2026-04-30T17:53:03+00:00 · Latest: 2026-04-30T17:53:03+00:00

Comments: CVPR 2026. Project Page: https://phyco-video.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Modern video diffusion models excel at appearance synthesis but still struggle with physical consistency: objects drift, collisions lack realistic rebound, and material responses seldom match their underlying properties. We present PhyCo, a framework that introduces continuous, interpretable, and physically grounded control into video generation. Our approach integrates three key components: (i) a large-scale dataset of over 100K photorealistic simulation videos where friction, restitution, deformation, and force are systematically varied across diverse scenarios; (ii) physics-supervised fine-tuning of a pretrained diffusion model using a ControlNet conditioned on pixel-aligned physical property maps; and (iii) VLM-guided reward optimization, where a fine-tuned vision-language model evaluates generated videos with targeted physics queries and provides differentiable feedback. This combination enables a generative model to produce physically consistent and controllable outputs through variations in physical attributes-without any simulator or geometry reconstruction at inference. On the Physics-IQ benchmark, PhyCo significantly improves physical realism over strong baselines, and human studies confirm clearer and more faithful control over physical attributes. Our results demonstrate a scalable path toward physically consistent, controllable generative video models that generalize beyond synthetic training environments.

Summary / 总结

PhyCo is a framework that integrates a large-scale dataset of photorealistic simulations, interpretable and physically grounded controllable motion generation. a pretrained diffusion model. It combines a large large physical property maps for supervised fine fine tuning and on LM-guided reward optimization to generate physically consistent and controllable motion. a on-tuned visionLM.. the Physics-IQ benchmark,,, the framework improves physical realism on baselines and allows humans perceive clearer and more faithful control physical attributes.

FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

Authors: Zeyu Jiang, Changqing Zhou, Xingxing Zuo, Changhao Chen

Venue: RSS 2026

First: 2026-04-30T17:05:56+00:00 · Latest: 2026-04-30T17:05:56+00:00

Comments: RSS 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing learning-based occupancy prediction methods rely on large-scale 3D annotations and generalize poorly across environments. We present FreeOcc, a training-free framework for open-vocabulary occupancy prediction from monocular or RGB-D sequences. Unlike prior approaches that require voxel-level supervision and ground-truth camera poses, FreeOcc operates without 3D annotations, pose ground truth, or any learning stage. FreeOcc incrementally builds a globally consistent occupancy map via a four-layer pipeline: a SLAM backbone estimates poses and sparse geometry; a geometrically consistent Gaussian update constructs dense 3D Gaussian maps; open-vocabulary semantics from off-the-shelf vision-language models are associated with Gaussian primitives; and a probabilistic Gaussian-to-occupancy projection produces dense voxel occupancy. Despite being entirely training-free and pose-agnostic, FreeOcc achieves over $2\times$ improvements in IoU and mIoU on EmbodiedOcc-ScanNet compared to prior self-supervised methods. We further introduce ReplicaOcc, a benchmark for indoor open-vocabulary occupancy prediction, and show that FreeOcc transfers zero-shot to novel environments, substantially outperforming both supervised and self-supervised baselines. Project page: https://the-masses.github.io/freeocc-web/.

中文标题/摘要

标题：FreeOcc：无需训练的开放词汇占用预测

现有的基于学习的占用预测方法依赖于大规模的3D注释，并且在不同环境中泛化能力较差。我们提出了FreeOcc，这是一种无需训练的框架，可以从单目或RGB-D序列中进行开放词汇的占用预测。与之前需要体素级监督和地面真实相机姿态的方法不同，FreeOcc无需3D注释、姿态地面真实或任何学习阶段。FreeOcc通过四层流水线逐步构建全局一致的占用图：SLAM主干估计姿态和稀疏几何；几何一致的高斯更新构建密集的3D高斯图；来自现成的视觉语言模型的开放词汇语义与高斯原语关联；概率高斯到占用的投影生成密集体素占用。尽管完全无需训练且姿态无关，FreeOcc在EmbodiedOcc-ScanNet上的IoU和mIoU相比之前的半监督方法提高了超过2倍。我们还引入了ReplicaOcc，一个用于室内开放词汇占用预测的基准，并展示了FreeOcc可以零样本地转移到新的环境中，显著优于监督和半监督基线。项目页面：https://the-masses.github.io/freeocc-web/

Auto-FlexSwitch: Efficient Dynamic Model Merging via Learnable Task Vector Compression

Authors: Junqi Gao, Dazhi Zhang, Zhichang Guo, Biqing Qi, Yi Ran, Wangmeng Zuo

First: 2026-04-30T16:58:05+00:00 · Latest: 2026-04-30T16:58:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Model merging has attracted attention as an effective path toward multi-task adaptation by integrating knowledge from multiple task-specific models. Among existing approaches, dynamic merging mitigates performance degradation caused by conflicting parameter updates across tasks by flexibly combining task-specific parameters at inference time, thereby maintaining high performance. However, these methods require storing independent parameters for each task, resulting in prohibitive storage overhead. To address this issue, we first experimentally demonstrate that the fine-tuned weight increments (referred to as task vectors) exhibit an impulse-like activation pattern and high robustness to low-bit representations. Driven by this insight, we propose T-Switch, which decomposes task vectors into three compact components: a binary sparse mask, a sign vector, and a scalar scaling factor, achieving high-fidelity approximation at high compression ratios. We then introduce Auto-Switch, a training-free merging scheme that automatically composes task vectors via feature similarity retrieval. Building on this, we develop Auto-Switch, a training-free merging scheme that automatically assembles task vectors through feature similarity retrieval. Furthermore, to transform task vector sparsification and quantization from static rules to adaptive learning, we propose FlexSwitch, a learnable framework which jointly optimizes the compression strategy for each model unit via Learnable Gating Sparsification (LGS) and Bit-width Adaptive Selection (BAS), while employing the Sparsity-Aware Storage Strategy (SASS) to select the optimal storage encoding structure. Finally, by incorporating a K-Nearest Neighbor (KNN) inference scheme with a learnable low-rank metric, we present Auto-FlexSwitch, a dynamic model merging approach that supports highly efficient task vector compression.

Summary / 总结

The paper addresses the challenge of high storage overhead in dynamic model merging by proposing Auto-FlexSwitch. It first shows that task vectors exhibit an impulse-like pattern and are robust to low-bit representations. T-Switch decomposes task vectors into a binary mask, a sign vector, and a scaling factor for high-fidelity approximation. Auto-Switch uses feature similarity retrieval for automatic composition. FlexSwitch introduces learnable gating sparsification and bit-width adaptive selection to optimize compression strategies. Finally, Auto-FlexSwitch combines a KNN inference scheme with a learnable low-rank metric for efficient task vector compression.

论文旨在解决动态模型合并中的高存储开销问题，提出了Auto-FlexSwitch。首先展示了任务向量表现出脉冲模式，并且对低比特表示具有鲁棒性。T-Switch将任务向量分解为二进制掩码、符号向量和缩放因子，实现高保真近似。Auto-Switch通过特征相似性检索自动组合任务向量。FlexSwitch引入了可学习的门控稀疏化和位宽自适应选择来优化压缩策略。最后，Auto-FlexSwitch结合了KNN推理方案和可学习的低秩度量，实现高效的任务向量压缩。

Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models

Authors: Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu

Venue: ICLR 2026

First: 2025-10-20T06:17:57+00:00 · Latest: 2026-04-30T15:40:24+00:00

Comments: ICLR 2026 camera-ready version

Abs · PDF · Code1 · Code2

Abstract

Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.

中文标题/摘要

标题：理解并改进层次稀疏注意模型中的长度泛化

有效处理长上下文是语言模型面临的关键挑战。虽然标准Transformer受到二次复杂性和长度外推能力差的限制，但滑动窗口注意和状态空间模型等替代架构由于其固定大小的内存而牺牲了充分利用完整上下文的能力。基于块的稀疏注意已成为一种有前景的极端长度泛化范式，但其成功背后的关键架构原则尚未完全理解。在本文中，我们系统地剖析了这些模型，以识别驱动其性能的核心组件。通过统一框架和全面的消融研究，我们证明了三种设计原则的结合是至关重要的：（1）具有专用CLS标记的表达性非线性块编码器，用于检索表示；（2）旁路残差路径，以稳定地整合检索到的全局信息，而不被局部残差流覆盖；（3）预训练期间强制选择稀疏性，以弥合训练-测试分布差距。我们为块内信息处理和地标生成提供了理论动机。通过结合这些原则，我们建立了训练无监督长度外推的新基准，成功地将训练在4K上下文上的模型推广到RULER和BABILong上的3200万标记。我们的发现为开发未来高度能力的长上下文语言模型提供了一套清晰且经验丰富的设计原则。

Summary / 总结

This work aims to improve the ability of language models to handle long contexts by analyzing and enhancing chunk-based sparse attention models. The study identifies three key design principles: an expressive Chunk Encoder with a CLS token, a Bypassing Residual Path, and enforced selection sparsity during pre-training. These principles enable better generalization to longer sequences, setting a new state-of-the-art for training-free length extrapolation on RULER and BABILong datasets.

该研究旨在通过分析和改进基于块的稀疏注意力模型来提高语言模型处理长上下文的能力。研究确定了三个关键设计原则：具有CLS标记的表达性Chunk编码器、旁路残差路径以及预训练期间强制选择稀疏性。这些原则使模型能够更好地泛化到更长的序列上，在RULER和BABILong数据集上达到了训练无损长度外推的新最佳水平。

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Authors: Hong-Tao Yu, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

Venue: ICLR 2026

First: 2025-04-21T09:30:41+00:00 · Latest: 2026-04-30T15:12:07+00:00

Comments: Accepted to ICLR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks-fundamental to computer vision-remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

中文标题/摘要

标题：大型视觉-语言模型在细粒度图像任务上的基准测试：全面评估

近年来，大型视觉-语言模型（LVLMs）在多模态感知能力方面取得了显著进展，引起了广泛关注。尽管已经出现了许多评估研究，对LVLMs进行了整体和专门任务的评估，但计算机视觉中至关重要的细粒度图像任务仍鲜有探索。为填补这一空白，我们引入了一个全面的细粒度评估基准，即FG-BMK，包含101万道问题和33万张图像。我们的评估从人类导向和机器导向两个方面系统地考察了LVLMs，重点关注它们的语义识别能力和细粒度特征表示能力。通过对十二个代表性LVLMs/VLMs进行广泛的实验，我们揭示了训练范式、模态对齐、扰动敏感性和细粒度类别推理对任务性能的影响。本研究为当前LVLMs的局限性提供了关键见解，并为未来数据构建和模型设计提供了指导，以开发更先进的LVLMs。我们的代码开源并可在https://github.com/SEU-VIPGroup/FG-BMK获取。

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Authors: Ce Chen, Yi Ren, Yuanming Li, Viktor Goriachko, Zhenhui Ye, Zujin Guo, Zhibin Hong, Mingming Gong

Venue: www

First: 2026-04-30T15:05:06+00:00 · Latest: 2026-04-30T15:05:06+00:00

Comments: This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/

Abs · PDF · Code1 · Code2 · Project1 · Project2

Abstract

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/

中文标题/摘要

标题：TransVLM：一种检测任意镜头过渡的视觉-语言框架和基准

传统的镜头边界检测（SBD）固有地难以处理复杂的过渡，因为它围绕孤立的剪辑点来定义任务，经常导致视频剪辑被破坏。我们通过正式化镜头过渡检测（STD）任务来解决这一根本限制。不同于常规地在剪辑点上寻找模糊的点，STD 明确地检测过渡的连续时间段。为了解决这个问题，我们提出了 TransVLM，一种用于 STD 的视觉-语言模型（VLM）框架。与主要依赖空间语义的常规 VLM 不同，我们的方法在输入阶段显式地注入了光学流作为关键的运动先验。通过一种简单而有效的特征融合策略，TransVLM 直接处理了颜色和运动的拼接表示，显著增强了其时间意识，而不会在语言骨干上增加任何额外的视觉标记开销。为了克服公共数据中严重的类别不平衡，我们设计了一个可扩展的数据引擎来合成多样化的过渡视频以进行稳健训练，并且还提供了一个全面的 STD 基准。广泛的实验表明，TransVLM 在整体性能上表现出色，优于传统的启发式方法、专门的空间-时间网络以及顶级的 VLM。这项工作已部署到生产环境中。如需更多相关研究，请访问 HeyGen 研究（https://www.heygen.com/research）和 HeyGen Avatar-V（https://www.heygen.com/research/avatar-v-model）。项目页面：https://chence17.github.io/TransVLM/

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

Authors: Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Xiuying Chen

First: 2026-04-30T15:03:56+00:00 · Latest: 2026-04-30T15:03:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-state definitions, and an overreliance on final-task success, obscuring where and why agents fail. To address this gap, we introduce \textbf{FineState-Bench}, a benchmark that evaluates whether an agent can correctly ground an instruction to the intended UI control and reach the exact target state. FineState-Bench comprises 2,209 instances across desktop, web, and mobile platforms, spanning four interaction families and 23 UI component types, with each instance explicitly specifying an exact target state for fine-grained state setting. We further propose \textit{FineState-Metrics}, a four-stage diagnostic pipeline with stage-wise success rates: Localization Success Rate (SR@Loc), Interaction Success Rate (SR@Int), Exact State Success Rate at Locate (ES-SR@Loc), and Exact State Success Rate at Interact (ES-SR@Int), and a plug-and-play \textit{Visual Diagnostic Assistant} (VDA) that generates a Description and a bounding-box Localization Hint to diagnose visual grounding reason via controlled w/ vs.\ w/o comparisons. On FineState-Bench, exact goal-state success remains low: ES-SR@Int peaks at 32.8\% on Web and 22.8\% on average across platforms. With VDA localization hints, Gemini-2.5-Flash gains +14.9 ES-SR@Int points, suggesting substantial headroom from improved visual grounding, yet overall accuracy is still insufficient for reliable fine-grained state-conditioned interaction \href{https://github.com/FengxianJi/FineState-Bench}{Github.}

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

Authors: Kenneth J. K. Ong

First: 2026-04-30T14:50:48+00:00 · Latest: 2026-04-30T14:50:48+00:00

Abs · PDF · Code1 · Code2

Abstract

As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effects of visual priming on VLMs' cooperative behavior using the Iterated Prisoner's Dilemma (IPD) as a test scenario. We examine whether exposure to images depicting behavioral concepts (kindness/helpfulness vs. aggressiveness/selfishness) and color-coded reward matrices alters VLM decision patterns. Experiments were conducted across multiple state-of-the-art VLMs. We further explore mitigation strategies including prompt modifications, Chain of Thought (CoT) reasoning, and visual token reduction. Results show that VLM behavior can be influenced by both image content and color cues, with varying susceptibility and mitigation effectiveness across models. These findings not only underscore the importance of robust evaluation frameworks for VLM deployment in visually rich and safety-critical environments, but also highlight how architectural and training differences among models may lead to distinct behavioral responses-an area worthy of further investigation.

中文标题/摘要

标题：视觉先兆对视觉语言模型合作行为的影响

随着视觉语言模型（VLMs）越来越多地集成到决策系统中，理解视觉输入如何影响其行为变得至关重要。本文通过迭代囚徒困境（IPD）作为测试场景，研究视觉先兆对VLMs合作行为的影响。我们探讨了暴露于描绘行为概念的图像（友善/乐于助人 vs. 好斗/自私）和颜色编码的奖励矩阵是否改变了VLM的决策模式。实验在多个最先进的VLMs上进行。我们进一步探讨了包括提示修改、链式思考（CoT）推理和视觉标记减少在内的缓解策略。结果表明，VLM的行为可以受到图像内容和颜色提示的影响，不同模型在影响程度和缓解效果上存在差异。这些发现不仅强调了在视觉丰富和安全关键环境中部署VLM时需要稳健的评估框架的重要性，还突显了模型架构和训练差异可能导致不同行为反应这一领域值得进一步研究。

Summary / 总结

This paper explores how visual priming affects cooperative behavior in Vision-Language Models (VLMs) using the Iterated Prisoner's Dilemma. The study examines the impact of images and color-coded reward matrices on VLM decisions and tests multiple state-of-the-art VLMs. It also investigates mitigation strategies like prompt modifications, Chain of Thought reasoning, and visual token reduction. The results indicate that VLM behavior is influenced by visual and color cues, with different models showing varying degrees of susceptibility and differing effectiveness of mitigation strategies.

本文研究了视觉提示如何影响视觉语言模型（VLM）的合作行为，使用迭代囚徒困境作为测试场景。研究考察了图像和颜色编码奖励矩阵对VLM决策的影响，并测试了多个最先进的VLM。研究还探讨了包括提示修改、链式思考推理和视觉标记减少在内的缓解策略。结果表明，VLM的行为受到视觉和颜色提示的影响，不同模型的敏感性和缓解策略的有效性存在差异。

Diffusion-OAMP for Joint Image Compression and Wireless Transmission

Authors: Wentao Hou, Yimin Bai, Zelei Luo, Jiadong Hong, Lei Liu

First: 2026-04-30T14:49:31+00:00 · Latest: 2026-04-30T14:49:31+00:00

Comments: 6 pages, 5 figures, 2 tables, submitted for a possible publication

Abs · PDF · Code1 · Code2

Abstract

Joint image compression and wireless transmission remain relatively underexplored compared to generic image restoration, despite its importance in practical communication systems. We formulate this problem under an equivalent linear model, and propose Diffusion-OAMP, a training-free reconstruction framework that embeds a pre-trained diffusion model into the OAMP algorithm. In Diffusion-OAMP, the OAMP linear estimator produces pseudo-AWGN observations, while the diffusion model serves as a nonlinear estimator under an SNR-matching rule. This framework offers a way to incorporate multiple generative priors into OAMP. Experiments with varying compression ratios and noise levels show that Diffusion-OAMP performs favorably against classic methods in the evaluated settings.

Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training

Authors: Mingliang Liang, Zhuoran Liu, Arjen P. de Vries, Martha Larson

First: 2026-04-30T14:33:23+00:00 · Latest: 2026-04-30T14:33:23+00:00

Abs · PDF · Code1 · Code2

Abstract

The computational cost of training a vision-language model (VLM) can be reduced by sampling the training data. Previous work on efficient VLM pre-training has pointed to the importance of semantic data balance, adjusting the distribution of topics in the data to improve VLM accuracy. However, existing efficient pre-training approaches may disproportionately remove rare concepts from the training corpus. As a result, \emph{long-tail concepts} remain insufficiently represented in the training data and are not effectively captured during training. In this work, we introduce a \emph{dynamic cluster-based sampling approach (DynamiCS)} that downsamples large clusters of data and upsamples small ones. The approach is dynamic in that it applies sampling at each epoch. We first show the importance of dynamic sampling for VLM training. Then, we demonstrate the advantage of our cluster-scaling approach, which maintains the relative order of semantic clusters in the data and emphasizes the long-tail. This approach contrasts with current work, which focuses only on flattening the semantic distribution of the data. Our experiments show that DynamiCS reduces the computational cost of VLM training and provides a performance advantage for long-tail concepts.

中文标题/摘要

标题：动态聚类数据采样以实现高效且长尾意识的视觉-语言预训练

通过采样训练数据可以降低视觉-语言模型（VLM）的训练计算成本。先前关于高效VLM预训练的工作强调了语义数据平衡的重要性，通过调整数据中的主题分布来提高VLM的准确性。然而，现有的高效预训练方法可能会不成比例地从训练语料库中移除稀有概念，导致训练数据中长尾概念的代表性不足，且在训练过程中未能有效捕捉。在本文中，我们提出了一种动态聚类采样方法（DynamiCS），该方法在每个周期中对大数据簇进行下采样，对小数据簇进行上采样。该方法是动态的，因为它在每个周期中应用采样。我们首先展示了动态采样对于VLM训练的重要性，然后展示了我们聚类缩放方法的优势，该方法在数据中保持语义簇的相对顺序，并强调长尾。这种方法与当前工作不同，后者仅关注数据语义分布的扁平化。我们的实验表明，DynamiCS可以降低VLM训练的计算成本，并为长尾概念提供性能优势。

Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction

Authors: Shipeng Liu, Liang Zhao, Dengfeng Chen, Zhanping Song

First: 2026-04-30T14:31:00+00:00 · Latest: 2026-04-30T14:31:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Tunnel inspection requires outputs that can support defect localization, measurement, severity grading, and engineering documentation. Existing training-free foundation-model pipelines usually stop at coarse open-vocabulary proposals, which are difficult to use directly in interference-heavy tunnel scenes. We propose a training-free framework TunnelMIND. Specifically, language-guided defect proposals are not treated as final outputs; instead, their spatial support is recalibrated at inference time through dense visual consistency, so that coarse semantic anchors can be transformed into more reliable prompts under tunnel-specific hard negatives. The resulting masks are further reconstructed into structured defect entities with category, location, geometry, severity, and context attributes, which are then mapped to retrieval-grounded explanation and engineering-readable report generation under expert knowledge constraints. On visible, GPR, and road defect tasks, TunnelMIND achieves F1 scores of 0.68, 0.78, and 0.72, respectively. Overall, TunnelMIND shows that training-free tunnel inspection can move beyond coarse localization toward structured defect evidence for engineering assessment.

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

Authors: Zhenguo Zhang, Haohan Zheng, Yishen Wang, Le Xu, Tianchen Deng, Xuefeng Chen, Qu Chen, Bo Zhang, Wuxiong Huang

First: 2025-12-16T03:19:28+00:00 · Latest: 2026-04-30T14:06:23+00:00

Abs · PDF · Code1 · Code2

Abstract

The deployment of Vision-Language Models (VLMs) in safety-critical domains like autonomous driving (AD) is critically hindered by reliability failures, most notably object hallucination. This failure stems from their reliance on ungrounded, text-based Chain-of-Thought (CoT) reasoning. While existing multi-modal CoT approaches attempt mitigation, they suffer from two fundamental flaws: (1) decoupled perception and reasoning stages that prevent end-to-end joint optimization, and (2) reliance on expensive, dense localization labels. Thus we introduce OmniDrive-R1, an end-to-end VLM framework designed for autonomous driving, which unifies perception and reasoning through an interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism. Our core innovation is an Reinforcement-driven visual grounding capability, enabling the model to autonomously direct its attention and "zoom in" on critical regions for fine-grained analysis. This capability is enabled by our pure two-stage reinforcement learning training pipeline and Clip-GRPO algorithm. Crucially, Clip-GRPO introduces an annotation-free, process-based grounding reward. This reward not only eliminates the need for dense labels but also circumvents the instability of external tool calls by enforcing real-time cross-modal consistency between the visual focus and the textual reasoning. Extensive experiments on DriveLMM-o1 demonstrate our model's significant improvements. Compared to the baseline Qwen2.5VL-7B, OmniDrive-R1 improves the overall reasoning score from 51.77% to 80.35%, and the final answer accuracy from 37.81% to 73.62%.

NeocorRAG: Less Irrelevant Information, More Explicit Evidence, and More Effective Recall via Evidence Chains

Authors: Shiyao Peng, Qianhe Zheng, Zhuodi Hao, Zichen Tang, Rongjin Li, Qing Huang, Jiayu Huang, Jiacheng Liu, Yifan Zhu, Haihong E

Venue: WWW 2026

First: 2026-04-30T13:37:01+00:00 · Latest: 2026-04-30T13:37:01+00:00

Comments: Accepted to WWW 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Although precise recall is a core objective in Retrieval-Augmented Generation (RAG), a critical oversight persists in the field: improvements in retrieval performance do not consistently translate to commensurate gains in downstream reasoning. To diagnose this gap, we propose the Recall Conversion Rate (RCR), a novel evaluation metric to quantify the contribution of retrieval to reasoning accuracy. Our quantitative analysis of mainstream RAG methods reveals that as Recall@5 improves, the RCR exhibits a near-linear decay. We identify the neglect of retrieval quality in these methods as the underlying cause. In contrast, approaches that focus solely on quality optimization often suffer from inferior recall performance. Both categories lack a comprehensive understanding of retrieval quality optimization, resulting in a trade-off dilemma. To address these challenges, we propose comprehensive retrieval quality optimization criteria and introduce the NeocorRAG framework. This framework achieves holistic retrieval quality optimization by systematically mining and utilizing Evidence Chains. Specifically, NeocorRAG first employs an innovative activated search algorithm to obtain a refined candidate space. Then it ensures precise evidence chain generation through constrained decoding. Finally, the retrieved set of evidence chains guides the retrieval optimization process. Evaluated on benchmarks including HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ, NeocorRAG achieves SOTA performance on both 3B and 70B parameter models, while consuming less than 20% of tokens used by comparable methods. This study presents an efficient, training-free paradigm for RAG enhancement that effectively optimizes retrieval quality while maintaining high recall. Our code is released at https://github.com/BUPT-Reasoning-Lab/NeocorRAG.

Summary / 总结

The paper addresses the gap between improved retrieval performance and downstream reasoning accuracy in Retrieval-Augmented Generation (RAG) by introducing the Recall Conversion Rate (RCR) metric. It identifies that mainstream RAG methods suffer from a trade-off between retrieval quality and recall. To overcome this, the NeocorRAG framework is proposed, which optimizes retrieval quality through systematic mining and utilization of Evidence Chains. NeocorRAG achieves state-of-the-art performance on benchmarks while using fewer tokens compared to other methods.

该研究通过引入召回转换率（RCR）指标，解决了检索增强生成（RAG）中检索性能提升与下游推理准确性之间的差距问题。研究指出主流RAG方法在检索质量和召回率之间存在权衡。为解决这一问题，提出了NeocorRAG框架，该框架通过系统挖掘和利用证据链来优化检索质量。NeocorRAG在多个基准测试上实现了最先进的性能，同时使用了比其他方法更少的令牌数量。

Hyper-Dimensional Fingerprints as Molecular Representations

Authors: Jonas Teufel, Luca Torresi, André Eberhard, Pascal Friederich

First: 2026-04-30T12:53:58+00:00 · Latest: 2026-04-30T12:53:58+00:00

Comments: Code: https://doi.org/10.5281/zenodo.19373621

Abs · PDF · Code1 · Code2

Abstract

Computational molecular representations underpin virtual screening, property prediction, and materials discovery. Conventional fingerprints are efficient and deterministic but lose structural information through hash-based compression, particularly at low dimensionalities. Learned representations from graph neural networks recover this expressiveness but require task-specific training and substantial computational resources. Here we introduce hyperdimensional fingerprints (HDF), which replace the learned transformations of message-passing neural networks with algebraic operations on high-dimensional vectors, producing deterministic molecular representations without any training. Across diverse property prediction benchmarks, HDF outperforms conventional fingerprints in the majority of tasks while exhibiting greater consistency across datasets and models. Crucially, HDF embeddings preserve molecular similarity faithfully: at 32 dimensions, distances in HDF space achieve a 0.9 Pearson correlation with graph edit distance, compared to 0.55 for Morgan fingerprints at equivalent size. This structural fidelity persists at low dimensions where hash-based methods degrade, allowing simple nearest-neighbor regression to remain predictive with as few as 64 components. We further demonstrate the practical impact in Bayesian molecular optimization, where HDF-based surrogate models achieve substantially improved sample efficiency in regimes where Morgan fingerprints perform comparably to random search. HDF thus provides a general-purpose, training-free alternative to conventional molecular fingerprints, suggesting that the information loss long accepted as inherent to fixed-length fingerprints is a limitation of the hash-based encoding scheme rather than the fingerprint paradigm itself.

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

Authors: Wenrui Zhou, Mohamed Hendy, Shu Yang, Qingsong Yang, Zikun Guo, Yuyu Luo, Lijie Hu, Di Wang

Venue: ACL 2026

First: 2025-06-08T15:00:21+00:00 · Latest: 2026-04-30T12:27:19+00:00

Comments: 27 Pages, Accepted by ACL 2026 Main Conference

Abs · PDF · Code1 · Code2

Abstract

As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the videolanguage domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE(Video-LLM Sycophancy Benchmarking and Evaluation), the first benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISEpioneeringly brings linguistic perspectives on sycophancy into the video domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. Furthermore, we propose two potential training-free mitigation strategies revealing potential paths for reducing sycophantic bias: (i) enhancing visual grounding through interpretable key-frame selection and (ii) steering model behavior away from sycophancy via targeted, inference-time intervention on its internal neural representations. Our code is available at https://anonymous.4open.science/r/VideoSycophancy-567F.

Summary / 总结

The research aims to address the issue of sycophancy in Video-LLMs, which can undermine their factual consistency and reliability. To tackle this, the authors developed VISE, the first benchmark to evaluate sycophantic behavior in Video-LLMs across various question formats and visual reasoning tasks. Key findings include the identification of different sycophancy types and the proposal of two mitigation strategies: enhancing visual grounding and steering model behavior through inference-time interventions on neural representations.

研究旨在解决视频大语言模型（Video-LLMs）中的奉承行为问题，这可能削弱其事实一致性与可靠性。为此，作者开发了VISE，这是首个用于评估Video-LLMs在不同问题格式和视觉推理任务中的奉承行为的基准。主要发现包括识别不同类型的奉承行为，并提出了两种缓解策略：增强视觉接地和通过推理时对神经表示的定向干预来引导模型行为。

Iterative Multimodal Retrieval-Augmented Generation for Medical Question Answering

Authors: Xupeng Chen, Binbin Shi, Chenqian Le, Jiaqi Zhang, Kewen Wang, Ran Gong, Jinhan Zhang, Chihang Wang

First: 2026-04-30T11:16:07+00:00 · Latest: 2026-04-30T11:16:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Medical retrieval-augmented generation (RAG) systems typically operate on text chunks extracted from biomedical literature, discarding the rich visual content (tables, figures, structured layouts) of original document pages. We propose MED-VRAG, an iterative multimodal RAG framework that retrieves and reasons over PMC document page images instead of OCR'd text. The system pairs ColQwen2.5 patch-level page embeddings with a sharded MapReduce LLM filter, scaling to ~350K pages while keeping Stage-1 retrieval under 30 ms via an offline coarse-to-fine index (C=8 centroids per page, ANN over centroids, exact two-way scoring on the top-R shortlist). A vision-language model (VLM) then iteratively refines its query and accumulates evidence in a memory bank across up to 3 reasoning rounds, with a single iteration costing ~15.9 s and the full three-round pipeline ~47.8 s on 4xA100. Across four medical QA benchmarks (MedQA, MedMCQA, PubMedQA, MMLU-Med), MEDVRAG reaches 78.6% average accuracy. Under controlled comparison with the same Qwen2.5-VL-32B backbone, retrieval contributes a +5.8 point gain over the no-retrieval baseline; we also note a +1.8 point edge over MedRAG + GPT-4 (76.8%), with the caveat that this is a cross-paper rather than head-to-head comparison. Ablations isolate +1.0 from page-image vs text-chunk retrieval, +1.5 from iteration, and +1.0 from the memory bank.

中文标题/摘要

标题：迭代多模态检索增强生成在医疗问答中的应用

医疗检索增强生成（RAG）系统通常基于从生物医学文献中提取的文本片段运行，忽略了原始文档页面中的丰富视觉内容（表格、图表、结构化布局）。我们提出了一种迭代多模态RAG框架MED-VRAG，该框架检索和推理的是PMC文档页面图像，而不是OCR文本。该系统将ColQwen2.5页面级片段嵌入与分片MapReduce LLM过滤器配对，可扩展到约35万页，通过离线粗到细索引（每页8个质心，质心的ANN，精确的两向评分在前R短列表上）使第一阶段检索保持在30毫秒以下。视觉语言模型（VLM）随后在最多3轮推理中迭代地细化其查询并在记忆库中累积证据，单轮迭代成本约为15.9秒，完整的三轮管道成本约为47.8秒（4xA100）。在四个医疗问答基准（MedQA、MedMCQA、PubMedQA、MMLU-Med）上，MEDVRAG达到78.6%的平均准确率。在与相同Qwen2.5-VL-32B骨干的受控比较中，检索比无检索基线提高了5.8分；我们还注意到与MedRAG + GPT-4相比有1.8分的优势（76.8%），但需注意这是跨论文而非直接对比。消融实验分别隔离了页面图像与文本片段检索的+1.0分增益、迭代的+1.5分增益以及记忆库的+1.0分增益。

Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

Authors: Xupeng Chen, Binbin Shi, Chenqian Le, Qifu Yin, Lang Lin, Haowei Ni, Ran Gong, Panfeng Li

First: 2026-04-30T11:11:47+00:00 · Latest: 2026-04-30T11:11:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Deploying vision-language models (VLMs) in clinical settings demands auditable behavior under realistic failure conditions, yet the failure landscape of frontier VLMs on specialized medical inputs is poorly characterized. We audit five recent frontier and grounding-aware VLMs (Gemini~2.5~Pro, GPT-5, o3, GLM-4.5V, Qwen~2.5~VL) on Medical VQA along two trust-relevant axes. Perception: all models localize anatomical and pathological targets poorly -- the best model reaches only 0.23 mean IoU and 19.1% Acc@0.5 -- and exhibit clinically dangerous laterality confusion. Pipeline integration: a self-grounding pipeline, where the same model localizes then answers, degrades VQA accuracy for every model -- driven by both inaccurate localization and format-compliance failures under the two-step prompt (parse failure rises to 70%--99% for Gemini and GPT-5 on VQA-RAD). Replacing predicted boxes with ground-truth annotations recovers and improves VQA accuracy, consistent with the failure residing in the perception module rather than in the decomposition itself. These observational findings identify grounding quality as a primary trustworthiness bottleneck in our SLAKE bounding-box setting. As a complementary fine-tuning follow-up, supervised fine-tuning of Qwen~2.5~VL on combined Med-VQA training data attains the highest reported SLAKE open-ended recall (85.5%) among comparable methods, suggesting that the VQA-level gap is tractable with domain adaptation; whether this also closes the perception/trustworthiness bottleneck is left to future work.

Training-Free Reward-Guided Image Editing via Trajectory Optimal Control

Authors: Jinho Chang, Jaemin Kim, Jong Chul Ye

Venue: ICLR 2026 Poster

First: 2025-09-30T06:34:37+00:00 · Latest: 2026-04-30T11:11:41+00:00

Comments: Poster in ICLR 2026; 22 pages, 9 figures. The code is available at https://github.com/jinhojsk515/ITOC

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A prominent line of research involves reward-guided guidance, which steers the generation process during inference to align with specific objectives. However, leveraging this reward-guided approach to the task of image editing, which requires preserving the semantic content of the source image while enhancing a target reward, is largely unexplored. In this work, we introduce a novel framework for training-free, reward-guided image editing. We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, we demonstrate that our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.

中文标题/摘要

标题：基于轨迹最优控制的无需训练奖励引导图像编辑

近期在扩散和流匹配模型方面的进展展示了其在高保真图像合成方面的卓越能力。研究的一个重要方向是奖励引导的指导，该方法在推理过程中引导生成过程以满足特定目标。然而，将这种奖励引导的方法应用于需要保留源图像语义内容同时增强目标奖励的图像编辑任务，尚未得到充分探索。在本文中，我们提出了一种新的无需训练的奖励引导图像编辑框架。我们将编辑过程形式化为一个轨迹最优控制问题，其中扩散模型的逆过程被视为从源图像出发的可控轨迹，通过迭代更新伴随状态来引导编辑过程。通过在不同编辑任务上的广泛实验，我们证明了我们的方法在奖励最大化和对源图像保真度之间取得了优于现有基于反转的无需训练指导基线的显著平衡，同时没有出现奖励作弊。

Summary / 总结

This work introduces a training-free reward-guided image editing framework that formulates the editing process as a trajectory optimal control problem. The reverse process of a diffusion model is treated as a controllable trajectory starting from the source image, with adjoint states iteratively updated to steer the editing process. Experiments show that this approach outperforms existing inversion-based training-free guidance methods, balancing reward maximization and source image fidelity without reward hacking.

该研究提出了一种无需训练的奖励引导图像编辑框架，将编辑过程建模为轨迹最优控制问题。扩散模型的逆过程被视为从源图像出发的可控轨迹，通过迭代更新伴随状态来引导编辑过程。实验表明，该方法在最大化奖励和保持源图像保真度之间取得了更好的平衡，且未出现奖励作弊现象，优于现有基于反转的无需训练的引导方法。

Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models via Data-Free Flatness-Aware Prompt Pretraining

Authors: Hyeonseo Jang, Jaebyeong Jeon, Joong-Won Hwang, Kibok Lee

Venue: CVPR 2026

First: 2026-04-30T11:01:23+00:00 · Latest: 2026-04-30T11:01:23+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Test-time prompt tuning (TPT) has emerged as a promising technique for enhancing the adaptability of vision-language models by optimizing textual prompts using unlabeled test data. However, prior studies have observed that TPT often produces poorly calibrated models, raising concerns about the reliability of their predictions. Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. Motivated by this observation, we introduce Flatness-aware Prompt Pretraining (FPP), a simple yet effective pretraining framework for TPT that initializes prompts within flatter regions of the loss landscape prior to adaptation. We show that simply replacing the initialization in existing TPT pipelines--without modifying any other components--is sufficient to improve both calibration and performance. Notably, FPP requires no labeled data and incurs no additional computational costs during test-time tuning, making it highly practical for real-world deployment. The code is available at: https://github.com/YonseiML/fpp.

Summary / 总结

This paper addresses the issue of poorly calibrated models in Test-time prompt tuning (TPT) for vision-language models. It proposes Flatness-aware Prompt Pretraining (FPP), which initializes prompts in flatter regions of the loss landscape to improve calibration without degrading performance. Experiments show that FPP enhances both calibration and performance, and it does not require additional labeled data or computational costs during test-time tuning.

本文解决了Test-time prompt tuning (TPT)在视觉-语言模型中的模型校准问题。提出了Flatness-aware Prompt Pretraining (FPP)，通过将初始化的提示置于损失景观的更平坦区域来提高校准，而不降低性能。实验表明，FPP可以同时提升校准和性能，且无需额外的标注数据或测试时的计算成本。

WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning

Authors: Ke Xu

First: 2026-04-30T09:19:26+00:00 · Latest: 2026-04-30T09:19:26+00:00

Comments: 16 pages, 3 figures, 8 tables

Abs · PDF · Code1 · Code2

Abstract

We present WaferSAGE, a framework for wafer defect visual question answering using small vision-language models. To address data scarcity in semiconductor manufacturing, we propose a three-stage synthesis pipeline incorporating structured rubric generation for precise evaluation. Starting from limited labeled wafer maps, we employ clustering-based cleaning to filter label noise, then generate comprehensive defect descriptions using vision-language models, which are converted into structured evaluation rubrics criteria. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. Our dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, our 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment. We demonstrate that small models with domain-specific training can surpass proprietary large models in specialized industrial visual understanding, offering a viable path for privacy-preserving, cost-effective deployment in semiconductor manufacturing.

中文标题/摘要

标题：WaferSAGE：基于合成数据生成和准则导向强化学习的大语言模型驱动晶圆缺陷分析

我们提出了WaferSAGE，一种使用小型视觉语言模型进行晶圆缺陷视觉问答的框架。为了解决半导体制造中的数据稀缺问题，我们提出了一种包含结构化准则生成的三阶段合成流水线，以实现精确评估。从有限的晶圆图标注开始，我们采用基于聚类的清理来过滤标签噪声，然后使用视觉语言模型生成全面的缺陷描述，这些描述被转换为结构化的评估准则。这些准则指导VQA对的合成，确保覆盖缺陷类型识别、空间分布、形态和根本原因分析。我们的双重评估框架通过贝叶斯优化将基于规则的度量与LLM-裁判评分对齐，从而实现可靠的自动化评估。通过基于课程的强化学习和组序列策略优化（GSPO）以及准则对齐的奖励，我们的4B参数Qwen3-VL模型获得了6.493的LLM-裁判评分，接近Gemini-3-Flash（7.149），同时实现了完全本地部署。我们证明，具有领域特定训练的小模型可以在专门的工业视觉理解中超越专有的大型模型，为半导体制造中提供隐私保护、成本效益的部署提供了一条可行路径。

Summary / 总结

WaferSAGE is a framework for wafer defect visual question answering using small vision-language models. It addresses data scarcity in semiconductor manufacturing through a three-stage synthesis pipeline involving structured rubric generation. Starting from limited labeled wafer maps, WaferSAGE employs clustering-based cleaning, generates comprehensive defect descriptions, and converts them into structured evaluation rubrics. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. The framework uses a dual assessment approach with Bayesian optimization to align rule-based metrics with LLM-Judge scores. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, the 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment.

WaferSAGE 是一种使用小型视觉语言模型进行晶圆缺陷视觉问答的框架。它通过包含结构化评价标准生成的三阶段合成管道来解决半导体制造中的数据稀缺问题。从有限的标注晶圆图开始，WaferSAGE 使用聚类基清洗，生成全面的缺陷描述，并将其转换为结构化的评价标准。这些标准指导 VQA 对应物的合成，确保覆盖缺陷类型识别、空间分布、形态和根本原因分析。该框架使用贝叶斯优化方法将基于规则的度量与LLM-裁判评分对齐。通过基于课程的强化学习与组序列策略优化（GSPO）和评价标准对齐的奖励，4B参数的Qwen3-VL 模型实现了6.493的LLM-裁判评分，接近Gemini-3-Flash（7.149），同时实现了完全本地部署。

Test-Time Distillation for Continual Model Adaptation

Authors: Xiao Chen, Jiazhen Huang, Zhiming Liu, Qinting Jiang, Fanding Huang, Jingyan Jiang, Zhi Wang

Venue: CVPR 2026

First: 2025-06-03T09:16:51+00:00 · Latest: 2026-04-30T09:01:20+00:00

Comments: Accepted by CVPR 2026 Findings

Abs · PDF · Code1 · Code2 · Code3

Abstract

Deep neural networks often suffer performance degradation upon deployment due to distribution shifts. Continual Test-Time Adaptation (CTTA) aims to address this issue in an unsupervised manner. However, existing methods that rely on self-supervision are prone to an inherent self-referential feedback loop that amplifies initial prediction errors, leading to model drift. We revisit this limitation and propose Test-Time Distillation (TTD), which reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM) as an external signal. While promising, we find that direct distillation is fraught with two pitfalls: (1) the Generalist Trap, where the VLM's broad but non-specialized knowledge leads to suboptimal performance on specific tasks and shifts; and (2) the Entropy Bias, where naive model fusion techniques based on entropy fail due to the disparate calibration of heterogeneous models. These pitfalls highlight the need to build a robust supervisory signal and leverage it to guide the target model toward stable adaptation. Hence, we present CoDiRe, a Continual Distillation and Rectification framework for TTD. CoDiRe first constructs a robust blended teacher by dynamically fusing the predictions of the VLM and the target model. Critically, it circumvents the Entropy Bias by leveraging Maximum Softmax Probability (MSP) as a more reliable confidence metric for weighting each model's expertise. Then it applies an Optimal Transport-based rectification to further align predictions with the blended teacher, enabling continuous and stable adaptation. Extensive experiments show that CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% with only 48% of its time cost on ImageNet-C. Project page is publicly available at https://github.com/walawalagoose/TTD.

CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation

Authors: Sonali Sharma, Jin Long, George Shih, Sarah Eid, Christian Bluethgen, Francine L. Jacobson, Emily B. Tsai, Global Radiology Consortium, Ahmed M. Alaa, Curtis P. Langlotz

First: 2026-04-29T04:33:43+00:00 · Latest: 2026-04-30T08:02:58+00:00

Comments: 51 pages, 7 figures, 10 tables

Abs · PDF · Code1 · Code2

Abstract

Chest X-ray interpretation is one of the most frequently performed diagnostic tasks in medicine and a primary target for AI development, yet current vision-language models are primarily trained on datasets of paired images and reports, not the cognitive processes and visual attention that underlie clinical reasoning. Here, we present CheXthought, a global, multimodal resource containing 103,592 chain-of-thought reasoning traces and 6,609,082 synchronized visual attention annotations across 50,312 multi-read chest X-rays from 501 radiologists in 71 countries. Our analysis reveals clinical reasoning patterns in how experts deploy distinct visual search strategies, integrate clinical context, and communicate uncertainty. We demonstrate the clinical utility of CheXthought across four dimensions. First, CheXthought reasoning significantly outperforms state-of-the-art vision-language model chain-of-thought in factual accuracy and spatial grounding. Second, visual attention data used as an inference-time hint recovers missed findings and significantly reduces hallucinations. Third, vision-language models trained on CheXthought data achieve significantly stronger pathology classification, visual faithfulness, temporal reasoning and uncertainty communication. Fourth, leveraging CheXthought's multi-reader annotations, we predict both human-human and human-AI disagreement directly from an image, enabling transparent communication of case difficulty, uncertainty and model reliability. These findings establish CheXthought as a resource for advancing multimodal clinical reasoning and the development of more transparent, interpretable vision-language models.

Summary / 总结

CheXthought is a global multimodal dataset containing 103,592 chain-of-thought reasoning traces and 6,609,082 visual attention annotations from 50,312 chest X-rays read by 501 radiologists. It reveals clinical reasoning patterns and visual search strategies. The dataset demonstrates that CheXthought reasoning outperforms state-of-the-art vision-language models in factual accuracy and spatial grounding. Visual attention data improves model performance and reduces hallucinations. Models trained on CheXthought show better pathology classification and uncertainty communication. Multi-reader annotations predict human-human and human-AI disagreements, enhancing model transparency and interpretability.

CheXthought 是一个包含 103,592 条临床推理链和 6,609,082 个视觉注意力标注的全球多模态数据集，来自 50,312 张胸部 X 光片，由 501 名放射科医生在 71 个国家阅读。该数据集揭示了临床推理模式和视觉搜索策略。研究表明，CheXthought 的推理在事实准确性与空间定位上优于最先进的视觉-语言模型。视觉注意力数据提高了模型性能并减少了幻觉。基于 CheXthought 训练的模型在病理分类、视觉忠实度和不确定性沟通方面表现更好。多读者标注可以预测人类与人类、人类与 AI 的分歧，增强模型的透明度和可解释性。

EdgeFM: Efficient Edge Inference for Vision-Language Models

Authors: Mengling Deng, Yuanpeng Chen, Sheng Yang, Wei Tao, Wenhai Zhang, Hui Song, Linyuanhao Qin, Kai Zhao, Xiaojun Ye, Shanhui Mo, Jingli Fan, Shuang Zhang, Bei Liu, Tiankun Zhao, Xiangjing An

First: 2026-04-30T06:18:50+00:00 · Latest: 2026-04-30T06:18:50+00:00

Comments: Technique Report version

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely on bloated general-purpose designs or force developers into opaque, hardware-specific closed-source ecosystems, leading to hardware lock-in limitation and poor cross-platform adaptability. Observing that modern AI agents can efficiently search and tune configurations to generate highly optimized low-level kernels for standard LLM operators, we propose EdgeFM, a lightweight, agent-driven VLM/LLM inference framework tailored for cross-platform industrial edge deployment. EdgeFM removes non-essential features to reduce single-request latency, and encapsulates agent-tuned kernel optimizations as a modular library of reusable skills. By allowing direct invocation of these skills rather than waiting for closed-source implementations, it effectively closes the performance gap long dominated by proprietary toolchains. The framework natively supports mainstream platforms including x86 and NVIDIA Orin SoCs, and represents the first end-to-end VLA deployment on the domestic Horizon Journey platform, enhancing cross-platform portability. In most cases, it yields clearly better inference performance than conventional vendor-specific toolchains, achieving up to 1.49 times speedup over TensorRT-Edge-LLM on the NVIDIA Orin platform. Experimental results show that EdgeFM delivers favorable end-to-end inference performance, providing an open-source, production-grade solution for diverse edge industrial scenarios.

Summary / 总结

EdgeEdgeFM is an efficient edge lightweight inference framework for vision vision-language models (VLMs) designed to address optimize inference performance under edge cross-platform Edge devices. It It framework addresses removes direct invocation of optimized kernel skills to close close under hardware limitations and proprietary ecosystems, on achieving it achieving significant performance improvements on conventional vendor-specific solutions chains.

RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

Authors: Meghana Kshirsagar, Allen Nie, Ching-An Cheng, Fanglei Xue, Rahul Dodhia, Juan Lavista Ferres, Kevin K. Yang, Frank DiMaio

First: 2026-04-19T00:20:18+00:00 · Latest: 2026-04-30T05:12:50+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce RosettaSearch, an inference-time multi-objective optimization approach for backbone conditioned protein sequence design. We use large language models (LLMs) as a generative optimizer within a search algorithm capable of controlled exploration and exploitation, using rewards computed from RosettaFold3, a structure prediction model, under a strict computational budget. In a large-scale evaluation, we apply RosettaSearch to 400 suboptimal sequences generated by LigandMPNN (a state-of-the-art model trained for protein sequence design), recovering high-fidelity designs that LigandMPNN's single-pass decoding fails to produce. RosettaSearch's designs show improvements in structural fidelity metrics ranging between 18% to 68%, translating to a 2.5x improvement in design success rate. We observe that these gains in success rate are robust when RosettaSearch-designed sequences are evaluated with an independent structure prediction oracle (Chai-1) and generalize across two distinct LLM families (o4-mini and Gemini-3), with performance scaling consistently with reasoning capability. We further demonstrate that RosettaSearch improves the sequence fidelity of ProteinMPNN designs for de novo backbones from the Dayhoff atlas, showing that the approach generalizes beyond native protein structures to computationally generated backbones. We also demonstrate a multi-modal extension of RosettaSearch with vision-language models, where images of predicted protein structures are used as feedback to incorporate structural context to guide protein sequence generation. To our knowledge, this is the first large-scale demonstration that LLMs can serve as effective generative optimizers for backbone-conditioned protein sequence design, yielding systematic gains without any model retraining.

Summary / 总结

RosettaSearch is an inference-time multi-objective optimization approach for protein sequence design, using large language models as a generative optimizer within a search algorithm. It improves structural fidelity metrics by 18% to 68%, leading to a 2.5x increase in design success rate compared to single-pass decoding by LigandMPNN. The method generalizes across different LLM families and shows improvements for de novo backbones and multi-modal extensions with vision-language models.

RosettaSearch 是一种在推断时进行多目标优化的蛋白质序列设计方法，使用大型语言模型作为生成优化器嵌入在搜索算法中。它通过将结构准确度指标提高18%到68%，使设计成功率提高了2.5倍，相比LigandMPNN的一次性解码。该方法在不同的LLM家族中具有普适性，并且在新的骨架和多模态扩展（结合视觉语言模型）中也显示出改进。

Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis

Authors: David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Mert D. Pese

Venue: SAE Technical Paper 2026-01-0170, SAE WCX 2026

First: 2026-04-30T04:33:38+00:00 · Latest: 2026-04-30T04:33:38+00:00

Comments: 9 pages, 2 figures. Accepted at SAE WCX 2026

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) are increasingly used in autonomous driving because they combine visual perception with language-based reasoning, supporting more interpretable decision-making, yet their robustness to physical adversarial attacks, especially whether such attacks transfer across different VLM architectures, is not well understood and poses a practical risk when attackers do not know which model a vehicle uses. We address this gap with a systematic cross-architecture study of adversarial transferability in VLM-based driving, evaluating three representative architectures (Dolphins, OmniDrive, and LeapVAD) using physically realizable patches placed on roadside infrastructure in both crosswalk and highway scenarios. Our transfer-matrix evaluation shows high cross-architecture effectiveness, with transfer rates of 73-91% (mean TR = 0.815 for crosswalk and 0.833 for highway) and sustained frame-level manipulation over 64.7-79.4% of the critical decision window even when patches are not optimized for the target model.

中文标题/摘要

标题：理解视觉语言模型在自动驾驶中的对抗转移性：跨架构分析

视觉语言模型（VLMs）在自动驾驶中越来越受欢迎，因为它们结合了视觉感知和基于语言的推理，支持更具解释性的决策，但它们对物理对抗攻击的鲁棒性，尤其是这些攻击是否在不同的VLM架构之间转移，尚未得到充分理解，当攻击者不知道车辆使用的是哪个模型时，这会带来实际风险。我们通过在人行横道和高速公路场景中使用物理可实现的补丁放置在路边基础设施上，对基于VLM的驾驶中的对抗转移性进行了系统性的跨架构研究，评估了三种代表性架构（Dolphins、OmniDrive和LeapVAD）。我们的转移矩阵评估显示了高跨架构有效性，转移率为73-91%（人行横道的平均转移率TR = 0.815，高速公路为0.833），即使补丁未针对目标模型进行优化，也能在关键决策窗口的64.7-79.4%的帧级上保持操纵。

Summary / 总结

The study investigates the transferability of adversarial attacks across different vision-language model architectures in autonomous driving, focusing on three representative models (Dolphins, OmniDrive, and LeapVAD). By placing physically realizable patches on roadside infrastructure, the research demonstrates high transfer rates of 73-91% across architectures, with sustained manipulation over 64.7-79.4% of the critical decision window in both crosswalk and highway scenarios.

该研究探讨了不同视觉-语言模型架构在自动驾驶系统中对抗攻击的跨架构转移性。通过在人行横道和高速公路场景中使用实际可实现的贴片对三种代表性架构（Dolphins、OmniDrive 和 LeapVAD）进行评估，研究发现跨架构转移性效果显著，转移率为73-91%，并在关键决策窗口的64.7-79.4%时间内持续操纵，表明攻击者可能利用这种转移性对不同模型发起攻击的风险较大。

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

Authors: Yihong Guo, Youwei Lyu, Jiajun Tang, Yizhuo Zhou, Hongliang Wang, Jinwei Chen, Changqing Zou, Qingnan Fan

First: 2026-04-30T03:39:32+00:00 · Latest: 2026-04-30T03:39:32+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.

Summary / 总结

VeraRetouch is a lightweight fully differentiable framework designed for multi-task photo retouching, addressing the limitations of existing non-differentiable approaches. It uses a 0.5B Vision-Language Model to generate retouching plans and a fully differentiable Retouch Renderer to replace external tools, allowing for end-to-end training. The framework also introduces AetherRetouch-1M+, a million-scale dataset, and DAPO-AE, a reinforcement learning strategy, to improve performance. Experiments show that VeraRetouch outperforms existing methods while being more compact, suitable for mobile deployment.

VeraRetouch 是一个轻量级的全可微框架，用于多任务照片修图，解决了现有非可微方法的局限性。它使用一个 0.5B 视觉-语言模型生成修图计划，并使用一个全可微的修图渲染器替换外部工具，允许端到端的像素级训练。该框架还引入了 AetherRetouch-1M+，一个百万规模的数据集，以及 DAPO-AE，一种强化学习后训练策略，以提高性能。实验表明，VeraRetouch 在多个基准测试中表现出色，同时体积更小，适合移动部署。

CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling

Authors: Yingrui Wu, Youkang Kong, Mingyang Zhao, Weize Quan, Dong-Ming Yan, Yang Liu

First: 2026-04-30T03:18:26+00:00 · Latest: 2026-04-30T03:18:26+00:00

Comments: SIGGARPH 2026 (Journal Track), Code: https://github.com/YingruiWoo/CasLayout

Abs · PDF · Code1 · Code2 · Code3

Abstract

Synthesizing realistic 3D indoor scenes remains challenging due to data scarcity and the difficulty of simultaneously enforcing global architectural constraints and local semantic consistency. Existing approaches often overlook structural boundaries or rely on fully connected relation graphs that introduce redundant generation errors. Inspired by human design cognition, we present CasLayout, a cascaded diffusion framework that decomposes the joint scene generation task into four conditional sub-stages with explicit physical and semantic roles: (1) predicting furniture quantity and categories, (2) refining object sizes and feature embeddings, (3) modeling spatial relationships in a latent space, and (4) generating Oriented Bounding Boxes (OBBs). This decoupled architecture reduces data requirements and enables flexible integration of Large Language Models (LLMs) and Vision Language Models (VLMs) for zero-shot tasks such as image-to-scene generation. To maintain physical validity within complex floor plans, we explicitly model building elements (e.g., walls, doors, and windows) as conditional constraints. Furthermore, to address the high entropy of dense relation graphs, we introduce a sparse relation graph formulation aligned with human spatial descriptions. By encoding these sparse graphs into a compact latent space using a bidirectional Variational Autoencoder (VAE), the proposed framework provides enhanced relational controllability, allowing generated layouts to better respect functional organization. Experiments demonstrate that CasLayout achieves state-of-the-art performance in fidelity and diversity while enabling improved controllability in practical applications.

Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization

Authors: Naeem Rehmat, Muhammad Saad Saeed, Ijaz Ul Haq, Khalid Malik

Venue: CVPR

First: 2026-04-30T02:25:33+00:00 · Latest: 2026-04-30T02:25:33+00:00

Comments: Accepted at CVPR NeXD Workshop (2026)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Web filtering systems rely on accurate web content classification to block cyber threats, prevent data exfiltration, and ensure compliance. However, classification is increasingly difficult due to the dynamic and rapidly evolving nature of the modern web. Embedding-based zero-shot approaches map content and category descriptions into a shared semantic space, enabling label assignment without labeled training data, but remain highly sensitive to definition quality. Poorly specified or ambiguous definitions create semantic overlap in the embedding space, leading to systematic misclassification. In this paper, we propose a training-free, adaptive iterative definition refinement framework that improves zero-shot web content classification by progressively optimizing category definitions rather than updating model parameters. Using LLMs as feedback-driven definition optimizers, we investigate three refinement strategies namely example-guided, confusion-aware, and history-aware, each refining class descriptions using structured signals from misclassified instances. Furthermore, we introduce a human-labeled benchmark of 10 URL categories with 1,000 samples per class and evaluate across 13 state-of-the-art embedding foundation models. Results demonstrate that iterative definition refinement consistently improves classification performance across diverse architectures, establishing definition quality as a critical and underexplored factor in embedding-based systems. The dataset is available at https://github.com/naeemrehmat/B2MWT-10C.

中文标题/摘要

标题：基于LLM语义原型优化的零样本分类迭代定义精炼

网络过滤系统依赖于准确的网络内容分类以阻止网络威胁、防止数据外泄并确保合规。然而，由于现代网络的动态和快速演变，分类变得越来越困难。基于嵌入的零样本方法将内容和类别描述映射到共享的语义空间中，从而在没有标注训练数据的情况下进行标签分配，但仍然高度依赖定义质量。不良定义或模糊定义在嵌入空间中造成语义重叠，导致系统性误分类。在本文中，我们提出了一种无需训练的自适应迭代定义精炼框架，通过逐步优化类别定义而非更新模型参数来提高零样本网络内容分类。利用LLM作为反馈驱动的定义优化器，我们研究了三种精炼策略，即基于示例的、混淆意识的和历史意识的，每种策略都使用误分类实例的结构化信号来精炼类别描述。此外，我们引入了一个包含10个URL类别、每类别1000个样本的人工标注基准，并在13个最先进的嵌入基础模型上进行了评估。结果表明，迭代定义精炼在各种架构中一致地提高了分类性能，确立了定义质量是嵌入系统中一个关键且未充分探索的因素。数据集可在https://github.com/naeemrehmat/B2MWT-10C 获取。

Summary / 总结

This paper addresses the challenge of accurate web content classification in dynamic web environments by proposing an iterative definition refinement framework for zero-shot classification. The method uses LLMs to optimize category definitions through example-guided, confusion-aware, and history-aware strategies, without training the underlying model. Experiments across 13 state-of-the-art embedding models on a human-labeled benchmark dataset show consistent performance improvements, highlighting the importance of definition quality in embedding-based systems.

本文提出了一种迭代定义精炼框架，用于解决动态网络环境中准确的网页内容分类问题。该方法利用LLM通过基于示例、混淆意识和历史意识的策略优化类别定义，而不训练基础模型。实验表明，在13个最先进的嵌入基础模型上，该方法在一个人类标注的基准数据集上表现出一致的性能提升，突显了定义质量在嵌入式系统中的重要性。

History

20260502_0426 20260501_0429 20260430_0430 20260429_0437 20260428_0429 20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553