See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Authors: Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang
First: 2025-12-26T18:59:47+00:00 · Latest: 2025-12-26T18:59:47+00:00
Abstract
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
中文标题/摘要
标题:见少而明:双向感知塑造用于多模态推理
大型视觉-语言模型(VLMs)通常从中间视觉提示中受益,这些提示要么通过外部工具注入,要么在推理过程中作为潜在视觉标记生成,但这些机制仍然忽略了细微的视觉证据(例如图表中的多边形线),在不同领域泛化能力差,并且在推理时间成本高。在本文中,我们提出了双向感知塑造(BiPS),它将问题条件下的掩码视图转换为双向的“看哪里”的信号,在训练过程中塑造感知。BiPS 首先在原始图像和保留仅与问题相关区域的证据保留视图之间施加KL一致性约束,鼓励粗略但完整的支持像素覆盖。然后在原始图像和关键像素被遮蔽的证据消除视图之间施加KL分离约束,使得图像不再支持原始答案,从而避免仅从文本回答(即,仅从文本回答)并强制执行细微的视觉依赖。在八个基准测试中,BiPS 将 Qwen2.5-VL-7B 的性能平均提高了 8.2%,并在未见过的数据集和图像类型上展示了强大的跨域泛化能力。
Summary / 总结
This paper addresses the limitations of existing vision-language models in handling fine-grained visual evidence and their poor generalization across domains. It introduces Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals to shape perception during training. BiPS uses KL-consistency and KL-separation constraints to encourage coarse but complete coverage of supporting pixels and discourage text-only shortcuts, respectively. The method improves Qwen2.5-VL-7B by 8.2% on average across eight benchmarks and demonstrates strong out-of-domain generalization.
本文针对现有视觉-语言模型在处理细粒度视觉证据和跨域泛化能力方面的局限性,提出了双向感知塑造(BiPS)方法。BiPS通过将问题条件下的遮罩视图转换为双向的注视信号,在训练过程中塑造感知。该方法使用KL一致性约束和KL分离约束,以鼓励粗略但完整的支持像素覆盖,并避免纯文本捷径。该方法在八个基准测试中平均提高了Qwen2.5-VL-7B 8.2%,并在未见过的数据集和图像类型上表现出强大的跨域泛化能力。
ProEdit: Inversion-based Editing From Prompts Done Right
Authors: Zhi Ouyang, Dian Zheng, Xiao-Ming Wu, Jian-Jian Jiang, Kun-Yu Lin, Jingke Meng, Wei-Shi Zheng
First: 2025-12-26T18:59:14+00:00 · Latest: 2025-12-26T18:59:14+00:00
Comments: Equal contributions from first two authors. Project page: https://isee-laboratory.github.io/ProEdit/ Code: https://github.com/iSEE-Laboratory/ProEdit
Abstract
Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.
中文标题/摘要
标题:ProEdit: 根据提示正确进行反转编辑的方法
基于反转的视觉编辑提供了一种有效且无需训练的方法,可以根据用户指令编辑图像或视频。现有方法通常在采样过程中注入源图像信息以保持编辑一致性。然而,这种采样策略过度依赖源信息,这会负面影响目标图像中的编辑(例如,未能根据指示改变主体的姿态、数量或颜色)。在本文中,我们提出ProEdit以解决这一问题,同时在注意力和潜在方面进行改进。在注意力方面,我们引入了KV-mix,它在编辑区域混合源和目标的KV特征,减轻了源图像对编辑区域的影响,同时保持背景一致性。在潜在方面,我们提出了Latents-Shift,它扰动源潜在的编辑区域,消除了反转潜在对采样的影响。在几个图像和视频编辑基准上的广泛实验表明,我们的方法达到了SOTA性能。此外,我们的设计是即插即用的,可以无缝集成到现有的反转和编辑方法中,如RF-Solver、FireFlow和UniEdit。
Summary / 总结
ProEdit addresses the issue of overly relying on source image information in inversion-based visual editing, which negatively affects the edits in the target image. It introduces KV-mix to mix KV features of the source and target in the edited region, and Latents-Shift to perturb the edited region of the source latent, maintaining background consistency. Experiments show that ProEdit achieves state-of-the-art performance and can be easily integrated into existing methods like RF-Solver, FireFlow, and UniEdit.
ProEdit通过引入KV-mix和Latents-Shift解决了现有基于反转的视觉编辑方法的局限性。KV-mix在编辑区域混合源和目标的关键值特征,保持背景一致性的同时减少源的影响。Latents-Shift在编辑区域扰动源的潜在特征,消除反转潜在特征对采样的影响。实验表明,ProEdit在各种基准上优于现有方法,并且可以无缝集成到其他反转和编辑方法中。
Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs
Authors: Jiayu Hu, Beibei Li, Jiangwei Xia, Yanjun Qin, Bing Ji, Zhongshi He
First: 2025-12-26T11:56:45+00:00 · Latest: 2025-12-26T11:56:45+00:00
Abstract
While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs' over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic decoding calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an \textbf{A}ctivate-\textbf{L}ocate-\textbf{E}dit \textbf{A}dversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting LLM prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial tuned prefixes that are optimized to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at https://github.com/hujiayu1223/ALEAHallu.
中文标题/摘要
标题:仔细看看!一种对抗参数编辑框架以减轻VLM中的幻觉
视觉-语言模型(VLMs)由于其有前景的实际应用而在AI社区中引起了越来越多的关注,但它们仍然存在持续的幻觉问题,生成的输出与视觉输入不一致。最近的研究将这些幻觉归因于VLMs过度依赖于语言先验和视觉特征整合不足,提出了启发式解码校准策略来减轻这些问题。然而,这些策略的不可训练性质固有限制了它们的优化潜力。为此,我们提出了一种对抗参数编辑框架,以减轻VLM中的幻觉,遵循激活-定位-编辑对抗的A-LE-EA范式。具体来说,我们首先构建了一个激活数据集,其中包括与视觉特征紧密关联的接地响应(正样本)和反映LLM先验偏见和内部知识缺陷的幻觉响应(负样本)。然后,通过分析响应对的差异隐藏状态来识别关键的幻觉易发参数簇。接着,使用注入对抗调优前缀的提示对这些簇进行微调,这些前缀被优化以最大化视觉忽视,从而迫使模型优先考虑视觉证据而非固有的参数偏见。在生成性和判别性VLM任务上的评估表明,ALEAHallu在减轻幻觉方面具有显著效果。我们的代码可在https://github.com/hujiayu1223/ALEAHallu获取。
Summary / 总结
This paper addresses the hallucination issue in Vision-Language Models (VLMs) by proposing an adversarial parametric editing framework called ALEAHallu. The framework activates and locates critical parameter clusters prone to hallucination, then edits these clusters using adversarial tuned prompts to prioritize visual evidence. Experiments show ALEAHallu effectively mitigates hallucinations in both generative and discriminative VLM tasks.
论文提出了一种对抗参数编辑框架ALEAHallu来解决Vision-Language模型(VLM)的幻觉问题。该方法包括构建包含正负样本的激活数据集,识别易产生幻觉的关键参数簇,并通过对抗优化前缀进行微调,以优先考虑视觉证据。实验表明,ALEAHallu在生成性和判别性VLM任务中显著减轻了幻觉现象。
LVLM-Aided Alignment of Task-Specific Vision Models
Authors: Alexander Koebler, Lukas Kuhn, Ingo Thon, Florian Buettner
First: 2025-12-26T11:11:25+00:00 · Latest: 2025-12-26T11:11:25+00:00
Abstract
In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model's dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.
中文标题/摘要
标题:LVLM辅助的任务特定视觉模型对齐
在高风险领域,由于其低计算需求和解释结果的多种方法,小型任务特定视觉模型至关重要。然而,这些解释往往揭示出模型与人类领域知识不一致,而是依赖于虚假的相关性。这可能导致模型在实际部署后表现出脆弱的行为。为解决这一问题,我们提出了一种利用大型视觉语言模型(LVLM)泛化能力的新颖且高效的方法,以实现小型任务特定视觉模型与人类领域知识的对齐。我们的LVLM辅助视觉对齐(LVLM-VA)方法提供了一个双向接口,将模型行为翻译成自然语言,并将人类类级规范映射到图像级批评,从而实现领域专家与模型的有效互动。我们的方法在合成数据集和真实世界数据集上都证明了在对齐模型行为与人类规范方面有显著改进。我们展示了它有效地减少了模型对虚假特征和群体特定偏见的依赖,而无需精细的反馈。
Summary / 总结
The research aims to improve the alignment between task-specific vision models and human domain knowledge in high-stakes applications. It introduces LVLM-VA, a method that leverages a Large Vision Language Model to translate model behavior into natural language and map human class-level specifications to image-level critiques. The method significantly enhances model alignment with human specifications, reducing reliance on spurious features and group-specific biases, as demonstrated on both synthetic and real-world datasets.
本文针对任务特定的视觉模型与人类领域知识不匹配的问题,可能会导致在实际应用中的脆弱行为。作者提出了一种名为LVLM-VA的方法,利用大型视觉语言模型(LVLM)来使这些模型与人类规范相匹配。LVLM-VA将模型行为翻译成自然语言,并将人类的类别级规范映射到图像级批评,促进了领域专家与模型之间的互动。该方法在合成和真实世界数据集上显著提高了与人类规范的匹配度,并减少了对虚假特征和群体特定偏见的依赖。
Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models
Authors: Dunyuan XU, Xikai Yang, Yaoqian Li, Juzheng Miao, Jinpeng Li, Pheng-Ann Heng
First: 2025-12-26T10:23:30+00:00 · Latest: 2025-12-26T10:23:30+00:00
Abstract
Medical Multi-modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real-world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs' robustness in general domains, they primarily focus on text modality and rely on costly fine-tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs' inherent denoising capabilities following the perceive-and-calibrate principle for cross-modal robustness enhancement. For the visual modality, we propose a Perturbation-aware Denoising Calibration (PDC) which leverages MLLMs' own vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs' robustness in real clinical scenarios.
中文标题/摘要
标题:感知与校准:分析和增强医疗多模态大型语言模型的鲁棒性
医疗多模态大型语言模型(MLLMs)在临床应用中表现出有希望的性能。然而,它们对现实世界输入扰动的敏感性,如成像伪影和文本错误,严重削弱了其临床应用性。对医疗MLLMs中此类噪声影响的系统分析尚未得到充分探索。此外,虽然已有几项研究探讨了MLLMs在一般领域的鲁棒性,但它们主要集中在文本模态上,并依赖于昂贵的微调。它们无法解决复杂的噪声模式并满足医学中的严格安全标准。为解决这一问题,本研究系统分析了各种扰动对医疗MLLMs在视觉和文本模态上的影响。基于我们的发现,我们提出了一种无需训练的固有增强多模态校准(IMC)框架,该框架遵循感知与校准原则,利用MLLMs的固有去噪能力增强跨模态鲁棒性。对于视觉模态,我们提出了一种扰动感知去噪校准(PDC),利用MLLMs自身的视觉编码器识别噪声模式并进行原型引导的特征校准。对于文本去噪,我们设计了一种自我实例化多智能体系统(SMS),利用MLLMs的自我评估能力通过智能体合作层次结构精炼噪声文本。我们构建了一个包含11种不同类型噪声的基准,覆盖图像和文本模态的两个数据集。实验结果表明,我们的方法在多个模态上达到了最先进的性能,显示出增强MLLMs在实际临床场景中鲁棒性的潜力。
Summary / 总结
This work addresses the robustness issues of Medical Multi-modal Large Language Models (MLLMs) by analyzing their sensitivity to real-world input perturbations. It introduces an Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs' inherent denoising capabilities. For visual modality, a Perturbation-aware Denoising Calibration (PDC) is proposed to identify and calibrate noise patterns. For text, a Self-instantiated Multi-agent System (SMS) is designed to refine noisy text through a cooperative hierarchy of agents. The method demonstrates state-of-the-art performance across multiple modalities, enhancing MLLMs' robustness for clinical applications.
该研究通过分析医疗多模态大型语言模型(MLLMs)对现实世界干扰的敏感性,并引入了多模态增强校准(IMC)框架来解决这一问题。IMC框架包括针对视觉模态的扰动感知去噪校准(PDC)和针对文本去噪的自我实例化多代理系统(SMS)。该方法构建了一个包含图像和文本模态11种噪声类型的基准,并展示了在增强MLLMs鲁棒性方面达到最先进的性能。
Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?
Authors: Naen Xu, Jinghuai Zhang, Changjiang Li, Hengyu An, Chunyi Zhou, Jun Wang, Boyu Xu, Yuyuan Li, Tianyu Du, Shouling Ji
Venue: AAAI 2026 Oral
First: 2025-12-26T05:09:55+00:00 · Latest: 2025-12-26T05:09:55+00:00
Comments: AAAI 2026 (Oral)
Abstract
Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks. However, their widespread accessibility raises critical concerns about potential copyright infringement. Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content (i.e., user input, retrieved documents) in the context? Failure to comply with copyright regulations may lead to serious legal and ethical consequences, particularly when LVLMs generate responses based on copyrighted materials (e.g., retrieved book experts, news reports). In this paper, we present a comprehensive evaluation of various LVLMs, examining how they handle copyrighted content -- such as book excerpts, news articles, music lyrics, and code documentation when they are presented as visual inputs. To systematically measure copyright compliance, we introduce a large-scale benchmark dataset comprising 50,000 multimodal query-content pairs designed to evaluate how effectively LVLMs handle queries that could lead to copyright infringement. Given that real-world copyrighted content may or may not include a copyright notice, the dataset includes query-content pairs in two distinct scenarios: with and without a copyright notice. For the former, we extensively cover four types of copyright notices to account for different cases. Our evaluation reveals that even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice. To solve this limitation, we introduce a novel tool-augmented defense framework for copyright compliance, which reduces infringement risks in all scenarios. Our findings underscore the importance of developing copyright-aware LVLMs to ensure the responsible and lawful use of copyrighted content.
中文标题/摘要
标题:填补版权空白:大型视觉-语言模型能否识别和尊重版权内容?
大型视觉-语言模型(LVLMs)在多模态推理任务中取得了显著进展。然而,它们的广泛应用引发了关于潜在版权侵权的严重关切。当LVLMs遇到版权内容(即用户输入、检索文档)时,它们是否会准确识别并遵守版权规定?不遵守版权规定可能导致严重的法律和伦理后果,尤其是在LVLMs基于版权材料生成回应(例如,检索的书籍专家、新闻报道)时。本文对各种LVLMs进行了全面评估,考察它们如何处理版权内容——如作为视觉输入呈现的书摘、新闻文章、歌词和代码文档。为了系统地衡量版权合规性,我们引入了一个包含50,000个多模态查询-内容对的大规模基准数据集,用于评估LVLMs处理可能导致版权侵权的查询的能力。鉴于现实世界中的版权内容可能或可能不包含版权通知,数据集包括两种不同的场景:有版权通知和无版权通知的查询-内容对。对于前者,我们广泛涵盖了四种类型的版权通知以涵盖不同情况。我们的评估表明,即使是最先进的闭源LVLMs,在有版权通知的情况下,也表现出在识别和尊重版权内容方面存在显著缺陷。为了解决这一局限性,我们提出了一种新的工具增强防御框架,以确保在所有场景下降低侵权风险。我们的研究结果强调了开发版权意识LVLMs的重要性,以确保负责任和合法地使用版权内容。
Summary / 总结
This paper evaluates how large vision-language models (LVLMs) recognize and comply with copyright regulations when handling copyrighted content. It introduces a benchmark dataset of 50,000 multimodal query-content pairs, covering scenarios with and without copyright notices. The evaluation shows that even top LVLMs struggle with copyright compliance, especially when no notice is present. The study highlights the need for a tool-augmented defense framework to enhance copyright awareness in LVLMs.
本文评估了大型视觉-语言模型(LVLMs)在处理版权内容时是否能够识别和遵守版权法规。引入了一个包含50,000个多模态查询-内容对的基准数据集,涵盖了有和没有版权通知的场景。评估结果显示,即使是最先进的LVLMs也无法在有版权通知的情况下识别和尊重版权内容。作者提出了一种工具增强的防御框架来增强LVLMs的版权合规性,从而在所有评估的场景中降低侵权风险。这项工作强调了开发版权意识强的LVLMs的重要性,以确保对版权内容的合法使用。
Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models
Authors: Masayuki Kawarada, Kosuke Yamada, Antonio Tejero-de-Pablos, Naoto Inoue
First: 2025-12-26T04:51:23+00:00 · Latest: 2025-12-26T04:51:23+00:00
Abstract
Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.
中文标题/摘要
标题:利用大型视觉语言模型的无训练条件图像嵌入框架
条件图像嵌入是专注于由给定文本条件(如颜色、类型)指示的图像特定方面的特征表示,这是一个具有挑战性的问题。尽管最近的视觉基础模型,如CLIP,提供了丰富的图像表示,但它们并不是专门设计来关注特定条件的。在本文中,我们提出了一种名为DIOR的方法,该方法利用大型视觉语言模型(LVLM)生成条件图像嵌入。DIOR是一种无训练方法,通过提示LVLM用与给定条件相关的单个词描述图像,从中提取LVLM最后一个词的隐藏状态向量作为条件图像嵌入。DIOR提供了一种通用的解决方案,可以在任何图像和条件下应用,无需额外训练或任务特定先验。在条件图像相似性任务上的全面实验结果表明,DIOR优于现有的无训练基线,包括CLIP。此外,DIOR在多个设置中优于需要额外训练的方法,表现出更优的性能。
Summary / 总结
The paper addresses the challenge of generating conditional image embeddings that focus on specific aspects of an image as indicated by a textual condition. It introduces DIOR, a training-free method that uses a large vision-language model to generate these embeddings by prompting the model to describe an image with a single word related to the condition. Experimental results show that DIOR outperforms existing training-free baselines and performs better than methods requiring additional training across various settings, particularly in conditional image similarity tasks.
论文旨在生成能够聚焦于图像特定方面(由文本条件指示)的条件图像嵌入。提出了一种无需训练的方法DIOR,利用大型视觉语言模型通过提示模型用与条件相关的单个词描述图像来生成这些嵌入。实验结果表明,DIOR在条件图像相似性任务中优于现有无需训练的基线,并且在多种设置中优于需要额外训练的方法。
RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic
Authors: Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu
First: 2025-12-24T15:01:26+00:00 · Latest: 2025-12-26T03:30:51+00:00
Comments: 11 pages, 6 figures
Abstract
Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.
中文标题/摘要
标题:RoboSafe:通过可执行的安全逻辑保护具身代理
由视觉-语言模型(VLMs)驱动的具身代理越来越能够执行复杂的现实世界任务,但它们仍然容易受到可能导致不安全行为的危险指令的影响。运行时安全护栏可以在任务执行过程中拦截危险行为,提供了一种有前景的解决方案,因为它们具有灵活性。然而,现有的防御措施往往依赖于静态规则过滤或提示级控制,难以应对动态、时间依赖性和上下文丰富的环境中出现的隐含风险。为了解决这个问题,我们提出了RoboSafe,这是一种通过可执行谓词基础的安全逻辑为具身代理提供混合推理运行时保护的混合方法。RoboSafe结合了在混合长短期安全记忆上的两种互补推理过程。我们首先提出了一种后向反思推理模块,该模块不断回顾短期记忆中的最近轨迹,以推断时间安全谓词,并在检测到违规行为时主动触发重新规划。然后,我们提出了一种前瞻预测推理模块,该模块通过从长期安全记忆和代理的多模态观察中生成上下文感知的安全谓词来预见即将出现的风险。这些组件共同形成了一个既可解释又可执行的适应性、验证性安全逻辑。在多个代理的广泛实验中,RoboSafe与领先基准相比显著减少了危险行为(风险发生率降低36.8%),同时保持了接近原始的任务性能。在物理机器人手臂上的实际评估进一步证实了其实用性。代码将在接受后发布。
Summary / 总结
RoboSafe is designed to safeguard embodied agents by using executable safety logic. It combines backward reflective reasoning, which continuously monitors recent actions for safety, and forward predictive reasoning, which forecasts potential risks based on long-term memory and current observations. Experiments show that RoboSafe significantly reduces hazardous actions by 36.8% compared to existing methods, while preserving task performance. Real-world tests on robotic arms validate its practicality.
RoboSafe 是一种使用可执行谓词安全逻辑的混合推理运行时保护方案,结合了持续回顾近期轨迹以推断时间安全谓词并触发重新规划的后向反思推理模块,以及从长期安全记忆和代理的多模态观察中生成情境感知安全谓词以预见风险的前瞻预测推理模块。实验表明,RoboSafe 将危险行为减少了 36.8%,同时保持了接近原始的任务性能。
Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models
Authors: Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang
First: 2025-12-26T01:01:25+00:00 · Latest: 2025-12-26T01:01:25+00:00
Comments: 19 Pages,11 figures,8 tables
Abstract
Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, a measure of model uncertainty, is strongly correlated with the reliability of VLM. Prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token contributes equally to generation instability. We show instead that a small fraction (about 20%) of high-entropy tokens, i.e., critical decision points in autoregressive generation, disproportionately governs output trajectories. By concentrating adversarial perturbations on these positions, we achieve semantic degradation comparable to global methods while using substantially smaller budgets. More importantly, across multiple representative VLMs, such selective attacks convert 35-49% of benign outputs into harmful ones, exposing a more critical safety risk. Remarkably, these vulnerable high-entropy forks recur across architecturally diverse VLMs, enabling feasible transferability (17-26% harmful rates on unseen targets). Motivated by these findings, we propose Entropy-bank Guided Adversarial attacks (EGA), which achieves competitive attack success rates (93-95%) alongside high harmful conversion, thereby revealing new weaknesses in current VLM safety mechanisms.
中文标题/摘要
标题:少数词元起作用:熵引导的视觉-语言模型攻击
视觉-语言模型(VLMs)取得了显著的性能,但仍然容易受到对抗性攻击的影响。熵,衡量模型不确定性的一个指标,与VLM的可靠性密切相关。先前的基于熵的攻击在所有解码步骤中最大化不确定性,隐含地假设每个词元对生成不稳定性贡献相同。相反,我们证明一小部分(约20%)高熵词元,即自回归生成中的关键决策点,不成比例地控制了输出轨迹。通过将对抗性扰动集中在这些位置,我们实现了与全局方法相当的语义降级,但使用了更小的预算。更重要的是,在多个代表性VLM中,这种选择性攻击将35-49%的良性输出转化为有害输出,揭示了更严重的安全风险。令人惊讶的是,这些脆弱的高熵分支在架构上不同的VLM中反复出现,使得跨模型的转移性成为可能(在未见过的目标上,有害率在17-26%)。受这些发现的启发,我们提出了熵库引导的对抗性攻击(EGA),该方法在攻击成功率(93-95%)和高有害转化率方面表现出竞争力,从而揭示了当前VLM安全机制的新弱点。
Summary / 总结
This paper investigates the vulnerability of vision-language models (VLMs) to adversarial attacks, focusing on the role of entropy in model uncertainty. It demonstrates that a small fraction of high-entropy tokens, rather than all tokens, significantly influence the generation process. By targeting these critical tokens, the proposed method achieves comparable semantic degradation to global attacks but with a smaller budget. The study also reveals that such selective attacks can convert up to 49% of benign outputs to harmful ones, highlighting a critical safety risk. The findings suggest that these high-entropy tokens are recurrent across different VLM architectures, enabling transferability of the attacks. The paper proposes Entropy-bank Guided Adversarial attacks (EGA) that achieve high attack success rates and harmful conversion rates, indicating new weaknesses in VLM safety mechanisms.
该论文通过关注模型不确定性度量熵,研究了视觉语言模型(VLMs)对对抗攻击的脆弱性。研究表明,一小部分高熵令牌而非所有令牌显著影响模型的输出。通过针对这些关键令牌,所提出的方法在预算较小的情况下实现了与全局攻击相当的语义降级效果。研究还揭示,此类选择性攻击可以将高达49%的良性输出转化为有害输出,突显了重大安全风险。所提出的熵银行引导的对抗攻击(EGA)实现了高攻击成功率和有害转化率,表明当前VLM安全机制存在新的弱点。
Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models
Authors: Nimrod Berman, Adam Botach, Emanuel Ben-Baruch, Shunit Haviv Hakimi, Asaf Gendler, Ilan Naiman, Erez Yosef, Igor Kviatkovsky
First: 2025-12-25T20:31:36+00:00 · Latest: 2025-12-25T20:31:36+00:00
Abstract
Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context-focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level logits of the VLM, enabling controllable precision-recall trade-offs that were previously limited to encoder-based methods. Furthermore, we demonstrate that our model can be aligned to generate coherent natural-language rationales for its boundary decisions through minimal targeted supervision. Our approach achieves state-of-the-art performance on standard scene segmentation benchmarks. On MovieNet, for example, Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method.
中文标题/摘要
标题:Scene-VLM:通过视觉语言模型进行多模态视频场景分割
将长视频分割成语义上连贯的场景是大规模视频理解中的基本任务。现有的基于编码器的方法受到视觉中心偏见的限制,它们孤立地对每个镜头进行分类而没有利用序列依赖性,并且缺乏叙事理解和可解释性。在本文中,我们提出了Scene-VLM,这是第一个用于视频场景分割的微调视觉语言模型(VLM)框架。Scene-VLM 联合处理包括帧、转录和可选元数据在内的视觉和文本线索,以实现连续镜头之间的多模态推理。该模型以因果依赖性顺序生成预测,并引入了一个上下文聚焦窗口机制,以确保每个镜头级决策的足够时间上下文。此外,我们提出了一种方案,从VLM的标记级logits中提取置信分数,这使得可以控制精确度-召回率之间的权衡,而以前这些方法仅限于基于编码器的方法。此外,我们证明,通过最小的定向监督,我们的模型可以对齐以生成其边界决策的连贯自然语言解释。我们的方法在标准场景分割基准上达到了最先进的性能。例如,在MovieNet上,Scene-VLM相对于之前领先的方法在AP上提高了6个点,在F1上提高了13.7个点。
Summary / 总结
Scene-VLM is the first vision-language model framework for video scene segmentation, addressing limitations of existing methods by integrating visual and textual cues and leveraging causal dependencies among shots. It introduces a context-focus window mechanism and a scheme to extract confidence scores from token-level logits, enabling better precision-recall trade-offs. Scene-VLM achieves state-of-the-art performance, improving AP and F1 scores by 6 and 13.7 percentage points, respectively, on the MovieNet benchmark compared to the previous leading method.
Scene-VLM 是首个用于视频场景分割的视觉-语言模型框架,通过整合视觉和文本线索并利用镜头间的因果依赖关系来解决现有方法的局限性。它引入了上下文焦点窗口机制,并提出了一种从标记级概率中提取置信分数的方案,以实现更好的精确率-召回率权衡。Scene-VLM 达到了最先进的性能,在 MovieNet 基准测试中,其 AP 和 F1 分数分别提高了 6 个和 13.7 个百分点,超过了之前的领先方法。
ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation
Authors: Hosam Elgendy, Ahmed Sharshar, Ahmed Aboeitta, Mohsen Guizani
First: 2025-08-14T13:33:44+00:00 · Latest: 2025-12-25T15:41:13+00:00
Comments: 11 pages, 5 figures, 7 tables
Abstract
Understanding environmental changes from remote sensing imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and "what-if" reasoning (e.g., BERTF1 0.902) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.
中文标题/摘要
标题:ChatENV:一种基于传感器引导的环境监测和场景模拟的交互式视觉语言模型
从遥感图像中理解环境变化对于气候韧性、城市规划和生态系统监测至关重要。然而,当前的视觉语言模型(VLMs)忽视了环境传感器的因果信号,依赖于单一来源的描述,容易产生风格偏见,并缺乏基于交互式场景的推理。我们提出了ChatENV,这是第一个能够联合推理卫星图像对和现实世界传感器数据的交互式VLM。我们的框架:(i) 创建了一个包含177,000张图像的数据集,形成152,000个时间对,覆盖197个国家的62个土地利用类别,具有丰富的传感器元数据(例如,温度、PM10、CO);(ii) 使用GPT4o和Gemini 2.0注释数据,以实现风格和语义多样性;(iii) 使用高效的低秩适应(LoRA)适配器对Qwen-2.5-VL进行微调,以实现聊天目的。ChatENV在时间推理和“假设情景”推理(例如,BERTF1 0.902)方面表现出色,与最先进的时序模型相当或优于它们,同时支持交互式场景分析。这使ChatENV成为一种强大的基于地面、传感器感知的环境监测工具。
Summary / 总结
ChatENV is an interactive vision-language model that integrates satellite images and real-world sensor data to enhance environmental monitoring and scenario simulation. It creates a large dataset of 177k images with 152k temporal pairs, annotates data using GPT4o and Gemini 2.0, and fine-tunes Qwen-2.5-VL with LoRA adapters. ChatENV excels in temporal and 'what-if' reasoning, achieving a BERTF1 score of 0.902 and outperforming state-of-the-art models in interactive scenario-based analysis.
ChatENV 是一个将卫星图像与实际传感器数据结合的交互式视觉-语言模型,以提升环境监测和情景模拟。它创建了一个包含丰富传感器元数据的大规模数据集,使用先进的文本生成模型进行注释,并使用 LoRA 适配器对 Qwen-2.5-VL 进行微调。ChatENV 在时间序列和‘假设’推理方面表现出色,并支持交互式情景分析,使其成为环境监测和规划的强大工具。
Towards Responsible and Explainable AI Agents with Consensus-Driven Reasoning
Authors: Eranga Bandara, Tharaka Hewa, Ross Gore, Sachin Shetty, Ravi Mukkamala, Peter Foytik, Abdul Rahman, Safdar H. Bouk, Xueping Liang, Amin Hass, Sachini Rajapakse, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan
First: 2025-12-25T14:49:25+00:00 · Latest: 2025-12-25T14:49:25+00:00
Abstract
Agentic AI represents a major shift in how autonomous systems reason, plan, and execute multi-step tasks through the coordination of Large Language Models (LLMs), Vision Language Models (VLMs), tools, and external services. While these systems enable powerful new capabilities, increasing autonomy introduces critical challenges related to explainability, accountability, robustness, and governance, especially when agent outputs influence downstream actions or decisions. Existing agentic AI implementations often emphasize functionality and scalability, yet provide limited mechanisms for understanding decision rationale or enforcing responsibility across agent interactions. This paper presents a Responsible(RAI) and Explainable(XAI) AI Agent Architecture for production-grade agentic workflows based on multi-model consensus and reasoning-layer governance. In the proposed design, a consortium of heterogeneous LLM and VLM agents independently generates candidate outputs from a shared input context, explicitly exposing uncertainty, disagreement, and alternative interpretations. A dedicated reasoning agent then performs structured consolidation across these outputs, enforcing safety and policy constraints, mitigating hallucinations and bias, and producing auditable, evidence-backed decisions. Explainability is achieved through explicit cross-model comparison and preserved intermediate outputs, while responsibility is enforced through centralized reasoning-layer control and agent-level constraints. We evaluate the architecture across multiple real-world agentic AI workflows, demonstrating that consensus-driven reasoning improves robustness, transparency, and operational trust across diverse application domains. This work provides practical guidance for designing agentic AI systems that are autonomous and scalable, yet responsible and explainable by construction.
中文标题/摘要
标题:基于共识驱动推理的责任AI代理
代理型AI代表了自主系统在协调大型语言模型(LLMs)、视觉语言模型(VLMs)、工具和外部服务的情况下进行推理、规划和执行多步骤任务方式的重大转变。尽管这些系统提供了强大的新能力,但不断增加的自主性引入了与可解释性、问责制、稳健性和治理相关的关键挑战,尤其是在代理输出影响下游行动或决策时。现有的代理型AI实现通常强调功能性和可扩展性,但提供了有限的机制来理解决策理由或在代理交互中实施责任。本文提出了一种基于多模型共识和推理层治理的责任(RAI)和可解释(XAI)AI代理架构,适用于生产级代理型工作流。在所提出的架构中,由异构LLM和VLM代理组成的联盟独立地从共享输入上下文中生成候选输出,明确地暴露不确定性、分歧和替代解释。然后,一个专门的推理代理进行结构化的综合,强制执行安全性和政策约束,减轻幻觉和偏见,并生成可审计、基于证据的决策。通过显式的跨模型比较和保留中间输出实现可解释性,通过集中式推理层控制和代理级约束实施责任。我们在多个实际代理型AI工作流中评估了该架构,证明共识驱动的推理提高了不同应用领域的鲁棒性、透明度和操作信任。本研究为设计自主且可扩展但又责任和可解释的代理型AI系统提供了实用指导。
Summary / 总结
This paper addresses the challenges of explainability, accountability, and governance in agentic AI systems, which reason, plan, and execute multi-step tasks through the coordination of LLMs, VLMs, tools, and external services. The proposed RAI and XAI architecture uses a multi-model consensus approach, where heterogeneous agents independently generate outputs, and a reasoning agent consolidates these outputs while enforcing safety and policy constraints. The evaluation across various workflows shows that this approach enhances robustness, transparency, and operational trust. Explainability is achieved through explicit cross-model comparisons and preserved intermediate outputs, while responsibility is enforced through centralized control and agent-level constraints.
本文针对通过LLMs、VLMs、工具和外部服务进行推理、规划和执行多步骤任务的代理AI系统中的可解释性、问责制和治理问题,提出了一种基于多模型共识和推理层治理的RAI和XAI架构。该架构中,一个推理代理整合来自各种模型的输出,同时确保安全并减轻偏见。在多个实际应用领域的评估表明,该架构提高了鲁棒性、透明度和操作信任。通过跨模型比较和保留中间输出,该架构确保了可解释性,而集中控制和对代理的约束则确保了责任。
SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration
Authors: Md Motaleb Hossen Manik, Md Zabirul Islam, Ge Wang
First: 2025-12-25T14:02:27+00:00 · Latest: 2025-12-25T14:02:27+00:00
Abstract
Modern vision--language models (VLMs) are increasingly used to interpret and generate educational content, yet their semantic outputs remain challenging to verify, reproduce, and audit over time. Inconsistencies across model families, inference settings, and computing environments undermine the reliability of AI-generated instructional material, particularly in high-stakes and quantitative STEM domains. This work introduces SlideChain, a blockchain-backed provenance framework designed to provide verifiable integrity for multimodal semantic extraction at scale. Using the SlideChain Slides Dataset-a curated corpus of 1,117 medical imaging lecture slides from a university course-we extract concepts and relational triples from four state-of-the-art VLMs and construct structured provenance records for every slide. SlideChain anchors cryptographic hashes of these records on a local EVM (Ethereum Virtual Machine)-compatible blockchain, providing tamper-evident auditability and persistent semantic baselines. Through the first systematic analysis of semantic disagreement, cross-model similarity, and lecture-level variability in multimodal educational content, we reveal pronounced cross-model discrepancies, including low concept overlap and near-zero agreement in relational triples on many slides. We further evaluate gas usage, throughput, and scalability under simulated deployment conditions, and demonstrate perfect tamper detection along with deterministic reproducibility across independent extraction runs. Together, these results show that SlideChain provides a practical and scalable step toward trustworthy, verifiable multimodal educational pipelines, supporting long-term auditability, reproducibility, and integrity for AI-assisted instructional systems.
中文标题/摘要
标题:SlideChain:通过区块链注册实现讲座理解的语义溯源
现代视觉-语言模型(VLMs)越来越多地用于解释和生成教育内容,但它们的语义输出在长时间内难以验证、重现和审计。不同模型家族、推理设置和计算环境之间的不一致性削弱了AI生成的教育材料的可靠性,尤其是在高风险和定量的STEM领域。本文介绍了SlideChain,这是一种基于区块链的溯源框架,旨在为大规模多模态语义提取提供可验证的完整性。使用SlideChain幻灯片数据集——一个包含1,117张来自大学课程的医学成像讲座幻灯片的精选语料库,从四种最先进的VLMs中提取概念和关系三元组,并为每张幻灯片构建结构化的溯源记录。SlideChain将这些记录的加密哈希值锚定在本地兼容以太坊虚拟机(EVM)的区块链上,提供防篡改审计和持久的语义基线。通过首次系统分析多模态教育内容中的语义分歧、跨模型相似性和讲座级别的变异性,我们揭示了显著的跨模型差异,包括概念重叠低和许多幻灯片中关系三元组几乎零一致性。我们进一步在模拟部署条件下评估了气体使用量、吞吐量和可扩展性,并展示了完美的篡改检测以及独立提取运行中的确定性重现性。这些结果共同表明,SlideChain为可信、可验证的多模态教育流水线提供了一种实用且可扩展的步骤,支持长期审计、重现性和完整性,以支持AI辅助的教育系统。
Summary / 总结
SlideChain is a blockchain-based provenance framework designed to enhance the reliability of semantic outputs from vision-language models (VLMs) used in educational content. It extracts concepts and relational triples from medical imaging lecture slides using four state-of-the-art VLMs and records these in tamper-evident blockchain entries. Key findings include significant cross-model discrepancies in extracted concepts and relational triples, with low overlap and near-zero agreement on many slides. The system also demonstrates perfect tamper detection and deterministic reproducibility, suggesting practical scalability for trustworthy educational pipelines.
SlideChain 是一个基于区块链的溯源框架,旨在提高教育内容中多模态语义提取的可靠性。通过一个包含1,117张医学影像讲义的语料库,作者从四个最先进的视觉-语言模型中提取概念和关系三元组,并构建结构化的溯源记录。这些记录被锚定在本地区块链上,确保了防篡改的审计能力和持久的语义基线。研究揭示了显著的跨模型差异,包括概念重叠低和关系三元组几乎零一致性,并在模拟部署条件下展示了完美的防篡改检测和确定性的可重复性。
The Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds
Authors: Subramanyam Sahoo, Jared Junkin
First: 2025-12-25T13:27:56+00:00 · Latest: 2025-12-25T13:27:56+00:00
Comments: 10 pages, 5 figures, Initial Work
Abstract
Deepfake detection models have achieved high accuracy in identifying synthetic media, but their decision processes remain largely opaque. In this paper we present a mechanistic interpretability framework for deepfake detection applied to a vision-language model. Our approach combines a sparse autoencoder (SAE) analysis of internal network representations with a novel forensic manifold analysis that probes how the model's features respond to controlled forensic artifact manipulations. We demonstrate that only a small fraction of latent features are actively used in each layer, and that the geometric properties of the model's feature manifold, including intrinsic dimensionality, curvature, and feature selectivity, vary systematically with different types of deepfake artifacts. These insights provide a first step toward opening the "black box" of deepfake detectors, allowing us to identify which learned features correspond to specific forensic artifacts and to guide the development of more interpretable and robust models.
中文标题/摘要
标题:深伪侦探:通过稀疏特征和流形进行神经法医解释
深伪检测模型在识别合成媒体方面取得了高精度,但其决策过程仍然相当不透明。本文提出了一种应用于视觉-语言模型的深伪检测机制可解释性框架。我们的方法结合了稀疏自编码器(SAE)对内部网络表示的分析以及一种新颖的法医流形分析,该分析探究了模型特征对受控法医特征操纵的响应。我们证明了每个层中只有少量的潜在特征被积极使用,而模型特征流形的几何属性,包括固有维数、曲率和特征选择性,随着不同类型的深伪特征的变化而系统地变化。这些见解为打开“黑箱”深伪检测器的第一步,使我们能够识别出哪些学习特征对应于特定的法医特征,并指导更可解释和稳健模型的发展。
Summary / 总结
The research aims to enhance the interpretability of deepfake detection models by analyzing their internal representations. The method combines a sparse autoencoder to identify active features and a novel forensic manifold analysis to understand how the model responds to specific manipulations. Key findings show that only a small fraction of latent features are used in each layer, and the geometric properties of the feature manifold vary with different types of deepfake artifacts, providing insights into the model's decision-making process.
研究旨在通过分析内部表示来增强深度假信息检测模型的可解释性。方法结合了稀疏自编码器来识别活跃特征,并采用一种新颖的法医流形分析来理解模型对特定操作的响应。关键发现表明,每个层中只有少量的潜在特征被使用,而特征流形的几何特性会随着不同类型深度假信息的出现而变化,这为理解模型的决策过程提供了见解。
Training-Free Disentangled Text-Guided Image Editing via Sparse Latent Constraints
Authors: Mutiara Shabrina, Nova Kurnia Putri, Jefri Satria Ferdiansyah, Sabita Khansa Dewi, Novanto Yudistira
First: 2025-12-25T11:38:10+00:00 · Latest: 2025-12-25T11:38:10+00:00
Abstract
Text-driven image manipulation often suffers from attribute entanglement, where modifying a target attribute (e.g., adding bangs) unintentionally alters other semantic properties such as identity or appearance. The Predict, Prevent, and Evaluate (PPE) framework addresses this issue by leveraging pre-trained vision-language models for disentangled editing. In this work, we analyze the PPE framework, focusing on its architectural components, including BERT-based attribute prediction and StyleGAN2-based image generation on the CelebA-HQ dataset. Through empirical analysis, we identify a limitation in the original regularization strategy, where latent updates remain dense and prone to semantic leakage. To mitigate this issue, we introduce a sparsity-based constraint using L1 regularization on latent space manipulation. Experimental results demonstrate that the proposed approach enforces more focused and controlled edits, effectively reducing unintended changes in non-target attributes while preserving facial identity.
中文标题/摘要
标题:无训练解缠文本引导图像编辑通过稀疏潜在约束
文本驱动的图像操作经常遭受属性纠缠的问题,其中修改目标属性(例如,添加刘海)会无意中改变其他语义属性,如身份或外观。PPE框架通过利用预训练的视觉-语言模型解决这一问题,进行解缠编辑。在本文中,我们分析了PPE框架,重点关注其架构组件,包括基于BERT的属性预测和基于StyleGAN2的图像生成(在CelebA-HQ数据集上)。通过实证分析,我们发现原始正则化策略的一个局限性,即潜在更新保持密集且容易发生语义泄漏。为解决这一问题,我们引入了一种基于稀疏性的约束,使用L1正则化对潜在空间操作进行约束。实验结果表明,所提出的方法能够执行更集中和可控的编辑,有效减少非目标属性的意外变化,同时保留面部身份。
Summary / 总结
This study addresses the problem of attribute entanglement in text-driven image manipulation by proposing a sparsity-based constraint using L1 regularization. The PPE framework, which uses BERT for attribute prediction and StyleGAN2 for image generation, is analyzed. The key finding is that the proposed method reduces unintended changes in non-target attributes while preserving facial identity, demonstrating more focused and controlled edits compared to the original framework.
该研究通过提出使用L1正则化引入稀疏性约束的方法,解决了文本驱动图像编辑中属性纠缠的问题。分析了PPE框架,该框架结合BERT进行属性预测和StyleGAN2进行图像生成,并指出其原始正则化策略的局限性。所提出的方法引入了稀疏性约束,以减轻语义泄露,从而实现更集中和可控的编辑,同时保留面部身份并减少非目标属性的意外变化。
SymDrive: Realistic and Controllable Driving Simulator via Symmetric Auto-regressive Online Restoration
Authors: Zhiyuan Liu, Daocheng Fu, Pinlong Cai, Lening Wang, Ying Liu, Yilong Ren, Botian Shi, Jianqiang Wang
First: 2025-12-25T10:28:43+00:00 · Latest: 2025-12-25T10:28:43+00:00
Abstract
High-fidelity and controllable 3D simulation is essential for addressing the long-tail data scarcity in Autonomous Driving (AD), yet existing methods struggle to simultaneously achieve photorealistic rendering and interactive traffic editing. Current approaches often falter in large-angle novel view synthesis and suffer from geometric or lighting artifacts during asset manipulation. To address these challenges, we propose SymDrive, a unified diffusion-based framework capable of joint high-quality rendering and scene editing. We introduce a Symmetric Auto-regressive Online Restoration paradigm, which constructs paired symmetric views to recover fine-grained details via a ground-truth-guided dual-view formulation and utilizes an auto-regressive strategy for consistent lateral view generation. Furthermore, we leverage this restoration capability to enable a training-free harmonization mechanism, treating vehicle insertion as context-aware inpainting to ensure seamless lighting and shadow consistency. Extensive experiments demonstrate that SymDrive achieves state-of-the-art performance in both novel-view enhancement and realistic 3D vehicle insertion.
中文标题/摘要
标题:SymDrive:通过对称自回归在线恢复实现逼真可控的驾驶模拟
高保真度和可控的3D模拟对于解决自动驾驶(AD)中的长尾数据稀缺问题至关重要,但现有方法难以同时实现逼真的渲染和交互式的交通编辑。当前的方法在大角度新颖视图合成方面常常失败,并且在资产操作过程中会出现几何或照明伪影。为了解决这些挑战,我们提出SymDrive,这是一种统一的基于扩散的框架,能够同时实现高质量渲染和场景编辑。我们引入了一种对称自回归在线恢复范式,该范式通过基于真实值的双视图公式构建配对的对称视图以恢复细粒度的细节,并利用自回归策略进行一致的侧视图生成。此外,我们利用这种恢复能力来实现一种无需训练的协调机制,将车辆插入视为上下文感知的图像填充,以确保照明和阴影的一致性。广泛的实验表明,SymDrive 在新颖视图增强和真实3D车辆插入方面均达到了最先进的性能。
Summary / 总结
SymDrive is a unified diffusion-based framework designed to address the challenges of high-fidelity and controllable 3D simulation in autonomous driving. It introduces a Symmetric Auto-regressive Online Restoration paradigm to recover fine-grained details and enable consistent lateral view generation. SymDrive also uses a training-free harmonization mechanism for realistic 3D vehicle insertion, ensuring seamless lighting and shadow consistency. Experiments show that SymDrive outperforms existing methods in novel-view synthesis and realistic 3D vehicle insertion, achieving state-of-the-art performance.
SymDrive 是一个统一的扩散基框架,旨在解决自动驾驶仿真中高保真渲染和交互式交通编辑的挑战。它引入了一种对称自回归在线恢复范式,以恢复细粒度细节并确保一致的横向视图生成。关键发现表明,SymDrive 在新颖视图增强和真实3D车辆插入方面均优于现有方法,实现了这两个领域的最先进性能。
TAMEing Long Contexts in Personalization: Towards Training-Free and State-Aware MLLM Personalized Assistant
Authors: Rongpei Hong, Jian Lang, Ting Zhong, Yong Wang, Fan Zhou
Venue: KDD 2026
First: 2025-12-25T10:23:56+00:00 · Latest: 2025-12-25T10:23:56+00:00
Comments: Accepted by KDD 2026 research track. Code and data are available at https://github.com/ronpay/TAME
Abstract
Multimodal Large Language Model (MLLM) Personalization is a critical research problem that facilitates personalized dialogues with MLLMs targeting specific entities (known as personalized concepts). However, existing methods and benchmarks focus on the simple, context-agnostic visual identification and textual replacement of the personalized concept (e.g., "A yellow puppy" -> "Your puppy Mochi"), overlooking the ability to support long-context conversations. An ideal personalized MLLM assistant is capable of engaging in long-context dialogues with humans and continually improving its experience quality by learning from past dialogue histories. To bridge this gap, we propose LCMP, the first Long-Context MLLM Personalization evaluation benchmark. LCMP assesses the capability of MLLMs in perceiving variations of personalized concepts and generating contextually appropriate personalized responses that reflect these variations. As a strong baseline for LCMP, we introduce a novel training-free and state-aware framework TAME. TAME endows MLLMs with double memories to manage the temporal and persistent variations of each personalized concept in a differentiated manner. In addition, TAME incorporates a new training-free Retrieve-then-Align Augmented Generation (RA2G) paradigm. RA2G introduces an alignment step to extract the contextually fitted information from the multi-memory retrieved knowledge to the current questions, enabling better interactions for complex real-world user queries. Experiments on LCMP demonstrate that TAME achieves the best performance, showcasing remarkable and evolving interaction experiences in long-context scenarios.
中文标题/摘要
标题:TAMEing长上下文在个性化中的应用:面向训练无依赖和状态感知的MLLM个性化助理
多模态大型语言模型(MLLM)个性化是一个关键的研究问题,它使MLLM能够与特定实体(称为个性化概念)进行个性化的对话。然而,现有的方法和基准主要关注简单的、上下文无关的个性化概念的视觉识别和文本替换(例如,“一只黄色的狗” -> “你的小狗Mochi”),忽视了支持长上下文对话的能力。理想的个性化MLLM助理能够与人类进行长上下文对话,并通过学习过去的对话历史来不断改进其体验质量。为了弥合这一差距,我们提出了LCMP,这是第一个长上下文MLLM个性化评估基准。LCMP评估MLLM在感知个性化概念的变化以及生成反映这些变化的上下文适当个性化响应方面的能力。作为LCMP的强基线,我们引入了一种新的训练无依赖和状态感知框架TAME。TAME赋予MLLM双记忆,以不同的方式管理每个个性化概念的时态和持久变化。此外,TAME结合了一种新的训练无依赖检索-然后对齐增强生成(RA2G)范式。RA2G引入了一步对齐步骤,从多记忆检索的知识中提取与当前问题相适应的信息,从而更好地处理复杂的现实用户查询。在LCMP上的实验表明,TAME取得了最佳性能,展示了在长上下文场景中显著且不断进化的交互体验。
Summary / 总结
The research aims to enhance personalized dialogues with MLLMs by addressing the limitation of existing methods that focus on simple, context-agnostic personalization. The proposed LCMP benchmark evaluates MLLMs' ability to handle long-context conversations and adapt to variations of personalized concepts. TAME, a training-free and state-aware framework, is introduced to manage temporal and persistent variations of personalized concepts through double memories and a Retrieve-then-Align Augmented Generation (RA2G) paradigm, which improves contextually appropriate responses. Experiments show that TAME outperforms other methods in long-context scenarios, providing better interaction experiences.
研究旨在通过解决现有方法仅关注简单视觉和文本替换而不考虑长上下文对话的问题,来提升个性化对话中MLLM的能力。提出的LCMP基准评估MLLM处理个性化概念变体并生成上下文适当响应的能力。TAME是一种训练免费且状态感知的框架,通过双记忆管理个性化概念的临时和持久变化,并引入新的RA2G范式。实验表明,TAME在长上下文场景中提供了更好的交互体验。
A Medical Multimodal Diagnostic Framework Integrating Vision-Language Models and Logic Tree Reasoning
Authors: Zelin Zang, Wenyi Gu, Siqi Ma, Dan Yang, Yue Shen, Zhu Zhang, Guohui Fan, Wing-Kuen Ling, Fuji Yang
First: 2025-12-25T09:01:06+00:00 · Latest: 2025-12-25T09:01:06+00:00
Abstract
With the rapid growth of large language models (LLMs) and vision-language models (VLMs) in medicine, simply integrating clinical text and medical imaging does not guarantee reliable reasoning. Existing multimodal models often produce hallucinations or inconsistent chains of thought, limiting clinical trust. We propose a diagnostic framework built upon LLaVA that combines vision-language alignment with logic-regularized reasoning. The system includes an input encoder for text and images, a projection module for cross-modal alignment, a reasoning controller that decomposes diagnostic tasks into steps, and a logic tree generator that assembles stepwise premises into verifiable conclusions. Evaluations on MedXpertQA and other benchmarks show that our method improves diagnostic accuracy and yields more interpretable reasoning traces on multimodal tasks, while remaining competitive on text-only settings. These results suggest a promising step toward trustworthy multimodal medical AI.
中文标题/摘要
标题:一种结合视觉语言模型和逻辑树推理的医学多模态诊断框架
随着医学领域大型语言模型(LLMs)和视觉语言模型(VLMs)的迅速发展,仅仅整合临床文本和医学影像并不能保证可靠的推理。现有的一些多模态模型常常产生幻觉或不一致的推理链,限制了临床信任。我们提出了一种基于LLaVA的诊断框架,结合了视觉语言对齐和逻辑规整推理。该系统包括一个文本和图像的输入编码器、一个用于跨模态对齐的投影模块、一个分解诊断任务的推理控制器,以及一个逻辑树生成器,将逐步前提组装成可验证的结论。在MedXpertQA和其他基准上的评估表明,我们的方法提高了多模态诊断的准确性,并在多模态任务上产生了更可解释的推理轨迹,同时在纯文本设置上保持竞争力。这些结果表明,朝着可信的多模态医学人工智能迈出了有希望的一步。
Summary / 总结
The research aims to address the limitations of existing multimodal models in medicine, which often produce hallucinations or inconsistent reasoning. The proposed framework integrates LLaVA with logic-regularized reasoning to enhance diagnostic accuracy and produce more interpretable reasoning traces. Evaluations on MedXpertQA and other benchmarks demonstrate improved diagnostic accuracy and more reliable reasoning compared to existing models, while maintaining competitiveness in text-only settings.
研究旨在解决现有医学多模态模型存在的问题,如产生幻觉或不一致的推理。提出的框架结合了LLaVA和逻辑规约推理,以提高诊断准确性并生成更具解释性的推理轨迹。在MedXpertQA和其他基准上的评估表明,该方法在多模态任务上提高了诊断准确性,并且在纯文本设置上保持了竞争力。
Towards Long-window Anchoring in Vision-Language Model Distillation
Authors: Haoyi Zhou, Shuo Li, Tianyu Chen, Qi Song, Chonghan Gao, Jianxin Li
Venue: AAAI 2026
First: 2025-12-25T08:39:14+00:00 · Latest: 2025-12-25T08:39:14+00:00
Comments: Accepted by AAAI 2026
Abstract
While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students' capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.
中文标题/摘要
标题:向视觉语言模型蒸馏中长窗口锚定迈进
虽然大型视觉语言模型(VLMs)在长上下文理解方面表现出色,但它们普遍存在的小分支在有限窗口大小下对语言-摄影对齐表现不佳。我们发现,知识蒸馏作为旋转位置嵌入(RoPE)的补充,能够提升学生模型在不同窗口大小上的能力。基于这一发现,我们提出了LAid,该方法直接致力于通过两种互补组件转移长距离注意力机制:(1)渐进的距离加权注意力匹配,在训练过程中动态强调更长的位置差异;(2)可学习的RoPE响应增益调制,选择性地在需要的地方放大位置敏感性。在多个模型家族的广泛实验中,LAid蒸馏模型的有效上下文窗口长度比基线小模型长3.2倍,同时在标准VL基准测试上保持或提高了性能。频谱分析还表明,LAid成功保留了传统方法无法转移的关键低频注意力成分。我们的工作不仅提供了构建更高效的长上下文VLMs的实用技术,还提供了蒸馏过程中位置理解如何出现和转移的理论见解。
Summary / 总结
The research aims to enhance the long-context understanding of small vision-language models by improving their alignment with large models. LAid, a novel method, is proposed to transfer long-range attention mechanisms through progressive distance-weighted attention matching and learnable RoPE response gain modulation. Experiments show that LAid-distilled models can achieve up to 3.2 times longer effective context windows than baseline models while maintaining or improving performance on standard VL benchmarks. Spectral analysis indicates that LAid successfully preserves low-frequency attention components that other methods fail to transfer.
本文针对小型视觉-语言模型在处理长上下文理解方面的局限性,提出了LAid,通过渐进的距离加权注意力匹配和可学习的RoPE响应增益调制来增强知识蒸馏。实验表明,LAid蒸馏模型可以实现比基线模型长3.2倍的有效上下文窗口,同时在标准基准测试上保持或提高了性能。频谱分析表明,LAid成功保留了其他方法无法转移的重要低频注意力成分。
Toward Intelligent Scene Augmentation for Context-Aware Object Placement and Sponsor-Logo Integration
Authors: Unnati Saraswat, Tarun Rao, Namah Gupta, Shweta Swami, Shikhar Sharma, Prateek Narang, Dhruv Kumar
First: 2025-12-25T08:12:27+00:00 · Latest: 2025-12-25T08:12:27+00:00
Abstract
Intelligent image editing increasingly relies on advances in computer vision, multimodal reasoning, and generative modeling. While vision-language models (VLMs) and diffusion models enable guided visual manipulation, existing work rarely ensures that inserted objects are \emph{contextually appropriate}. We introduce two new tasks for advertising and digital media: (1) \emph{context-aware object insertion}, which requires predicting suitable object categories, generating them, and placing them plausibly within the scene; and (2) \emph{sponsor-product logo augmentation}, which involves detecting products and inserting correct brand logos, even when items are unbranded or incorrectly branded. To support these tasks, we build two new datasets with category annotations, placement regions, and sponsor-product labels.
中文标题/摘要
标题:面向上下文感知对象插入和赞助商标识集成的智能场景增强
智能图像编辑越来越多地依赖计算机视觉、多模态推理和生成建模的进步。尽管视觉语言模型(VLM)和扩散模型能够实现指导性的视觉操作,但现有工作很少确保插入的对象是\emph{上下文适宜}的。我们提出了两个新的广告和数字媒体任务:(1)\emph{上下文感知对象插入},要求预测合适的对象类别、生成它们并在场景中合理地放置;(2)\emph{赞助产品标识增强},涉及检测产品并插入正确的品牌标识,即使物品未标品牌或错误标品牌。为了支持这些任务,我们构建了两个新的数据集,包含类别注释、放置区域和赞助产品标签。
Summary / 总结
The research aims to enhance intelligent scene augmentation for context-aware object placement and sponsor-logo integration. The method involves using vision-language models and diffusion models to predict, generate, and place objects contextually appropriate within scenes, and to detect products and insert correct brand logos. Key findings include the development of two new datasets with category annotations, placement regions, and sponsor-product labels to support these tasks.
研究旨在通过确保对象插入和品牌标识集成的上下文适宜性来提升智能图像编辑。该研究引入了两个新任务:上下文感知对象插入和赞助产品标识增强。为了实现这一目标,作者构建了两个新数据集,包含对象类别、放置区域和赞助产品的详细标注,使得能够在场景中合理生成和放置对象和标识。
Chain-of-Evidence Multimodal Reasoning for Few-shot Temporal Action Localization
Authors: Mengshi Qi, Hongwei Ji, Wulian Yun, Xianlin Zhang, Huadong Ma
First: 2025-04-18T04:35:35+00:00 · Latest: 2025-12-25T06:57:15+00:00
Abstract
Traditional temporal action localization (TAL) methods rely on large amounts of detailed annotated data, whereas few-shot TAL reduces this dependence by using only a few training samples to identify unseen action categories. However, existing few-shot TAL methods typically focus solely on video-level information, neglecting textual information, which can provide valuable semantic support for the action localization task. To address these issues, in this work, we propose a new few-shot temporal action localization method by Chain-of-Evidence multimodal reasoning to improve localization performance. Specifically, we design a novel few-shot learning framework to capture action commonalities and variations, which includes a semantic-aware text-visual alignment module designed to align the query and support videos at different levels. Meanwhile, to better express the temporal dependencies and causal relationships between actions at the textual level, we design a Chain-of-Evidence (CoE) reasoning method that progressively guides the Vision Language Model (VLM) and Large Language Model (LLM) to generate CoE text descriptions for videos. The generated texts can capture more variance of action than visual features. We conduct extensive experiments on the publicly available ActivityNet1.3, THUMOS14 and our newly collected Human-related Anomaly Localization Dataset. The experimental results demonstrate that our proposed method significantly outperforms existing methods in single-instance and multi-instance scenarios. Our source code and data are available at https://github.com/MICLAB-BUPT/VAL-VLM.
中文标题/摘要
标题:证据链多模态推理在少样本时间动作定位中的应用
传统的时序动作定位(TAL)方法依赖大量详细的标注数据,而少样本TAL通过仅使用少量训练样本来识别未见过的动作类别,从而减少了这种依赖。然而,现有的少样本TAL方法通常仅关注视频级别的信息,忽略了文本信息,而文本信息可以为动作定位任务提供有价值的语义支持。为了解决这些问题,本文提出了一种新的基于证据链多模态推理的少样本时间动作定位方法,以提高定位性能。具体而言,我们设计了一种新颖的少样本学习框架来捕捉动作的共性和变异性,其中包括一种语义感知的文本-视觉对齐模块,用于在不同级别对查询和支撑视频进行对齐。同时,为了更好地表达文本级别上动作之间的时序依赖性和因果关系,我们设计了一种证据链(CoE)推理方法,逐步引导视觉语言模型(VLM)和大型语言模型(LLM)生成视频的CoE文本描述。生成的文本可以比视觉特征捕捉到更多的动作变化。我们在公开的ActivityNet1.3、THUMOS14以及我们新收集的人类相关异常定位数据集上进行了广泛的实验。实验结果表明,我们提出的方法在单实例和多实例场景中显著优于现有方法。我们的源代码和数据可在https://github.com/MICLAB-BUPT/VAL-VLM上获取。
Summary / 总结
This paper proposes a Chain-of-Evidence multimodal reasoning method for few-shot temporal action localization, addressing the limitations of existing methods by incorporating textual information. The method includes a semantic-aware text-visual alignment module and a Chain-of-Evidence reasoning process that guides Vision Language Models and Large Language Models to generate detailed text descriptions. Experiments on ActivityNet1.3, THUMOS14, and a new dataset show significant improvements over existing methods in both single-instance and multi-instance scenarios.
本文提出了一种基于链式证据的多模态推理方法,以解决少样本时空动作定位的问题。该方法设计了语义感知的图文对齐模块和链式证据推理方法,更好地利用了文本和视觉信息。实验结果表明,该方法在ActivityNet1.3、THUMOS14以及一个新收集的数据集上均优于现有方法,在单实例和多实例场景下均表现出色。
Hierarchy-Aware Fine-Tuning of Vision-Language Models
Authors: Jiayu Li, Rajesh Gangireddy, Samet Akcay, Wei Cheng, Juhua Hu
First: 2025-12-25T06:44:33+00:00 · Latest: 2025-12-25T06:44:33+00:00
Abstract
Vision-Language Models (VLMs) learn powerful multimodal representations through large-scale image-text pretraining, but adapting them to hierarchical classification is underexplored. Standard approaches treat labels as flat categories and require full fine-tuning, which is expensive and produces inconsistent predictions across taxonomy levels. We propose an efficient hierarchy-aware fine-tuning framework that updates a few parameters while enforcing structural consistency. We combine two objectives: Tree-Path KL Divergence (TP-KL) aligns predictions along the ground-truth label path for vertical coherence, while Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) encourages consistent predictions among sibling classes. Both losses work in the VLM's shared embedding space and integrate with lightweight LoRA adaptation. Experiments across multiple benchmarks show consistent improvements in Full-Path Accuracy and Tree-based Inconsistency Error with minimal parameter overhead. Our approach provides an efficient strategy for adapting VLMs to structured taxonomies.
中文标题/摘要
标题:基于层次结构的视觉-语言模型微调
视觉-语言模型(VLMs)通过大规模的图像-文本预训练学习强大的多模态表示,但将其适应层次分类尚未得到充分探索。标准方法将标签视为扁平类别,并需要完全微调,这既昂贵又会在分类层次结构的不同级别上产生不一致的预测。我们提出了一种高效的基于层次结构的微调框架,该框架更新少量参数并强制结构一致性。我们结合了两个目标:树路径KL散度(TP-KL)沿真实标签路径对齐预测,以实现垂直一致性,而层次兄弟平滑交叉熵(HiSCE)鼓励兄弟类之间的一致预测。这两种损失在VLM的共享嵌入空间中工作,并与轻量级的LoRA适应相结合。在多个基准测试中的实验表明,在参数开销最小的情况下,我们的方法在全长准确性和基于树的不一致性误差方面都表现出一致的改进。我们的方法为将VLMs适应结构化分类层次结构提供了一种有效的策略。
Summary / 总结
The research aims to improve the adaptation of Vision-Language Models (VLMs) to hierarchical classification tasks by proposing a hierarchy-aware fine-tuning framework. This framework updates only a few parameters while ensuring structural consistency. It combines Tree-Path KL Divergence and Hierarchy-Sibling Smoothed Cross-Entropy to align predictions along the label path and encourage consistent predictions among sibling classes, respectively. Experiments across multiple benchmarks demonstrate consistent improvements in Full-Path Accuracy and Tree-based Inconsistency Error with minimal parameter overhead.
论文针对Vision-Language模型(VLM)在层次分类中的适应性不足问题,提出了一种新的层次感知微调框架,该框架仅更新少量参数并保持结构一致性。该框架结合了Tree-Path KL Divergence和Hierarchy-Sibling Smoothed Cross-Entropy,以提高垂直一致性和兄弟类别的一致性。实验结果表明,在多个基准测试中,该方法能够以最小的参数开销实现全路径准确性和树基不一致性错误的持续改进。
PhysicsCorrect: A Training-Free Approach for Stable Neural PDE Simulations
Authors: Xinquan Huang, Paris Perdikaris
Venue: AAAI 2026 Oral
First: 2025-07-03T01:22:57+00:00 · Latest: 2025-12-25T05:02:40+00:00
Comments: AAAI 2026 Oral
Abstract
Neural networks have emerged as powerful surrogates for solving partial differential equations (PDEs), offering significant computational speedups over traditional methods. However, these models suffer from a critical limitation: error accumulation during long-term rollouts, where small inaccuracies compound exponentially, eventually causing complete divergence from physically valid solutions. We present PhysicsCorrect, a training-free correction framework that enforces PDE consistency at each prediction step by formulating correction as a linearized inverse problem based on PDE residuals. Our key innovation is an efficient caching strategy that precomputes the Jacobian and its pseudoinverse during an offline warm-up phase, reducing computational overhead by two orders of magnitude compared to standard correction approaches. Across three representative PDE systems, including Navier-Stokes fluid dynamics, wave equations, and the chaotic Kuramoto-Sivashinsky equation, PhysicsCorrect reduces prediction errors by up to 100x while adding negligible inference time (under 5%). The framework integrates seamlessly with diverse architectures, including Fourier Neural Operators, UNets, and Vision Transformers, effectively transforming unstable neural surrogates into reliable simulation tools that bridge the gap between deep learning's computational efficiency and the physical fidelity demanded by practical scientific applications.
中文标题/摘要
标题:PhysicsCorrect:一种无需训练的稳定神经PDE模拟方法
神经网络已发展成为解决偏微分方程(PDEs)的强大代理,相比传统方法提供了显著的计算速度提升。然而,这些模型面临一个关键限制:在长时间模拟过程中误差累积,导致微小的不准确性指数级放大,最终导致完全偏离物理上有效的解。我们提出了一种无需训练的纠正框架PhysicsCorrect,通过基于PDE残差的线性化逆问题来在每个预测步骤中强制执行PDE一致性。我们的关键创新是一种高效的缓存策略,在离线预热阶段预计算雅可比矩阵及其伪逆,与标准纠正方法相比,将计算开销降低了两个数量级。在包括纳维-斯托克斯流体动力学、波动方程和混沌库拉托夫斯基-西瓦什金斯基方程在内的三个代表性PDE系统中,PhysicsCorrect将预测误差降低了高达100倍,同时增加的推理时间不到5%。该框架可以无缝集成到各种架构中,包括傅里叶神经算子、UNets和视觉变换器,有效地将不稳定的神经代理转化为可靠的模拟工具,填补了深度学习计算效率与实际科学应用所需的物理精度之间的差距。
Summary / 总结
PhysicsCorrect is a training-free framework that corrects neural network predictions for PDE simulations by enforcing PDE consistency at each step. It uses an efficient caching strategy to precompute the Jacobian and its pseudoinverse, reducing computational overhead. Across various PDE systems, PhysicsCorrect significantly reduces prediction errors by up to 100x with minimal inference time, making neural PDE solvers more reliable for scientific applications.
PhysicsCorrect 是一个无需训练的框架,通过在每一步强制执行 PDE 一致性来纠正神经网络的预测。它使用高效的缓存策略来预计算雅可比矩阵及其伪逆,从而减少计算开销。在各种 PDE 系统中,PhysicsCorrect 将预测误差显著降低至最多 100 倍,同时增加的推理时间几乎可以忽略不计,使神经网络在科学模拟中更加可靠。
Fixed-Budget Parameter-Efficient Training with Frozen Encoders Improves Multimodal Chest X-Ray Classification
Authors: Md Ashik Khan, Md Nahid Siddique
First: 2025-12-25T05:02:19+00:00 · Latest: 2025-12-25T05:02:19+00:00
Comments: Accepted at the 2025 28th International Conference on Computer and Information Technology (ICCIT). 6 pages, 6 figures
Abstract
Multimodal chest X-Ray analysis often fine-tunes large vision-language models, which is computationally costly. We study parameter-efficient training (PET) strategies, including frozen encoders, BitFit, LoRA, and adapters for multi-label classification on the Indiana University Chest X-Ray dataset (3,851 image-report pairs; 579 test samples). To mitigate data leakage, we redact pathology terms from reports used as text inputs while retaining clinical context. Under a fixed parameter budget (2.37M parameters, 2.51% of total), all PET variants achieve AUROC between 0.892 and 0.908, outperforming full fine-tuning (0.770 AUROC), which uses 94.3M trainable parameters, a 40x reduction. External validation on CheXpert (224,316 images, 58x larger) confirms scalability: all PET methods achieve >0.69 AUROC with <9% trainable parameters, with Adapter achieving best performance (0.7214 AUROC). Budget-matched comparisons reveal that vision-only models (0.653 AUROC, 1.06M parameters) outperform budget-matched multimodal models (0.641 AUROC, 1.06M parameters), indicating improvements arise primarily from parameter allocation rather than cross-modal synergy. While PET methods show degraded calibration (ECE: 0.29-0.34) compared to simpler models (ECE: 0.049), this represents a tractable limitation addressable through post-hoc calibration methods. These findings demonstrate that frozen encoder strategies provide superior discrimination at substantially reduced computational cost, though calibration correction is essential for clinical deployment.
中文标题/摘要
标题:固定预算参数高效训练与冻结编码器改善多模态胸部X光分类
多模态胸部X光分析通常微调大型视觉-语言模型,这在计算上成本高昂。我们研究了参数高效训练(PET)策略,包括冻结编码器、BitFit、LoRA和适配器,用于印第安纳大学胸部X光数据集(3,851张图像-报告对;579个测试样本)的多标签分类。为避免数据泄露,我们在作为文本输入的报告中删除病理术语,同时保留临床背景。在固定参数预算(2.37M参数,总参数的2.51%)下,所有PET变体的AUROC在0.892到0.908之间,优于使用94.3M可训练参数的全微调(0.770 AUROC),后者参数减少40倍。在CheXpert的外部验证(224,316张图像,规模大58倍)中,所有PET方法均实现>0.69 AUROC,且<9%的可训练参数,适配器表现最佳(0.7214 AUROC)。预算匹配比较显示,仅视觉模型(0.653 AUROC,1.06M参数)优于预算匹配的多模态模型(0.641 AUROC,1.06M参数),表明改进主要来自参数分配而非跨模态协同作用。尽管PET方法的校准(ECE:0.29-0.34)比简单模型(ECE:0.049)差,但这是可以通过后处理校准方法解决的可管理限制。这些发现表明,冻结编码器策略在显著降低计算成本的同时提供了更优的区分能力,但校准修正对于临床部署至关重要。
Summary / 总结
The study investigates parameter-efficient training (PET) strategies for multimodal chest X-ray classification, focusing on frozen encoders, BitFit, LoRA, and adapters. Using the Indiana University Chest X-Ray dataset, the methods achieve AUROC between 0.892 and 0.908 under a fixed parameter budget of 2.37M parameters, outperforming full fine-tuning. External validation on CheXpert shows that all PET methods achieve >0.69 AUROC with <9% trainable parameters, with adapters performing best. The research indicates that frozen encoder strategies provide better discrimination at a much lower computational cost, though calibration correction is necessary for clinical use.
研究探讨了参数高效训练(PET)策略在多模态胸部X光分类中的应用,包括冻结编码器、BitFit、LoRA和适配器。使用印第安纳大学胸部X光数据集,研究显示所有PET方法在固定参数预算2.37M参数下,AUROC达到0.892到0.908,显著优于全量微调。外部验证在CheXpert上确认了可扩展性,所有PET方法使用<9%的可训练参数均达到>0.69 AUROC,适配器表现最佳。研究结果表明,冻结编码器策略在较低计算成本下提供了更好的区分能力,但需要进行校准修正以适应临床应用。
SVBench: Evaluation of Video Generation Models on Social Reasoning
Authors: Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang
First: 2025-12-25T04:44:59+00:00 · Latest: 2025-12-25T04:44:59+00:00
Comments: 10pages
Abstract
Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.
中文标题/摘要
标题:SVBench:视频生成模型在社会推理评估上的表现
近期的文本到视频生成模型在视觉真实感、运动保真度和文本视频对齐方面取得了显著进展,但它们在生成社会连贯行为方面仍然存在根本性的局限。与人类不同,人类可以从简短的视觉线索中轻松推断意图、信念、情感和社会规范,而当前的模型往往渲染字面场景,而未能捕捉到潜在的因果或心理逻辑。为了系统地评估这一差距,我们引入了第一个视频生成中的社会推理基准。该基准基于发展心理学和社会心理学的研究成果,将三十个经典的社会认知范式组织成七个核心维度,包括心理状态推断、目标导向行为、共同注意、社会协调、亲社会行为、社会规范和多智能体策略。为了实现这些范式的操作化,我们开发了一个完全无需训练的基于代理的流水线,该流水线包括:(i) 提炼每个实验的推理机制,(ii) 合成多种多样的视频准备场景,(iii) 通过基于线索的批评实现概念中立性和难度控制,以及(iv) 使用高容量VLM裁判员在五个可解释的社会推理维度上评估生成的视频。利用这一框架,我们在七个最先进的视频生成系统上进行了首次大规模研究。我们的结果显示了显著的性能差距:尽管现代模型在表面合理性方面表现出色,但在意图识别、信念推理、共同注意和亲社会推理方面却系统性地失败。
Summary / 总结
The research aims to evaluate the ability of text-to-video generation models to produce socially coherent behavior, which is a limitation compared to human social reasoning. The study introduces SVBench, a benchmark for social reasoning in video generation, based on social cognition paradigms from psychology. Key findings show that while modern models are good at surface-level plausibility, they struggle with intention recognition, belief reasoning, joint attention, and prosocial inference.
研究旨在评估文本到视频生成模型的社会推理能力,这些模型在视觉真实性和对齐方面取得了显著进展,但在生成社会连贯行为方面却有所欠缺。研究引入了SVBench,该基准基于发展和社会心理学中的社会认知范式,涵盖了七个核心维度。方法涉及一个无需训练的基于代理的流水线,用于合成多样化的场景并使用高容量VLM评判器评估生成的视频。关键发现表明,虽然模型在表面合理性方面表现出色,但在意图识别、信念推理、共同注意和利他推理方面存在系统性缺陷。
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
Authors: Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig
First: 2025-12-19T04:09:24+00:00 · Latest: 2025-12-25T02:25:27+00:00
Abstract
While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
中文标题/摘要
标题:DAVE:为文档理解和网络代理设计的VLM视觉编码器
尽管视觉语言模型(VLMs)在多模态任务中表现出色,但它们选择的视觉编码器存在根本性弱点:低级特征缺乏文档理解和网络代理所需的稳健的结构和空间信息。为弥补这一差距,我们引入了DAVE,这是一种专为VLMs设计并针对这些任务定制的视觉编码器。我们的训练管道设计用于利用大量未标记数据,以绕过对文档和网络图像的大规模注释成本。我们首先在未标记图像上进行自我监督预训练阶段,然后在监督自回归预训练阶段,模型从有限的高质量数据中学习解析和定位等任务。在监督阶段内,我们采用两种策略来提高编码器与通用视觉知识和多样化文档及网络代理任务的对齐:(i) 我们引入了一种新的模型合并方案,将使用不同文本解码器训练的编码器结合在一起,以确保与不同网络代理架构的广泛兼容性。(ii) 我们使用集成训练将预训练通用编码器(例如SigLIP2)的特征与我们自己的文档和网络特定表示融合在一起。在经典文档任务、VQAs、网络定位和基于代理的基准测试中的广泛实验验证了我们方法的有效性,确立了DAVE作为文档和网络应用的强大视觉编码器的地位。
Summary / 总结
DAVE is a vision encoder designed to enhance the robustness of vision-language models for document understanding and web agents. It leverages self-supervised and supervised pretraining on unlabeled and high-quality data, respectively. DAVE incorporates a model-merging scheme and ensemble training to improve its compatibility with various web agentic tasks and general visual knowledge. Experimental results show DAVE's effectiveness in classic document tasks, VQAs, web localization, and agent-based benchmarks, making it a strong vision encoder for these applications.
研究旨在解决现有视觉-语言模型(VLMs)在捕捉文档理解和网页代理所需的空间和结构信息方面的局限性。为这些任务引入了专门的视觉编码器DAVE。它通过在未标注数据上的自我监督预训练阶段和有限高质量数据上的监督自回归阶段进行训练。采用模型合并和集成训练等策略,以增强与各种网页代理任务的兼容性。实验结果表明,DAVE在文档和网页应用中表现出色,超越了现有VLMs在这些领域的性能。
RLLaVA: An RL-central Framework for Language and Vision Assistants
Authors: Lei Zhao, Zihao Ma, Boyu Lin, Yuhe Liu, Wenjun Wu, Lei Huang
First: 2025-12-25T00:09:02+00:00 · Latest: 2025-12-25T00:09:02+00:00
Comments: The code is available at https://github.com/TinyLoopX/RLLaVA
Abstract
We present an RL-central framework for Language and Vision Assistants (RLLaVA) with its formulation of Markov decision process (MDP). RLLaVA decouples RL algorithmic logic from model architecture and distributed execution, supporting researchers in implementing new RL algorithms with minimal code, and to plug in a broad family of RL methods and vision-language models (VLMs) while remaining agnostic to specific training and inference engines. RLLaVA makes resource-efficient training of 1B--7B models feasible on common GPUs; notably, 4B-scale models can be trained end-to-end with full-parameter updates on a single 24GB GPU. Experiments on multi-modal and agentic tasks demonstrate that RLLaVA has task extensibility, and the models trained with it consistently improve performance over base models, competitive with other specially engineered RL frameworks. The code is available at https://github.com/TinyLoopX/RLLaVA.
中文标题/摘要
标题:RLLaVA:一种以强化学习为中心的语言和视觉助手框架
我们提出了一种以强化学习为中心的语言和视觉助手框架(RLLaVA),并对其马尔可夫决策过程(MDP)进行了形式化描述。RLLaVA 将强化学习算法逻辑与模型架构及分布式执行分离,支持研究人员以最少的代码实现新的强化学习算法,并可插入广泛的强化学习方法和视觉-语言模型(VLMs),同时对特定的训练和推理引擎保持中立。RLLaVA 使 1B 到 7B 模型的资源高效训练在常见 GPU 上成为可能;值得注意的是,4B 规模的模型可以在单个 24GB GPU 上端到端地进行训练,且具有完整的参数更新。在多模态和自主任务上的实验表明,RLLaVA 具有任务扩展性,使用它训练的模型在性能上持续优于基线模型,且与其它专门设计的强化学习框架具有竞争力。代码可在 https://github.com/TinyLoopX/RLLaVA 获取。
Summary / 总结
RLLaVA is an RL-central framework for Language and Vision Assistants that formulates an MDP and decouples RL algorithmic logic from model architecture. This enables researchers to implement new RL algorithms with minimal code and supports a broad range of RL methods and vision-language models. Experiments show that RLLaVA can efficiently train models from 1B to 7B parameters on common GPUs, with 4B-scale models trainable on a single 24GB GPU. The framework consistently improves performance on multi-modal and agentic tasks compared to base models, matching the performance of other specialized RL frameworks.
RLLaVA 是一个用于语言和视觉助手的 RL 中心框架,通过 MDP 形式化并解耦了 RL 算法逻辑与模型架构。这使得研究人员能够用最少的代码实现新的 RL 算法,并支持广泛的 RL 方法和视觉-语言模型。实验表明,RLLaVA 可以在常见 GPU 上高效训练从 1B 到 7B 参数的模型,4B 规模的模型可以在单个 24GB GPU 上端到端训练。该框架在多模态和代理任务上的一致性能提升超过了基线模型,与其它专门设计的 RL 框架相当。
A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding
Authors: Christina Liu, Alan Q. Wang, Joy Hsu, Jiajun Wu, Ehsan Adeli
First: 2025-12-24T20:30:01+00:00 · Latest: 2025-12-24T20:30:01+00:00
Abstract
Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools. Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language. However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone. To address this, we propose a tool-use framework for medical image understanding called the Tool Bottleneck Framework (TBF), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM). For a given image and task, TBF leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features. Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection. Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors. We evaluate TBF on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes. Our code is available at https://github.com/christinaliu2020/tool-bottleneck-framework.
中文标题/摘要
标题:一种临床指导和可解释的医学影像理解工具瓶颈框架
基于视觉-语言模型(VLMs)的近期工具使用框架通过将模型预测与专业工具结合来提高影像理解能力。这些框架通常利用VLM和预定义的工具箱将预测任务分解为多个工具调用(通常为深度学习模型),并通过组合这些工具来做出预测。工具组合的主要方法是使用文本,通过嵌入在VLM生成代码或自然语言中的函数调用来实现。然而,这些方法在医学影像理解中表现不佳,因为显著信息编码为局部特征,仅通过文本难以组合或融合。为解决这一问题,我们提出了一种称为工具瓶颈框架(TBF)的医学影像理解工具使用框架,该框架使用学习到的工具瓶颈模型(TBM)来组合VLM选择的工具。对于给定的影像和任务,TBF利用现成的医学VLM从工具箱中选择工具,这些工具各自提取临床相关特征。这些工具不是通过文本组合,而是由TBM通过神经网络计算和融合工具输出,然后输出最终预测。我们提出了一种简单而有效的方法,使TBMs能够对任何任意VLM工具选择进行预测。总体而言,我们的框架不仅在医学成像环境中提高了工具使用,还产生了更可解释、临床导向的预测器。我们在组织病理学和皮肤科任务上评估了TBF,发现这些优势使我们的框架能够与基于深度学习的分类器、VLM和最先进的工具使用框架相媲美或更优,特别是在数据受限的情况下。我们的代码可在https://github.com/christinaliu2020/tool-bottleneck-framework/ 获取。
Summary / 总结
The research aims to enhance medical image understanding by developing a tool-use framework that leverages a learned Tool Bottleneck Model (TBM) to compose tools for better prediction. Unlike previous methods that use text-based composition, TBF selects tools based on clinically-relevant features extracted by a medical VLM and fuses their outputs using a neural network. The study evaluates TBF on histopathology and dermatology tasks and finds it performs comparably to or better than deep learning classifiers, VLMs, and other tool-use frameworks, especially in data-limited scenarios, while providing more interpretable and clinically-grounded predictions. The code is available at https://github.com/christinaliu2020/tool-bottleneck-framework.
本文提出了工具瓶颈框架(TBF),旨在增强医学图像理解。该框架受到基于文本的VLMs在医学图像中组合工具的局限性启发,使用学习到的工具瓶颈模型(TBM)来组合提取临床相关特征的工具。TBM通过神经网络融合工具输出,从而产生更可解释和临床相关的预测。在病理学和皮肤科任务上的实验表明,TBF在数据受限的情况下与深度学习分类器、VLMs和其他工具使用框架相比,表现相当或更优。
Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation
Authors: Arnav Gupta, Gurekas Singh Sahney, Hardik Rathi, Abhishek Chandwani, Ishaan Gupta, Pratik Narang, Dhruv Kumar
First: 2025-12-24T19:43:59+00:00 · Latest: 2025-12-24T19:43:59+00:00
Comments: Under Review
Abstract
Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks like VideoScore-2 assess visual and semantic fidelity, they do not capture how specific audiovisual attributes drive real audience engagement. In this work, we propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos. Our curated YouTube Shorts dataset enables systematic analysis of how VLM-derived features relate to human engagement behavior. Experiments show strong correlations between predicted and actual engagement, demonstrating that our lightweight, feature-based evaluator provides interpretable and scalable assessments compared to traditional metrics (e.g., SSIM, FID). By grounding evaluation in both multimodal feature importance and human-centered engagement signals, our approach advances toward robust and explainable video understanding.
中文标题/摘要
标题:理解病毒性传播:基于视觉语言模型的短格式教育娱乐内容评估框架
评估短格式视频内容需要超越表面质量指标,转向与人类价值观一致的多模态推理。虽然现有框架如VideoScore-2评估视觉和语义保真度,但它们未能捕捉特定视听属性如何驱动实际观众参与。在本文中,我们提出了一种数据驱动的评估框架,该框架使用视觉语言模型(VLMs)提取无监督的视听特征,将它们聚类为可解释的因素,并训练基于回归的评估器预测短格式教育娱乐视频的参与度。我们精心策划的YouTube Shorts数据集使我们能够系统地分析VLM提取的特征与人类参与行为之间的关系。实验结果显示,预测参与度与实际参与度之间存在强烈的相关性,证明了我们轻量级、基于特征的评估器相比传统指标(如SSIM、FID)提供了可解释且可扩展的评估。通过将评估根植于多模态特征重要性和以人为中心的参与信号,我们的方法朝着稳健且可解释的视频理解迈进。
Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
Authors: Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu
First: 2025-12-24T18:59:54+00:00 · Latest: 2025-12-24T18:59:54+00:00
Comments: Project page: https://sytwu.github.io/BeyondMemo/
Abstract
We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/
中文标题/摘要
标题:超越记忆:多模态序数回归基准以揭示视觉语言模型中的流行度偏差
我们揭示了最先进的视觉语言模型(VLMs)中存在显著的流行度偏差,这些模型在著名建筑上的准确率比普通建筑高出34%,表明它们依赖于记忆而非可泛化的理解。为了系统地研究这一问题,我们引入了该任务上最大的开放基准数据集:YearGuessr数据集,包含来自157个国家的55,546张建筑图像,具有多模态属性,并附有其建设年份的连续序数标签(1001-2024)、GPS数据和页面浏览量作为流行度的代理。使用该数据集,我们将建筑年份预测任务框架化为序数回归,并引入了流行度感知的区间准确度指标来量化这种偏差。我们构建的包含30多个模型的基准,包括我们的YearCLIP模型,证实了VLMs在流行、记忆化的项目上表现出色,但在未识别的主题上却面临重大挑战,揭示了它们推理能力中的关键缺陷。项目页面:https://sytwu.github.io/BeyondMemo/
Summary / 总结
The paper addresses the significant popularity bias in state-of-the-art vision-language models (VLMs), which perform better on famous buildings than ordinary ones. To systematically investigate this, the authors introduce the YearGuessr dataset, a multi-modal benchmark with 55,546 building images from 157 countries, annotated with construction years, GPS data, and page-view counts. Using this dataset, they frame the task as ordinal regression and introduce new metrics to quantify the bias. The benchmark reveals that VLMs excel on popular items but struggle with unrecognized subjects, highlighting a critical flaw in their reasoning capabilities.
论文探讨了最先进的视觉-语言模型(VLMs)中存在的显著流行度偏差,这些模型在著名建筑上的表现优于普通建筑。为了系统地研究这一问题,作者引入了包含55,546张来自157个国家的建筑图像的YearGuessr数据集,这些图像被标注了建造年份、GPS数据和页面浏览量。使用这个数据集,他们将任务定义为序数回归,并引入新的指标来量化这种偏差。基准测试表明,VLMs在流行项目上表现出色,但在未识别的主题上却面临重大挑战,这揭示了它们推理能力的一个关键缺陷。