TABLET: A Large-Scale Dataset for Robust Visual Table Understanding
Authors: Iñigo Alonso, Imanol Miranda, Eneko Agirre, Mirella Lapata
First: 2025-09-25T14:14:27+00:00 · Latest: 2025-11-05T16:33:45+00:00
Abstract
While table understanding increasingly relies on pixel-only settings where
tables are processed as visual representations, current benchmarks
predominantly use synthetic renderings that lack the complexity and visual
diversity of real-world tables. Additionally, existing visual table
understanding (VTU) datasets offer fixed examples with single visualizations
and pre-defined instructions, providing no access to underlying serialized data
for reformulation. We introduce TABLET, a large-scale VTU dataset with 4
million examples across 20 tasks, grounded in 2 million unique tables where 88%
preserve original visualizations. Each example includes paired image-HTML
representations, comprehensive metadata, and provenance information linking
back to the source datasets. Fine-tuning vision-language models like
Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while
increasing robustness on real-world table visualizations. By preserving
original visualizations and maintaining example traceability in a unified
large-scale collection, TABLET establishes a foundation for robust training and
extensible evaluation of future VTU models.
中文标题/摘要
标题:TABLET:大规模视觉表格理解数据集
尽管表格理解越来越多地依赖于仅基于像素的设置,其中表格被视为视觉表示,但当前的基准测试主要使用缺乏现实世界表格复杂性和视觉多样性的合成渲染。此外,现有的视觉表格理解(VTU)数据集提供固定示例和单一可视化,并预定义指令,不提供访问底层序列化数据以重新表述的机会。我们引入了TABLET,这是一个包含400万示例的大型VTU数据集,覆盖20个任务,基于200万张独特表格,其中88%保留了原始可视化。每个示例包括配对的图像-HTML表示、全面的元数据以及链接回源数据集的来源信息。在TABLET上微调如Qwen2.5-VL-7B这样的视觉语言模型可以提高已见和未见VTU任务的性能,同时增强对现实世界表格可视化的鲁棒性。通过保留原始可视化并在统一的大规模集合中保持示例可追溯性,TABLET为未来的VTU模型的稳健训练和扩展评估奠定了基础。
Summary / 总结
The research aims to address the limitations of current VTU benchmarks by introducing TABLET, a large-scale dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables. Each example includes paired image-HTML representations and comprehensive metadata. Fine-tuning vision-language models on TABLET improves performance on both seen and unseen VTU tasks and enhances robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability, TABLET provides a robust training and evaluation foundation for future VTU models.
研究旨在通过引入包含400万示例、覆盖20个任务的TABLET数据集来解决当前VTU基准的局限性,该数据集基于200万张独特的表格。每个示例包括配对的图像-HTML表示和全面的元数据。在TABLET上微调视觉-语言模型可以提高对已见和未见VTU任务的性能,并增强对真实世界表格可视化的效果。通过保留原始可视化并保持示例可追溯性,TABLET为未来的VTU模型提供了稳健的训练和扩展评估基础。
Text-guided Fine-Grained Video Anomaly Detection
Authors: Jihao Gu, Kun Li, He Wang, Kaan Akşit
First: 2025-11-01T11:59:23+00:00 · Latest: 2025-11-05T15:46:07+00:00
Abstract
Video Anomaly Detection (VAD) aims to identify anomalous events within video
segments. In scenarios such as surveillance or industrial process monitoring,
anomaly detection is of critical importance. While existing approaches are
semi-automated, requiring human assessment for anomaly detection, traditional
VADs offer limited output as either normal or anomalous. We propose Text-guided
Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large
Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD)
that performs pixel-wise visual-textual feature alignment to generate
fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly
Encoder (RAE) that transforms the heatmaps into learnable textual embeddings,
guiding the LVLM to accurately identify and localize anomalous events in
videos. This significantly enhances both the granularity and interactivity of
anomaly detection. The proposed method achieving SOTA performance by
demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and
67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset,
and subjectively verified more preferable textual description on the
ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories;
Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for
targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).
中文标题/摘要
标题:文本引导的细粒度视频异常检测
视频异常检测(VAD)旨在识别视频片段中的异常事件。在监控或工业过程监控等场景中,异常检测至关重要。尽管现有方法是半自动化,需要人工评估异常检测,但传统VADs的输出仅限于正常或异常。我们提出了文本引导的细粒度视频异常检测(T-VAD),该框架基于大型视觉-语言模型(LVLM)。T-VAD引入了异常热图解码器(AHD),通过像素级的视觉-文本特征对齐生成细粒度的异常热图。此外,我们设计了区域感知异常编码器(RAE),将热图转换为可学习的文本嵌入,引导LVLM准确识别和定位视频中的异常事件。这显著提高了异常检测的粒度和互动性。所提出的方法在UBnormal数据集上实现了SOTA性能,AUC(特别是微AUC)达到94.8%,异常热图(RBDC/TBDC)准确率为67.8%/76.7%,并在ShanghaiTech基于的数据集上主观验证了更优的文本描述(BLEU-4:目标62.67,轨迹88.84;是/否准确率:97.67%),以及UBnormal数据集上(BLEU-4:目标50.32,轨迹78.10;是/否准确率:89.73%)。
Summary / 总结
The research aims to improve the accuracy and granularity of video anomaly detection by integrating textual guidance into the detection process. The proposed Text-guided Fine-Grained Video Anomaly Detection (T-VAD) framework uses a Large Vision-Language Model (LVLM) with an Anomaly Heatmap Decoder (AHD) and a Region-aware Anomaly Encoder (RAE) to generate fine-grained anomaly heatmaps and guide the LVLM for precise anomaly localization. The method achieves state-of-the-art performance with 94.8% micro-AUC and 67.8%/76.7% accuracy in anomaly heatmaps on the UBnormal dataset, and shows superior textual description quality with BLEU-4 scores of 62.67 and 88.84 on the ShanghaiTech-based dataset.
研究旨在通过结合文本指导来提高视频异常检测的精细度和互动性。方法Text-guided Fine-Grained Video Anomaly Detection (T-VAD) 使用大型视觉语言模型(LVLM),结合Anomaly Heatmap Decoder (AHD) 和Region-aware Anomaly Encoder (RAE),生成精细的异常热图并引导LVLM进行准确的异常定位。该方法在UBnormal数据集上达到最先进的性能,微AUC为94.8%,热图准确率为67.8%/76.7%,并在ShanghaiTech数据集上展示了更优的文本描述质量,BLEU-4得分为62.67和88.84。
Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
Authors: Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, Jin Hao, Zijian Chen, Ruijia Wu, Tao Tang, Junhui Lv, Hongxia Xu, Hongwei Wang, Jun Xiao, Bin Feng, Fudong Zhu, Kenli Li, Weidi Xie, Jimeng Sun, Jian Wu, Zuozhu Liu
First: 2025-10-09T17:06:42+00:00 · Latest: 2025-11-05T15:19:13+00:00
Abstract
Real-world clinical decision-making requires integrating heterogeneous data,
including medical text, 2D images, 3D volumes, and videos, while existing AI
systems fail to unify all these signals, limiting their utility. In this paper,
we introduce Hulu-Med, a transparent, generalist medical Vision-Language Model
(VLM) designed to unify language-only, 2D/3D vision-language, and video
understanding within a single architecture. Hulu-Med is trained on a curated
corpus of 16.7 million samples, comprising exclusively public or synthetic
data, spanning 12 major anatomical systems and 14 medical imaging modalities.
Hulu-Med employs a medical-aware token-reduction strategy that prunes redundant
visual tokens, achieving up to a 55% reduction for 3D and video inputs,
improving cross-modal efficiency, and enabling training at 7B-32B parameter
scales in approximately 4,000-40,000 GPU hours. Across 30 public in-domain and
out-of-domain medical benchmarks-covering text reasoning, visual question
answering, report generation, multilingual dialogue, video understanding, and
rare disease diagnosis-Hulu-Med surpasses existing open-source models on 27 of
30 benchmarks and outperforms proprietary systems such as GPT-4o on 16
benchmarks. Despite being a VLM, Hulu-Med outperforms GPT-4o and matches GPT-o1
on the text-only HealthBench. For the first time in the community, we provide a
fully transparent, reproducible and cost-effective pipeline for holistic
medical vision-language understanding by releasing our end-to-end data
curation, training procedures, and model parameters. Code and models are
available at https://github.com/ZJUI-AI4H/Hulu-Med.
中文标题/摘要
标题:Hulu-Med:面向全面医疗视图语言理解的透明通用模型
现实世界中的临床决策需要整合异构数据,包括医学文本、2D图像、3D体积和视频,而现有的AI系统无法统一所有这些信号,限制了它们的实用性。在本文中,我们介绍了Hulu-Med,这是一种透明的通用医疗视图语言模型(VLM),旨在在一个架构中统一语言理解、2D/3D视图语言理解和视频理解。Hulu-Med基于1670万样本的精心策划的语料库进行训练,这些样本仅包含公开或合成数据,涵盖了12个主要的解剖系统和14种医学成像模态。Hulu-Med采用了一种医学意识的标记减少策略,去除冗余的视觉标记,对于3D和视频输入,最多可减少55%的标记,提高了跨模态效率,并能够在约4000-40000个GPU小时的训练中实现7B-32B参数规模的训练。在涵盖文本推理、视觉问答、报告生成、多语言对话、视频理解和罕见疾病诊断的30个公开领域内和领域外医学基准测试中,Hulu-Med在27个基准测试中超越了现有的开源模型,并在16个基准测试中超越了如GPT-4o等专有系统。尽管是VLM,Hulu-Med在仅文本的HealthBench基准测试中也超越了GPT-4o,并与GPT-o1持平。我们首次为社区提供了全面透明、可重复和成本效益高的医疗视图语言理解管道,通过发布我们的端到端数据策划、训练流程和模型参数。代码和模型可在https://github.com/ZJUI-AI4H/Hulu-Med/获取。
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models
Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Chengyuan Yu, Mengshu Sun, Qiang Zhang, Jiahang Cao, Yijie Guo, Ning Liu, Kaidi Xu, Jize Zhang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu
First: 2024-09-20T03:02:05+00:00 · Latest: 2025-11-05T14:57:42+00:00
Abstract
Recently, driven by advancements in Multimodal Large Language Models (MLLMs),
Vision Language Action Models (VLAMs) are being proposed to achieve better
performance in open-vocabulary scenarios for robotic manipulation tasks. Since
manipulation tasks involve direct interaction with the physical world, ensuring
robustness and safety during the execution of this task is always a very
critical issue. In this paper, by synthesizing current safety research on MLLMs
and the specific application scenarios of the manipulation task in the physical
world, we comprehensively evaluate VLAMs in the face of potential physical
threats. Specifically, we propose the Physical Vulnerability Evaluating
Pipeline (PVEP) that can incorporate as many visual modal physical threats as
possible for evaluating the physical robustness of VLAMs. The physical threats
in PVEP specifically include Out-of-Distribution, Typography-based Visual
Prompt, and Adversarial Patch Attacks. By comparing the performance
fluctuations of VLAMs before and after being attacked, we provide generalizable
\textbf{\textit{Analyses}} of how VLAMs respond to different physical threats.
中文标题/摘要
标题:面对威胁的操纵:评估端到端视觉语言动作模型的物理脆弱性
近年来,随着多模态大型语言模型(MLLMs)的进步,视觉语言动作模型(VLAMs)被提出以在机器人操纵任务的开放词汇场景中实现更好的性能。由于操纵任务涉及直接与物理世界交互,因此在执行此任务时确保其鲁棒性和安全性始终是一个非常关键的问题。在本文中,通过综合当前MLLMs的安全研究以及操纵任务在物理世界中的具体应用场景,我们全面评估了VLAMs在面对潜在物理威胁时的表现。具体而言,我们提出了物理脆弱性评估管道(PVEP),它可以尽可能多地纳入视觉模态的物理威胁,以评估VLAMs的物理鲁棒性。PVEP中的物理威胁具体包括离分布、基于字体的视觉提示和对抗性补丁攻击。通过比较VLAMs在攻击前后性能的变化,我们提供了关于VLAMs如何应对不同物理威胁的可泛化的分析。
Summary / 总结
This paper evaluates the physical robustness of Vision Language Action Models (VLAMs) by proposing the Physical Vulnerability Evaluating Pipeline (PVEP), which includes out-of-distribution, typography-based visual prompt, and adversarial patch attacks. The study finds that VLAMs exhibit varying degrees of vulnerability to these physical threats, providing insights into their performance fluctuations under different attack scenarios.
本文通过提出物理脆弱性评估管道(PVEP),评估Vision Language Action Models(VLAMs)在面对分布外、基于字体的视觉提示和对抗性补丁攻击等物理威胁时的物理鲁棒性。研究发现,当VLAMs受到这些攻击时,其性能会显著波动,表明它们在机器人操作任务中对物理威胁的脆弱性。
Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models
Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu
First: 2025-03-14T15:42:42+00:00 · Latest: 2025-11-05T14:45:59+00:00
Comments: This paper is accepted by IJCAI2025 Workshop on Deepfake Detection,
Localization, and Interpretability as Best Student Paper
Abstract
Current Cross-Modality Generation Models (GMs) demonstrate remarkable
capabilities in various generative tasks. Given the ubiquity and information
richness of vision modality inputs in real-world scenarios, Cross-Vision tasks,
encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have
attracted significant attention. Large Vision Language Models (LVLMs) and I2I
Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively.
Previous research indicates that printing typographic words into input images
significantly induces LVLMs and I2I GMs to produce disruptive outputs that are
semantically aligned with those words. Additionally, visual prompts, as a more
sophisticated form of typography, are also revealed to pose security risks to
various applications of cross-vision tasks. However, the specific
characteristics of the threats posed by visual prompts remain underexplored. In
this paper, to comprehensively investigate the performance impact induced by
Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we
propose the Typographic Visual Prompts Injection Dataset and thoroughly
evaluate the TVPI security risks on various open-source and closed-source LVLMs
and I2I GMs under visual prompts with different target semantics, deepening the
understanding of TVPI threats.
中文标题/摘要
标题:探索跨模态生成模型中的字体视觉提示注入威胁
当前的跨模态生成模型(GMs)在各种生成任务中表现出显著的能力。鉴于现实世界场景中视觉模态输入的普遍性和信息丰富性,包括视觉语言感知(VLP)和图像到图像(I2I)在内的跨视觉任务引起了广泛关注。大型视觉语言模型(LVLMs)和I2I生成模型(GMs)分别用于处理VLP和I2I任务。先前的研究表明,在输入图像中印刷字体文字会显著诱导LVLMs和I2I GMs生成与这些文字语义一致的破坏性输出。此外,视觉提示作为一种更复杂的字体形式,也被发现对跨视觉任务的各种应用构成了安全风险。然而,视觉提示所造成的威胁的具体特征仍待进一步探索。在本文中,为了全面调查字体视觉提示注入(TVPI)在各种LVLMs和I2I GMs中的性能影响,我们提出了字体视觉提示注入数据集,并在具有不同目标语义的视觉提示下对各种开源和闭源LVLMs和I2I GMs进行了彻底的安全风险评估,加深了对TVPI威胁的理解。
Summary / 总结
This paper explores the security threats posed by typographic visual prompts in cross-modality generation models. It introduces a dataset for evaluating the impact of typographic visual prompt injection (TVPI) and thoroughly assesses the security risks on various LVLMs and I2I GMs. The study reveals that visual prompts can induce models to produce disruptive outputs aligned with the prompts, highlighting the need for better security measures in cross-vision tasks.
本文探讨了图文提示注入对跨模态生成模型的安全威胁。引入了一个图文提示注入数据集,并在不同目标语义下评估了其对各种大型视觉语言模型和图像到图像生成模型的影响,揭示了跨视觉任务中的重大安全风险。
Revisiting Multimodal Positional Encoding in Vision-Language Models
Authors: Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, Shuai Bai
First: 2025-10-27T08:00:46+00:00 · Latest: 2025-11-05T14:25:38+00:00
Comments: 16 pages
Abstract
Multimodal position encoding is essential for vision-language models, yet
there has been little systematic investigation into multimodal position
encoding. We conduct a comprehensive analysis of multimodal Rotary Positional
Embedding (RoPE) by examining its two core components: position design and
frequency allocation. Through extensive experiments, we identify three key
guidelines: positional coherence, full frequency utilization, and preservation
of textual priors-ensuring unambiguous layout, rich representation, and
faithful transfer from the pre-trained LLM. Based on these insights, we propose
Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and
plug-and-play variants that require no architectural changes. Our methods
consistently outperform existing approaches across diverse benchmarks, with
significant improvements in both general and fine-grained multimodal
understanding. Code will be avaliable at
https://github.com/JJJYmmm/Multimodal-RoPEs.
中文标题/摘要
标题:重新审视视觉-语言模型中的多模态位置编码
多模态位置编码对于视觉-语言模型至关重要,但对多模态位置编码的系统性研究却很少。我们对多模态旋转位置嵌入(RoPE)进行了全面分析,考察了其两个核心组成部分:位置设计和频率分配。通过大量实验,我们确定了三个关键指导原则:位置一致性、充分利用频率以及保留文本先验,以确保布局明确、表示丰富以及从预训练的大语言模型中忠实转移。基于这些见解,我们提出了多头RoPE(MHRoPE)和MRoPE-交错(MRoPE-I)两种简单且即插即用的变体,无需进行架构更改。我们的方法在多种基准测试中始终优于现有方法,显著提高了通用和细粒度多模态理解。代码将在https://github.com/JJJYmmm/Multimodal-RoPEs上提供。
Summary / 总结
The paper revisits the role of multimodal position encoding in vision-language models, focusing on Rotary Positional Embedding (RoPE). By analyzing the position design and frequency allocation, the authors propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), which enhance multimodal understanding. These methods improve performance across various benchmarks, particularly in handling general and fine-grained multimodal tasks, without requiring architectural changes.
论文重新审视了多模态位置编码在视觉-语言模型中的作用,重点关注旋转位置嵌入(RoPE)。通过对位置设计和频率分配的分析,作者提出了多头RoPE(MHRoPE)和MRoPE-交错(MRoPE-I),这些方法增强了多模态理解。这些方法在各种基准测试中表现出色,特别是在处理通用和细粒度的多模态任务方面,无需进行架构更改。
ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs
Authors: Ben Zhang, LuLu Yu, Lei Gao, QuanJiang Guo, Jing Liu, Hui Gao
First: 2025-08-06T08:31:11+00:00 · Latest: 2025-11-05T13:58:18+00:00
Abstract
During reasoning in vision-language models (VLMs), false positive (FP)
reasoning occurs when a model produces the correct answer but follows an
incorrect reasoning path, resulting in undermined reasoning reliability.
Existing approaches mainly rely on prompt engineering, knowledge distillation
or reinforcement learning to improve reasoning reliability, both of which
require large amounts of high-quality data and thus limit practical
applicability. Few approaches have focused on directly detecting and correcting
FPs. To address these issues, we propose ViFP, a framework for Visual False
Positive Detection to Enhance Reasoning Reliability in VLMs. ViFP builds
effective reasoning paths through multi-turn QA and dynamically analyzes the
consistency of the reasoning path to identify potential FPs. It also introduces
a targeted reasoning chain correction mechanism to modify FP reasoning, thereby
improving logical consistency and accuracy. Finally, we introduce a reliability
evaluation metric, VoC, which integrates answer accuracy and the FP rate,
providing a quantitative tool to assess whether a VLM not only answers
correctly but also reasons reliably. Our experiments on closed-source VLMs show
that ViFP consistently improves performance across three datasets: A-OKVQA,
OK-VQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing
the previous state-of-the-art by 4.3%, and significantly reduces the number of
FPs, validating its benefits in enhancing reasoning reliability.
中文标题/摘要
标题:ViFP:视觉假阳性检测框架以增强VLMs推理可靠性
在视觉语言模型(VLMs)的推理过程中,当模型给出正确答案但遵循错误的推理路径时,会发生假阳性(FP)推理,从而削弱推理可靠性。现有方法主要依赖于提示工程、知识蒸馏或强化学习来提高推理可靠性,但这些方法需要大量高质量的数据,从而限制了其实用性。很少有方法专注于直接检测和纠正FPs。为了解决这些问题,我们提出了ViFP,一种用于视觉假阳性检测以增强VLMs推理可靠性的框架。ViFP通过多轮问答构建有效的推理路径,并动态分析推理路径的一致性以识别潜在的FPs。它还引入了针对性的推理链修正机制来修改FP推理,从而提高逻辑一致性和准确性。最后,我们引入了可靠性评估指标VoC,该指标结合了答案准确率和FP率,提供了一种定量工具来评估VLM不仅回答正确,还能可靠地推理。我们在闭源VLMs上的实验表明,ViFP在三个数据集A-OKVQA、OK-VQA和FVQA上均能持续提高性能。在A-OKVQA上,ViFP将准确率提高了最多5.4%,超越了之前的最佳方法4.3%,并显著减少了FP的数量,验证了其在增强推理可靠性方面的益处。
Summary / 总结
The paper introduces ViFP, a framework for detecting and correcting visual false positives in vision-language models to enhance reasoning reliability. It uses multi-turn QA and dynamically analyzes reasoning paths to identify potential false positives, then corrects the reasoning chain to improve logical consistency and accuracy. Experiments show that ViFP improves accuracy by up to 5.4% on A-OKVQA, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of false positives.
论文提出了ViFP框架,用于检测和纠正视觉假阳性,以增强视觉语言模型的推理可靠性。ViFP使用多轮问答并动态分析推理路径的一致性,引入了针对推理链的修正机制。实验表明,ViFP在A-OKVQA上将准确率提高至多5.4%,超越了之前的最佳方法4.3%,并且显著减少了假阳性数量,验证了其在增强推理可靠性方面的优势。
Decoupling Augmentation Bias in Prompt Learning for Vision-Language Models
Authors: Gahyeon Kim, Sohee Kim, Seokju Lee
First: 2025-11-05T11:15:16+00:00 · Latest: 2025-11-05T11:15:16+00:00
Comments: Accepted in Pattern Recognition
Abstract
Recent advances in large-scale vision and language models have led to
significant progress in zero-shot learning tasks. Methods such as CoOp and
CoCoOp have shown that replacing handcrafted prompts with learnable vectors,
known as prompt learning, can result in improved performance. However, these
models often struggle to generalize to entirely unseen categories. While
traditional zero-shot learning techniques benefit from various data
augmentation strategies, prompt learning has primarily focused on text-based
modifications, leaving the potential of image-based augmentation largely
unexplored. In this work, we explore how image-level augmentations,
particularly those that introduce attribute-specific variations, can support
and enhance prompt learning. Our analysis examines the interaction between
these augmentations and soft prompt frameworks, revealing their potential to
improve generalization. We also identify a limitation in existing methods, such
as CoCoOp, which do not provide explicit guidance for learning prompts that
focus on semantically meaningful visual features. To address this, we propose
Adding Attributes to Prompt Learning, AAPL, a novel method that introduces
adversarial token embeddings to decouple superficial visual variations
introduced by augmentation from class-relevant semantic representations. This
decoupling enables the learned prompts to concentrate on visually
discriminative features that align with the target categories. We conduct
comprehensive experiments on eleven benchmark datasets, and AAPL consistently
outperforms existing methods across few-shot, zero-shot, cross-dataset, and
domain generalization settings. Our source code is publicly available at:
https://github.com/Gahyeonkim09/AAPL
中文标题/摘要
标题:分离视觉语言模型提示学习中增强偏差
大规模视觉和语言模型的最新进展在零样本学习任务中取得了显著进展。CoOp和CoCoOp等方法表明,用可学习向量替换手工设计的提示,即提示学习,可以提高性能。然而,这些模型往往难以泛化到完全未见过的类别。虽然传统的零样本学习技术受益于各种数据增强策略,但提示学习主要集中在文本修改上,图像增强的潜力尚未得到充分探索。在本文中,我们探讨了图像级增强,特别是那些引入属性特定变化的增强,如何支持和增强提示学习。我们的分析研究了这些增强与软提示框架之间的相互作用,揭示了它们提高泛化能力的潜力。我们还指出了现有方法(如CoCoOp)的一个局限性,即它们没有提供明确的指导来学习专注于语义有意义的视觉特征的提示。为了解决这个问题,我们提出了添加属性到提示学习(AAPL),这是一种新颖的方法,通过引入对抗性标记嵌入来分离由增强引入的表面视觉变化与类别相关的语义表示。这种分离使学习到的提示能够集中于与目标类别对齐的视觉区分特征。我们在11个基准数据集上进行了全面实验,AAPL在少量样本、零样本、跨数据集和领域泛化设置中均优于现有方法。我们的源代码可在以下网址获取:https://github.com/Gahyeonkim09/AAPL
Summary / 总结
This paper addresses the challenge of improving the generalization of prompt learning models in vision-language tasks, particularly in handling unseen categories. It proposes AAPL, a method that uses adversarial token embeddings to decouple superficial visual variations from class-relevant semantic representations. Experimental results show that AAPL outperforms existing methods across various settings including few-shot, zero-shot, cross-dataset, and domain generalization.
本文探讨了通过在提示学习中使用图像级增强来解决视觉-语言模型在面对未见过的类别时的泛化问题。作者提出了一种名为AAPL的新方法,该方法通过引入对抗性令牌嵌入来解耦由增强引入的表面视觉变化与类别相关的语义表示。实验结果表明,AAPL在少量样本、零样本、跨数据集和领域泛化等设置中均优于现有方法。
Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge
Authors: Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang
First: 2025-11-05T10:01:31+00:00 · Latest: 2025-11-05T10:01:31+00:00
Abstract
In this report, we present our solution to the MOT25-Spatiotemporal Action
Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately
localize and track multiple objects that match specific and free-form language
queries, using video data of complex real-world scenes as input. We model the
underlying task as a video retrieval problem and present a two-stage, zero-shot
approach, combining the advantages of the SOTA tracking model FastTracker and
Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our
method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which
won second place in the challenge.
中文标题/摘要
标题:使用LLaVA-Video的多目标跟踪检索:MOT25-StAG挑战的无需训练解决方案
在本报告中,我们提出了对MOT25-时空动作定位(MOT25-StAG)挑战的解决方案。该挑战的目标是使用复杂现实场景的视频数据作为输入,准确地定位和跟踪与特定和自由形式的语言查询匹配的多个对象。我们将基础任务建模为视频检索问题,并提出了一种两阶段、零样本的方法,结合了最先进的跟踪模型FastTracker和多模态大型语言模型LLaVA-Video的优势。在MOT25-StAG测试集上,我们的方法分别获得了m-HIoU和HOTA分数20.68和10.73,赢得了挑战的第二名。
Summary / 总结
The research aims to accurately localize and track multiple objects in complex real-world scenes based on specific language queries. The method combines FastTracker for tracking and LLaVA-Video for multi-modal understanding, achieving m-HIoU and HOTA scores of 20.68 and 10.73 respectively on the MOT25-StAG test set, securing second place in the challenge.
研究旨在基于特定语言查询,在复杂现实场景中准确地定位和跟踪多个物体。方法结合了FastTracker进行跟踪和LLaVA-Video进行多模态理解,在MOT25-StAG测试集上取得了m-HIoU和HOTA分数分别为20.68和10.73的成绩,获得挑战赛第二名。
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Authors: Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
First: 2025-10-30T08:21:50+00:00 · Latest: 2025-11-05T05:49:17+00:00
Comments: 10 pages
Abstract
Modern vision-language models (VLMs) excel at many multimodal tasks, yet
their grasp of temporal information in video remains weak and, crucially,
under-evaluated. We probe this gap with a deceptively simple but revealing
challenge: judging the arrow of time (AoT)-whether a short clip is played
forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated
benchmark that tests whether VLMs can infer temporal direction in natural
videos using the same stimuli and behavioral baselines established for humans.
Our comprehensive evaluation of open-weight and proprietary, reasoning and
non-reasoning VLMs reveals that most models perform near chance, and even the
best lag far behind human accuracy on physically irreversible processes (e.g.,
free fall, diffusion/explosion) and causal manual actions (division/addition)
that humans recognize almost instantly. These results highlight a fundamental
gap in current multimodal systems: while they capture rich visual-semantic
correlations, they lack the inductive biases required for temporal continuity
and causal understanding. We release the code and data for AoT-PsyPhyBENCH to
encourage further progress in the physical and temporal reasoning capabilities
of VLMs.
中文标题/摘要
标题:时间流动的方向如何?基于心理物理学的视觉-语言模型评估
现代视觉-语言模型(VLMs)在许多多模态任务中表现出色,但在视频中的时间信息掌握方面仍然较弱且未得到充分评估。我们通过一个看似简单但揭示性强的挑战——判断时间箭头(AoT)——即判断短片段是正向播放还是反向播放,来探索这一差距。我们引入了AoT-PsyPhyBENCH,这是一个经过心理物理学验证的基准测试,测试VLMs是否能在自然视频中推断时间方向,使用与人类相同的刺激和行为基线。我们对开放权重和专有、推理和非推理VLMs的全面评估显示,大多数模型的表现接近随机猜测,甚至最好的模型在物理不可逆过程(如自由落体、扩散/爆炸)和因果手动动作(如分割/加法)上的表现也远远落后于人类的准确性,这些过程人类几乎可以瞬间识别。这些结果突显了当前多模态系统中的一个根本性差距:虽然它们捕捉了丰富的视觉-语义关联,但缺乏用于时间连续性和因果理解的归纳偏置。我们发布了AoT-PsyPhyBENCH的代码和数据,以鼓励进一步提高VLMs在物理和时间推理能力方面的发展。
Summary / 总结
This study evaluates the temporal understanding of vision-language models (VLMs) by introducing AoT-PsyPhyBENCH, a benchmark that assesses models' ability to determine the direction of time in videos. Despite VLMs' success in many multimodal tasks, most models perform poorly, even on simple tasks like identifying whether a video is played forward or backward, with human accuracy far exceeding model performance. This highlights a critical gap in VLMs' ability to understand temporal continuity and causal relationships.
研究通过引入基于心理物理验证的AoT-PsyPhyBENCH基准,评估了视觉语言模型(VLMs)的时间理解能力。测试VLMs在自然视频中推断时间方向的能力,结果显示大多数模型表现不佳,尤其是在不可逆过程和因果动作方面,远远落后于人类的准确度。这突显了VLMs在时间连续性和因果理解方面存在显著差距。
CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
Authors: Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, Caiwen Ding
First: 2025-10-23T22:52:00+00:00 · Latest: 2025-11-05T02:10:35+00:00
Abstract
Developing efficient CUDA kernels is increasingly critical for AI
applications such as large-scale LLM training. However, manual kernel design is
both costly and time-consuming, motivating automatic approaches that leverage
LLMs for code generation. Existing methods for automatic kernel generation,
however, often produce low-efficiency kernels, incur high computational
overhead, and fail to generalize across settings. In this work, we propose
CudaForge, a training-free multi-agent workflow for CUDA kernel generation and
optimization. Our workflow is inspired by the iterative workflow of human
experts, which contains steps such as developing initial kernels, testing
correctness, analyzing hardware feedback, and iterative improvement. More
specifically, CudaForge employs two LLM agents: a Coder and a Judge, that
iteratively generate, correct, and optimize CUDA kernels, while integrating
hardware feedback such as Nsight Compute (NCU) metrics. In extensive
evaluations, we show that CudaForge, by leveraging base models like OpenAI-o3,
achieves 97.6\% correctness of generated kernels and an average 1.68$\times$
speedup over PyTorch baselines, substantially surpassing state-of-the-art
models including OpenAI-o3 and Kevin on KernelBench.Beyond accuracy and speed,
CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090,
3090) and base models (OpenAI-o3, GPT-5, gpt-oss-120B, Claude-Sonnet-4,
QwQ-32B), while maintaining high efficiency. In particular, generating an
optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about \$
0.3 API cost, which is significantly cheaper than existing agentic work that
costs 6 H100 hours and \$ 5 API cost per kernel. Our results highlight that
multi-agent, training-free workflows can enable cost-effective, generalizable,
and high-performance CUDA kernel optimization. Code available at
https://github.com/OptimAI-Lab/CudaForge
中文标题/摘要
标题:CudaForge:一种带有硬件反馈的CUDA内核优化智能代理框架
开发高效的CUDA内核对于大规模LLM训练等AI应用越来越关键。然而,手动设计内核既昂贵又耗时,因此推动了利用LLM进行代码生成的自动方法。然而,现有的自动内核生成方法往往生成低效的内核,产生高计算开销,并且无法在不同场景下泛化。在本工作中,我们提出了一种名为CudaForge的无需训练的多智能体工作流,用于CUDA内核的生成和优化。我们的工作流灵感来源于人类专家的迭代工作流,包括开发初始内核、测试正确性、分析硬件反馈和迭代改进等步骤。具体而言,CudaForge使用两个LLM智能体:一个编码器和一个裁判,它们迭代生成、修正和优化CUDA内核,同时整合硬件反馈,如Nsight Compute (NCU)指标。在广泛的评估中,我们展示了CudaForge通过利用如OpenAI-o3等基础模型,生成内核的正确性达到97.6%,平均比PyTorch基线快1.68倍,显著超越包括OpenAI-o3和Kevin在内的最新模型在KernelBench上的表现。除了准确性和速度,CudaForge在不同GPU(A100、RTX 6000、4090、3090)和基础模型(OpenAI-o3、GPT-5、gpt-oss-120B、Claude-Sonnet-4、QwQ-32B)上表现出强大的泛化能力,同时保持高效率。特别是,生成一个优化的内核在一台RTX6000上大约需要26.5分钟,API成本约为$0.3,这比现有代理工作每内核6个H100小时和$5 API成本要便宜得多。我们的结果表明,无需训练的多智能体工作流可以实现成本效益高、可泛化和高性能的CUDA内核优化。代码可在https://github.com/OptimAI-Lab/CudaForge获取
Summary / 总结
CudaForge is a training-free multi-agent framework for CUDA kernel generation and optimization, inspired by the iterative process of human experts. It uses two agents, a Coder and a Judge, to iteratively generate, correct, and optimize CUDA kernels while integrating hardware feedback such as Nsight Compute metrics. CudaForge achieves 97.6% correctness of generated kernels and an average 1.68x speedup over PyTorch baselines, surpassing state-of-the-art models on KernelBench. It demonstrates strong generalization across different GPUs and base models while maintaining high efficiency and low cost compared to existing approaches.
CudaForge 是一个无需训练的多代理框架,用于 CUDA 内核的生成和优化,灵感来源于人类专家的迭代过程。它使用两个代理,一个编码器和一个裁判,来迭代生成、纠正和优化 CUDA 内核,并整合 Nsight Compute 等硬件反馈。CudaForge 实现了生成内核 97.6% 的正确性,并且在 PyTorch 基线上的平均加速比为 1.68 倍,超过了 KernelBench 上的最新模型。它在不同 GPU 和基础模型上表现出强大的泛化能力,同时保持了高效率和低成本,优于现有方法。
ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones
Authors: Anurag Ghosh, Shen Zheng, Robert Tamburo, Khiem Vuong, Juan Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, Srinivasa G. Narasimhan
Venue: ICCV 2025
First: 2024-06-11T19:06:41+00:00 · Latest: 2025-11-04T23:16:33+00:00
Comments: ICCV 2025 Accepted Paper
Abstract
Perceiving and autonomously navigating through work zones is a challenging
and underexplored problem. Open datasets for this long-tailed scenario are
scarce. We propose the ROADWork dataset to learn to recognize, observe,
analyze, and drive through work zones. State-of-the-art foundation models fail
when applied to work zones. Fine-tuning models on our dataset significantly
improves perception and navigation in work zones. With ROADWork dataset, we
discover new work zone images with higher precision (+32.5%) at a much higher
rate (12.8$\times$) around the world. Open-vocabulary methods fail too, whereas
fine-tuned detectors improve performance (+32.2 AP). Vision-Language Models
(VLMs) struggle to describe work zones, but fine-tuning substantially improves
performance (+36.7 SPICE).
Beyond fine-tuning, we show the value of simple techniques. Video label
propagation provides additional gains (+2.6 AP) for instance segmentation.
While reading work zone signs, composing a detector and text spotter via
crop-scaling improves performance +14.2% 1-NED). Composing work zone detections
to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We
predict navigational goals and compute drivable paths from work zone videos.
Incorporating road work semantics ensures 53.6% goals have angular error (AE) <
0.5 (+9.9 %) and 75.3% pathways have AE < 0.5 (+8.1 %).
中文标题/摘要
标题:ROADWork:学习识别、观察、分析和通过工作区的数据集和基准
感知并自主导航通过工作区是一个具有挑战性和未充分探索的问题。针对这一长尾场景的公开数据集稀缺。我们提出了ROADWork数据集,以学习识别、观察、分析和通过工作区。最先进的基础模型在应用于工作区时表现不佳。在我们的数据集上微调模型显著提高了工作区的感知和导航性能。通过ROADWork数据集,我们发现具有更高精度(+32.5%)的新工作区图像,发现率提高了12.8倍。开放词汇方法也失败了,而微调检测器提高了性能(+32.2 AP)。视觉-语言模型(VLMs)难以描述工作区,但微调显著提高了性能(+36.7 SPICE)。除了微调,我们展示了简单技术的价值。视频标签传播为实例分割提供了额外收益(+2.6 AP)。在阅读工作区标志时,通过裁剪缩放组合检测器和文本识别器提高了性能(+14.2% 1-NED)。组合工作区检测提供上下文进一步减少了VLMs中的幻觉(+3.9 SPICE)。我们预测导航目标并从工作区视频中计算可行驶路径。结合道路工作语义确保了53.6%的目标具有角度误差(AE)<0.5(+9.9%),75.3%的路径具有AE < 0.5(+8.1%)。
Summary / 总结
The paper addresses the challenge of perceiving and navigating through work zones, which are underexplored. It introduces the ROADWork dataset to train models for recognizing, observing, analyzing, and driving through these zones. Fine-tuning models on this dataset significantly enhances perception and navigation. The study also explores the limitations of open-vocabulary methods and vision-language models, showing that fine-tuning improves performance. Simple techniques like video label propagation and crop-scaling also contribute to better performance. Incorporating road work semantics in drivable path computation ensures more accurate navigational goals and pathways.
研究旨在解决工作区感知和导航的挑战,这是一个尚未充分探索的领域。提出了ROADWork数据集以促进该场景的学习。在该数据集上微调模型显著提升了工作区的感知和导航能力。研究发现,微调检测器和视觉语言模型表现出更好的性能,而开放词汇方法则失败。简单的技术如视频标签传播和上下文组合也提高了性能。通过融入道路工作语义,导航目标和可行驶路径的精度得到提升,减少了角度误差。
SCALE-VLP: Soft-Weighted Contrastive Volumetric Vision-Language Pre-training with Spatial-Knowledge Semantics
Authors: Ailar Mahdizadeh, Puria Azadi Moghadam, Xiangteng He, Shahriar Mirabbasi, Panos Nasiopoulos, Leonid Sigal
First: 2025-11-04T21:03:17+00:00 · Latest: 2025-11-04T21:03:17+00:00
Abstract
Vision-language models (VLMs) have demonstrated strong cross-modal
capabilities, yet most work remains limited to 2D data and assumes binary
supervision (i.e., positive vs. negative pairs), overlooking the continuous and
structured dependencies present in volumetric data such as CT. Existing
approaches often treat volumetric scans as independent 2D slices, compromising
spatial coherence and underutilizing rich clinical semantics. We propose
SCALE-VLP, a soft-weighted contrastive vision-language pre-training framework
that integrates (i) volumetric spatial semantics to preserve anatomical
structure and (ii) domain-aware, knowledge-infused semantics (e.g.,
radiological ontologies) to guide alignment. This yields structurally
consistent and semantically grounded representations under limited supervision,
demonstrating strong cross-task transferability (retrieval, report generation,
and classification), and cross-domain generalizability with consistent gains
without further fine-tuning. In particular, compared to the previous state of
the art, SCALE-VLP achieves up to 4.3x higher top-1 CT-report retrieval,
improves abnormality classification by 10 points, and reaches ROUGE-L 0.44 and
BERT-F1 0.89 for report generation. Further, in zero-shot evaluation on an
out-of-domain external dataset, we observe consistent gains, indicating the
cross-task and cross-domain generalization ability of SCALE-VLP.
中文标题/摘要
标题:SCALE-VLP:软加权对比体视语言预训练框架,结合空间知识语义
视觉语言模型(VLMs)展示了强大的跨模态能力,但大多数工作仍局限于2D数据,并假设二元监督(即正负配对),忽视了CT等体数据中存在的连续和结构化依赖关系。现有方法通常将体扫描视为独立的2D切片,这损害了空间连贯性并未能充分利用丰富的临床语义。我们提出SCALE-VLP,这是一种结合(i)体视空间语义以保留解剖结构和(ii)领域感知、知识注入语义(例如放射学本体)的软加权对比视觉语言预训练框架。这在有限监督下产生了结构上一致且语义上合理的表示,展示了强大的跨任务迁移能力(检索、报告生成和分类),以及跨领域泛化能力,无需进一步微调即可获得一致的改进。特别是,与之前的最新技术相比,SCALE-VLP在CT报告检索上的top-1得分提高了4.3倍,异常分类提高了10个点,并且在报告生成上达到了ROUGE-L 0.44和BERT-F1 0.89。此外,在一个领域外的外部数据集上的零样本评估中,我们观察到一致的改进,表明了SCALE-VLP的跨任务和跨领域泛化能力。
Summary / 总结
SCALE-VLP is a pre-training framework for vision-language models that integrates volumetric spatial semantics and domain-specific knowledge to improve cross-modal capabilities. It achieves up to 4.3 times higher top-1 CT-report retrieval, 10-point improvement in abnormality classification, and ROUGE-L 0.44 and BERT-F1 0.89 for report generation. SCALE-VLP demonstrates strong cross-task and cross-domain generalizability without further fine-tuning.
SCALE-VLP 是一种预训练框架,结合了体积空间语义和领域相关的知识,以增强针对如 CT 扫描等 3D 数据的视觉语言模型。它使用软加权对比学习来保留解剖结构和临床语义,从而在跨任务和跨领域性能上表现出色。与之前的方法相比,SCALE-VLP 在 top-1 CT 报告检索上提高了高达 4.3 倍,异常分类提高了 10 个点,并且在报告生成中实现了 ROUGE-L 0.44 和 BERT-F1 0.89。此外,在外部数据集上的零样本评估中也观察到一致的性能提升,表明了 SCALE-VLP 的跨任务和跨领域泛化能力。
Prompt to Restore, Restore to Prompt: Cyclic Prompting for Universal Adverse Weather Removal
Authors: Rongxin Liao, Feng Li, Yanyan Wei, Zenglin Shi, Le Zhang, Huihui Bai, Meng Wang
First: 2025-03-12T03:03:06+00:00 · Latest: 2025-11-04T15:59:18+00:00
Abstract
Universal adverse weather removal (UAWR) seeks to address various weather
degradations within a unified framework. Recent methods are inspired by prompt
learning using pre-trained vision-language models (e.g., CLIP), leveraging
degradation-aware prompts to facilitate weather-free image restoration,
yielding significant improvements. In this work, we propose CyclicPrompt, an
innovative cyclic prompt approach designed to enhance the effectiveness,
adaptability, and generalizability of UAWR. CyclicPrompt Comprises two key
components: 1) a composite context prompt that integrates weather-related
information and context-aware representations into the network to guide
restoration. This prompt differs from previous methods by marrying learnable
input-conditional vectors with weather-specific knowledge, thereby improving
adaptability across various degradations. 2) The erase-and-paste mechanism,
after the initial guided restoration, substitutes weather-specific knowledge
with constrained restoration priors, inducing high-quality weather-free
concepts into the composite prompt to further fine-tune the restoration
process. Therefore, we can form a cyclic "Prompt-Restore-Prompt" pipeline that
adeptly harnesses weather-specific knowledge, textual contexts, and reliable
textures. Extensive experiments on synthetic and real-world datasets validate
the superior performance of CyclicPrompt. The code is available at:
https://github.com/RongxinL/CyclicPrompt.
中文标题/摘要
标题:从提示恢复,恢复到提示:循环提示在通用不良天气去除中的应用
通用不良天气去除(UAWR)旨在在一个统一框架内解决各种天气退化问题。最近的方法受到预训练视觉-语言模型(如CLIP)提示学习的启发,利用退化感知提示来促进无天气图像恢复,取得了显著的改进。在本文中,我们提出了一种名为CyclicPrompt的创新循环提示方法,旨在增强UAWR的有效性、适应性和泛化能力。CyclicPrompt包含两个关键组件:1) 综合上下文提示,将与天气相关的信息和上下文感知表示整合到网络中,以指导恢复。这种提示与以往方法不同,通过结合可学习的输入条件向量和特定天气知识,提高了在各种退化中的适应性。2) 在初始引导恢复之后,擦除并粘贴机制用受限的恢复先验替换特定天气知识,将高质量的无天气概念引入综合提示中,进一步微调恢复过程。因此,我们可以形成一个循环的“提示-恢复-提示”管道,巧妙地利用特定天气知识、文本上下文和可靠的纹理。在合成和真实世界数据集上的大量实验验证了CyclicPrompt的优越性能。代码可在以下链接获取:https://github.com/RongxinL/CyclicPrompt.
Summary / 总结
CyclicPrompt is a novel cyclic prompting method for universal adverse weather removal (UAWR), which enhances the adaptability and generalizability of UAWR by integrating weather-related information and context-aware representations. It uses a composite context prompt and an erase-and-paste mechanism to refine the restoration process, forming a cyclic pipeline that effectively leverages weather-specific knowledge and textual contexts. Experiments show that CyclicPrompt outperforms existing methods on both synthetic and real-world datasets.
CyclicPrompt 是一种用于统一不良天气去除(UAWR)的新型循环提示方法,通过整合天气相关信息和上下文感知表示来增强 UAWR 的适应性和通用性。它使用复合上下文提示和擦除-粘贴机制来细化恢复过程,形成一个循环管道,有效地利用了天气特定知识和文本上下文。实验表明,CyclicPrompt 在合成和真实世界数据集上的表现优于现有方法。
TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Authors: Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi
First: 2025-11-04T13:56:39+00:00 · Latest: 2025-11-04T13:56:39+00:00
Comments: 13 pages, 8 figures, 3 tables. The first two authors contributed
equally. Project Page: https://iyatomilab.github.io/TAUE
Abstract
Despite the remarkable success of text-to-image diffusion models, their
output of a single, flattened image remains a critical bottleneck for
professional applications requiring layer-wise control. Existing solutions
either rely on fine-tuning with large, inaccessible datasets or are
training-free yet limited to generating isolated foreground elements, failing
to produce a complete and coherent scene. To address this, we introduce the
Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a
novel framework for zero-shot, layer-wise image generation. Our core technique,
Noise Transplantation and Cultivation (NTC), extracts intermediate latent
representations from both foreground and composite generation processes,
transplanting them into the initial noise for subsequent layers. This ensures
semantic and structural coherence across foreground, background, and composite
layers, enabling consistent, multi-layered outputs without requiring
fine-tuning or auxiliary datasets. Extensive experiments show that our
training-free method achieves performance comparable to fine-tuned methods,
enhancing layer-wise consistency while maintaining high image quality and
fidelity. TAUE not only eliminates costly training and dataset requirements but
also unlocks novel downstream applications, such as complex compositional
editing, paving the way for more accessible and controllable generative
workflows.
中文标题/摘要
标题:TAUE:无需训练的噪声移植与培养扩散模型
尽管文本到图像扩散模型取得了显著的成功,但它们输出单一、扁平图像的局限性仍然是专业应用中逐层控制的瓶颈。现有解决方案要么依赖于大规模、难以获取的数据集的微调,要么是无需训练但只能生成孤立的前景元素,无法生成完整且连贯的场景。为了解决这一问题,我们提出了无需训练的噪声移植与培养扩散模型(TAUE),这是一种用于零样本、逐层图像生成的新框架。我们的核心技术,噪声移植与培养(NTC),从前景和合成生成过程中提取中间的潜在表示,并将其移植到初始噪声中,以供后续层使用。这确保了前景、背景和合成层之间的语义和结构一致性,从而在无需微调或辅助数据集的情况下实现一致的多层输出。大量实验表明,我们的无需训练方法在性能上与微调方法相当,增强了逐层一致性,同时保持了高质量和高保真度的图像。TAUE不仅消除了昂贵的训练和数据集需求,还解锁了新的下游应用,如复杂的组合编辑,为更易于访问和可控的生成工作流程铺平了道路。
Summary / 总结
TAUE is a training-free noise transplantation and cultivation diffusion model designed to generate multi-layered images with semantic and structural coherence. It extracts latent representations from both foreground and composite generation processes and transplants them into initial noise for subsequent layers, achieving performance comparable to fine-tuned methods without requiring additional datasets or training. Extensive experiments demonstrate that TAUE maintains high image quality and consistency across layers, enabling complex compositional editing and unlocking new applications.
TAUE 是一种无需微调和大型数据集的噪声移植和培养扩散模型,旨在生成具有语义和结构一致性的多层图像。它从前景和合成生成过程中提取中间的潜在表示,并将其移植到后续层的初始噪声中。实验表明,TAUE 的性能与微调方法相当,增强了层间的一致性并保持了高质量和高保真的图像。
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Authors: Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, Li Yuan
First: 2025-10-19T15:38:06+00:00 · Latest: 2025-11-04T13:15:36+00:00
Abstract
Instruction-based image editing has achieved remarkable progress; however,
models solely trained via supervised fine-tuning often overfit to annotated
patterns, hindering their ability to explore and generalize beyond training
distributions. To this end, we introduce Edit-R1, a novel post-training
framework for instruction-based image editing based on policy optimization.
Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a
likelihood-free policy optimization method consistent with the flow matching
forward process, thereby enabling the use of higher-order samplers and more
efficient training. Another key challenge here is the absence of a universal
reward model, resulting from the diverse nature of editing instructions and
tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM)
as a unified, training-free reward model, leveraging its output logits to
provide fine-grained feedback. Furthermore, we carefully design a low-variance
group filtering mechanism to reduce MLLM scoring noise and stabilize
optimization. \texttt{UniWorld-V2}, trained with this framework, achieves
\textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks,
scoring 4.49 and 7.83, respectively. Crucially, our framework is
model-agnostic, delivering substantial performance gains when applied to
diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its
wide applicability. Code and models are publicly available to support further
research.
中文标题/摘要
标题:Uniworld-V2: 使用扩散负样本感知微调和MLLM隐式反馈强化图像编辑
基于指令的图像编辑已经取得了显著进展;然而,仅通过监督微调训练的模型往往会过度拟合标注模式,限制了它们探索和泛化的能力。为了解决这一问题,我们提出了Edit-R1,一种基于策略优化的新型后训练框架。具体而言,我们利用了一种与流匹配前向过程一致的无似然性策略优化方法——扩散负样本感知微调(DiffusionNFT),这使得可以使用高阶采样器并进行更高效的训练。另一个关键挑战是没有通用的奖励模型,这源于编辑指令和任务的多样性。为了解决这一问题,我们采用了一种多模态大型语言模型(MLLM)作为统一的、无需训练的奖励模型,并利用其输出logits提供细粒度反馈。此外,我们精心设计了一种低方差组筛选机制,以减少MLLM评分噪声并稳定优化。使用此框架训练的\texttt{UniWorld-V2}在ImgEdit和GEdit-Bench基准测试中分别取得了4.49和7.83的优异结果。重要的是,我们的框架是模型无关的,当应用于诸如Qwen-Image-Edit和FLUX-Kontext等不同基础模型时,能够显著提高性能,展示了其广泛的应用性。代码和模型已公开,以支持进一步的研究。
Summary / 总结
The research aims to improve instruction-based image editing by addressing overfitting issues in supervised fine-tuning. It introduces Edit-R1, a post-training framework using Diffusion Negative-aware Finetuning and a Multimodal Large Language Model (MLLM) for reward modeling. The framework achieves state-of-the-art results on ImgEdit and GEdit-Bench benchmarks with scores of 4.49 and 7.83, respectively. The method is model-agnostic and enhances various base models like Qwen-Image-Edit and FLUX-Kontext.
该论文提出了基于策略优化的Edit-R1框架,用于指令驱动的图像编辑。该框架采用Diffusion Negative-aware Finetuning (DiffusionNFT) 和Multimodal Large Language Model (MLLM) 作为奖励模型,提供细粒度反馈。该框架在ImgEdit和GEdit-Bench基准测试中分别取得了4.49和7.83的得分,表现出色,并且是模型无关的,在不同基础模型上显示出显著的性能提升。
Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes
Authors: Robinson Umeike, Neil Getty, Yin Xiangyu, Yi Jiang
First: 2025-11-04T11:43:05+00:00 · Latest: 2025-11-04T11:43:05+00:00
Abstract
The automation of workflows in advanced microscopy is a key goal where
foundation models like Language Models (LLMs) and Vision-Language Models (VLMs)
show great potential. However, adapting these general-purpose models for
specialized scientific tasks is critical, and the optimal domain adaptation
strategy is often unclear. To address this, we introduce PtychoBench, a new
multi-modal, multi-task benchmark for ptychographic analysis. Using this
benchmark, we systematically compare two specialization strategies: Supervised
Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies
on a visual artifact detection task with VLMs and a textual parameter
recommendation task with LLMs in a data-scarce regime. Our findings reveal that
the optimal specialization pathway is task-dependent. For the visual task, SFT
and ICL are highly complementary, with a fine-tuned model guided by
context-aware examples achieving the highest mean performance (Micro-F1 of
0.728). Conversely, for the textual task, ICL on a large base model is the
superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a
powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm
the superiority of context-aware prompting and identify a consistent contextual
interference phenomenon in fine-tuned models. These results, benchmarked
against strong baselines including GPT-4o and a DINOv3-based classifier, offer
key observations for AI in science: the optimal specialization path in our
benchmark is dependent on the task modality, offering a clear framework for
developing more effective science-based agentic systems.
中文标题/摘要
标题:将通用基础模型适应于低数据量条件下的X射线 Ptychography
在先进显微镜工作流程的自动化方面,基础模型如语言模型(LLMs)和视觉-语言模型(VLMs)显示出巨大的潜力。然而,将这些通用模型适应于特定的科学任务至关重要,而最优的领域适应策略往往不明确。为解决这一问题,我们引入了PtychoBench,这是一种新的多模态、多任务基准,用于衍射分析。利用这一基准,我们系统地比较了两种专业化策略:监督微调(SFT)和上下文学习(ICL)。我们在数据稀缺的条件下,使用VLMs进行视觉伪影检测任务,使用LLMs进行文本参数推荐任务,评估这些策略。我们的研究发现,最优的专业化路径取决于任务。对于视觉任务,SFT和ICL高度互补,微调模型在上下文感知示例的引导下,达到最高的平均性能(Micro-F1为0.728)。相反,对于文本任务,大型基础模型上的ICL是更优策略,达到峰值Micro-F1为0.847,优于强大的“超级专家”SFT模型(零样本Micro-F1为0.839)。我们还确认了上下文感知提示的优越性,并在微调模型中发现了一致的上下文干扰现象。这些结果,与包括GPT-4o和基于DINOv3的分类器在内的强基线进行基准测试,为科学中的AI提供了关键观察:在我们的基准中,最优的专业化路径取决于任务模态,为开发更有效的基于科学的代理系统提供了清晰的框架。
Summary / 总结
The study aims to evaluate the effectiveness of two specialization strategies, Supervised Fine-Tuning (SFT) and In-Context Learning (ICL), for adapting general-purpose models to specialized scientific tasks in a data-scarce regime. Using PtychoBench, a new benchmark for ptychographic analysis, the research compares these strategies on visual artifact detection and textual parameter recommendation tasks. The findings indicate that the optimal pathway depends on the task: SFT and ICL are complementary for the visual task, with a fine-tuned model guided by context-aware examples achieving the highest performance, while ICL on a large base model outperforms SFT for the textual task, reaching a peak performance of 0.847 Micro-F1. These results highlight the task-dependent nature of the optimal specialization strategy in scientific applications.
研究旨在探索如何将通用模型如VLM和LLM适应于特定的科学任务,特别是ptychographic分析。使用PtychoBench这一新基准,比较了两种策略:监督微调(SFT)和上下文学习(ICL)。对于视觉特征检测任务,SFT和ICL表现出高度互补性,最佳性能由一个由上下文感知示例引导的微调模型实现。对于文本参数推荐任务,大型基模型上的ICL优于SFT,达到了最高性能。这些发现表明,最佳的适应路径取决于任务的模态。
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
Authors: Lukas Aichberger, Alasdair Paren, Guohao Li, Philip Torr, Yarin Gal, Adel Bibi
Venue: NeurIPS 2025
First: 2025-03-13T18:59:12+00:00 · Latest: 2025-11-04T10:25:46+00:00
Comments: NeurIPS 2025
Abstract
Recent advances in operating system (OS) agents have enabled vision-language
models (VLMs) to directly control a user's computer. Unlike conventional VLMs
that passively output text, OS agents autonomously perform computer-based tasks
in response to a single user prompt. OS agents do so by capturing, parsing, and
analysing screenshots and executing low-level actions via application
programming interfaces (APIs), such as mouse clicks and keyboard inputs. This
direct interaction with the OS significantly raises the stakes, as failures or
manipulations can have immediate and tangible consequences. In this work, we
uncover a novel attack vector against these OS agents: Malicious Image Patches
(MIPs), adversarially perturbed screen regions that, when captured by an OS
agent, induce it to perform harmful actions by exploiting specific APIs. For
instance, a MIP can be embedded in a desktop wallpaper or shared on social
media to cause an OS agent to exfiltrate sensitive user data. We show that MIPs
generalise across user prompts and screen configurations, and that they can
hijack multiple OS agents even during the execution of benign instructions.
These findings expose critical security vulnerabilities in OS agents that have
to be carefully addressed before their widespread deployment.
中文标题/摘要
标题:MIP对抗代理:恶意图像补丁劫持多模态OS代理
近年来操作系统(OS)代理的进步使视觉语言模型(VLMs)能够直接控制用户的计算机。与传统的被动输出文本的VLMs不同,OS代理能够自主执行基于计算机的任务,仅需一个用户指令。OS代理通过捕获、解析和分析屏幕截图,并通过应用程序编程接口(APIs)执行低级操作(如鼠标点击和键盘输入)来实现这一目标。这种直接与OS的交互显著提高了风险,因为失败或操纵可能会立即产生实际后果。在本研究中,我们发现了一种针对这些OS代理的新攻击向量:恶意图像补丁(MIPs),这些对抗性扰动的屏幕区域在被OS代理捕获时,会通过利用特定的APIs诱导其执行有害操作。例如,MIP可以嵌入在桌面壁纸中或在社交媒体上分享,以导致OS代理泄露敏感用户数据。我们展示了MIPs在用户指令和屏幕配置方面具有泛化能力,并且即使在执行良性指令期间也能劫持多个OS代理。这些发现揭示了OS代理中关键的安全漏洞,这些漏洞在广泛部署之前必须仔细解决。
From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
Authors: Nicolas Schuler, Lea Dewald, Nick Baldig, Jürgen Graf
First: 2025-11-04T09:58:29+00:00 · Latest: 2025-11-04T09:58:29+00:00
Comments: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI
International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18
DECEMBER 2025
Abstract
Video Understanding, Scene Interpretation and Commonsense Reasoning are
highly challenging tasks enabling the interpretation of visual information,
allowing agents to perceive, interact with and make rational decisions in its
environment. Large Language Models (LLMs) and Visual Language Models (VLMs)
have shown remarkable advancements in these areas in recent years, enabling
domain-specific applications as well as zero-shot open vocabulary tasks,
combining multiple domains. However, the required computational complexity
poses challenges for their application on edge devices and in the context of
Mobile Robotics, especially considering the trade-off between accuracy and
inference time. In this paper, we investigate the capabilities of
state-of-the-art VLMs for the task of Scene Interpretation and Action
Recognition, with special regard to small VLMs capable of being deployed to
edge devices in the context of Mobile Robotics. The proposed pipeline is
evaluated on a diverse dataset consisting of various real-world cityscape,
on-campus and indoor scenarios. The experimental evaluation discusses the
potential of these small models on edge devices, with particular emphasis on
challenges, weaknesses, inherent model biases and the application of the gained
information. Supplementary material is provided via the following repository:
https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
中文标题/摘要
标题:从实验室到实际应用:在移动机器人领域评估边缘设备上的零样本场景解释
视频理解、场景解释和常识推理是高度挑战性的任务,能够解释视觉信息,使智能体能够感知、交互并理性地做出决策。近年来,大型语言模型(LLMs)和视觉语言模型(VLMs)在这些领域取得了显著进展,不仅能够实现特定领域的应用,还能完成零样本开放式词汇任务,结合多个领域。然而,所需的计算复杂性为它们在边缘设备上的应用以及在移动机器人领域的应用带来了挑战,特别是在准确性和推理时间之间的权衡。本文研究了最先进的VLMs在场景解释和动作识别任务中的能力,特别关注适用于移动机器人领域边缘设备的小型VLMs。提出的管道在包含各种真实城市景观、校园和室内场景的多样数据集上进行了评估。实验评估讨论了这些小型模型在边缘设备上的潜力,特别是挑战、弱点、固有的模型偏差以及获得的信息的应用。补充材料可通过以下存储库提供:https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
Summary / 总结
This paper evaluates the capabilities of state-of-the-art Visual Language Models (VLMs) for zero-shot scene interpretation and action recognition on edge devices in mobile robotics. The study focuses on small VLMs suitable for deployment on edge devices, using a diverse dataset of real-world scenarios. Key findings highlight the potential of these models on edge devices, though they also reveal challenges and inherent biases, emphasizing the need for further optimization and understanding of model limitations.
本文评估了最先进的视觉语言模型(VLMs)在移动机器人领域的零样本场景解释和动作识别能力,重点关注其在边缘设备上的部署。研究使用了多样化的数据集,并强调了小型VLMs的潜力,同时讨论了面临的挑战和模型固有的偏差。
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Authors: Jiahe Song, Chuang Wang, Bowen Jiang, Yinfan Wang, Hao Zheng, Xingjian Wei, Chengjin Liu, Junyuan Gao, Yubin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He
First: 2025-11-04T09:08:44+00:00 · Latest: 2025-11-04T09:08:44+00:00
Abstract
Large-scale chemical reaction datasets are crucial for AI research in
chemistry. However, existing chemical reaction data often exist as images
within papers, making them not machine-readable and unusable for training
machine learning models. In response to this challenge, we propose the
RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP).
Our framework reformulates the traditional coordinate prediction driven parsing
process into an image captioning problem, which Large Vision-Language Models
(LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as
Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector,
MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the
input image. This turns the downstream parsing into a natural-language
description problem. Extensive experiments show that the BIVP strategy
significantly improves structural extraction quality while simplifying model
design. We further construct the RxnCaption-11k dataset, an order of magnitude
larger than prior real-world literature benchmarks, with a balanced test subset
across four layout archetypes. Experiments demonstrate that RxnCaption-VL
achieves state-of-the-art performance on multiple metrics. We believe our
method, dataset, and models will advance structured information extraction from
chemical literature and catalyze broader AI applications in chemistry. We will
release data, models, and code on GitHub.
中文标题/摘要
标题:RxnCaption: 将化学反应图解析重新定义为视觉提示引导的描述
大规模的化学反应数据集对于化学领域的AI研究至关重要。然而,现有的化学反应数据通常以论文中的图像形式存在,使其无法被机器读取和用于训练机器学习模型。为应对这一挑战,我们提出了RxnCaption框架,用于化学反应图解析(RxnDP)任务。我们的框架将传统的基于坐标的解析过程重新定义为图像描述问题,这是大型视觉语言模型(LVLMs)自然处理的问题。我们引入了一种称为“边界框和索引作为视觉提示”(BIVP)的策略,使用我们最先进的分子检测器MolYOLO在输入图像上预先绘制分子边界框和索引。这将下游解析转化为自然语言描述问题。广泛的实验表明,BIVP策略显著提高了结构提取质量,简化了模型设计。我们进一步构建了RxnCaption-11k数据集,其规模比之前的实际文献基准数据集大一个数量级,并且具有平衡的测试子集,跨越四种布局架构。实验表明,RxnCaption-VL在多个指标上达到了最先进的性能。我们认为,我们的方法、数据集和模型将促进化学文献中结构化信息的提取,并推动更广泛的化学领域的AI应用。我们将在GitHub上发布数据、模型和代码。
Summary / 总结
The paper proposes RxnCaption, a framework that reformulates chemical Reaction Diagram Parsing as a visual prompt guided captioning task, using a BBox and Index as Visual Prompt (BIVP) strategy to improve structural extraction quality. The framework leverages Large Vision-Language Models and a state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices on input images, simplifying model design. Experiments show that RxnCaption-VL outperforms existing methods on multiple metrics and the constructed RxnCaption-11k dataset is significantly larger than previous benchmarks, facilitating more robust training and evaluation. This work advances structured information extraction from chemical literature and enhances AI applications in chemistry.
该论文提出了RxnCaption框架,将化学反应图解析重新表述为视觉提示引导的图像描述任务,使用BBox和Index作为视觉提示(BIVP)策略来提高结构提取质量。该框架利用大型视觉-语言模型和最先进的分子检测器MolYOLO,在输入图像上预先绘制分子边界框和索引,简化了模型设计。实验表明,RxnCaption-VL在多个指标上优于现有方法,并构建了RxnCaption-11k数据集,比之前的基准数据集大得多,促进了更稳健的训练和评估。这项工作推进了化学文献中的结构化信息提取,并增强了化学领域的AI应用。
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit
Authors: Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba
Venue: NeurIPS
2025
First: 2025-06-03T17:24:55+00:00 · Latest: 2025-11-04T09:06:34+00:00
Comments: 39th Conference on Neural Information Processing Systems (NeurIPS
2025)
Abstract
Motivated by the hypothesis that neural network representations encode
abstract, interpretable features as linearly accessible, approximately
orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in
interpretability. However, recent work has demonstrated phenomenology of model
representations that lies outside the scope of this hypothesis, showing
signatures of hierarchical, nonlinear, and multi-dimensional features. This
raises the question: do SAEs represent features that possess structure at odds
with their motivating hypothesis? If not, does avoiding this mismatch help
identify said features and gain further insights into neural network
representations? To answer these questions, we take a construction-based
approach and re-contextualize the popular matching pursuits (MP) algorithm from
sparse coding to design MP-SAE -- an SAE that unrolls its encoder into a
sequence of residual-guided steps, allowing it to capture hierarchical and
nonlinearly accessible features. Comparing this architecture with existing SAEs
on a mixture of synthetic and natural data settings, we show: (i) hierarchical
concepts induce conditionally orthogonal features, which existing SAEs are
unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE
recovers highly meaningful features, helping us unravel shared structure in the
seemingly dichotomous representation spaces of different modalities in a
vision-language model, hence demonstrating the assumption that useful features
are solely linearly accessible is insufficient. We also show that the
sequential encoder principle of MP-SAE affords an additional benefit of
adaptive sparsity at inference time, which may be of independent interest.
Overall, we argue our results provide credence to the idea that
interpretability should begin with the phenomenology of representations, with
methods emerging from assumptions that fit it.
中文标题/摘要
标题:从平面到层次结构:使用匹配追求提取稀疏表示
受神经网络表示可以编码为线性可访问的、近似正交的方向这一假设的启发,稀疏自编码器(SAEs)已成为可解释性的一个流行工具。然而,最近的工作展示了模型表示的现象学,这超出了这一假设的范围,显示出层次化、非线性和多维特征的特征。这提出了一个问题:SAEs是否表示了与其假设相矛盾的结构特征?如果不是,避免这种不匹配是否有助于识别这些特征并进一步了解神经网络表示?为了回答这些问题,我们采取了一种基于构造的方法,并重新将流行的匹配追求(MP)算法从稀疏编码重新定位,设计了MP-SAE——一种将编码器展开为残差引导步骤的SAE,使其能够捕捉层次化和非线性可访问的特征。在合成和自然数据设置的比较中,我们展示了:(i) 层次概念诱导条件正交特征,现有的SAEs无法忠实捕捉,(ii) MP-SAE的非线性编码步骤恢复了高度有意义的特征,帮助我们揭示了不同模态在视觉语言模型中看似二元表示空间中的共享结构,从而证明假设有用的特征仅线性可访问是不足的。我们还展示了MP-SAE的顺序编码原理在推理时提供了额外的自适应稀疏性益处,这可能具有独立的兴趣。总体而言,我们认为我们的结果支持了可解释性应从表示的现象学开始,方法应从适应其假设中产生这一观点。
AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models
Authors: Aashray Reddy, Andrew Zagula, Nicholas Saban
First: 2025-11-04T08:56:28+00:00 · Latest: 2025-11-04T08:56:28+00:00
Abstract
Large Language Models (LLMs) remain vulnerable to jailbreaking attacks where
adversarial prompts elicit harmful outputs, yet most evaluations focus on
single-turn interactions while real-world attacks unfold through adaptive
multi-turn conversations. We present AutoAdv, a training-free framework for
automated multi-turn jailbreaking that achieves up to 95% attack success rate
on Llama-3.1-8B within six turns a 24 percent improvement over single turn
baselines. AutoAdv uniquely combines three adaptive mechanisms: a pattern
manager that learns from successful attacks to enhance future prompts, a
temperature manager that dynamically adjusts sampling parameters based on
failure modes, and a two-phase rewriting strategy that disguises harmful
requests then iteratively refines them. Extensive evaluation across commercial
and open-source models (GPT-4o-mini, Qwen3-235B, Mistral-7B) reveals persistent
vulnerabilities in current safety mechanisms, with multi-turn attacks
consistently outperforming single-turn approaches. These findings demonstrate
that alignment strategies optimized for single-turn interactions fail to
maintain robustness across extended conversations, highlighting an urgent need
for multi-turn-aware defenses.
中文标题/摘要
标题:AutoAdv:自动化对抗提示以实现大型语言模型的多轮脱笼攻击
大型语言模型(LLMs)仍然容易受到脱笼攻击的影响,其中对抗提示会引发有害输出,但大多数评估主要集中在单轮交互上,而实际攻击则通过适应性的多轮对话展开。我们提出了AutoAdv,这是一种无需训练的框架,用于实现自动化多轮脱笼攻击,在六轮内对Llama-3.1-8B的攻击成功率高达95%,比单轮基线提高了24个百分点。AutoAdv独特地结合了三种适应性机制:一个模式管理器,可以从成功的攻击中学习以增强未来的提示;一个温度管理器,根据失败模式动态调整采样参数;以及一个两阶段重写策略,先隐藏有害请求,然后逐步优化它们。广泛的评估表明,当前的安全机制存在持续的漏洞,多轮攻击始终优于单轮方法。这些发现表明,针对单轮交互优化的对齐策略无法在长时间对话中保持鲁棒性,突显了对多轮攻击意识的防御措施的迫切需求。
Summary / 总结
The research aims to address the vulnerability of Large Language Models (LLMs) to multi-turn jailbreaking attacks, which are more realistic than single-turn evaluations. AutoAdv, a training-free framework, uses adaptive mechanisms such as a pattern manager, a temperature manager, and a two-phase rewriting strategy to achieve a 95% attack success rate within six turns, significantly improving upon single-turn baselines. The study demonstrates persistent vulnerabilities in current safety mechanisms and highlights the need for multi-turn-aware defenses.
AutoAdv 是一个无需训练的框架,用于自动化 LLM 的多轮脱管攻击,六轮内成功率可达 95%,比单轮基线高出 24%。它结合了三种适应性机制:模式管理器、温度管理器和两阶段重写策略,以增强未来提示、动态调整采样参数并掩饰有害请求。广泛评估表明,多轮攻击优于单轮攻击,这表明需要多轮意识的防御策略。
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Authors: Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang
First: 2025-03-23T13:18:17+00:00 · Latest: 2025-11-04T08:39:38+00:00
Comments: Accepted by IEEE Transactions on Neural Networks and Learning Systems
Abstract
Data scarcity is a long-standing challenge in the Vision-Language Navigation
(VLN) field, which extremely hinders the generalization of agents to unseen
environments. Previous works primarily rely on additional simulator data or
web-collected images/videos to improve the generalization. However, the
simulator environments still face limited diversity, and the web-collected data
often requires extensive labor to remove the noise. In this paper, we propose a
Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates
the unseen observation-instruction pairs via rewriting human-annotated training
data. Benefiting from our rewriting mechanism, new observation-instruction
pairs can be obtained in both simulator-free and labor-saving manners to
promote generalization. Specifically, we first introduce Object-Enriched
Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large
Language Models (LLMs) to derive rewritten object-enriched scene descriptions,
enabling observation synthesis with diverse objects and spatial layouts via
Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast
Instruction Rewriting, which generates observation-aligned rewritten
instructions by requiring LLMs to reason the difference between original and
new observations. We further develop a mixing-then-focusing training strategy
with a random observation cropping scheme, effectively enhancing data
distribution diversity while suppressing augmentation data noise during
training. Experiments on both the discrete environments (R2R, REVERIE, and R4R
datasets) and continuous environments (R2R-CE dataset) show the superior
performance and impressive generalization ability of our method. Code is
available at https://github.com/SaDil13/VLN-RAM.
中文标题/摘要
标题:从已见重写未见:使用基础模型增强视觉语言导航的观察-指令重写
视觉语言导航(VLN)领域长期面临数据稀缺的挑战,这严重阻碍了代理在未见环境中的泛化能力。以往的工作主要依赖额外的模拟器数据或网络收集的图像/视频来提高泛化能力。然而,模拟器环境仍然面临多样性有限的问题,而网络收集的数据往往需要大量劳动来去除噪声。在本文中,我们提出了一种重写驱动的增强(RAM)范式,直接通过重写人类标注的训练数据来生成未见的观察-指令对。得益于我们的重写机制,新的观察-指令对可以在无需模拟器和节省劳动的情况下获得,从而促进泛化。具体而言,我们首先引入了对象增强的观察重写,其中结合视觉语言模型(VLMs)和大型语言模型(LLMs)来推导出重写后对象丰富的场景描述,通过文本到图像生成模型(T2IMs)实现具有多样对象和空间布局的观察合成。然后,我们提出了观察对比指令重写,通过要求LLMs推理原始观察与新观察之间的差异来生成与观察对齐的重写指令。我们进一步开发了一种混合然后聚焦的训练策略,采用随机观察裁剪方案,有效增强数据分布多样性,同时在训练过程中抑制增强数据噪声。在离散环境(R2R、REVERIE和R4R数据集)和连续环境(R2R-CE数据集)上的实验表明,我们的方法具有优越的性能和令人印象深刻的泛化能力。代码可在https://github.com/SaDil13/VLN-RAM/获取。
Summary / 总结
This paper addresses the challenge of data scarcity in Vision-Language Navigation (VLN) by proposing a Rewriting-driven AugMentation (RAM) paradigm. It uses rewriting mechanisms to generate unseen observation-instruction pairs, enhancing generalization without relying on additional simulator data or web-collected images/videos. The method includes Object-Enriched Observation Rewriting and Observation-Contrast Instruction Rewriting, and employs a mixing-then-focusing training strategy to improve data diversity and reduce noise. Experiments on various VLN datasets demonstrate the method's superior performance and strong generalization ability.
论文针对视觉-语言导航(VLN)领域数据稀缺的问题,提出了一种重写驱动增强(RAM)范式。该方法通过从现有数据中生成未见过的观察-指令对,增强泛化能力,而不依赖额外的模拟器数据或网络收集的图像/视频。该方法包括对象增强的观察重写和观察对比指令重写,并采用混合然后聚焦的训练策略来提高数据多样性并减少噪声。在多个数据集上的实验表明,该方法具有优越的性能和强大的泛化能力。
CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
Authors: Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan
First: 2025-11-04T08:28:46+00:00 · Latest: 2025-11-04T08:28:46+00:00
Abstract
In human cognition, there exist numerous thought processes that are tacit and
beyond verbal expression, enabling us to understand and interact with the world
in multiple ways. However, contemporary Vision-Language Models (VLMs) remain
constrained to reasoning within the discrete and rigid space of linguistic
tokens, thereby bottlenecking the rich, high-dimensional nature of visual
perception. To bridge this gap, we propose CoCoVa (Chain of Continuous
Vision-Language Thought), a novel framework for vision-language model that
leverages continuous cross-modal reasoning for diverse vision-language tasks.
The core of CoCoVa is an iterative reasoning cycle, where a novel Latent
Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a
chain of latent thought vectors through cross-modal fusion. To focus this
process, a token selection mechanism dynamically identifies salient visual
regions, mimicking attentional focus. To ensure these latent thoughts remain
grounded, we train the model with a multi-task objective that combines
contrastive learning and diffusion-based reconstruction, enforcing alignment
between latent representations and both visual and textual modalities.
Evaluations show CoCoVa improves accuracy and token efficiency over strong
baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B
models on almost all benchmarks. When scaled to 7B LLM backbones, it remains
competitive with state-of-the-art models. Qualitative analysis validates that
learned latent space captures interpretable and structured reasoning patterns,
highlighting the potential of CoCoVa to bridge the representational gap between
discrete language processing and the continuous nature of visual understanding.
中文标题/摘要
标题:CoCoVa:连续视觉语言思维链的潜在空间推理
在人类认知中,存在许多默会且无法用言语表达的思维过程,使我们能够以多种方式理解并互动于世界。然而,当前的视觉语言模型(VLMs)仍然局限于语言令牌的离散和僵化空间中的推理,从而限制了视觉感知的丰富性和高维特性。为弥合这一差距,我们提出了CoCoVa(连续视觉语言思维链),一种利用连续跨模态推理的新框架,以应对多种视觉语言任务。CoCoVa的核心是一个迭代推理循环,其中新颖的潜空间Q-Former(LQ-Former)作为动态推理引擎,通过跨模态融合迭代细化思维向量链。为了聚焦此过程,一种标记选择机制动态识别出显著的视觉区域,模拟注意力聚焦。为了确保这些潜思维保持在地,我们使用结合对比学习和扩散重建的多任务目标进行模型训练,强制潜表示与视觉和文本模态之间的对齐。评估表明,CoCoVa在准确性和令牌效率上优于强基线。使用1.5B的骨干网络时,它在几乎所有基准上与或超越了更大的7B-9B模型。当扩展到7B大语言模型(LLM)骨干时,它仍然与最先进的模型竞争。定性分析验证了学习到的潜空间捕捉到了可解释和结构化的推理模式,突显了CoCoVa在离散语言处理与视觉理解的连续性之间的表示差距上的潜力。
Summary / 总结
CoCoVa is a novel framework that enhances vision-language models by enabling continuous cross-modal reasoning, which iteratively refines latent thought vectors through cross-modal fusion. It uses a Latent Q-Former as a dynamic reasoning engine and a token selection mechanism to focus on salient visual regions. CoCoVa improves accuracy and token efficiency over strong baselines and performs competitively with larger models on various benchmarks. Qualitative analysis shows that the learned latent space captures interpretable reasoning patterns, bridging the gap between discrete language processing and continuous visual understanding.
CoCoVa 是一种新型框架,通过连续的跨模态推理增强视觉语言模型,迭代地通过跨模态融合细化潜空间思维向量。它使用 Latent Q-Former 作为动态推理引擎,并使用标记选择机制聚焦于显著的视觉区域。CoCoVa 在各种基准测试中提高了准确性和标记效率,并与较大的模型竞争。定性分析表明,学习到的潜空间捕捉到了可解释的推理模式,填补了离散语言处理与连续视觉理解之间的表示差距。
Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
Authors: Jiawei Wang, Dingyou Wang, Jiaming Hu, Qixuan Zhang, Jingyi Yu, Lan Xu
First: 2025-11-03T07:21:42+00:00 · Latest: 2025-11-04T07:22:41+00:00
Comments: project page: https://sites.google.com/deemos.com/kinematify
Abstract
A deep understanding of kinematic structures and movable components is
essential for enabling robots to manipulate objects and model their own
articulated forms. Such understanding is captured through articulated objects,
which are essential for tasks such as physical simulation, motion planning, and
policy learning. However, creating these models, particularly for objects with
high degrees of freedom (DoF), remains a significant challenge. Existing
methods typically rely on motion sequences or strong assumptions from
hand-curated datasets, which hinders scalability. In this paper, we introduce
Kinematify, an automated framework that synthesizes articulated objects
directly from arbitrary RGB images or textual descriptions. Our method
addresses two core challenges: (i) inferring kinematic topologies for high-DoF
objects and (ii) estimating joint parameters from static geometry. To achieve
this, we combine MCTS search for structural inference with geometry-driven
optimization for joint reasoning, producing physically consistent and
functionally valid descriptions. We evaluate Kinematify on diverse inputs from
both synthetic and real-world environments, demonstrating improvements in
registration and kinematic topology accuracy over prior work.
中文标题/摘要
标题:Kinematify:高自由度可动物体的开放词汇合成
对运动结构和可动部件的深刻理解对于使机器人能够操作物体并建模其自身可动形态至关重要。这种理解通过可动物体来捕捉,这些物体对于物理模拟、运动规划和策略学习等任务至关重要。然而,创建这些模型,特别是对于具有高自由度(DoF)的物体,仍然是一个重大挑战。现有方法通常依赖于运动序列或手选数据集中的强假设,这限制了其可扩展性。在本文中,我们介绍了Kinematify,这是一种自动框架,可以直接从任意RGB图像或文本描述中合成可动物体。我们的方法解决了两个核心挑战:(i) 推断高自由度物体的运动结构拓扑,(ii) 从静态几何中估计关节参数。为此,我们结合了基于MCTS搜索的结构推理和基于几何的优化来推断关节参数,从而生成物理上一致且功能上有效的描述。我们在来自合成和真实环境的多种输入上评估了Kinematify,展示了与先前工作相比在注册和运动结构拓扑准确性方面的改进。
Summary / 总结
Kinematify is an automated framework that synthesizes articulated objects from RGB images or textual descriptions, addressing the challenge of creating models for objects with high degrees of freedom. It uses MCTS search for structural inference and geometry-driven optimization for joint reasoning, producing physically consistent and functionally valid descriptions. Experiments show improvements in registration and kinematic topology accuracy compared to previous methods.
Kinematify 是一个自动化框架,可以从 RGB 图像或文本描述中合成 articulated 对象,解决高自由度模型创建的挑战。它使用 MCTS 搜索进行结构推理,并使用几何驱动优化进行关节推理,生成物理上一致且功能有效的描述。实验结果显示,在注册和运动学拓扑准确性方面优于先前的方法。