arXiv 论文速递

2025-11-27 03:27
Snapshot: 20251127_0327
MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
Authors: Tooba Tehreem Sheikh, Jean Lahoud, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal
First: 2025-11-25T18:59:53+00:00 · Latest: 2025-11-25T18:59:53+00:00
Abstract
Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at https://github.com/toobatehreem/MedROV.
中文标题/摘要
标题:MedROV:跨多种医学成像模态的实时开放词汇检测
传统医学影像中的对象检测模型在封闭集范式下运行,限制了它们检测新标签对象的能力。开放词汇对象检测(OVOD)解决了这一限制,但由于数据集稀缺和文本-图像对齐较弱,医学影像领域对此研究较少。为弥合这一差距,我们引入了MedROV,这是首个用于医学影像的实时开放词汇检测模型。为了实现开放词汇学习,我们构建了一个大规模数据集Omnis,包含九种成像模态下的60万检测样本,并引入了一种伪标签策略来处理多源数据集中缺失的注释。此外,我们通过引入大型预训练基础模型的知识来增强泛化能力。通过利用对比学习和跨模态表示,MedROV 有效地检测了已知和新型结构。实验结果表明,MedROV 在医学图像检测中优于之前的最先进的基础模型,平均绝对改进率为40 mAP50,并且在超过封闭集检测器3 mAP50的同时,运行速度达到70 FPS,为医学检测设立了新的基准。我们的源代码、数据集和训练模型可在 https://github.com/toobatehreem/MedROV 获取。
Summary / 总结
MedROV is a real-time open-vocabulary detection model for medical imaging that addresses the limitations of traditional closed-set models. It introduces a large-scale dataset, Omnis, and a pseudo-labeling strategy to handle missing annotations, and leverages a large pre-trained foundation model to enhance generalization. MedROV outperforms previous state-of-the-art models by 40 mAP50 and surpasses closed-set detectors by more than 3 mAP50, achieving 70 FPS performance.
MedROV 是一种针对医学影像的实时开放词汇检测模型,解决了传统封闭集模型的局限性。它引入了一个包含60万样本的大规模数据集Omnis,覆盖九种成像模态,并使用伪标签策略处理多源数据中的缺失注释。通过结合大型预训练模型的知识和对比学习,MedROV 在平均绝对改进40 mAP50方面超过了之前的最先进的模型,并且在闭集检测器上超过3 mAP50,运行速度为70 FPS。
Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
Authors: Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag
First: 2025-11-25T18:59:46+00:00 · Latest: 2025-11-25T18:59:46+00:00
Comments: Project Page: https://infinity-rope.github.io/
Abstract
Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.
中文标题/摘要
标题:Infinity-RoPE:动作可控的无限视频生成源自自回归自我展开
当前的自回归视频扩散模型受到三个核心瓶颈的限制:(i) 基本模型3D旋转位置嵌入(3D-RoPE)施加的有限时间窗口;(ii) 在长时间展开过程中保持精细动作控制的缓慢响应;(iii) 无法在单个生成流中实现不连续的电影转换。我们引入了$\infty$-RoPE,这是一种统一的推理时框架,通过三个相互关联的组件解决了所有三个限制:块相对RoPE、KV刷新和RoPE剪切。块相对RoPE将时间编码重新表述为移动的局部参考框架,其中每个新生成的潜在块相对于基模型的最大帧窗口旋转,而较早的块则向后旋转以保持相对时间几何。这种相对表述消除了固定的时间位置,使视频生成能够远远超出基位置限制。为了在不重新编码的情况下获得精细的动作控制,KV刷新通过保留全局汇和最后一个生成的潜在帧来更新KV缓存,从而确保即时的提示响应。最后,RoPE剪切引入了时间RoPE坐标中的受控不连续性,使单个连续展开中能够实现多剪辑场景过渡。这些组件共同确立了$\infty$-RoPE作为无限时间、可控和电影风格视频扩散的无训练基础。全面的实验表明,$\infty$-RoPE在总体VBench评分上始终优于之前的自回归模型。
Summary / 总结
The paper introduces $\infty$-RoPE, a method addressing three core limitations of current autoregressive video diffusion models: finite temporal horizon, slow prompt responsiveness, and inability to achieve discontinuous transitions. It uses Block-Relativistic RoPE, KV Flush, and RoPE Cut to enable continuous video generation beyond the base model's temporal limits, maintain fine-grained action control, and allow for cinematic scene transitions. Experiments show $\infty$-RoPE outperforms previous models in overall VBench scores.
研究通过引入$\infty$-RoPE解决了当前自回归视频扩散模型的限制,克服了有限的时间范围、缓慢的提示响应以及无法实现断点过渡的问题。该方法包括Block-Relativistic RoPE、KV Flush和RoPE Cut,使视频生成连续、具有精细的动作控制和无缝场景过渡。实验表明,$\infty$-RoPE在整体VBench评分上优于之前的模型。
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Authors: Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu
First: 2025-11-25T18:59:45+00:00 · Latest: 2025-11-25T18:59:45+00:00
Comments: Tech report. Project page: https://nvlabs.github.io/LocateAnything3D/
Abstract
To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.
中文标题/摘要
标题:LocateAnything3D:基于视线链的视觉-语言3D检测
为了在世界中行动,模型必须命名它所看到的并知道其在3D中的位置。今天的视觉-语言模型(VLM)在开放的2D描述和语义定位方面表现出色,但多对象3D检测仍然缺乏于VLM工具箱中。我们提出了LocateAnything3D,这是一种VLM原生的方法,将3D检测视为下一个标记预测问题。关键在于一个简短明确的视线链(CoS)序列,这反映了人类如何从图像中推理:先在2D中找到一个物体,然后推断其距离、大小和姿态。解码器首先以视觉链的方式发出2D检测,然后在容易到困难的课程中预测3D框:在对象之间,从近到远的顺序减少了早期的不确定性并匹配了以自我为中心的实用性;在每个对象内部,从相机中心、尺寸和旋转的分解按稳定性和可学习性排列信息。这种VLM原生的接口保留了开放词汇和视觉提示的能力,而无需专门的头部。在具有挑战性的Omni3D基准测试中,我们的模型达到了最先进的结果,3D AP得分为49.89,即使基线模型获得了真实2D框,绝对改进也超过了前最佳模型15.51。它还以强大的鲁棒性在零样本情况下推广到未见过的类别。通过将3D检测转化为一个有纪律的下一个标记问题,LocateAnything3D为模型提供了一个感知3D的实用基础。
Summary / 总结
LocateAnything3D addresses the challenge of multi-object 3D detection in vision-language models by framing it as a next-token prediction task. It introduces a Chain-of-Sight (CoS) sequence that helps the model reason from 2D to 3D, predicting 3D boxes in a curriculum that starts with nearby objects and then moves to more distant ones. This method achieves state-of-the-art results on the Omni3D benchmark, surpassing previous bests by 15.51 AP_3D points and showing strong zero-shot generalization to new categories.
LocateAnything3D通过将多对象3D检测问题转化为下一个标记预测任务来解决视觉语言模型中的挑战。它引入了一个链路视线(CoS)序列,帮助模型从2D到3D进行推理,按照从近到远的顺序预测3D框。该方法在Omni3D基准测试中取得了最先进的结果,超越了之前的最佳成绩15.51 AP_3D点,并且在新类别上具有强大的零样本泛化能力。
Vision-Language Memory for Spatial Reasoning
Authors: Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, Chen Wang
First: 2025-11-25T18:59:02+00:00 · Latest: 2025-11-25T18:59:02+00:00
Abstract
Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.
中文标题/摘要
标题:视觉-语言记忆在空间推理中的应用
空间推理是智能机器人的一项关键能力,但当前的视觉-语言模型(VLMs)在基于视频的空间推理方面仍无法达到人类水平的性能。这一差距主要源于两个挑战:语义-几何的不匹配,导致无法实现一致的三维理解,以及缺乏持久记忆来保留三维表示和理解。为了解决这些限制,我们提出了VLM$^2$,这是一种具有持久记忆的视觉-语言模型,用于从二维视频中获得一致的三维感知的空间推理。具体来说,为了增强长时推理,我们引入了一个双记忆模块,包括一个工作记忆,它作为一个滑动窗口来关注即时上下文,以及一个情景记忆,用于整合和存储关键的长期信息。这种设计使得空间推理可以在固定计算成本下高效且具有长时性。在多个基准上的广泛实验表明,VLM$^2$在仅视频模型中达到了最先进的性能,显著推进了视觉-空间智能的前沿。
Summary / 总结
The research aims to improve vision-language models (VLMs) for spatial reasoning in robots by addressing semantic-geometric misalignment and the lack of persistent memory. VLM$^2$ incorporates a dual-memory module, including a working memory and an episodic memory, to enable efficient long-horizon reasoning. Experiments demonstrate that VLM$^2$ outperforms existing video-only models and significantly advances visual-spatial intelligence capabilities.
研究旨在改进视觉-语言模型(VLMs)以提高机器人在空间推理方面的表现。为了解决当前模型的局限性,提出了VLM$^2$,该模型包含一个双记忆模块,用于持久记忆和视图一致的3D感知表示。该模型使用工作记忆处理即时上下文,并使用情景记忆存储长期信息。实验表明,VLM$^2$在视频-only模型中表现出色,显著推进了视觉-空间智能的边界。
Concept-Aware Batch Sampling Improves Language-Image Pretraining
Authors: Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge
First: 2025-11-25T18:58:07+00:00 · Latest: 2025-11-25T18:58:07+00:00
Comments: Tech Report
Abstract
What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.
中文标题/摘要
标题:概念感知批量采样提高语言-图像预训练
视觉-语言模型应该训练在什么数据上?为回答这个问题,许多数据整理努力集中在数据集的质量上。然而,这些现有方法大多(i)离线的,即从一组预定义的过滤标准生成静态数据集,(ii)概念无关的,即使用模型基础过滤器引入额外的数据偏差。在本文中,我们超越了这些离线、概念无关的方法,提倡更灵活、任务适应的概念基础在线整理。我们的第一个贡献是DataConcept,一个包含1.28亿个从网络抓取的图像-文本对的集合,这些对被细粒度地标记有关其概念组成的信息。基于DataConcept,我们引入了概念感知批量采样(CABS),这是一种简单而有效的批量采样框架,可以根据特定的目标分布实时构建批量。我们提出了两种变体:(i)多样性最大化(CABS-DM)以整理覆盖广泛可用概念的批量,(ii)频率最大化(CABS-FM)以整理具有高对象多重性的批量。通过在28个基准上的广泛评估,我们证明了我们的CABS方法显著提高了CLIP/SigLIP模型类,并产生了高性能的模型。总体而言,CABS代表了一个强大的开源替代品,用于专有的在线数据整理算法,使实践者能够定义优化特定下游任务的概念分布。
Summary / 总结
This work addresses the question of what data a vision-language model should be trained on by proposing a new approach called DataConcept, which includes 128 million image-text pairs annotated with detailed concept information. The authors introduce Concept-Aware Batch Sampling (CABS), a method that dynamically constructs batches based on specific target distributions, leading to improved performance on 28 benchmarks for CLIP/SigLIP models. Two variants of CABS, Diversity Maximization (CABS-DM) and Frequency Maximization (CABS-FM), are proposed to optimize for broad coverage and high object multiplicity, respectively.
该研究通过提出名为DataConcept的新方法,其中包括1.28亿带有详细概念标注的图像-文本对,来回答视觉-语言模型应训练于何种数据的问题。作者引入了概念感知批采样(CABS)框架,能够根据特定目标分布动态构建批次。提出了两种变体CABS-DM和CABS-FM,分别用于最大化概念多样性和频率。通过对28个基准的广泛评估,CABS显著提升了CLIP/SigLIP模型的性能,表明其在增强模型能力方面的有效性。
Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition
Authors: Wei Tang, Zuo-Zheng Wang, Kun Zhang, Tong Wei, Min-Ling Zhang
First: 2025-11-25T18:57:28+00:00 · Latest: 2025-11-25T18:57:28+00:00
Abstract
Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP's textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.
中文标题/摘要
标题:释放视觉语言模型在长尾多标签视觉识别中的潜力
长尾多标签视觉识别提出了重大挑战,因为图像通常包含多个具有高度不平衡类分布的标签,导致模型偏向于头部类而对尾部类表现不佳。最近的努力利用了预训练的视觉语言模型,如CLIP,结合长尾学习技术,利用丰富的视觉文本先验以提高性能。然而,现有方法通常直接从不平衡数据集中推导出语义类间关系,由于数据稀缺,这导致尾部类的不可靠相关性。此外,CLIP的零样本范式优化了单标签图像文本匹配,使其在多标签任务中表现不佳。为了解决这些问题,我们提出了相关性适应提示网络(CAPNET),这是一种新颖的端到端框架,明确从CLIP的文本编码器中建模标签相关性。该框架结合了图卷积网络进行标签感知传播,并使用可学习的软提示进行细化嵌入。它利用分布平衡的Focal损失和类感知重权进行优化训练。此外,它通过测试时集成提高泛化能力,并通过参数高效微调重新对齐视觉文本模态,以避免在不牺牲头部类性能的情况下过度拟合尾部类。在包括VOC-LT、COCO-LT和NUS-WIDE在内的基准测试上的广泛实验和消融研究表明,CAPNET在最先进的方法上取得了显著的改进,验证了其在实际长尾多标签视觉识别中的有效性。
Summary / 总结
The paper addresses the challenge of long-tailed multi-label visual recognition by proposing CAPNET, which uses a graph convolutional network and learnable soft prompts to model label correlations from CLIP's textual encoder. It employs a distribution-balanced Focal loss with class-aware re-weighting and test-time ensembling to improve performance under class imbalance. Experiments show that CAPNET outperforms existing methods on VOC-LT, COCO-LT, and NUS-WIDE benchmarks, demonstrating its effectiveness in real-world applications.
论文提出CAPNET,通过图卷积网络和可学习的软提示来从CLIP的文本编码器中建模标签间的关联。它使用分布平衡的Focal损失和类感知重权以及测试时的集成来提高性能。实验表明,CAPNET在VOC-LT、COCO-LT和NUS-WIDE等基准上的表现优于现有方法,证明了其在处理类别分布不平衡问题上的有效性。
Latent Collaboration in Multi-Agent Systems
Authors: Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang
First: 2025-11-25T18:56:57+00:00 · Latest: 2025-11-25T18:56:57+00:00
Comments: Project: https://github.com/Gen-Verse/LatentMAS
Abstract
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
中文标题/摘要
标题:多智能体系统中的潜在协作
多智能体系统(MAS)将大型语言模型(LLMs)从独立的单模型推理扩展到协调的系统级智能。虽然现有的LLM智能体依赖于基于文本的中介进行推理和通信,但我们通过使模型能够在连续的潜在空间中直接协作,向前迈进了一步。我们引入了LatentMAS,这是一种端到端无需训练的框架,使LLM智能体之间能够纯粹地进行潜在协作。在LatentMAS中,每个智能体首先通过最后一层隐藏嵌入进行自回归潜在思维生成。共享的潜在工作记忆则保存并转移每个智能体的内部表示,确保无损信息交换。我们提供了理论分析,证明LatentMAS在表达能力和无损信息保存方面比传统的基于文本的MAS具有更高的效率和更低的复杂度。此外,在涵盖数学和科学推理、常识理解和代码生成的9个全面基准测试中,LatentMAS在所有基准测试中都优于强大的单模型和基于文本的MAS基线,准确率提高了14.6%,输出令牌使用量减少了70.8%-83.7%,端到端推理速度提高了4倍至4.3倍。这些结果表明,我们的新潜在协作框架在提高系统级推理质量的同时,还提供了显著的效率提升,而无需额外的训练。代码和数据已完全开源,可在https://github.com/Gen-Verse/LatentMAS获取。
Fighting AI with AI: Leveraging Foundation Models for Assuring AI-Enabled Safety-Critical Systems
Authors: Anastasia Mavridou, Divya Gopinath, Corina S. Păsăreanu
First: 2025-11-25T18:48:19+00:00 · Latest: 2025-11-25T18:48:19+00:00
Abstract
The integration of AI components, particularly Deep Neural Networks (DNNs), into safety-critical systems such as aerospace and autonomous vehicles presents fundamental challenges for assurance. The opacity of AI systems, combined with the semantic gap between high-level requirements and low-level network representations, creates barriers to traditional verification approaches. These AI-specific challenges are amplified by longstanding issues in Requirements Engineering, including ambiguity in natural language specifications and scalability bottlenecks in formalization. We propose an approach that leverages AI itself to address these challenges through two complementary components. REACT (Requirements Engineering with AI for Consistency and Testing) employs Large Language Models (LLMs) to bridge the gap between informal natural language requirements and formal specifications, enabling early verification and validation. SemaLens (Semantic Analysis of Visual Perception using large Multi-modal models) utilizes Vision Language Models (VLMs) to reason about, test, and monitor DNN-based perception systems using human-understandable concepts. Together, these components provide a comprehensive pipeline from informal requirements to validated implementations.
中文标题/摘要
标题:用AI对抗AI:利用基础模型确保AI驱动的安全关键系统
将AI组件,尤其是深度神经网络(DNNs),集成到航空和自动驾驶车辆等安全关键系统中,提出了确保方面的根本性挑战。AI系统的不透明性与高层要求和低层网络表示之间的语义差距,阻碍了传统验证方法的应用。这些特定于AI的挑战被长期存在的需求工程问题放大,包括自然语言规范的模糊性和形式化中的可扩展性瓶颈。我们提出了一种方法,通过两种互补组件利用AI本身来应对这些挑战。REACT(需求工程中的AI一致性与测试)利用大型语言模型(LLMs)在非正式自然语言需求与形式规范之间架起桥梁,实现早期验证和验证。SemaLens(视觉感知的语义分析使用大型多模态模型)利用视觉语言模型(VLMs)使用人类可理解的概念来推理、测试和监控基于DNN的感知系统。这些组件共同提供了一个从非正式需求到验证实现的全面管道。
Summary / 总结
This paper addresses the challenges of ensuring safety in AI-enabled systems by proposing an approach that leverages AI itself. REACT uses Large Language Models to translate informal requirements into formal specifications, facilitating early verification. SemaLens employs Vision Language Models to analyze and test DNN-based perception systems using human-understandable concepts. The key findings show that this approach effectively bridges the semantic gap and enhances the assurance of safety-critical systems like aerospace and autonomous vehicles.
本文提出了一种通过自身AI来解决AI使能的安全系统保障挑战的方法。REACT 使用大型语言模型将非正式的要求转化为正式规范,促进早期验证。SemaLens 利用视觉语言模型通过人类可理解的概念来分析和监控基于DNN的感知系统。主要发现表明,这种方法有效地弥合了高层要求与低层网络表示之间的差距,增强了安全关键系统的保障。
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
Authors: Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang
Venue: NeurIPS 2025
First: 2025-09-21T17:53:30+00:00 · Latest: 2025-11-25T17:49:27+00:00
Comments: Project homepage: https://flageval-baai.github.io/LRM-Eval/ This work will also be presented at NeurIPS 2025 Workshop on Foundations of Reasoning in Language Models (FoRLM); update with trials on Gemini 3 Pro
Abstract
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
中文标题/摘要
标题:FlagEval 发现报告:对自动可验证文本和视觉问题上当前大型推理模型的初步评估
我们进行了一项中等规模的无污染评估,对当前的大型推理模型(LRMs)进行了初步发现。我们还发布了 ROME,这是一个用于视觉语言模型的评估基准,旨在测试从视觉线索中进行推理的能力。更多基准、评估数据和其他更新请访问:https://flageval-baai.github.io/LRM-Eval/
AutoFocus-IL: VLM-based Saliency Maps for Data-Efficient Visual Imitation Learning without Extra Human Annotations
Authors: Litian Gong, Fatemeh Bahrani, Yutai Zhou, Amin Banayeeanzade, Jiachen Li, Erdem Bıyık
First: 2025-11-23T21:21:10+00:00 · Latest: 2025-11-25T17:43:27+00:00
Comments: 8 pages, 6 figures. Code and datasets available at http://autofocus-il.github.io/
Abstract
AutoFocus-IL is a simple yet effective method to improve data efficiency and generalization in visual imitation learning by guiding policies to attend to task-relevant features rather than distractors and spurious correlations. Although saliency regularization has emerged as a promising way to achieve this, existing approaches typically require costly supervision such as human gaze data or manual saliency annotations. In contrast, AutoFocus-IL leverages vision-language models (VLMs) to automatically identify and track key objects in demonstrations, generating temporal saliency maps that highlight causal visual signals while suppressing distractors. These maps are then used to regularize behavior cloning policies, yielding stronger alignment between visual attention and task-relevant cues. Experiments in both the CARLA simulator and real-robot manipulation tasks demonstrate that AutoFocus-IL not only outperforms standard behavior cloning but also surpasses state-of-the-art baselines that assume privileged access to human supervision, such as gaze data. Code, datasets, and trained policy videos are available at https://AutoFocus-IL.github.io/.
中文标题/摘要
标题:AutoFocus-IL:基于VLM的数据高效视觉模仿学习无额外人工注释
AutoFocus-IL 是一种简单而有效的方法,通过引导策略关注任务相关特征而非干扰物和虚假相关性来提高视觉模仿学习的数据效率和泛化能力。尽管显著性正则化已被证明是一种有前途的方法,但现有方法通常需要昂贵的监督,如人类注视数据或手动显著性注释。相比之下,AutoFocus-IL 利用视觉语言模型(VLMs)自动识别和跟踪演示中的关键对象,生成突出因果视觉信号并抑制干扰物的时序显著性图。然后使用这些图来正则化行为克隆策略,从而在视觉注意力与任务相关线索之间实现更强的对齐。在CARLA模拟器和真实机器人操作任务中的实验表明,AutoFocus-IL 不仅优于标准的行为克隆,还超越了假设拥有特权人类监督(如注视数据)的最新基线方法。有关代码、数据集和训练策略视频,请访问 https://AutoFocus-IL.github.io/。
Summary / 总结
AutoFocus-IL is a method that enhances data efficiency and generalization in visual imitation learning by using vision-language models to generate saliency maps that guide policies to focus on task-relevant features. This approach avoids the need for costly human annotations or gaze data. Instead, it automatically identifies key objects in demonstrations, creating saliency maps that highlight relevant visual cues. Experiments show that AutoFocus-IL outperforms standard behavior cloning and state-of-the-art methods that rely on human supervision, demonstrating its effectiveness in both simulated and real-world tasks.
AutoFocus-IL 是一种通过引导策略关注任务相关特征来提高视觉模仿学习的数据效率和泛化能力的方法。它利用视觉语言模型自动生成突出关键对象并抑制干扰物的注意图,而无需昂贵的人类注释。实验表明,AutoFocus-IL 在模拟和真实世界任务中均优于标准的行为克隆以及假设拥有人类监督特权的先进基线方法。
Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models
Authors: Shamima Hossain
Venue: ICML poster
First: 2025-11-25T17:34:32+00:00 · Latest: 2025-11-25T17:34:32+00:00
Comments: Accepted as poster at NewInML Workshop ICML, 2025
Abstract
Visual Language Models (VLMs) are powerful generative tools but often produce factually inaccurate outputs due to a lack of robust reasoning capabilities. While extensive research has been conducted on integrating external knowledge for reasoning in large language models (LLMs), such efforts remain underexplored in VLMs, where the challenge is compounded by the need to bridge multiple modalities seamlessly. This work introduces a framework for knowledge-guided reasoning in VLMs, leveraging structured knowledge graphs for multi-hop verification using image-captioning task to illustrate our framework. Our approach enables systematic reasoning across multiple steps, including visual entity recognition, knowledge graph traversal, and fact-based caption refinement. We evaluate the framework using hierarchical, triple-based and bullet-point based knowledge representations, analyzing their effectiveness in factual accuracy and logical inference. Empirical results show that our approach improves factual accuracy by approximately 31% on preliminary experiments on a curated dataset of mixtures from Google Landmarks v2, Conceptual captions and Coco captions revealing key insights into reasoning patterns and failure modes. This work demonstrates the potential of integrating external knowledge for advancing reasoning in VLMs, paving the way for more reliable and knowledgable multimodal systems.
中文标题/摘要
标题:超越生成:视觉语言模型中的多跳推理以提高事实准确性
视觉语言模型(VLMs)是强大的生成工具,但由于缺乏稳健的推理能力,常常产生事实不准确的输出。尽管在大型语言模型(LLMs)中整合外部知识进行推理的研究已经非常广泛,但在VLMs中这样的努力仍然很少见,因为需要无缝地跨越多种模态。这项工作引入了一个知识引导推理的框架,利用结构化的知识图谱进行多跳验证,并通过图像配对任务来说明我们的框架。我们的方法使系统性推理跨越多个步骤成为可能,包括视觉实体识别、知识图谱遍历和基于事实的配对优化。我们使用分层、三元组和项目符号知识表示来评估框架,分析它们在事实准确性和逻辑推理方面的有效性。实验证明,我们的方法在初步实验中将事实准确性提高了约31%,这些实验基于Google Landmarks v2、Conceptual Captions和Coco Captions的混合数据集,揭示了推理模式和失败模式的关键见解。这项工作展示了整合外部知识以提高VLMs推理能力的潜力,为更可靠和知识丰富的多模态系统铺平了道路。
Summary / 总结
This work addresses the issue of factual inaccuracy in Visual Language Models (VLMs) by introducing a knowledge-guided reasoning framework. The method leverages structured knowledge graphs for multi-hop verification through an image-captioning task, enabling systematic reasoning across multiple steps. The approach improves factual accuracy by approximately 31% on a curated dataset, highlighting its potential for advancing reasoning in VLMs and enhancing multimodal systems reliability.
该研究通过引入知识引导的推理框架解决了视觉语言模型(VLMs)事实不准确的问题。方法利用结构化的知识图谱进行多跳验证,应用于图像-描述任务。该方法增强了跨多个步骤的系统推理,包括视觉实体识别、知识图谱遍历和基于事实的描述精炼。实验证明,在一个精心策划的数据集上,事实准确性的提高达到了31%,突显了将外部知识集成到VLMs推理中的潜力。
ExDDV: A New Dataset for Explainable Deepfake Detection in Video
Authors: Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu
Venue: WACV 2026
First: 2025-03-18T16:55:07+00:00 · Latest: 2025-11-25T17:20:44+00:00
Comments: Accepted at WACV 2026
Abstract
The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.
中文标题/摘要
标题:ExDDV:视频可解释深度伪造检测的新数据集
生成的视频越来越逼真,使得人类越来越难以识别深度伪造内容,他们不得不越来越多地依赖自动深度伪造检测器。然而,这些检测器也容易出错,其决策不可解释,使人类容易受到基于深度伪造的欺诈和错误信息的影响。为此,我们介绍了ExDDV,这是第一个用于视频可解释深度伪造检测的数据集和基准。ExDDV包含约5400个真实和深度伪造视频,并且手动标注了文本描述(解释伪影)和点击(指出伪影)。我们对ExDDV上的多个视觉-语言模型进行了评估,使用了各种微调和上下文学习策略进行了实验。我们的结果显示,文本和点击监督都对于开发针对深度伪造视频的稳健可解释模型是必要的,这些模型能够定位并描述观察到的伪影。我们的新型数据集和可复现结果的代码可在https://github.com/vladhondru25/ExDDV获取。
Summary / 总结
ExDDV is a new dataset for explainable deepfake detection in video, comprising 5,400 real and deepfake videos annotated with text descriptions and clicks to highlight artifacts. The study evaluates vision-language models with different fine-tuning and in-context learning strategies, demonstrating that both text and click supervision are necessary for developing robust explainable models that can localize and describe deepfake artifacts. The dataset and code are available at https://github.com/vladhondru25/ExDDV.
ExDDV 是一个用于视频中可解释的深伪检测的新数据集,包含5,400个真实和深伪视频,并进行了文本描述和点击标注以突出显示伪影。研究使用不同的微调和上下文学习策略评估了视觉-语言模型,结果显示,文本和点击监督对于开发能够定位和描述深伪伪影的稳健可解释模型都是必要的。数据集和代码可在 https://github.com/vladhondru25/ExDDV 获取。
Adam Simplified: Bias Correction Simplified
Authors: Sam Laing, Antonio Orvieto
First: 2025-11-25T17:20:40+00:00 · Latest: 2025-11-25T17:20:40+00:00
Abstract
The Adam optimizer is a cornerstone of modern deep learning, yet the empirical necessity of each of its individual components is often taken for granted. This paper presents a focused investigation into the role of bias-correction, a feature whose contribution remains poorly understood. Through a series of systematic ablations on vision and language modelling tasks, we demonstrate that the conventional wisdom surrounding bias correction is misleading. In particular, we demonstrate that in the optimal hyper-parameter configuration, the inclusion of bias correction leads to no improvement in final test performance. Moreover, unless appropriate learning rate scheduling is implemented, the inclusion of bias correction can sometimes be detrimental to performance. We further reinterpret bias correction as a form of implicit learning rate scheduling whose behaviour is strongly dependent on the choice of smoothing hyper-parameters $β_1, β_2 \in [0,1)$. Our findings challenge the universal inclusion of this component.
Summary / 总结
This paper investigates the role of bias-correction in the Adam optimizer, a key component of modern deep learning. Through systematic ablations on vision and language modeling tasks, the authors show that bias correction does not improve final test performance in the optimal hyper-parameter configuration and can even harm performance if appropriate learning rate scheduling is not used. The study reinterprets bias correction as a form of implicit learning rate scheduling, highlighting its dependence on smoothing hyper-parameters. These findings suggest that bias correction may not be universally beneficial and its inclusion should be reconsidered.
该论文研究了Adam优化器中偏差校正的作用,这是现代深度学习的关键组件。通过在视觉和语言建模任务上的系统消融实验,作者表明,在最优超参数配置下,偏差校正不会提升测试性能,甚至在没有适当的学习率调度时还会损害性能。研究还将偏差校正重新解释为一种隐式的学习率调度形式,其行为强烈依赖于平滑超参数$β_1, β_2 \in [0,1)$的选择。这些发现表明,偏差校正并非普遍有益,应根据具体任务需求谨慎考虑。
NVIDIA Nemotron Parse 1.1
Authors: Kateryna Chumachenko, Amala Sanjay Deshmukh, Jarno Seppanen, Ilia Karmanov, Chia-Chih Chen, Lukas Voegtle, Philipp Fischer, Marek Wawrzos, Saeid Motiian, Roman Ageev, Kedi Wu, Alexandre Milesi, Maryam Moosaei, Krzysztof Pawelec, Padmavathy Subramanian, Mehrzad Samadi, Xin Yu, Celina Dear, Sarah Stoddard, Jenna Diamond, Jesse Oliver, Leanna Chraghchian, Patrick Skelly, Tom Balough, Yao Xu, Jane Polak Scowcroft, Daniel Korzekwa, Darragh Hanley, Sandip Bhaskar, Timo Roman, Karan Sapra, Andrew Tao, Bryan Catanzaro
First: 2025-11-25T16:41:25+00:00 · Latest: 2025-11-25T16:41:25+00:00
Abstract
We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.
中文标题/摘要
标题:NVIDIA Nemotron Parse 1.1
我们介绍了Nemotron-Parse-1.1,这是一个轻量级的文档解析和OCR模型,其功能比前一代Nemoretriever-Parse-1.0有所提升。Nemotron-Parse-1.1在通用OCR、Markdown格式化、结构化表格解析和从图片、图表和图表中提取文本方面表现出改进的能力。它还支持更长的输出序列长度,以处理视觉密集型文档。与前一代产品一样,它提取文本段的边界框以及相应的语义类别。Nemotron-Parse-1.1采用编码器-解码器架构,包含885M参数,其中包括一个紧凑的256M参数语言解码器。它在公共基准测试中取得了竞争力的准确性,使其成为强大的轻量级OCR解决方案。我们已在Huggingface上公开发布了模型权重,以及优化的NIM容器和部分训练数据,作为更广泛的Nemotron-VLM-v2数据集的一部分。此外,我们还发布了Nemotron-Parse-1.1-TC,该版本在视觉标记长度减少的情况下,提供了20%的速度提升,同时质量略有下降。
Summary / 总结
NVIDIA Nemotron-Parse-1.1 is a lightweight OCR model that enhances text extraction and formatting capabilities over its predecessor. It supports longer output sequences for dense documents and includes a 256M-parameter language decoder. The model achieves competitive accuracy on public benchmarks and is publicly available with optimized container and training data. Nemotron-Parse-1.1-TC offers a 20% speed improvement with minimal quality loss by reducing vision token length.
NVIDIA 推出了 Nemotron-Parse-1.1,一种增强的文档解析和 OCR 模型,改进了 OCR 准确性、markdown 格式化、表格解析和从图片、图表和图表中提取文本的能力。该模型具有更长的输出序列长度以处理密集文档,并且具有紧凑的 256M 参数语言解码器。该模型在公共基准测试中表现出竞争性的准确性,并提供了模型权重、优化的 NIM 容器以及部分训练数据。Nemotron-Parse-1.1-TC 通过减少视觉标记长度实现了 20% 的速度提升,同时保持了较低的质量损失。
When to Think and When to Look: Uncertainty-Guided Lookback
Authors: Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yolo Y. Tang, Luchuan Song, Susan Liang, Zhongfei Mark Zhang, Jason J. Corso, Chenliang Xu
First: 2025-11-19T17:01:02+00:00 · Latest: 2025-11-25T16:38:28+00:00
Abstract
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
中文标题/摘要
标题:何时思考何时查看:基于不确定性回顾
测试时的思考(即生成明确的中间推理链)已被证明能提升大型语言模型的性能,并且最近在大型视觉语言模型(LVLM)中也显示出强大的增益。然而,尽管取得了这些有希望的结果,仍然没有系统分析思考如何影响视觉推理。我们提供了首个此类分析,通过大规模、受控的比较,评估了来自InternVL3.5和Qwen3-VL家族的十个变体在MMMU-val上的表现,使用宽松的令牌预算和多轮解码。我们展示了更多的思考并不总是更好的;长链往往导致错误的轨迹,忽视了图像,并且表现不如标准指令模式运行的相同模型。更深入的分析表明,某些短回顾短语,明确地回溯到图像,强烈地丰富了成功的轨迹,并与更好的视觉定位相关。基于这一洞察,我们提出了基于不确定性回顾的解码策略,该策略结合了不确定性信号、自适应回顾提示和广度搜索。我们的方法在整体MMMU性能上有所提升,在标准思考较弱的类别中带来了最大的增益,并优于几个强大的解码基线,设定了固定模型家族和令牌预算下的新最佳性能。我们进一步展示了该解码策略的泛化能力,在五个额外的基准上产生了一致的改进,包括两个广泛的多模态套件和数学聚焦的视觉推理数据集。
Summary / 总结
This study investigates the impact of test-time thinking on visual reasoning in large vision language models (LVLMs). By comparing ten variants from InternVL3.5 and Qwen3-VL families, the research finds that more thinking is not always beneficial, as long chains often lead to incorrect reasoning. The authors propose an uncertainty-guided lookback strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search, which improves overall performance and outperforms several strong baselines, setting a new state-of-the-art on fixed model families and token budgets. This method also generalizes well, showing consistent improvements across five additional benchmarks.
研究探讨了测试时思考对大型视觉语言模型(LVLMs)视觉推理的影响。比较了InternVL3.5和Qwen3-VL模型的十个变体,发现过度思考可能导致错误的推理。研究引入了不确定性引导回溯,这一解码策略增强了视觉定位,并在标准思考较弱的类别中表现出色,固定模型和令牌预算下达到新的最佳性能。
Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search
Authors: Yunqi Zhou, Chengjie Jiang, Chun Yuan, Jing Li
First: 2025-11-25T16:25:54+00:00 · Latest: 2025-11-25T16:25:54+00:00
Comments: 17 pages, 8 figures
Abstract
With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples 'where to look' from 'how to answer' for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8\% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.
中文标题/摘要
标题:看哪里都重要:基于自适应缩放搜索的无需训练超高分辨率遥感VQA
随着卫星星座、传感器技术和成像流水线的进步,超高分辨率(Ultra-HR)遥感图像变得越来越普遍。然而,当前的遥感基础模型对这种输入并不适合:全图像编码会耗尽标记和内存预算,而基于重采样的预处理会丢失关键的细节信息。在此背景下,指导模型在预测前看哪里都重要变得至关重要。因此,我们提出了ZoomSearch,这是一种无需训练、即插即用的管道,将“看哪里”与“如何回答”解耦,适用于超高分辨率遥感视觉问答(RS-VQA)。ZoomSearch 结合了自适应多分支缩放搜索,该方法在图像块上进行分层搜索以定位查询相关区域,以及布局感知块重组,将选定的块重新组织成紧凑且布局忠实的画布。我们在超高分辨率RS-VQA基准MME-RealWorld-RS和LRS-VQA上进行了全面实验,与(i)强大的通用基础模型,(ii)遥感基础模型,(iii)超高分辨率RS-VQA方法,以及(iv)基于搜索的视觉问答方法进行了比较。当与LLaVA-ov结合使用时,ZoomSearch在各种任务中达到了最先进的准确性,在LRS-VQA上提高了LLaVA-ov基线26.3%,在MME-RealWorld-RS上提高了114.8%。同时,它实现了更高的推理效率,在速度上比之前的基于搜索的方法快20%~44%。
Summary / 总结
The paper addresses the challenge of processing ultra-high-resolution remote sensing imagery for visual question answering (RS-VQA) by introducing ZoomSearch, a training-free method that decouples 'where to look' from 'how to answer'. It uses Adaptive Multi-Branch Zoom Search to search for query-relevant regions and Layout-Aware Patch Reassembly to reorganize selected patches. Experiments on Ultra-HR RS-VQA benchmarks show that ZoomSearch outperforms existing methods, improving the LLaVA-ov baseline by 26.3% and 114.8% on LRS-VQA and MME-RealWorld-RS respectively, while maintaining higher inference efficiency.
研究针对处理超高清遥感图像的视觉问答(RS-VQA)模型面临的挑战,提出了一种无需训练的方法ZoomSearch,将“看哪里”与“如何回答”分离。通过使用自适应多分支缩放搜索聚焦于相关图像区域,并使用布局感知的块重组重新组织这些区域,ZoomSearch提升了RS-VQA模型的准确性。实验表明,ZoomSearch在超高清RS-VQA基准测试中优于现有方法,将基线准确性提高多达114.8%,同时实现了更高的推理效率。
Towards Trustworthy Wi-Fi Sensing: Systematic Evaluation of Deep Learning Model Robustness to Adversarial Attacks
Authors: Shreevanth Krishnaa Gopalakrishnan, Stephen Hailes
First: 2025-11-25T16:24:29+00:00 · Latest: 2025-11-25T16:24:29+00:00
Comments: 19 pages, 8 figures, 7 tables
Abstract
Machine learning has become integral to Channel State Information (CSI)-based human sensing systems and is expected to power applications such as device-free activity recognition and identity detection in future cellular and Wi-Fi generations. However, these systems rely on models whose decisions can be subtly perturbed, raising concerns for security and reliability in ubiquitous sensing. Quantifying and understanding the robustness of such models, defined as their ability to maintain accurate predictions under adversarial perturbations, is therefore critical before wireless sensing can be safely deployed in real-world environments. This work presents a systematic evaluation of the robustness of CSI deep learning models under diverse threat models (white-box, black-box/transfer, and universal perturbations) and varying degrees of attack realism. We establish a framework to compare compact temporal autoencoder models with larger deep architectures across three public datasets, quantifying how model scale, training regime, and physical constraints influence robustness. Our experiments show that smaller models, while efficient and equally performant on clean data, are markedly less robust. We further confirm that physically realizable signal-space perturbations, designed to be feasible in real wireless channels, significantly reduce attack success compared to unconstrained feature-space attacks. Adversarial training mitigates these vulnerabilities, improving mean robust accuracy with only moderate degradation in clean performance across both model classes. As wireless sensing advances towards reliable, cross-domain operation, these findings provide quantitative baselines for robustness estimation and inform design principles for secure and trustworthy human-centered sensing systems.
中文标题/摘要
标题:迈向可信赖的Wi-Fi传感:深度学习模型对抗攻击鲁棒性系统评估
机器学习已成为基于信道状态信息(CSI)的人体传感系统的核心,并有望在未来蜂窝和Wi-Fi时代推动诸如无设备活动识别和身份检测等应用。然而,这些系统依赖于其决策可能被微妙扰动的模型,这引发了在广泛传感中安全性和可靠性的担忧。因此,在无线传感可以安全部署到现实环境之前,量化和理解这些模型的鲁棒性(即在对抗扰动下保持准确预测的能力)是至关重要的。 本研究对CSI深度学习模型在不同威胁模型(白盒、黑盒/转移和通用扰动)下的鲁棒性进行了系统评估,并在不同攻击现实程度下进行了评估。我们建立了一个框架,比较了紧凑的时域自编码模型与更大规模的深度架构在三个公开数据集上的表现,量化了模型规模、训练方式和物理约束如何影响鲁棒性。我们的实验表明,虽然较小的模型在干净数据上高效且性能相当,但在鲁棒性方面明显较差。我们进一步证实,设计为在实际无线信道中可行的物理可实现信号空间扰动,与不受约束的特征空间攻击相比,显著降低了攻击成功率。对抗训练减轻了这些漏洞,提高了两种模型类别的平均鲁棒准确率,同时仅在干净性能上略有退化。随着无线传感向可靠、跨域操作发展,这些发现为鲁棒性估计提供了定量基准,并为安全和可信赖的人体中心传感系统设计原则提供了指导。
Summary / 总结
This study evaluates the robustness of deep learning models used in Wi-Fi-based human sensing systems against adversarial attacks. It assesses the impact of model size, training methods, and physical constraints on robustness across different threat models. The research finds that smaller models, though efficient, are less robust to adversarial perturbations compared to larger models. Physically realizable signal-space perturbations are more effective in reducing attack success than unconstrained feature-space attacks. Adversarial training improves robustness with minimal impact on clean performance.
该研究评估了用于Wi-Fi人体感知系统的深度学习模型在对抗攻击下的鲁棒性。研究比较了紧凑的时域自编码模型与更大架构在三个数据集上的表现,发现较小的模型对扰动的鲁棒性较差。研究还发现,物理上可实现的信号空间扰动比无约束攻击更有效,而对抗训练可以在最小影响干净性能的情况下提高鲁棒性。
Object-Centric Vision Token Pruning for Vision Language Models
Authors: Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen
First: 2025-11-25T16:12:32+00:00 · Latest: 2025-11-25T16:12:32+00:00
Abstract
In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.
中文标题/摘要
标题:面向对象的视觉语言模型视觉标记剪枝
在视觉语言模型(VLMs)中,视觉标记的数量庞大但信息分散,与语言标记相比消耗过多不必要的计算。为了提高VLM推理效率,剪枝冗余的视觉标记一直受到研究,但现有方法都依赖于间接且无法保证的方式。我们提出了一种直接且可保证的方法OC-VTP,用于选择最具代表性的视觉标记,以实现高效且保持准确性的VLM推理。我们的OC-VTP仅需要对一个小的对象中心视觉标记剪枝器进行轻量级预训练,然后可以插入到现有的VLM中,无需对任何模型进行任何数据集的微调。通过最小化从选定标记重建原始未剪枝标记的误差,可以保证保留最具代表性的视觉标记。无论任何视觉剪枝比例,即推理效率,我们的OC-VTP都能帮助主流VLM保持最高的推理准确性。我们的剪枝方法还展示了有趣的可解释性。我们的代码可在https://github.com/GarryLarry010131/OC-VTP获取。
Summary / 总结
The research aims to improve the efficiency of Vision Language Models (VLMs) by pruning redundant vision tokens, which are numerous but informationally dispersed. The proposed method, OC-VTP, directly selects the most representative vision tokens through light-weight pre-training of a small object-centric pruner, without requiring fine-tuning. This approach ensures that the pruned tokens maintain the highest inference accuracy across different pruning ratios. The method shows consistent accuracy preservation and interesting interpretability properties.
研究旨在通过修剪冗余的视觉标记来提高视觉语言模型(VLMs)的效率,这些视觉标记数量众多但信息分散。提出的OC-VTP方法通过轻量级预训练一个小的对象中心标记修剪器直接选择最具代表性的视觉标记,无需任何微调。该方法确保修剪后的标记在不同修剪比例下保持最高的推理准确性,并展示了有趣可解释性特征。
Block Cascading: Training Free Acceleration of Block-Causal Video Models
Authors: Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song, Varun Jampani
First: 2025-11-25T15:52:58+00:00 · Latest: 2025-11-25T15:52:58+00:00
Abstract
Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/
中文标题/摘要
标题:块级递进:无需训练的块因视频模型加速
块因视频生成面临速度与质量的严峻权衡:小规模1.3B模型只能达到16 FPS,而大规模14B模型则仅能以4.5 FPS运行,迫使用户在响应性和质量之间做出选择。块级递进显著通过无训练并行化缓解了这一权衡。我们的关键洞察:未来视频块不需要完全去噪的当前块即可开始生成。通过使用来自前驱的半去噪上下文启动块生成,我们将顺序管道转换为并行级联,在这里多个块可以同时去噪。使用5块GPU利用时间并行性,我们在所有模型规模上实现了约2倍的加速:1.3B模型从16 FPS加速到30 FPS,14B模型从4.5 FPS加速到12.5 FPS。除了推理速度之外,块级递进还消除了上下文切换期间从KV缓存中恢复的开销(约200ms)。针对多个块因管道的广泛评估表明,在从块因管道切换到块级递进管道进行推理时,生成质量没有显著损失。项目页面:https://hmrishavbandy.github.io/block_cascading_page/
Summary / 总结
Block Cascading addresses the speed-quality trade-off in block-causal video generation by enabling parallel processing. The method starts block generation with partially denoised context from predecessors, transforming sequential pipelines into parallel cascades. With 5 GPUs, it achieves a 2x speedup for both 1.3B and 14B models, increasing FPS from 16 to 30 and from 4.5 to 12.5 respectively. This technique also reduces context switch overhead, maintaining generation quality without significant loss.
Block Cascading通过启用并行块生成来解决块因果视频生成中的速度与质量权衡问题,从前一个块的半去噪上下文中开始生成。该方法使1.3B和14B模型的FPS分别从16提升到30和从4.5提升到12.5,而不会影响生成质量。此外,它还减少了上下文切换时的开销,使其适用于交互式生成场景。
VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning
Authors: Bo Pang, Chenxi Xu, Jierui Ren, Guoping Wang, Sheng Li
First: 2025-11-25T15:48:49+00:00 · Latest: 2025-11-25T15:48:49+00:00
Abstract
Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object's geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry -> physical attributes -> modal parameters -> acoustic signals. Each 3D model has explicit physical properties (density, Young's modulus, Poisson's ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object's physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.
中文标题/摘要
标题:VibraVerse:一个大规模几何-声学对齐数据集,用于物理一致的多模态学习
理解物理世界需要基于物理定律的感知模型,而不是仅仅依赖统计相关性。然而,现有的多模态学习框架,专注于视觉和语言,缺乏物理一致性,并且忽视了物体几何、材料、振动模式及其产生的声音之间的内在因果关系。我们引入了VibraVerse,一个大规模的几何-声学对齐数据集,明确地将因果链从3D几何 -> 物理属性 -> 模态参数 -> 声音信号联系起来。每个3D模型具有明确的物理属性(密度、杨氏模量、泊松比)和体素几何,从中可以计算出在受控激励下的冲击声合成的模态固有频率和特征向量。为了建立这种一致性,我们引入了CLASP,一种用于跨模态对齐的对比学习框架,它保留了物体物理结构与其声学响应之间的因果对应关系。该框架在各模态之间强制执行物理一致的对齐,确保每个样本都是连贯的,可追溯到支配方程,并嵌入到跨越形状、图像和声音的统一表示空间中。基于VibraVerse,我们定义了一组基准任务,用于几何到声音预测、声音引导的形状重建以及跨模态表示学习。在这些任务上的广泛验证表明,使用VibraVerse训练的模型在各模态上表现出更高的准确率、可解释性和泛化能力。这些结果确立了VibraVerse作为物理一致且因果可解释的多模态学习基准的地位,为声音引导的体感感知提供了基础,并加深了对物理世界的理解。该数据集将开源。
Summary / 总结
VibraVerse is a large-scale dataset that aligns 3D geometry with acoustic signals, bridging the causal chain from geometry to sound. It includes explicit physical properties and computed modal parameters for each 3D model. CLASP, a contrastive learning framework, ensures physically consistent cross-modal alignment. Models trained on VibraVerse show superior accuracy and generalization across modalities, validating its effectiveness for multimodal learning.
VibraVerse 是一个大规模的数据集,将 3D 几何与声学信号对齐,建立了从几何到声音的因果链。它包含显式的物理属性和体积几何,用于计算模态参数。CLASP 是一个对比学习框架,确保跨模态的一致对齐。基于 VibraVerse 训练的模型展示了更高的准确性和泛化能力,使其成为物理一致的多模态学习的基准。
A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control
Authors: Jiawei Lin, Guanlong Jiao, Jianjin Xu
First: 2025-11-25T15:28:10+00:00 · Latest: 2025-11-25T15:28:10+00:00
Abstract
Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.
中文标题/摘要
标题:一种无需训练的多ID定制方法通过注意力调整和空间控制
多ID定制是计算机视觉中的一个有趣话题,最近引起了相当大的关注。给定多个个体的身份图像,其目的是生成一个无缝融合它们且保留各自身份的定制图像。与单ID定制相比,多ID定制要困难得多,并且面临两大挑战。首先,由于多ID定制模型在训练时是从裁剪的人体区域重建图像,因此在推理过程中常常遇到复制粘贴问题,导致生成的图像质量较低。其次,模型还遭受文本控制能力较差的问题。生成的结果只是简单地将多个人员合并到一张图像中,而不考虑输入文本的对齐情况。在本文中,我们提出了一种名为MultiID的方法,以无需训练的方式解决这一具有挑战性的任务。由于现有的单ID定制模型较少存在复制粘贴问题,我们的关键思想是将这些模型适应以实现多ID定制。为此,我们提出了一种ID解耦的交叉注意力机制,将不同的ID嵌入注入相应的图像区域,从而生成多ID输出。为了增强生成的可控性,我们引入了三种关键策略,即局部提示、深度引导的空间控制和扩展自我注意力,使结果更符合文本提示和ID图像。我们还精心构建了一个名为IDBench的基准用于评估。广泛的定性和定量结果表明,MultiID在解决上述两个挑战方面具有有效性。其性能与基于训练的多ID定制方法相当甚至更好。
Summary / 总结
This paper addresses the challenges of multi-ID customization in computer vision by proposing MultiID, a training-free approach. It overcomes the copy-paste issue and improves text controllability by using an ID-decoupled cross-attention mechanism and introducing local prompts, depth-guided spatial control, and extended self-attention. The results show that MultiID effectively integrates multiple identities while preserving their identities and aligning with text prompts, outperforming or matching training-based methods.
论文针对计算机视觉中的多ID定制挑战,重点关注复制粘贴问题和文本可控性差。提出了一种无需训练的方法MultiID,通过ID解耦交叉注意力机制适应现有的单ID定制模型,并引入局部提示、深度引导的空间控制和扩展自我注意力来增强生成的可控性。该方法能够有效整合多个身份,同时与文本提示对齐,通过基准IDBench和与训练基线方法相比更优的表现得到了验证。
Target-aware Image Editing via Cycle-consistent Constraints
Authors: Yanghao Wang, Zhen Wang, Long Chen
First: 2025-10-23T04:58:29+00:00 · Latest: 2025-11-25T15:15:11+00:00
Abstract
Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state'' and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.
中文标题/摘要
标题:基于目标感知的循环一致约束图像编辑
预训练的文本到图像生成模型的最新进展使基于文本的图像编辑取得了显著进展。主流方法总是采用破坏-然后-恢复的范式,即源图像首先被破坏成一个“中间状态”,然后在提示引导下恢复为目标图像。然而,当前的方法以目标无关的方式构建这种中间状态,即它们主要关注实现源图像重建,而忽视了与特定编辑目标之间的语义差距。这种设计导致了在所需修改与源图像有较大偏离时,编辑效果有限或不一致。在本文中,我们主张中间状态应该是目标感知的,即有选择地破坏与编辑相关的内容,同时保留与编辑无关的内容。为此,我们提出了FlowCycle,这是一种新颖的无反演且基于生成的编辑框架,通过可学习的噪声参数化破坏,并通过循环一致的过程优化它们。通过迭代地将源图像编辑为目标图像并恢复回源图像,同时满足双重一致性约束,FlowCycle 学会生成目标感知的中间状态,从而实现忠实的修改并保持源图像的一致性。广泛的消融实验表明,FlowCycle 在编辑质量和一致性方面优于最先进的方法。
Summary / 总结
This paper addresses the limitations of current text-to-image editing methods by proposing FlowCycle, a novel framework that aims to produce a target-aware intermediate state. FlowCycle uses a cycle-consistent process to iteratively edit the source image to the target and recover it back, learning to selectively corrupt editing-relevant contents while preserving irrelevant ones. The method achieves superior editing quality and consistency compared to existing approaches.
本文提出FlowCycle框架,通过生成目标感知的中间状态来解决当前文本到图像编辑方法的局限性。FlowCycle使用循环一致的过程,迭代地将源图像编辑为目标并恢复回源图像,学习选择性地破坏与编辑相关的内容,同时保留无关的内容。这种方法在编辑质量和一致性方面优于现有方法。
CLIP-IT: CLIP-based Pairing for Histology Images Classification
Authors: Banafsheh Karimian, Giulia Avanzato, Soufian Belharbi, Alexis Guichemerre, Luke McCaffrey, Mohammadhadi Shateri, Eric Granger
First: 2025-04-22T18:14:43+00:00 · Latest: 2025-11-25T15:13:39+00:00
Abstract
Multimodal learning has shown promise in medical imaging, combining complementary modalities like images and text. Vision-language models (VLMs) capture rich diagnostic cues but often require large paired datasets and prompt- or text-based inference, limiting their practicality due to annotation cost, privacy, and compute demands. Crucially, available free unpaired external text, like pathology reports, can still provide complementary diagnostic cues if semantically relevant content is retrievable per image. To address this, we introduce CLIP-IT, a novel framework that relies on rich unpaired text reports. Specifically, CLIP-IT uses a CLIP model pre-trained on histology image-text pairs from a separate dataset to retrieve the most relevant unpaired textual report for each image in the downstream unimodal dataset. These reports, sourced from the same disease domain and tissue type, form pseudo-pairs that reflect shared clinical semantics rather than exact alignment. Knowledge from these texts is distilled into the vision model during training, while LoRA-based adaptation mitigates the semantic gap between unaligned modalities. At inference, only the vision model is used, keeping overhead low while still benefiting from multimodal training without requiring paired data in the downstream dataset. Experiments on histology image datasets confirm that CLIP-IT consistently improves classification accuracy over both unimodal and multimodal CLIP-based baselines in most cases, without the burden of per-dataset paired annotation or inference-time complexity.
中文标题/摘要
标题:CLIP-IT:基于CLIP的配对方法用于组织学图像分类
多模态学习在医学成像中显示出潜力,结合了如图像和文本等互补模态。视觉语言模型(VLMs)捕获丰富的诊断线索,但通常需要大型配对数据集和基于提示或文本的推理,由于注释成本、隐私和计算需求限制了其实用性。关键的是,可用的免费未配对的外部文本,如病理报告,如果能检索到与图像语义相关的相关内容,仍能提供补充的诊断线索。为了解决这个问题,我们引入了CLIP-IT,这是一种新颖的框架,依赖于丰富的未配对文本报告。具体来说,CLIP-IT 使用一个在另一个数据集的组织学图像-文本配对上预训练的CLIP模型,为下游单模态数据集中的每张图像检索最相关的未配对文本报告。这些报告来自相同的疾病领域和组织类型,形成伪配对,反映共享的临床语义而非精确对齐。这些文本中的知识在训练过程中被提炼到视觉模型中,而基于LoRA的适应性缓解了未对齐模态之间的语义差距。在推理时,仅使用视觉模型,保持开销低,同时仍能从多模态训练中受益,而无需在下游数据集中使用配对数据。在组织学图像数据集上的实验表明,CLIP-IT 在大多数情况下都能在单模态和多模态CLIP基线之上一致地提高分类准确性,而无需每个数据集的配对注释或推理时的复杂性。
Summary / 总结
CLIP-IT is a framework that leverages unpaired pathology reports to enhance histology image classification. It uses a pre-trained CLIP model to retrieve relevant textual reports for each image, forming pseudo-pairs that reflect shared clinical semantics. During training, knowledge from these texts is distilled into the vision model, and LoRA-based adaptation helps bridge the semantic gap between unaligned modalities. At inference, only the vision model is used, reducing overhead. Experiments show that CLIP-IT improves classification accuracy over unimodal and multimodal baselines without the need for paired data or complex inference processes.
CLIP-IT 是一个框架,利用未配对的病理报告来增强组织切片图像分类。它使用预训练的 CLIP 模型为每张图像检索相关文本报告,形成反映共享临床语义的伪配对。在训练过程中,这些文本的知识被注入到视觉模型中,而基于 LoRA 的适应性调整有助于弥合未对齐模态之间的语义差距。在推理时,仅使用视觉模型,减少了开销。实验表明,CLIP-IT 在大多数情况下比单模态和多模态基线提高了分类准确性,而无需配对数据或复杂的推理过程。
Harnessing Vision-Language Models for Time Series Anomaly Detection
Authors: Zelin He, Sarah Alnegheimish, Matthew Reimherr
Venue: AAAI 2026 Oral
First: 2025-06-07T15:27:30+00:00 · Latest: 2025-11-25T14:56:02+00:00
Comments: Accepted at AAAI 2026 (Oral)
Abstract
Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and sensor-based condition monitoring. Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal understanding capacity that human experts have to identify contextual anomalies. To fill this gap, we explore a solution based on vision language models (VLMs). Recent studies have shown the ability of VLMs for visual understanding tasks, yet their direct application to time series has fallen short on both accuracy and efficiency. To harness the power of VLMs for TSAD, we propose a two-stage solution, with (1) ViT4TS, a vision-screening stage built on a relatively lightweight pre-trained vision encoder, which leverages 2D time series representations to accurately localize candidate anomalies; (2) VLM4TS, a VLM-based stage that integrates global temporal context and VLM's visual understanding capacity to refine the detection upon the candidates provided by ViT4TS. We show that without any time-series training, VLM4TS outperforms time-series pre-trained and from-scratch baselines in most cases, yielding a 24.6% improvement in F1-max score over the best baseline. Moreover, VLM4TS also consistently outperforms existing language model-based TSAD methods and is on average 36x more efficient in token usage.
中文标题/摘要
标题:利用视觉语言模型进行时间序列异常检测
时间序列异常检测(TSAD)在医疗保健、金融和基于传感器的状态监测等多个领域中发挥了重要作用。以往的方法主要集中在训练针对数值数据的领域特定模型上,缺乏人类专家所具有的视觉-时间理解能力,以识别上下文异常。为弥补这一不足,我们探索了一种基于视觉语言模型(VLMs)的解决方案。最近的研究表明,VLMs在视觉理解任务中具有能力,但将其直接应用于时间序列数据上在准确性和效率上都存在不足。为了利用VLMs的力量进行TSAD,我们提出了一种两阶段的解决方案,包括(1)基于相对轻量级预训练视觉编码器的ViT4TS视觉筛选阶段,利用2D时间序列表示准确定位候选异常;(2)基于VLM的时间序列阶段VLM4TS,该阶段整合了全局时间上下文和VLM的视觉理解能力,以改进ViT4TS提供的候选异常的检测。结果显示,无需任何时间序列训练,VLM4TS在大多数情况下都优于时间序列预训练和从零开始的基线模型,F1-max分数提高了24.6%。此外,VLM4TS在语言模型基线TSAD方法中也表现出色,并且在标记使用上平均提高了36倍。
Summary / 总结
This paper addresses the challenge of time-series anomaly detection (TSAD) by leveraging vision-language models (VLMs). It proposes a two-stage solution: ViT4TS for visual screening and VLM4TS for refined detection. Without training on time-series data, VLM4TS outperforms existing methods, achieving a 24.6% improvement in F1-max score and being 36x more efficient in token usage.
本文通过利用视觉语言模型(VLMs)来解决时间序列异常检测(TSAD)的挑战,提出了一种两阶段解决方案:ViT4TS进行视觉筛选和VLM4TS进行上下文感知的细化。在无需时间序列训练的情况下,VLM4TS在F1-max分数上比现有方法提高了24.6%,并且在标记使用上平均效率提高了36倍。
RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness
Authors: Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, Hao Tang
Venue: NeurIPS 2025 Spotlight
First: 2025-02-24T13:52:05+00:00 · Latest: 2025-11-25T14:36:57+00:00
Comments: NeurIPS 2025 (Spotlight) Fix some typos
Abstract
Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.
中文标题/摘要
标题:RobustMerge:参数高效模型合并方法,具有方向鲁棒性
使用自定义数据微调预训练模型会产生众多针对特定任务的专家模型。将模型合并为一个通用模型以增强多任务能力,避免数据泄露,已变得流行。随着数据和模型规模的扩大,参数高效微调已成为高效获取任务特定模型的常见做法。然而,很少有方法专注于高效合并,现有的为全面微调合并设计的方法在高效合并中效果不佳。为解决这一问题,我们从低秩分解的角度进行分析,揭示了合并高效模块时方向鲁棒性的重要性。我们进一步发现,补偿显著奇异值之间的差距有助于方向鲁棒性。因此,我们提出了RobustMerge,这是一种无需训练的参数高效合并方法,具有互补的参数适应性以保持方向鲁棒性。具体而言,我们(1)从参数间关系中剪枝参数并调整系数以保持远离任务干扰的方向稳定性,(2)执行跨任务归一化以增强对未见过任务的泛化能力。我们在包含多种模态任务的基准上进行了实验,以证明我们方法的出色性能和泛化能力。进一步的研究和详尽的分析进一步展示了其有效性。代码可在https://github.com/AuroraZengfh/RobustMerge获取。
Summary / 总结
RobustMerge is a parameter-efficient merging method for multi-task language models that addresses the issue of direction robustness during merging. It prunes parameters and scales coefficients to maintain direction stability and performs cross-task normalization to enhance generalization. Experiments on a diverse multimodal task benchmark demonstrate the method's superior performance and generalizability.
RobustMerge 是一种参数高效的多任务语言模型(MLLM)合并方法,旨在解决在高效合并过程中保持方向稳健性的挑战。它通过修剪和缩放参数来保持方向稳定性,并执行跨任务归一化以增强泛化能力。在多样化的多模态任务基准上的实验展示了其出色的性能和泛化能力。
LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference
Authors: Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, Daniele De Martini
First: 2025-10-13T15:19:07+00:00 · Latest: 2025-11-25T14:24:21+00:00
Comments: 22 pages, 9 figures
Abstract
Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
中文标题/摘要
标题:LikePhys:通过似然偏好评估视频扩散模型的直观物理理解
视频扩散模型中的直观物理理解在构建通用的物理合理世界模拟器中起着重要作用,但由于难以区分生成中的物理正确性和视觉外观,准确评估这种能力仍然是一个具有挑战性的任务。为此,我们引入了LikePhys,一种无需训练的方法,通过使用去噪目标作为基于ELBO的似然近似来区分有效和不可能的视频,从而评估视频扩散模型中的直观物理理解。通过在我们构建的包含十二种场景的基准测试中测试,这些场景跨越了四个物理领域,我们展示了我们的评估指标,可信赖性偏好误差(PPE),与人类偏好高度一致,优于最先进的评估基准。我们随后系统地评估了当前视频扩散模型的直观物理理解。我们的研究进一步分析了模型设计和推理设置如何影响直观物理理解,并突显了不同物理定律下的领域特定能力差异。实验证据表明,尽管当前模型在复杂和混沌动力学方面存在困难,但随着模型容量和推理设置的增加,物理理解能力有明显的提升趋势。
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
Authors: Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
First: 2025-06-05T07:26:34+00:00 · Latest: 2025-11-25T14:16:10+00:00
Comments: Project page: https://youngwanlee.github.io/holisafe
Abstract
Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.
中文标题/摘要
标题:HoliSafe:视觉语言模型的全面安全基准和建模
尽管已经出现了增强视觉语言模型(VLMs)安全性的努力,但当前的方法面临两个主要不足。1)现有的安全调优数据集和基准仅部分考虑了图像-文本交互可能导致有害内容的情况,经常忽视看似无害的配对所引发的上下文不安全结果。这种狭窄的覆盖范围使VLMs在未见配置中容易受到脱狱攻击。2)先前的方法主要依赖于数据驱动的调优,缺乏对内在增强安全性的架构创新。我们通过引入一个全面的安全数据集和基准——**HoliSafe**,跨越所有五种安全/不安全的图像-文本组合,为训练和评估提供了更坚实的基础(HoliSafe-Bench)。我们进一步提出了一种新颖的模块化框架,通过视觉防护模块(VGM)增强VLM的安全性,该模块旨在评估输入图像对VLM的有害性。该模块赋予VLMs双重功能:它们不仅学习生成更安全的响应,还可以提供可解释的有害性分类,以证明其拒绝决策的合理性。这种方法的一个重要优势是其模块化;VGM被设计为插件组件,可以无缝集成到各种规模的预训练VLMs中。实验表明,使用VGM训练的Safe-VLM在多个VLM基准测试中实现了最先进的安全性能。此外,HoliSafe-Bench本身揭示了现有VLM模型中的关键漏洞。我们希望HoliSafe和VGM能够激发对稳健和可解释的VLM安全性的进一步研究,扩展未来多模态对齐的途径。
Summary / 总结
HoliSafe addresses the limitations of existing safety benchmarks for Vision-Language Models (VLMs) by introducing a comprehensive dataset and benchmark that covers all safe and unsafe image-text combinations. The proposed Visual Guard Module (VGM) enhances VLM safety by assessing the harmfulness of input images, providing both safer responses and interpretable harmfulness classifications. Experiments show that VLMs trained on HoliSafe achieve superior safety performance, and the HoliSafe-Bench highlights vulnerabilities in current VLM models.
HoliSafe通过引入一个全面的安全数据集和基准,涵盖了所有安全和不安全的图像-文本组合,解决了现有VLM安全调优数据集和基准的局限性。此外,还提出了一种新型模块化框架,视觉守护模块(VGM),通过评估输入图像的有害性来增强VLM的安全性。实验表明,使用HoliSafe和VGM训练的VLM在多个VLM基准上实现了最先进的安全性能,并提供了可解释的有害性分类。
NNGPT: Rethinking AutoML with Large Language Models
Authors: Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte
First: 2025-11-25T14:10:44+00:00 · Latest: 2025-11-25T14:10:44+00:00
Abstract
Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.
中文标题/摘要
标题:NNGPT:以大型语言模型重新思考自动机器学习
构建自我改进的人工智能系统仍然是人工智能领域的一项基本挑战。我们提出了NNGPT,这是一个开源框架,将大型语言模型(LLM)转变为用于神经网络开发的自我改进自动机器学习引擎,主要针对计算机视觉领域。与之前的框架不同,NNGPT通过生成新模型扩展神经网络的数据集,基于生成、评估和自我改进的闭环系统对LLM进行持续微调。它在一个统一的工作流中集成了五个协同的基于LLM的管道:零样本架构合成、超参数优化(HPO)、代码感知的准确度/提前停止预测、检索增强的封闭领域PyTorch模块合成(NN-RAG)以及强化学习。基于经过审计的LEMUR数据集作为可重复度量的语料库,NNGPT从单个提示开始,验证网络架构、预处理代码和超参数,执行整个流程,并从结果中学习。PyTorch适配器使NNGPT框架无关,使其具有强大的性能:NN-RAG在1,289个目标上实现了73%的可执行性,3次提示提升在常见数据集上的准确性,基于哈希的去重节省了数百次运行。一击预测与基于搜索的自动机器学习相当,减少了多次试验的需要。在LEMUR上的超参数优化实现了RMSE 0.60,优于Optuna(0.64),而代码感知预测达到了RMSE 0.14,皮尔逊相关系数为0.78。该系统已经生成了超过5,000个验证模型,证明NNGPT是一个自主的自动机器学习引擎。一旦被接受,代码、提示和检查点将对公众开放,以促进可重复性和社区使用。
Summary / 总结
NNGPT is an open-source framework that uses a large language model to create a self-improving AutoML engine for neural network development, particularly in computer vision. It generates new models and continuously fine-tunes LLMs through a closed-loop system of generation, assessment, and self-improvement. Key findings include 73% executability for NN-RAG, 3-shot prompting boosting accuracy, and HPO outperforming Optuna with RMSE 0.60. The system has generated over 5,000 validated models, demonstrating its effectiveness as an autonomous AutoML engine.
NNGPT 是一个开源框架,将大型语言模型转变为用于计算机视觉的神经网络开发的自改进 AutoML 引擎。它通过生成、评估和自我改进的闭环系统生成新的神经网络模型并持续微调 LLM。关键发现包括 NN-RAG 的执行率为 73%,三轮提示增强准确性,以及 HPO 的 RMSE 为 0.60,优于 Optuna。该系统已生成超过 5,000 个验证模型,证明其作为自主 AutoML 引擎的有效性。
Solving Heterogeneous Agent Models with Physics-informed Neural Networks
Authors: Marta Grzeskiewicz
First: 2025-11-25T13:11:03+00:00 · Latest: 2025-11-25T13:11:03+00:00
Abstract
Understanding household behaviour is essential for modelling macroeconomic dynamics and designing effective policy. While heterogeneous agent models offer a more realistic alternative to representative agent frameworks, their implementation poses significant computational challenges, particularly in continuous time. The Aiyagari-Bewley-Huggett (ABH) framework, recast as a system of partial differential equations, typically relies on grid-based solvers that suffer from the curse of dimensionality, high computational cost, and numerical inaccuracies. This paper introduces the ABH-PINN solver, an approach based on Physics-Informed Neural Networks (PINNs), which embeds the Hamilton-Jacobi-Bellman and Kolmogorov Forward equations directly into the neural network training objective. By replacing grid-based approximation with mesh-free, differentiable function learning, the ABH-PINN solver benefits from the advantages of PINNs of improved scalability, smoother solutions, and computational efficiency. Preliminary results show that the PINN-based approach is able to obtain economically valid results matching the established finite-difference solvers.
中文标题/摘要
标题:使用物理信息神经网络求解异质代理人模型
理解家庭行为对于建模宏观经济动态和设计有效政策至关重要。虽然异质代理人模型提供了比代表代理人框架更现实的选择,但其实施面临显著的计算挑战,特别是在连续时间中。Aiyagari-Bewley-Huggett (ABH) 框架重新表述为偏微分方程系统,通常依赖于网格基解算器,这些解算器遭受维数灾难、高计算成本和数值不准确性的困扰。本文引入了ABH-PINN解算器,这是一种基于物理信息神经网络(PINNs)的方法,将哈密尔顿-雅可比-贝尔曼和科莫戈罗夫前向方程直接嵌入到神经网络训练目标中。通过用无网格、可微分函数学习替代网格基近似,ABH-PINN解算器受益于PINNs的改进可扩展性、更平滑的解和计算效率。初步结果显示,基于PINN的方法能够获得与已建立的差分求解器一致的经济上有效的结果。
Summary / 总结
This paper addresses the computational challenges in solving heterogeneous agent models by introducing the ABH-PINN solver, which uses Physics-Informed Neural Networks to embed the Hamilton-Jacobi-Bellman and Kolmogorov Forward equations directly into the training objective. The method offers improved scalability, smoother solutions, and computational efficiency compared to grid-based solvers. Initial results demonstrate that the PINN-based approach can achieve economically valid results comparable to established finite-difference solvers.
本文通过引入基于物理感知神经网络的ABH-PINN解算器来解决异质代理模型的计算挑战,该方法将哈密尔顿-雅可比-贝尔曼和科莫戈罗夫前向方程直接嵌入到神经网络训练目标中。与基于网格的求解器相比,该方法具有更好的可扩展性、更平滑的解和更高的计算效率。初步结果表明,基于PINN的方法可以达到与现有有限差分求解器相当的经济有效结果。
Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement
Authors: Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang
Venue: ICCV 2025
First: 2025-11-25T13:09:03+00:00 · Latest: 2025-11-25T13:09:03+00:00
Comments: ICCV 2025 Physics-IQ Challenge Third Place Solution
Abstract
Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.
中文标题/摘要
标题:通过VLM引导的迭代自我精炼实现物理原理指导下的视频生成
近期视频生成技术取得了显著的视觉效果进步,但当前模型仍然难以生成符合现实物理原理的结果。为此,我们提出了一种迭代自我精炼框架,利用大型语言模型和视觉-语言模型提供物理感知的指导以改进视频生成。具体而言,我们引入了一种多模态链式思考(MM-CoT)过程,基于物理不一致性的反馈逐步精炼提示,从而逐步提升生成质量。该方法无需训练且即插即用,使其能够广泛应用于各种视频生成模型。在PhyIQ基准测试上的实验表明,我们的方法将Physics-IQ得分从56.31提高到62.38。我们希望这项工作能够作为物理一致视频生成的初步探索,并为未来研究提供参考。
Summary / 总结
The research aims to improve video generation models to better align with real-world physical principles. It proposes an iterative self-refinement framework using large language models and vision-language models to provide physics-aware guidance. Experiments show that this method enhances the Physics-IQ score from 56.31 to 62.38 on the PhyIQ benchmark, demonstrating improved alignment with physical principles in video generation.
研究旨在使生成的视频更符合现实世界的物理原理。提出了一种迭代自我完善框架,利用大型语言模型和视觉-语言模型提供物理意识的指导。实验表明,该方法在PhyIQ基准上的Physics-IQ分数从56.31提高到62.38。该方法无需训练,可以轻松集成到各种视频生成模型中。
History
20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553