FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
Authors: Guangzhao Li, Yanming Yang, Chenxi Song, Chi Zhang
First: 2025-06-05T13:54:40+00:00 · Latest: 2025-12-12T15:24:58+00:00
Comments: Project Page is https://flowdirector-edit.github.io
Abstract
Text-driven video editing aims to modify video content based on natural language instructions. While recent training-free methods have leveraged pretrained diffusion models, they often rely on an inversion-editing paradigm. This paradigm maps the video to a latent space before editing. However, the inversion process is not perfectly accurate, often compromising appearance fidelity and motion consistency. To address this, we introduce FlowDirector, a novel training-free and inversion-free video editing framework. Our framework models the editing process as a direct evolution in the data space. It guides the video to transition smoothly along its inherent spatio-temporal manifold using an ordinary differential equation (ODE), thereby avoiding the inaccurate inversion step. From this foundation, we introduce three flow correction strategies for appearance, motion, and stability: 1) Direction-aware flow correction amplifies components that oppose the source direction and removes irrelevant terms, breaking conservative streamlines and enabling stronger structural and textural changes. 2) Motion-appearance decoupling optimizes motion agreement as an energy term at each timestep, significantly improving consistency and motion transfer. 3) Differential averaging guidance strategy leverages differences among multiple candidate flows to approximate a low variance regime at low cost, suppressing artifacts and stabilizing the trajectory. Extensive experiments across various editing tasks and benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction following, temporal consistency, and background preservation, establishing an efficient new paradigm for coherent video editing without inversion.
中文标题/摘要
标题:FlowDirector:无需训练的精确文本到视频编辑流控制
基于文本的视频编辑旨在根据自然语言指令修改视频内容。虽然最近的无需训练方法利用了预训练的扩散模型,但它们通常依赖于反演编辑范式。该范式将视频映射到潜在空间,然后再进行编辑。然而,反演过程并不完全准确,经常损害外观保真度和运动一致性。为了解决这个问题,我们引入了FlowDirector,这是一种全新的无需训练和无需反演的视频编辑框架。我们的框架将编辑过程建模为数据空间中的直接演化。它使用常微分方程(ODE)引导视频沿着其固有的时空流形平滑过渡,从而避免了不准确的反演步骤。在此基础上,我们提出了三种流校正策略,分别针对外观、运动和稳定性:1)方向感知流校正增强与源方向相反的成分并移除无关项,打破保守的流线,使结构和纹理变化更加强烈。2)运动-外观解耦在每个时间步优化运动一致性作为能量项,显著提高一致性和运动转移。3)微分平均引导策略利用多个候选流之间的差异,在低成本下近似低方差区域,抑制伪影并稳定轨迹。在各种编辑任务和基准上的广泛实验表明,FlowDirector在指令遵循、时间一致性以及背景保留方面达到了最先进的性能,建立了高效的新范式,无需反演。
Summary / 总结
FlowDirector is a training-free and inversion-free video editing framework that models the editing process as a direct evolution in the data space using an ODE. It introduces three flow correction strategies for appearance, motion, and stability, which enhance structural changes, motion consistency, and trajectory stability. Experiments show that FlowDirector outperforms existing methods in instruction following, temporal consistency, and background preservation, setting a new standard for coherent video editing.
FlowDirector 是一种无需训练且无需反演的视频编辑框架,通过常微分方程在数据空间中直接演化编辑过程。它引入了针对外观、运动和稳定性的三种流校正策略,提高了指令跟随、时间一致性和背景保留。实验表明,FlowDirector 在各种编辑任务和基准测试中优于现有方法。
3D-LATTE: Latent Space 3D Editing from Textual Instructions
Authors: Maria Parelli, Michael Oechsle, Michael Niemeyer, Federico Tombari, Andreas Geiger
First: 2025-08-29T22:51:59+00:00 · Latest: 2025-12-12T14:39:26+00:00
Abstract
Despite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals. Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity and precise edits across a wide range of shapes and semantic manipulations. Our project webpage is https://mparelli.github.io/3d-latte
中文标题/摘要
标题:3D-LATTE:基于文本指令的3D空间编辑
尽管多视角扩散模型在基于文本/图像的3D资产生成方面取得了近期的成功,但基于指令的3D资产编辑的质量远远落后于生成模型。主要原因在于,最近使用2D先验的方法遭受了视角不一致的编辑信号。超越2D先验提取方法和多视角编辑策略,我们提出了一种无需训练的编辑方法,在原生3D扩散模型的潜在空间中操作,使我们能够直接操控3D几何。我们通过将生成的3D注意力图与源对象混合来引导编辑合成。结合几何感知正则化引导、Fourier域中的频谱调制策略和3D增强的细化步骤,我们的方法在多种形状和语义操控下实现了高保真度和精确的编辑,超越了之前的3D编辑方法。我们的项目网页是https://mparelli.github.io/3d-latte
Summary / 总结
The research aims to improve 3D asset editing from textual instructions by addressing the limitations of current 2D-based methods. It introduces a training-free approach that operates directly in the latent space of a 3D diffusion model, using 3D attention maps and geometry-aware regularization to achieve high-fidelity and precise edits. The method outperforms previous techniques in handling various shapes and semantic manipulations.
研究旨在通过解决当前基于2D的方法的局限性,提高从文本指令编辑3D资产的能力。它提出了一种无需训练的方法,直接在3D扩散模型的潜在空间中操作,使用3D注意力图和几何感知正则化来实现高保真度和精确的编辑。该方法在处理各种形状和语义操作方面优于先前的技术。
SSL-MedSAM2: A Semi-supervised Medical Image Segmentation Framework Powered by Few-shot Learning of SAM2
Authors: Zhendi Gong, Xin Chen
Venue: MICCAI 2025
First: 2025-12-12T13:33:38+00:00 · Latest: 2025-12-12T13:33:38+00:00
Comments: Accepted by MICCAI 2025 CARE Challenge, waiting for publication
Abstract
Despite the success of deep learning based models in medical image segmentation, most state-of-the-art (SOTA) methods perform fully-supervised learning, which commonly rely on large scale annotated training datasets. However, medical image annotation is highly time-consuming, hindering its clinical applications. Semi-supervised learning (SSL) has been emerged as an appealing strategy in training with limited annotations, largely reducing the labelling cost. We propose a novel SSL framework SSL-MedSAM2, which contains a training-free few-shot learning branch TFFS-MedSAM2 based on the pretrained large foundation model Segment Anything Model 2 (SAM2) for pseudo label generation, and an iterative fully-supervised learning branch FSL-nnUNet based on nnUNet for pseudo label refinement. The results on MICCAI2025 challenge CARE-LiSeg (Liver Segmentation) demonstrate an outstanding performance of SSL-MedSAM2 among other methods. The average dice scores on the test set in GED4 and T1 MRI are 0.9710 and 0.9648 respectively, and the Hausdorff distances are 20.07 and 21.97 respectively. The code is available via https://github.com/naisops/SSL-MedSAM2/tree/main.
中文标题/摘要
标题:SSL-MedSAM2:基于SAM2少样本学习的半监督医学图像分割框架
尽管基于深度学习的模型在医学图像分割方面取得了成功,但大多数最先进的(SOTA)方法都采用全监督学习,通常依赖于大规模标注的训练数据集。然而,医学图像标注耗时且成本高,阻碍了其临床应用。半监督学习(SSL)作为一种在有限标注下训练的有吸引力策略出现,大大降低了标注成本。我们提出了一种新的SSL框架SSL-MedSAM2,该框架包含一个基于预训练的大规模基础模型Segment Anything Model 2 (SAM2) 的无训练少样本学习分支TFFS-MedSAM2,用于伪标签生成,以及一个基于nnUNet的迭代全监督学习分支FSL-nnUNet,用于伪标签精炼。在MICCAI2025挑战CARE-LiSeg(肝脏分割)上的结果表明,SSL-MedSAM2在其他方法中表现出色。在GED4和T1 MRI测试集上的平均dice分数分别为0.9710和0.9648,Hausdorff距离分别为20.07和21.97。代码可通过https://github.com/naisops/SSL-MedSAM2/tree/main/获取。
Summary / 总结
The research aims to address the high cost of annotation in medical image segmentation by proposing SSL-MedSAM2, a semi-supervised learning framework. It combines a training-free few-shot learning branch using the pretrained SAM2 model for pseudo label generation and an iterative fully-supervised learning branch using nnUNet for pseudo label refinement. The framework achieves outstanding performance, with average Dice scores of 0.9710 and 0.9648 on GED4 and T1 MRI, respectively, and Hausdorff distances of 20.07 and 21.97. The code is available on GitHub.
研究旨在通过提出SSL-MedSAM2半监督学习框架来解决医学图像分割中的标注成本问题。该框架结合了用于伪标签生成的少量样本学习分支和用于细化的完全监督学习分支。方法在GED4和T1 MRI数据集上的平均Dice分数分别为0.9710和0.9648,Hausdorff距离分别为20.07和21.97。结果在MICCAI2025挑战CARE-LiSeg中优于其他方法。代码可在GitHub上获得。
VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing
Authors: Emanuel Sánchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg
First: 2025-12-12T11:39:35+00:00 · Latest: 2025-12-12T11:39:35+00:00
Comments: 21 pages, 7 figures, under review
Abstract
Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.
中文标题/摘要
标题:VLM2GeoVec:迈向遥感领域的通用多模态嵌入
卫星图像与自然图像在本质上存在根本差异:其空中视角、极高分辨率、多样化的尺度变化以及众多的小物体,要求同时进行区域级别的空间推理和整体场景理解。当前的遥感方法在双编码检索模型和生成助手之间存在碎片化:前者擅长大规模跨模态搜索,但无法交错模态;后者支持区域级别的解释,但缺乏可扩展的检索能力。我们提出了一种名为$\textbf{VLM2GeoVec}$的指令遵循、单编码视觉-语言模型,该模型通过对比学习在统一的向量空间中嵌入交错输入(图像、文本、边界框和地理坐标)。我们的单编码器将所有输入交织成一个联合嵌入,并通过对比损失进行训练,消除了多阶段管道和特定任务模块。为了评估其通用性,我们引入了$\textbf{RSMEB}$,这是一个新的基准,涵盖了关键的遥感嵌入应用:场景分类;跨模态搜索;组合检索;视觉问答;视觉定位和区域级别推理;以及语义地理空间检索。在RSMEB上,它在区域-描述检索中达到了$\textbf{26.6%}$的P@1(比双编码基线高出25个百分点),在引用表达检索中达到了$\textbf{32.5%}$的P@1(比双编码基线高出19个百分点),在语义地理定位检索中达到了$\textbf{17.8%}$的P@1(超过先前最佳结果的3倍),同时在场景分类和跨模态检索等传统任务上与专门的基线相当或超越。VLM2GeoVec将可扩展的检索与区域级别的空间推理统一起来,使遥感领域的多模态分析得以连贯进行。在文章被接受后,我们将公开发布代码、检查点和数据。
Summary / 总结
VLM2GeoVec is proposed to address the unique challenges of satellite imagery by integrating vision-language models that handle both region-level spatial reasoning and holistic scene understanding. It uses a single-encoder approach to train interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. On the RSMEB benchmark, VLM2GeoVec outperforms dual-encoder baselines in region-caption retrieval, referring-expression retrieval, and semantic geospatial retrieval, while matching or exceeding specialized baselines in conventional tasks like scene classification and cross-modal retrieval. This model unifies scalable retrieval with region-level spatial reasoning, enhancing multimodal analysis in remote sensing.
VLM2GeoVec 旨在解决卫星图像的独特挑战,通过集成能够处理区域级空间推理和整体场景理解的视觉语言模型。它采用单编码器方法,在统一的向量空间中训练交错输入(图像、文本、边界框和地理坐标)。在 RSMEB 基准测试中,VLM2GeoVec 在区域描述符检索、引用表达检索和语义地理定位检索方面优于双编码器基线,同时在场景分类和跨模态检索等传统任务上与专门的基线相当或超越。该模型将可扩展的检索与区域级空间推理统一起来,增强了遥感中的多模态分析。
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
Authors: Zongzhao Li, Xiangzhe Kong, Jiahui Su, Zongyang Ma, Mingze Li, Songyou Li, Yuelin Zhang, Yu Rong, Tingyang Xu, Deli Zhao, Wenbing Huang
First: 2025-12-11T18:00:21+00:00 · Latest: 2025-12-12T11:08:08+00:00
Abstract
This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.
中文标题/摘要
标题:从宏观到微观:通过视觉语言模型评估分子微观空间智能
本文介绍了微观空间智能(MiSI)的概念,即感知和推理看不见的微观实体的空间关系的能力,这是科学研究的基础。为了评估视觉语言模型(VLMs)在这一领域的潜力,我们提出了一种系统性的基准框架MiSI-Bench。该框架包含超过163,000个问答对和587,000张图像,源自约4,000个分子结构,涵盖了九个互补任务,评估能力从基本的空间变换到复杂的关联识别。实验结果表明,当前最先进的VLMs在这一基准上的表现远低于人类水平。然而,微调后的7B模型显示出巨大的潜力,甚至在空间变换任务上超过了人类,而其在氢键识别等基于科学的任务上的表现不佳,突显了整合显式领域知识以实现科学AGI的必要性。数据集可在https://huggingface.co/datasets/zongzhao/MiSI-bench 获取。
Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation
Authors: Jingmin Zhu, Anqi Zhu, Hossein Rahmani, Jun Liu, Mohammed Bennamoun, Qiuhong Ke
First: 2025-12-12T10:53:51+00:00 · Latest: 2025-12-12T10:53:51+00:00
Abstract
We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at https://github.com/Alchemist0754/Skeleton-Cache.
中文标题/摘要
标题:基于骨架的零样本动作识别的无训练测试时自适应增强
我们介绍了Skeleton-Cache,这是第一个用于基于骨架的零样本动作识别(SZAR)的无训练测试时自适应框架,旨在提高模型在推理过程中对未见过的动作的泛化能力。Skeleton-Cache将推理重新定义为对非参数缓存中的结构化骨架表示进行轻量级检索的过程,该缓存同时存储全局和细粒度局部描述符。为了指导描述符级别的预测融合,我们利用大型语言模型(LLMs)的语义推理能力为每个类别分配特定的重要性权重。通过将这些结构化描述符与LLM引导的语义先验相结合,Skeleton-Cache能够在不进行额外训练或访问训练数据的情况下动态适应未见过的动作。在NTU RGB+D 60/120和PKU-MMD II上的广泛实验表明,Skeleton-Cache在零样本和泛化零样本设置下能够一致地提升各种SZAR骨干网络的性能。代码已公开发布在https://github.com/Alchemist0754/Skeleton-Cache。
Summary / 总结
The research introduces Skeleton-Cache, a training-free test-time adaptation framework for skeleton-based zero-shot action recognition, which enhances model generalization to unseen actions. It reformulates inference as a lightweight retrieval process using a non-parametric cache of structured skeleton representations, integrating global and local descriptors. Semantic importance weights are assigned using large language models to guide descriptor fusion. Experiments on NTU RGB+D 60/120 and PKU-MMD II show that Skeleton-Cache consistently improves the performance of various SZAR backbones under zero-shot and generalized zero-shot settings.
论文提出了一个训练-free 测试时自适应框架 Skeleton-Cache,用于骨架基于的零样本动作识别,以提高模型对未见过动作的泛化能力。它将推理过程重新定义为一个轻量级的结构骨架表示检索过程,并利用大型语言模型分配类别特定的重要性权重。在 NTU RGB+D 60/120 和 PKU-MMD II 上的实验表明,Skeleton-Cache 在零样本和泛化零样本设置下提高了各种 SZAR 后端的性能,无需额外训练或访问训练数据。
Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering
Authors: Jun Li, Hongjian Dou, Zhenyu Zhang, Kai Li, Shaoguo Liu, Tingting Gao
First: 2025-08-15T07:10:10+00:00 · Latest: 2025-12-12T09:33:35+00:00
Abstract
Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited -- compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a simple but effective module called Pyramid Patcher, we enhanced the Pyramid Matching Model's understanding of visual information at different granularities. Inspired by representation engineering, we extracted representations from COT data and injected them into the LVLMs. This approach allowed us to obtain refined retrieval scores in the Training-Free Refinement paradigm without relying on explicit textual reasoning, further enhancing performance. Extensive experiments on CIR benchmarks demonstrate that PMTFR surpasses state-of-the-art methods in supervised CIR tasks. The code will be made public.
中文标题/摘要
标题:通过推理增强表示工程提高监督组合图像检索
组合图像检索(CIR)面临重大挑战,因为它需要同时理解参考图像和修改后的文本指令,以找到相关的目标图像。一些现有方法尝试使用两阶段方法进一步细化检索结果。然而,这通常需要额外训练排名模型。尽管链式思考(CoT)技术在减少语言模型训练成本方面取得了成功,但在CIR任务中的应用仍然有限——将视觉信息压缩为文本或依赖复杂的提示设计。此外,现有工作仅将其用于零样本CIR,因为即使使用训练良好的模型,在监督CIR任务中获得满意的结果也具有挑战性。在本文中,我们提出了一种包括训练免费精炼金字塔匹配模型(PMTFR)的框架来解决这些挑战。通过一个简单但有效的模块——金字塔补丁,我们增强了金字塔匹配模型对不同粒度视觉信息的理解。受表示工程的启发,我们从COT数据中提取表示并注入到LVLMs中。这种方法使我们能够在训练免费精炼范式中获得细化的检索分数,而无需依赖显式的文本推理,进一步提高了性能。在CIR基准上的广泛实验表明,PMTFR在监督CIR任务中超越了最先进的方法。代码将公开。
Summary / 总结
This work addresses the challenge of Composed Image Retrieval by proposing a framework called PMTFR, which includes a Pyramid Matching Model with Training-Free Refinement. It enhances the model's understanding of visual information through a Pyramid Patcher module and uses representation engineering to improve retrieval scores without explicit textual reasoning. Experiments show that PMTFR outperforms existing methods in supervised CIR tasks.
本文提出了一种名为PMTFR的框架,该框架包含一个带有训练免费精炼的金字塔匹配模型。通过金字塔补丁模块增强模型对视觉信息的理解,并使用表示工程从COT数据中提取表示并注入LVLMs,从而在无需显式文本推理的情况下提高检索分数。实验结果表明,PMTFR在监督CIR任务中优于现有方法。
Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction
Authors: Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen
First: 2025-12-12T09:19:45+00:00 · Latest: 2025-12-12T09:19:45+00:00
Abstract
Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.
中文标题/摘要
标题:最少片段,最大显著性:通过关键时刻提取进行长视频摘要
视觉-语言模型(VLMs)能够处理越来越长的视频。然而,重要视觉信息很容易在整个上下文中丢失并被VLMs忽略。此外,设计能够经济有效地分析长视频内容的工具也很重要。在本文中,我们提出了一种片段选择方法,旨在选择应包含在多模态摘要中的关键视频时刻。我们将视频划分为短片段,并使用轻量级视频描述模型生成每个片段的紧凑视觉描述。然后将这些描述传递给大型语言模型(LLM),该模型选择包含最多相关视觉信息的K个片段以构建多模态摘要。我们在MovieSum数据集中的人类标注的屏幕剧和摘要的参考片段上评估了我们的方法。我们进一步表明,这些参考片段(不到电影的6%)足以构建MovieSum中电影的完整多模态摘要。使用我们的片段选择方法,我们实现了与这些参考片段相当的摘要性能,同时捕获了比随机片段选择多得多的相关视频信息。重要的是,我们通过依赖轻量级描述模型保持了较低的计算成本。
Summary / 总结
This paper proposes a method for summarizing long videos by selecting key moments using Vision-Language Models (VLMs). The video is divided into short clips, each described using a lightweight video captioning model. A large language model then selects the most relevant clips for a multimodal summary. The approach achieves summarization performance close to reference clips, which are derived from full human-annotated screenplays and summaries, while capturing more relevant video information than random selection. This method maintains low computational cost by using a lightweight captioning model.
本文提出了一种通过Vision-Language模型选择关键时刻来总结长视频的方法。视频被分割成短片段,每个片段使用轻量级视频描述模型进行描述。然后,大型语言模型选择最相关的片段以构建多模态摘要。该方法的总结性能接近于从完整的人工标注剧本和摘要中自动提取的参考片段,同时比随机选择片段捕获了更多的相关视频信息。这种方法通过使用轻量级描述模型保持了较低的计算成本。
The N-Body Problem: Parallel Execution from Single-Person Egocentric Video
Authors: Zhifan Zhu, Yifei Huang, Yoichi Sato, Dima Damen
First: 2025-12-12T09:07:21+00:00 · Latest: 2025-12-12T09:07:21+00:00
Comments: project webpage: https://zhifanzhu.github.io/ego-nbody
Abstract
Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.
中文标题/摘要
标题:N体问题:单人主观视频的并行执行
人类可以直观地并行化复杂活动,但模型能否仅通过观察单个人来学习这一点?给定一个主观视频,我们引入N体问题:如何让N个人在假设中执行视频中观察到的同一组任务。目标是最大化加速比,但简单的视频片段分配给个人往往违反现实世界的约束,导致物理上不可能的场景,如两个人使用同一个物体或占据相同的空间。为了解决这个问题,我们形式化了N体问题,并提出了一套评估标准,包括性能(加速比、任务覆盖)和可行性(空间碰撞、物体冲突和因果约束)。然后,我们引入了一种结构化的提示策略,指导视觉-语言模型(VLM)推理3D环境、物体使用和时间依赖性,以生成可行的并行执行。在来自EPIC-Kitchens和HD-EPIC的100个视频上,我们的方法在N=2时,相较于Gemini 2.5 Pro的基本提示,动作覆盖提高了45%,同时将碰撞率、物体冲突和因果冲突分别降低了55%、45%和55%。
Summary / 总结
The research aims to enable a model to parallelize complex activities observed in a single egocentric video. The N-Body Problem is formalized to maximize speed-up while avoiding physical constraints. The method uses a structured prompting strategy to guide a Vision-Language Model to reason about the 3D environment and temporal dependencies, resulting in a 45% boost in action coverage and a significant reduction in collision rates and object/cause conflicts compared to a baseline prompt for Gemini 2.5 Pro.
研究旨在让模型能够从单个第一人称视频中并行化观察到的复杂活动,解决N-Body问题。方法包括形式化问题并使用结构化提示策略来引导视觉-语言模型推理3D环境、物体使用和时间依赖性。关键发现表明,对于N=2的情况,所提出的方法将动作覆盖率提高了45%,并减少了碰撞率、物体冲突和因果冲突分别55%、45%和55%。与Gemini 2.5 Pro的基本提示相比。
Zero-shot 3D Map Generation with LLM Agents: A Dual-Agent Architecture for Procedural Content Generation
Authors: Lim Chien Her, Ming Yan, Yunshu Bai, Ruihao Li, Hao Zhang
First: 2025-12-11T10:22:02+00:00 · Latest: 2025-12-12T08:48:44+00:00
Comments: 12 pages, 6 figures
Abstract
Procedural Content Generation (PCG) offers scalable methods for algorithmically creating complex, customizable worlds. However, controlling these pipelines requires the precise configuration of opaque technical parameters. We propose a training-free architecture that utilizes LLM agents for zero-shot PCG parameter configuration. While Large Language Models (LLMs) promise a natural language interface for PCG tools, off-the-shelf models often fail to bridge the semantic gap between abstract user instructions and strict parameter specifications. Our system pairs an Actor agent with a Critic agent, enabling an iterative workflow where the system autonomously reasons over tool parameters and refines configurations to progressively align with human design preferences. We validate this approach on the generation of various 3D maps, establishing a new benchmark for instruction-following in PCG. Experiments demonstrate that our approach outperforms single-agent baselines, producing diverse and structurally valid environments from natural language descriptions. These results demonstrate that off-the-shelf LLMs can be effectively repurposed as generalized agents for arbitrary PCG tools. By shifting the burden from model training to architectural reasoning, our method offers a scalable framework for mastering complex software without task-specific fine-tuning.
中文标题/摘要
标题:零样本3D地图生成:双重代理架构的LLM代理程序内容生成
程序化内容生成(PCG)提供了通过算法创建复杂可定制世界的可扩展方法。然而,控制这些管道需要精确配置不透明的技术参数。我们提出了一种无需训练的架构,利用LLM代理程序进行零样本PCG参数配置。虽然大型语言模型(LLMs)承诺为PCG工具提供自然语言界面,但现成的模型往往无法弥合抽象用户指令与严格参数规范之间的语义差距。我们的系统将一个执行者代理与一个评论者代理配对,使系统能够自主推理工具参数并逐步优化配置以与人类设计偏好相一致。我们在生成各种3D地图上验证了这种方法,建立了PCG指令遵循的新基准。实验表明,我们的方法优于单代理基线,能够从自然语言描述中生成多样且结构有效的环境。这些结果表明,现成的LLM可以有效重新利用为任意PCG工具的一般代理程序。通过将负担从模型训练转移到架构推理,我们的方法提供了一种无需特定任务微调的复杂软件掌握的可扩展框架。
Summary / 总结
The research aims to address the challenge of controlling procedural content generation (PCG) pipelines by proposing a training-free architecture using LLM agents. The method involves an Actor and a Critic agent working iteratively to configure PCG parameters based on natural language instructions. Experiments show that this dual-agent approach outperforms single-agent baselines, generating diverse and structurally valid 3D maps from natural language descriptions, thus demonstrating the potential of off-the-shelf LLMs for PCG tasks.
研究旨在通过使用LLM代理提出一种无需训练的架构,解决控制程序化内容生成(PCG)管道的问题。方法包括一个Actor代理和一个Critic代理,它们通过迭代工作来根据自然语言指令配置PCG参数。实验表明,这种双代理方法在生成来自自然语言描述的多样化和结构上有效的3D地图方面优于单代理基线,从而展示了现成的LLM在PCG任务中的潜力。
Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture
Authors: Tanu Singh, Pranamesh Chakraborty, Long T. Truong
First: 2025-12-12T07:57:36+00:00 · Latest: 2025-12-12T07:57:36+00:00
Abstract
Road traffic accidents represent a leading cause of mortality globally, with incidence rates rising due to increasing population, urbanization, and motorization. Rising accident rates raise concerns about traffic surveillance effectiveness. Traditional computer vision methods for accident detection struggle with limited spatiotemporal understanding and poor cross-domain generalization. Recent advances in transformer architectures excel at modeling global spatial-temporal dependencies and parallel computation. However, applying these models to automated traffic accident detection is limited by small, non-diverse datasets, hindering the development of robust, generalizable systems. To address this gap, we curated a comprehensive and balanced dataset that captures a wide spectrum of traffic environments, accident types, and contextual variations. Utilizing the curated dataset, we propose an accident detection model based on a transformer architecture using pre-extracted spatial video features. The architecture employs convolutional layers to extract local correlations across diverse patterns within a frame, while leveraging transformers to capture sequential-temporal dependencies among the retrieved features. Moreover, most existing studies neglect the integration of motion cues, which are essential for understanding dynamic scenes, especially during accidents. These approaches typically rely on static features or coarse temporal information. In this study, multiple methods for incorporating motion cues were evaluated to identify the most effective strategy. Among the tested input approaches, concatenating RGB features with optical flow achieved the highest accuracy at 88.3%. The results were further compared with vision language models (VLM) such as GPT, Gemini, and LLaVA-NeXT-Video to assess the effectiveness of the proposed method.
中文标题/摘要
标题:基于监视视频的交通事故检测的变压器架构
道路交通事故是全球导致死亡的主要原因之一,由于人口增长、城市化和机动化,事故率正在上升。事故率的上升引发了对交通监控有效性的担忧。传统的事故检测计算机视觉方法在时空理解有限和跨域泛化能力差方面存在困难。最近的变压器架构在建模全局空间-时间依赖性和并行计算方面表现出色。然而,将这些模型应用于自动交通事故检测受到小规模、非多样数据集的限制,阻碍了稳健、可泛化的系统的开发。为了解决这一差距,我们整理了一个全面且平衡的数据集,涵盖了广泛的交通环境、事故类型和上下文变化。利用整理的数据集,我们提出了一种基于变压器架构的事故检测模型,该模型使用预提取的空间视频特征。该架构采用卷积层提取帧内多种模式的局部相关性,同时利用变压器捕捉检索特征之间的序列-时间依赖性。此外,大多数现有研究忽略了运动线索的整合,这对于理解动态场景,尤其是在事故期间,是必不可少的。这些方法通常依赖于静态特征或粗略的时间信息。在本研究中,评估了多种整合运动线索的方法,以确定最有效的策略。在测试的输入方法中,将RGB特征与光流特征进行拼接实现了最高的准确率88.3%。结果还与视觉语言模型(VLM)如GPT、Gemini和LLaVA-NeXT-Video进行了比较,以评估所提方法的有效性。
Summary / 总结
This study addresses the challenge of detecting traffic accidents in surveillance videos by proposing a model based on transformer architecture. Motivated by the rising global incidence of traffic accidents, the research aims to improve traffic surveillance effectiveness. The method uses a curated dataset that includes various traffic environments and accident types, and combines convolutional layers with transformers to extract both local and sequential-temporal features. The study finds that incorporating motion cues through RGB and optical flow features achieves the highest accuracy of 88.3%, outperforming vision language models like GPT, Gemini, and LLaVA-NeXT-Video.
该研究旨在通过基于变压器架构的模型解决交通视频中事故检测的挑战。研究动机源于全球交通事故率的上升以及传统计算机视觉方法的局限性。该方法使用了一个包含各种交通环境和事故类型的定制数据集,并结合卷积层进行局部特征提取和变压器捕捉时间依赖性。研究还评估了不同的运动线索集成方法,发现将RGB特征与光流结合使用达到了最高的准确率88.3%。结果表明,该方法在与视觉语言模型(如GPT、Gemini和LLaVA-NeXT-Video)的比较中表现出有效性。
Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models
Authors: Minghao Yin, Yukang Cao, Kai Han
First: 2025-11-27T13:03:57+00:00 · Latest: 2025-12-12T06:34:34+00:00
Abstract
We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.
中文标题/摘要
标题:悟空的72变:基于流模型的无训练高保真纹理3D变形
我们提出了一种名为WUKONG的新型无训练框架,用于高保真纹理3D变形,该框架以一对源和目标提示(图像或文本)作为输入。与传统的依赖于手动对应匹配和变形轨迹估计的方法(限制了泛化能力并需要昂贵的预处理)不同,WUKONG利用基于流的生成器的生成先验来生成具有丰富纹理细节的高保真3D过渡。为了确保形状过渡的平滑性,我们利用基于流的生成过程的内在连续性,并将变形问题形式化为最优传输重心问题。我们进一步引入了一种顺序初始化策略,以防止突然的几何失真并保持身份一致性。为了忠实保留纹理,我们提出了一种基于相似性的语义一致性机制,该机制选择性地保留高频细节并允许对混合动力学进行精确控制。这避免了常见的过度平滑等伪影,同时保持语义保真度。广泛的定量和定性评估表明,WUKONG显著优于最先进的方法,在各种几何和纹理变化中取得了更好的结果。
Summary / 总结
WUKONG is a training-free framework for high-fidelity textured 3D morphing that uses a pair of source and target prompts as input. Unlike conventional methods, WUKONG utilizes flow-based transformers to generate smooth and detailed 3D transitions without manual correspondence matching. It formulates morphing as an optimal transport barycenter problem and introduces a sequential initialization strategy and a similarity-guided semantic consistency mechanism to ensure smooth shape and texture transitions. Experimental results show that WUKONG outperforms existing methods in handling diverse geometry and texture variations.
WUKONG 是一个无需训练的框架,用于通过图像或文本提示进行高保真纹理 3D 变形。它利用流基变换器生成平滑且细节丰富的 3D 过渡,无需手动对应匹配。WUKONG 将变形问题表述为最优传输重心问题,并引入了顺序初始化策略和相似性引导的语义一致性机制,以保持身份和纹理细节。实验结果表明,WUKONG 在处理各种几何和纹理变化方面优于现有方法。
Benchmarking the Generality of Vision-Language-Action Models
Authors: Pranav Guruprasad, Sudipta Chowdhury, Harsh Sikka, Mridul Sharma, Helen Lu, Sean Rivera, Aryan Khurana, Hangliang Ren, Yangyue Wang
First: 2025-12-12T06:31:52+00:00 · Latest: 2025-12-12T06:31:52+00:00
Comments: 23 pages, 7 figures, and 1 table
Abstract
Generalist multimodal agents are expected to unify perception, language, and control - operating robustly across diverse real world domains. However, current evaluation practices remain fragmented across isolated benchmarks, making it difficult to assess whether today's foundation models truly generalize beyond their training distributions. We introduce MultiNet v1.0, a unified benchmark for measuring the cross domain generality of vision language models (VLMs) and vision language action models (VLAs) across six foundational capability regimes. Visual grounding, spatial reasoning, tool use, physical commonsense, multi agent coordination, and continuous robot control. Evaluating GPT 5, Pi0, and Magma, we find that no model demonstrates consistent generality. All exhibit substantial degradation on unseen domains, unfamiliar modalities, or cross domain task shifts despite strong performance within their training distributions.These failures manifest as modality misalignment, output format instability, and catastrophic knowledge degradation under domain transfer.Our findings reveal a persistent gap between the aspiration of generalist intelligence and the actual capabilities of current foundation models.MultiNet v1.0 provides a standardized evaluation substrate for diagnosing these gaps and guiding the development of future generalist agents.Code, data, and leaderboards are publicly available.
中文标题/摘要
标题:视觉-语言-行动模型通用性的基准测试
通用型多模态代理被期望能够统一感知、语言和控制能力,在多种现实世界领域中稳健地运行。然而,当前的评估实践仍然分散在孤立的基准上,使得难以评估今天的基础模型是否真正超越了它们的训练分布。我们引入了MultiNet v1.0,这是一个统一的基准,用于衡量视觉语言模型(VLMs)和视觉语言行动模型(VLAs)在六个基础能力领域的跨域通用性。视觉定位、空间推理、工具使用、物理常识、多智能体协调和连续机器人控制。评估GPT 5、Pi0和Magma,我们发现没有一个模型表现出一致的通用性。所有模型在未见过的领域、不熟悉的模态或跨域任务转移中都表现出显著的退化。这些失败表现为模态对齐不良、输出格式不稳定以及领域迁移下的知识灾难性退化。我们的研究结果揭示了通用智能的期望与当前基础模型实际能力之间持续存在的差距。MultiNet v1.0 提供了一个标准化的评估平台,用于诊断这些差距并指导未来通用智能代理的发展。代码、数据和排行榜已公开。
Summary / 总结
The study aims to assess the generalization capabilities of vision-language-action models across diverse domains by introducing MultiNet v1.0, a unified benchmark. The evaluation of GPT 5, Pi0, and Magma shows that none of these models consistently generalize well, experiencing significant performance drops in unseen domains or when tasks shift across domains. The findings highlight the need for improving the robustness of these models to handle unfamiliar modalities and cross-domain task shifts.
研究旨在评估视觉-语言-行动模型在多种现实世界领域的泛化能力。引入了MultiNet v1.0统一基准,涵盖六个能力领域。评估GPT 5、Pi0和Magma等模型后,研究发现这些模型均未表现出一致的泛化能力,在未见过的领域或不熟悉的模态中表现出显著的性能下降。研究结果揭示了通用智能的期望与当前模型能力之间的持续差距,强调了该领域进一步发展的必要性。
The Finer the Better: Towards Granular-aware Open-set Domain Generalization
Authors: Yunyun Wang, Zheng Duan, Xinyue Liao, Ke-Jia Chen, Songcan Chen
First: 2025-11-21T06:19:19+00:00 · Latest: 2025-12-12T06:30:23+00:00
Comments: 9 pages,3 figures,aaai2026
Abstract
Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns" that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.
中文标题/摘要
标题:精益求精:面向细粒度的开放集领域泛化
开放集领域泛化(OSDG)处理部署模型遇到领域转移和新对象类别的情况。尽管视觉语言模型如CLIP取得了显著进展,现有方法仍然在已知类别结构风险和未知类别开放空间风险之间陷入困境,并且容易表现出过度自信,尤其是在区分与已知类别具有细粒度视觉相似性的“硬未知”时。为此,我们提出了一种语义增强CLIP(SeeCLIP)框架,通过细粒度语义增强明确解决这一困境。在SeeCLIP中,我们提出了一种语义感知提示增强模块,将图像分解为具有区分性的语义标记,使视觉-语言对齐超越粗略类别标签。为了有效定位未知提示,我们引入了互补目标的双对比学习,即排斥以保持与已知类别的可分性,凝聚力以保持语义邻近性。此外,我们的语义引导扩散模块通过扰动提取的语义标记合成伪未知样本,生成与已知类别视觉相似但具有关键局部差异的具有挑战性的样本。这些难负面对模型学习更精细的决策边界具有推动作用。在五个基准上的广泛实验表明,与最先进的方法相比,准确率和H分数分别提高了3%和5%。
Summary / 总结
This paper addresses the challenge of Open-Set Domain Generalization (OSDG) by proposing a Semantic-enhanced CLIP (SeeCLIP) framework. The method introduces a semantic-aware prompt enhancement module and duplex contrastive learning to handle fine-grained visual similarities between known and unknown classes. The SeeCLIP framework also includes a semantic-guided diffusion module that generates hard negatives, which forces the model to learn finer decision boundaries. Experiments show consistent improvements of 3% accuracy and 5% H-score over existing methods across five benchmarks.
论文提出了一个语义增强的CLIP(SeeCLIP)框架来解决开放集域适应问题。SeeCLIP引入了一个语义感知的提示增强模块,将图像分解为具有区分性的语义标记,并采用双重对比学习机制有效定位未知提示。此外,语义引导的扩散模块生成具有挑战性的伪未知样本,促使模型学习更精细的决策边界。实验结果显示,在五个基准测试中,SeeCLIP相比现有方法在准确率和H分数上分别提高了3%和5%。
Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference
Authors: Adilet Metinov, Gulida M. Kudakeeva, Bolotbek uulu Nursultan, Gulnara D. Kabaeva
First: 2025-12-12T02:02:02+00:00 · Latest: 2025-12-12T02:02:02+00:00
Comments: 6 pages, 3 tables , 1 figure
Abstract
We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs.
中文标题/摘要
标题:自适应软滚动KV冻结与熵引导恢复:高效大语言模型推理的次线性内存增长
我们提出了自适应软滚动KV冻结与熵引导恢复(ASR-KF-EGR),这是一种无需训练的推理时框架,用于高效的大语言模型生成。我们的方法引入了一种可逆的软冻结机制,该机制在滑动注意力窗口内识别出低重要性标记时暂时暂停关键值(KV)更新。与基于驱逐的方法不同,ASR-KF-EGR 保留所有标记在离GPU存储中,并在需要时恢复它们。我们扩展了该框架,引入了次线性冻结调度,其中冻结时长随着低重要性检测的重复而次线性增长,防止过度压缩。初步实验表明,ASR-KF-EGR 在LLaMA-3 8B 上实现了55-67% 的活动KV缓存大小减少,同时保持生成质量并通过了针扎干草堆检索测试。该方法架构无关,无需微调,并为长上下文大语言模型的内存受限部署提供了实用的解决方案。
Summary / 总结
The paper introduces ASR-KF-EGR, a training-free framework for efficient large language model inference. It uses a reversible soft-freeze mechanism to temporarily suspend updates for low-importance tokens within a sliding window, preserving all tokens in off-GPU storage. Experiments show a 55-67% reduction in active KV cache size while maintaining generation quality. The method is architecture-agnostic and does not require fine-tuning, offering a practical solution for memory-constrained LLM deployments.
ASR-KF-EGR 是一种无需训练的推理时框架,能在保持生成质量的同时,将活跃的 KV 缓存大小减少 55-67%,并通过检索测试。该方法引入了一种可逆的软冻结机制,暂时冻结低重要性标记,并通过次线性冻结调度防止过度压缩。该方法不依赖于架构,无需微调,为内存受限的 LLM 部署提供了一个实用的解决方案。
Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy
Authors: Kechun Xu, Zhenjie Zhu, Anzhe Chen, Shuqi Zhao, Qing Huang, Yifei Yang, Haojian Lu, Rong Xiong, Masayoshi Tomizuka, Yue Wang
First: 2025-12-12T01:59:23+00:00 · Latest: 2025-12-12T01:59:23+00:00
Abstract
The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods. Project page is available at: https://xukechun.github.io/papers/BayesVLA.
中文标题/摘要
标题:观行合一,提示精化:视觉语言动作政策的贝叶斯分解
视觉语言动作(VLA)模型的离分布泛化追求常因视觉语言模型(VLM)主干在微调过程中灾难性遗忘而受阻。虽然与外部推理数据的联合训练有所帮助,但需要经验丰富的调优和数据相关开销。超越这些外部依赖,我们发现VLA数据集内部存在一种根本原因:模态失衡,其中语言多样性远低于视觉和动作多样性。这种失衡使模型偏向视觉捷径和语言遗忘。为解决这一问题,我们引入了BayesVLA,这是一种贝叶斯分解,将政策分解为支持观行合一的视觉-动作先验和语言条件下的似然性,从而实现提示精化。这本身保留了泛化能力并促进了指令遵循。我们进一步引入了接触前和接触后阶段,以更好地利用预训练基础模型。信息论分析正式验证了我们在减轻捷径学习方面的有效性。大量实验表明,与现有方法相比,我们在未见过的指令、物体和环境中具有更优的泛化能力。项目页面可访问:https://xukechun.github.io/papers/BayesVLA.
Summary / 总结
The research aims to improve out-of-distribution generalization in Vision-Language-Action models by addressing the issue of modality imbalance in datasets, which leads to language forgetting and reliance on visual shortcuts. BayesVLA, a Bayesian factorization method, decomposes the policy into a visual-action prior and a language-conditioned likelihood, enhancing generalization and instruction following. Experiments demonstrate superior performance in handling unseen instructions, objects, and environments compared to existing methods.
论文通过引入BayesVLA解决了Vision-Language-Action (VLA) 模型的分布外泛化问题,该方法将策略分解为视觉-动作先验和语言条件下的似然性。这有助于减少模型依赖视觉捷径和忘记语言指令的趋势。实验表明,BayesVLA 在处理未见过的指令、对象和环境方面优于现有方法,从而提高了泛化能力。信息论分析支持该方法在减少捷径学习方面的有效性。
Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling
Authors: Bryan Wong, Jong Woo Kim, Huazhu Fu, Mun Yong Yi
Venue: NeurIPS 2025
First: 2025-05-23T14:48:32+00:00 · Latest: 2025-12-12T01:52:09+00:00
Comments: Accepted at NeurIPS 2025
Abstract
Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at https://github.com/bryanwong17/HiVE-MIL.
中文标题/摘要
标题:通过分层视觉-语言对齐和建模从吉格像素图像进行少量样本学习
视觉-语言模型(VLMs)已集成到多个实例学习(MIL)框架中,以解决全切片图像(WSIs)少量样本、弱监督分类的挑战。一个关键趋势是利用多尺度信息以更好地表示分层组织结构。然而,现有方法通常面临两个关键限制:(1)在同一模态跨尺度(例如,5x和20x)内的交互建模不足,(2)同一尺度上视觉和文本模态之间的对齐不足。为了解决这些差距,我们提出了HiVE-MIL,这是一种分层视觉-语言框架,构建了一个统一图,包括(1)粗(5x)和细(20x)视觉/文本节点之间的父节点-子节点链接,以捕捉分层关系,以及(2)同一尺度上视觉和文本节点之间的异质内尺度边。为了进一步增强语义一致性,HiVE-MIL引入了一种两阶段、文本引导的动态过滤机制,去除弱相关的小块-文本对,并引入了分层对比损失以在不同尺度上对齐文本语义。在TCGA乳腺、肺和肾癌数据集上的广泛实验表明,HiVE-MIL在16样本设置下的一致性宏F1得分上优于传统MIL和最近的基于VLM的MIL方法,提高了高达4.1%。我们的结果表明,联合建模分层结构和多模态对齐对于从有限的病理数据中高效和可扩展地学习具有价值。代码可在https://github.com/bryanwong17/HiVE-MIL/ 获取。
Summary / 总结
The research aims to improve few-shot, weakly supervised classification of whole slide images (WSIs) using vision-language models (VLMs) integrated into multiple instance learning (MIL) frameworks. The proposed HiVE-MIL framework addresses limitations in modeling within-scale interactions and aligning visual and textual modalities by constructing a unified graph with parent-child and intra-scale edges. It also includes a two-stage text-guided dynamic filtering mechanism and a hierarchical contrastive loss. Experiments on TCGA datasets show that HiVE-MIL outperforms traditional MIL and VLM-based MIL approaches, achieving up to 4.1% improvement in macro F1 under 16-shot settings.
研究旨在利用视觉-语言模型(VLMs)结合多实例学习(MIL)框架,提高全切片图像(WSIs)的少样本、弱监督分类。HiVE-MIL是一种层次视觉-语言框架,通过构建包含父节点-子节点和同尺度内边的统一图,并引入两阶段文本引导动态过滤机制和层次对比损失,来解决交互和对齐的限制。实验结果显示,HiVE-MIL在TCGA数据集上优于传统MIL和最近的VLM-MIL方法,16-shot设置下宏观F1得分提高了最多4.1%。
Noise Matters: Optimizing Matching Noise for Diffusion Classifiers
Authors: Yanghao Wang, Long Chen
First: 2025-08-15T09:01:03+00:00 · Latest: 2025-12-12T01:40:40+00:00
Abstract
Although today's pretrained discriminative vision-language models (e.g., CLIP) have demonstrated strong perception abilities, such as zero-shot image classification, they also suffer from the bag-of-words problem and spurious bias. To mitigate these problems, some pioneering studies leverage powerful generative models (e.g., pretrained diffusion models) to realize generalizable image classification, dubbed Diffusion Classifier (DC). Specifically, by randomly sampling a Gaussian noise, DC utilizes the differences of denoising effects with different category conditions to classify categories. Unfortunately, an inherent and notorious weakness of existing DCs is noise instability: different random sampled noises lead to significant performance changes. To achieve stable classification performance, existing DCs always ensemble the results of hundreds of sampled noises, which significantly reduces the classification speed. To this end, we firstly explore the role of noise in DC, and conclude that: there are some ``good noises'' that can relieve the instability. Meanwhile, we argue that these good noises should meet two principles: Frequency Matching and Spatial Matching. Regarding both principles, we propose a novel Noise Optimization method to learn matching (i.e., good) noise for DCs: NoOp. For frequency matching, NoOp first optimizes a dataset-specific noise: Given a dataset and a timestep t, optimize one randomly initialized parameterized noise. For Spatial Matching, NoOp trains a Meta-Network that adopts an image as input and outputs image-specific noise offset. The sum of optimized noise and noise offset will be used in DC to replace random noise. Extensive ablations on various datasets demonstrated the effectiveness of NoOp.
中文标题/摘要
标题:噪声很重要:优化扩散分类器中的匹配噪声
尽管今天预训练的视觉-语言辨别模型(例如CLIP)在零样本图像分类等感知能力方面表现出色,但它们也面临着词汇袋问题和虚假偏见。为了解决这些问题,一些开创性研究利用强大的生成模型(例如预训练的扩散模型)实现通用图像分类,称为扩散分类器(DC)。具体来说,通过随机采样高斯噪声,DC利用不同类别条件下的去噪效果差异来进行分类。不幸的是,现有DC的一个固有且众所周知的弱点是噪声不稳定:不同的随机采样噪声会导致显著的性能变化。为了实现稳定的分类性能,现有的DC总是将数百次采样噪声的结果进行集成,这显著降低了分类速度。为此,我们首先探索了噪声在DC中的作用,并得出结论:存在一些“好的噪声”可以缓解这种不稳定性。同时,我们认为这些好的噪声应该满足两个原则:频率匹配和空间匹配。针对这两个原则,我们提出了一种新的噪声优化方法来学习适合DC的匹配(即好的)噪声:NoOp。对于频率匹配,NoOp首先优化一个数据集特定的噪声:给定一个数据集和一个时间步t,优化一个随机初始化的参数化噪声。对于空间匹配,NoOp训练一个元网络,该网络以图像为输入并输出图像特定的噪声偏移。优化后的噪声和噪声偏移的总和将在DC中使用以替代随机噪声。在各种数据集上的广泛消融实验证明了NoOp的有效性。
Summary / 总结
This paper addresses the issue of noise instability in Diffusion Classifiers (DCs), which suffer from significant performance fluctuations due to different sampled noises. To stabilize classification performance, the authors propose NoOp, a novel Noise Optimization method that learns 'good noises' through Frequency Matching and Spatial Matching. Frequency Matching optimizes a dataset-specific noise, while Spatial Matching trains a Meta-Network to output image-specific noise offsets. Experimental results on various datasets show the effectiveness of NoOp in achieving stable classification performance without the need for noise ensembling.
本文探讨了扩散分类器(DC)中的噪声问题,并提出了一种名为NoOp的噪声优化方法,旨在找到可以稳定分类性能的“好噪声”。NoOp包括两个原则:频率匹配和空间匹配。频率匹配优化了特定数据集的噪声,而空间匹配则训练了一个元网络,输入图像并输出图像特定的噪声偏移。在各种数据集上的实验证明,NoOp能够实现稳定的分类性能,无需进行噪声集成,从而提高分类速度。
GoalLadder: Incremental Goal Discovery with Vision-Language Models
Authors: Alexey Zakharov, Shimon Whiteson
Venue: NeurIPS 2025
First: 2025-06-19T15:28:27+00:00 · Latest: 2025-12-12T00:33:49+00:00
Comments: NeurIPS 2025
Abstract
Natural language can offer a concise and human-interpretable means of specifying reinforcement learning (RL) tasks. The ability to extract rewards from a language instruction can enable the development of robotic systems that can learn from human guidance; however, it remains a challenging problem, especially in visual environments. Existing approaches that employ large, pretrained language models either rely on non-visual environment representations, require prohibitively large amounts of feedback, or generate noisy, ill-shaped reward functions. In this paper, we propose a novel method, GoalLadder, that leverages vision-language models (VLMs) to train RL agents from a single language instruction in visual environments. GoalLadder works by incrementally discovering states that bring the agent closer to completing a task specified in natural language. To do so, it queries a VLM to identify states that represent an improvement in agent's task progress and to rank them using pairwise comparisons. Unlike prior work, GoalLadder does not trust VLM's feedback completely; instead, it uses it to rank potential goal states using an ELO-based rating system, thus reducing the detrimental effects of noisy VLM feedback. Over the course of training, the agent is tasked with minimising the distance to the top-ranked goal in a learned embedding space, which is trained on unlabelled visual data. This key feature allows us to bypass the need for abundant and accurate feedback typically required to train a well-shaped reward function. We demonstrate that GoalLadder outperforms existing related methods on classic control and robotic manipulation environments with the average final success rate of $\sim$95% compared to only $\sim$45% of the best competitor.
中文标题/摘要
标题:GoalLadder:利用视觉语言模型实现逐步目标发现的强化学习
自然语言可以提供一种简洁且易于人类理解的方式来指定强化学习(RL)任务。从语言指令中提取奖励的能力可以促进能够从人类指导中学习的机器人系统的开发;然而,在视觉环境中,这仍然是一个具有挑战性的问题。现有的方法要么依赖于非视觉环境表示,要么需要大量的反馈,要么生成嘈杂且形状不规则的奖励函数。在本文中,我们提出了一种名为GoalLadder的新方法,该方法利用视觉语言模型(VLMs)从单个语言指令中训练视觉环境中的RL代理。GoalLadder通过逐步发现将代理带向完成自然语言指定任务的步骤来工作。它通过查询VLM来识别表示代理任务进展改进的状态,并使用成对比较对其进行排名。与先前的工作不同,GoalLadder不会完全信任VLM的反馈,而是使用它来通过基于ELO的评分系统对潜在的目标状态进行排名,从而减少嘈杂VLM反馈的负面影响。在训练过程中,代理被要求在学习的未标记视觉数据嵌入空间中最小化与排名最高的目标之间的距离。这一关键特性使我们能够绕过通常需要大量准确反馈来训练良好形状的奖励函数的需求。我们证明,与现有的相关方法相比,GoalLadder在经典控制和机器人操作环境中表现出色,平均最终成功率约为95%,而最佳竞争对手的成功率仅为约45%。
Summary / 总结
GoalLadder is a method that uses vision-language models to train reinforcement learning agents from a single language instruction in visual environments. It incrementally discovers states that improve task progress and ranks them using an ELO-based system to mitigate noisy feedback. GoalLadder outperforms existing methods with an average final success rate of about 95% in classic control and robotic manipulation environments, compared to only about 45% for the best competitor.
GoalLadder 是一种使用视觉语言模型从单个语言指令中训练强化学习代理的方法,适用于视觉环境。它通过查询 VLM 并使用 ELO 等级系统对潜在目标状态进行排名,逐步发现使代理更接近完成任务的状态。GoalLadder 的表现优于现有方法,在经典控制和机器人操作环境中,其最终成功率约为 95%,而最佳竞争对手的成功率为约 45%。
Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context
Authors: Anatole Jacquin de Margerie, Alexis Roger, Irina Rish
Venue: AAAI 2025
First: 2025-12-11T23:17:38+00:00 · Latest: 2025-12-11T23:17:38+00:00
Comments: Accepted in AAAI 2025 Workshop on Reproducible AI
Abstract
Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights for future high-resolution multimodal modeling. However, we also report deviations in the results, with the magnitude of these effects depending heavily on task type and tile granularity.
中文标题/摘要
标题:高分辨率推理中的图像镶嵌:局部细节与全局上下文的平衡
可再现性仍然是科学进步的基石,但复杂的多模态模型往往缺乏透明的实现细节和可访问的训练基础设施。在本研究中,我们详细再现并批判性分析了CVPR24上发表的猴子视觉-语言模型(VLM)(Li等,2023b),这是一种通过图像镶嵌进行高分辨率图像理解的近期方法。原始论文提出将大图像分割成块以恢复细粒度的视觉细节,同时保持计算效率。我们的研究使用开放检查点复制了这一策略,并重新实现了训练管道。我们确认了原始猴子VLM工作的关键发现,即镶嵌有效地恢复了局部细节。然后,我们进一步扩展了这项工作,通过研究全局上下文的纳入效果,为未来的高分辨率多模态建模提供了实用见解。然而,我们也报告了结果中的偏差,这些效果的大小在很大程度上取决于任务类型和块粒度。
Limits and Gains of Test-Time Scaling in Vision-Language Reasoning
Authors: Mohammadjavad Ahmadpour, Amirmahdi Meighani, Payam Taebi, Omid Ghahroodi, Amirmohammad Izadi, Mahdieh Soleymani Baghshah
First: 2025-12-11T20:48:54+00:00 · Latest: 2025-12-11T20:48:54+00:00
Comments: Mohammadjavad Ahmadpour and Amirmadhi Meighani contributed equally to this work
Abstract
Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.
中文标题/摘要
标题:测试时缩放在视觉语言推理中的局限与增益
测试时缩放(TTS)已成为通过在推理时分配额外计算来提高大型语言模型(LLMs)推理能力的强大范式,但其在视觉语言模型(VLMs)等多模态系统中的应用仍被广泛探索。在本研究中,我们对推理时间的推理方法进行了系统性的实证研究,应用在不同基准上的开源和闭源VLMs上。我们的结果表明,闭源模型始终从结构化推理和迭代自我完善中受益,而开源VLMs则表现出不一致的行为:外部验证提供了最可靠的增益,而迭代完善往往降低性能。我们还发现,TTS的有效性取决于数据集,它在多步推理任务上提供了明显的改进,但在感知导向的基准上仅提供有限的增益。这些发现表明,TTS并非万能解决方案,必须根据模型能力和任务特征进行定制,这激励了未来自适应TTS策略和多模态奖励模型的研究。
Summary / 总结
This study investigates the effectiveness of test-time scaling (TTS) in enhancing the reasoning abilities of Vision-Language Models (VLMs). It finds that closed-source models benefit from structured reasoning and iterative self-refinement, while open-source VLMs show inconsistent results, with external verification providing the most reliable gains. TTS improves performance on multi-step reasoning tasks but offers limited benefits on perception-focused benchmarks. These findings suggest that TTS is not a universal solution and should be tailored to specific models and tasks.
这项研究探讨了测试时缩放(TTS)在提升视觉语言模型(VLM)推理能力方面的有效性。研究发现,闭源模型从结构化推理和迭代自我完善中受益,而开源VLMs表现出不一致的结果,外部验证提供了最可靠的改进。TTS在多步推理任务中提高了性能,但在感知导向的基准测试中仅提供了有限的增益。这些发现表明,TTS并不是一种通用的解决方案,应该根据特定的模型和任务进行调整。
Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description
Authors: Nazanin Mahjourian, Vinh Nguyen
First: 2025-12-11T20:20:38+00:00 · Latest: 2025-12-11T20:20:38+00:00
Abstract
Many manufacturing environments operate in low-light conditions or within enclosed machines where conventional vision systems struggle. Infrared cameras provide complementary advantages in such environments. Simultaneously, supervised AI systems require large labeled datasets, which makes zero-shot learning frameworks more practical for applications including infrared cameras. Recent advances in vision-language foundation models (VLMs) offer a new path in zero-shot predictions from paired image-text representations. However, current VLMs cannot understand infrared camera data since they are trained on RGB data. This work introduces VLM-IRIS (Vision-Language Models for InfraRed Industrial Sensing), a zero-shot framework that adapts VLMs to infrared data by preprocessing infrared images captured by a FLIR Boson sensor into RGB-compatible inputs suitable for CLIP-based encoders. We demonstrate zero-shot workpiece presence detection on a 3D printer bed where temperature differences between the build plate and workpieces make the task well-suited for thermal imaging. VLM-IRIS converts the infrared images to magma representation and applies centroid prompt ensembling with a CLIP ViT-B/32 encoder to achieve high accuracy on infrared images without any model retraining. These findings demonstrate that the proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring.
中文标题/摘要
标题:红外工业传感在增材制造场景描述中的视觉-语言模型
许多制造环境在低光条件下运行或在封闭的机器内,传统视觉系统在这种环境中表现不佳。红外相机在这种环境中提供了互补的优势。同时,监督AI系统需要大量标注的数据集,因此零样本学习框架在包括红外相机的应用中更为实际。最近在视觉-语言基础模型(VLMs)方面的进展为从配对的图像-文本表示进行零样本预测提供了新的途径。然而,当前的VLMs无法理解红外相机数据,因为它们是基于RGB数据进行训练的。本研究引入了VLM-IRIS(视觉-语言模型用于红外工业传感),这是一种零样本框架,通过将FLIR Boson传感器捕获的红外图像预处理成RGB兼容的输入,适用于基于CLIP的编码器。我们展示了在3D打印机床面上的工作件存在检测的零样本工作,其中构建板和工作件之间的温度差异使得热成像非常适合此任务。VLM-IRIS将红外图像转换为magma表示,并使用CLIP ViT-B/32编码器进行质心提示集成,无需任何模型重新训练即可在红外图像上实现高精度。这些发现表明,对VLMs的提议改进可以有效地扩展到热成像应用中,实现无标签监控。
Summary / 总结
This work addresses the challenge of using infrared cameras in low-light manufacturing environments by adapting vision-language models (VLMs) to infrared data. The proposed VLM-IRIS framework preprocesses infrared images into RGB-compatible inputs for CLIP-based encoders, enabling zero-shot workpiece presence detection on a 3D printer bed. The method achieves high accuracy without retraining the model, demonstrating the potential of VLMs for thermal applications in label-free monitoring.
该研究通过引入VLM-IRIS框架,将视觉语言模型适应红外数据,解决低光制造环境中的红外摄像头应用难题。通过将红外图像预处理为RGB兼容输入,VLM-IRIS在3D打印机床上实现了高精度的工作件存在检测,无需任何模型重新训练。这展示了视觉语言模型在无标签监控中的潜在应用价值。
MoCA-Video: Motion-Aware Concept Alignment for Consistent Video Editing
Authors: Tong Zhang, Juan C Leon Alcazar, Victor Escorcia, Bernard Ghanem
First: 2025-06-01T13:28:04+00:00 · Latest: 2025-12-11T19:59:43+00:00
Abstract
We present MoCA-Video, a training-free framework for semantic mixing in videos. Operating in the latent space of a frozen video diffusion model, MoCA-Video utilizes class-agnostic segmentation with diagonal denoising scheduler to localize and track the target object across frames. To ensure temporal stability under semantic shifts, we introduce momentum-based correction to approximate novel hybrid distributions beyond trained data distribution, alongside a light gamma residual module that smooths out visual artifacts. We evaluate model's performance using SSIM, LPIPS, and a proposed metric, \metricnameabbr, which quantifies semantic alignment between reference and output. Extensive evaluation demonstrates that our model consistently outperforms both training-free and trained baselines, achieving superior semantic mixing and temporal coherence without retraining. Results establish that structured manipulation of diffusion noise trajectories enables controllable and high-quality video editing under semantic shifts.
中文标题/摘要
标题:MoCA-Video:基于运动感知的概念对齐以实现一致的视频编辑
我们提出了MoCA-Video,一种无需训练的视频语义混合框架。MoCA-Video 在冻结的视频扩散模型的潜在空间中运行,利用无类别的分割和对角去噪调度器来定位和跟踪目标对象。为了在语义变化下确保时间稳定性,我们引入了基于动量的校正来近似新型混合分布,超出训练数据分布,并且还引入了轻量级的伽马残差模块来平滑视觉伪影。我们使用SSIM、LPIPS以及一个提出的度量\metricnameabbr来评估模型性能,该度量量化了参考和输出之间的语义对齐。广泛的评估表明,我们的模型在无需重新训练的情况下始终优于训练和非训练的基线,实现了更优的语义混合和时间连贯性。结果表明,结构化的扩散噪声轨迹操纵能够实现可控且高质量的语义变化下的视频编辑。
TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
Authors: Shima Imani, Seungwhan Moon, Lambert Mathias, Lu Zhang, Babak Damavandi
First: 2025-12-05T18:40:18+00:00 · Latest: 2025-12-11T19:26:00+00:00
Abstract
Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.
中文标题/摘要
标题:TRACE:分析和增强视觉语言模型逐步推理的框架
可靠地进行数学和科学推理仍然是大型视觉语言模型面临的开放挑战。标准的最终答案评估往往掩盖了推理错误,允许无声失败持续存在。为了解决这一问题,我们引入了TRACE,一种透明推理和一致性评估框架,该框架诊断推理轨迹而非仅关注最终结果。核心上,TRACE 利用辅助推理集,这是一种分解复杂问题的紧凑子问题答案对,通过基于一致性的度量评估中间步骤,并揭示标准评估中未注意到的失败。我们的实验表明,辅助推理集(ARS)的一致性与最终答案的正确性相关,并有助于定位失败出现的推理步骤,提供可操作的信号以改进模型。此外,TRACE 定义了置信区域,区分可靠和不可靠的推理路径,支持有效的过滤、调试和模型细化。
Summary / 总结
The research aims to improve the reliability of mathematical and scientific reasoning in large vision-language models by addressing the limitations of standard final-answer evaluation. TRACE, a framework for Transparent Reasoning And Consistency Evaluation, diagnoses reasoning trajectories using Auxiliary Reasoning Sets, which decompose complex problems and evaluate intermediate steps. Experiments show that consistency across these sets correlates with final-answer correctness and helps identify specific reasoning steps where failures occur, providing actionable signals for model improvement. Additionally, TRACE defines confidence regions to distinguish reliable from unreliable reasoning paths, aiding in effective filtering and debugging.
研究旨在通过解决标准最终答案评估的局限性,提高大型视觉-语言模型在数学和科学推理方面的可靠性。TRACE 是一个透明推理和一致性评估框架,使用辅助推理集来分解复杂问题并评估中间步骤。实验表明,这些集中的一致性与最终答案的正确性相关,并有助于识别推理过程中失败的具体步骤,提供改进模型的行动信号。此外,TRACE 定义了置信区间,以区分可靠和不可靠的推理路径,支持有效的过滤、调试和模型改进。
VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation
Authors: Felix O'Mahony, Roberto Cipolla, Ayush Tewari
First: 2025-12-11T19:21:47+00:00 · Latest: 2025-12-11T19:21:47+00:00
Comments: Website: https://felixomahony.github.io/vdaworld/
Abstract
Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.
中文标题/摘要
标题:VDAWorld:通过VLM指导的抽象与模拟进行世界建模
生成式视频模型,一种领先的世界建模方法,面临根本性的局限。它们经常违反物理和逻辑规则,缺乏互动性,并且作为不透明的黑箱,不适合构建结构化、可查询的世界。为克服这些挑战,我们提出了一种新的范式,专注于将图像描述对提炼为一种可处理的、优化的抽象表示,以适应模拟。我们引入了VDAWorld框架,其中视觉语言模型(VLM)作为智能代理来协调这一过程。VLM自主构建一个接地的(2D或3D)场景表示,并从中选择一系列视觉工具,相应地选择一个兼容的物理模拟器(例如,刚体、流体)来作用于它。VDAWorld可以从中推断出潜在的动力学,以预测可能的未来状态。我们的实验表明,这种智能抽象与适应性模拟的结合,产生了一个多功能的世界模型,能够在广泛的动力学场景中生成高质量的模拟。
Summary / 总结
The paper addresses the limitations of generative video models in world modeling by proposing VDAWorld, a framework that uses a Vision-Language Model to create an abstract representation of a scene. The VLM selects appropriate vision tools and physics simulators to construct and simulate the scene, allowing for the prediction of future states. Experiments demonstrate that VDAWorld can generate high-quality simulations across various dynamic scenarios.
论文提出了一种名为VDAWorld的新框架,通过视觉-语言模型创建场景的可处理抽象表示,以克服生成视频模型在世界建模中的局限性。该模型选择合适的视觉工具和物理模拟器来构建和模拟场景,从而推断潜在的动力学并预测未来状态。实验表明,VDAWorld能够在各种动态场景中生成高质量的模拟。
Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning
Authors: Chenjun Li, Cheng Wan, Laurin Lux, Alexander Berger, Richard B. Rosen, Martin J. Menten, Johannes C. Paetzold
First: 2025-12-11T19:19:39+00:00 · Latest: 2025-12-11T19:19:39+00:00
Comments: 23 pages, 8 figures, 6 tables. Full paper under review for MIDL 2026 (Medical Imaging with Deep Learning)
Abstract
Vision-Language Models (VLMs) offer a promising path toward interpretable medical diagnosis by allowing users to ask about clinical explanations alongside predictions and across different modalities. However, training VLMs for detailed reasoning requires large-scale image-text datasets. In many specialized domains, for example in reading Optical Coherence Tomography Angiography (OCTA) images, such precise text with grounded description of pathologies is scarce or even non-existent. To overcome this bottleneck, we introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text, specifically: realistic retinal vasculature with Diabetic Retinopathy (DR) features: capillary dropout, microaneurysms, neovascularization, and tortuosity, while automatically generating granular reasoning texts. Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs. Our experiments show that a general-purpose VLM (Qwen3-VL-8b) trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images, outperforming supervised baselines. Through human expert evaluation we also demonstrate that it significantly enhances explanation quality and pathology localization on clinical data.
中文标题/摘要
标题:合成血管和病理增强视觉语言模型推理
视觉语言模型(VLMs)为通过允许用户在预测和不同模态之间询问临床解释以实现可解释的医学诊断提供了有希望的道路。然而,为了进行详细的推理,训练VLMs需要大规模的图像-文本数据集。在许多专门领域,例如阅读光学相干断层扫描血管成像(OCTA)图像时,这种精确的文本和基于描述的病理学信息稀缺甚至不存在。为了解决这一瓶颈,我们引入了合成血管推理(SVR)框架,该框架可控地合成图像及其相应的文本,具体而言:具有糖尿病视网膜病变(DR)特征的现实视网膜血管:毛细血管丢失、微动脉瘤、新生血管和扭曲,同时自动生成详细的推理文本。基于此,我们构建了包含100,000对的OCTA-100K-SVR OCTA图像推理数据集。我们的实验表明,该数据集训练的一般视觉语言模型(Qwen3-VL-8b)在真实OCTA图像上的零样本平衡分类准确率为89.67%,优于监督基线。通过人类专家评估,我们还证明它在临床数据上的解释质量和病理定位显著增强。
Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
Authors: Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Yifan Shen, Tianjiao Yu, Ismini Lourentzou
First: 2025-06-26T17:59:12+00:00 · Latest: 2025-12-11T19:06:30+00:00
Comments: Project webpage: https://plan-lab.github.io/hallusegbench/
Abstract
Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).
中文标题/摘要
标题:反事实分割推理:诊断和缓解像素定位幻觉
分割视觉语言模型(VLMs)在增强基于视觉的语义理解方面取得了显著进展,但它们仍然容易产生像素定位幻觉,即为错误的对象生成掩码或为完全不存在的对象生成掩码。现有的评估几乎完全依赖于基于文本或标签的扰动,仅检查预测的掩码是否与查询标签匹配。这种评估忽略了幻觉的空间足迹和严重程度,因此无法揭示由视觉驱动的幻觉,这些幻觉更具挑战性且更为普遍。为解决这一差距,我们形式化了反事实分割推理(CSR)任务,其中模型必须在事实图像中分割参考对象,并在反事实对应物中避免。为了支持这一任务,我们构建了HalluSegBench,这是首个使用受控视觉反事实来诊断引用和推理表达分割幻觉的大规模基准,并引入了新的评估指标来衡量幻觉的严重程度并分离视觉和语言驱动的失败模式。我们还引入了RobustSeg,这是一种通过反事实微调(CFT)训练的分割VLM,使其学习何时分割何时避免。实验结果表明,RobustSeg将幻觉减少了30%,同时在FP-RefCOCO(+/g)上提高了分割性能。
Summary / 总结
The research addresses the issue of pixel-grounding hallucinations in Segmentation Vision-Language Models (VLMs) by introducing Counterfactual Segmentation Reasoning (CSR) and a new benchmark called HalluSegBench. The method involves training models to abstain in counterfactual images while correctly segmenting the referenced object in factual images. Experimental results show that the proposed RobustSeg model reduces hallucinations by 30% and improves segmentation performance on FP-RefCOCO(+/g).
本文通过引入Counterfactual Segmentation Reasoning (CSR) 和新的基准HalluSegBench,解决了Segmentation Vision-Language Models (VLMs) 中的像素定位幻觉问题。该方法包括创建一个带有控制视觉反事实数据集,并开发新的评估指标来评估幻觉严重性。关键发现是,通过反事实微调(RobustSeg)训练的模型可以将幻觉减少30%,同时在FP-RefCOCO(+/g)上提高分割性能。
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Authors: Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchinov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, Jun-Yan Zhu, Sergey Tulyakov
First: 2025-12-11T18:59:56+00:00 · Latest: 2025-12-11T18:59:56+00:00
Comments: Project page: https://snap-research.github.io/omni-attribute
Abstract
Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.
中文标题/摘要
标题:Omni-Attribute:面向视觉概念个性化的大词汇量属性编码器
视觉概念个性化旨在将特定图像属性,如身份、表情、光照和风格,转移到未见的上下文中。然而,现有方法依赖于通用图像编码器的整体嵌入,这会将多个视觉因素纠缠在一起,使得难以隔离单一属性。这通常会导致信息泄露和不一致的合成。为了解决这一局限性,我们引入了Omni-Attribute,这是第一个用于学习高保真度、属性特定表示的大词汇量图像属性编码器。我们的方法联合设计数据和模型:(i) 我们收集了带有正负属性标注的语义关联图像对,以明确地教导编码器保留或抑制什么;(ii) 我们采用了一种双目标训练范式,平衡生成保真度与对比性解耦。生成的嵌入在开放词汇量属性检索、个性化和组合生成方面证明有效,并在多个基准测试中达到了最先进的性能。
Summary / 总结
Omni-Attribute is an open-vocabulary image attribute encoder designed to isolate and transfer specific visual attributes like identity, expression, lighting, and style. It uses semantically linked image pairs and a dual-objective training approach to learn high-fidelity, attribute-specific representations, improving attribute retrieval, personalization, and compositional generation. The method outperforms existing techniques across multiple benchmarks.
Omni-Attribute 是一种开放词汇量的图像属性编码器,旨在学习高保真的、属性特定的表示,以实现视觉概念个性化。该方法通过编排语义关联的图像对和采用双重目标训练范式来解决现有方法的局限性。该方法在多个基准测试中实现了开放词汇量属性检索、个性化和组合生成的最先进性能。
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
Authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung
First: 2025-12-11T18:59:22+00:00 · Latest: 2025-12-11T18:59:22+00:00
Abstract
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
中文标题/摘要
标题:VL-JEPA:联合嵌入预测架构的跨模态模型
我们介绍了基于联合嵌入预测架构(JEPA)的跨模态模型VL-JEPA。与经典视觉语言模型(VLM)逐个生成标记不同,VL-JEPA预测目标文本的连续嵌入。通过在抽象表示空间中学习,该模型专注于与任务相关的语义,同时抽象掉表面语言的变异性。在严格控制的比较中,与使用相同视觉编码器和训练数据的标准标记空间VLM训练相比,VL-JEPA在参数量减少50%的情况下实现了更强的性能。在推理时,仅在需要时调用轻量级文本解码器将VL-JEPA预测的嵌入转换为文本。我们展示了VL-JEPA原生支持选择性解码,将解码操作减少2.85倍,同时保持与非自适应均匀解码相似的性能。除了生成之外,VL-JEPA的嵌入空间自然支持开放词汇分类、文本到视频检索和区分型VQA,无需任何架构修改。在八个视频分类数据集和八个视频检索数据集上,VL-JEPA的平均性能超过了CLIP、SigLIP2和感知编码器。同时,尽管只有1.6B参数,该模型在四个VQA数据集(GQA、TallyQA、POPE和POPEv2)上的性能与经典VLM(InstructBLIP、QwenVL)相当。
Summary / 总结
VL-JEPA is a vision-language model that uses a Joint Embedding Predictive Architecture to predict continuous embeddings of target texts, rather than generating tokens autoregressively. This approach focuses on task-relevant semantics and reduces surface-level linguistic variability. Compared to standard token-space VLMs with the same vision encoder and training data, VL-JEPA achieves better performance with 50% fewer parameters and supports selective decoding that reduces the number of decoding operations by 2.85x. The model excels in video classification and retrieval tasks and performs comparably to larger VLMs on VQA tasks with fewer parameters.
VL-JEPA 是一种使用联合嵌入预测架构来预测目标文本连续嵌入的视觉语言模型,而不是自回归生成标记。这种方法使得模型在更少的参数下表现出更强的性能,并支持选择性解码,将解码操作次数减少2.85倍的同时保持相似的性能。VL-JEPA 在多个视频分类和检索任务中表现出色,并且在 VQA 任务上使用显著更少的参数实现了可比的性能。
BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
Authors: Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, Boqing Gong
First: 2025-12-11T18:57:05+00:00 · Latest: 2025-12-11T18:57:05+00:00
Abstract
Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.
中文标题/摘要
标题:BabyVLM-V2:面向发展性基础视觉模型预训练和基准测试的框架
早期儿童的发展轨迹为高效样本预训练视觉基础模型提供了自然目标。我们介绍了BabyVLM-V2,这是一种基于发展的婴儿启发式视觉语言建模框架,通过纵向多维度预训练集、多功能模型以及最重要的是DevCV工具箱进行认知评估,大幅改进了BabyVLM-V1。预训练集最大化覆盖范围同时最小化纵向婴儿为中心的视听内容的整理,产生视频-语句、图像-语句和多轮对话数据,反映婴儿的经验。DevCV工具箱将最近发布的NIH婴儿工具箱中所有与视觉相关的度量标准转化为涵盖空间推理、记忆和词汇理解的十项多模态任务基准套件,与早期儿童的能力相一致。实验结果表明,从零开始预训练的紧凑模型在DevCV工具箱上可以达到竞争力的表现,某些任务上优于GPT-4o。我们希望BabyVLM-V2框架能够促进发展性基础视觉模型预训练的研究。
Summary / 总结
BabyVLM-V2 is designed to pretrain vision foundation models in a developmentally grounded manner, focusing on infant experiences. It uses a longitudinal, multifaceted pretraining set and a DevCV Toolbox for cognitive evaluation. The model achieves competitive performance on a benchmark suite of ten multimodal tasks, outperforming GPT-4o on some tasks, demonstrating its effectiveness in sample-efficient pretraining for vision foundation models.
BabyVLM-V2 是一个基于发展的框架,用于高效预训练视觉基础模型,通过纵向、多方面的预训练集和名为 DevCV 工具箱的认知评估套件。预训练集包括反映婴儿体验的视频-语音、图像-语音和多轮对话数据。DevCV 工具箱将 NIH 婴儿工具箱的测量指标改编成涵盖空间推理、记忆和词汇理解的十个跨模态任务基准套件。实验结果表明,从零开始预训练的紧凑型模型在这些任务上可以达到竞争力的表现,某些任务上甚至优于 GPT-4o。该框架旨在加速发展合理预训练视觉基础模型的研究。