Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
Authors: Hao Sun, Zi-Jun Ding, Da-Wei Zhou
First: 2026-05-13T17:56:23+00:00 · Latest: 2026-05-13T17:56:23+00:00
Abstract
Class-Incremental Learning (CIL) enables models to continuously integrate new knowledge while mitigating catastrophic forgetting. Driven by the remarkable generalization of CLIP, leveraging pre-trained vision-language models has become a dominant paradigm in CIL. However, current work primarily focuses on aligning global image embeddings (i.e., [CLS] token) with their corresponding text prompts (i.e., [EOS] token). Despite their good performance, we find that they discard the rich patch-level semantic information inherent in CLIP's encoders. For instance, when recognizing a rabbit, local patches may encode its distinctive cues, such as long ears and a fluffy tail, which can provide complementary evidence for recognition. Based on the above observation, we propose SPA (Semantic-guided Patch-level Alignment) for CLIP-based CIL, which aims to awaken long-neglected local representations within CLIP. Specifically, for each class, we first construct representative and diverse visual samples and feed them to GPT-5 as visual guidance to generate class-wise semantic descriptions. These descriptions are used to guide the selection of discriminative patch-level visual features. Building upon these selected patches, we further employ optimal transport to align selected patch tokens with semantic tokens from class-wise descriptions, yielding a structured cross-modal alignment that improves recognition. Furthermore, we introduce task-specific projectors for effective adaptation to downstream incremental tasks, and sample pseudo-features from stored class-wise Gaussian statistics to calibrate old-class representations, thereby mitigating catastrophic forgetting. Extensive experiments demonstrate that SPA achieves state-of-the-art performance.
中文标题/摘要
标题:基于CLIP的类增量学习中解锁patches级别的特征
类增量学习(CIL)使模型能够不断整合新知识并减轻灾难性遗忘。受CLIP卓越泛化能力的驱动,利用预训练的跨模态模型已成为CIL中的主导范式。然而,当前工作主要集中在对齐全局图像嵌入(即[CLS]标记)与其相应的文本提示(即[EOS]标记)。尽管它们表现出色,但我们发现它们忽略了CLIP编码器中固有的丰富patches级别的语义信息。例如,在识别兔子时,局部patches可能编码其独特的线索,如长耳朵和蓬松的尾巴,这些线索可以为识别提供补充证据。基于上述观察,我们提出了基于CLIP的CIL方法SPA(语义引导的patches级别对齐),旨在唤醒CLIP中长期被忽视的局部表示。具体而言,对于每个类别,我们首先构建代表性且多样的视觉样本,并将其输入GPT-5作为视觉指导以生成类别级别的语义描述。这些描述用于指导选择具有区分性的patches级别的视觉特征。在此基础上,我们进一步利用最优传输将所选的patches标记与类别级别的描述中的语义标记对齐,从而获得一种结构化的跨模态对齐,以提高识别效果。此外,我们引入了特定任务的投影器以有效适应下游增量任务,并从存储的类别级别的高斯统计中采样伪特征以校准旧类别的表示,从而减轻灾难性遗忘。广泛的实验表明,SPA达到了最先进的性能。
Summary / 总结
The paper proposes SPA (Semantic-guided Patch-level Alignment) to enhance CLIP-based class-incremental learning by leveraging patch-level semantic information. It constructs class-wise semantic descriptions using GPT-5 and aligns these with selected patch tokens via optimal transport, improving recognition. Task-specific projectors and pseudo-features from stored Gaussian statistics are also introduced to mitigate catastrophic forgetting. SPA achieves state-of-the-art performance in extensive experiments.
论文提出了SPA(语义引导的局部特征对齐),通过利用局部语义信息来增强基于CLIP的类增量学习。它使用GPT-5构建类别的语义描述,并通过最优传输将这些描述与选定的局部特征对齐,从而提高识别效果。此外,还引入了任务特定的投影器和从存储的高斯统计中抽取的伪特征来减轻灾难性遗忘。广泛的实验表明,SPA达到了最先进的性能。
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Authors: Zhaowei Wang, Lishu Luo, Haodong Duan, Weiwei Liu, Sijin Wu, Ji Luo, Shen Yan, Shuai Peng, Sihang Yuan, Chaoyi Huang, Yi Lin, Yangqiu Song
First: 2026-05-13T17:52:53+00:00 · Latest: 2026-05-13T17:52:53+00:00
Comments: work in progress
Abstract
Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.
Summary / 总结
This work investigates long-context modeling in vision-language models (LVLMs) by extending a 7B model from 32K to 128K context. Key findings include the effectiveness of long-document VQA over OCR transcription, the importance of balanced data for generalizable long-context ability, and the benefit of retrieval-heavy mixtures. The study introduces MMProLong, which improves long-document VQA scores by 7.1% and maintains strong performance beyond its training window without additional training, demonstrating generalization to various tasks.
该研究通过将一个7B模型从32K扩展到128K上下文,探索了视觉语言模型(LVLM)的长上下文建模。关键发现包括长文档VQA比OCR转录更有效,平衡数据对于长上下文能力的泛化更为重要,以及检索密集型混合数据的优势。研究引入了MMProLong,其长文档VQA得分提高了7.1%,并且在超过其训练窗口的上下文长度下仍保持了强大的性能,无需额外训练,还展示了对网页多模态检索、长上下文视觉文本压缩和长视频理解等任务的泛化能力。
VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence
Authors: Guney Tombak, Ertunc Erdil, Ender Konukoglu
First: 2026-05-13T17:20:26+00:00 · Latest: 2026-05-13T17:20:26+00:00
Abstract
Cross-modal 3D medical image analysis requires voxelwise representations that remain anatomically consistent across imaging contrasts, scanners, and acquisition protocols. Recent work has shown that frozen 2D Vision Transformer (ViT) foundation models can support such representations, but typical pipelines extract features along a single anatomical axis and adapt those features inside a registration solver for one image pair at a time, leaving complementary viewing directions unused and producing representations that do not transfer to new volumes. We introduce VoxCor, a training-free fit--transform method for reusable volumetric feature representations from frozen 2D ViT foundation models. During an offline fitting phase, VoxCor combines triplanar ViT inference with a compact closed-form weighted partial least squares (WPLS) projection that uses fitting-time voxel correspondences to select modality-stable anatomical directions in the triplanar feature space. At transform time, new volumes are mapped by triplanar ViT inference and linear projection alone, without fine-tuning or registration. Voxel correspondences can then be queried directly by nearest-neighbor search. We evaluate VoxCor on intra-subject Abdomen MR--CT and inter-subject HCP T2w--T1w tasks using deformable registration, voxelwise k-nearest-neighbor segmentation, and segmentation-center landmark localization. VoxCor improves the hardest cross-subject, cross-modality transfer settings, reduces encoder sensitivity for dense correspondence transfer, and yields registration performance competitive with handcrafted descriptors and learned 3D features. This positions VoxCor as a reusable feature layer for downstream multimodal analysis beyond pairwise registration. Code, configuration files, and implementation details are publicly available on GitHub at \href{https://github.com/guneytombak/VoxCor}{guneytombak/VoxCor}.
中文标题/摘要
标题:VoxCor:无需训练的体积特征用于多模态体素对应
跨模态3D医学图像分析需要在成像对比度、扫描器和采集协议之间保持解剖上一致的体素表示。最近的研究表明,冻结的2D视觉变换器(ViT)基础模型可以支持这种表示,但典型的管道仅沿单一解剖轴提取特征,并在一次图像对中将这些特征适应到注册求解器中,从而未充分利用互补的视角方向,并产生不适用于新体积的表示。我们提出了VoxCor,这是一种无需训练的体积特征表示方法,来自冻结的2D ViT基础模型。在离线拟合阶段,VoxCor结合了三平面ViT推理和一个紧凑的封闭形式加权部分最小二乘(WPLS)投影,该投影使用拟合时的体素对应关系来选择三平面特征空间中的模态稳定的解剖方向。在转换阶段,新体积仅通过三平面ViT推理和线性投影映射,无需微调或注册。然后可以通过最近邻搜索直接查询体素对应关系。我们使用可变形注册、体素加权最近邻分割和分割中心地标定位评估了VoxCor在腹腔MR-CT内个体任务和HCP T2w-T1w跨个体任务上的表现。VoxCor提高了最难的跨个体、跨模态转移设置,降低了编码器对密集对应关系转移的敏感性,并获得了与手工制作的描述符和学习的3D特征相当的注册性能。这使VoxCor成为用于多模态分析的可重用特征层,超越了成对注册。代码、配置文件和实现细节可在GitHub上公开获取:\href{https://github.com/guneytombak/VoxCor}{guneytombak/VoxCor}。
Summary / 总结
VoxCor is a training-free method that uses frozen 2D Vision Transformer (ViT) to generate volumetric feature representations for multimodal voxel correspondence. During an offline fitting phase, VoxCor combines triplanar ViT inference with a compact WPLS projection to select modality-stable anatomical directions. At transform time, new volumes are mapped by triplanar ViT inference and linear projection alone, without fine-tuning. VoxCor improves cross-subject and cross-modality transfer settings, reduces encoder sensitivity for dense correspondence transfer, and achieves competitive registration performance compared to handcrafted descriptors and learned 3D features.
VoxCor 是一种无需训练的方法,利用冻结的 2D 视觉变换器(ViT)生成多模态体素对应中的体积特征表示。在离线拟合阶段,VoxCor 将三平面 ViT 推断与紧凑的 WPLS 投影结合,以选择模态稳定的解剖方向。在变换阶段,新体积仅通过三平面 ViT 推断和线性投影进行映射,无需微调。VoxCor 改进了跨受试者和跨模态的传输设置,减少了编码器对密集对应传输的敏感性,并实现了与手工制作的描述符和学习的 3D 特征相当的配准性能。
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
Authors: Harold Haodong Chen, Sirui Chen, Yingjie Xu, Wenhang Ge, Ying-Cong Chen
First: 2026-05-13T16:54:36+00:00 · Latest: 2026-05-13T16:54:36+00:00
Comments: On-going work
Abstract
The scalability of robotic manipulation is fundamentally bottlenecked by the scarcity of task-aligned physical interaction data. While vision-language models (VLMs) and video generation models (VGMs) hold promise for autonomous data synthesis, they suffer from semantic-spatial misalignment and physical hallucinations, respectively. To bridge this gap, we introduce RoboEvolve, a novel framework that couples a VLM planner and a VGM simulator into a mutually reinforcing co-evolutionary loop. Operating purely on unlabeled seed images, RoboEvolve leverages a cognitive-inspired dual-phase mechanism: (i) daytime exploration fosters physically grounded behavioral discovery through a semantic-controlled multi-granular reward, and (ii) nighttime consolidation mines "near-miss" failures to stabilize policy optimization. Guided by an autonomous progressive curriculum, the system naturally scales from simple atomic actions to complex tasks. Extensive experiments demonstrate that RoboEvolve (I) achieves superior effectiveness, elevating base planners by 30 absolute points and amplifying simulator success by 48% on average; (II) exhibits extreme data efficiency, surpassing fully supervised baselines with merely 500 unlabeled seeds--a 50x reduction; and (III) demonstrates robust continual learning without catastrophic forgetting.
中文标题/摘要
标题:RoboEvolve:基于有限数据的规划-模拟协同进化机器人操作框架
机器人操作的可扩展性从根本上受到任务对齐物理交互数据稀缺性的限制。尽管视觉语言模型(VLM)和视频生成模型(VGM)有望自主合成数据,但它们分别存在语义-空间错位和物理幻觉的问题。为了解决这一问题,我们提出了RoboEvolve,这是一种新颖的框架,将VLM规划器和VGM模拟器耦合进一个相互强化的协同进化循环中。RoboEvolve仅基于未标记的种子图像运行,利用一种认知启发式的双阶段机制:(i)白天探索通过语义控制的多粒度奖励促进物理上合理的行为发现;(ii)夜间巩固挖掘“接近失败”以稳定策略优化。在自主渐进式课程的引导下,系统自然地从简单的原子动作扩展到复杂的任务。大量实验表明,RoboEvolve(I)实现了卓越的效果,使基础规划器提高了30个绝对点,并将模拟器的成功率平均提高了48%;(II)表现出极高的数据效率,仅使用500个未标记种子就超越了完全监督的基线,减少了50倍的数据量;(III)展示了稳健的持续学习能力,没有灾难性遗忘。
Summary / 总结
RoboEvolve is a framework that co-evolves a vision-language model planner and a video generation model simulator to address the scarcity of task-aligned physical interaction data in robotic manipulation. It uses a dual-phase mechanism, daytime exploration and nighttime consolidation, to discover physically grounded behaviors and stabilize policy optimization, respectively. RoboEvolve shows superior effectiveness and extreme data efficiency, improving base planners by 30 absolute points and surpassing fully supervised baselines with only 500 unlabeled seeds, a 50x reduction. It also demonstrates robust continual learning without catastrophic forgetting.
RoboEvolve 是一个框架,通过将视觉语言模型规划器和视频生成模型模拟器相互促进地进化,来解决机器人操作中任务对齐的物理交互数据稀缺问题。它使用日间探索和夜间巩固的双重机制,发现物理上合理的行为并稳定策略优化。RoboEvolve 展现了优越的效果和极高的数据效率,将基线规划器提高了30个绝对点,并且仅用500个未标记的种子就超过了完全监督的基线,减少了50倍。同时,它还展示了不遗忘的持续学习能力。
LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction
Authors: Christina Kassab, Hyeonjae Gil, Matías Mattamala, Ayoung Kim, Maurice Fallon
First: 2026-05-13T16:19:02+00:00 · Latest: 2026-05-13T16:19:02+00:00
Abstract
Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: https://ori-drs.github.io/lexisg-web/.
Summary / 总结
LEXI-SG is a monocular 3D scene graph mapping system that partitions scenes into rooms and performs feed-forward reconstruction when each room is fully observed, enabling dense mapping without scale inconsistencies. It uses semantic priors from open-vocabulary foundation models to support open-vocabulary object segmentation and tracking within each room. Experiments on indoor scenes show improved trajectory estimation and dense reconstruction compared to existing methods, and competitive performance in open-vocabulary segmentation. LEXI-SG demonstrates that accurate 3D scene graphs can be achieved using only RGB camera input.
LEXI-SG 是一种仅使用 RGB 相机输入的单目 3D 场景图映射系统,它将场景划分为房间,并在每个房间完全观察后进行前向重建,从而避免了尺度不一致的问题。该系统利用开放词汇量基础模型的语义先验来支持每个房间内的开放词汇量对象分割和跟踪。在室内场景上的实验表明,与现有方法相比,该系统在轨迹估计和密集重建方面表现出改进,并且在开放词汇量分割方面具有竞争力。LEXI-SG 证明了仅使用 RGB 相机输入即可实现准确的 3D 场景图。
Asynchronous Reasoning: Training-Free Interactive Thinking LLMs
Authors: George Yakushev, Nataliia Babina, Masoud Vahid Dastgerdi, Vyacheslav Zhdanovskiy, Denis Kuznedelev, Alina Shutova, Max Ryabinin
First: 2025-12-11T18:57:02+00:00 · Latest: 2026-05-13T16:04:41+00:00
Comments: Preprint, work in progress
Abstract
Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embodied assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of positional embeddings to enable LLMs built for sequential generation to simultaneously think, listen, and write outputs. We evaluate our approach on math, commonsense, and safety reasoning: it allows models to generate accurate thinking-augmented answers while reducing time to first non-thinking token from minutes to ${\le}$ 5s and the overall delays by up to $12{\times}$.
Summary / 总结
This work addresses the limitation of state-of-the-art LLMs that require sequential reasoning before responding, making them less interactive. The authors propose a method to enable LLMs to think, listen, and write outputs simultaneously by leveraging the properties of positional embeddings. The approach is evaluated on math, commonsense, and safety reasoning tasks, showing that it can generate accurate answers with significantly reduced delays, down to less than 5 seconds for the first non-thinking token and overall delays reduced by up to 12 times.
这项工作解决了当前最先进的LLM需要顺序推理才能回应的问题,使其不够互动。作者提出了一种方法,通过利用位置嵌入的特性,使LLM能够同时思考、倾听和输出。该方法在数学、常识和安全推理任务上的评估表明,它可以生成准确的答案,并将首次非思考标记的时间缩短到不到5秒,整体延迟最多减少12倍。
Prototype-Based Test-Time Adaptation of Vision-Language Models
Authors: Zhaohong Huang, Yuxin Zhang, Wenjing Liu, Fei Chao, Rongrong Ji
First: 2026-04-23T07:20:56+00:00 · Latest: 2026-05-13T15:56:21+00:00
Abstract
Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.
ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation
Authors: Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang
First: 2026-04-24T13:36:41+00:00 · Latest: 2026-05-13T15:42:25+00:00
Comments: Some errors in the experimental sections
Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
中文标题/摘要
标题:ReLIC-SGG:开放词汇场景图生成的关系不完整性完成
开放词汇场景图生成(SGG)旨在使用灵活的关系短语描述视觉场景,而不限于固定谓词集。现有方法通常将标注三元组视为正样本,而将所有未标注的对象对关系视为负样本。然而,场景图标注本质上是不完整的:许多有效关系缺失,相同的交互可以在不同粒度下描述,例如on、standing on、resting on和supported by。在开放词汇SGG中,由于关系空间更大,这一问题更为严重。我们提出了**ReLIC-SGG**,一种关系不完整性感知框架,将未标注的关系视为潜在变量,而不是确定的负样本。ReLIC-SGG构建了一个语义关系格来建模开放词汇谓词之间的相似性、蕴含和矛盾,并利用其从视觉-语言兼容性、图上下文和语义一致性中推断缺失的正关系。正样本-未标注图学习目标进一步减少了假阴性监督,而格引导解码生成紧凑且语义一致的场景图。在传统、开放词汇和泛光SGG基准上的实验表明,ReLIC-SGG提高了罕见和未见过的谓词识别,并更好地恢复了缺失的关系。
Summary / 总结
ReLIC-SGG addresses the issue of incomplete scene graph annotations in open-vocabulary SGG by treating unannotated relations as latent variables. It uses a semantic relation lattice to model relations and infer missing positive relations from visual-language compatibility and semantic consistency. Experiments show improved recognition of rare and unseen predicates and better recovery of missing relations.
ReLIC-SGG通过将未标注的关系视为潜在变量来解决开放词汇场景图生成中的注解不完整性问题。它使用语义关系格来建模关系,并从视觉-语言兼容性和语义一致性中推断缺失的正关系。实验显示其在识别罕见和未见过的关系方面表现更好,并能更好地恢复缺失的关系。
Sampling from Flow Language Models via Marginal-Conditioned Bridges
Authors: Iskander Azangulov, Leo Zhang
First: 2026-05-13T15:38:48+00:00 · Latest: 2026-05-13T15:38:48+00:00
Abstract
Flow Language Models (FLMs) are a recently introduced class of language models which adapt continuous flow matching for one-hot encoded token sequences. Their denoisers have a special structure absent from generic continuous diffusion models: each block of the denoising mean is a posterior marginal distribution over the clean token at that position. Standard DDPM-style samplers collapse these marginals to a single conditional-mean endpoint and bridge toward this simplex-valued point, which is generally not a valid one-hot sequence. We argue that the natural sampler for an FLM is instead posterior-predictive. At each reverse step, we sample a clean one-hot endpoint from the factorized posterior defined by the FLM token marginals, and then sample the next continuous state from the analytic Ornstein--Uhlenbeck bridge conditioned on that endpoint. The method is training-free, uses the same model evaluations as standard sampling, and gives a principled interface for token-level decoding controls such as temperature scaling and nucleus truncation. We show that, under exact posterior marginals, the endpoint approximation error is exactly the conditional multi-information among token positions. The induced one-step bridge kernel preserves all token-wise posterior-predictive marginals and loses only the residual cross-position dependence. Finally, we prove a Girsanov path-space comparison showing that the marginal-conditioned bridge has a no-larger denoising-error term than the frozen conditional-mean bridge, with strict improvement whenever intermediate coordinate-wise bridge observations reveal additional information about the clean token. Experiments with FLMs show that the sampler improves the quality--diversity tradeoff. Code is available at: github.com/imbirik/mcb.
中文标题/摘要
标题:通过边缘条件桥梁从流语言模型采样
流语言模型(FLMs)是一种最近引入的语言模型类别,它们采用连续流匹配来适应一热编码的标记序列。它们的去噪器具有一个特殊结构,不同于通用的连续扩散模型:每个去噪均值块是该位置清洁标记的后验边缘分布。标准DDPM风格的采样器将这些边缘压缩为单一条件均值端点,并向这个单纯形值点桥接,这通常不是一个有效的一热序列。我们认为FLM的自然采样器是后验预测。在每个反向步骤中,我们从由FLM标记边缘定义的因子化后验中采样一个清洁的一热端点,然后从条件于该端点的解析Ornstein--Uhlenbeck桥梁中采样下一个连续状态。该方法无需训练,使用与标准采样相同的模型评估,并提供了一个原理性的接口,用于标记级别的解码控制,如温度缩放和核截断。我们证明,在精确的后验边缘下,端点近似误差恰好是标记位置之间的条件多信息量。诱导的一步桥梁核保持所有标记级别的后验预测边缘,并仅丢失跨位置的残余依赖性。最后,我们证明了一个Girsanov路径空间比较,表明边缘条件桥梁的去噪误差项不超过冻结条件均值桥梁,且在中间坐标桥梁观测揭示额外关于清洁标记信息时,具有严格改进。FLM实验表明,该采样器改善了质量-多样性权衡。代码可在github.com/imbirik/mcb/获得。
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
Authors: Vladislav Makarov, Mark Gizetdinov, Dmitry Yudin
First: 2026-05-13T15:27:41+00:00 · Latest: 2026-05-13T15:27:41+00:00
Abstract
Scene graph generation provides a compact structured representation for visual perception, but accurate and fast graph prediction from images and videos remains challenging. Recent VLM-based methods can generate scene graphs end-to-end as structured text, yet often produce long outputs with irrelevant objects and relations. We present SceneGraphVLM, a compact method for image and video scene graph generation with small visual language models. SceneGraphVLM serializes graphs in a token-efficient TOON format and trains the model in two stages: supervised fine-tuning followed by reinforcement learning with hallucination-aware rewards that balance relation coverage and precision while penalizing unsupported objects and relations. For videos, the model can optionally condition each frame on the previously generated graph, providing lightweight short-term context without tracking or post-processing. We evaluate SceneGraphVLM on PSG, PVSG, and Action Genome. With compact VLMs and vLLM-accelerated decoding, SceneGraphVLM achieves a strong quality-speed trade-off, improves precision-oriented SGG metrics while preserving reasonable recall, and generates complete scene graphs with approximately one-second latency. Code and implementation details are available at: https://github.com/markus0440/SceneGraphVLM.git.
中文标题/摘要
标题:SceneGraphVLM:基于视觉语言模型的视频场景图动态生成
场景图生成提供了一种紧凑的结构化表示形式,但从图像和视频中准确快速地预测图仍然具有挑战性。基于VLM的方法可以端到端地生成结构化文本形式的场景图,但往往会产生冗长且不相关的输出。我们提出了SceneGraphVLM,这是一种使用小型视觉语言模型进行图像和视频场景图生成的紧凑方法。SceneGraphVLM 以 TOON 格式序列化图,并通过两阶段训练:监督微调后,使用考虑幻觉的奖励进行强化学习,平衡关系覆盖和精度,同时惩罚未支持的对象和关系。对于视频,模型可以选择在每帧上条件化生成的图,提供轻量级的短期上下文,无需跟踪或后处理。我们在PSG、PVSG和Action Genome上评估了SceneGraphVLM。通过紧凑的VLM和vLLM加速解码,SceneGraphVLM实现了良好的质量-速度权衡,提高了精度导向的场景图生成(Scene Graph Generation, SGG)指标,同时保持了合理的召回率,并以大约一秒的延迟生成完整的场景图。代码和实现细节可在:https://github.com/markus0440/SceneGraphVLM.git 获取。
Summary / 总结
SceneGraphVLM is a compact method for generating scene graphs from images and videos using small visual language models. It serializes graphs in a token-efficient format and trains the model in two stages: supervised fine-tuning and reinforcement learning with hallucination-aware rewards. SceneGraphVLM achieves a strong quality-speed trade-off, improving precision while maintaining reasonable recall and generating complete scene graphs with approximately one-second latency.
SceneGraphVLM 是一种使用小型视觉语言模型从图像和视频生成场景图的紧凑方法。它以高效地序列化图,并通过两个阶段训练模型:监督微调和带有幻觉感知奖励的强化学习。SceneGraphVLM 实现了良好的质量和速度权衡,提高了精度同时保持合理的召回率,并以大约一秒的延迟生成完整的场景图。
NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating
Authors: Zhongju Yuan, Geraint Wiggins, Dick Botteldooren
Venue: ICML 2026
First: 2026-05-13T15:09:47+00:00 · Latest: 2026-05-13T15:09:47+00:00
Comments: Accepted as a regular paper by ICML 2026
Abstract
Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.
Sparse Code Uplifting for Efficient 3D Language Gaussian Splatting
Authors: Lovre Antonio Budimir, Yushi Guan, Steve Ryhner, Sven Lončarić, Nandita Vijaykumar
First: 2026-05-13T14:35:31+00:00 · Latest: 2026-05-13T14:35:31+00:00
Comments: 18 pages (9 pages main paper), 10 figures, preprint
Abstract
3D Language Gaussian Splatting (3DLGS) augments 3D Gaussian Splatting with language-aligned visual features for open-vocabulary 3D scene understanding. A core challenge is efficiently associating high-dimensional vision-language embeddings with millions of 3D Gaussians while preserving efficient feature rendering for text-based querying. Existing methods either store dense features directly on Gaussians, causing high storage costs and slow rendering, or learn compact representations through expensive per-scene optimization with repeated feature rasterization. No existing method simultaneously achieves fast 3D semantic reconstruction, efficient storage, and fast rendering. We propose SCOUP (Sparse COde UPlifting), which addresses all three by decoupling language representation learning from 3D Gaussian optimization. Rather than working directly in 3D, we learn sparse codebook-based representations entirely using features associated with 2D image regions, associating each region with a sparse set of codebook coefficients. We then uplift these coefficients to 3D Gaussians with our weighted sparse aggregation using Gaussian-to-pixel associations, where each Gaussian accumulates coefficients over codebook atoms across views. Top-$K$ filtering then extracts the most dominant multi-view coefficients per Gaussian, enabling efficient storage and fast rendering. Our method achieves up to $400\times$ training speedup while being $3\times$ more memory efficient during training compared to the state-of-the-art in rendering speed. Across multiple benchmarks, SCOUP matches or outperforms existing methods in open-vocabulary querying accuracy.
Summary / 总结
SCOUP addresses the challenge of efficiently associating high-dimensional vision-language embeddings with 3D Gaussians for semantic reconstruction. It proposes a sparse codebook-based representation learned from 2D image regions, which is then uplifted to 3D Gaussians using weighted sparse aggregation. This method achieves up to 400 times faster training speed and 3 times more memory efficiency compared to existing methods, while matching or outperforming them in open-vocabulary querying accuracy across multiple benchmarks.
SCOUP通过从2D图像区域学习稀疏码本表示,并将其提升到3D高斯体上,解决了高维视觉-语言嵌入与3D高斯体高效关联的问题。该方法在训练速度上快了400倍,训练时的内存效率提高了3倍,同时在多个基准测试中与现有方法相比,在开放词汇查询准确性上达到或超过了现有方法的水平。
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
Authors: Siheng Wang, Yanshu Li, Bohan Hu, Zhengdao Li, Haibo Zhan, Linshan Li, Weiming Liu, Ruizhi Qian, Guangxin Wu, Hao Zhang, Jifeng Shen, Piotr Koniusz, Zhengtao Yao, Junhao Dong, Qiang Sun
Venue: ICLR 2026
First: 2026-04-03T05:56:29+00:00 · Latest: 2026-05-13T14:33:13+00:00
Comments: Accepted at ICLR 2026
Abstract
Open-vocabulary object detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.
PreFIQs: Face Image Quality Is What Survives Pruning
Authors: Jan Niklas Kolf, Guray Ozgur, Andrea Atzori, Žiga Babnik, Vitomir Štruc, Naser Damer, Fadi Boutros
Venue: CVPR 2026
First: 2026-05-13T11:53:19+00:00 · Latest: 2026-05-13T11:53:19+00:00
Comments: Accepted at CVPR 2026 Workshops
Abstract
Face Image Quality Assessment (FIQA) evaluates the utility of a face image for automated face recognition (FR) systems. In this work, we propose PreFIQs, an unsupervised and training-free FIQA framework grounded in the Pruning Identified Exemplar (PIE) hypothesis. We hypothesize that low-utility face images rely disproportionately on fragile network parameters, resulting in larger geometric displacement of their embeddings under model sparsification. Accordingly, PreFIQs quantifies image utility as the Euclidean distance between L2-normalized embeddings extracted from a pre-trained FR model and its pruned counterpart. We provide a first-order theoretical justification via a Jacobian-vector product analysis, demonstrating that this empirical drift serves as a computationally efficient approximation of the exact geometric sensitivity of the latent embedding manifold. Extensive experiments across eight benchmarks and four FR models demonstrate that PreFIQs achieves competitive or superior performance compared to state-of-the-art FIQA methods, including establishing new state-of-the-art results on several benchmarks, without any training or supervision. These results validate parameter sparsification as a principled and practically efficient signal for face image utility, and demonstrate that quality is, in essence, what survives pruning.
Summary / 总结
PreFIQs is an unsupervised FIQA framework that evaluates face image quality by measuring the Euclidean distance between embeddings from a pre-trained FR model and its pruned version. This method is grounded in the hypothesis that low-utility face images rely more on fragile network parameters, leading to larger geometric displacement under model sparsification. Experiments across eight benchmarks show that PreFIQs outperforms or matches state-of-the-art methods without requiring training or supervision, validating the use of parameter sparsification as a signal for face image quality.
PreFIQs 是一个无监督的 FIQA 框架,通过计算预训练模型和其稀疏化版本的嵌入之间的欧几里得距离来评估人脸图像的质量,以适应自动人脸识别系统。该方法基于低效用图像具有更脆弱参数的假设,这些参数在模型稀疏化时会导致更大的位移。实验表明,PreFIQs 在八个基准测试中优于或匹配了最先进的方法,且无需训练,验证了参数稀疏化作为人脸图像效用信号的有效性。
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
Authors: Mingzhe Huang, Weijun Wang, Xin Ding, Liang Mi, Hao Wen, Yuanchun Li, Lichen Pang, Shansong Yang, Yunxin Liu, Ting Cao
First: 2026-05-13T11:32:03+00:00 · Latest: 2026-05-13T11:32:03+00:00
Comments: 10 pages, 11 figures
Abstract
In Vision-Language Models (VLMs), processing a massive number of visual tokens incurs prohibitive computational overhead. While recent training-aware pruning methods attempt to selectively discard redundant tokens, they largely rely on continuous-gradient relaxations. However, visual token pruning is inherently a discrete, non-convex combinatorial problem; consequently, these continuous approximations frequently trap the optimization in sub-optimal local minima, especially under aggressive compression budgets. To overcome this fundamental bottleneck, we propose GRIP-VLM, a Group-Relative Importance Pruning framework driven by Reinforcement Learning. Rather than relying on smooth-gradient assumptions, GRIP-VLM formulates pruning as a Markov Decision Process, employing a Group Relative Policy Optimization (GRPO) paradigm anchored by supervised warm-up to directly explore the discrete selection space. Integrated with a budget-aware scorer, our lightweight agent dynamically evaluates per-token importance and adapts to arbitrary compression ratios without retraining. Extensive experiments across diverse multimodal benchmarks demonstrate that GRIP-VLM consistently outperforms heuristic and supervised-learning baselines, achieving a superior Pareto frontier and delivering up to a 15\% inference speedup at equal accuracy.
Summary / 总结
GRIP-VLM proposes a Reinforcement Learning-based pruning framework to efficiently prune redundant tokens in Vision-Language Models (VLMs). by directly evaluating token importance through a Group Relative Policy Optimization (GRPO) paradigm. Unlike previous gradient relaxations, GRIP-VLM employs an on-policy approach on a Markkov Decision Process to handlevert discrete pruning decisions. Integrated with a budget-aware selector-token importance scorer, on lightweight agent dynamically adapts to arbitrary compression ratios on ontraining. Experimental results results diverse multimodal benchmarks demonstrate consistently outperperper heuristic and supervised-learning baselines on achieving a higher Pareto Frontier and providing up an on underferenceference on accuracy improvement.
GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding
Authors: Mayank Nautiyal, Li Ju, Andreas Hellander, Ekta Vats, Prashant Singh
First: 2026-05-13T11:12:18+00:00 · Latest: 2026-05-13T11:12:18+00:00
Abstract
Standard dual-encoder vision-language models that map images and text to deterministic points on a shared unit hypersphere through $\ell_2$ normalization typically expose neither \emph{aleatoric} uncertainty (cross-modal ambiguity) nor \emph{epistemic} uncertainty (lack of training-distribution support). Existing post-hoc methods either recover at most one of the two uncertainty components, or ignore the hyperspherical geometry of these models' embeddings. We propose \textbf{GeoFlowVLM} as a post-hoc adapter that learns the joint distribution of paired $\ell_2$-normalised dual-encoder VLM embeddings on the product hypersphere $\mathbb{S}^{d-1} \times \mathbb{S}^{d-1}$ via Riemannian flow matching with a single masked velocity field. A consistency result shows that, in the population limit, the trained network exposes the joint flow and both cross-modal conditional flows as valid Riemannian flow-matching velocity fields on their respective domains. We derive two quantities from this single model: a conditional retrieval entropy that quantifies aleatoric ambiguity with a decision-theoretic interpretation via a Fano-type bound, and a marginal-typicality epistemic score justified by an exact chain-rule decomposition of the joint NLL. This decomposition isolates a cross-modal pointwise-mutual-information term that is structurally discriminative rather than epistemic, and is empirically the only consistently uninformative standalone component. Empirically, the entropy tracks Recall@1 with near-ideal monotonic calibration across three retrieval benchmarks in both directions, and the marginal-typicality sum yields consistently calibrated selective accuracy across four zero-shot classification benchmarks.
中文标题/摘要
标题:GeoFlowVLM:几何感知联合不确定性在冻结视觉-语言嵌入中的应用
标准的双编码器视觉-语言模型通过$\ell_2$归一化将图像和文本映射到共享单位超球面上的确定性点,通常不暴露\emph{ aleatoric}不确定性(跨模态歧义)或\emph{epistemic}不确定性(训练分布支持不足)。现有的后处理方法要么仅恢复这两种不确定性成分之一,要么忽略了这些模型嵌入的超球面几何结构。我们提出\textbf{GeoFlowVLM}作为后处理适配器,通过单个掩码速度场的黎曼流匹配在乘积超球面$\mathbb{S}^{d-1} \times \mathbb{S}^{d-1}$上学习配对的$\ell_2$归一化双编码器VLM嵌入的联合分布。一致性结果表明,在总体极限下,训练后的网络暴露了联合流和跨模态条件流作为其各自领域上的有效黎曼流匹配速度场。从该单一模型中,我们推导出两个量:一个条件检索熵,通过Fano型界以决策理论解释量化跨模态歧义;以及一个边际典型性epistemic评分,通过联合NLL的确切链式分解进行验证。该分解隔离了一个跨模态点间互信息项,其结构上具有区分性而非epistemic性,并且在实验中是唯一一致无信息的独立成分。实验中,熵在三个检索基准中的双向中与理想单调校准接近跟踪Recall@1,而边际典型性总和在四个零样本分类基准中提供了一致校准的选择性准确性。
Summary / 总结
GeoFlowVLM is a post-hoc adapter that learns the joint distribution of paired $\ell_2$-normalized dual-encoder VLM embeddings on the product hypersphere via Riemannian flow matching. It exposes both aleatoric and epistemic uncertainties by deriving a conditional retrieval entropy and a marginal-typicality epistemic score. Empirically, the entropy tracks Recall@1 with near-ideal monotonic calibration, and the marginal-typicality sum yields consistently calibrated selective accuracy.
GeoFlowVLM 是一个后置适配器,通过黎曼流匹配在乘积超球体上学习配对的 $\ell_2$ 归一化双编码器 VLM 向量的联合分布。它通过推导条件检索熵和边缘典型性表征不确定性。实验上,熵与 Recall@1 在三个检索基准中的双向近理想单调校准相关,而边缘典型性总和在四个零样本分类基准中表现出一致的校准选择准确性。
Block-wise Adaptive Caching for Accelerating Diffusion Policy
Authors: Kangye Ji, Yuan Meng, Hanyun Cui, Ye Li, Jianbo Zhou, Shengjia Hua, Lei Chen, Zhi Wang
First: 2025-06-16T13:14:58+00:00 · Latest: 2026-05-13T10:49:00+00:00
Abstract
Diffusion Policy has demonstrated strong visuomotor modeling capabilities, but its high computational cost renders it impractical for real-time robotic control. Despite huge redundancy across repetitive denoising steps, existing diffusion acceleration techniques fail to generalize to Diffusion Policy due to fundamental architectural and data divergences. In this paper, we propose $\textbf{B}$lock-wise $\textbf{A}$daptive $\textbf{C}$aching ($\textbf{BAC}$), a method to accelerate Diffusion Policy by caching intermediate action features. BAC achieves lossless action generation acceleration by adaptively updating and reusing cached features at the block level, based on a key observation that feature similarities exhibit non-uniform temporal dynamics and distinct block-specific patterns. To operationalize this insight, we first design an Adaptive Caching Scheduler to identify optimal update timesteps by maximizing the global feature similarities between cached and skipped features. However, applying this scheduler for each block leads to significant error surges due to the inter-block propagation of caching errors, particularly within Feed-Forward Network (FFN) blocks. To mitigate this issue, we develop the Bubbling Union Algorithm, which truncates these errors by updating the upstream blocks with significant caching errors before downstream FFNs. As a training-free plugin, BAC is readily integrable with existing transformer-based Diffusion Policy and vision-language-action models. Extensive experiments on multiple robotic benchmarks demonstrate that BAC achieves up to 3$\times$ inference speedup for free. Project page: https://block-wise-adaptive-caching.github.io.
中文标题/摘要
标题:块级自适应缓存加速扩散策略
扩散策略展示了强大的视知觉建模能力,但由于其高昂的计算成本,使其在实时机器人控制中不切实际。尽管在重复去噪步骤中存在巨大的冗余,现有的扩散加速技术由于基本架构和数据差异的原因,无法适用于扩散策略。在本文中,我们提出了一种名为块级自适应缓存(BAC)的方法,通过缓存中间动作特征来加速扩散策略。BAC 通过在块级别上自适应地更新和重用缓存特征,实现了无损的动作生成加速,基于一个关键观察,即特征相似性表现出非均匀的时间动态和特定于块的模式。为了实现这一洞察,我们首先设计了一个自适应缓存调度器,通过最大化缓存和跳过的特征之间的全局特征相似性来确定最优的更新时间步。然而,对每个块应用此调度器会导致由于缓存错误在块间传播而引起的显著误差激增,特别是在前馈网络(FFN)块中。为了解决这个问题,我们开发了冒泡联合算法,该算法通过在下游 FFN 之前更新具有显著缓存错误的上游块来截断这些错误。作为无需训练的插件,BAC 可以轻松集成到现有的基于变压器的扩散策略和视觉-语言-动作模型中。在多个机器人基准上的广泛实验表明,BAC 可以免费实现高达 3 倍的推理速度提升。项目页面:https://block-wise-adaptive-caching.github.io/
Summary / 总结
The research aims to address the high computational cost of Diffusion Policy, which limits its practical application in real-time robotic control. The proposed Block-wise Adaptive Caching (BAC) method accelerates the policy by caching intermediate action features and updating them adaptively at the block level. BAC includes an Adaptive Caching Scheduler to optimize update times and a Bubbling Union Algorithm to mitigate errors from inter-block propagation. Experiments show that BAC can achieve up to 3x inference speedup without compromising action generation quality.
研究旨在解决Diffusion Policy的高计算成本问题,这限制了其在实时机器人控制中的应用。提出的Block-wise Adaptive Caching (BAC)方法通过缓存中间动作特征并按块级适应性更新来加速策略。BAC包括一个自适应缓存调度器以优化更新时间,并使用Bubbling Union算法来减轻因块间传播导致的错误。实验表明,BAC可以在不牺牲动作生成质量的情况下实现最高3倍的推理加速。
Agentic AI for Remote Sensing: Technical Challenges and Research Directions
Authors: Muhammad Akhtar Munir, Muhammad Umer Sheikh, Akashah Shabbir, Muhammad Haris Khan, Fahad Khan, Xiao Xiang Zhu, Begum Demir, Salman Khan
First: 2026-04-27T18:59:49+00:00 · Latest: 2026-05-13T10:44:17+00:00
Comments: 31 pages. Position Paper
Abstract
Earth Observation (EO) is moving beyond static prediction toward multi-step analytical workflows that require coordinated reasoning over data, tools, and geospatial state. While foundation models and vision-language models have advanced representation learning and language-grounded interaction in remote sensing, and agentic AI has shown strong potential for long-horizon reasoning and tool use, EO is not a straightforward extension of generic agentic AI. EO workflows operate on georeferenced, multi-modal, and temporally structured data, where operations such as reprojection, resampling, compositing, and aggregation transform the underlying state and can constrain later analysis. As a result, errors may propagate silently across steps, and correctness depends not only on internal coherence but also on geospatial consistency, temporally valid comparisons, and physical validity. This position paper argues that these challenges are structural rather than incidental. We examine the assumptions commonly made in generic agentic systems, analyze how they break in geospatial workflows, and characterize failure modes in multi-step EO pipelines. We then outline design principles for EO-native agents centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and validity-aware learning and evaluation. Building reliable geospatial agents, therefore, requires rethinking agent design around the physical, geospatial, and workflow constraints that govern EO analysis.
Summary / 总结
This paper addresses the challenges in applying agentic AI to Earth Observation (EO) workflows, which involve complex, geospatially structured data and multi-step analytical processes. The authors analyze how generic agentic AI systems fail in EO contexts due to the need for geospatial consistency, temporal validity, and physical validity. They propose design principles for EO-native agents that include structured geospatial state, tool-aware reasoning, verifier-guided execution, and validity-aware learning and evaluation to ensure reliability in EO analysis.
本文探讨了将代理AI应用于地球观测(EO)工作流中所面临的挑战,这些工作流涉及复杂的、地理空间结构化数据和多步骤分析过程。作者分析了通用代理AI系统在EO上下文中的失败之处,因为需要地理空间一致性、时间有效性以及物理有效性。他们提出了针对EO原生代理的设计原则,包括结构化的地理空间状态、工具感知推理、验证者引导执行以及有效性感知的学习和评估,以确保EO分析的可靠性。
KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models
Authors: Richard Sproat, Stefano Peluchetti
First: 2026-05-13T10:35:07+00:00 · Latest: 2026-05-13T10:35:07+00:00
Comments: Preprint
Abstract
Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon yōgo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility. We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks. KamonBench therefore provides a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.
Summary / 总结
KamonBench is a dataset for evaluating compositional factor recovery in vision-language models. It consists of 20,000 synthetic composite crests generated from a grammar, each paired with formal descriptions and segmented analyses. The dataset supports evaluation beyond caption-level accuracy, including direct program-code factor metrics and controlled factor-pair recombination splits. Baseline results for various models are provided, demonstrating KamonBench's utility for sparse compositional visual recognition.
KamonBench 是一个用于评估视觉-语言模型中组合因素恢复的数据集,包含 20,000 个从语法生成的合成纹章,每个纹章配有不同的正式描述和分段分析。该数据集支持超出标题级准确性的评估,包括直接程序代码因素度量和可控因素对重组分割。提供了多种模型的基本结果,展示了 KamonBench 在稀疏组合视觉识别中的应用价值。
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
Authors: Weiqing Luo, Zongye Hu, Xiao Wang, Zhiyuan Yu, Haofeng Zhang, Ziyi Huang
Venue: ACL 2026
First: 2026-05-13T09:54:31+00:00 · Latest: 2026-05-13T09:54:31+00:00
Comments: Accepted to ACL 2026
Abstract
Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.
Summary / 总结
The paper addresses the issue of visual evidence selection in multimodal retrieval-augmented generation (RAG), where existing methods often focus on semantic relevance rather than the actual utility for downstream reasoning. It formulates evidence utility using information gain and introduces a latent helpfulness measure to rank evidence efficiently. The proposed framework, which is training-free and surrogate-accelerated, estimates evidence utility using lightweight models and shows consistent performance improvements over state-of-the-art RAG baselines with reduced computational cost.
论文针对多模态检索增强生成(RAG)中的视觉证据选择问题,现有方法通常关注语义相关性而非实际的下游推理有用性。它通过信息增益来定义证据的有用性,并引入了一个潜在的帮助性度量来高效地排名证据。提出的框架无需训练且加速了近似计算,使用轻量级的多模态模型估计证据的有用性,并在多个模型家族上展示了相对于最先进的RAG基线的一致性能改进和计算成本降低。
Distill, Diffuse, and Semanticize (DDS): Annotation-Free 3D Scene Understanding Based on Multi-Granularity Distillation and Graph-Diffusion-Based Segmentation
Authors: Yijing Wang, Ruonan Li, Qilin Wang, Rongqiang Zhao, Jie Liu
First: 2026-05-08T09:39:59+00:00 · Latest: 2026-05-13T09:48:13+00:00
Abstract
3D semantic scene understanding is essential for digital twins, autonomous driving, smart agriculture, and embodied perception, yet dense point-wise annotation for point clouds remains expensive and difficult to scale. Existing annotation-free methods often face a trade-off between semantic recognition and structural efficiency: open-vocabulary and foundation-model-driven methods provide strong semantic priors, but often come with substantial computational costs, while structure-oriented methods based on superpoints, clustering, and graph reasoning are lightweight but often produce category-agnostic regions. We propose DDS, a resource-efficient structure-oriented framework for region-consistent and semanticized annotation-free 3D scene understanding. DDS preserves the lightweight superpoint-based organization paradigm while incorporating visual semantic cues from projected features and segmentation-derived masks. It first performs multi-granularity distillation to guide the 3D backbone at the point, mask-prototype, and inter-prototype levels, then applies graph diffusion over superpoints to propagate semantic information directly in 3D, producing coherent region representations without costly spectral decomposition or dense open-vocabulary 3D feature fields. Finally, DDS uses segmentation-cluster association to assign interpretable semantic names to category-agnostic 3D clusters. Experiments on real-world datasets show that DDS achieves the best performance among representative structure-oriented annotation-free baselines, improving oAcc, mAcc, and mIoU by up to 5.9%, 8.1%, and 2.4%, respectively. These results demonstrate that DDS improves region consistency and lightweight semantic recognition, providing a scalable and interpretable solution for annotation-free 3D scene understanding.
Summary / 总结
The paper proposes DDS, a resource-efficient framework for 3D scene understanding without dense annotations. It combines multi-granularity distillation and graph diffusion to guide a 3D backbone and propagate semantic information, producing coherent region representations. Experiments show that DDS outperforms existing methods, improving oAcc, mAcc, and mIoU by up to 5.9%, 8.1%, and 2.4%, respectively.
论文提出了一种资源高效的无标注3D场景理解框架DDS,结合多粒度蒸馏和图扩散来引导3D骨干并传播语义信息,生成连贯的区域表示。实验表明,DDS在oAcc、mAcc和mIoU上分别提高了最多5.9%、8.1%和2.4%。
Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms
Authors: Heehwan Wang, Joonwoo Kwon, Sooyoung Kim, Jungwoo Seo, Shinjae Yoo, Yuewei Lin, Jiook Cha
First: 2024-11-24T16:53:34+00:00 · Latest: 2026-05-13T09:39:21+00:00
Comments: Accepted by ICIP 2026
Abstract
Music style transfer blends source structure with reference style to enable personalized music creation. However, existing zero-shot methods often struggle to capture fine-grained audio nuances, relying on coarse text descriptions or requiring expensive task-specific training. We propose Stylus, a training-free framework that repurposes pretrained image diffusion models for music style transfer in the Mel-spectrogram domain. By treating audio as structured time-frequency images, Stylus manipulates self-attention by injecting style keys and values while preserving source structural queries. To ensure high fidelity, we introduce a phase-preserving reconstruction strategy to mitigate spectrogram inversion artifacts, alongside a classifier-free-guidance-inspired control for adjustable stylization. Extensive evaluations including 2,925 human ratings demonstrate that Stylus outperforms state-of-the-art baselines, achieving 34.1% higher content preservation and 25.7% better perceptual quality. Our work validates that generic image priors can be effectively leveraged for the training-free transformation of structured Mel-spectrograms. Code and materials are available at https://github.com/Sooyyoungg/Stylus.git.
Summary / 总结
Stylus is a training-free framework that uses pretrained image diffusion models to perform music style transfer on Mel-spectrograms. By treating audio as structured time-frequency images and manipulating self-attention, Stylus achieves high fidelity and adjustable stylization. Evaluations show that Stylus outperforms existing methods, with 34.1% higher content preservation and 25.7% better perceptual quality.
研究旨在解决现有零样本音乐风格转移方法难以捕捉细微的音频特征且需要昂贵的特定任务训练的问题。Stylus 提出了一种无需训练的框架,利用预训练的图像扩散模型来操作梅尔频谱图。通过注入风格键和值同时保留源结构查询,Stylus 实现了高保真度和可调风格化。通过 2,925 人的主观评价,Stylus 的表现优于最先进的方法,内容保真度提高了 34.1%,感知质量提高了 25.7%。
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
Authors: Camile Lendering, Erkut Akdag, Egor Bondarev
Venue: CVPR 2026
First: 2026-02-26T13:52:57+00:00 · Latest: 2026-05-13T09:15:01+00:00
Comments: Accepted to CVPR 2026. Revised version with corrected AU-PRO evaluation and recomputed metrics
Abstract
Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 97.1% and 97.5% on the MVTec-AD dataset, and 93.2% and 98.2% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.
STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition
Authors: Hongli Liu, Yu Wang, Shengjie Zhao
First: 2026-05-13T08:54:38+00:00 · Latest: 2026-05-13T08:54:38+00:00
Comments: Accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)
Abstract
Few-shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision-language models, existing approaches still suffer from semantic-temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi-scale temporal dynamics, as short-term discriminative cues and long-range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic-alignment component and a temporal-aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal Semantic Attention (TSA) mechanism, which performs frame-level cross-modal alignment with textual cues, ensuring fine-grained semantic-temporal consistency. The temporal-aware module incorporates a Semantic Temporal Prototype Refiner (STPR) that integrates semantic-guided Mamba blocks with multi-frequency temporal sampling and bidirectional state-space refinement, yielding semantically aligned prototypes with enhanced discriminative fidelity and temporal consistency. Furthermore, temporally dependent class descriptors derived from large language models (LLMs) provide long-range semantic guidance. Extensive experiments on five FSAR benchmarks demonstrate the consistent superiority of STAR over state-of-the-art methods. For instance, STAR achieves up to 8.1% and 6.7% gains on the SSv2-Full and SSv2-Small datasets under the 1-shot setting, and 7.3% on HMDB51, validating its effectiveness under limited supervision. The code is available at https://github.com/HongliLiu1/STAR-main.
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
Authors: Sangin Lee, Yukyung Choi
First: 2026-05-13T08:40:40+00:00 · Latest: 2026-05-13T08:40:40+00:00
Comments: 18 pages, 8 figures
Abstract
In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational overhead. To address this, recent studies have explored pruning redundant or less informative visual tokens for image understanding tasks. However, these methods struggle with pixel grounding tasks, where token importance is highly contingent on the input text. Through an in-depth analysis of CLIP, we observe that visual tokens located within referent regions often exhibit low similarity to the textual representation. Motivated by this insight, we introduce LiteLVLM, a training-free, text-guided token pruning strategy for efficient pixel grounding inference. By reversing the ranking of CLIP's visual-text similarity, LiteLVLM effectively retains visual tokens covering the referent regions, while recovering context tokens to enable clear foreground-background separation. Extensive experiments demonstrate that LiteLVLM significantly outperforms existing methods by over 5% across diverse token budgets. Without any training or fine-tuning, LiteLVLM maintains 90\% of the original performance with a 22% speedup and a 2.3x memory reduction. Our code is available at https://github.com/sejong-rcv/LiteLVLM.
Summary / 总结
The paper addresses the computational overhead caused by visual tokens in large vision-language models, which are crucial for image understanding tasks but pose challenges for pixel grounding. Motivated by the observation that visual tokens within referent regions have low similarity to textual representations in CLIP, the authors propose LiteLVLM, a training-free token pruning method. LiteLVLM reverses CLIP's visual-text similarity ranking to retain relevant visual tokens and recover context tokens, improving foreground-background separation. Experiments show that LiteLVLM outperforms existing methods by over 5% across various token budgets, achieving a 22% speedup and 2.3x memory reduction without training or fine-tuning.
该论文针对大型视觉-语言模型中由视觉令牌引起的计算开销问题,这些令牌对于图像理解任务至关重要但对像素定位任务构成挑战。受CLIP中视觉-文本相似度排名观察到的视觉令牌在参照区域与文本表示低相似性的启发,作者提出了一种无需训练的令牌剪枝方法LiteLVLM。LiteLVLM通过反转CLIP的视觉-文本相似度排名来保留相关视觉令牌并恢复上下文令牌,从而改善前景与背景的分离。实验表明,LiteLVLM在各种令牌预算下比现有方法高出超过5%,实现了22%的加速和2.3倍的内存减少,且无需训练或微调。
Coupling-Informed Transport Maps for Bayesian Filtering in Nonlinear Dynamical Systems
Authors: Dengfei Zeng, Lijian Jiang, Shuyu Sun, Dunhui Xiao
First: 2026-05-13T08:36:49+00:00 · Latest: 2026-05-13T08:36:49+00:00
Comments: 29 pages, 14 figures
Abstract
A likelihood-free transport filtering method is proposed based on the couplings between state and observation variables. By exploiting a block-triangular structure in the transport map, the analysis step of filtering is reformulated as the minimization of the maximum mean discrepancy (MMD) between the true joint measure and its transport-based approximation. To circumvent the non-convexity in the MMD optimization, we introduce a training-free transport filter method via gradient flows, which leads to an analytic computation for the transport map that implies the steepest descent direction of the MMD. The proposed approach accurately approximates non-Gaussian filtering posteriors and avoids particle collapse. We provide a convergence analysis for the expectation of the MMD between the approximated posterior and the truth posterior. Finally, we extend the method to high-dimensional problems through domain localization. Numerical examples demonstrate the superior performance of our approach over conventional filtering methods in nonlinear, non-Gaussian scenarios.
LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters
Authors: Beomjin Ahn, Jungmin Kwon, Chanyong Jung, Jaewook Chung
First: 2026-05-13T08:27:23+00:00 · Latest: 2026-05-13T08:27:23+00:00
Comments: Accepted to ICIP 2026
Abstract
Foundation models and low-rank adapters enable efficient on-device generative AI but raise risks such as intellectual property leakage and model recovery attacks. Existing defenses are often impractical because they require retraining or access to the original dataset. We propose LoREnc, a training-free framework that secures both FMs and adapters via spectral truncation and compensation. LoREnc suppresses dominant low-rank components of FM weights, compensates for the missing information in authorized adapters, and further applies orthogonal reparameterization to obscure structural fingerprints of the protected adapter. Unauthorized users produce structurally collapsed outputs, while authorized users recover exact performance. Experiments demonstrate that LoREnc provides strong protection against model recovery with under 1% computational overhead.
中文标题/摘要
标题:LoREnc:低秩加密用于保护基础模型和LoRA适配器
基础模型和低秩适配器能够实现高效的设备端生成式AI,但也会带来知识产权泄露和模型恢复攻击的风险。现有防御措施通常不实用,因为它们需要重新训练或访问原始数据集。我们提出了一种无需训练的框架LoREnc,通过谱截断和补偿来同时保护基础模型和适配器。LoREnc抑制了基础模型权重中的主要低秩成分,补偿授权适配器中缺失的信息,并进一步应用正交重参数化以掩盖受保护适配器的结构特征。未经授权的用户会产生结构上坍缩的输出,而授权用户可以恢复精确性能。实验表明,LoREnc在不到1%的计算开销下提供了强大的模型恢复保护。
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning
Authors: Yiyun Zhou, Zhonghua Jiang, Wenkang Han, Kunxi Li, Mingjing Xu, Chang Yao, Jingyuan Chen
Venue: IJCAI 2026
First: 2026-05-13T08:24:55+00:00 · Latest: 2026-05-13T08:24:55+00:00
Comments: Accepted by IJCAI 2026
Abstract
Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has not been systematically studied in image classification. Through extensive analysis, we reveal a Branch Bias issue in vision-language image classification: adapting the image encoder does not always improve performance under out-of-distribution settings. Motivated by this observation, we propose A$_3$B$_2$, an Adaptive Asymmetric Adapter that alleviates Branch Bias in few-shot learning. A$_3$B$_2$ introduces Uncertainty-Aware Adapter Dampening (UAAD), which automatically suppresses image-branch adaptation when prediction uncertainty is high, enabling soft and data-driven control without manual intervention. Architecturally, A$_3$B$_2$ adopts a lightweight asymmetric design inspired by mixture-of-experts with Load Balancing Regularization. Extensive experiments on three few-shot image classification tasks across 11 datasets demonstrate that A$_3$B$_2$ consistently outperforms 11 competitive prompt- and adapter-based baselines.
Summary / 总结
The paper addresses the Branch Bias issue in vision-language image classification, where adapting the image encoder does not always improve performance under out-of-distribution settings. It proposes A$_3$B$_2$, an Adaptive Asymmetric Adapter, which introduces Uncertainty-Aware Adapter Dampening to automatically suppress image-branch adaptation when prediction uncertainty is high. Experiments show that A$_3$B$_2$ outperforms 11 competitive prompt- and adapter-based baselines across three few-shot image classification tasks on 11 datasets.
论文解决了视觉-语言图像分类中的分支偏差问题,即现有方法假设图像和文本分支具有同等重要性。它提出了A$_3$B$_2$,一种自适应不对称适配器,使用不确定性感知适配器抑制,在预测不确定性高时抑制图像分支的适应。实验表明,A$_3$B$_2$在三个少样本图像分类任务的11个数据集上优于11个竞争性基线。
Dual-Pathway Circuits of Object Hallucination in Vision-Language Models
Authors: Jiaxin Liu, Ding Zhong, Yue Wang, Zhidong Yang, Zhaolu Kang, Guangyuan Dong, Qishi Zhan, Pengcheng Fang, Aofan Liu
First: 2026-05-13T08:20:01+00:00 · Latest: 2026-05-13T08:20:01+00:00
Abstract
Vision-language models (VLMs) have demonstrated remarkable capabilities in bridging visual perception and natural language understanding, enabling a wide range of multimodal reasoning tasks. However, they often produce object hallucinations, describing content absent from the input image, which limits their reliability and interpretability. To address this limitation, we propose Dual-Pathway Circuit Analysis, a framework that identifies and characterizes hallucination-related circuits in VLMs for mechanistic understanding and causal probing. We first apply activation patching across five architecturally diverse VLMs to identify a visual grounding pathway that supports correct predictions and a hallucination pathway that drives erroneous outputs. We then introduce Conditional Pathway Analysis (CPA) to characterize pathway-level interactions, revealing that grounding components remain strongly redundant in both correct and hallucinating samples but undergo a consistent polarity flip, shifting from supporting the ground truth on correct samples to aligning with the hallucinated answer on erroneous ones. We further perform targeted suppression of hallucination-pathway components, showing that scaling these components reduces object hallucination by up to 76% with minimal accuracy cost, and validate that the same circuit selectively transfers to relational but not attribute hallucination. Evaluations on POPE-adversarial and AMBER show that the identified circuits are consistent across architectures, support causal intervention, and transfer selectively across hallucination types.
Summary / 总结
This study addresses vision-language models (VLMs) proposes a Dual-Circuit Analysis framework to address hallucination-related circuits, these these diverse VLMs. identifying a grounding pathway for predictions and a hallucination pathway for erroneous outputs. through activation patchinging. on Conditional Analysis (CP CPA) revealing strong redundancy on ground truth samples but shifting polarity on hallucination samples. validating selective circuit involvement in hallucination on.
研究旨在通过提出双路径电路分析框架来解决视觉语言模型(VLMs)中的物体幻觉问题。该框架识别并表征了VLMs中的幻觉相关电路,发现了一个支持正确预测的视觉定位路径和一个导致错误输出的幻觉路径。研究发现,定位组件在正确和幻觉样本中都保持冗余,但在正确样本中支持真实信息,在错误样本中则与幻觉答案对齐。通过针对性地抑制幻觉路径组件,可以将物体幻觉减少高达76%,且对准确性的影响很小。同样的电路在关系性幻觉中选择性地转移,但不适用于属性幻觉,显示出跨架构和幻觉类型的稳定性。
Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation
Authors: Nikitas Chatzis, Marios Loizou, Evangelos Kalogerakis
First: 2026-05-13T07:55:29+00:00 · Latest: 2026-05-13T07:55:29+00:00
Abstract
Recent 3D generative models can synthesize high-quality assets, but their outputs are typically static: they lack the skeletal rigs, joint hierarchies, and skinning weights required for animation. This limits their use in games, film, simulation, virtual agents, and embodied AI, where assets must not only look plausible but also move plausibly. We introduce Rigel3D, a generative method for animation-ready 3D assets represented as rigged meshes. Unlike post-hoc auto-rigging methods that attach rigs to completed shapes, our method jointly models geometry and rig structure through coupled surface and skeleton structured latent representations. A rig-aware autoencoder decodes these representations into mesh geometry, skeleton topology, joint coordinates, and skinning weights, while a two-stage latent generative model synthesizes both surface and skeleton representations for image-conditioned generation. To support downstream animation workflows, we further introduce an open-vocabulary joint labeling module that embeds generated joints into a shared vision-language space, enabling correspondence to arbitrary retargeting templates. Experiments on large-scale rigged asset datasets demonstrate that our method generates diverse, high-quality animation-ready assets and outperforms existing rigging baselines across multiple metrics.
中文标题/摘要
标题:Rigel3D:动画准备就绪的3D资产生成中的骨骼感知潜在表示
最近的3D生成模型可以合成高质量的资产,但它们的输出通常是静态的:缺乏用于动画所需的骨骼绑定、关节层次结构和蒙皮权重。这限制了它们在游戏、电影、模拟、虚拟代理和具身AI中的应用,其中资产不仅需要看起来逼真,还需要移动得逼真。我们提出了Rigel3D,这是一种生成方法,用于表示为绑定网格的动画准备就绪的3D资产。与在完成形状后附加绑定的后处理自动绑定方法不同,我们的方法通过耦合表面和骨骼结构化的潜在表示联合建模几何和绑定结构。一种骨骼感知自编码器将这些表示解码为网格几何、骨架拓扑、关节坐标和蒙皮权重,而两阶段的潜在生成模型则合成表面和骨架表示以进行图像条件生成。为了支持下游动画工作流,我们进一步引入了一个开放词汇量的关节标注模块,将生成的关节嵌入到共享的视觉-语言空间中,从而能够与任意目标模板建立对应关系。在大规模绑定资产数据集上的实验表明,我们的方法生成了多样且高质量的动画准备就绪的资产,并在多个指标上优于现有的绑定基线。
Summary / 总结
Rigel3D is a generative method that creates 3D assets with integrated skeletal rigs, enabling animation. It uses coupled latent representations for geometry and rig structure, and a rig-aware autoencoder to decode these into mesh geometry, skeleton topology, joint coordinates, and skinning weights. Experiments show that Rigel3D generates diverse, high-quality assets suitable for animation and outperforms existing rigging methods across multiple metrics.
Rigel3D 是一种生成方法,能够创建带有集成骨骼框架的 3D 资产,便于动画制作。它使用耦合的潜在表示来表示几何结构和骨架结构,并使用带有所见语言嵌入的骨架感知自动编码器将这些表示解码为网格几何、骨架拓扑、关节坐标和蒙皮权重。实验表明,Rigel3D 生成了多样且高质量的适合动画的资产,并在多个指标上优于现有骨架生成方法。