ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
Authors: Zhengwentai Sun, Keru Zheng, Chenghong Li, Hongjie Liao, Xihe Yang, Heyuan Li, Yihao Zhi, Shuliang Ning, Shuguang Cui, Xiaoguang Han
First: 2026-04-21T17:47:26+00:00 · Latest: 2026-04-21T17:47:26+00:00
Abstract
Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.
中文标题/摘要
标题:ReImagine: 通过图像优先合成重新思考高质量人类视频生成
由于在有限的多视角数据下难以同时建模人类外观、运动和摄像机视角,人类视频生成仍然具有挑战性。现有方法通常分别处理这些因素,导致控制力有限或视觉质量降低。我们从图像优先的角度重新审视这一问题,通过图像生成学习高质量的人类外观,并将其作为视频合成的先验,将外观建模与时间一致性解耦。我们提出了一种结合预训练图像主干和基于SMPL-X的运动指导的可控制姿态和视角的流水线,并基于预训练的视频扩散模型引入了一个无需训练的时间细化阶段。我们的方法在不同姿态和视角下生成高质量、时间一致的视频。我们还发布了标准人类数据集和辅助模型,用于合成人类图像。代码和数据可在https://github.com/Taited/ReImagine公开获取。
Summary / 总结
The research aims to improve the quality and controllability of human video generation by addressing the challenges of modeling human appearance, motion, and camera viewpoint. The method proposes an image-first synthesis approach, using a pretrained image backbone for high-quality appearance and SMPL-X for motion guidance, with a training-free temporal refinement stage. Key findings include the generation of high-quality, temporally consistent videos under various poses and viewpoints, and the release of a canonical human dataset and an auxiliary model for compositional human image synthesis.
研究旨在通过解决人类外观、运动和摄像机视角建模的挑战来提高人类视频生成的质量和可控性。方法采用图像优先合成策略,使用预训练的图像骨干网络进行高质量外观建模,并结合SMPL-X进行运动指导,同时采用基于预训练视频扩散模型的无训练阶段进行时间上的细化。主要发现包括在各种姿态和视角下生成高质量、时间一致的视频,并发布了标准的人类数据集和辅助模型以用于合成的人类图像生成。
InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement
Authors: Nikita Kister, Pradyumna YM, István Sárándi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll
First: 2026-04-21T16:53:18+00:00 · Latest: 2026-04-21T16:53:18+00:00
Abstract
Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.
中文标题/摘要
标题:InHabit:利用图像基础模型实现可扩展的3D人体放置
训练具身智能体理解3D场景需要大量的人类有意义地与多种环境互动的数据,但此类数据稀缺。现实世界中的动作捕捉成本高昂且局限于受控环境,而现有的合成数据集依赖于简单的几何启发式方法,忽略了丰富的场景上下文。相比之下,基于互联网规模数据训练的2D基础模型已经隐式地获得了人类-环境交互的常识知识。为了将这些知识转移到3D场景中,我们提出了InHabit,一种全自动且可扩展的数据生成器,用于填充3D场景中的人类。InHabit遵循渲染-生成-提升原则:给定一个渲染的3D场景,一个视觉-语言模型提出上下文相关的行为,一个图像编辑模型插入一个人,然后通过优化过程将编辑结果提升为与场景几何形状对齐的物理上合理的SMPL-X身体。应用于Habitat-Matterport3D,InHabit生成了首个大规模的逼真3D人体-场景交互数据集,包含78,000个样本,覆盖800个建筑规模场景,具有完整的3D几何形状、SMPL-X身体和RGB图像。将我们的样本与标准训练数据结合使用可以提高基于RGB的3D人体-场景重建和接触估计性能,并在感知用户研究中,我们的数据在78%的情况下被偏好于当前最先进的方法。
Summary / 总结
InHabit leverages image foundation models to generate large-scale 3D human-scene interaction data, addressing the scarcity of such data. By using a render-generate-lift principle, it proposes contextually meaningful actions, inserts humans, and optimizes them into physically plausible bodies. This results in 78K photorealistic samples across 800 scenes, improving 3D reconstruction and contact estimation, and is preferred in 78% of perceptual user studies over existing methods.
InHabit 利用图像基础模型生成大规模的3D人体-场景交互数据,解决此类数据稀缺的问题。通过渲染-生成-提升的原则,它提出上下文相关的行为,插入人体,并优化为物理上合理的身体。这产生了78K的逼真样本,覆盖800个场景,改善了3D重建和接触估计,并在感知用户研究中,有78%的情况下被偏好于现有方法。
CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation
Authors: Yanhui Chen, Baoyao Yang, Siqi Liu, Jingchao Wang
First: 2026-04-21T16:37:18+00:00 · Latest: 2026-04-21T16:37:18+00:00
Abstract
SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.
Summary / 总结
CoCoCoCo-SAM3 enhances multi-class fine-vocabulary semantic segmentation by addressing concept conflicts through a two-step inference process. It The method first aligns and ranks classes information from synonymous prompts to enhance concept consistency, then performs inter class on on a unified class on, enabling direct pixel-wise comparison among candidate classes on. This approach on stabilizes multi class inference and mitigates inter on conflicts. Experiments on eight class-vocabulary semantic segmentation benchmarks show show
CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers
Authors: Weidong Chen, Dexiang Hong, Zhendong Mao, Yutao Cheng, Xinyan Liu, Lei Zhang, Yongdong Zhang
First: 2026-04-21T16:20:19+00:00 · Latest: 2026-04-21T16:20:19+00:00
Abstract
Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7\% across all metrics.
SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference under Hard Uplink Budgets
Authors: Inhyeok Choi, Hyuncheol Park
First: 2026-04-21T16:11:56+00:00 · Latest: 2026-04-21T16:11:56+00:00
Comments: 11pages, 9 figures
Abstract
Edge-cloud hybrid inference offloads difficult inputs to a powerful remote model, but the uplink channel imposes hard per-request constraints on the number of bits that can be transmitted. We show that selecting transmitted content based solely on attention-based importance, the standard approach in collaborative inference, is inherently limited under hard budgets. Two findings support this claim. First, replacing high-importance units with low-importance but complementary ones improves server accuracy. This shows that what matters is not individual importance but how well the transmitted set covers diverse aspects of the input. Second, spatially uniform selection without any content information achieves competitive accuracy at moderate budgets. This confirms that spatial coverage alone carries independent value. Based on this analysis, we propose SAGE (Semantic Attention-Guided Evidence), a principled, training-free method that combines importance filtering with embedding-diversity sampling. SAGE achieves 93% of the server ceiling in offloaded accuracy while transmitting fewer than half of the available evidence units on ImageNet-1K, substantially outperforming importance-only composition.
Summary / 总结
The paper addresses the challenge of transmitting limited bits over the uplink channel in edge-cloud inference. It finds that selecting content based on attention importance alone is insufficient under hard budget constraints. Instead, SAGE, a training-free method, combines importance filtering with embedding-diversity sampling, achieving 93% of server accuracy while transmitting fewer than half of the available evidence units on ImageNet-1K, outperforming importance-only composition methods.
论文针对边缘-云推理中受限上行链路通道的数据传输限制问题,发现仅基于注意力选择高重要性内容在严格预算下是不够的。提出了一种无训练的SAGE方法,结合重要性过滤和嵌入多样性采样。SAGE在传输更少单元的情况下实现了接近最优的服务器准确率,显示出显著的效率和效果提升。
Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
Authors: Boammani Aser Lompo, Marc Haraoui
Venue: NeurIPS 2025
First: 2025-09-09T17:52:26+00:00 · Latest: 2026-04-21T15:48:38+00:00
Comments: Accepted at the First Workshop on Foundations of Reasoning in Language Models, NeurIPS 2025. Available at: https://openreview.net/forum?id=fvJRsGwhPf
Abstract
Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
Authors: Chuou Xu, Liya Ji, Qifeng Chen
First: 2026-04-21T15:19:49+00:00 · Latest: 2026-04-21T15:19:49+00:00
Abstract
Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that "powder" and "cake" are related by "is made of" grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots' ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.
CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
Authors: Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci, Biplab Banerjee
Venue: CVPR 2026
First: 2026-02-23T23:17:12+00:00 · Latest: 2026-04-21T14:05:24+00:00
Comments: Accepted in CVPR 2026
Abstract
Recent vision-language models (VLMs) such as CLIP demonstrate impressive cross-modal reasoning, extending beyond images to 3D perception. Yet, these models remain fragile under domain shifts, especially when adapting from synthetic to real-world point clouds. Conventional 3D domain adaptation approaches rely on heavy trainable encoders, yielding strong accuracy but at the cost of efficiency. We introduce CLIPoint3D, the first framework for few-shot unsupervised 3D point cloud domain adaptation built upon CLIP. Our approach projects 3D samples into multiple depth maps and exploits the frozen CLIP backbone, refined through a knowledge-driven prompt tuning scheme that integrates high-level language priors with geometric cues from a lightweight 3D encoder. To adapt task-specific features effectively, we apply parameter-efficient fine-tuning to CLIP's encoders and design an entropy-guided view sampling strategy for selecting confident projections. Furthermore, an optimal transport-based alignment loss and an uncertainty-aware prototype alignment loss collaboratively bridge source-target distribution gaps while maintaining class separability. Extensive experiments on PointDA-10 and GraspNetPC-10 benchmarks show that CLIPoint3D achieves consistent 3-16% accuracy gains over both CLIP-based and conventional encoder-based baselines. Project page: https://sarthakm320.github.io/CLIPoint3D.
中文标题/摘要
标题:CLIPoint3D:基于语言的少量样本无监督3D点云领域适应
近期的视觉-语言模型(VLMs)如CLIP展示了跨模态推理的惊人能力,不仅限于图像,还扩展到了3D感知。然而,这些模型在领域转换下仍然脆弱,尤其是在从合成点云到真实世界点云的适应过程中。传统的3D领域适应方法依赖于大量的可训练编码器,这虽然能获得较高的准确性,但代价是效率低下。我们提出了CLIPoint3D,这是第一个基于CLIP构建的少量样本无监督3D点云领域适应框架。我们的方法将3D样本投影到多个深度图中,并利用冻结的CLIP主干,通过一种知识驱动的提示调优方案进行精炼,该方案结合了高级语言先验和轻量级3D编码器提供的几何线索。为了有效适应任务特定特征,我们对CLIP的编码器进行了参数高效的微调,并设计了一种基于熵的视角采样策略来选择自信的投影。此外,基于最优传输的对齐损失和基于不确定性感知的原型对齐损失共同缩小了源-目标分布差距,同时保持了类别可分性。在PointDA-10和GraspNetPC-10基准测试上的广泛实验表明,CLIPoint3D在CLIP基线和传统编码器基线之上实现了3-16%的一致性准确性提升。项目页面:https://sarthakm320.github.io/CLIPoint3D/
TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
Authors: Hongyu Zhang, Yufan Deng, Zilin Pan, Peng-Tao Jiang, Bo Li, Qibin Hou, Zhiyang Dou, Zhen Dong, Daquan Zhou
Venue: ICLR 2026
First: 2026-04-21T13:56:36+00:00 · Latest: 2026-04-21T13:56:36+00:00
Comments: ICLR 2026, code available at: https://github.com/Hong-yu-Zhang/TS-Attn
Abstract
Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at https://github.com/Hong-yu-Zhang/TS-Attn.
中文标题/摘要
标题:TS-Attn:多事件视频生成中的时间分离注意力机制
从包含多个连续动作的复杂时间描述生成高质量视频是一个关键未解决问题。现有方法受到固有的权衡限制:使用多个短提示按顺序输入模型可以提高动作准确性,但会牺牲时间一致性,而单个复杂的提示则保持一致性,但牺牲了提示跟随能力。我们将这一问题归因于两个主要原因:1)视频内容与提示之间的时间对齐不准确,2)与运动相关的视觉对象及其相关文本条件之间的注意力耦合冲突。为了解决这些挑战,我们提出了一种新的、无需训练的注意力机制——时间分离注意力(TS-Attn),该机制动态重新排列注意力分布,以确保在多事件场景中具有时间意识和全局一致性。TS-Attn 可以无缝集成到各种预训练的文本到视频模型中,仅增加2%的推理时间,即可在 Wan2.1-T2V-14B 和 Wan2.2-T2V-A14B 上将 StoryEval-Bench 分数提高33.5%和16.4%。它还支持在模型之间进行多事件图像到视频生成的即插即用使用。源代码和项目页面可在 https://github.com/Hong-yu-Zhang/TS-Attn 获取。
Summary / 总结
This work addresses addresses addresses addresses a method for generating high videos from complex temporal descriptions containing multiple sequential actions. The method involves Temporal-wise Separable Attention (TS-Attn) dynamically rearranges attention distribution to ensure temporal awareness and coherence in multi-event scenarios. This method improves StoryEval-Bench scores by 33.5% and 16.4% on Wan-T onVV- and Wan-T on onv-A with only a 5 1 on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on on.
ReefNet: A Large-Scale Dataset and Benchmark for Fine-Grained Coral Reef Recognition
Authors: Abdulwahab Felemban, Yahia Battach, Faizan Farooq Khan, Yuqian Fu, Xuhui Liu, Yesmeen M. Khattab, Yousef A. Radwan, Xiang Li, Fabio Marchese, Sara Beery, Burton H. Jones, Francesca Benzoni, Mohamed Elhoseiny
First: 2025-10-19T13:18:44+00:00 · Latest: 2026-04-21T13:18:32+00:00
Abstract
Coral reefs are rapidly declining under anthropogenic pressures (e.g., climate change), creating an urgent need for scalable and automated monitoring. Progress in data-driven coral analysis, however, is constrained by the scarcity of large-scale datasets with fine-grained labels that are taxonomically consistent across sites and studies. To address this gap, we introduce ReefNet, a large-scale public coral reef image dataset with point-level annotations mapped to the World Register of Marine Species (WoRMS) taxonomy. ReefNet aggregates imagery from 76 curated CoralNet sources and an additional reef site from Al-Wajh (Red Sea), totaling approximately 925K genus-level hard coral annotations. Through expert-driven verification and targeted filtering, we derive a high-confidence benchmark subset with 92% expert agreement over 39 hard-coral label classes, enabling reliable evaluation under realistic label noise and strong class imbalance. Beyond dataset construction, we establish a comprehensive benchmark spanning zero-shot, cross-domain few-shot adaptation, within-source evaluation, and cross-source transfer to the Al-Wajh dataset. Experiments with state-of-the-art vision-language models (VLMs), multimodal large language models (MLLMs), and vision-only backbones reveal substantial degradation in zero-shot and extremely few-shot regimes, while adaptation with in-domain supervision yields large gains yet still leaves a persistent gap under cross-source shift and on long-tail genera. These results highlight fundamental challenges in applying general-purpose multimodal models to biodiversity monitoring and underscore the importance of large-scale, taxonomically grounded, high-quality datasets. ReefNet serves as both a benchmark and a training resource for advancing fine-grained coral reef understanding.
中文标题/摘要
标题:ReefNet:大规模数据集和细粒度珊瑚礁识别基准
珊瑚礁在人为压力(如气候变化)下迅速衰退,迫切需要可扩展和自动化的监测手段。然而,数据驱动的珊瑚分析进展受限于缺乏大规模且细粒度标签一致的数据库。为解决这一问题,我们引入了ReefNet,这是一个包含世界海洋物种注册数据库(WoRMS)分类学点级注释的大型公共珊瑚礁图像数据集。ReefNet整合了76个精选的CoralNet来源和来自阿勒瓦赫(红海)的一个额外的珊瑚礁站点,总计约92.5万种属级别的硬珊瑚注释。通过专家驱动的验证和目标筛选,我们提取了一个高置信度基准子集,其中92%的专家对39种硬珊瑚标签类别的意见一致,能够在现实标签噪声和强类别不平衡条件下实现可靠的评估。除了数据集构建,我们还建立了涵盖零样本、跨域少量样本适应、来源内评估和跨源转移至阿勒瓦赫数据集的全面基准。实验表明,最先进的视觉-语言模型(VLMs)、多模态大型语言模型(MLLMs)和视觉专用骨干网络在零样本和极度少量样本条件下表现显著下降,而通过领域内监督进行适应虽然取得了显著进步,但在跨源转移和长尾种属上仍存在差距。这些结果突显了在生物多样性监测中应用通用多模态模型的基本挑战,并强调了大规模、分类学基础的高质量数据集的重要性。ReefNet既作为基准又作为训练资源,有助于推进细粒度珊瑚礁的理解。
Summary / 总结
The paper introduces ReefNet, a large-scale dataset with fine-grained point-level annotations for coral reef recognition, addressing the scarcity of such datasets. The dataset includes 925K genus-level hard coral annotations from 76 curated sources and an additional reef site, mapped to the WoRMS taxonomy. Key findings show that state-of-the-art models perform poorly in zero-shot and extremely few-shot regimes, while in-domain adaptation improves performance but still leaves gaps under cross-source transfer and for long-tail genera. This highlights the challenges in applying general-purpose models to biodiversity monitoring and emphasizes the need for high-quality, taxonomically grounded datasets.
论文介绍了ReefNet,这是一个包含精细粒度点级注释的大型珊瑚礁图像数据集,解决了此类数据集稀缺的问题。该数据集包含来自76个精选来源和一个额外的礁址的约925K个属级硬珊瑚注释,能够在现实标签噪声和强类别不平衡条件下实现可靠的评估。关键发现表明,最先进的视觉-语言模型和多模态大型语言模型在零样本和极端少样本情况下表现严重下降,而领域内适应可以提高性能,但在跨源转移和对长尾类群上仍存在差距。
MESA: A Training-Free Multi-Exemplar Deep Framework for Restoring Ancient Inscription Textures
Authors: Vasileios Toulatzis, Sofia Theodoridou, Ioannis Fudos
First: 2026-04-19T11:38:03+00:00 · Latest: 2026-04-21T13:03:32+00:00
Abstract
Ancient inscriptions frequently suffer missing or corrupted regions from fragmentation, erosion, or other damage, hindering reading, and analysis. We review prior image restoration methods and their applicability to inscription image recovery, then introduce MESA (Multi-Exemplar, Style-Aware) -an image-level restoration method that uses well-preserved exemplar inscriptions (from the same epigraphic monument, material, or similar letterforms) to guide reconstruction of damaged text. MESA encodes VGG19 convolutional features as Gram matrices to capture exemplar texture, style, and stroke structure; for each neural network layer it selects the exemplar minimizing Mean-Squared Displacement (MSD) to the damaged input. Layer-wise contribution weights are derived from Optical Character Recognition-estimated character widths in the exemplar set to bias filters toward scales matching letter geometry, and a training mask preserves intact regions so synthesis is restricted to damaged areas. We also summarize prior network architectures and exemplar and single-image synthesis, inpainting, and Generative Adversarial Network (GAN) approaches, highlighting limitations that MESA addresses. Comparative experiments demonstrate the advantages of MESA. Finally, we provide a practical roadmap for choosing restoration strategies given available exemplars and metadata.
中文标题/摘要
标题:MESA:一种无需训练的多例深学习框架用于恢复古代铭文纹理
古代铭文经常因碎片化、侵蚀或其他损坏而出现缺失或损坏区域,这妨碍了阅读和分析。我们回顾了先前的图像恢复方法及其对铭文图像恢复的适用性,然后介绍了MESA(多例,风格感知)——一种图像级恢复方法,使用保存良好的铭文实例(来自同一铭文纪念碑、材料或类似字母形状)来指导损坏文本的重建。MESA 将 VGG19 卷积特征编码为格拉姆矩阵以捕捉实例纹理、风格和笔画结构;对于每个神经网络层,它选择使均方位移(MSD)最小的实例来指导损坏输入的重建。层贡献权重从光学字符识别估计的实例字符宽度中导出,以偏向匹配字母几何形状的尺度滤波器,并且训练掩码保留完整区域,使合成仅限于损坏区域。我们还总结了先前的网络架构和实例及单图像合成、 inpainting 和生成对抗网络(GAN)方法,强调 MESA 解决的局限性。比较实验展示了 MESA 的优势。最后,我们提供了一条实用的路线图,以选择给定可用实例和元数据的恢复策略。
Summary / 总结
MESA is is a training-free method for restoring ancient inscriptions by leveraging well-preserved exemplar inscriptions. It framework encodes texture and style structure using VGG convolutional Gram matrices and uses Mean-Squared Dis Displacement (MSD) to align damaged regions with exemplars ones textures. A The method filters are are derived from estimated character widthss in and a mask isD ensures preserves intact regionsD allowing Experimental results findings findings show show show results demonstrated that that that that MESA outperper performs well well well wellDD well well inDDD onD comparingD toD other other other otherD other other otherD other other otherD other other otherD other other otherD other otherD other otherDD other other otherD other otherD other other otherD other otherD other otherD in in otherD in onD other other otherDD other otherDD other otherD other onDDD different different comparingD approachesD.
VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing
Authors: Yanbin Huang, Yisen Li, Guiyao Tie, Xiaoye Qu, Pan Zhou, Hongfei Wang, Zhaofan Zou, Hao Sun, Xuelong Li
Venue: ICASSP 2026
First: 2026-04-21T12:40:07+00:00 · Latest: 2026-04-21T12:40:07+00:00
Comments: ICASSP 2026
Abstract
Large vision-language models (LVLMs) frequently suffer from Object Hallucination (OH), wherein they generate descriptions containing objects that are not actually present in the input image. This phenomenon is particularly problematic in real-world applications such as medical imaging and autonomous driving, where accuracy is critical. Recent studies suggest that the hallucination problem may stem from language priors: biases learned during pretraining that cause LVLMs to generate words based on their statistical co-occurrence. To mitigate this problem, we propose Visual Contrastive Editing (VCE), a novel post-hoc method that identifies and suppresses hallucinatory tendencies by analyzing the model's response to contrastive visual perturbations. Using Singular Value Decomposition (SVD), we decompose the model's activation patterns to isolate hallucination subspaces and apply targeted parameter edits to attenuate its influence. Unlike existing approaches that require fine-tuning or labeled data, VCE operates as a label-free intervention, making it both scalable and practical for deployment in resource-constrained settings. Experimental results demonstrate that VCE effectively reduces object hallucination across multiple benchmarks while maintaining the model's original computational efficiency.
Summary / 总结
The paper addresses the issue of Object Hallucination in Large Vision-Language Models (LVLMs) by proposing Visual Contrastive Editing (VCE), a zero-cost method that mitigates hallucination through visual contrastive perturbations. VCE uses Singular Value Decomposition to identify and suppress hallucinatory tendencies, thereby reducing object hallucination without requiring fine-tuning or labeled data. Experiments show that VCE effectively reduces hallucination across various benchmarks while preserving computational efficiency.
论文提出了一种名为Visual Contrastive Editing (VCE)的方法,通过视觉对比扰动来解决大型视觉-语言模型(LVLM)中的幻觉问题。VCE利用奇异值分解来识别并抑制幻觉倾向,从而在不需要微调或标注数据的情况下减少幻觉现象。实验结果表明,VCE能够有效降低各种基准上的幻觉现象,同时保持计算效率。
Learning Evolution via Optimization Knowledge Adaptation
Authors: Chao Wang, Lingling Li, Licheng Jiao, Jiaxuan Zhao, Fang Liu, Shuyuan Yang
First: 2025-01-04T05:35:21+00:00 · Latest: 2026-04-21T12:31:45+00:00
Comments: This work has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract
The iterative search process of evolutionary algorithms (EAs) encapsulates optimization knowledge within historical populations and fitness evaluations. Effective utilization of this knowledge is crucial for facilitating knowledge transfer and online adaptation. However, current research typically addresses these goals in isolation and faces distinct limitations: evolutionary sequential transfer optimization often suffers from incomplete utilization of prior knowledge, while adaptive strategies, utilizing real-time knowledge, are limited to tailoring specific evolutionary operators. To simultaneously achieve these two capabilities, we introduce the Optimization Knowledge Adaptation Evolutionary Model (OKAEM), a unified learnable evolutionary framework capable of adaptively updating parameters based on available optimization knowledge. By parameterizing evolutionary operators via attention mechanisms, OKAEM enables learnable update rules that facilitate the utilization of optimization knowledge via two phases: pre-training to integrate extensive prior knowledge for efficient transfer, and adaptive optimization to dynamically update parameters based on real-time knowledge. Experimental results confirm that OKAEM significantly outperforms state-of-the-art sequential transfer methods across 12 transfer scenarios via pre-training, and surpasses advanced learnable EAs solely through its self-tuning mechanism in prior-free settings. Beyond demonstrating practical utility in prompt tuning for vision-language models, ablation studies validate the necessity of the learnable components, while visualization analyses reveal the model's capacity to autonomously discover interpretable evolutionary principles. The code can be accessed at https://gitee.com/Anonymity_Paper/code-of-okaem.
中文标题/摘要
标题:通过优化知识适应学习进化
进化算法(EAs)的迭代搜索过程将优化知识封装在历史种群和适应度评估中。有效利用这些知识对于促进知识转移和在线适应至关重要。然而,当前研究通常将这些目标孤立地解决,面临不同的局限性:进化序列转移优化往往未能充分利用先验知识,而适应策略利用实时知识来调整特定的进化算子也受到限制。为了同时实现这两种能力,我们提出了优化知识适应进化模型(OKAEM),这是一种统一的学习可调进化框架,能够根据可用的优化知识自适应更新参数。通过使用注意力机制参数化进化算子,OKAEM 允许学习可调更新规则,通过两个阶段促进优化知识的利用:预训练阶段整合大量先验知识以实现高效转移,自适应优化阶段根据实时知识动态更新参数。实验结果证实,OKAEM 在 12 种转移场景中显著优于最先进的序列转移方法,通过预训练显著优于无先验设置下的高级可调进化算法。除了在视觉语言模型的提示调优中展示其实用性外,消融研究验证了可调组件的必要性,而可视化分析揭示了模型自主发现可解释进化原理的能力。代码可在 https://gitee.com/Anonymity_Paper/code-of-okaem 获取。
HarmoniDiff-RS: Training-Free Diffusion Harmonization for Satellite Image Composition
Authors: Xiaoqi Zhuang, Jefersson A. Dos Santos, Jungong Han
Venue: CVPR 2026
First: 2026-04-21T12:18:15+00:00 · Latest: 2026-04-21T12:18:15+00:00
Comments: 8 pages, 6 figures, CVPR 2026 findings. Code is available at https://github.com/XiaoqiZhuang/HarmoniDiff-RS
Abstract
Satellite image composition plays a critical role in remote sensing applications such as data augmentation, disaste simulation, and urban planning. We propose HarmoniDiff-RS, a training-free diffusion-based framework for harmonizing composite satellite images under diverse domain conditions. Our method aligns the source and target domains through a Latent Mean Shift operation that transfers radiometric characteristics between them. To balance harmonization and content preservation, we introduce a Timestep-wise Latent Fusion strategy by leveraging early inverted latents for high harmonization and late latents for semantic consistency to generate a set of composite candidates. A lightweight harmony classifier is trained to further automatically select the most coherent result among them. We also construct RSIC-H, a benchmark dataset for satellite image harmonization derived from fMoW, providing 500 paired composition samples. Experiments demonstrate that our method effectively performs satellite image composition, showing strong potential for scalable remote-sensing synthesis and simulation tasks. Code is available at: https://github.com/XiaoqiZhuang/HarmoniDiff-RS.
中文标题/摘要
标题:HarmoniDiff-RS:无需训练的卫星图像合成扩散调和框架
卫星图像合成在遥感应用中起着关键作用,如数据增强、灾害模拟和城市规划。我们提出了一种无需训练的基于扩散的框架——HarmoniDiff-RS,用于在多种领域条件下对合成卫星图像进行调和。我们的方法通过一种潜在均值偏移操作将辐射特性在源域和目标域之间转移。为了平衡调和和内容保留,我们引入了一种时间步长潜在融合策略,通过利用早期反转的潜在变量进行高调和,以及晚期潜在变量进行语义一致性,生成一组合成候选。我们还训练了一个轻量级的和谐分类器,以进一步自动选择其中最一致的结果。我们还构建了RSIC-H基准数据集,该数据集源自fMoW,提供了500对合成样本。实验表明,我们的方法有效地执行了卫星图像合成,显示出在可扩展的遥感合成和模拟任务中的强大潜力。代码可在:https://github.com/XiaoqiZhuang/HarmoniDiff-RS 获取。
Summary / 总结
HarmoniDiff-RS is a training-free framework that harmonizes composite satellite images using a diffusion-based approach. It aligns source and target domains through a Latent Mean Shift operation and introduces a Timestep-wise Latent Fusion strategy to balance harmonization and content preservation. Experiments show that HarmoniDiff-RS effectively performs satellite image composition, demonstrating potential for remote-sensing synthesis and simulation tasks.
HarmoniDiff-RS 是一个无需训练的框架,使用扩散方法来谐调合成的卫星图像。它通过 Latent Mean Shift 操作对齐源域和目标域,并引入 Timestep-wise Latent Fusion 策略来平衡谐调和内容保留。实验表明,HarmoniDiff-RS 能够有效进行卫星图像合成,展示出在遥感合成和模拟任务中的潜力。
See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers
Authors: Ding Xia, Xinyue Gui, Mark Colley, Fan Gao, Zhongyi Zhou, Dongyuan Li, Renhe Jiang, Takeo Igarashi
First: 2026-02-02T13:03:48+00:00 · Latest: 2026-04-21T12:15:10+00:00
Comments: Accepted to ACL2026
Abstract
Automated vehicles lack natural communication channels with other road users, making external Human-Machine Interfaces (eHMIs) essential for conveying intent and maintaining trust in shared environments. However, most eHMI studies rely on developer-crafted message-action pairs, which are difficult to adapt to diverse and dynamic traffic contexts. A promising alternative is to use Large Language Models (LLMs) as action designers that generate context-conditioned eHMI actions, yet such designers lack perceptual verification and typically depend on fixed prompts or costly human-annotated feedback for improvement. We present See2Refine, a human-free, closed-loop framework that uses vision-language model (VLM) perceptual evaluation as automated visual feedback to improve an LLM-based eHMI action designer. Given a driving context and a candidate eHMI action, the VLM evaluates the perceived appropriateness of the action, and this feedback is used to iteratively revise the designer's outputs, enabling systematic refinement without human supervision. We evaluate our framework across three eHMI modalities (lightbar, eyes, and arm) and multiple LLM model sizes. Across settings, our framework consistently outperforms prompt-only LLM designers and manually specified baselines in both VLM-based metrics and human-subject evaluations. Results further indicate that the improvements generalize across modalities and that VLM evaluations are well aligned with human preferences, supporting the robustness and effectiveness of See2Refine for scalable action design.
Summary / 总结
See2Refine is a framework that uses vision-language models to provide automated visual feedback to improve an LLM-based eHMI action designer. It evaluates the perceived appropriateness of eHMI actions and iteratively refines the designer's outputs. Across three eHMI modalities and multiple LLM model sizes, See2Refine outperforms prompt-only LLM designers and manually specified baselines in both VLM-based metrics and human-subject evaluations, indicating its robustness and effectiveness for scalable action design.
研究旨在通过使用视觉语言模型(VLM)进行自动视觉反馈来改进自动车辆外部人机界面(eHMI)的设计,利用VLM评估候选eHMI动作并迭代优化设计师的输出。研究结果表明,See2Refine在不同eHMI模态和LLM模型大小的VLM基线指标和人类受试者评估中均优于仅使用提示的LLM设计师和手动指定的基线,证明了其在可扩展动作设计中的稳健性和有效性。
LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation
Authors: Jingjing Wang, Zhengdong Hong, Chong Bao, Yuke Zhu, Junhan Sun, Guofeng Zhang
First: 2026-04-09T17:14:00+00:00 · Latest: 2026-04-21T11:35:00+00:00
Abstract
Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.
中文标题/摘要
标题:LAMP: 将图像编辑提升为开放世界操纵的一般3D先验
在开放世界中实现类人的泛化仍然是机器人操纵中的一个基本挑战。现有的基于学习的方法,包括强化学习、模仿学习和视觉-语言-动作模型(VLAs),往往难以应对新的任务和未见过的环境。另一个有前景的方向是探索能够捕捉开放世界操纵中精细的空间和几何关系的一般化表示。虽然大型语言模型(LLMs)和视觉-语言模型(VLMs)提供了基于语言或标注的2D表示的强语义推理,但它们有限的3D意识限制了它们在精细操纵中的应用。为了解决这个问题,我们提出了LAMP,它将图像编辑提升为3D先验,以提取物体间的3D变换作为连续的、几何感知的表示。我们的核心见解是,图像编辑本质上编码了丰富的2D空间线索,将这些隐含的线索提升到3D变换中为开放世界操纵提供了精细和准确的指导。广泛的实验表明,LAMP 提供了精确的3D变换,并在开放世界操纵中实现了强大的零样本泛化。项目页面:https://zju3dv.github.io/LAMP/
Summary / 总结
The paper addresses the challenge of human-like generalization in open-world robotic manipulation. It proposes LAMP, which uses image-editing techniques to extract 3D transformations as continuous, geometry-aware representations. Experiments show that LAMP provides precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation tasks.
论文旨在解决开放世界中类人的机器人操作泛化问题。提出LAMP,利用图像编辑作为3D先验提取连续的几何感知的物体间3D变换表示。实验表明,LAMP能够提供精确的3D变换并在开放世界操作任务中实现强大的零样本泛化。
PLaMo 2.1-VL Technical Report
Authors: Tommi Kerola, Yuya Masuda, Takashi Masuko, Toshiki Nakanishi, Daisuke Nishino, Kuniyuki Takahashi, Hanqin Wang, Yoshihiro Yamada
First: 2026-04-21T10:46:42+00:00 · Latest: 2026-04-21T10:46:42+00:00
Comments: 35 pages, 9 figreus
Abstract
We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.
Summary / 总结
PLaMo 2.1-VL is a lightweight Vision Language Model designed for autonomous devices, focusing on VQA and Visual Grounding. It excels in factory task analysis and infrastructure anomaly detection, achieving 61.5 ROUGE-L on Japanese and English benchmarks and 53.9% zero-shot accuracy in factory task analysis. Fine-tuning on power plant data improved anomaly detection by 25.2 percentage points.
PLaMo 2.1-VL 是一种轻量级的视觉语言模型,适用于自主设备,专注于 VQA 和视觉定位。它在工厂任务分析和基础设施异常检测中表现出色,分别在日语和英语基准测试中达到 61.5 的 ROUGE-L 和 53.9% 的零样本准确率。对发电厂数据的微调将异常检测的 bbox + 标签 F1 分数提高了 25.2 个百分点。
Visual Adversarial Attack on Vision-Language Models for Autonomous Driving
Authors: Tianyuan Zhang, Lu Wang, Xinwei Zhang, Yitong Zhang, Boyi Jia, Siyuan Liang, Shengshan Hu, Qiang Fu, Aishan Liu, Xianglong Liu
First: 2024-11-27T12:09:43+00:00 · Latest: 2026-04-21T10:31:24+00:00
Comments: Accepted by Machine Intelligence Research
Abstract
Vision-language models (VLMs) have significantly advanced autonomous driving (AD) by enhancing reasoning capabilities. However, these models remain highly vulnerable to adversarial attacks. While existing research has primarily focused on general VLM attacks, the development of attacks tailored to the safety-critical AD context has been largely overlooked. In this paper, we take the first step toward designing adversarial attacks specifically targeting VLMs in AD, exposing the substantial risks these attacks pose within this critical domain. We identify two unique challenges for effective adversarial attacks on AD VLMs: the variability of textual instructions and the time-series nature of visual scenarios. To this end, we propose ADvLM, the first visual adversarial attack framework specifically designed for VLMs in AD. Our framework introduces Semantic-Invariant Induction, which uses a large language model to create a diverse prompt library of textual instructions with consistent semantic content, guided by semantic entropy. Building on this, we introduce Scenario-Associated Enhancement, an approach where attention mechanisms select key frames and perspectives within driving scenarios to optimize adversarial perturbations that generalize across the entire scenario. Extensive experiments on several AD VLMs over multiple benchmarks show that ADvLM achieves state-of-the-art attack effectiveness. Moreover, real-world attack studies further validate its applicability and potential in practice.
Summary / 总结
This paper addresses the vulnerability of vision-language models (VLMs) in autonomous driving (AD) to adversarial attacks, which have been largely ignored in existing research. The authors propose ADvLM, a novel adversarial attack framework tailored for AD VLMs, which includes Semantic-Invariant Induction and Scenario-Associated Enhancement. ADvLM demonstrates state-of-the-art attack effectiveness across multiple benchmarks and real-world scenarios, highlighting the significant risks posed by adversarial attacks in AD VLMs.
本文探讨了视觉语言模型(VLMs)在自动驾驶(AD)中对抗攻击的脆弱性,这一问题在现有研究中被忽视。作者提出了一种名为ADvLM的新型对抗攻击框架,专门针对AD VLMs,该框架包括语义不变诱导和场景关联增强。ADvLM在多个基准测试和实际场景中展示了最先进的攻击效果,突显了对抗攻击对AD VLMs的重大风险。
RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models
Authors: Yusuf Çelebi, Yağız Asker, Özay Ezerceli, Mahmoud ElHussieni, Selva Taş, Reyhan Bayraktar, Fatma Betül Terzioğlu
First: 2026-04-21T10:29:42+00:00 · Latest: 2026-04-21T10:29:42+00:00
Abstract
Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the layer-specific roles of internal representations are poorly understood, leading to heuristic decisions about where adaptation should be applied. We model the evolution of hidden states as a high-dimensional geometric trajectory and propose using the Ramer-Douglas-Peucker (RDP) algorithm, a parameter-free and training-free polygon simplification method that preserves global structural transitions while eliminating locally redundant changes, to identify critical breakpoints along the representation path. Crucially, we use these geometric pivots not merely for analysis, but as a direct decision signal for determining which layers should be adapted during parameter-efficient fine-tuning. By integrating this geometry-aware layer selection strategy into LoRA fine-tuning of Qwen3-8B-Base, we achieve superior performance on MMLU-Math using only 13 RDP-selected layers (81.67%), significantly outperforming both full 36-layer adaptation (79.32%) and random 13-layer selection (75.56%), as well as the baseline Qwen3-8B-Base model (74.25%). These results demonstrate that leveraging the intrinsic geometry of representation trajectories provides a robust, interpretable, and training-free signal for optimizing layer selection during model adaptation.
中文标题/摘要
标题:RDP LoRA:基于几何驱动的参数高效适应识别
尽管存在参数高效方法如低秩适应(LoRA),大型语言模型(LLMs)的微调仍然在结构上存在不确定性,因为内部表示的层特定作用尚不明确,导致在何处进行适应的决策多为经验性的。我们建模隐藏状态的演变作为高维几何轨迹,并提出使用Ramer-Douglas-Peucker(RDP)算法,这是一种无参数且无需训练的多边形简化方法,能够保留全局结构转换的同时消除局部冗余变化,以识别表示路径中的关键断点。关键的是,我们不仅将这些几何支点用于分析,还直接作为决定哪些层在参数高效微调期间应进行适应的决策信号。通过将这种几何感知的层选择策略整合到Qwen3-8B-Base的LoRA微调中,我们仅使用13个RDP选择的层(81.67%)在MMLU-Math上实现了更优性能,显著优于36层全适应(79.32%)和随机选择13层(75.56%),以及基线Qwen3-8B-Base模型(74.25%)。这些结果表明,利用表示轨迹的内在几何特性为优化模型适应期间的层选择提供了稳健、可解释且无需训练的信号。
Summary / 总结
This study addresses the structural uncertainty in fine-tuning Large Language Models (LLMs) by proposing a geometry-driven method called RDP LoRA. It uses the Ramer-Douglas-Peucker (RDP) algorithm to identify critical breakpoints in the hidden state trajectories, which are then used to select layers for parameter-efficient fine-tuning. The method significantly outperforms full fine-tuning and random layer selection on the MMLU-Math benchmark, demonstrating the effectiveness of leveraging geometric insights for layer selection in LLMs.
该研究通过提出一种基于几何的方法RDP LoRA,解决了大型语言模型(LLMs)在微调时的结构不确定性问题。该方法使用Ramer-Douglas-Peucker(RDP)算法来识别隐藏状态轨迹中的关键断点,并据此选择需要微调的层。该方法在MMLU-Math基准测试中显著优于全层微调和随机层选择,证明了利用表示轨迹的内在几何洞察来优化微调过程中层选择的有效性。
Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models
Authors: Enyi Shi, Pengyang Shao, Yanxin Zhang, Chenhang Cui, Jiayi Lyu, Xiaobo Xia, Fei Shen, Tat-Seng Chua
First: 2026-01-30T09:18:13+00:00 · Latest: 2026-04-21T10:06:36+00:00
Abstract
The robust safety of Vision-Language Large Models (VLLMs) against joint multilingual and multimodal threats remains severely underexplored. Current benchmarks typically isolate these dimensions, being either multilingual but text-only, or multimodal but monolingual. While recent red-teaming efforts attempt to bridge this gap by rendering harmful prompts as images, their overreliance on typography-style visuals and lack of semantically grounded image-text pairs fail to capture realistic cross-modal interactions under multilingual and multimodal conditions. To address this, we introduce Lingua-SafetyBench, a comprehensive benchmark of 100,440 harmful image-text pairs spanning 10 languages. Crucially, Lingua-SafetyBench explicitly partitions data into image-dominant and text-dominant subsets to precisely disentangle sources of risk. Extensive evaluations reveal that current VLLMs retain non-negligible vulnerabilities under these joint inputs. Linguistically, requests in Non-High-Resource Languages (Non-HRLs) and non-Latin scripts generally pose greater threats. Furthermore, analyzing modality-language interactions uncovers a striking asymmetry: in High-Resource Languages (HRLs), models are most vulnerable to image-dominant risks, whereas in Non-HRLs, text-dominant risks severely degrade safety performance. Finally, a controlled study on the Qwen series demonstrates that while model scaling and iterative upgrades improve overall safety, they disproportionately benefit HRLs. This exacerbates the safety disparity between HRLs and Non-HRLs under text-dominant risks, highlighting that achieving robust safety requires dedicated language- and modality-aware alignment strategies beyond mere scaling. The code and dataset will be available at https://github.com/zsxr15/Lingua-SafetyBench.Warning: this paper contains examples with unsafe content.
Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing
Authors: Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki
First: 2026-04-21T08:05:21+00:00 · Latest: 2026-04-21T08:05:21+00:00
Comments: 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Abstract
Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .
Summary / 总结
This paper benchmarks 15 pre-trained vision foundation models for face anti-spoofing, focusing on self-supervised ViTs like DINOv2 with Registers. The study shows that these models, combined with data augmentation techniques, achieve state-of-the-art performance in the MICO protocol and outperform existing methods under data-constrained scenarios while maintaining efficiency. This work provides a robust vision-only baseline for FAS, highlighting the potential of optimized self-supervised vision transformers for both vision-only and multimodal FAS systems.
该研究对15个预训练的视觉基础模型进行了面部防欺骗基准测试,重点关注带有Registers的自监督ViTs,如DINOv2。研究表明,结合数据增强技术后,这些模型在MICO协议中达到了最先进的性能,并在数据受限的情况下优于现有方法,同时保持了高效性。这项工作为面部防欺骗提供了稳健的视觉单一基线,突显了优化的自监督视觉变压器在单一视觉和未来多模态面部防欺骗系统中的潜力。
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning
Authors: Haoyang Chen, Yi Liu, Jianzhi Shao, Tao Zhang, Chengfu Huo, Wei Hu
Venue: ACL 2026
First: 2026-04-21T06:55:17+00:00 · Latest: 2026-04-21T06:55:17+00:00
Comments: Accepted in the Findings of ACL 2026
Abstract
Thinking LLMs produce reasoning traces before answering. Prior activation steering work mainly targets on shaping these traces. It remains less understood how answer tokens actually read and integrate the reasoning to produce reliable outcomes. Focusing on quantitative reasoning, we analyze the answer-to-reasoning attention and observe a benign self-reading pattern aligned with correctness, characterized by a forward drift of the reading focus along the reasoning trace and a persistent concentration on key semantic anchors, whereas incorrect solutions exhibit diffuse and irregular attention pattern. We interpret this as internal certainty during answer decoding, where the model commits to a viable solution branch and integrates key evidence. Following this, we propose a training-free steering method driven by Self-Reading Quality (SRQ) scores combining geometric metrics for process control with semantic metrics for content monitoring. SRQ selects data to build steering vectors that guide inference toward benign self-reading and away from uncertain and disorganized reading. Experiments show that our method yields consistent accuracy gains.
Summary / 总结
This study examines how answer tokens read and integrate reasoning traces in Thinking LLMs for quantitative reasoning. It finds that correct answers show a forward drift and persistent concentration on key semantic anchors, while incorrect answers exhibit diffuse and irregular attention. The authors propose a training-free steering method using Self-Reading Quality (SRQ) scores to guide inference towards benign self-reading, resulting in consistent accuracy gains.
论文研究了Thinking LLMs在进行定量推理时,答案令牌如何阅读和整合推理痕迹。研究发现,正确答案的注意力会向前漂移,并持续集中在关键语义锚点上,而错误答案则表现出分散且不规则的注意力模式。作者提出了一种基于Self-Reading Quality (SRQ)评分的无训练引导方法,通过结合几何度量和语义度量来控制过程并监控内容,引导推理向良性自我阅读模式发展,从而实现一致的准确率提升。
ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving
Authors: Lin Sha, Haiyun Guo, Tao Wang, Cong Zhang, Min Huang, Jinqiao Wang, Qinghai Miao
First: 2026-04-21T06:51:08+00:00 · Latest: 2026-04-21T06:51:08+00:00
Comments: 18 pages, 4 figures
Abstract
Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.
中文标题/摘要
标题:ST-Prune:无需训练的空间-时间令牌剪枝方法在自动驾驶中的视觉-语言模型
视觉-语言模型(VLMs)已成为自动驾驶系统的核心,但其部署受到多视点相机和多帧视频输入的巨大计算开销的严重限制。现有的令牌剪枝方法主要针对单张图像输入设计,将每个帧或视图孤立处理,未能利用驾驶场景中的固有空间-时间冗余性。为解决这一问题,我们提出了ST-Prune,这是一种无需训练、即插即用的框架,包含两个互补模块:运动感知的时间剪枝(MTP)和环视空间剪枝(RSP)。MTP通过在多样性选择目标中引入运动波动性和时间近期性作为软约束,优先处理动态轨迹和当前帧内容,而不是静态的历史背景。RSP进一步通过利用环视相机几何结构来惩罚双边跨视图相似性,消除时间剪枝无法抑制的重复投影和残余背景。这两个模块共同构成了完整的空间-时间剪枝过程,在严格压缩下保留关键场景信息。ST-Prune在四个涵盖感知、预测和规划的基准测试中进行了验证,建立了无需训练的令牌剪枝的新基准。值得注意的是,即使在90%的令牌减少下,ST-Prune仍能实现近乎无损的性能,某些指标甚至超过全模型基线,同时保持与现有剪枝方法相当的推理速度。
Summary / 总结
ST-Prune is a training-free framework for spatio-temporal token pruning in vision-language models for autonomous driving, addressing the computational overhead of multi-view and multi-frame inputs. It consists of Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP) modules, which respectively handle temporal and spatial redundancies by encoding motion volatility and penalizing bilateral cross-view similarity. Experiments across four benchmarks show that ST-Prune achieves near-lossless performance with 90% token reduction and maintains inference speeds comparable to existing methods.
ST-Prune 是一个无需训练的时空令牌剪枝框架,用于自动驾驶中的视觉-语言模型,旨在解决多视图和多帧输入带来的计算瓶颈。该框架包含运动感知时域剪枝(MTP)和环视空间剪枝(RSP)模块,分别处理时域和空间冗余,通过编码运动波动性和惩罚双边跨视图相似性来实现这一目标。实验结果显示,即使在90%的令牌减少下,ST-Prune 也能实现近乎无损的性能,并保持与现有方法相当的推理速度。
Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval
Authors: Hang Cheng, Fanhe Dong, Long Zeng
First: 2026-04-21T06:32:34+00:00 · Latest: 2026-04-21T06:32:34+00:00
Abstract
This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal feature-enhanced strategy that conditions the frozen diffusion backbone with complementary visual and textual cues from CLIP, explicitly enhancing the ability of semantic context capture and concentrating on sketch contours. Specifically, we inject global and local visual features derived from a pretrained CLIP visual encoder, and incorporate enriched textual guidance by combining learnable soft prompts with hard textual descriptions generated by BLIP. Furthermore, we employ the Circle-T loss to dynamically strengthen positive-pair attraction once negative samples are sufficiently separated, thereby adapting to sketch noise and enabling more effective sketch-3D alignment. Extensive experiments on two public benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in ZS-SBSR.
中文标题/摘要
标题:Diff-SBSR:学习多模态特征增强扩散模型进行零样本草图基于3D形状检索
本文首次探讨了文本到图像扩散模型在零样本草图基于3D形状检索(ZS-SBSR)中的应用。现有的基于草图的3D形状检索方法在零样本设置中表现不佳,因为缺乏类别监督和草图输入的极端稀疏性。我们的关键见解是,大规模预训练的扩散模型本质上具有开放词汇能力和强烈的形状偏见,使其非常适合零样本视觉检索。我们利用冻结的Stable Diffusion主干从中间U-Net层中提取和聚合来自草图和渲染3D视图的判别性表示。由于其极端的抽象性和稀疏性,扩散模型在处理草图时遇到困难,且与自然图像之间存在显著的领域差距。为了解决这一限制,而无需昂贵的重新训练,我们引入了一种多模态特征增强策略,通过从CLIP获取补充的视觉和文本线索来条件化冻结的扩散主干,从而增强语义上下文捕获能力并集中于草图轮廓。具体而言,我们注入了来自预训练CLIP视觉编码器的全局和局部视觉特征,并通过结合可学习的软提示和BLIP生成的硬文本描述来增强丰富的文本指导。此外,我们采用Circle-T损失动态增强正样本对的吸引力,一旦负样本充分分离,从而适应草图噪声并实现更有效的草图-3D对齐。在两个公开基准上的广泛实验表明,我们的方法在ZS-SBSR中始终优于最先进的方法。
Summary / 总结
This paper introduces Diff-SBSR, a method for zero-shot sketch-based 3D shape retrieval using text-to-image diffusion models. The approach leverages a pretrained Stable Diffusion backbone and multimodal feature enhancement with CLIP and BLIP to improve semantic context capture and sketch contour concentration. Experiments show that Diff-SBSR outperforms existing methods on two public benchmarks, addressing the challenges of zero-shot settings and sparse sketch inputs.
该论文提出了一种名为Diff-SBSR的方法,利用文本到图像的扩散模型进行零样本草图基于的3D形状检索。该方法利用预训练的Stable Diffusion主干,并结合CLIP和BLIP的多模态特征增强,以提高语义上下文捕获和草图轮廓集中。实验表明,Diff-SBSR在两个公开基准上优于现有方法,解决了零样本设置和稀疏草图输入的挑战。
EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation
Authors: Ruibing Hou, Mingyue Zhou, Yuwei Gui, Mingshuang Luo, Bingpeng Ma, Hong Chang, Shiguang Shan, Xilin Chen
First: 2026-04-21T05:31:06+00:00 · Latest: 2026-04-21T05:31:06+00:00
Comments: 12 pages, 3 figures
Abstract
Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.
PhysMem: Scaling Test-time Physical Memory for Robot Manipulation
Authors: Haoyang Li, Yang You, Hao Su, Leonidas Guibas
First: 2026-02-23T20:18:35+00:00 · Latest: 2026-04-21T04:12:49+00:00
Abstract
Reliable object manipulation requires understanding physical properties that vary across objects and environments. Vision-language model (VLM) planners can reason about friction and stability in general terms; however, they often cannot predict how a specific ball will roll on a particular surface or which stone will provide a stable foundation without direct experience. We present PhysMem, a memory framework that enables VLM robot planners to learn physical principles from interaction at test time, without updating model parameters. The system records experiences, generates candidate hypotheses, and verifies them through targeted interaction before promoting validated knowledge to guide future decisions. A central design choice is verification before application: the system tests hypotheses against new observations rather than applying retrieved experience directly, reducing rigid reliance on prior experience when physical conditions change. We evaluate PhysMem on three real-world manipulation tasks and simulation benchmarks across four VLM backbones. On a controlled brick insertion task, principled abstraction achieves 76% success compared to 23% for direct experience retrieval, and real-world experiments show consistent improvement over 30-minute deployment sessions.
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge
Authors: Zihao Ye, Yung Hsiang Lu, Xiao Hu, Shuai Zhang, Taotao Jing, Xin Li, Zhen Yao, Bo Lang, Zhihao Zheng, Seungmin Oh, Hankyul Kang, Seunghun Kang, Jongbin Ryu, Kexin Chen, Yuan Qi, George K Thiruvathukal, Mooi Choo Chuah
First: 2026-04-21T04:00:55+00:00 · Latest: 2026-04-21T04:00:55+00:00
Comments: 11 pages, 8 figures, 4 tables
Abstract
The IEEE Low-Power Computer Vision Challenge (LPCVC) aims to promote the development of efficient vision models for edge devices, balancing accuracy with constraints such as latency, memory capacity, and energy use. The 2025 challenge featured three tracks: (1) Image classification under various lighting conditions and styles, (2) Open-Vocabulary Segmentation with Text Prompt, and (3) Monocular Depth Estimation. This paper presents the design of LPCVC 2025, including its competition structure and evaluation framework, which integrates the Qualcomm AI Hub for consistent and reproducible benchmarking. The paper also introduces the top-performing solutions from each track and outlines key trends and observations. The paper concludes with suggestions for future computer vision competitions.
Summary / 总结
The IEEE Low-Power Computer Vision Challenge (LPCVC) 2025 aimed to develop efficient vision models for edge devices by balancing accuracy with constraints like latency, memory, and energy. The challenge included three tracks: image classification, open-vocabulary segmentation, and monocular depth estimation. The paper describes the competition structure, evaluation framework using Qualcomm AI Hub, and highlights the top solutions from each track, noting key trends and observations. It concludes with recommendations for future competitions.
IEEE低功耗计算机视觉挑战(LPCVC)2025旨在通过平衡准确性和延迟、内存和能耗等约束来开发适用于边缘设备的高效视觉模型。挑战包括三个赛道:图像分类、开放词汇分割和单目深度估计。论文描述了比赛结构、使用高通AI Hub的评估框架,并概述了每个赛道的顶级解决方案,指出关键趋势和观察结果。最后提出了对未来竞赛的建议。
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
Authors: Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Shan Mu, Weihao Yuan, Siyu Zhu
First: 2026-04-15T09:17:38+00:00 · Latest: 2026-04-21T03:52:24+00:00
Abstract
Autoregressive vision-language models (VLMs) deliver strong multimodal capability, but their token-by-token decoding imposes a fundamental inference bottleneck. Diffusion VLMs offer a more parallel decoding paradigm, yet directly converting a pretrained autoregressive VLM into a large-block diffusion VLM (dVLM) often leads to substantial quality degradation. In this work, we present BARD, a simple and effective bridging framework that converts a pretrained autoregressive VLM into a same-architecture, decoding-efficient dVLM. Our approach combines progressive supervised block merging, which gradually enlarges the decoding block size, with stage-wise intra-dVLM distillation from a fixed small-block diffusion anchor to recover performance lost at larger blocks. We further incorporate a mixed noise scheduler to improve robustness and token revision during denoising, and memory-friendly training to enable efficient training on long multimodal sequences. A key empirical finding is that direct autoregressive-to-diffusion distillation is poorly aligned and can even hurt performance, whereas distillation within the diffusion regime is consistently effective. Experimental results show that, with $\leq 4.4M$ data, BARD-VL transfers strong multimodal capability from Qwen3-VL to a large-block dVLM. Remarkably, BARD-VL establishes a new SOTA among comparable-scale open dVLMs on our evaluation suite at both 4B and 8B scales. At the same time, BARD-VL achieves up to \textbf{3$\times$} decoding throughput speedup compared to the source model.
Summary / 总结
BARD is a framework that converts a pretrained autoregressive vision-language model into a decoding-efficient diffusion model by combining progressive block merging and stage-wise distillation. Key findings include the ineffectiveness of direct autoregressive-to-diffusion distillation and the effectiveness of intra-diffusion distillation. BARD-VL achieves strong multimodal performance and up to 3x decoding throughput speedup.
BARD 是一种框架,通过结合渐进式块合并和阶段内扩散蒸馏,将预训练的自回归视觉语言模型转换为高效的解码扩散模型。关键发现包括直接自回归到扩散蒸馏的无效性以及阶段内扩散蒸馏的有效性。BARD-VL 实现了强大的多模态性能,并且与源模型相比,解码速度提高了最多 3 倍。
Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents
Authors: Xu Chen, Shichao Xie, Zhining Gu, Lu Jia, Minghua Luo, Fei Liu, Zedong Chu, Yanfen Shen, Xiaolong Wu, Mu Xu
First: 2026-04-21T03:35:31+00:00 · Latest: 2026-04-21T03:35:31+00:00
Abstract
Constructing structured spatial memory is essential for enabling long-horizon reasoning in complex embodied navigation tasks. Current memory construction predominantly relies on a decoupled, two-stage paradigm: agents first aggregate environmental data through exploration, followed by the offline reconstruction of spatial memory. However, this post-hoc and geometry-centric approach precludes agents from leveraging high-level semantic intelligence, often causing them to overlook navigationally critical landmarks (e.g., doorways and staircases) that serve as fundamental semantic anchors in human cognitive maps. To bridge this gap, we propose ABot-Explorer, a novel active exploration framework that unifies memory construction and exploration into an online, RGB-only process. At its core, ABot-Explorer leverages Large Vision-Language Models (VLMs) to distill Semantic Navigational Affordances (SNA), which act as cognitive-aligned anchors to guide the agent's movement. By dynamically integrating these SNAs into a hierarchical SG-Memo, ABot-Explorer mirrors human-like exploratory logic by prioritizing structural transit nodes to facilitate efficient coverage. To support this framework, we contribute a large-scale dataset extending InteriorGS with SNA and SG-Memo annotations. Experimental results demonstrate that ABot-Explorer significantly outperforms current state-of-the-art methods in both exploration efficiency and environment coverage, while the resulting SG-Memo is shown to effectively support diverse downstream tasks.
中文标题/摘要
标题:探索如人类:基于在线SG-Memo构建的自主探索
构建结构化的空间记忆对于在复杂的体态导航任务中实现长期推理至关重要。当前的记忆构建主要依赖于脱钩的两阶段范式:代理首先通过探索聚集环境数据,然后进行离线的空间记忆重建。然而,这种事后且以几何为中心的方法阻止了代理利用高层次的语义智能,经常导致它们忽视导航上至关重要的地标(例如门和楼梯),这些地标在人类认知地图中作为基本的语义锚点。为弥合这一差距,我们提出了一种名为ABot-Explorer的新型主动探索框架,该框架将记忆构建和探索统一为一个在线的、仅基于RGB的过程。ABot-Explorer的核心在于利用大型视觉-语言模型(VLMs)提取语义导航功能(SNA),这些SNA作为认知对齐的锚点来引导代理的移动。通过动态将这些SNA整合到层次化的SG-Memo中,ABot-Explorer通过优先考虑结构化的过渡节点来模拟人类的探索逻辑,从而实现高效的覆盖。为了支持这一框架,我们贡献了一个扩展InteriorGS的大规模数据集,其中包括SNA和SG-Memo注释。实验结果表明,ABot-Explorer在探索效率和环境覆盖方面显著优于当前最先进的方法,而生成的SG-Memo被证明能够有效地支持各种下游任务。
Summary / 总结
The paper aims to enhance autonomous exploration for embodied agents by integrating memory construction and exploration into an online process. ABot-Explorer uses Large Vision-Language Models to extract Semantic Navigational Affordances, which guide the agent's movement and prioritize structural transit nodes. The results show that ABot-Explorer outperforms existing methods in exploration efficiency and environment coverage, and the generated SG-Memo supports various downstream tasks.
论文旨在通过将记忆构建与探索过程统一为在线过程来提升自主探索能力。ABot-Explorer 使用大型视觉-语言模型提取语义导航功能,引导代理的移动并优先考虑结构化的过渡节点。实验结果表明,ABot-Explorer 在探索效率和环境覆盖方面优于现有方法,并且生成的SG-Memo支持多种下游任务。
FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
Authors: Tao Fan, Guoqiang Ma, Yuanfeng Song, Lixin Fan, Kai Chen, Qiang Yang
First: 2026-04-21T03:06:24+00:00 · Latest: 2026-04-21T03:06:24+00:00
Abstract
Federated fine-tuning of Large Language Models (LLMs) is obstructed by a trilemma of challenges: protecting LLMs intellectual property (IP), ensuring client privacy, and mitigating performance loss on heterogeneous data. Existing methods like Offsite-Tuning (OT) secure the LLMs IP by having clients train only lightweight adapters, yet our analysis reveals they suffer from a fundamental performance bottleneck, leaving a significant gap compared to centralized training. To bridge this gap, we introduce FedProxy, a new federated adaptation framework. FedProxy replaces weak adapters with a unified, powerful Proxy Small Language Model (SLM), compressed from the proprietary LLM, to serve as a high-fidelity surrogate for collaborative fine-tuning. Our framework systematically resolves the trilemma through a three-stage architecture: (i) Efficient Representation via server-guided compression to create a resource-friendly proxy; (ii) Robust Optimization through an interference-mitigating aggregation strategy to handle data heterogeneity; and (iii) Effortless Fusion via a training-free "plug-in" mechanism to integrate learned knowledge back into the LLM. Experiments show FedProxy significantly outperforms OT methods and approaches centralized performance, establishing a new benchmark for secure and high-performance federated LLM adaptation.
Summary / 总结
FedProxy is a federated adaptation framework designed to address the challenges of protecting intellectual property, ensuring client privacy, and maintaining performance in fine-tuning Large Language Models (LLMs). It introduces a unified Proxy Small Language Model (SLM) to replace weak adapters and uses a three-stage architecture: efficient representation, robust optimization, and effortless fusion. Experiments demonstrate that FedProxy outperforms existing methods like Offsite-Tuning and approaches the performance of centralized training.
FedProxy 是一种联邦适应框架,旨在解决在精细调整大型语言模型(LLM)时保护知识产权、客户端隐私和性能损失的挑战。它引入了一个从专有 LLM 压缩而来的代理小型语言模型(SLM)来替代弱适配器。FedProxy 包含三个阶段:高效表示、稳健优化和轻松融合,以解决这些挑战。实验表明,FedProxy 显著优于现有方法如 Offsite-Tuning,并接近集中式训练的性能。