arXiv 论文速递

2026-03-16 03:43
Snapshot: 20260316_0343
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
Authors: Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata
First: 2026-03-12T17:59:48+00:00 · Latest: 2026-03-12T17:59:48+00:00
Comments: Preprint
Abstract
Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation. Code is available at https://github.com/ExplainableML/LCS.
中文标题/摘要
标题:潜在颜色子空间:高维混沌中的 emergent 秩序
文本到图像生成模型取得了快速进展,但在对生成图像实现精细控制方面仍然困难重重,主要原因是对语义信息如何编码的理解有限。我们开发了对 FLUX [Dev] 变分自编码器潜在空间中的颜色表示的一种解释,揭示出一种反映色调、饱和度和亮度的结构。我们通过证明 LCS 可以预测和显式控制颜色来验证我们的潜在颜色子空间 (LCS) 解释,引入了基于闭形式潜在空间操作的 FLUX 中完全无需训练的方法。代码可在 https://github.com/ExplainableML/LCS 获取。
BiGain: Unified Token Compression for Joint Generation and Classification
Authors: Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen
Venue: CVPR 2026
First: 2026-03-12T17:55:53+00:00 · Latest: 2026-03-12T17:55:53+00:00
Comments: CVPR 2026. Code: https://github.com/Greenoso/BiGain
Abstract
Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.
中文标题/摘要
标题:BiGain:统一的标记压缩方法以实现联合生成和分类
扩散模型的加速方法(例如标记合并或下采样)通常在减少计算量的同时优化合成质量,但往往忽视了判别能力。我们重新审视了标记压缩,并提出了一种无需训练、即插即用的框架BiGain,该框架在加速扩散模型中保持生成质量的同时提高分类能力。我们的核心见解是频率分离:将特征空间信号映射到频率感知的表示中,将细部与全局语义分离,从而实现既尊重生成保真度又兼顾判别有用性的压缩。BiGain 通过两种频率感知的操作体现了这一原则:(1)拉普拉斯门控标记合并,鼓励在光谱平滑标记之间进行合并,同时抑制高对比度标记的合并,从而保留边缘和纹理;(2)插值-外推 KV 下采样,通过可控的插值-外推在最近邻和平均池化之间进行下采样,同时保持查询不变,从而保留注意力精度。在基于 DiT 和 U-Net 的骨干网络以及 ImageNet-1K、ImageNet-100、Oxford-IIIT Pets 和 COCO-2017 上,我们的操作在加速扩散分类时始终改善了速度-准确性的权衡,同时在相似加速条件下保持或提升了生成质量。例如,在 ImageNet-1K 上,使用 70% 的标记合并于 Stable Diffusion 2.0 中,BiGain 提高了分类准确率 7.15%,同时提高了 FID 0.34(1.85%)。我们的分析表明,平衡频谱保留,保留高频细节和低/中频语义,是扩散模型中标记压缩的可靠设计规则。据我们所知,BiGain 是第一个在加速扩散中同时研究和推进生成和分类的框架,支持低成本部署。
Summary / 总结
BiGain is a training-free framework that improves the speed-accuracy trade-off for diffusion models by preserving generation quality while enhancing classification. It uses frequency separation, with Laplacian-gated token merging and Interpolate-Extrapolate KV Downsampling, to balance spectral retention. Experiments on various datasets show consistent improvements in classification accuracy and FID scores under comparable acceleration, with significant gains on ImageNet-1K.
BiGain 是一个无需训练的框架,通过保留生成质量同时提高分类性能来优化扩散模型的速度-准确度权衡。它使用频率分离来分离细节点和全局语义,并采用拉普拉斯门控的标记合并和内插-外推 KV 下采样。实验表明,BiGain 在多个数据集上提高了分类准确率,同时保持或提升了生成质量。例如,在 ImageNet-1K 上,通过 70% 的标记合并,它将分类准确率提高了 7.15%,同时将 FID 改进了 0.34 (1.85%)。
SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation
Authors: Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng
First: 2026-03-12T17:55:07+00:00 · Latest: 2026-03-12T17:55:07+00:00
Comments: Code: https://github.com/ROUJINN/SceneAssistant
Abstract
Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant
中文标题/摘要
标题:SceneAssistant:一种用于开放词汇3D场景生成的视觉反馈代理
从自然语言生成文本到3D场景对于数字内容创作来说非常 desirable。然而,现有方法大多局限于特定领域或依赖预定义的空间关系,限制了它们在不受限制、开放词汇3D场景合成方面的能力。在本文中,我们介绍了SceneAssistant,一种用于开放词汇3D场景生成的视觉反馈驱动代理。我们的框架利用了现代3D对象生成模型以及视觉语言模型(VLM)的空间推理和规划能力。为了实现开放词汇场景组合,我们为VLM提供了全面的原子操作集(例如,缩放、旋转、聚焦)。在每次交互步骤中,VLM接收渲染的视觉反馈并相应地采取行动,逐步细化场景以实现更连贯的空间布局并更好地与输入文本对齐。实验结果表明,我们的方法可以生成多样、开放词汇且高质量的3D场景。定性和定量的人类评估都证明了我们方法优于现有方法。此外,我们的方法允许用户根据自然语言命令编辑现有场景。我们的代码可在https://github.com/ROUJINN/SceneAssistant 获取
Summary / 总结
SceneAssistant is a visual-feedback-driven agent for open-vocabulary 3D scene generation, which uses a 3D object generation model and Vision-Language Models to iteratively refine scenes based on natural language input. The method provides VLMs with atomic operations and visual feedback at each step, leading to more coherent and aligned scenes. Experimental results show that SceneAssistant can generate diverse and high-quality 3D scenes, outperforming existing methods both qualitatively and quantitatively, and supports editing existing scenes with natural language instructions.
SceneAssistant 是一种通过自然语言描述生成开放词汇3D场景的方法,利用视觉反馈驱动的方式,结合3D对象生成模型和视觉语言模型(VLMs)逐步细化场景。该方法能够生成多样且高质量的3D场景,与输入文本高度匹配,并在定性和定量评估中优于现有方法。此外,它还支持使用自然语言指令编辑现有场景。
ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
Authors: Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao
First: 2026-03-12T17:30:49+00:00 · Latest: 2026-03-12T17:30:49+00:00
Abstract
Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.
中文标题/摘要
标题:ForensicZip:更多的标记更好但并非必要——在法医视觉语言模型中的应用
多模态大型语言模型(MLLMs)通过生成伪造检测的文本解释来实现多媒体可解释性取证。然而,处理密集的视觉序列会带来高昂的计算成本,特别是对于高分辨率图像和视频。视觉标记剪枝是一种实用的加速策略,但现有方法主要基于语义驱动,保留显著对象,而丢弃包含伪造痕迹(如高频异常和时间抖动)的背景区域。为了解决这一问题,我们引入了ForensicZip,这是一种无需训练的框架,从伪造驱动的角度重新定义了标记压缩。ForensicZip将时间标记演变建模为具有松弛虚拟节点的出生-死亡最优传输问题,量化表明瞬态生成伪迹的物理不连续性。法医评分进一步将传输基础的新颖性与高频先验相结合,在大比例压缩下分离法医证据和语义内容。在深度伪造和AIGC基准测试中,即使在10%的标记保留率下,ForensicZip也实现了2.97倍的加速和超过90%的FLOPs减少,同时保持了最先进的检测性能。
Summary / 总结
The research aims to improve the efficiency of forensic vision-language models by addressing the high computational costs associated with processing dense visual sequences. ForensicZip is introduced as a training-free framework that reformulates token compression from a forgery-driven perspective, using a Birth-Death Optimal Transport problem to quantify physical discontinuities. At 10% token retention, ForensicZip achieves a 2.97x speedup and over 90% FLOPs reduction while maintaining state-of-the-art detection performance on deepfake and AIGC benchmarks.
ForensicZip 是一个无需训练的框架,从伪造驱动的角度重新定义了 token 压缩问题,使用 Birth-Death Optimal Transport 问题来量化视觉序列中的物理不连续性。在 10% token 保留的情况下,ForensicZip 实现了 2.97 倍的加速和超过 90% 的 FLOPs 减少,同时在深伪和 AIGC 基准测试中保持了最先进的检测性能。
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
Authors: Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li
First: 2026-03-12T17:27:21+00:00 · Latest: 2026-03-12T17:27:21+00:00
Abstract
Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).
中文标题/摘要
标题:IndexCache:通过跨层索引重用加速稀疏注意
长上下文代理工作流已成为大型语言模型的关键使用案例,使得注意效率对于推理速度和提供成本至关重要。稀疏注意有效地应对了这一挑战,DeepSeek 稀疏注意(DSA)是一种代表性的生产级解决方案:一个轻量级的闪电索引器选择每个查询的最相关的 top-k 个令牌,将核心注意从 $O(L^2)$ 减少到 $O(Lk)$。然而,索引器本身保持 $O(L^2)$ 复杂性,并且必须在每一层独立运行,尽管连续层的结果 top-k 选择高度相似。我们提出了 IndexCache,通过将层划分为运行自己索引器的小型全层集和简单重用最近全层 top-k 索引的大多数共享层来利用这种跨层冗余。我们提出了两种互补的方法来确定和优化此配置。无需训练的 IndexCache 使用贪婪搜索算法直接在校准集上最小化语言建模损失来选择保留索引器的层,无需权重更新。基于训练的 IndexCache 引入了一种多层蒸馏损失,训练每个保留的索引器与它服务的所有层的平均注意分布进行对比,即使简单的交错模式也能达到全索引器的准确性。在 30B DSA 模型上的实验结果显示,IndexCache 可以去除 75% 的索引器计算,质量下降可以忽略不计,相比标准 DSA 实现了高达 1.82$\times$ 前填速度提升和 1.48$\times$ 解码速度提升。初步实验进一步证实了我们在生产规模 GLM-5 模型上的这些积极结果(图 1)。
Summary / 总结
IndexCache accelerates sparse attention by reusing indexers across layers, reducing the indexer computations by 75% while maintaining model performance. It uses two approaches: a training-free method that minimizes language modeling loss, and a training-aware method that uses a multi-layer distillation loss. On a 30B DSA model, IndexCache achieves up to 1.82x prefill speedup and 1.48x decode speedup.
IndexCache 通过跨层重用索引器来加速稀疏注意力,将计算量减少75%的同时保持模型性能。它使用贪心搜索算法实现无训练的IndexCache,并使用多层蒸馏损失实现有训练的IndexCache,相比标准DSA可实现最高1.82倍的预填充加速和1.48倍的解码加速。
BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
Authors: Jingyang Ke, Weihan Li, Amartya Pradhan, Jeffrey Markowitz, Anqi Wu
First: 2026-03-12T17:09:20+00:00 · Latest: 2026-03-12T17:09:20+00:00
Abstract
Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.
中文标题/摘要
标题:BehaviorVLM:统一的无需微调的行为理解与视觉-语言推理
理解自由移动的动物行为是神经科学的核心,其中姿态估计和行为理解构成了将神经活动与自然动作联系起来的基础。然而,这两个任务仍然严重依赖于人工注释或不稳定的无监督管道,限制了其可扩展性和可重复性。我们提出了BehaviorVLM,这是一种统一的视觉-语言框架,用于姿态估计和行为理解,无需特定任务的微调和最少的人工标注,通过引导预训练的视觉-语言模型(VLMs)进行详细的、明确的和可验证的推理步骤。对于姿态估计,我们利用量子点标记的行为数据,并提出了一种多阶段管道,结合了时间、空间和跨视图推理。这种设计大大减少了人工标注的工作量,通过几何检查如重投影误差暴露了低置信度的标签,并生成了可以稍后过滤、修正或用于微调下游姿态模型的标签。对于行为理解,我们提出了一种管道,结合了深度嵌入聚类以发现过度分割的行为,基于VLM的每段视频字幕,以及基于LLM的推理以合并和语义标注行为片段。行为管道可以直接从视觉信息运行,不需要关键点来分割行为。这些组件共同实现了多动物行为的大规模、可解释和轻标注分析。
Summary / 总结
The research aims to improve the scalability and reproducibility of understanding animal behavior in neuroscience by leveraging a unified vision-language framework called BehaviorVLM. This framework uses pretrained models and detailed reasoning steps to achieve pose estimation and behavioral understanding without task-specific fine-tuning. Key findings include a multi-stage pose estimation pipeline that reduces human annotation effort and an integrated behavioral understanding pipeline that discovers and labels behaviors directly from visual information, enhancing the interpretability and scalability of multi-animal behavior analysis.
研究旨在提高理解自由移动动物行为的可扩展性和可重复性。它引入了BehaviorVLM,一个统一的视觉语言框架,使用预训练模型和详细的推理步骤进行姿态估计和行为理解,无需特定任务的微调。该框架减少了人工标注的工作量,并生成可以过滤或修正的标签。对于行为理解,它结合了深度嵌入聚类、基于VLM的视频字幕生成和基于LLM的推理,直接从视觉信息中发现和标注行为。
GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows
Authors: Zexuan Yan, Jiarui Jin, Yue Ma, Shijian Wang, Jiahui Hu, Wenxiang Jiao, Yuan Lu, Linfeng Zhang
First: 2026-03-12T16:53:06+00:00 · Latest: 2026-03-12T16:53:06+00:00
Abstract
Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.
中文标题/摘要
标题:GlyphBanana:通过自主工作流提升精确文本渲染
尽管生成模型的最新进展在文本渲染方面取得了显著进步,但准确生成复杂文本和数学公式仍然是一项艰巨的挑战。这一困难主要源于当前模型在遇到分布外提示时有限的指令遵循能力。为了解决这一问题,我们引入了GlyphBanana,并设计了一个相应的基准,专门用于渲染复杂字符和公式。GlyphBanana采用了一种自主工作流,将辅助工具集成到潜在空间和注意力图中,以注入字形模板,促进生成图像的迭代优化。值得注意的是,我们的无训练方法可以无缝应用于各种文本到图像(T2I)模型,相比现有基线实现了更高的精度。大量实验表明了我们提出的工作流的有效性。相关代码已公开发布在https://github.com/yuriYanZeXuan/GlyphBanana。
Summary / 总结
The research aims to improve the precision of text rendering, especially for complex characters and formulas, by addressing the limitations of current generative models in handling out-of-distribution prompts. GlyphBanana uses an agentic workflow that integrates auxiliary tools to inject glyph templates into the latent space and attention maps, allowing for iterative refinement. Experiments show that this approach achieves superior precision compared to existing methods without requiring training.
研究旨在通过解决当前生成模型在处理非分布提示时的局限性,提高复杂字符和公式的文本渲染精度。GlyphBanana采用一种代理工作流,将字形模板注入潜在空间和注意力图中,实现迭代优化。实验表明,这种方法在不需训练的情况下能获得比现有方法更高的精度。
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction
Authors: Mengfei Duan, Hao Shi, Fei Teng, Guoqiang Zhao, Yuheng Zhang, Zhiyong Li, Kailun Yang
First: 2026-03-12T16:45:42+00:00 · Latest: 2026-03-12T16:45:42+00:00
Comments: The source code will be made publicly available at https://github.com/MengfeiD/O3N
Abstract
Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.
中文标题/摘要
标题:O3N:全方位开放式词汇占用预测
通过全方位感知理解并重建3D世界是自主代理和具身智能发展中不可避免的趋势。然而,现有的3D占用预测方法受限于有限视角输入和预定义的训练分布,难以应用于需要全面和安全场景感知的具身代理。为解决这一问题,我们提出了O3N,这是首个纯视觉、端到端的全方位开放式词汇占用预测框架。O3N通过Polar-spiral Mamba (PsM) 模块嵌入全方位体素,以极螺旋拓扑结构实现连续的空间表示和360°范围内的长程上下文建模。Occupancy Cost Aggregation (OCA) 模块引入了一种原理性的机制,用于在体素空间内统一几何和语义监督,确保重建几何与底层语义结构的一致性。此外,Natural Modality Alignment (NMA) 建立了一种无梯度对齐路径,协调视觉特征、体素嵌入和文本语义,形成一致的“像素-体素-文本”表示三元组。在多个模型上的广泛实验表明,我们的方法不仅在QuadOcc和Human360Occ基准测试中达到了最先进的性能,还展示了出色的跨场景泛化能力和语义可扩展性,为通用3D世界建模铺平了道路。源代码将在https://github.com/MengfeiD/O3N公开。
Summary / 总结
O3N is an end-to-end framework for omnidirectional open-vocabulary occupancy prediction, addressing limitations of existing methods by incorporating continuous spatial representation and long-range context modeling. It uses a Polar-spiral Mamba module to embed omnidirectional voxels and an Occupancy Cost Aggregation module to unify geometric and semantic supervision. The Natural Modality Alignment module aligns visual features, voxel embeddings, and text semantics. Experiments show O3N outperforms existing methods on QuadOcc and Human360Occ benchmarks and demonstrates strong cross-scene generalization and semantic scalability.
O3N 是一种端到端的全景开放词汇占用预测框架,通过 Polar-spiral Mamba (PsM) 模块嵌入全景体素并使用 Occupancy Cost Aggregation (OCA) 模块统一几何和语义监督,以及通过 Natural Modality Alignment (NMA) 模块对齐视觉特征、体素嵌入和文本语义。实验表明,O3N 在 QuadOcc 和 Human360Occ 基准上优于现有方法,并展示了强大的跨场景泛化能力和语义可扩展性。
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
Authors: Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou, Shuai Wang, Weili Guan, Gongwei Chen
Venue: CVPR 2026
First: 2026-03-12T16:40:59+00:00 · Latest: 2026-03-12T16:40:59+00:00
Comments: Accepted by CVPR 2026
Abstract
Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.
中文标题/摘要
标题:HATS:面向GUI代理的硬度感知轨迹合成
由大规模视觉-语言模型(VLMs)驱动的图形用户界面(GUI)代理在自动化数字任务方面展现了显著潜力,突显了高质量轨迹数据对于有效代理训练的必要性。然而,现有的轨迹合成管道往往生成的代理无法超越简单的交互进行泛化。我们发现这一局限源于对语义含糊动作的忽视,这些动作的意义依赖于上下文、序列或视觉上的含糊性。这些动作对于现实世界的鲁棒性至关重要,但在当前数据集中却严重不足且处理不佳,导致任务指令与执行之间存在语义不匹配。为解决这些问题,我们提出了HATS,一种硬度感知轨迹合成框架,旨在减轻语义含糊性的影响。我们将硬度定义为与动作相关的语义含糊程度,并开发了两个互补模块:(1)硬度驱动探索,引导数据收集向含糊但有信息价值的交互;(2)对齐引导精炼,迭代验证和修复指令执行对齐。两个模块在一个闭环中运行:探索为精炼提供具有挑战性的轨迹,而精炼反馈更新硬度信号以指导未来的探索。广泛的实验表明,使用HATS训练的代理在基准GUI环境中始终优于最先进的基线。
Summary / 总结
The research aims to improve the generalization ability of GUI agents by addressing the issue of semantic ambiguity in trajectory synthesis. The proposed HATS framework introduces hardness-driven exploration and alignment-guided refinement to collect and refine trajectories, thereby enhancing the robustness of GUI agents. Experiments demonstrate that HATS-trained agents outperform existing methods in various GUI environments.
研究旨在通过解决轨迹合成中的语义模糊问题,提高GUI代理的鲁棒性。提出了HATS框架,将其定义为动作的语义模糊程度,并引入了两个模块:语义模糊驱动的探索和对齐引导的精炼。实验表明,使用HATS训练的代理在各种GUI环境中表现优于现有方法。
Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments
Authors: Zhaoyang Jiang, Zhizhong Fu, David McAllister, Yunsoo Kim, Honghan Wu
First: 2026-03-12T15:40:59+00:00 · Latest: 2026-03-12T15:40:59+00:00
Abstract
Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer's disease assessment. However, current deep-learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision-language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision-language models, which reads longitudinal T1-weighted brain MRI, produces a region-level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically-weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject-level held-out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three-class diagnostic accuracy (+34.8% over the no-grounding baseline), 97.2% on two-class diagnosis accuracy (+4% over the SOTA) and 82.6% region-level anatomical classification accuracy (+33.1% over VLM baselines). Zero-shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three-class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at https://github.com/Anonymous-TEVC/LoV-3D.
中文标题/摘要
标题:论文标题:LoV3D:通过区域体积评估在纵向3D脑MRI中接地的认知预后推理
纵向脑MRI对于表征阿尔茨海默病等神经退行性疾病的发展至关重要。然而,当前的深度学习工具将此过程分割:分类器将扫描简化为标签,体积管道生成未解释的测量值,而视觉-语言模型(VLMs)可能会生成流畅但可能是幻觉的结论。我们提出了LoV3D,这是一种用于训练3D视觉-语言模型的管道,该管道读取纵向T1加权脑MRI,生成区域级别的解剖评估,进行与先前扫描的纵向比较,最后输出三类诊断(认知正常、轻度认知障碍或痴呆)以及合成的诊断摘要。分步管道通过强制执行标签一致性、纵向连贯性和生物学可行性来接地最终诊断,从而降低幻觉的风险。训练过程引入了一个临床加权的验证器,自动将候选输出与标准化体积指标得出的参考值评分,从而实现直接偏好优化,无需单个人类注释。在ADNI主题级保留测试集(479个扫描,258个受试者)上,LoV3D在三类诊断准确性上达到93.7%(比无接地基线高34.8%),在两类诊断准确性上达到97.2%(比SOTA高4%),在区域级别的解剖分类准确性上达到82.6%(比VLM基线高33.1%)。零样本迁移在MIRIAD上达到95.4%(痴呆召回率100%),在AIBL上达到82.9%的三类准确性,证实了其在不同站点、扫描仪和人群中的高泛化能力。代码可在https://github.com/Anonymous-TEVC/LoV-3D/ 获取。
Summary / 总结
The research aims to improve the accuracy and reliability of diagnosing neurological diseases using longitudinal 3D brain MRI. LoV3D is a pipeline that uses 3D vision-language models to analyze MRI scans, providing a region-level anatomical assessment and a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia). The pipeline ensures label consistency, longitudinal coherence, and biological plausibility, achieving 93.7% three-class diagnostic accuracy and 97.2% two-class accuracy on the ADNI test set, outperforming existing methods. Zero-shot transfer confirms its generalizability across different datasets and populations.
研究旨在通过纵向3D脑MRI提高神经疾病诊断的准确性和可靠性。LoV3D管道使用3D视觉语言模型分析MRI扫描,提供区域级别的解剖评估和三类诊断(认知正常、轻度认知障碍或痴呆)。该管道确保标签一致性、纵向连贯性和生物学合理性,ADNI测试集上的三类诊断准确率达到93.7%,二类准确率达到97.2%,优于现有方法。零样本迁移验证了其在不同数据集和人群中的普适性。
Coarse-Guided Visual Generation via Weighted h-Transform Sampling
Authors: Yanghao Wang, Ziqi Jiang, Zhen Wang, Long Chen
First: 2026-03-12T15:26:19+00:00 · Latest: 2026-03-12T15:26:19+00:00
Abstract
Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.
中文标题/摘要
标题:基于加权h-变换采样的粗粒度视觉生成
粗粒度视觉生成是从退化或低保真度的粗略参考中合成精细视觉样本的关键技术,对于各种实际应用至关重要。虽然基于训练的方法很有效,但它们受到高训练成本和配对数据收集限制的内在局限。因此,最近的无训练方法提出利用预训练的扩散模型,并在采样过程中引入指导。然而,这些无训练方法要么需要知道正向(精细到粗略)变换算子,例如双立方下采样,要么难以在指导和合成质量之间取得平衡。为了解决这些挑战,我们提出了一种新颖的指导方法,使用h-变换,这是一种可以在所需条件下约束随机过程(例如采样过程)的工具。具体来说,我们通过在原始微分方程中添加一个漂移函数来修改每个采样时间步的转换概率,这大约会引导生成向理想的精细样本。为了解决不可避免的近似误差,我们引入了一种噪声级别感知的时间表,随着误差增加逐渐减少该项的权重,从而确保指导的遵守和高质量的合成。广泛的实验表明,我们的方法在各种图像和视频生成任务中具有有效性和泛化能力。
Summary / 总结
The paper addresses the challenge of synthesizing fine visual samples from degraded coarse references, which is crucial for various applications. It proposes a novel coarse-guided visual generation method using the h-transform, which modifies the sampling process to steer towards the ideal fine sample. The method introduces a noise-level-aware schedule to balance guidance and synthetic quality, improving both adherence to the reference and the quality of the generated samples. Experiments show the method's effectiveness and generalization across different tasks.
论文解决了从退化的粗略参考中合成精细视觉样本的问题,这对于各种应用至关重要。提出了一种使用h-变换的新型引导方法,通过在每个采样时间步添加一个漂移函数来修改转移概率,以引导生成向理想的精细样本靠拢。引入了一个噪声级别感知的调度方案来缓解近似误差,确保生成过程既遵循指导又保持高质量。实验表明该方法在不同任务中的有效性和普适性。
Continual Learning with Vision-Language Models via Semantic-Geometry Preservation
Authors: Chiyuan He, Zihuan Qiu, Fanman Meng, Runtong Zhang, Linfeng Xu, Qingbo Wu, Hongliang Li
First: 2026-03-12T15:25:53+00:00 · Latest: 2026-03-12T15:25:53+00:00
Comments: 14 pages, 11 figures, under review
Abstract
Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.
中文标题/摘要
标题:通过语义-几何保留实现预训练视觉-语言模型的持续学习
预训练视觉-语言模型(VLMs)的持续学习容易发生灾难性遗忘,当前方法在适应新任务时并未明确保留从预训练和先前阶段继承的跨模态语义几何结构,导致新任务监督诱导几何失真。我们观察到,最明显的漂移倾向于集中在旧新语义界面附近的脆弱区域,在这些区域中,共享的视觉模式容易被新的文本语义重新解释。为了解决这一问题,在无示例约束下,我们提出了语义几何保留的持续学习(SeGP-CL)。SeGP-CL 首先通过双重目标投影梯度下降(DPGD)构建紧凑的对抗锚集来探测易漂移区域,驱动选定的新任务种子向旧类语义靠拢,同时在原始视觉空间中保持忠实。在训练过程中,通过锚引导的跨模态几何蒸馏(ACGD)保留跨模态结构,并通过轻量级文本语义-几何正则化(TSGR)在任务间稳定文本参考框架。训练后,我们估计锚引起的原始空间漂移,转移旧视觉原型,并通过融合跨模态和视觉线索进行双路径推理。在五个持续学习基准上的广泛实验表明,SeGP-CL 一致地提高了稳定性和前向迁移,同时更好地保留了 VLMs 的语义几何结构,达到最先进的性能。
Summary / 总结
The research aims to address catastrophic forgetting in continual learning of vision-language models by preserving semantic geometry. The method involves constructing adversarial anchors using DPGD to guide new-task seeds towards old-class semantics while maintaining visual fidelity. During training, cross-modal structure is preserved through ACGD and the textual reference frame is stabilized with TSGR. The approach improves stability and forward transfer, achieving state-of-the-art performance on five benchmarks while better preserving semantic geometry of VLMs.
论文通过提出SeGP-CL,即通过对抗锚点构造和跨模态几何蒸馏来保持语义几何结构,解决了视觉语言模型(VLMs)在持续学习中的灾难性遗忘问题。实验表明,SeGP-CL 提高了稳定性和前向迁移,同时保持了VLMs的语义几何结构,达到了最先进的性能。
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
Authors: Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan
First: 2026-03-12T15:14:48+00:00 · Latest: 2026-03-12T15:14:48+00:00
Abstract
Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.
中文标题/摘要
标题:慢速-快速推理:基于句内支持稳定性的无训练推理加速
长上下文自回归解码仍然很昂贵,因为每次解码步骤都必须反复处理不断增长的历史记录。我们在解码过程中观察到一个一致的模式:在一个句子内,更广泛地说在一个短的语义连贯的片段内,主导的注意力支持通常保持相对稳定。受此观察的启发,我们提出了慢速-快速推理(SFI),这是一种无训练的解码框架,将生成过程解耦为频繁的低成本快速步骤和偶尔的密集注意力慢速步骤。快速步骤重用紧凑的稀疏记忆以实现高效的解码。慢速步骤在语义边界附近被触发。在慢速步骤中,模型回顾更广泛的上下文,并使用选择器刷新选定的记忆,以供后续快速步骤使用。在评估的不同上下文长度下,SFI 大约提供了 1.6 倍至 14.4 倍的更高解码吞吐量,同时在长上下文和长链推理设置中通常保持与全键值基线相当的质量。由于 SFI 是无训练的,并且可以直接应用于现有的检查点,因此它为减少当前自回归推理模型在长上下文、长展望和代理工作负载中的推理成本提供了一条实用的道路。
Summary / 总结
The paper addresses the inefficiency of long-context autoregressive decoding by proposing Slow-Fast Inference (SFI), a training-free method that decouples the decoding process into fast and slow steps. Fast steps use a compact memory for efficient decoding, while slow steps, triggered near semantic boundaries, refresh the memory. SFI achieves approximately 1.6 to 14.4 times higher decoding throughput while maintaining quality comparable to the full-KV baseline in long-context and long-chain-of-thought settings.
论文提出了一种名为Slow-Fast Inference (SFI)的训练-free解码框架,通过将过程拆分为频繁的快速步骤和偶尔的慢速步骤来加速长上下文自回归解码。快速步骤利用紧凑的稀疏内存进行高效解码,而慢速步骤在语义边界附近触发,刷新选定的内存。SFI在长上下文和长CoT设置中实现了约1.6到14.4倍的更高解码吞吐量,同时保持与全KV基线相当的质量。
JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas
Authors: Sandeep Inuganti, Hideaki Kanayama, Kanta Shimizu, Mahdi Chamseddine, Soichiro Yokota, Didier Stricker, Jason Rambach
First: 2026-03-06T11:22:14+00:00 · Latest: 2026-03-12T14:27:22+00:00
Abstract
Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.
中文标题/摘要
标题:JOPP-3D:联合开放词汇语义分割点云和全景图
跨视觉模态(如3D点云和全景图像)的语义分割仍然是一个具有挑战性的任务,主要是由于标注数据的稀缺性和固定标签模型的有限适应性。在本文中,我们提出了JOPP-3D,这是一种联合利用全景图和点云数据的开放词汇语义分割框架,以实现基于语言的场景理解。我们将RGB-D全景图像转换为其相应的切线视角图像和3D点云,然后使用这些模态来提取和对齐基础的视觉-语言特征。这使得自然语言查询能够在输入的两种模态上生成语义掩码。在斯坦福-2D-3D-s和ToF-360数据集上的实验评估表明,JOPP-3D能够在全景和3D领域生成连贯且语义上有意义的分割。我们提出的方法在开放词汇和封闭词汇的2D和3D语义分割中取得了显著的改进。
Summary / 总结
The research aims to address the challenge of semantic segmentation across 3D point clouds and panoramic images by developing JOPP-3D, an open-vocabulary semantic segmentation framework. The method converts RGB-D panoramic images into tangential perspective images and 3D point clouds, then extracts and aligns vision-language features to enable natural language querying for semantic segmentation. Experiments on Stanford-2D-3D-s and ToF-360 datasets show that JOPP-3D produces coherent and semantically meaningful segmentations, outperforming the state-of-the-art in both open and closed vocabulary 2D and 3D semantic segmentation.
研究旨在通过开发JOPP-3D开放词汇语义分割框架来解决3D点云和全景图像之间的语义分割挑战。该框架联合处理全景图像和点云以提取和对齐视觉-语言特征,使自然语言查询能够生成语义掩码。实验在斯坦福-2D-3D-s和ToF-360数据集上显示,JOPP-3D能够生成连贯且语义上有意义的分割结果,并在开放和封闭词汇设置中优于现有最佳方法。
HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
Authors: Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu
First: 2026-03-12T14:25:44+00:00 · Latest: 2026-03-12T14:25:44+00:00
Abstract
The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.
Summary / 总结
HomeSafe-Bench is a benchmark designed to evaluate Vision-Language Models on detecting unsafe actions in household scenarios, addressing the limitations of current static safety evaluations. It uses a hybrid pipeline combining physical simulation and video generation to create 438 diverse cases with fine-grained annotations. The study introduces HD-Guard, a hierarchical streaming architecture that balances inference efficiency and detection accuracy by using a lightweight FastBrain for continuous screening and a SlowBrain for deep reasoning. Experiments show that HD-Guard outperforms in terms of latency and performance trade-off compared to existing methods.
HomeSafe-Bench 是一个基准,用于评估视觉-语言模型在家庭场景中检测不安全行为的能力,解决了当前安全评估的局限性。它通过结合物理模拟和视频生成,包含438个具有精细注释的多样化案例。提出的 HD-Guard 架构是一个分层流式系统,通过使用轻量级的 FastBrain 进行连续筛查和使用 SlowBrain 进行深度多模态推理来平衡推理效率和检测准确性,展示了在实时安全监控中的优越性能。
SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models
Authors: Zhanxuan Hu, Qiyu Xu, Yu Duan, Yonghang Tai, Huafeng Li
First: 2025-06-16T17:27:47+00:00 · Latest: 2026-03-12T12:26:23+00:00
Abstract
Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \textbf{SOTA} is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.
中文标题/摘要
标题:SOTA:自适应最优运输在多基础模型零样本分类中的应用
基础模型因其强大的零样本分类能力而在各个领域引起了广泛关注。本文受到两个关键观察的启发:(1)视觉-语言模型(VLMs),如CLIP,往往过度依赖类别级别的文本先验,难以捕捉细微的视觉线索,而视觉基础模型(VFMs),如DINO,则提供了丰富的区分性视觉特征但缺乏语义对齐;(2)不同VLMs在不同数据集上的性能差异很大,这归因于预训练的不同。为了解决这些挑战,我们提出了SOTA(自适应最优运输),这是一种无需训练的集成框架,通过学习自适应运输计划来整合多个基础模型(VFMs或VLMs)的输出。值得注意的是,SOTA 是无先验的,并且能够自动平衡模型的贡献。在包括自然图像、医学病理和遥感在内的多个领域的广泛实验中,验证了SOTA 的普适性。结果一致表明,它有效地利用了不同基础模型的互补优势,并在单个模型上取得了显著的改进。代码实现可在:https://github.com/Afleve/self-adaptive-Optimal-Transport 获取。
Summary / 总结
This work addresses the limitations of vision-language models (VLMs) and vision-only foundation models (VFMs) in zero-shot classification by proposing SOTA, a training-free ensemble framework. SOTA integrates the outputs of multiple foundation models by learning a self-adaptive transport plan, which automatically balances model contributions without relying on prior information. Experiments across various domains demonstrate that SOTA effectively leverages the complementary strengths of different foundation models and achieves significant performance improvements over individual models.
该研究通过提出SOTA,一种无需训练的集成框架,解决了视觉语言模型(VLMs)和视觉基础模型(VFMs)在零样本分类中的局限性。SOTA 通过学习一个自适应的运输计划来整合多个基础模型,无需依赖先验知识即可自动平衡它们的贡献。实验结果表明,SOTA 能够有效利用不同基础模型的互补优势,显著优于单一模型的表现。
MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis
Authors: Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, Jianxin Lin
Venue: AAAI 2026
First: 2025-11-27T01:47:43+00:00 · Latest: 2026-03-12T11:42:07+00:00
Comments: AAAI 2026, Medical Chain-of-Thought (CoT), Reinforcement Learning with Verifiable Rewards (RLVR), Multimodal Grounded Reasoning
Abstract
Accurate medical diagnosis often involves progressive visual focusing and iterative reasoning, characteristics commonly observed in clinical workflows. While recent vision-language models demonstrate promising chain-of-thought (CoT) reasoning capabilities via reinforcement learning with verifiable rewards (RLVR), their purely on-policy learning paradigm tends to reinforce superficially coherent but clinically inaccurate reasoning paths. We propose MedEyes, a novel reinforcement learning framework that dynamically models clinician-style diagnostic reasoning by progressively attending to and interpreting relevant medical image regions. By incorporating off-policy expert guidance, MedEyes converts expert visual search trajectories into structured external behavioral signals, guiding the model toward clinically aligned visual reasoning. We design the Gaze-guided Reasoning Navigator (GRN) to emulate the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis. To balance expert imitation and autonomous discovery, we introduce the Confidence Value Sampler (CVS), which employs nucleus sampling and adaptive termination to create diverse yet credible exploration paths. Finally, the dual-stream GRPO optimization framework decouples on-policy and off-policy learning signals, mitigating reward assimilation and entropy collapse. Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5pp across multiple medical VQA benchmarks, validating MedEyes's potential in building trustworthy medical AI systems. Code is available at https://github.com/zhcz328/MedEyes.
中文标题/摘要
标题:MedEyes: 学习医疗渐进诊断的动态视觉聚焦
准确的医疗诊断通常涉及渐进的视觉聚焦和迭代推理,这是临床工作流程中常见的特征。虽然最近的视觉-语言模型通过强化学习和可验证奖励(RLVR)展示了链式推理(CoT)的能力,但它们的纯策略学习范式往往会强化表面上连贯但临床不准确的推理路径。我们提出MedEyes,这是一种新颖的强化学习框架,通过逐步关注和解释相关的医学图像区域,动态建模临床医生风格的诊断推理。通过结合离策略专家指导,MedEyes将专家的视觉搜索轨迹转化为结构化的外部行为信号,引导模型向临床对齐的视觉推理方向发展。我们设计了注视引导推理导航器(GRN),通过双模式探索策略模拟诊断过程,扫描系统异常定位并深入分析详细区域。为了平衡专家模仿和自主发现,我们引入了置信值采样器(CVS),它使用核采样和自适应终止来创建多样且可信的探索路径。最后,双流GRPO优化框架将策略学习信号和离策略学习信号解耦,缓解奖励同化和熵崩溃。实验表明,MedEyes在多个医学VQA基准测试中平均性能提高了8.5个百分点,验证了MedEyes在构建可信赖的医疗AI系统方面的潜力。代码可在https://github.com/zhcz328/MedEyes/ 获取。
Summary / 总结
MedEyes is a reinforcement learning framework designed to improve medical diagnosis by dynamically modeling the visual focus and reasoning process of clinicians. It uses off-policy expert guidance and a dual-stream optimization framework to balance imitation and discovery. Experimental results show that MedEyes improves performance by an average of 8.5 percentage points across multiple medical VQA benchmarks.
MedEyes 是一种强化学习框架,旨在通过动态模拟临床医生的视觉聚焦和推理过程来提升医学诊断。它利用离策训练专家指导和双模式探索策略来平衡模仿和发现,从而更好地与临床推理相一致。实验表明,MedEyes 在多个医学 VQA 基准测试中的性能提高了 8.5 个百分点,展示了其在构建可信医学 AI 系统方面的潜力。
Evaluating Generative Models via One-Dimensional Code Distributions
Authors: Zexi Jia, Pengcheng Luo, Yijia Zhong, Jinchao Zhang, Jie Zhou
First: 2026-03-09T07:57:56+00:00 · Latest: 2026-03-12T11:19:47+00:00
Abstract
Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of discrete visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce Codebook Histogram Distance (CHD), a training-free distribution metric in token space, and Code Mixture Model Score (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose VisForm, a benchmark of 210K images spanning 62 visual forms and 12 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments. We will release all code and datasets to facilitate future research, with the code publicly available at https://github.com/zexiJia/1d-Distance.
中文标题/摘要
标题:通过一维代码分布评估生成模型
大多数生成模型的评估依赖于特征分布度量,如FID,这些度量在连续的识别特征上运行,这些特征明确训练为对外观变化不变,因此丢弃了对感知质量至关重要的线索。相反,我们将在离散视觉标记的空间中评估模型,现代1D图像标记器紧凑地编码了语义和感知信息,质量表现为可预测的标记统计。我们引入了代码本直方图距离(CHD),这是一种无需训练的标记空间分布度量,以及代码混合模型评分(CMMS),这是一种从标记序列的合成退化中学习到的无参考质量度量。为了在广泛的分布偏移下测试度量,我们进一步提出了VisForm基准,包含210K张图像和62种视觉形式以及12种生成模型的专家注释。在AGIQA、HPDv2/3和VisForm中,我们的基于标记的度量与人类判断的相关性达到最新水平。我们将发布所有代码和数据集以促进未来的研究,代码可在https://github.com/zexiJia/1d-Distance公开获取。
Summary / 总结
This paper evaluates generative models by focusing on one-dimensional code distributions, which capture both semantic and perceptual information. It introduces two metrics: Codebook Histogram Distance (CHD) and Code Mixture Model Score (CMMS). The authors also propose VisForm, a benchmark with 210K images and expert annotations, to test the robustness of these metrics. The results show that the token-based metrics correlate well with human judgments across different benchmarks and datasets.
该论文通过关注一维代码分布来评估生成模型,这些分布能够捕捉语义和感知信息。作者引入了两个指标:码本直方图距离(CHD)和代码混合模型评分(CMMS)。此外,他们还提出了VisForm基准,包含210K张图像和专家注释,以测试这些指标的鲁棒性。结果显示,基于代码的指标与人类判断在不同基准和数据集上具有良好的相关性。
RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset
Authors: Yongzhong Wang, Keyu Zhu, Yong Zhong, Liqiong Wang, Jinyu Yang, Feng Zheng
Venue: IROS
First: 2026-03-12T11:18:52+00:00 · Latest: 2026-03-12T11:18:52+00:00
Comments: 8 pages, 4 figures. Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Abstract
The acquisition of large-scale physical interaction data, a critical prerequisite for modern robot learning, is severely bottlenecked by the prohibitive cost and scalability limits of human-in-the-loop collection paradigms. To break this barrier, we introduce Robust Autonomous Data Acquisition for Robotics (RADAR), a fully autonomous, closed-loop data generation engine that completely removes human intervention from the collection cycle. RADAR elegantly divides the cognitive load into a four-module pipeline. Anchored by 2-5 3D human demonstrations as geometric priors, a Vision-Language Model first orchestrates scene-relevant task generation via precise semantic object grounding and skill retrieval. Next, a Graph Neural Network policy translates these subtasks into physical actions via in-context imitation learning. Following execution, the VLM performs automated success evaluation using a structured Visual Question Answering pipeline. Finally, to shatter the bottleneck of manual resets, a Finite State Machine orchestrates an autonomous environment reset and asymmetric data routing mechanism. Driven by simultaneous forward-reverse planning with a strict Last-In, First-Out causal sequence, the system seamlessly restores unstructured workspaces and robustly recovers from execution failures. This continuous brain-cerebellum synergy transforms data collection into a self-sustaining process. Extensive evaluations highlight RADAR's exceptional versatility. In simulation, our framework achieves up to 90% success rates on complex, long-horizon tasks, effortlessly solving challenges where traditional baselines plummet to near-zero performance. In real-world deployments, the system reliably executes diverse, contact-rich skills (e.g., deformable object manipulation) via few-shot adaptation without domain-specific fine-tuning, providing a highly scalable paradigm for robotic data acquisition.
中文标题/摘要
标题:RADAR:通过语义规划和自主因果环境重置的闭环机器人数据生成
现代机器人学习的关键先决条件——大规模物理交互数据的获取,受到人类在环数据收集范式中高昂成本和可扩展性的限制。为突破这一障碍,我们引入了Robust Autonomous Data Acquisition for Robotics (RADAR),一个完全自主的闭环数据生成引擎,完全消除了数据收集周期中的人工干预。RADAR将认知负荷优雅地分为四个模块。以2-5个3D人类演示作为几何先验,视觉-语言模型首先通过精确的语义对象定位和技能检索来协调场景相关任务生成。接着,图神经网络策略通过上下文模仿学习将这些子任务转化为物理动作。执行后,VLM 使用结构化的视觉问答流水线进行自动成功评估。最后,为了打破手动重置的瓶颈,有限状态机协调自主环境重置和非对称数据路由机制。系统通过同时进行正向和反向规划,并严格遵循后进先出的因果序列,无缝恢复无序的工作空间并从执行失败中稳健恢复。这种持续的大脑-小脑协同作用将数据收集转变为自我维持的过程。广泛的评估突显了RADAR的卓越灵活性。在仿真中,我们的框架在复杂、长时程任务上的成功率高达90%,轻松解决了传统基线在这些任务上几乎失效的挑战。在实际部署中,系统可靠地执行各种接触丰富的技能(例如,可变形物体操作)并通过少量示例适应,提供了一种高度可扩展的机器人数据获取范式。
Summary / 总结
RADAR is a fully autonomous data generation system for robotics that eliminates human intervention in the data collection process. It uses a four-module pipeline: a Vision-Language Model for task generation, a Graph Neural Network for action execution, a Visual Question Answering pipeline for success evaluation, and a Finite State Machine for autonomous environment reset. RADAR demonstrates high success rates in complex tasks in both simulation and real-world settings, achieving up to 90% success rates and reliable execution of contact-rich skills without domain-specific fine-tuning.
RADAR 是一个完全自主的数据生成系统,用于机器人学习,消除了数据收集过程中的手动干预。它采用四模块流水线:视觉语言模型进行任务生成、图神经网络进行动作执行、视觉问答管道进行成功评估以及有限状态机进行自主环境重置。RADAR 在模拟中展示了在复杂任务中的高成功率,并且能够通过少量示例适应执行多样化的接触丰富技能,在现实世界部署中表现出色。
OSM-based Domain Adaptation for Remote Sensing VLMs
Authors: Stefan Maria Ailuro, Mario Markov, Mohammad Mahdi, Delyan Boychev, Luc Van Gool, Danda Pani Paudel
First: 2026-03-12T11:08:30+00:00 · Latest: 2026-03-12T11:08:30+00:00
Abstract
Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.
中文标题/摘要
标题:基于OSM的数据域适应遥感VLM
视觉语言模型(VLMs)适应遥感依赖于特定领域的图像-文本监督,但高质量的卫星和航空影像标注稀缺且昂贵。现有的伪标签管道通过从大型前沿模型中提取知识来解决这一问题,但对大型教师模型的依赖性使得成本高昂、可扩展性受限,并且性能受限于教师模型的天花板。我们提出OSMDA:一种自包含的数据域适应框架,消除了对大型教师模型的依赖。我们的核心见解是,一个能力强的基础VLM可以作为自己的标注引擎:通过将航空图像与渲染的OpenStreetMap (OSM) 地图瓦片配对,利用模型的光学字符识别和图表理解能力生成富含OSM大量辅助元数据的描述。然后,模型仅使用卫星影像对该语料库进行微调,生成OSMDA-VLM,这是一种无需人工标注且无需更强外部模型的数据域适应VLM。我们进行了涵盖10个基准的详尽评估,包括图像-文本到文本任务,并与9个竞争基线进行比较。当与真实数据等量混合时,我们的方法达到了最先进的结果,而训练成本远低于依赖教师模型的替代方案。这些结果表明,给定一个强大的基础模型,与众包地理数据对齐是一种实用且可扩展的遥感数据域适应路径。数据集和模型权重将公开提供。
Summary / 总结
This paper addresses the challenge of domain adaptation for Vision-Language Models (VLMs) in remote sensing, where high-quality annotations are scarce. The authors propose OSMDA, a self-contained framework that uses OpenStreetMap (OSM) data to generate captions for aerial images, eliminating the need for external large models. Experiments show that OSMDA-VLM, fine-tuned on satellite imagery alone, outperforms existing methods and achieves state-of-the-art results while being more cost-effective. This suggests that alignment with crowd-sourced geographic data can be a practical and scalable approach for remote sensing domain adaptation.
该论文旨在解决遥感领域中视觉-语言模型(VLM)的领域适应问题,其中高质量的标注数据稀缺。作者提出了一种名为OSMDA的自包含框架,通过将卫星图像与OpenStreetMap(OSM)图块配对,利用基础VLM生成带有OSM丰富辅助元数据的描述,从而无需手动标注或更强的外部模型。实验结果显示,当与真实数据混合时,OSMDA-VLM在10个基准测试中达到了最先进的效果,并且训练成本远低于依赖教师模型的方法。
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Xiang An, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang
Venue: CVPR 2026
First: 2025-10-21T13:36:58+00:00 · Latest: 2026-03-12T11:07:39+00:00
Comments: 25 pages, 17 figures
Abstract
Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code is available at https://github.com/zhangquanchen/3DThinker.
中文标题/摘要
标题:三维思考:基于有限视角的几何想象与空间推理
尽管近期视觉-语言模型(VLMs)在多种跨模态任务中取得了显著进展,但从有限视角理解三维空间关系仍然是一个重大挑战。以往的推理方法通常依赖纯文本(例如拓扑认知图)或二维视觉线索。然而,它们有限的表示能力阻碍了在需要三维空间想象的任务中的表现。为了解决这一限制,我们提出了3DThinker框架,该框架能够在推理过程中有效利用图像中嵌入的丰富几何信息,类似于人类的思考方式。我们的框架是第一个能够在推理过程中启用三维思考而无需任何三维先验输入的框架,并且在训练过程中不依赖于明确标注的三维数据。具体而言,我们的训练分为两个阶段。首先,我们进行监督训练,以使VLM在推理过程中生成的3D潜在表示与3D基础模型(例如VGGT)生成的3D潜在表示对齐。然后,我们仅基于结果信号优化整个推理轨迹,从而细化底层的三维思考。在多个基准测试中的广泛实验表明,3DThinker在多个基准测试中始终优于强基线,并为将三维表示统一到跨模态推理中提供了新的视角。我们的代码可在https://github.com/zhangquanchen/3DThinker获取。
Summary / 总结
The research aims to improve the ability of vision-language models to understand 3D spatial relationships from limited views, which is challenging for existing methods that rely on text or 2D visual cues. The proposed 3DThinker framework enhances reasoning by exploiting geometric information within images, enabling 3D spatial imagination during reasoning without requiring 3D data. Experiments show that 3DThinker outperforms strong baselines across multiple benchmarks, providing a new approach to unify 3D representations in multimodal reasoning.
论文针对从有限视角理解三维空间关系这一挑战,提出了一种名为3DThinker的框架,该框架利用图像中的几何信息进行三维推理。框架分为两个阶段:监督训练以对齐3D潜在空间,以及基于结果信号优化推理轨迹。实验表明,3DThinker在多个基准测试中优于强基线,并提供了一种将3D表示整合到多模态推理中的新方法。
MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization
Authors: Animesh Jain, Alexandros Stergiou
First: 2025-08-11T10:36:58+00:00 · Latest: 2026-03-12T10:48:14+00:00
Comments: Project page: https://anaekin.github.io/MIMIC
Abstract
Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.
中文标题/摘要
标题:MIMIC:多模态反演以实现模型解释与概念化
视觉语言模型(VLMs)将多模态输入编码到大型、复杂且难以解释的架构中,这限制了透明度和信任度。我们提出了一种多模态反演以实现模型解释与概念化(MIMIC)框架,该框架可以反演VLM的内部编码。MIMIC使用基于VLM的联合反演和特征对齐目标来考虑VLM的自回归处理。此外,它还包括一个用于空间对齐、自然图像平滑性和语义现实性的三重正则化器。我们通过反演不同长度的自由形式VLM输出中的视觉概念,从定量和定性两个方面评估MIMIC。报告的结果包括标准的视觉质量指标和语义文本指标。据我们所知,这是第一个针对VLM概念视觉解释的模型反演方法。
Summary / 总结
The research aims to enhance the transparency and trust in Vision Language Models (VLMs) by developing a framework called MIMIC for multimodal inversion. MIMIC inverts the internal encodings of VLMs using a joint inversion and feature alignment objective, along with regularizers for spatial alignment, natural image smoothness, and semantic realism. The framework is evaluated both quantitatively and qualitatively by inverting visual concepts across various VLM outputs, demonstrating improvements in visual quality and semantic realism.
研究旨在通过开发名为MIMIC的多模态反演框架来增强视觉语言模型(VLM)的透明度和信任度。MIMIC使用联合反演和特征对齐目标,并包含空间对齐、自然图像平滑性和语义现实性的正则化项。研究通过在各种VLM输出中反演视觉概念,从定量和定性两个方面进行评估,报告了视觉质量和语义文本指标。这是第一个专注于VLM概念视觉解释的模型反演方法。
Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
Authors: Tong Wang, Yunhan Zhao, Shu Kong
First: 2026-01-31T16:42:55+00:00 · Latest: 2026-03-12T09:13:49+00:00
Abstract
Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ''mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ''mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search for the target image. In contrast, we address CIR from first principles by directly generating the ''mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ''mental image'' for a given multimodal query and propose to use this ''mental image'' to search for the target image. As the ''mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.
中文标题/摘要
标题:生成平行宇宙以实现无需训练的零样本组合图像检索
组合图像检索(CIR)是指使用包含参考图像和修改文本的多模态查询从数据库中检索目标图像的任务。文本说明如何修改参考图像以形成“心理图像”,基于此,CIR 应在数据库中找到目标图像。CIR 的基本挑战在于这种“心理图像”是不可物理获取的,仅由查询隐式定义。当代文献追求零样本方法,并使用大型多模态模型(LMM)生成给定多模态查询的文本描述,然后使用视觉语言模型(VLM)进行文本-视觉匹配以搜索目标图像。相反,我们从第一原理出发,直接生成“心理图像”以实现更准确的匹配。特别地,我们提示 LMM 生成给定多模态查询的“心理图像”,并建议使用此“心理图像”来搜索目标图像。由于“心理图像”与真实图像之间存在合成到现实的领域差距,我们还为数据库中的每个真实图像生成一个合成对应物以促进匹配。因此,我们的方法使用 LMM 构建一个“平行宇宙”,其中它匹配多模态查询和数据库图像。因此,我们称此方法为“平行宇宙”。值得注意的是,平行宇宙是一种无需训练的零样本 CIR 方法。它在具有挑战性的基准测试中显著优于现有零样本方法,实现了零样本 CIR 的最佳性能。
Summary / 总结
The paper addresses Composed Image Retrieval (CIR) by directly generating a 'mental image' using a Large Multimodal Model (LMM) for a given multimodal query, and then searching for the target image in a database. To bridge the synthetic-to-real domain gap, the authors generate synthetic counterparts for each real image in the database. This approach, named Paracosm, is training-free and zero-shot, and it outperforms existing methods on challenging benchmarks for zero-shot CIR.
论文通过使用大型多模态模型(LMM)直接生成给定多模态查询的‘心理图像’,然后在数据库中查找目标图像来解决组成图像检索(CIR)的挑战。这种方法称为Paracosm,通过为数据库中的每个真实图像生成一个合成的对应物来构建一个合成的‘平行宇宙’,这有助于更准确的匹配。该方法是无训练和零样本的,并在具有挑战性的基准测试中显著优于现有零样本方法,实现了CIR的最新性能。
GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
Authors: Qinghongbing Xie, Zhaoyuan Xia, Feng Zhu, Lijun Gong, Ziyue Li, Rui Zhao, Long Zeng
Venue: ICLR 2026
First: 2025-10-09T05:09:27+00:00 · Latest: 2026-03-12T08:33:55+00:00
Comments: ICLR 2026, 31 pages, 20 figures
Abstract
Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for autonomous driving, embodied AI and general AI. Existing spatial-temporal benchmarks mainly focus on egocentric (first-person) perspective reasoning using images/video contexts, or geographic reasoning with graphical context (e.g., maps), thus fail to assess VLMs' geographic spatial-temporal intelligence that requires integrating both images/video and graphical context, which is crucial for real-world scenarios such as traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench show that even the best proprietary model, Gemini-2.5-Pro (34.9\%), significantly lags behind human performance (78.61\%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three major deficiencies of current models for geo-temporal reasoning. (1) VLMs exhibit imbalanced utilization of spatial and temporal context during reasoning. (2) they show weak temporal forecasting ability, leading to poorer performance on temporally focused tasks. (3) they lack the capability to effectively align and integrate map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.
中文标题/摘要
标题:GTR-Bench:评估视觉-语言模型的地理时空推理能力
近年来,视觉-语言模型(VLMs)的时空智能引起了广泛关注,这对于自动驾驶、具身人工智能和通用人工智能至关重要。现有的时空基准主要集中在使用图像/视频上下文进行以自我为中心(第一人称)视角的推理,或使用图形上下文(例如,地图)进行地理推理,因此无法评估VLMs所需的地理时空智能,这需要结合图像/视频和图形上下文,这对于交通管理和应急响应等现实场景至关重要。为了解决这些差距,我们引入了Geo-Temporal Reasoning基准(GTR-Bench),这是一个新的挑战,用于大规模摄像网络中移动目标的地理时空推理。GTR-Bench更具挑战性,因为它需要在地图和视频之间进行多次视角切换,跨多个具有非重叠视场的视频进行联合推理,并对任何视频上下文都无法观察到的时空区域进行推理。对超过10个流行的VLMs在GTR-Bench上的评估显示,即使是最先进的专有模型Gemini-2.5-Pro(34.9%),在地理时空推理方面也远远落后于人类表现(78.61%)。此外,我们对GTR-Bench的全面分析揭示了当前模型在地理时空推理方面的三大缺陷。(1)VLMs在推理过程中对空间和时间上下文的利用不平衡。(2)它们在时间预测方面表现较弱,导致在时间导向任务上的表现较差。(3)它们缺乏有效对齐和整合地图数据与多视角视频输入的能力。我们相信GTR-Bench提供了宝贵的见解,并为时空智能的研究和应用开辟了新的机会。基准和代码将在https://github.com/X-Luffy/GTR-Bench发布。
Summary / 总结
GTR-Bench evaluates the geo-temporal reasoning capabilities of VLMs by introducing a new benchmark for geographic temporal reasoning in a large-scale camera network. It addresses the limitations of existing benchmarks that focus on egocentric or geographic reasoning alone. The evaluation of over 10 VLMs shows that even the best model, Gemini-2.5-Pro, performs poorly compared to human performance. The analysis reveals three major deficiencies: imbalanced use of spatial and temporal context, weak temporal forecasting ability, and poor alignment of map data with multi-view video inputs.
GTR-Bench 评估了 VLM 的地理时空推理能力,解决了现有主要关注第一人称或地理推理的基准的局限性。该基准引入了一个新的挑战,涉及多视角切换和跨非重叠视频视场的联合推理。对 10 种流行 VLM 的评估显示,即使是最佳模型 Gemini-2.5-Pro 的表现也远低于人类水平。分析指出三个主要缺陷:时空上下文利用不平衡、时间预测能力弱以及难以有效整合地图数据与多视角视频输入。
BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder
Authors: Siquan Huang, Yijiang Li, Ningzhi Gao, Xingfu Yan, Leyu Shi
First: 2026-03-12T08:32:19+00:00 · Latest: 2026-03-12T08:32:19+00:00
Comments: 17 pages, 10 figures, 6 tables
Abstract
Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.
中文标题/摘要
标题:BackdoorIDS:预训练视觉编码器的零样本后门检测
自我监督和多模态视觉编码器学习强大的视觉表示,广泛应用于下游视觉任务和大型视觉-语言模型(LVLMs)。然而,下游用户经常依赖来源不明的第三方预训练编码器,使其面临后门攻击的风险。在本工作中,我们提出了一种名为BackdoorIDS的简单而有效的零样本、推理时后门样本检测方法,用于预训练视觉编码器。BackdoorIDS受到两个观察结果的启发:注意力劫持和恢复。在渐进输入遮罩下,后门图像最初将注意力集中在恶意触发特征上。一旦遮罩比例超过触发的鲁棒性阈值,触发器被禁用,注意力迅速转向良性内容。这种转变导致图像嵌入产生显著变化,而干净图像的嵌入则在遮罩过程中更平滑地演变。BackdoorIDS通过沿遮罩轨迹提取嵌入序列并应用基于密度的聚类(如DBSCAN)来实现这一信号。如果输入的嵌入序列形成多个聚类,则将其标记为后门。大量实验表明,BackdoorIDS在各种攻击类型、数据集和模型家族中始终优于现有防御措施。值得注意的是,它是一种即插即用的方法,无需重新训练,并在推理时完全零样本运行,使其与各种编码器架构兼容,包括CNN、ViT、CLIP和LLaVA-1.5。
Summary / 总结
BackdoorIDS is a zero-shot backdoor detection method for pretrained vision encoders, motivated by the observations of Attention Hijacking and Restoration. It detects backdoor attacks by analyzing the change in image embeddings under progressive input masking and applying density-based clustering. Experiments show that BackdoorIDS outperforms existing defenses across various attack types, datasets, and model families, and it can be applied without retraining and operates fully at inference time.
BackdoorIDS 是一种无需重新训练即可检测预训练视觉编码器后门攻击的零样本方法,通过利用注意力劫持和恢复来识别后门样本。通过在渐进输入遮罩过程中监控图像嵌入的变化,BackdoorIDS 可在不重新训练的情况下检测后门样本。实验表明,它在各种攻击类型、数据集和模型家族中表现出色,是一种易于集成的保护预训练编码器免受后门攻击的解决方案。
Partially Recentralization Softmax Loss for Vision-Language Models Robustness
Authors: Hao Wang, Jinzhe Jiang, Xin Zhang, Chen Li
First: 2024-02-06T01:44:38+00:00 · Latest: 2026-03-12T08:30:30+00:00
Comments: The study described in Section 4 was conducted without required institutional review board approval. The paper is withdrawn pending completion of the approval process
Abstract
As Large Language Models make a breakthrough in natural language processing tasks (NLP), multimodal technique becomes extremely popular. However, it has been shown that multimodal NLP are vulnerable to adversarial attacks, where the outputs of a model can be dramatically changed by a perturbation to the input. While several defense techniques have been proposed both in computer vision and NLP models, the multimodal robustness of models have not been fully explored. In this paper, we study the adversarial robustness provided by modifying loss function of pre-trained multimodal models, by restricting top K softmax outputs. Based on the evaluation and scoring, our experiments show that after a fine-tuning, adversarial robustness of pre-trained models can be significantly improved, against popular attacks. Further research should be studying, such as output diversity, generalization and the robustness-performance trade-off of this kind of loss functions. Our code will be available after this paper is accepted
中文标题/摘要
标题:部分重新中央化softmax损失对视觉-语言模型鲁棒性的研究
随着大型语言模型在自然语言处理任务(NLP)中取得突破,多模态技术变得极其流行。然而,已经表明,多模态NLP模型对对抗攻击非常脆弱,输入的微小扰动可以使模型的输出发生巨大变化。虽然在计算机视觉和NLP模型中已经提出了多种防御技术,但多模态模型的鲁棒性尚未得到充分探索。在本文中,我们通过限制softmax输出的前K项来研究修改预训练多模态模型损失函数提供的对抗鲁棒性。根据评估和评分,我们的实验表明,在微调后,预训练模型的对抗鲁棒性可以显著提高,对抗流行的攻击。进一步的研究应该包括输出多样性、泛化以及这种损失函数的鲁棒性-性能权衡。我们的代码将在论文被接受后提供
Summary / 总结
This paper aims to enhance the adversarial robustness of pre-trained vision-language models by modifying the loss function. The method involves partially recentralizing the softmax outputs to restrict the top K outputs. Experiments show that fine-tuning with this approach significantly improves the models' resistance to adversarial attacks, demonstrating a notable increase in robustness. Further research is needed to explore output diversity, generalization, and the trade-off between robustness and performance.
本文通过提出部分重新中心化softmax损失函数来应对多模态模型对对抗攻击的脆弱性。该方法在微调过程中限制softmax输出的前K项,以增强对抗鲁棒性。实验表明,这种方法显著提高了预训练模型对各种攻击的鲁棒性,但仍需进一步研究输出多样性以及鲁棒性与性能之间的权衡。
Generalizing Vision-Language Models with Dedicated Prompt Guidance
Authors: Xinyao Li, Yinjie Min, Hongbo Chen, Zhekai Du, Fengling Li, Jingjing Li
First: 2025-12-02T05:06:17+00:00 · Latest: 2026-03-12T08:24:09+00:00
Comments: Accepted to AAAI26
Abstract
Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.
中文标题/摘要
标题:视觉-语言模型的专用提示指导泛化
对大规模预训练视觉-语言模型(VLMs)进行微调已成为下游适应的主流范式,但面临着领域特异性与领域泛化(DG)能力之间的关键权衡。当前方法通常将通用模型在完整数据集上进行微调,这可能会损害其对未见领域的泛化能力。为填补这一空白,我们提供了视觉语言模型微调泛化能力的理论理解,揭示了在分区源领域上训练多个参数高效的专家模型比在通用模型上进行微调具有更好的泛化能力。受此发现的启发,我们提出了一种两步领域专家指导泛化(GuiDG)框架。GuiDG 首先使用提示调优获得源领域专家,然后通过自适应专家集成引导视觉编码器的微调,引入跨模态注意力模块。为了更好地评估少量样本的泛化能力,我们从ImageNet及其变体构建了ImageNet-DG。在标准泛化基准和ImageNet-DG上的广泛实验表明,GuiDG 在保持效率的同时优于最先进的微调方法。
Summary / 总结
The paper addresses the challenge of domain generalization in fine-tuning large pretrained vision-language models, proposing a two-step framework called GuiDG. This framework first uses prompt tuning to create domain-specific experts and then guides the fine-tuning of the vision encoder through adaptive expert integration. Experiments show that GuiDG outperforms existing methods on standard domain generalization benchmarks and a newly constructed ImageNet-DG dataset while maintaining efficiency.
论文针对大规模预训练视觉-语言模型在下游适应中的域泛化问题,提出了一种两步框架GuiDG。GuiDG首先通过提示调优生成域特定专家,然后通过跨模态注意力模块指导视觉编码器的微调。实验表明,GuiDG在标准域泛化基准和新构建的ImageNet-DG数据集上优于现有微调方法,同时保持高效。
MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation
Authors: Baicheng Li, Dong Wu, Jun Li, Shunkai Zhou, Zecui Zeng, Lusong Li, Hongbin Zha
First: 2026-03-12T07:53:35+00:00 · Latest: 2026-03-12T07:53:35+00:00
Abstract
Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.
中文标题/摘要
标题:MV-SAM3D:适应性多视角融合以实现布局感知的3D生成
近期统一的3D生成模型在从单张图像生成高质量3D资产方面取得了显著进展。特别是,布局感知的方法如SAM3D可以在保持多个物体的空间布局的同时重建这些物体,为场景级别的3D生成打开了大门。然而,当前的方法仅限于单视角输入,无法利用互补的多视角观测,而独立估计的物体姿态往往会导致物理上不可行的布局,如穿插和漂浮的伪影。 我们提出了MV-SAM3D,这是一种无需训练的框架,它通过多视角一致性与物理可行性扩展了布局感知的3D生成。我们将多视角融合形式化为3D潜在空间中的多扩散过程,并提出了两种自适应加权策略——注意力-熵加权和可见性加权,以实现置信度感知的融合,确保每个视角根据其局部观测可靠性贡献。对于多物体组合,我们引入了物理感知优化,该优化在生成过程中和生成后注入碰撞和接触约束,从而产生物理上可行的物体布局。在标准基准和真实世界的多物体场景上的实验表明,在无需额外训练的情况下,重建精度和布局的物理可行性都有了显著提高。代码可在https://github.com/devinli123/MV-SAM3D 获取。
Summary / 总结
MV-SAM3D is a training-free framework that enhances layout-aware 3D generation by incorporating multi-view consistency and physical plausibility. It uses a Multi-Diffusion process in 3D latent space and two adaptive weighting strategies to ensure each viewpoint contributes based on its local observation reliability. Additionally, it introduces physics-aware optimization to inject collision and contact constraints, resulting in physically plausible object arrangements. Experiments show significant improvements in reconstruction fidelity and layout plausibility without additional training.
MV-SAM3D 是一个无需训练的框架,通过引入多视图一致性和物理合理性来增强布局感知的 3D 生成。它使用 3D 潜空间中的多扩散过程,并采用两种自适应加权策略来融合多视图观测,确保每个视点根据其可靠性贡献。此外,它引入了物理感知优化,注入碰撞和接触约束,从而实现物理合理的物体布局。实验结果显示,在重建保真度和布局合理性方面取得了显著改进,无需额外训练。
VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought
Authors: Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim
First: 2026-03-12T07:47:46+00:00 · Latest: 2026-03-12T07:47:46+00:00
Comments: 30 pages, 21 figures, EACL 2026 Findings
Abstract
Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.
中文标题/摘要
标题:VisDoT : 通过人类似解题分解和视觉感知接地增强视觉推理
大型视觉-语言模型(LVLMs)在检测图表中的视觉原语并将其与语义表示对齐方面难以可靠地进行,这严重限制了它们在复杂视觉推理方面的性能。这种感知接地的缺乏构成了基于图表推理的主要瓶颈。我们提出了VisDoT框架,通过人类似解题接地来增强视觉推理。我们基于图形感知理论形式化了四个感知任务,包括位置和长度。在此基础上,我们引入了解题分解(DoT)提示,该提示将问题按顺序分解为视觉感知子问题和逻辑子问题。使用VisDoT微调InternVL在ChartQA上实现了11.2%的改进,并在更具挑战性的ChartQAPro基准上超过了GPT-4o。在新引入的VisDoTQA基准上,模型提高了33.2%。此外,一致的零样本增益在各种开放域VQA基准上证实了感知-逻辑分离策略在视觉问答中的普适性。VisDoT利用人类似感知来增强视觉接地,实现了最先进的图表理解和可解释的视觉推理。
Summary / 总结
The paper addresses the challenge of LVLMs in reliably detecting visual primitives in charts and aligning them with semantic representations, which limits their performance on complex visual reasoning. VisDoT is proposed to enhance visual reasoning through human-like interpretation grounding and Decomposition-of-Thought (DoT) prompting, which separates questions into visual perception and logic sub-questions. Fine-tuning InternVL with VisDoT improves performance on ChartQA by 11.2% and surpasses GPT-4o on ChartQAPro, with a +33.2% improvement on the VisDoTQA benchmark. The study also shows consistent zero-shot gains on various open-domain VQA benchmarks, confirming the strategy's generalizability.
论文解决了大型视觉-语言模型在检测图表中的视觉基本元素并将其与语义表示对齐时可靠性不足的问题,这限制了它们在复杂视觉推理中的性能。它提出了VisDoT框架,通过类似人类的感知接地和分解思维(DoT)提示来增强视觉推理。VisDoT将问题按顺序分解为视觉感知和逻辑子问题,导致在ChartQA上的改进达到+11.2%,并在ChartQAPro基准上超越了GPT-4o。在新引入的VisDoTQA基准上,模型的改进达到+33.2%,并且在各种VQA基准上的零样本增益表明该策略的通用性。
StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References
Authors: Boyu He, Yunfan Ye, Chang Liu, Weishang Wu, Fang Liu, Zhiping Cai
First: 2026-03-11T03:05:02+00:00 · Latest: 2026-03-12T07:44:05+00:00
Comments: 18 pages, 23 figures, Conference on Computer Vision and Pattern Recognition 2026
Abstract
Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.
中文标题/摘要
标题:StyleGallery:无需训练且具有语义意识的个性化风格迁移,从任意图像参考中
尽管基于扩散的图像风格迁移取得了进展,但现有方法通常受限于1)语义差距:风格参考可能缺少适当的内容语义,导致不可控的风格化;2)依赖额外约束(例如,语义掩码)限制了适用性;3)刚性特征关联缺乏适应性的全局-局部对齐,无法平衡精细风格化和全局内容保留。这些限制,尤其是无法灵活利用风格输入,从根本上限制了风格迁移在个性化、准确性和适应性方面的表现。为了解决这些问题,我们提出了StyleGallery,这是一种无需训练且具有语义意识的框架,支持任意参考图像作为输入,并能够实现有效的个性化定制。它包括三个核心阶段:语义区域分割(在潜在扩散特征上进行自适应聚类,无需额外输入以划分区域);聚类区域匹配(在提取的特征上进行块过滤,以实现精确对齐);以及风格迁移优化(基于能量函数的扩散采样与区域风格损失相结合,以优化风格化)。在我们引入的基准测试上进行的实验表明,StyleGallery在内容结构保留、区域风格化、可解释性和个性化定制方面优于现有最先进的方法,特别是在利用多个风格参考时。
Summary / 总结
The paper addresses limitations in existing image style transfer methods, such as semantic gaps and reliance on extra constraints, by proposing StyleGallery, a training-free and semantic-aware framework. It uses semantic region segmentation, clustered region matching, and style transfer optimization to enable effective personalized customization with arbitrary reference images. Experiments show that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization.
论文针对现有基于扩散的图像风格迁移方法存在的语义差距和需要额外约束的问题,提出了一个无需训练且语义感知的框架StyleGallery,可以使用任意参考图像进行个性化风格转移。该框架包含三个阶段:语义区域分割、聚类区域匹配和风格转移优化。实验结果显示,StyleGallery在保留内容结构、区域风格化、可解释性和个性化定制方面优于现有最先进的方法。
History
20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553