arXiv 论文速递

2026-03-15 03:41
Snapshot: 20260315_0341
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
Authors: Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata
First: 2026-03-12T17:59:48+00:00 · Latest: 2026-03-12T17:59:48+00:00
Comments: Preprint
Abstract
Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation. Code is available at https://github.com/ExplainableML/LCS.
中文标题/摘要
标题:潜在颜色子空间:高维混沌中的 emergent 秩序
文本到图像生成模型取得了快速进展,但在对生成图像实现精细控制方面仍然困难重重,主要原因是对语义信息如何编码的理解有限。我们开发了对 FLUX.1 [Dev] 变分自编码器潜在空间中的颜色表示的一种解释,揭示出一种反映色调、饱和度和亮度的结构。我们通过证明潜在颜色子空间(LCS)可以预测和显式控制颜色,验证了我们的解释,引入了一种基于闭形式潜在空间操作的完全无需训练的方法,仅在 FLUX 中实现。代码可在 https://github.com/ExplainableML/LCS 获取。
Summary / 总结
The research aims to improve fine-grained control over text-to-image generation by understanding how semantic information is encoded in the latent space of FLUX.1. The authors identify a structure in the color representation within the latent space, termed the Latent Color Subspace (LCS), which reflects Hue, Saturation, and Lightness. They validate this interpretation by showing that LCS can both predict and control color in FLUX, achieving this through a training-free method based on closed-form latent-space manipulation.
研究旨在通过理解FLUX.1的潜空间中语义信息的编码方式,提高文本生成图像的精细控制。研究揭示了潜空间中的颜色表示结构,解释为反映色调、饱和度和亮度。这一解释使得可以通过基于闭形式潜空间操作的无训练方法,同时预测和控制生成图像中的颜色。
BiGain: Unified Token Compression for Joint Generation and Classification
Authors: Jiacheng Liu, Shengkun Tang, Jiacheng Cui, Dongkuan Xu, Zhiqiang Shen
Venue: CVPR 2026
First: 2026-03-12T17:55:53+00:00 · Latest: 2026-03-12T17:55:53+00:00
Comments: CVPR 2026. Code: https://github.com/Greenoso/BiGain
Abstract
Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.
中文标题/摘要
标题:BiGain:统一的标记压缩方法以实现联合生成和分类
扩散模型的加速方法(例如标记合并或下采样)通常在减少计算量的情况下优化合成质量,但往往忽视了判别能力。我们重新审视了联合目标下的标记压缩,并提出了BiGain,这是一种无需训练、即插即用的框架,可以在加速的扩散模型中保持生成质量的同时提高分类能力。我们的核心见解是频率分离:将特征空间信号映射到频率感知的表示中,可以将细节与全局语义分离,从而实现既尊重生成保真度又兼顾判别效用的压缩。BiGain 通过两种频率感知的操作体现了这一原则:(1)拉普拉斯门控标记合并,鼓励在光谱平滑标记之间进行合并,同时抑制高对比度标记的合并,从而保留边缘和纹理;(2)插值-外推 KV 下采样,通过可控的插值-外推方法在最近邻和平均池化之间进行下采样,同时保持查询不变,从而保持注意力精度。在基于 DiT 和 U-Net 的骨干网络以及 ImageNet-1K、ImageNet-100、Oxford-IIIT Pets 和 COCO-2017 上,我们的操作在加速的扩散分类中始终改善了速度-准确性的权衡,同时在相似的加速条件下保持或提高了生成质量。例如,在 ImageNet-1K 上,使用 70% 的标记合并于 Stable Diffusion 2.0 中,BiGain 提高了分类准确率 7.15%,同时提高了 FID 0.34(1.85%)。我们的分析表明,平衡的光谱保留,保留高频细节和低/中频语义,是扩散模型中标记压缩的可靠设计规则。据我们所知,BiGain 是第一个在加速扩散中同时研究和推进生成和分类的框架,支持低成本部署。
Summary / 总结
BiGain is a training-free framework that improves the speed-accuracy trade-off for diffusion models by preserving generation quality while enhancing classification. It uses frequency-aware operators: Laplacian-gated token merging and Interpolate-Extrapolate KV Downsampling, which respect both generative fidelity and discriminative utility. Across various datasets and models, BiGain consistently improves classification accuracy while maintaining or enhancing generation quality under comparable acceleration, such as increasing ImageNet-1K classification accuracy by 7.15% with 70% token merging on Stable Diffusion 2.0, while improving FID by 0.34 (1.85%).
BiGain 是一个无需训练的框架,通过保留生成质量同时增强分类来改善扩散模型的速度-准确率权衡。它使用频率分离来分离细节点和全局语义,并采用拉普拉斯门控的标记合并和内插-外推 KV 下采样。实验表明,BiGain 在各种数据集上一致地提高了分类准确性,同时保持或增强了生成质量,例如在 ImageNet-1K 上通过 70% 的标记合并提高了分类准确性 7.15%,并改善了 FID 0.34 (1.85%)。
SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation
Authors: Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng
First: 2026-03-12T17:55:07+00:00 · Latest: 2026-03-12T17:55:07+00:00
Comments: Code: https://github.com/ROUJINN/SceneAssistant
Abstract
Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant
中文标题/摘要
标题:SceneAssistant:一种用于开放词汇3D场景生成的视觉反馈代理
从自然语言生成文本到3D场景对于数字内容创作来说非常 desirable。然而,现有方法大多局限于特定领域或依赖预定义的空间关系,限制了它们在不受限制、开放词汇3D场景合成方面的能力。在本文中,我们介绍了SceneAssistant,一种基于视觉反馈的代理,用于开放词汇3D场景生成。我们的框架利用了现代3D对象生成模型以及视觉语言模型(VLM)的空间推理和规划能力。为了实现开放词汇场景组合,我们为VLM提供了全面的原子操作集(例如,缩放、旋转、聚焦)。在每次交互步骤中,VLM接收渲染的视觉反馈并相应地采取行动,逐步细化场景,以实现更连贯的空间布局并更好地与输入文本对齐。实验结果表明,我们的方法可以生成多样、开放词汇且高质量的3D场景。定性和定量的人类评估都证明了我们方法优于现有方法。此外,我们的方法允许用户根据自然语言命令编辑现有场景。我们的代码可在https://github.com/ROUJINN/SceneAssistant 获取
Summary / 总结
SceneAssistant is a visual-feedback-driven agent for open-vocabulary 3D scene generation, using a combination of 3D object generation models and Vision-Language Models (VLMs) with atomic operations like Scale and Rotate. The VLMs receive visual feedback and iteratively refine the scene, leading to more coherent and high-quality 3D scenes compared to existing methods. Qualitative and quantitative evaluations show its superiority in generating diverse and open-vocabulary 3D scenes. Users can also instruct the agent to edit existing scenes based on natural language commands.
SceneAssistant 是一个基于视觉反馈的开放词汇3D场景生成代理,结合了3D对象生成模型和Vision-Language模型(VLMs)。它为VLMs提供了如缩放、旋转和聚焦等原子操作,以实现开放词汇场景的组合。代理基于视觉反馈迭代优化场景,生成多样且高质量的3D场景。实验结果表明,SceneAssistant在生成连贯且与输入文本对齐的3D场景方面优于现有方法,还支持使用自然语言命令编辑现有场景。
ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
Authors: Yingxin Lai, Zitong Yu, Jun Wang, Linlin Shen, Yong Xu, Xiaochun Cao
First: 2026-03-12T17:30:49+00:00 · Latest: 2026-03-12T17:30:49+00:00
Abstract
Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational costs, particularly for high-resolution images and videos. Visual token pruning is a practical acceleration strategy, yet existing methods are largely semantic-driven, retaining salient objects while discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters often reside. To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities indicating transient generative artifacts. The forensic scoring further integrates transport-based novelty with high-frequency priors to separate forensic evidence from semantic content under large-ratio compression. Experiments on deepfake and AIGC benchmarks show that at 10\% token retention, ForensicZip achieves $2.97\times$ speedup and over 90\% FLOPs reduction while maintaining state-of-the-art detection performance.
中文标题/摘要
标题:ForensicZip:更多的标记更好但并非必要——在法医视觉-语言模型中的应用
多模态大型语言模型(MLLMs)通过生成伪造检测的文本解释来实现多媒体的可解释性法医分析。然而,处理密集的视觉序列会带来高昂的计算成本,特别是对于高分辨率的图像和视频。视觉标记剪枝是一种实用的加速策略,但现有方法主要基于语义驱动,保留显著的对象,而丢弃包含伪造痕迹(如高频异常和时间抖动)的背景区域。为了解决这一问题,我们引入了ForensicZip,这是一种无需训练的框架,从伪造驱动的角度重新定义了标记压缩。ForensicZip将时间标记的演变建模为具有松弛虚拟节点的出生-死亡最优传输问题,量化物理不连续性以指示瞬态生成伪影。法医评分进一步将传输基础的新颖性与高频先验相结合,在大比例压缩下分离法医证据和语义内容。在深度伪造和AIGC基准测试中,即使在保留10%的标记下,ForensicZip也实现了2.97倍的加速和超过90%的FLOPs减少,同时保持了最先进的检测性能。
Summary / 总结
The research aims to improve the efficiency of forensic vision-language models by addressing the high computational costs associated with processing dense visual sequences. ForensicZip, a training-free framework, reformulates token compression from a forgery-driven perspective, focusing on quantifying physical discontinuities to detect transient generative artifacts. Experiments show that at 10% token retention, ForensicZip achieves a 2.97x speedup and over 90% FLOPs reduction while maintaining state-of-the-art detection performance on deepfake and AIGC benchmarks.
这项工作的动机是解决在处理高分辨率图像和视频中的密集视觉序列时,取证视觉语言模型的高计算成本问题。主要方法是引入ForensicZip,这是一种无需训练的框架,从伪造驱动的角度重新定义了标记压缩,重点关注量化物理不连续性以检测瞬态生成的伪迹。关键实验发现表明,在10%的标记保留率下,ForensicZip实现了2.97倍的加速和超过90%的FLOPs减少,同时保持了最先进的检测性能。
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
Authors: Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li
First: 2026-03-12T17:27:21+00:00 · Latest: 2026-03-12T17:27:21+00:00
Abstract
Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).
中文标题/摘要
标题:IndexCache:通过跨层索引重用加速稀疏注意力
长上下文代理工作流已成为大型语言模型的关键使用案例,使得注意力效率对于推理速度和提供成本至关重要。稀疏注意力有效应对了这一挑战,DeepSeek 稀疏注意力(DSA)是一个代表性的生产级解决方案:一个轻量级的闪电索引器选择每个查询的最相关的 top-k 个令牌,将核心注意力从 $O(L^2)$ 降低到 $O(Lk)$。然而,索引器本身保持 $O(L^2)$ 复杂性,并且必须在每一层独立运行,尽管连续层的结果 top-k 选择高度相似。我们提出了 IndexCache,通过将层划分为运行自己索引器的小型全层集和主要重用最近全层 top-k 索引的共享层集,利用了这种跨层冗余。我们提出了两种互补的方法来确定和优化这种配置。无需训练的 IndexCache 使用贪婪搜索算法直接在校准集上最小化语言建模损失来选择保留索引器的层,无需权重更新。基于训练的 IndexCache 引入了一种多层蒸馏损失,训练每个保留的索引器与它服务的所有层的平均注意力分布进行对比,即使简单的交错模式也能达到全索引器的准确性。在 30B DSA 模型上的实验结果显示,IndexCache 可以去除 75% 的索引器计算,质量下降可以忽略不计,相比标准 DSA 实现了高达 1.82$\times$ 前填速度提升和 1.48$\times$ 解码速度提升。初步实验进一步证实了我们在生产规模 GLM-5 模型上的这些积极结果(图 1)。
Summary / 总结
IndexCache accelerates sparse attention by reusing indexers across layers, reducing the number of indexer computations by 75% without significant quality loss. It uses two methods: a training-free approach that minimizes language modeling loss on a calibration set, and a training-aware approach that introduces a multi-layer distillation loss to train retained indexers. On a 30B DSA model, IndexCache achieves up to 1.82x prefill speedup and 1.48x decode speedup compared to standard DSA.
IndexCache 通过跨层重用索引器来加速稀疏注意力,将索引器计算量减少75%,同时保持较低的质量损失。它使用两种方法:无训练的方案通过最小化语言建模损失来选择保留索引器的层,以及有训练的方案通过多层蒸馏损失来训练保留的索引器。在30B DSA模型上,IndexCache 达到了最多1.82倍的预填充加速和1.48倍的解码加速,相比标准DSA。
BehaviorVLM: Unified Finetuning-Free Behavioral Understanding with Vision-Language Reasoning
Authors: Jingyang Ke, Weihan Li, Amartya Pradhan, Jeffrey Markowitz, Anqi Wu
First: 2026-03-12T17:09:20+00:00 · Latest: 2026-03-12T17:09:20+00:00
Abstract
Understanding freely moving animal behavior is central to neuroscience, where pose estimation and behavioral understanding form the foundation for linking neural activity to natural actions. Yet both tasks still depend heavily on human annotation or unstable unsupervised pipelines, limiting scalability and reproducibility. We present BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding that requires no task-specific finetuning and minimal human labeling by guiding pretrained Vision-Language Models (VLMs) through detailed, explicit, and verifiable reasoning steps. For pose estimation, we leverage quantum-dot-grounded behavioral data and propose a multi-stage pipeline that integrates temporal, spatial, and cross-view reasoning. This design greatly reduces human annotation effort, exposes low-confidence labels through geometric checks such as reprojection error, and produces labels that can later be filtered, corrected, or used to fine-tune downstream pose models. For behavioral understanding, we propose a pipeline that integrates deep embedded clustering for over-segmented behavior discovery, VLM-based per-clip video captioning, and LLM-based reasoning to merge and semantically label behavioral segments. The behavioral pipeline can operate directly from visual information and does not require keypoints to segment behavior. Together, these components enable scalable, interpretable, and label-light analysis of multi-animal behavior.
中文标题/摘要
标题:BehaviorVLM:基于视觉语言推理的无需微调统一行为理解
理解自由移动的动物行为是神经科学的核心,其中姿态估计和行为理解构成了将神经活动与自然动作联系起来的基础。然而,这两个任务仍然严重依赖于人工注释或不稳定的无监督管道,限制了其可扩展性和可重复性。我们提出了BehaviorVLM,这是一种无需特定任务微调且只需少量人工标注的统一视觉语言框架,通过引导预训练的视觉语言模型(VLMs)进行详细的、明确的和可验证的推理步骤来进行姿态估计和行为理解。对于姿态估计,我们利用量子点标注的行为数据,并提出了一种多阶段管道,该管道结合了时间、空间和跨视图推理。这种设计大大减少了人工标注的工作量,通过几何检查如重投影误差暴露了低置信度的标签,并生成了可以稍后过滤、修正或用于下游姿态模型微调的标签。对于行为理解,我们提出了一种管道,该管道结合了深度嵌入聚类以发现过度分割的行为,基于VLM的每段视频字幕生成,以及基于LLM的推理以合并和语义标注行为片段。行为管道可以直接从视觉信息中运行,不需要关键点来分割行为。这些组件共同实现了多动物行为的大规模、可解释和轻标注分析。
Summary / 总结
The paper introduces BehaviorVLM, a unified vision-language framework for pose estimation and behavioral understanding in neuroscience. It leverages pretrained Vision-Language Models (VLMs) and detailed reasoning steps to achieve this without requiring task-specific fine-tuning or extensive human labeling. For pose estimation, it uses a multi-stage pipeline with temporal, spatial, and cross-view reasoning to reduce annotation effort and improve label quality. For behavioral understanding, it integrates deep clustering, VLM-based video captioning, and LLM-based reasoning to discover and label behaviors directly from visual information. The framework significantly reduces human annotation and enhances the scalability and interpretability of behavioral analysis in multi-animal studies.
BehaviorVLM 是一个无需特定任务微调的统一视觉-语言框架,用于姿态估计和行为理解。它利用预训练模型并通过详细的推理步骤进行引导,并使用多阶段管道进行姿态估计,该管道结合了时间、空间和跨视图推理。对于行为理解,它提出了一种包括深度嵌入聚类、基于VLM的视频字幕生成和基于LLM的推理以合并和标注行为片段的管道。该框架显著减少了人工标注的工作量,并能够实现多动物行为的大规模、可解释和轻标注分析。
GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows
Authors: Zexuan Yan, Jiarui Jin, Yue Ma, Shijian Wang, Jiahui Hu, Wenxiang Jiao, Yuan Lu, Linfeng Zhang
First: 2026-03-12T16:53:06+00:00 · Latest: 2026-03-12T16:53:06+00:00
Abstract
Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.
中文标题/摘要
标题:GlyphBanana:通过自主工作流提升精确文本渲染
尽管生成模型的最新进展在文本渲染方面取得了显著进步,但准确生成复杂文本和数学公式仍然是一项艰巨的挑战。这一困难主要源于当前模型在遇到分布外提示时有限的指令遵循能力。为了解决这一问题,我们引入了GlyphBanana,并设计了一个专门用于渲染复杂字符和公式的基准测试。GlyphBanana采用了一种自主工作流,将辅助工具集成到潜在空间和注意力图中,以注入字形模板,促进生成图像的迭代优化。值得注意的是,我们的无训练方法可以无缝应用于各种文本到图像(T2I)模型,与现有基线相比,实现了更高的精度。大量实验表明了我们提出的工作流的有效性。相关代码可在https://github.com/yuriYanZeXuan/GlyphBanana上公开获取。
Summary / 总结
The research aims to improve the precision of text rendering, especially for complex characters and formulas, by addressing the limitations of current generative models. GlyphBanana introduces an agentic workflow that uses auxiliary tools to inject glyph templates into the latent space and attention maps, enabling iterative refinement. Experiments show that this approach outperforms existing methods in generating precise text and formulas without requiring training data. The code is publicly available.
研究旨在解决使用生成模型准确生成复杂文本和数学公式时遇到的难题,这些模型在处理非分布提示时常常表现不佳。研究引入了GlyphBanana,一种结合辅助工具将字形模板注入潜在空间和注意力图中的主动工作流,以实现迭代优化。实验表明,这种无需训练的方法在文本和公式渲染的精确度上优于现有基线。
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction
Authors: Mengfei Duan, Hao Shi, Fei Teng, Guoqiang Zhao, Yuheng Zhang, Zhiyong Li, Kailun Yang
First: 2026-03-12T16:45:42+00:00 · Latest: 2026-03-12T16:45:42+00:00
Comments: The source code will be made publicly available at https://github.com/MengfeiD/O3N
Abstract
Understanding and reconstructing the 3D world through omnidirectional perception is an inevitable trend in the development of autonomous agents and embodied intelligence. However, existing 3D occupancy prediction methods are constrained by limited perspective inputs and predefined training distribution, making them difficult to apply to embodied agents that require comprehensive and safe perception of scenes in open world exploration. To address this, we present O3N, the first purely visual, end-to-end Omnidirectional Open-vocabulary Occupancy predictioN framework. O3N embeds omnidirectional voxels in a polar-spiral topology via the Polar-spiral Mamba (PsM) module, enabling continuous spatial representation and long-range context modeling across 360°. The Occupancy Cost Aggregation (OCA) module introduces a principled mechanism for unifying geometric and semantic supervision within the voxel space, ensuring consistency between the reconstructed geometry and the underlying semantic structure. Moreover, Natural Modality Alignment (NMA) establishes a gradient-free alignment pathway that harmonizes visual features, voxel embeddings, and text semantics, forming a consistent "pixel-voxel-text" representation triad. Extensive experiments on multiple models demonstrate that our method not only achieves state-of-the-art performance on QuadOcc and Human360Occ benchmarks but also exhibits remarkable cross-scene generalization and semantic scalability, paving the way toward universal 3D world modeling. The source code will be made publicly available at https://github.com/MengfeiD/O3N.
中文标题/摘要
标题:O3N:全方位开放式词汇占用预测
通过全方位感知理解并重建3D世界是自主代理和具身智能发展中不可避免的趋势。然而,现有的3D占用预测方法受限于有限视角输入和预定义的训练分布,难以应用于需要全面和安全场景感知的具身代理。为解决这一问题,我们提出了O3N,这是首个纯视觉、端到端的全方位开放式词汇占用预测框架。O3N通过Polar-spiral Mamba (PsM) 模块嵌入全方位体素,以极螺旋拓扑结构实现连续的空间表示和360°范围内的长程上下文建模。Occupancy Cost Aggregation (OCA) 模块引入了一种原理性的机制,用于在体素空间内统一几何和语义监督,确保重建几何与底层语义结构的一致性。此外,Natural Modality Alignment (NMA) 建立了一种无梯度对齐路径,协调视觉特征、体素嵌入和文本语义,形成一致的“像素-体素-文本”表示三元组。在多个模型上的广泛实验表明,我们的方法不仅在QuadOcc和Human360Occ基准测试中达到了最先进的性能,还展示了出色的跨场景泛化能力和语义可扩展性,为通用3D世界建模铺平了道路。源代码将在https://github.com/MengfeiD/O3N公开。
Summary / 总结
O3N is an end-to-end framework for omnidirectional open-vocabulary occupancy prediction, addressing limitations of existing methods by using a polar-spiral topology and introducing modules like OCA and NMA for better geometric and semantic consistency. Experiments show O3N outperforms previous methods on QuadOcc and Human360Occ benchmarks and demonstrates strong cross-scene generalization and semantic scalability.
O3N 是一个端到端的全景开放词汇占用预测框架,通过使用极螺旋拓扑并引入 OCA 和 NMA 模块来提高几何和语义一致性,解决了现有方法的局限性。实验表明,O3N 在 QuadOcc 和 Human360Occ 基准上优于先前的方法,并且展示了强大的跨场景泛化能力和语义扩展性。
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents
Authors: Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou, Shuai Wang, Weili Guan, Gongwei Chen
Venue: CVPR 2026
First: 2026-03-12T16:40:59+00:00 · Latest: 2026-03-12T16:40:59+00:00
Comments: Accepted by CVPR 2026
Abstract
Graphical user interface (GUI) agents powered by large vision-language models (VLMs) have shown remarkable potential in automating digital tasks, highlighting the need for high-quality trajectory data to support effective agent training. Yet existing trajectory synthesis pipelines often yield agents that fail to generalize beyond simple interactions. We identify this limitation as stemming from the neglect of semantically ambiguous actions, whose meanings are context-dependent, sequentially dependent, or visually ambiguous. Such actions are crucial for real-world robustness but are under-represented and poorly processed in current datasets, leading to semantic misalignment between task instructions and execution. To address these issues, we propose HATS, a Hardness-Aware Trajectory Synthesis framework designed to mitigate the impact of semantic ambiguity. We define hardness as the degree of semantic ambiguity associated with an action and develop two complementary modules: (1) hardness-driven exploration, which guides data collection toward ambiguous yet informative interactions, and (2) alignment-guided refinement, which iteratively validates and repairs instruction-execution alignment. The two modules operate in a closed loop: exploration supplies refinement with challenging trajectories, while refinement feedback updates the hardness signal to guide future exploration. Extensive experiments show that agents trained with HATS consistently outperform state-of-the-art baselines across benchmark GUI environments.
中文标题/摘要
标题:HATS:面向GUI代理的硬度感知轨迹合成
由大规模视觉-语言模型(VLMs)驱动的图形用户界面(GUI)代理在自动化数字任务方面展现了显著潜力,突显了高质量轨迹数据对于有效代理训练的必要性。然而,现有的轨迹合成管道往往生成的代理无法超越简单的交互进行泛化。我们发现这一局限源于对语义含糊动作的忽视,这些动作的意义依赖于上下文、序列或视觉上的含糊性。这些动作对于现实世界的鲁棒性至关重要,但在当前数据集中却严重不足且处理不佳,导致任务指令与执行之间存在语义不匹配。为解决这些问题,我们提出了HATS,一种硬度感知轨迹合成框架,旨在减轻语义含糊性的影响。我们将硬度定义为与动作相关的语义含糊程度,并开发了两个互补模块:(1)硬度驱动探索,引导数据收集向含糊但有信息价值的交互;(2)对齐引导精炼,迭代验证和修复指令执行对齐。两个模块在一个闭环中运行:探索为精炼提供具有挑战性的轨迹,而精炼反馈更新硬度信号以指导未来的探索。广泛的实验表明,使用HATS训练的代理在基准GUI环境中始终优于最先进的基线。
Summary / 总结
The research aims to improve the generalization ability of GUI agents powered by VLMs by addressing the issue of semantic ambiguity in trajectory synthesis. HATS, a Hardness-Aware Trajectory Synthesis framework, is proposed to tackle this problem by defining hardness as the degree of semantic ambiguity and incorporating two modules: hardness-driven exploration and alignment-guided refinement. The framework iteratively collects and refines trajectories to better align task instructions with execution. Experiments demonstrate that agents trained with HATS outperform existing methods in various GUI environments.
研究旨在通过解决轨迹合成中的语义模糊问题,提高GUI代理的泛化能力。提出的HATS框架引入了硬度驱动的探索和对齐引导的精炼,以收集和精炼轨迹,确保任务指令与执行之间的更好对齐。实验表明,使用HATS训练的代理在各种GUI环境中表现优于现有方法。
Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments
Authors: Zhaoyang Jiang, Zhizhong Fu, David McAllister, Yunsoo Kim, Honghan Wu
First: 2026-03-12T15:40:59+00:00 · Latest: 2026-03-12T15:40:59+00:00
Abstract
Longitudinal brain MRI is essential for characterizing the progression of neurological diseases such as Alzheimer's disease assessment. However, current deep-learning tools fragment this process: classifiers reduce a scan to a label, volumetric pipelines produce uninterpreted measurements, and vision-language models (VLMs) may generate fluent but potentially hallucinated conclusions. We present LoV3D, a pipeline for training 3D vision-language models, which reads longitudinal T1-weighted brain MRI, produces a region-level anatomical assessment, conducts longitudinal comparison with the prior scan, and finally outputs a three-class diagnosis (Cognitively Normal, Mild Cognitive Impairment, or Dementia) along with a synthesized diagnostic summary. The stepped pipeline grounds the final diagnosis by enforcing label consistency, longitudinal coherence, and biological plausibility, thereby reducing the risks of hallucinations. The training process introduces a clinically-weighted Verifier that scores candidate outputs automatically against normative references derived from standardized volume metrics, driving Direct Preference Optimization without a single human annotation. On a subject-level held-out ADNI test set (479 scans, 258 subjects), LoV3D achieves 93.7% three-class diagnostic accuracy (+34.8% over the no-grounding baseline), 97.2% on two-class diagnosis accuracy (+4% over the SOTA) and 82.6% region-level anatomical classification accuracy (+33.1% over VLM baselines). Zero-shot transfer yields 95.4% on MIRIAD (100% Dementia recall) and 82.9% three-class accuracy on AIBL, confirming high generalizability across sites, scanners, and populations. Code is available at https://github.com/Anonymous-TEVC/LoV-3D.
中文标题/摘要
标题:论文标题:LoV3D:通过区域体积评估在纵向3D脑MRI中接地的认知预后推理
纵向脑MRI对于表征神经退行性疾病(如阿尔茨海默病)的进展至关重要。然而,当前的深度学习工具将此过程分割:分类器将扫描简化为标签,体积管道生成未解释的测量值,而视觉-语言模型(VLMs)可能会生成流畅但可能虚假的结论。我们提出了LoV3D,这是一种用于训练3D视觉-语言模型的管道,该管道读取纵向T1加权脑MRI,生成区域级别的解剖评估,进行与先前扫描的纵向比较,最终输出三类诊断(认知正常、轻度认知障碍或痴呆)以及合成的诊断摘要。分步管道通过强制执行标签一致性、纵向连贯性和生物学可行性来接地最终诊断,从而降低幻觉的风险。训练过程引入了一个临床加权的验证器,自动将候选输出与标准化体积指标衍生的参考值评分,实现直接偏好优化,无需单个人类注释。在ADNI主题级保留测试集(479个扫描,258个受试者)上,LoV3D在三类诊断准确性上达到93.7%(比无接地基线高34.8%),在两类诊断准确性上达到97.2%(比SOTA高4%),在区域级别解剖分类准确性上达到82.6%(比VLM基线高33.1%)。零样本迁移在MIRIAD上达到95.4%(痴呆召回率100%),在AIBL上达到82.9%的三类准确性,证实了其在不同站点、扫描仪和人群中的高泛化能力。代码可在https://github.com/Anonymous-TEVC/LoV-3D获取。
Coarse-Guided Visual Generation via Weighted h-Transform Sampling
Authors: Yanghao Wang, Ziqi Jiang, Zhen Wang, Long Chen
First: 2026-03-12T15:26:19+00:00 · Latest: 2026-03-12T15:26:19+00:00
Abstract
Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.
中文标题/摘要
标题:基于加权h-变换采样的粗略引导视觉生成
粗略引导视觉生成是从退化或低保真度的粗略参考中合成精细视觉样本的关键技术,对于各种实际应用至关重要。虽然基于训练的方法很有效,但它们受到高训练成本和配对数据收集限制的内在局限。因此,最近的无训练方法提出利用预训练的扩散模型,并在采样过程中引入引导。然而,这些无训练方法要么需要知道正向(精细到粗略)变换算子,例如双立方下采样,要么难以在引导和合成质量之间取得平衡。为了解决这些挑战,我们提出了一种新颖的引导方法,使用h-变换,这是一种可以在期望条件下约束随机过程(例如采样过程)的工具。具体来说,我们通过在原始微分方程中添加一个漂移函数来修改每个采样时间步的转换概率,这大约会引导生成向理想的精细样本。为了解决不可避免的近似误差,我们引入了一种噪声级别感知的时间表,随着误差增加逐渐减少该项的权重,从而确保引导的遵守和高质量的合成。广泛的实验表明,我们的方法在各种图像和视频生成任务中具有有效性和泛化能力。
Summary / 总结
The paper addresses the challenge of synthesizing fine visual samples from degraded coarse references, which is crucial for various applications. It proposes a novel method using the h-transform to guide the sampling process without requiring paired data or knowledge of the forward transformation. By modifying the transition probability and introducing a noise-level-aware schedule, the method ensures both effective guidance and high-quality synthesis. Experiments show its effectiveness across different tasks.
论文解决了从退化或低保真度的粗略参考生成精细视觉样本的挑战。提出了一种使用h-transform引导采样过程的新方法,在每个采样时间步长中通过添加漂移函数修改转移概率,以引导生成向理想精细样本靠拢。引入了噪声级别感知的调度机制来缓解近似误差,确保生成过程既遵循指导又保持高质量。实验表明该方法在各种图像和视频生成任务中的有效性和通用性。
Continual Learning with Vision-Language Models via Semantic-Geometry Preservation
Authors: Chiyuan He, Zihuan Qiu, Fanman Meng, Runtong Zhang, Linfeng Xu, Qingbo Wu, Hongliang Li
First: 2026-03-12T15:25:53+00:00 · Latest: 2026-03-12T15:25:53+00:00
Comments: 14 pages, 11 figures, under review
Abstract
Continual learning of pretrained vision-language models (VLMs) is prone to catastrophic forgetting, yet current approaches adapt to new tasks without explicitly preserving the cross-modal semantic geometry inherited from pretraining and previous stages, allowing new-task supervision to induce geometric distortion. We observe that the most pronounced drift tends to concentrate in vulnerable neighborhoods near the old-new semantic interface, where shared visual patterns are easily re-explained by new textual semantics. To address this under an exemplar-free constraint, we propose Semantic Geometry Preservation for Continual Learning (SeGP-CL). SeGP-CL first probes the drift-prone region by constructing a compact set of adversarial anchors with dual-targeted projected gradient descent (DPGD), which drives selected new-task seeds toward old-class semantics while remaining faithful in raw visual space. During training, we preserve cross-modal structure by anchor-guided cross-modal geometry distillation (ACGD), and stabilize the textual reference frame across tasks via a lightweight text semantic-geometry regularization (TSGR). After training, we estimate anchor-induced raw-space drift to transfer old visual prototypes and perform dual-path inference by fusing cross-modal and visual cues. Extensive experiments on five continual learning benchmarks demonstrate that SeGP-CL consistently improves stability and forward transfer, achieving state-of-the-art performance while better preserving semantic geometry of VLMs.
中文标题/摘要
标题:通过语义-几何保留实现预训练视觉-语言模型的持续学习
预训练视觉-语言模型(VLMs)的持续学习容易发生灾难性遗忘,当前方法在适应新任务时并未显式地保留从预训练和先前阶段继承的跨模态语义几何结构,导致新任务监督会引发几何失真。我们观察到,最显著的漂移往往集中在旧新语义界面附近的脆弱区域,在这些区域中,共享的视觉模式容易被新的文本语义重新解释。为了解决这一问题,在不依赖示例的情况下,我们提出了语义几何保留的持续学习(SeGP-CL)。SeGP-CL 首先通过构建具有双重目标投影梯度下降(DPGD)的紧凑集对抗锚点来探测易漂移区域,这会将选定的新任务种子引导向旧类语义,同时在原始视觉空间中保持忠实。在训练过程中,我们通过锚点引导的跨模态几何蒸馏(ACGD)保留跨模态结构,并通过轻量级文本语义-几何正则化(TSGR)在任务之间稳定文本参考框架。训练后,我们估计锚点引起的原始空间漂移以转移旧视觉原型,并通过融合跨模态和视觉线索进行双路径推理。在五个持续学习基准上的广泛实验表明,SeGP-CL 一致地提高了稳定性和前向迁移,同时更好地保留了 VLMs 的语义几何结构,达到了最先进的性能。
Summary / 总结
The paper addresses the issue of catastrophic forgetting in continual learning of vision-language models (VLMs) by proposing SeGP-CL, which preserves semantic geometry through adversarial anchor construction and cross-modal geometry distillation. Key findings show that SeGP-CL enhances stability and forward transfer, achieving state-of-the-art performance across five benchmarks while better preserving the semantic geometry of VLMs.
论文通过提出SeGP-CL方法,利用对抗锚点构造和跨模态几何蒸馏来解决视觉-语言模型(VLMs)在持续学习中的灾难性遗忘问题。该方法提高了稳定性和前向迁移,实现了五个基准上的最先进性能,同时更好地保留了VLMs的语义几何结构。
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
Authors: Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan
First: 2026-03-12T15:14:48+00:00 · Latest: 2026-03-12T15:14:48+00:00
Abstract
Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.
中文标题/摘要
标题:慢速-快速推理:基于句内支持稳定性的无训练推理加速
长上下文自回归解码仍然很昂贵,因为每次解码步骤都必须反复处理不断增长的历史记录。我们在解码过程中观察到一个一致的模式:在一个句子内,更广泛地说,在一个短的语义连贯的片段内,主导的注意力支持通常保持相对稳定。受此观察的启发,我们提出了慢速-快速推理(SFI),这是一种无训练的解码框架,将生成过程分解为频繁的低成本快速步骤和偶尔的密集注意力慢速步骤。快速步骤重用紧凑的稀疏记忆以实现高效的解码。慢速步骤在语义边界附近被触发。在慢速步骤中,模型回顾更广泛的上下文,并使用选择器刷新选定的记忆,以供后续快速步骤使用。在评估的不同上下文长度下,SFI 大约提供了 1.6 倍至 14.4 倍的更高解码吞吐量,同时在长上下文和长链推理设置中通常保持与全键值基线相当的质量。由于 SFI 是无训练的,并且可以直接应用于现有的检查点,因此它为减少当前自回归推理模型在长上下文、长展望和代理任务中的推理成本提供了一条实用的道路。
Summary / 总结
The paper proposes Slow-Fast Inference (SFI), a training-free decoding framework that accelerates long-context autoregressive decoding by decoupling the process into frequent fast steps and occasional slow steps. Fast steps reuse a compact sparse memory for efficient decoding, while slow steps, triggered near semantic boundaries, refresh the selected memory. SFI achieves approximately 1.6 to 14.4 times higher decoding throughput while maintaining quality comparable to the full-KV baseline in long-context and long-CoT settings.
论文提出了慢速-快速推理(SFI)框架,通过将解码过程拆分为频繁的快速步骤和偶尔的慢速步骤来加速长上下文自回归解码。快速步骤利用紧凑的稀疏记忆进行高效解码,而慢速步骤在语义边界附近触发,刷新选定的记忆。SFI 在长上下文和长链推理设置中实现了约 1.6 到 14.4 倍的解码吞吐量提升,同时保持与全值基线相当的质量。
JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas
Authors: Sandeep Inuganti, Hideaki Kanayama, Kanta Shimizu, Mahdi Chamseddine, Soichiro Yokota, Didier Stricker, Jason Rambach
First: 2026-03-06T11:22:14+00:00 · Latest: 2026-03-12T14:27:22+00:00
Abstract
Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.
中文标题/摘要
标题:JOPP-3D:联合开放词汇语义分割点云和全景图
跨视觉模态(如3D点云和全景图像)的语义分割仍然是一个具有挑战性的任务,主要是由于标注数据的稀缺性和固定标签模型的有限适应性。在本文中,我们提出了JOPP-3D,这是一种联合利用全景图和点云数据的开放词汇语义分割框架,以实现基于语言的场景理解。我们将RGB-D全景图像转换为其相应的切线视角图像和3D点云,然后使用这些模态来提取和对齐基础的视觉-语言特征。这使得自然语言查询能够在输入的两种模态上生成语义掩码。在斯坦福-2D-3D-s和ToF-360数据集上的实验评估表明,JOPP-3D能够在全景和3D领域生成连贯且语义上有意义的分割。我们提出的方法在开放词汇和封闭词汇的2D和3D语义分割中取得了显著的改进。
Summary / 总结
The research aims to address the challenge of semantic segmentation across 3D point clouds and panoramic images by developing JOPP-3D, an open-vocabulary semantic segmentation framework. It jointly utilizes panoramic and point cloud data to extract and align vision-language features, enabling natural language querying to generate semantic masks. Experiments on Stanford-2D-3D-s and ToF-360 datasets show that JOPP-3D produces coherent and semantically meaningful segmentations, outperforming the state-of-the-art in both open and closed vocabulary 2D and 3D semantic segmentation.
JOPP-3D 是一种用于 3D 点云和全景图像联合开放词汇语义分割的框架,解决了标注数据稀缺和固定标签模型适应性差的挑战。它将 RGB-D 全景图像转换为相应的切线视角图像和 3D 点云,然后提取并对齐视觉-语言特征,以实现自然语言查询进行语义分割。实验在斯坦福-2D-3D-s 和 ToF-360 数据集上表明,JOPP-3D 产生了连贯且语义意义明确的分割结果,并在开放和封闭词汇的 2D 和 3D 语义分割中优于当前最佳方法。
HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios
Authors: Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu
First: 2026-03-12T14:25:44+00:00 · Latest: 2026-03-12T14:25:44+00:00
Abstract
The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.
中文标题/摘要
标题:HomeSafe-Bench:评估视觉-语言模型在家庭场景中对危险动作检测的安全性
随着实体代理的迅速发展,家用机器人在现实环境中的部署速度加快。然而,与结构化的工业环境不同,家庭空间引入了不可预测的安全风险,系统限制如感知延迟和缺乏常识知识可能导致危险错误。当前的安全评估通常局限于静态图像、文本或一般性危害,未能充分评估这些特定情境下的动态危险动作检测。为弥补这一差距,我们引入了**HomeSafe-Bench**,这是一个具有挑战性的基准,旨在评估视觉-语言模型(VLMs)在家庭场景中的危险动作检测能力。HomeSafe-Bench 通过结合物理模拟和高级视频生成构建,包含六个功能区域的438个多样化案例,并具有精细的多维度注释。除了基准测试,我们还提出了**家庭安全的分层双脑守护(HD-Guard)**,这是一种分层流式架构,用于实时安全监控。HD-Guard 协调一个轻量级的 FastBrain 进行连续的高频筛查,并通过异步的大规模 SlowBrain 进行深度多模态推理,有效平衡推理效率与检测准确性。评估表明,HD-Guard 在延迟和性能之间实现了更优的权衡,而我们的分析指出了当前基于VLM的安全检测中的关键瓶颈。
Summary / 总结
HomeSafe-Bench is a benchmark designed to evaluate Vision-Language Models on detecting unsafe actions in household scenarios, addressing the limitations of current safety evaluations. It uses a hybrid pipeline combining physical simulation and video generation to create 438 diverse cases with fine-grained annotations. The proposed HD-Guard architecture, a hierarchical streaming system, enhances real-time safety monitoring by balancing inference efficiency and detection accuracy through a lightweight FastBrain and a large-scale SlowBrain for deep reasoning. Evaluations show HD-Guard outperforms in this trade-off while highlighting critical bottlenecks in current VLM-based safety detection systems.
HomeSafe-Bench 是一个基准,旨在评估 Vision-Language 模型在家庭场景中检测危险动作的能力,弥补了当前静态安全评估的不足。它通过结合物理模拟和视频生成的混合管道,包含438个具有精细注释的多样化案例。研究提出了 HD-Guard,这是一种分层流式架构,能够在保持推理效率的同时提高检测准确性,实现实时安全监控的优越性能,优于当前基于 VLM 的方法。
SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models
Authors: Zhanxuan Hu, Qiyu Xu, Yu Duan, Yonghang Tai, Huafeng Li
First: 2025-06-16T17:27:47+00:00 · Latest: 2026-03-12T12:26:23+00:00
Abstract
Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \textbf{SOTA} is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.
中文标题/摘要
标题:SOTA:自适应最优运输在多基础模型零样本分类中的应用
基础模型因其强大的零样本分类能力而在各个领域引起了广泛关注。本文受到两个关键观察的启发:(1)视觉-语言模型(VLMs),如CLIP,往往过度依赖于类别级别的文本先验,难以捕捉细微的视觉线索,而视觉基础模型(VFMs),如DINO,则提供了丰富的区分性视觉特征,但缺乏语义对齐;(2)不同VLMs在不同数据集上的性能差异很大,这归因于预训练的不同。为了解决这些挑战,我们提出了SOTA(自适应最优运输),这是一种无需训练的集成框架,通过学习自适应运输计划来整合多个基础模型(VFMs或VLMs)的输出。值得注意的是,SOTA 是无先验的,并且能够自动平衡模型的贡献。在包括自然图像、医学病理和遥感在内的多个领域进行的广泛实验验证了SOTA的普适性。结果一致表明,它有效地利用了不同基础模型的互补优势,并在单个模型上取得了显著的改进。代码可在以下链接获取:https://github.com/Afleve/self-adaptive-Optimal-Transport.
Summary / 总结
This work proposes SOTA, a training-free ensemble framework for zero-shot classification that integrates outputs from multiple foundation models (VFMs or VLMs) using a self-adaptive transport plan. Motivated by the limitations of VLMs and VFMs, SOTA addresses the challenges of fine-grained visual cues and semantic alignment. Experiments across various domains demonstrate that SOTA effectively leverages the strengths of different foundation models, achieving significant performance improvements over individual models.
该研究提出了SOTA,一种无需训练的集成框架,通过学习自适应传输计划来整合多个基础模型。受视觉语言模型和视觉基础模型局限性的启发,SOTA 解决了对文本先验的过度依赖和语义对齐不足的问题。在多个领域的实验中,SOTA 有效利用了不同基础模型的互补优势,显著提升了性能。
MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis
Authors: Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, Jianxin Lin
Venue: AAAI 2026
First: 2025-11-27T01:47:43+00:00 · Latest: 2026-03-12T11:42:07+00:00
Comments: AAAI 2026, Medical Chain-of-Thought (CoT), Reinforcement Learning with Verifiable Rewards (RLVR), Multimodal Grounded Reasoning
Abstract
Accurate medical diagnosis often involves progressive visual focusing and iterative reasoning, characteristics commonly observed in clinical workflows. While recent vision-language models demonstrate promising chain-of-thought (CoT) reasoning capabilities via reinforcement learning with verifiable rewards (RLVR), their purely on-policy learning paradigm tends to reinforce superficially coherent but clinically inaccurate reasoning paths. We propose MedEyes, a novel reinforcement learning framework that dynamically models clinician-style diagnostic reasoning by progressively attending to and interpreting relevant medical image regions. By incorporating off-policy expert guidance, MedEyes converts expert visual search trajectories into structured external behavioral signals, guiding the model toward clinically aligned visual reasoning. We design the Gaze-guided Reasoning Navigator (GRN) to emulate the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis. To balance expert imitation and autonomous discovery, we introduce the Confidence Value Sampler (CVS), which employs nucleus sampling and adaptive termination to create diverse yet credible exploration paths. Finally, the dual-stream GRPO optimization framework decouples on-policy and off-policy learning signals, mitigating reward assimilation and entropy collapse. Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5pp across multiple medical VQA benchmarks, validating MedEyes's potential in building trustworthy medical AI systems. Code is available at https://github.com/zhcz328/MedEyes.
中文标题/摘要
标题:MedEyes: 学习医疗渐进诊断的动态视觉聚焦
准确的医疗诊断通常涉及渐进的视觉聚焦和迭代推理,这是临床工作流程中常见的特征。虽然最近的视觉-语言模型通过可验证奖励的强化学习(RLVR)展示了有希望的链式思考(CoT)推理能力,但它们的纯策略学习范式往往会强化表面上连贯但临床不准确的推理路径。我们提出MedEyes,这是一种新颖的强化学习框架,通过逐步关注和解释相关的医学图像区域,动态建模临床医生风格的诊断推理。通过结合离策略专家指导,MedEyes将专家的视觉搜索轨迹转化为结构化的外部行为信号,引导模型向临床对齐的视觉推理。我们设计了注视引导推理导航器(GRN),通过双模式探索策略模拟诊断过程,扫描系统异常定位并深入分析详细区域。为了平衡专家模仿和自主发现,我们引入了置信值采样器(CVS),它使用核采样和自适应终止来创建多样且可信的探索路径。最后,双流GRPO优化框架解耦了策略学习信号,减轻了奖励同化和熵崩溃。实验表明,MedEyes在多个医学VQA基准测试中平均性能提高了8.5个百分点,验证了MedEyes在构建可信赖的医疗AI系统方面的潜力。代码可在https://github.com/zhcz328/MedEyes/ 获取。
Summary / 总结
MedEyes is a reinforcement learning framework designed to enhance medical diagnosis by dynamically focusing on relevant image regions and incorporating expert guidance. It uses a Gaze-guided Reasoning Navigator (GRN) for systematic abnormality localization and detailed analysis, and a Confidence Value Sampler (CVS) for diverse exploration. The dual-stream GRPO optimization framework balances expert imitation and autonomous discovery. Experiments show MedEyes improves performance by 8.5 percentage points across multiple medical VQA benchmarks, demonstrating its potential for trustworthy medical AI systems.
MedEyes 是一个强化学习框架,旨在通过动态聚焦相关图像区域并结合专家指导来提高医学诊断的准确性。它使用 Gaze-guided Reasoning Navigator (GRN) 进行系统异常定位和详细分析,并使用 Confidence Value Sampler (CVS) 来平衡模仿和发现。MedEyes 在多个医学 VQA 基准测试中表现出色,平均提高了 8.5 个百分点。
Evaluating Generative Models via One-Dimensional Code Distributions
Authors: Zexi Jia, Pengcheng Luo, Yijia Zhong, Jinchao Zhang, Jie Zhou
First: 2026-03-09T07:57:56+00:00 · Latest: 2026-03-12T11:19:47+00:00
Abstract
Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of discrete visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce Codebook Histogram Distance (CHD), a training-free distribution metric in token space, and Code Mixture Model Score (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose VisForm, a benchmark of 210K images spanning 62 visual forms and 12 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments. We will release all code and datasets to facilitate future research, with the code publicly available at https://github.com/zexiJia/1d-Distance.
中文标题/摘要
标题:通过一维代码分布评估生成模型
大多数生成模型的评估依赖于特征分布指标,如FID,这些指标作用于连续的识别特征,这些特征明确训练为对外观变化不变,因此会丢弃对感知质量至关重要的线索。相反,我们将在离散视觉标记的空间中评估模型,现代1D图像标记器紧凑地编码了语义和感知信息,质量表现为可预测的标记统计。我们引入了代码本直方图距离(CHD),这是一种无需训练的标记空间分布指标,以及代码混合模型评分(CMMS),这是一种从标记序列的合成退化中学习到的无参考质量指标。为了在广泛的分布偏移下测试指标,我们进一步提出了VisForm基准,包含210K张图像,跨越62种视觉形式和12种生成模型,并附有专家注释。在AGIQA、HPDv2/3和VisForm中,我们的基于标记的指标与人类判断的相关性达到最新水平。我们将发布所有代码和数据集以促进未来研究,代码可在https://github.com/zexiJia/1d-Distance公开获取。
Summary / 总结
This paper evaluates generative models by focusing on one-dimensional code distributions, which capture both semantic and perceptual information. It introduces two metrics: Codebook Histogram Distance (CHD) and Code Mixture Model Score (CMMS). The authors also propose VisForm, a benchmark with 210K images and expert annotations, to test the robustness of these metrics under various distribution shifts. The token-based metrics show strong correlation with human judgments across different benchmarks, outperforming existing methods.
该论文通过关注一维代码分布来评估生成模型,这些分布能够捕捉语义和感知信息。作者引入了两种度量标准:码本直方图距离(CHD)和代码混合模型评分(CMMS)。此外,他们还提出了VisForm基准,包含210K张图像和专家注释,以测试这些度量标准在各种分布变化下的鲁棒性。基于代码的度量标准在不同基准上的表现与人类判断高度相关,优于现有方法。
RADAR: Closed-Loop Robotic Data Generation via Semantic Planning and Autonomous Causal Environment Reset
Authors: Yongzhong Wang, Keyu Zhu, Yong Zhong, Liqiong Wang, Jinyu Yang, Feng Zheng
Venue: IROS
First: 2026-03-12T11:18:52+00:00 · Latest: 2026-03-12T11:18:52+00:00
Comments: 8 pages, 4 figures. Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Abstract
The acquisition of large-scale physical interaction data, a critical prerequisite for modern robot learning, is severely bottlenecked by the prohibitive cost and scalability limits of human-in-the-loop collection paradigms. To break this barrier, we introduce Robust Autonomous Data Acquisition for Robotics (RADAR), a fully autonomous, closed-loop data generation engine that completely removes human intervention from the collection cycle. RADAR elegantly divides the cognitive load into a four-module pipeline. Anchored by 2-5 3D human demonstrations as geometric priors, a Vision-Language Model first orchestrates scene-relevant task generation via precise semantic object grounding and skill retrieval. Next, a Graph Neural Network policy translates these subtasks into physical actions via in-context imitation learning. Following execution, the VLM performs automated success evaluation using a structured Visual Question Answering pipeline. Finally, to shatter the bottleneck of manual resets, a Finite State Machine orchestrates an autonomous environment reset and asymmetric data routing mechanism. Driven by simultaneous forward-reverse planning with a strict Last-In, First-Out causal sequence, the system seamlessly restores unstructured workspaces and robustly recovers from execution failures. This continuous brain-cerebellum synergy transforms data collection into a self-sustaining process. Extensive evaluations highlight RADAR's exceptional versatility. In simulation, our framework achieves up to 90% success rates on complex, long-horizon tasks, effortlessly solving challenges where traditional baselines plummet to near-zero performance. In real-world deployments, the system reliably executes diverse, contact-rich skills (e.g., deformable object manipulation) via few-shot adaptation without domain-specific fine-tuning, providing a highly scalable paradigm for robotic data acquisition.
中文标题/摘要
标题:RADAR:通过语义规划和自主因果环境重置的闭环机器人数据生成
大规模物理交互数据的获取是现代机器人学习的关键前提,但受到人工在环数据收集范式高昂成本和可扩展性的限制。为突破这一瓶颈,我们提出了Robust Autonomous Data Acquisition for Robotics (RADAR),一种完全自主的闭环数据生成引擎,完全消除了数据收集周期中的人工干预。RADAR将认知负荷优雅地分为四个模块。以2-5个3D人类演示作为几何先验,视觉-语言模型首先通过精确的语义对象定位和技能检索来协调场景相关任务生成。接着,图神经网络策略通过上下文模仿学习将这些子任务转化为物理动作。执行后,VLM 使用结构化的视觉问答流水线进行自动成功评估。最后,为了打破手动重置的瓶颈,有限状态机协调自主环境重置和非对称数据路由机制。系统通过同时进行正向和反向规划,并严格遵循后进先出的因果序列,无缝恢复无序的工作空间并从执行失败中稳健恢复。这种持续的大脑-小脑协同作用将数据收集转变为自我维持的过程。广泛的评估突显了RADAR的卓越灵活性。在仿真中,我们的框架在复杂、长时序任务上的成功率高达90%,轻松解决了传统基线在这些任务上几乎失效的挑战。在实际部署中,系统可靠地执行了多种接触丰富的技能(例如,可变形物体操作)并通过少量示例适应,提供了机器人数据获取的高可扩展范式。
Summary / 总结
RADAR is a fully autonomous data generation system for robotics that eliminates human intervention in the data collection process. It uses a four-module pipeline: a Vision-Language Model for task generation, a Graph Neural Network for action execution, a Visual Question Answering pipeline for success evaluation, and a Finite State Machine for autonomous environment reset. RADAR demonstrates high success rates in complex, long-horizon tasks in simulation and can execute diverse, contact-rich skills in real-world settings with few-shot adaptation.
RADAR 是一个完全自主的数据生成系统,用于机器人学习,消除了数据收集过程中的人工干预。它使用四个模块:视觉语言模型进行任务生成、图神经网络进行动作转换、视觉问答管道进行成功评估以及有限状态机进行自主环境重置。RADAR 在模拟中实现了高达 90% 的复杂任务成功率,并且可以通过少量示例进行适应,在现实世界中执行多种接触丰富的技能。
OSM-based Domain Adaptation for Remote Sensing VLMs
Authors: Stefan Maria Ailuro, Mario Markov, Mohammad Mahdi, Delyan Boychev, Luc Van Gool, Danda Pani Paudel
First: 2026-03-12T11:08:30+00:00 · Latest: 2026-03-12T11:08:30+00:00
Abstract
Vision-Language Models (VLMs) adapted to remote sensing rely heavily on domain-specific image-text supervision, yet high-quality annotations for satellite and aerial imagery remain scarce and expensive to produce. Prevailing pseudo-labeling pipelines address this gap by distilling knowledge from large frontier models, but this dependence on large teachers is costly, limits scalability, and caps achievable performance at the ceiling of the teacher. We propose OSMDA: a self-contained domain adaptation framework that eliminates this dependency. Our key insight is that a capable base VLM can serve as its own annotation engine: by pairing aerial images with rendered OpenStreetMap (OSM) tiles, we leverage optical character recognition and chart comprehension capabilities of the model to generate captions enriched by OSM's vast auxiliary metadata. The model is then fine-tuned on the resulting corpus with satellite imagery alone, yielding OSMDA-VLM, a domain-adapted VLM that requires no manual labeling and no stronger external model. We conduct exhaustive evaluations spanning 10 benchmarks across image-text-to-text tasks and comparing against 9 competitive baselines. When equally mixed with real data, our method achieves state-of-the-art results, while being substantially cheaper to train than teacher-dependent alternatives. These results suggest that, given a strong foundation model, alignment with crowd-sourced geographic data is a practical and scalable path towards remote sensing domain adaptation. Dataset and model weights will be made publicly available.
中文标题/摘要
标题:基于OSM的数据域适应遥感VLM
视觉-语言模型(VLMs)适应遥感依赖于特定领域的图像-文本监督,但高质量的卫星和航空影像标注稀缺且昂贵。现有的伪标签管道通过从大型前沿模型中提取知识来弥补这一缺口,但对大型教师模型的依赖性使得成本高昂、可扩展性受限,并且性能受限于教师模型的天花板。我们提出OSMDA:一个自包含的数据域适应框架,消除了对大型教师模型的依赖。我们的核心见解是,一个能力强的基础VLM可以作为自己的标注引擎:通过将航空图像与渲染的OpenStreetMap (OSM) 地图瓦片配对,利用模型的光学字符识别和图表理解能力生成富含OSM大量辅助元数据的描述。然后,模型仅使用卫星图像对该语料库进行微调,生成OSMDA-VLM,这是一种无需人工标注和更强外部模型的数据域适应VLM。我们进行了涵盖10个基准的详尽评估,比较了9个竞争基线。当与真实数据等量混合时,我们的方法达到了最先进的结果,而训练成本远低于依赖教师模型的替代方案。这些结果表明,给定一个强大的基础模型,与众包地理数据对齐是一种实用且可扩展的遥感数据域适应路径。数据集和模型权重将公开提供。
Summary / 总结
The research aims to address the scarcity of high-quality annotations for remote sensing images by proposing OSMDA, a self-contained domain adaptation framework. It leverages a base Vision-Language Model (VLM) to generate captions enriched with OpenStreetMap (OSM) metadata, which are then used to fine-tune the model on satellite imagery alone. The method outperforms existing teacher-dependent approaches in 10 benchmarks, achieving state-of-the-art results while being more cost-effective to train. This suggests that alignment with crowd-sourced geographic data can be a practical and scalable solution for remote sensing domain adaptation.
该论文针对遥感领域中视觉-语言模型(VLM)的领域适应问题,解决了高质量标注稀缺的问题。作者提出了OSMDA,这是一种自包含的框架,利用基础VLM将卫星图像与OpenStreetMap(OSM)图块配对生成描述,然后仅使用卫星图像进行微调,无需人工标注或更强的外部模型。实验结果显示,当与真实数据混合时,OSMDA-VLM在10个基准测试中达到最先进的效果,并且比依赖教师模型的方法训练成本更低。
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Xiang An, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang
Venue: CVPR 2026
First: 2025-10-21T13:36:58+00:00 · Latest: 2026-03-12T11:07:39+00:00
Comments: 25 pages, 17 figures
Abstract
Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code is available at https://github.com/zhangquanchen/3DThinker.
中文标题/摘要
标题:三维思考:基于有限视角的几何想象与空间推理
尽管近期视觉-语言模型(VLMs)在多种跨模态任务中取得了显著进展,但从有限视角理解三维空间关系仍然是一个重大挑战。以往的推理方法通常依赖纯文本(例如拓扑认知地图)或二维视觉线索。然而,它们有限的表示能力阻碍了在需要三维空间想象的任务中的表现。为了解决这一限制,我们提出了3DThinker框架,该框架能够在推理过程中有效利用图像中嵌入的丰富几何信息,类似于人类的思考方式。我们的框架是首个在推理过程中启用三维思考而无需任何三维先验输入的框架,并且在训练过程中不依赖于明确标注的三维数据。具体而言,我们的训练分为两个阶段。首先,我们进行监督训练,以使VLM在推理过程中生成的三维潜在表示与三维基础模型(例如VGGT)生成的三维潜在表示对齐。然后,我们仅基于结果信号优化整个推理过程,从而细化底层的三维思考。在多个基准测试中的广泛实验表明,3DThinker在多个基准测试中始终优于强基线,并为将三维表示统一到跨模态推理中提供了新的视角。我们的代码可在https://github.com/zhangquanchen/3DThinker获取。
Summary / 总结
The research aims to improve the ability of vision-language models to understand 3D spatial relationships from limited views, which is challenging for existing methods. 3DThinker, a novel framework, is proposed to exploit geometric information in images for 3D reasoning. It consists of two stages: supervised training to align 3D latent spaces and optimization based on outcome signals to refine 3D mentaling. Experiments show that 3DThinker outperforms strong baselines and provides a new approach for unifying 3D representations in multimodal reasoning.
研究旨在提高视觉-语言模型从有限视角理解3D空间关系的能力,这是当前方法的挑战。提出的3DThinker框架通过利用图像中的几何信息来增强推理,无需明确的3D数据。该框架分为两个阶段:监督训练以对齐3D潜在表示,并基于结果优化推理轨迹。实验表明,3DThinker在多个基准测试中优于强基线,为统一3D表示在多模态推理中的应用提供了新视角。
MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization
Authors: Animesh Jain, Alexandros Stergiou
First: 2025-08-11T10:36:58+00:00 · Latest: 2026-03-12T10:48:14+00:00
Comments: Project page: https://anaekin.github.io/MIMIC
Abstract
Vision Language Models (VLMs) encode multimodal inputs over large, complex, and difficult-to-interpret architectures, which limit transparency and trust. We propose a Multimodal Inversion for Model Interpretation and Conceptualization (MIMIC) framework that inverts the internal encodings of VLMs. MIMIC uses a joint VLM-based inversion and a feature alignment objective to account for VLM's autoregressive processing. It additionally includes a triplet of regularizers for spatial alignment, natural image smoothness, and semantic realism. We evaluate MIMIC both quantitatively and qualitatively by inverting visual concepts across a range of free-form VLM outputs of varying length. Reported results include both standard visual quality metrics and semantic text-based metrics. To the best of our knowledge, this is the first model inversion approach addressing visual interpretations of VLM concepts.
中文标题/摘要
标题:MIMIC:多模态反演以实现模型解释与概念化
视觉语言模型(VLMs)将多模态输入编码到大型、复杂且难以解释的架构中,这限制了透明度和信任度。我们提出了一种多模态反演以实现模型解释与概念化(MIMIC)框架,该框架可以反演VLM的内部编码。MIMIC使用基于VLM的联合反演和特征对齐目标来考虑VLM的自回归处理。此外,它还包括一个用于空间对齐、自然图像平滑性和语义现实性的三重正则化器。我们通过反演不同长度的自由形式VLM输出中的视觉概念,从定量和定性两个方面评估MIMIC。报告的结果包括标准的视觉质量指标和语义文本指标。据我们所知,这是第一个针对VLM概念视觉解释的模型反演方法。
Summary / 总结
The research aims to enhance the transparency and trust in Vision Language Models (VLMs) by developing a framework called MIMIC for multimodal inversion. MIMIC inverts the internal encodings of VLMs using a joint VLM-based inversion and a feature alignment objective, while incorporating regularizers for spatial alignment, natural image smoothness, and semantic realism. The evaluation shows that MIMIC can effectively invert visual concepts across various VLM outputs, providing both visual quality and semantic text-based metrics, and is the first model inversion approach for visual interpretations of VLM concepts.
研究旨在通过提出MIMIC框架增加视觉语言模型(VLM)的透明度和信任度,该框架通过联合VLM基的反向转换和特征对齐目标来解释VLM的内部编码,并包含用于空间对齐、自然图像平滑性和语义现实性的正则化项。该框架通过在各种VLM输出中反向转换视觉概念进行定量和定性评估,结果包括视觉质量和语义文本基线指标。这是首次针对VLM概念的视觉解释的模型反向转换方法。
Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
Authors: Tong Wang, Yunhan Zhao, Shu Kong
First: 2026-01-31T16:42:55+00:00 · Latest: 2026-03-12T09:13:49+00:00
Abstract
Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ''mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ''mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search for the target image. In contrast, we address CIR from first principles by directly generating the ''mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ''mental image'' for a given multimodal query and propose to use this ''mental image'' to search for the target image. As the ''mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.
中文标题/摘要
标题:生成平行宇宙以实现无需训练的零样本组合图像检索
组合图像检索(CIR)是指使用包含参考图像和修改文本的多模态查询从数据库中检索目标图像的任务。文本说明如何修改参考图像以形成“心理图像”,基于此,CIR 应在数据库中找到目标图像。CIR 的基本挑战在于这种“心理图像”是不可物理获取的,仅由查询隐式定义。当代文献追求零样本方法,并使用大型多模态模型(LMM)生成给定多模态查询的文本描述,然后使用视觉语言模型(VLM)进行文本-视觉匹配以搜索目标图像。相比之下,我们从第一性原理出发,直接生成“心理图像”以实现更准确的匹配。特别地,我们提示 LMM 生成给定多模态查询的“心理图像”,并提议使用此“心理图像”来搜索目标图像。由于“心理图像”与真实图像之间存在合成到现实的领域差距,我们还为数据库中的每个真实图像生成一个合成对应物以促进匹配。因此,我们的方法使用 LMM 构建一个“平行宇宙”,其中它匹配多模态查询和数据库图像。因此,我们称此方法为平行宇宙。值得注意的是,平行宇宙是一种无需训练的零样本 CIR 方法。它在具有挑战性的基准测试中显著优于现有零样本方法,实现了零样本 CIR 的最佳性能。
Summary / 总结
The paper addresses the challenge of Composed Image Retrieval (CIR) by directly generating a 'mental image' using a Large Multimodal Model (LMM) and proposing a training-free zero-shot method called Paracosm. This method constructs a 'paracosm' by generating synthetic counterparts for real images in the database, enabling more accurate matching. The approach significantly outperforms existing zero-shot methods on challenging benchmarks for CIR.
该论文通过直接使用大型多模态模型(LMM)生成‘心理图像’来解决组合图像检索(CIR)问题,以提高准确性,而现有方法依赖于文本描述。该方法名为Paracosm,为数据库中的每个真实图像生成一个合成对应物,以弥合合成到真实图像的领域差距,从而实现更准确的匹配。Paracosm在具有挑战性的基准测试中取得了最先进的性能,优于现有零样本方法的CIR。
GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models
Authors: Qinghongbing Xie, Zhaoyuan Xia, Feng Zhu, Lijun Gong, Ziyue Li, Rui Zhao, Long Zeng
Venue: ICLR 2026
First: 2025-10-09T05:09:27+00:00 · Latest: 2026-03-12T08:33:55+00:00
Comments: ICLR 2026, 31 pages, 20 figures
Abstract
Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for autonomous driving, embodied AI and general AI. Existing spatial-temporal benchmarks mainly focus on egocentric (first-person) perspective reasoning using images/video contexts, or geographic reasoning with graphical context (e.g., maps), thus fail to assess VLMs' geographic spatial-temporal intelligence that requires integrating both images/video and graphical context, which is crucial for real-world scenarios such as traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench show that even the best proprietary model, Gemini-2.5-Pro (34.9\%), significantly lags behind human performance (78.61\%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three major deficiencies of current models for geo-temporal reasoning. (1) VLMs exhibit imbalanced utilization of spatial and temporal context during reasoning. (2) they show weak temporal forecasting ability, leading to poorer performance on temporally focused tasks. (3) they lack the capability to effectively align and integrate map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.
中文标题/摘要
标题:GTR-Bench:评估视觉-语言模型的地理时空推理能力
近年来,视觉-语言模型(VLMs)的时空智能引起了广泛关注,这对于自动驾驶、具身人工智能和通用人工智能至关重要。现有的时空基准主要集中在使用图像/视频上下文进行以自我为中心(第一人称)视角的推理,或使用图形上下文(例如,地图)进行地理推理,因此无法评估VLMs所需的地理时空智能,这需要结合图像/视频和图形上下文,这对于交通管理和应急响应等现实场景至关重要。为了解决这些差距,我们引入了Geo-Temporal Reasoning基准(GTR-Bench),这是一个新的挑战,用于大规模摄像网络中移动目标的地理时空推理。GTR-Bench更具挑战性,因为它需要在地图和视频之间进行多次视角切换,跨多个具有非重叠视场的视频进行联合推理,并对任何视频上下文都无法观察到的时空区域进行推理。对超过10个流行的VLMs在GTR-Bench上的评估显示,即使是最先进的专有模型Gemini-2.5-Pro(34.9%),也远远落后于人类在地理时空推理上的表现(78.61%)。此外,我们对GTR-Bench的全面分析揭示了当前模型在地理时空推理方面的三大缺陷。(1)VLMs在推理过程中对空间和时间上下文的利用不平衡。(2)它们在时间预测方面表现较弱,导致在时间导向任务上的表现较差。(3)它们缺乏有效对齐和整合地图数据与多视角视频输入的能力。我们相信GTR-Bench提供了宝贵的见解,并为时空智能的研究和应用开辟了新的机会。基准和代码将在https://github.com/X-Luffy/GTR-Bench发布。
Summary / 总结
GTR-Bench evaluates the geo-temporal reasoning capabilities of VLMs, addressing the limitations of existing benchmarks that focus on egocentric or geographic reasoning. The benchmark introduces a novel challenge involving geographic temporal reasoning for moving targets in a large-scale camera network, requiring multiple perspective switches and joint reasoning across non-overlapping fields of view. Evaluations show that even the best VLM, Gemini-2.5-Pro, performs poorly compared to human performance. The study identifies three major deficiencies: imbalanced use of spatial and temporal context, weak temporal forecasting, and poor alignment of map data with multi-view video inputs.
GTR-Bench 评估了 VLM 的地理时空推理能力,解决了现有基准主要关注第一人称或地理推理的局限性。该基准引入了一个新的挑战,涉及地图和视频之间的多视角切换、跨非重叠视域的联合推理以及对未被视频观测到的时空区域的推理。对 10 种流行 VLM 的评估显示,即使最好的模型 Gemini-2.5-Pro 也远逊于人类表现。分析指出三个主要缺陷:时空上下文利用不平衡、时间预测能力弱以及无法有效对齐和整合地图数据与多视角视频输入。
BackdoorIDS: Zero-shot Backdoor Detection for Pretrained Vision Encoder
Authors: Siquan Huang, Yijiang Li, Ningzhi Gao, Xingfu Yan, Leyu Shi
First: 2026-03-12T08:32:19+00:00 · Latest: 2026-03-12T08:32:19+00:00
Comments: 17 pages, 10 figures, 6 tables
Abstract
Self-supervised and multimodal vision encoders learn strong visual representations that are widely adopted in downstream vision tasks and large vision-language models (LVLMs). However, downstream users often rely on third-party pretrained encoders with uncertain provenance, exposing them to backdoor attacks. In this work, we propose BackdoorIDS, a simple yet effective zero-shot, inference-time backdoor samples detection method for pretrained vision encoders. BackdoorIDS is motivated by two observations: Attention Hijacking and Restoration. Under progressive input masking, a backdoored image initially concentrates attention on malicious trigger features. Once the masking ratio exceeds the trigger's robustness threshold, the trigger is deactivated, and attention rapidly shifts to benign content. This transition induces a pronounced change in the image embedding, whereas embeddings of clean images evolve more smoothly across masking progress. BackdoorIDS operationalizes this signal by extracting an embedding sequence along the masking trajectory and applying density-based clustering such as DBSCAN. An input is flagged as backdoored if its embedding sequence forms more than one cluster. Extensive experiments show that BackdoorIDS consistently outperforms existing defenses across diverse attack types, datasets, and model families. Notably, it is a plug-and-play approach that requires no retraining and operates fully zero-shot at inference time, making it compatible with a wide range of encoder architectures, including CNNs, ViTs, CLIP, and LLaVA-1.5.
中文标题/摘要
标题:BackdoorIDS:预训练视觉编码器的零样本后门检测
自监督和多模态视觉编码器学习强大的视觉表示,广泛应用于下游视觉任务和大型视觉-语言模型(LVLMs)。然而,下游用户经常依赖来源不明的第三方预训练编码器,使其面临后门攻击的风险。在本工作中,我们提出了一种名为BackdoorIDS的简单而有效的零样本、推理时后门样本检测方法,用于预训练视觉编码器。BackdoorIDS受到两个观察结果的启发:注意力劫持和恢复。在渐进输入遮罩下,后门图像最初将注意力集中在恶意触发特征上。一旦遮罩比例超过触发的鲁棒性阈值,触发器被禁用,注意力迅速转向良性内容。这种转变导致图像嵌入产生显著变化,而干净图像的嵌入则在遮罩过程中更平滑地演变。BackdoorIDS通过沿遮罩轨迹提取嵌入序列并应用基于密度的聚类(如DBSCAN)来实现这一信号。如果输入的嵌入序列形成多个聚类,则将其标记为后门。广泛的实验表明,BackdoorIDS在多种攻击类型、数据集和模型家族中始终优于现有防御措施。值得注意的是,它是一种即插即用的方法,无需重新训练,并在推理时完全零样本运行,使其与各种编码器架构兼容,包括CNN、ViT、CLIP和LLaVA-1。
Summary / 总结
BackdoorIDS is a zero-shot, inference-time detection method for pretrained vision encoders that are vulnerable to backdoor attacks. Motivated by the observations of Attention Hijacking and Restoration, BackdoorIDS detects backdoor samples by analyzing the change in image embeddings under progressive input masking. The method extracts an embedding sequence and applies density-based clustering to identify backdoored inputs. Experiments demonstrate that BackdoorIDS outperforms existing defenses across various attack types, datasets, and model families, and it can be easily integrated into different encoder architectures without retraining.
BackdoorIDS 是一种针对预训练视觉编码器的零样本后门检测方法,基于注意力劫持和恢复的观察。它通过分析在逐级输入遮罩下的图像嵌入变化来检测后门攻击。大量实验表明,BackdoorIDS 在各种攻击类型、数据集和模型家族中均优于现有防御方法,并且可以轻松集成到不同的编码器架构中而无需重新训练。
Partially Recentralization Softmax Loss for Vision-Language Models Robustness
Authors: Hao Wang, Jinzhe Jiang, Xin Zhang, Chen Li
First: 2024-02-06T01:44:38+00:00 · Latest: 2026-03-12T08:30:30+00:00
Comments: The study described in Section 4 was conducted without required institutional review board approval. The paper is withdrawn pending completion of the approval process
Abstract
As Large Language Models make a breakthrough in natural language processing tasks (NLP), multimodal technique becomes extremely popular. However, it has been shown that multimodal NLP are vulnerable to adversarial attacks, where the outputs of a model can be dramatically changed by a perturbation to the input. While several defense techniques have been proposed both in computer vision and NLP models, the multimodal robustness of models have not been fully explored. In this paper, we study the adversarial robustness provided by modifying loss function of pre-trained multimodal models, by restricting top K softmax outputs. Based on the evaluation and scoring, our experiments show that after a fine-tuning, adversarial robustness of pre-trained models can be significantly improved, against popular attacks. Further research should be studying, such as output diversity, generalization and the robustness-performance trade-off of this kind of loss functions. Our code will be available after this paper is accepted
中文标题/摘要
标题:部分重新中央化softmax损失函数对视觉-语言模型鲁棒性的研究
随着大型语言模型在自然语言处理任务(NLP)中取得突破,多模态技术变得极其流行。然而,已经表明,多模态NLP模型对对抗攻击非常脆弱,输入的微小扰动可以使模型的输出发生巨大变化。虽然已经在计算机视觉和NLP模型中提出了多种防御技术,但多模态模型的鲁棒性尚未得到充分探索。在本文中,我们通过限制softmax输出的前K项来研究修改预训练多模态模型损失函数提供的对抗鲁棒性。根据评估和评分,我们的实验表明,在微调后,预训练模型的对抗鲁棒性可以显著提高,对抗流行的攻击。进一步的研究应该包括输出多样性、泛化以及这种损失函数的鲁棒性-性能权衡。在论文被接受后,我们的代码将可供使用
Summary / 总结
This paper investigates the adversarial robustness of pre-trained vision-language models by modifying their loss functions. Specifically, the authors introduce a partially recentralization softmax loss to restrict the top K softmax outputs. Experimental results show that fine-tuning with this loss function significantly improves the adversarial robustness of the models against popular attacks. Further research is needed to explore output diversity, generalization, and the robustness-performance trade-off of this approach.
本文提出了一种部分重新中心化softmax损失函数,以增强多模态模型对对抗攻击的鲁棒性。该方法通过在微调过程中限制softmax输出的前K个值来实现。实验表明,这种方法显著提高了预训练模型对各种攻击的鲁棒性,但仍需进一步研究输出多样性及鲁棒性与性能之间的权衡。
Generalizing Vision-Language Models with Dedicated Prompt Guidance
Authors: Xinyao Li, Yinjie Min, Hongbo Chen, Zhekai Du, Fengling Li, Jingjing Li
First: 2025-12-02T05:06:17+00:00 · Latest: 2026-03-12T08:24:09+00:00
Comments: Accepted to AAAI26
Abstract
Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.
中文标题/摘要
标题:视觉-语言模型的专用提示指导泛化
对大规模预训练视觉-语言模型(VLMs)进行微调已成为下游适应的主流范式,但面临着领域特异性与领域泛化(DG)能力之间的关键权衡。当前方法通常将通用模型在完整数据集上进行微调,这可能会损害其对未见领域的泛化能力。为填补这一空白,我们提供了视觉语言模型微调泛化能力的理论理解,揭示了在分区源领域上训练多个参数高效的专家模型比在通用模型上进行微调具有更好的泛化能力。受这一发现的启发,我们提出了一种两步领域专家指导泛化(GuiDG)框架。GuiDG 首先使用提示调优获得源领域专家,然后通过自适应专家集成引导视觉编码器的微调,借助跨模态注意力模块。为了更好地评估少量样本泛化,我们从ImageNet及其变体构建了ImageNet-DG。在标准泛化基准和ImageNet-DG上的广泛实验表明,GuiDG 在保持效率的同时优于最先进的微调方法。
Summary / 总结
The research addresses the challenge of balancing domain specificity and generalization in fine-tuning vision-language models. It proposes a two-step framework, GuiDG, which first uses prompt tuning to create domain-specific experts and then guides the fine-tuning of the vision encoder through adaptive expert integration. Experiments show that GuiDG outperforms existing methods on standard generalization benchmarks and a newly constructed ImageNet-DG dataset while maintaining efficiency.
论文针对大规模预训练视觉-语言模型在下游适应中的领域泛化问题,提出了一种两步框架GuiDG。GuiDG首先通过提示调优生成领域特定专家,然后通过跨模态注意力模块引导视觉编码器的调优。实验表明,GuiDG在标准领域泛化基准和新构建的ImageNet-DG数据集上优于现有调优方法,同时保持了高效性。
MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation
Authors: Baicheng Li, Dong Wu, Jun Li, Shunkai Zhou, Zecui Zeng, Lusong Li, Hongbin Zha
First: 2026-03-12T07:53:35+00:00 · Latest: 2026-03-12T07:53:35+00:00
Abstract
Recent unified 3D generation models have made remarkable progress in producing high-quality 3D assets from a single image. Notably, layout-aware approaches such as SAM3D can reconstruct multiple objects while preserving their spatial arrangement, opening the door to practical scene-level 3D generation. However, current methods are limited to single-view input and cannot leverage complementary multi-view observations, while independently estimated object poses often lead to physically implausible layouts such as interpenetration and floating artifacts. We present MV-SAM3D, a training-free framework that extends layout-aware 3D generation with multi-view consistency and physical plausibility. We formulate multi-view fusion as a Multi-Diffusion process in 3D latent space and propose two adaptive weighting strategies -- attention-entropy weighting and visibility weighting -- that enable confidence-aware fusion, ensuring each viewpoint contributes according to its local observation reliability. For multi-object composition, we introduce physics-aware optimization that injects collision and contact constraints both during and after generation, yielding physically plausible object arrangements. Experiments on standard benchmarks and real-world multi-object scenes demonstrate significant improvements in reconstruction fidelity and layout plausibility, all without any additional training. Code is available at https://github.com/devinli123/MV-SAM3D.
中文标题/摘要
标题:MV-SAM3D:适应性多视角融合以实现布局感知的3D生成
近期统一的3D生成模型在从单张图像生成高质量3D资产方面取得了显著进展。特别是布局感知的方法如SAM3D可以重建多个对象并保持其空间布局,为场景级别的3D生成打开了大门。然而,当前的方法仅限于单视角输入,无法利用互补的多视角观测,而独立估计的对象姿态往往会导致物理上不可行的布局,如穿插和漂浮的伪影。 我们提出了MV-SAM3D,这是一种无需训练的框架,它将布局感知的3D生成扩展到多视角一致性与物理可行性。我们将多视角融合形式化为3D潜在空间中的多扩散过程,并提出了两种自适应加权策略——注意力-熵加权和可见性加权,以实现置信度感知的融合,确保每个视角根据其局部观测可靠性贡献。对于多对象组合,我们引入了物理感知优化,该优化在生成过程中和生成后注入碰撞和接触约束,从而产生物理上可行的对象布局。在标准基准和真实世界的多对象场景上的实验表明,在无需额外训练的情况下,重建精度和布局可行性都有显著提高。代码可在https://github.com/devinli123/MV-SAM3D获取。
Summary / 总结
MV-SAM3D is a training-free framework that enhances layout-aware 3D generation by incorporating multi-view consistency and physical plausibility. It uses a Multi-Diffusion process in 3D latent space and two adaptive weighting strategies to fuse multi-view observations, ensuring each viewpoint contributes based on its reliability. Additionally, it introduces physics-aware optimization to enforce collision and contact constraints, leading to physically plausible object arrangements. Experiments show significant improvements in reconstruction fidelity and layout plausibility without additional training.
MV-SAM3D 是一个无需训练的框架,通过引入多视图一致性和平物理合理性来增强布局感知的 3D 生成。它使用 3D 潜空间中的多扩散过程,并采用两种自适应加权策略来融合多视图观察,确保每个视点根据其局部观察可靠性进行贡献。此外,它引入了物理感知优化,注入碰撞和接触约束,从而实现物理合理的物体排列。实验结果显示,在重建保真度和布局合理性方面取得了显著改进,无需额外训练。
VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought
Authors: Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim
First: 2026-03-12T07:47:46+00:00 · Latest: 2026-03-12T07:47:46+00:00
Comments: 30 pages, 21 figures, EACL 2026 Findings
Abstract
Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model improves by +33.2%. Furthermore, consistent zero-shot gains on diverse open-domain VQA benchmarks confirm the generalizability of the perception-logic separation strategy for visual question answering. VisDoT leverages human-like perception to enhance visual grounding, achieving state-of-the-art chart understanding and interpretable visual reasoning.
中文标题/摘要
标题:VisDoT : 通过人类似解题接地和思维分解增强视觉推理
大型视觉-语言模型(LVLMs)在检测图表中的视觉原语并将其与语义表示对齐方面难以可靠地进行,这严重限制了它们在复杂视觉推理方面的性能。这种感知接地的缺乏构成了基于图表推理的主要瓶颈。我们提出了VisDoT框架,通过人类似解题接地来增强视觉推理。我们基于图形感知理论形式化了四个感知任务,包括位置和长度。在此基础上,我们引入了思维分解(DoT)提示,该提示将问题依次分解为视觉感知子问题和逻辑子问题。使用VisDoT微调InternVL在ChartQA上实现了11.2%的改进,并在更具挑战性的ChartQAPro基准上超越了GPT-4o。在新引入的VisDoTQA基准上,模型提高了33.2%。此外,一致的零样本增益在多种开放域VQA基准上证实了感知-逻辑分离策略在视觉问答中的普适性。VisDoT利用人类似感知来增强视觉接地,实现了最先进的图表理解和可解释的视觉推理。
Summary / 总结
This paper addresses the challenge of LVLMs in reliably detecting visual primitives in charts and aligning them with semantic representations, which limits their performance in complex visual reasoning. The authors propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding and introduces Decomposition-of-Thought (DoT) prompting to separate questions into visual perception and logic sub-questions. Fine-tuning InternVL with VisDoT improves performance on ChartQA by 11.2%, surpasses GPT-4o on ChartQAPro, and achieves a 33.2% improvement on the newly introduced VisDoTQA benchmark. The study also confirms the generalizability of the perception-logic separation strategy on various open-domain VQA benchmarks.
论文针对大型视觉-语言模型(LVLM)在图表中可靠检测视觉基本元素并与语义表示对齐的困难,这限制了它们在复杂视觉推理中的表现。它提出了VisDoT框架,通过类人的解释接地和分解思维(DoT)提示来增强视觉推理,该提示将问题分解为视觉感知和逻辑子问题。通过VisDoT微调InternVL在ChartQA上的性能提高了11.2%,在更具挑战性的ChartQAPro基准上超越了GPT-4o,并在新引入的VisDoTQA基准上提高了33.2%。该模型还在各种开放域VQA基准上表现出一致的零样本增益,表明感知-逻辑分离策略的普适性。
StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References
Authors: Boyu He, Yunfan Ye, Chang Liu, Weishang Wu, Fang Liu, Zhiping Cai
First: 2026-03-11T03:05:02+00:00 · Latest: 2026-03-12T07:44:05+00:00
Comments: 18 pages, 23 figures, Conference on Computer Vision and Pattern Recognition 2026
Abstract
Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.
中文标题/摘要
标题:StyleGallery:无需训练且具有语义意识的个性化风格迁移,从任意图像参考中
尽管在基于扩散的图像风格迁移方面取得了进展,但现有方法通常受限于1)语义差距:风格参考可能缺少适当的内容语义,导致不可控的风格化;2)依赖额外约束(例如,语义掩码)限制了适用性;3)刚性特征关联缺乏适应性的全局-局部对齐,无法平衡精细风格化和全局内容保留。这些限制,尤其是无法灵活利用风格输入,从根本上限制了风格迁移在个性化、准确性和适应性方面的应用。为了解决这些问题,我们提出了StyleGallery,这是一种无需训练且具有语义意识的框架,支持任意参考图像作为输入,并能够实现有效的个性化定制。它包括三个核心阶段:语义区域分割(在潜在扩散特征上进行自适应聚类,无需额外输入以划分区域);聚类区域匹配(在提取特征上进行块过滤,以实现精确对齐);以及风格迁移优化(基于能量函数的扩散采样与区域风格损失相结合,以优化风格化)。在我们引入的基准测试上进行的实验表明,StyleGallery在内容结构保留、区域风格化、可解释性和个性化定制方面优于现有最先进的方法,尤其是在利用多个风格参考时。
Summary / 总结
The research addresses limitations in existing diffusion-based image style transfer methods, such as semantic gaps, reliance on extra constraints, and rigid feature associations. It introduces StyleGallery, a training-free and semantic-aware framework that uses arbitrary reference images for personalized style transfer. The framework consists of semantic region segmentation, clustered region matching, and style transfer optimization stages. Experiments show that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, especially when using multiple style references.
StyleGallery 是一个无需训练且具备语义感知的框架,用于图像风格转移,解决了现有方法中的语义差距和刚性特征关联等问题。它通过语义区域分割、聚类区域匹配和风格转移优化,支持任意参考图像输入并实现个性化定制。实验表明,StyleGallery 在内容结构保留、区域风格化、可解释性和个性化定制方面优于现有最先进的方法,特别是在使用多个风格参考时表现更佳。
History
20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553