arXiv 论文速递

Snapshot: 20260331_0407

Make Geometry Matter for Spatial Reasoning

Authors: Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang

First: 2026-03-27T17:45:12+00:00 · Latest: 2026-03-27T17:45:12+00:00

Abstract

Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

中文标题/摘要

标题：让几何学在空间推理中发挥作用

得益于大规模训练，视觉-语言模型（VLMs）在图像和视频理解方面表现出色，但在处理静态场景和动态视频的空间推理方面仍存在局限性。近期的研究试图通过将预训练的3D基础模型中的几何标记注入VLMs来解决这一局限性。然而，我们观察到，在这一领域的研究中，简单的标记融合加上标准的微调往往未能充分利用这些几何线索进行空间推理，因为VLMs倾向于依赖2D视觉线索。在本文中，我们提出了GeoSR框架，旨在通过鼓励VLMs积极与几何标记进行推理来让几何学发挥作用。GeoSR引入了两个关键组件：（1）几何释放掩码，在训练过程中战略性地屏蔽2D视觉标记的部分内容，以削弱非几何捷径并迫使模型在空间推理时咨询几何标记；（2）几何引导融合，这是一种门控路由机制，能够自适应地放大几何标记在关键几何证据区域的贡献。这些设计共同释放了几何标记在空间推理任务中的潜力。在静态和动态空间推理基准上的广泛实验表明，GeoSR始终优于先前的方法，并通过有效利用几何信息建立了新的最佳性能。项目页面可在https://suhzhang.github.io/GeoSR/获取。

Summary / 总结

This paper addresses the limitation of vision-language models in performing spatial reasoning by proposing GeoSR, a framework that encourages models to actively reason with geometry tokens. GeoSR includes Geometry-Unleashing Masking to weaken non-geometric shortcuts and Geometry-Guided Fusion to adaptively amplify geometry token contributions. Experiments show that GeoSR outperforms previous methods and sets new state-of-the-art performance on both static and dynamic spatial reasoning benchmarks.

本文提出GeoSR框架以解决视觉语言模型(VLMs)在空间推理中的局限性，该框架鼓励VLMs利用几何线索。GeoSR包括几何释放掩码以削弱非几何捷径，并使用几何引导融合机制在关键区域适当地放大几何令牌的贡献。实验表明，GeoSR在静态和动态空间推理基准上均优于先前的方法，并建立了新的最佳性能。

INSIGHT: Enhancing Autonomous Driving Safety through Vision-Language Models on Context-Aware Hazard Detection and Edge Case Evaluation

Authors: Dianwei Chen, Zifan Zhang, Lei Cheng, Yuchen Liu, Xianfeng Terry Yang

First: 2025-02-01T01:43:53+00:00 · Latest: 2026-03-27T17:40:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Autonomous driving systems face significant challenges in handling unpredictable edge-case scenarios, such as adversarial pedestrian movements, dangerous vehicle maneuvers, and sudden environmental changes. Current end-to-end driving models struggle with generalization to these rare events due to limitations in traditional detection and prediction approaches. To address this, we propose INSIGHT (Integration of Semantic and Visual Inputs for Generalized Hazard Tracking), a hierarchical vision-language model (VLM) framework designed to enhance hazard detection and edge-case evaluation. By using multimodal data fusion, our approach integrates semantic and visual representations, enabling precise interpretation of driving scenarios and accurate forecasting of potential dangers. Through supervised fine-tuning of VLMs, we optimize spatial hazard localization using attention-based mechanisms and coordinate regression techniques. Experimental results on the BDD100K dataset demonstrate a substantial improvement in hazard prediction straightforwardness and accuracy over existing models, achieving a notable increase in generalization performance. This advancement enhances the robustness and safety of autonomous driving systems, ensuring improved situational awareness and potential decision-making in complex real-world scenarios.

中文标题/摘要

标题：INSIGHT：通过上下文感知危害检测和边缘案例评估中的视觉-语言模型增强自动驾驶安全性

自动驾驶系统在处理不可预测的边缘案例场景时面临重大挑战，如对抗性行人的运动、危险的车辆操作以及突然的环境变化。当前的端到端驾驶模型由于传统检测和预测方法的局限性，在这些罕见事件上的泛化能力有限。为了解决这一问题，我们提出了INSIGHT（语义和视觉输入的综合用于泛化危害跟踪），这是一种分层的视觉-语言模型（VLM）框架，旨在增强危害检测和边缘案例评估。通过多模态数据融合，我们的方法将语义和视觉表示相结合，使驾驶场景的精确解释和潜在危险的准确预测成为可能。通过监督微调VLMs，我们使用基于注意力的机制和坐标回归技术优化了空间危害定位。在BDD100K数据集上的实验结果表明，与现有模型相比，我们的方法在危害预测的清晰度和准确性上有了显著提高，泛化性能也得到了显著提升。这一进步增强了自动驾驶系统的稳健性和安全性，确保了在复杂现实场景中的情况感知和潜在决策的改进。

Summary / 总结

The paper proposes INSIGHT, a hierarchical vision-language model framework to improve autonomous driving safety by enhancing hazard detection and edge-case evaluation. It integrates semantic and visual data to predict potential dangers more accurately and uses attention-based mechanisms and coordinate regression for spatial hazard localization. Experiments on the BDD100K dataset show a significant improvement in hazard prediction accuracy and generalization performance compared to existing models.

论文提出了一种名为INSIGHT的层次视觉-语言模型框架，以应对自动驾驶系统在处理不可预测的边缘案例场景时的挑战。INSIGHT通过融合语义和视觉表示，增强危险检测和边缘案例评估，使用多模态数据融合和注意力机制。实验在BDD100K数据集上显示了显著的危险预测准确性的提升，展示了比现有模型更好的泛化能力和鲁棒性。

Large Language Models Can Perform Automatic Modulation Classification via Discretized Self-supervised Candidate Retrieval

Authors: Mohammad Rostami, Atik Faysal, Reihaneh Gh. Roshan, Huaxia Wang, Nikhil Muralidhar, Yu-Dong Yao

First: 2025-09-30T22:20:57+00:00 · Latest: 2026-03-27T17:33:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Identifying wireless modulation schemes is essential for cognitive radio, but standard supervised models often degrade under distribution shift, and training domain-specific wireless foundation models from scratch is computationally prohibitive. Large Language Models (LLMs) offer a promising training-free alternative via in-context learning, yet feeding raw floating-point signal statistics into LLMs overwhelms models with numerical noise and exhausts token budgets. We introduce DiSC-AMC, a framework that reformulates Automatic Modulation Classification (AMC) as an LLM reasoning task by combining aggressive feature discretization with nearest-neighbor retrieval over self-supervised embeddings. By mapping continuous features to coarse symbolic tokens, DiSC-AMC aligns abstract signal patterns with LLM reasoning capabilities and reduces prompt length by over $50$\%. Simultaneously, utilizing a DINOv2 visual encoder to retrieve the $k_\text{NN}$ most similar labeled exemplars provides highly relevant, query-specific context rather than generic class averages. On a 10-class benchmark, a fine-tuned 7B-parameter LLM using DiSC-AMC achieves $83.0$\% in-distribution accuracy ($-10$\,to\,$+10$\,dB) and $82.50$\% out-of-distribution (OOD) accuracy ($-11$\,to\,$-15$\,dB), outperforming supervised baselines. Comprehensive ablations on vanilla LLMs demonstrate the token efficiency of DiSC-AMC. A training-free $7$B LLM achieves $71$\% accuracy using only $0.5$\,K-token prompt,surpassing a $200$B-parameter baseline that relies on a $2.9$K-token prompt. Furthermore, similarity-based exemplar retrieval outperforms naive class-average selection by over $20$\%. Finally, we identify a fundamental limitation of this pipeline. At extreme OOD noise levels ($-30$\,dB), the underlying self-supervised representations collapse, degrading retrieval quality and reducing classification to random chance.

中文标题/摘要

标题：大型语言模型可以通过离散化自监督候选检索自动执行调制分类

识别无线调制方案对于认知无线电至关重要，但标准的监督模型在分布偏移时往往会退化，从头训练特定领域的无线基础模型在计算上是不可行的。大型语言模型（LLMs）通过上下文学习提供了一种无训练的替代方案，但将原始浮点信号统计直接输入LLMs会使模型受到数值噪声的困扰，并耗尽令牌预算。我们提出了DiSC-AMC框架，通过结合激进的特征离散化和基于自监督嵌入的最近邻检索，将自动调制分类（AMC）重新表述为LLM推理任务。通过将连续特征映射为粗粒度的符号令牌，DiSC-AMC将抽象的信号模式与LLM的推理能力对齐，并将提示长度减少了超过50%。同时，利用DINOv2视觉编码器检索最相似的标记示例，提供了高度相关且查询特定的上下文，而不是通用的类别平均值。在10类基准测试中，使用DiSC-AMC微调的7B参数LLM实现了83.0%的分布内准确率（-10至+10 dB）和82.50%的分布外准确率（-11至-15 dB），超过了监督基线。对标准LLM的全面消融实验表明了DiSC-AMC的令牌效率。一个无训练的7B LLM仅使用0.5 K令牌提示实现了71%的准确率，超过了依赖2.9 K令牌提示的200B参数基线。此外，基于相似性的示例检索比简单的类别平均选择高出超过20%。最后，我们确定了该管道的一个基本局限性。在极端的分布外噪声水平（-30 dB）下，底层的自监督表示会崩溃，降低检索质量，导致分类退化为随机猜测。

Summary / 总结

The paper addresses the challenge of automatic modulation classification in cognitive radio, where standard supervised models often fail under distribution shift. It proposes DiSC-AMC, a framework that uses large language models (LLMs) for reasoning tasks by discretizing continuous signal features and retrieving nearest-neighbor self-supervised embeddings. This approach significantly reduces prompt length and improves accuracy, achieving 83.0% in-distribution and 82.5% out-of-distribution accuracy on a 10-class benchmark, outperforming supervised baselines. Comprehensive ablations show the token efficiency of DiSC-AMC, with a 7B-parameter LLM achieving 71% accuracy using only 0.5K tokens, surpassing a 200B-parameter baseline that requires 2.9K tokens. However, at extreme noise levels, the self-supervised representations degrade, affecting retrieval quality.

论文针对认知无线电中的自动调制分类挑战，标准监督模型在分布偏移下往往表现不佳。它引入了DiSC-AMC框架，将AMC重新表述为LLM推理任务，使用激进的特征离散化和基于自监督嵌入的最近邻检索。这种方法将提示长度减少了超过50%，并在10类基准测试中实现了83.0%的分布内准确率和82.5%的分布外准确率，优于监督基线。全面的消融实验显示了DiSC-AMC的令牌效率，一个7B参数的LLM仅使用0.5K令牌提示就达到了71%的准确率，超过了依赖2.9K令牌提示的200B参数基线。然而，在极端分布外噪声水平下，自监督表示会退化，影响检索质量。

When to Think and When to Look: Uncertainty-Guided Lookback

Authors: Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yolo Y. Tang, Luchuan Song, Susan Liang, Zhongfei Mark Zhang, Jason J. Corso, Chenliang Xu

Venue: CVPR 2026

First: 2025-11-19T17:01:02+00:00 · Latest: 2026-03-27T17:10:24+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.

中文标题/摘要

标题：何时思考何时查看：基于不确定性回溯

测试时的思考（即生成明确的中间推理链）已被证明能提升大型语言模型的性能，并且最近在大型视觉语言模型（LVLMs）中也显示出强大的增益。然而，尽管取得了这些有希望的结果，仍然没有系统分析思考如何影响视觉推理。我们提供了首个此类分析，通过大规模、受控的比较，评估了来自InternVL3.5和Qwen3-VL家族的十个变体在MMMU-val上的表现，使用宽松的标记预算和多轮解码。我们展示了更多的思考并不总是更好的；长链往往导致错误的轨迹，忽视了图像，且表现不如标准指令模式运行的相同模型。更深入的分析表明，某些短回溯短语，明确地回溯到图像，强烈富集于成功的轨迹中，并与更好的视觉定位相关。基于这一洞察，我们提出了基于不确定性回溯的解码策略，该策略结合了不确定性信号、自适应回溯提示和广度搜索。我们的方法在整体MMMU性能上有所提升，在标准思考较弱的类别中取得最大增益，并优于几个强大的解码基线，固定模型家族和标记预算下达到新的最佳水平。我们进一步展示了该解码策略的泛化能力，在五个额外的基准上取得一致的改进，包括两个广泛的多模态套件和数学聚焦的视觉推理数据集。

Summary / 总结

The study investigates the impact of test-time thinking on visual reasoning in large vision language models (LVLMs) by comparing ten variants from InternVL3.5 and Qwen3-VL families. It finds that more thinking is not always beneficial, as long chains often lead to incorrect reasoning. Short lookback phrases that reference the image are shown to be more effective. Based on this, the authors propose an uncertainty-guided lookback strategy that improves overall performance and outperforms several strong baselines, setting a new state-of-the-art on fixed model families and token budgets. This strategy also generalizes well, improving performance on five additional benchmarks.

研究通过比较InternVL3.5和Qwen3-VL家族的十种变体，探讨了测试时思考对大型视觉语言模型（LVLMs）视觉推理的影响。研究发现，更多的思考并不总是有益的，长的推理链往往会导致错误的推理。研究引入了基于不确定性指导的回溯解码策略，该策略通过结合不确定性信号和自适应回溯提示来增强视觉定位，从而在多种基准测试中取得了更好的性能，并建立了新的状态-of-the-art。

The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding

Authors: Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene

First: 2026-03-27T16:52:46+00:00 · Latest: 2026-03-27T16:52:46+00:00

Comments: 7 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.

中文标题/摘要

标题：从图片和文本学习的局限性：视觉语言模型与具身场景理解

什么样的信息足以学习人类场景理解的全部丰富性？分布假设认为，语言和图像的统计共现捕捉了视觉认知的概念知识。视觉语言模型（VLMs）在大规模配对的文本-图像语料库上进行训练，但缺乏具身经验，使其成为检验分布假设的理想工具。我们报告了两项实验，比较了18个VLMs生成的描述与超过2000名人类观察者在15项高层场景理解任务中的描述，这些任务涵盖了常识、功能、感官体验、情感反应和未来预测。由于许多任务缺乏真实答案，我们开发了一种基于人类校准余弦距离（HCD）度量，衡量VLM输出与人类反应分布的相似性，按人类内部变异性缩放。在实验1中，VLMs在常识任务上接近人类水平的表现，但在功能任务上表现出明显的缺陷，这些任务抵抗了提示工程且在新模型版本中没有改善。在实验2中，我们测试了六个解释这种功能差距的机制假设，发现缺陷是结构性的而非风格性的，并且通过提供显式空间信息也无法解决。语料库分析表明，图像字幕数据集中包含稀疏的针对代理的功能语言，这与格赖斯关于为什么具身知识可能系统性地在语言中被低估的解释一致。这些发现共同表明，从图像和文本中进行分布学习不足以进行基于功能的场景理解，暗示人类视觉认知的一些维度可能需要像照片或字幕无法编码的以代理为中心的三维体验。

Summary / 总结

The study investigates whether vision-language models (VLMs) can achieve human-level scene understanding, particularly in tasks related to affordances. Using a Human-Calibrated Cosine Distance metric, the research compares VLM outputs to human responses across 15 tasks. Experiment 1 shows VLMs perform well on general knowledge tasks but struggle with affordance tasks, which do not improve even with newer model versions. Experiment 2 tests six hypotheses and finds the deficit is structural, not stylistic, and is not resolved by providing spatial information, suggesting distributional learning is insufficient for affordance understanding.

研究探讨了视觉语言模型（VLMs）在场景理解任务中，特别是在与功能相关任务上的表现是否能达到人类水平。使用人类校准余弦距离度量，研究将VLM输出与人类在15个任务上的响应进行比较。实验1显示，VLM在一般知识任务上表现良好，但在功能任务上存在明显不足，即使使用更新的模型版本也无法改善。实验2测试了六个假设，发现不足是结构性的而非风格性的，并且通过提供空间信息也无法解决，表明基于分布的学习不足以理解功能相关的场景理解。

OVI-MAP:Open-Vocabulary Instance-Semantic Mapping

Authors: Zilong Deng, Federico Tombari, Marc Pollefeys, Johanna Wald, Daniel Barath

First: 2026-03-27T15:50:59+00:00 · Latest: 2026-03-27T15:50:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.

中文标题/摘要

标题：OVI-MAP：开放词汇项实例语义映射

增量开放词汇项3D实例语义映射对于在复杂日常环境中操作的自主代理至关重要。然而，由于需要稳健的实例分割、实时处理和灵活的开放集推理，这仍然具有挑战性。现有方法通常依赖于封闭集假设或密集的逐像素语言融合，这限制了可扩展性和时间一致性。我们提出了OVI-MAP，将实例重建与语义推理解耦。我们提出构建一个类无差别的3D实例图，该图从RGB-D输入中增量构建，而语义特征仅从自动选择的少量视图中使用视觉-语言模型提取。这种设计使得在线探索过程中实例跟踪和零样本语义标注保持稳定。我们的系统实时运行，并在标准基准上优于现有的开放词汇项映射基线。

Summary / 总结

The research aims to develop robust and scalable 3D instance-semantic mapping for autonomous agents in complex environments. The method decouples instance reconstruction from semantic inference, using a class-agnostic 3D instance map built from RGB-D inputs and semantic features extracted from selected views using vision-language models. Key findings show that the system achieves real-time performance and outperforms existing open-vocabulary mapping techniques on standard benchmarks.

研究旨在为复杂环境中的自主代理开发稳健且可扩展的3D实例语义映射。方法将实例重建与语义推理分离，使用从RGB-D输入构建的类无感知3D实例图，并仅从自动选择的视图中使用视觉-语言模型提取语义特征。关键发现表明，该系统提供了稳定的实例跟踪和零样本语义标注，并在标准基准上优于现有开放词汇映射技术。

ORION: ORthonormal Text Encoding for Universal VLM AdaptatION

Authors: Omprakash Chakraborty, Jose Dolz, Ismail Ben Ayed

Venue: CVPR

First: 2026-02-23T05:47:28+00:00 · Latest: 2026-03-27T15:33:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task specific discriminability. We introduce ORION, a text encoder fine tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore, we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug and play module on top of various state of the art methods, and across different prediction settings (zero shot, few shot and test time adaptation), ORION improves the performance consistently and significantly.

中文标题/摘要

标题：ORION：通用VLM适应的正交文本编码

视觉语言模型（VLMs）在多种任务上展示了出色的泛化能力，但其性能仍受限于用于表示类别的文本原型的质量和几何结构。标准的零样本分类器，源自冻结的文本编码器和手工制作的提示，可能会产生相关或弱分离的嵌入，从而限制了任务特定的可区分性。我们提出了ORION，一种仅使用类别名称来改进预训练VLMs的文本编码微调框架。我们的方法通过低秩适应优化了一种新颖的损失函数，该函数整合了两个项，一个促进给定任务类别文本表示之间的成对正交性，另一个惩罚与初始类别原型的偏差。此外，我们提供了我们正交性惩罚的概率解释，并通过惠更斯定理将其与一般最大似然估计（MLE）原则联系起来。我们在11个基准和三个大型VLM骨干网络上进行了广泛的实验，表明优化后的文本嵌入为标准CLIP原型提供了强大的替代方案。作为各种最先进的方法的即插即用模块，并在不同的预测设置（零样本、少量样本和测试时适应）中，ORION能够一致且显著地提高性能。

Summary / 总结

ORION is a text encoder fine-tuning framework that enhances pretrained vision language models using only class names. It optimizes textual representations by promoting orthogonality and minimizing deviations from initial prototypes, leading to improved performance across 11 benchmarks and three large VLM backbones. ORION consistently and significantly boosts performance in zero shot, few shot, and test time adaptation settings.

ORION 是一种仅使用类别名称来增强预训练 VLM 的文本编码微调框架，通过促进正交性和最小化与初始原型的偏差来优化文本表示。在 11 个基准上的广泛实验中，使用三个大型 VLM 后端表明，ORION 在零样本、少量样本和测试时适应设置中均能显著提高任务特定的可区分性，并优于标准 CLIP 原型。

CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation

Authors: Jesse Barkley, Rumi Loghmani, Amir Barati Farimani

First: 2026-03-27T15:23:05+00:00 · Latest: 2026-03-27T15:23:05+00:00

Comments: 8 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Existing methods for text-to-CAD generation either operate in a single pass with no geometric verification or rely on lossy visual feedback that cannot resolve dimensional errors. We present CADSmith, a multi-agent pipeline that generates CadQuery code from natural language. It then undergoes an iterative refinement process through two nested correction loops: an inner loop that resolves execution errors and an outer loop grounded in programmatic geometric validation. The outer loop combines exact measurements from the OpenCASCADE kernel (bounding box dimensions, volume, solid validity) with holistic visual assessment from an independent vision-language model Judge. This provides both the numerical precision and the high-level shape awareness needed to converge on the correct geometry. The system uses retrieval-augmented generation over API documentation rather than fine-tuning, maintaining a current database as the underlying CAD library evolves. We evaluate on a custom benchmark of 100 prompts in three difficulty tiers (T1 through T3) with three ablation configurations. Against a zero-shot baseline, CADSmith achieves a 100% execution rate (up from 95%), improves the median F1 score from 0.9707 to 0.9846, the median IoU from 0.8085 to 0.9629, and reduces the mean Chamfer Distance from 28.37 to 0.74, demonstrating that closed-loop refinement with programmatic geometric feedback substantially improves the quality and reliability of LLM-generated CAD models.

中文标题/摘要

标题：CADSmith：基于程序几何验证的多智能体CAD生成

现有的文本到CAD生成方法要么在单次操作中没有几何验证，要么依赖于有损的视觉反馈，无法解决尺寸错误。我们提出了CADSmith，这是一种多智能体流水线，能够从自然语言生成CadQuery代码。然后通过两个嵌套的修正循环进行迭代优化：内循环解决执行错误，外循环基于程序几何验证。外循环结合了OpenCASCADE内核的精确测量（边界框尺寸、体积、实体有效性）以及独立视觉语言模型Judge的整体视觉评估。这提供了所需的数值精度和高层次的形状意识，以收敛到正确的几何形状。该系统使用API文档增强的检索生成，而不是微调，保持了一个随CAD库演变的当前数据库。我们在一个包含100个提示的自定义基准上进行了评估，分为三个难度级别（T1至T3），并进行了三种消融配置。与零样本基线相比，CADSmith的执行率达到了100%（从95%提高），中位F1分数从0.9707提高到0.9846，中位IoU从0.8085提高到0.9629，平均Chamfer距离从28.37降低到0.74，表明闭环修正与程序几何反馈显著提高了LLM生成的CAD模型的质量和可靠性。

Summary / 总结

CADSmith is a multi-agent system that generates CadQuery code from natural language inputs and refines the generated CAD models through iterative correction loops. The system combines numerical precision from OpenCASCADE measurements with holistic visual assessment from a vision-language model, achieving a 100% execution rate and significant improvements in F1 score, IoU, and Chamfer Distance compared to a zero-shot baseline.

CADSmith 是一个多代理系统，从自然语言描述生成 CadQuery 代码，并通过迭代校正循环改进几何形状。它使用程序化的几何验证，结合精确测量和整体视觉评估来提高 CAD 模型的质量和可靠性。与零样本基线相比，CADSmith 显著提高了执行率、改进了 F1 分数、IoU 并减少了平均 Chamfer 距离。

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

Authors: Aditya Kumar Singh, Hitesh Kandala, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

Venue: CVPR 2026

First: 2026-02-21T14:22:49+00:00 · Latest: 2026-03-27T15:03:54+00:00

Comments: 15 Pages, 8 figures, 15 tables, CVPR 2026; Code: https://github.com/AMD-AGI/DUET-VLM

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual tokens or drop them progressively in language backbone, often trading accuracy for speed. In this work, we propose DUET-VLM, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder's output into information-preserving tokens, followed by (b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. This coordinated token management enables aggressive compression while retaining critical semantics. On LLaVA-1.5-7B, our approach maintains over 99% of baseline accuracy with 67% fewer tokens, and still retains >97% even at 89% reduction. With this dual-stage compression during training, it achieves 99.7% accuracy at 67% and 97.6% at 89%, surpassing prior SoTA visual token reduction methods across multiple benchmarks. When integrated into Video-LLaVA-7B, it even surpasses the baseline -- achieving >100% accuracy with a substantial 53.1% token reduction and retaining 97.6% accuracy under an extreme 93.4% setting. These results highlight end-to-end training with DUET-VLM, enabling robust adaptation to reduced visual (image/video) input without sacrificing accuracy, producing compact yet semantically rich representations within the same computational budget. Our code is available at https://github.com/AMD-AGI/DUET-VLM.

中文标题/摘要

标题：DUET-VLM：双阶段统一高效视觉标记缩减框架用于VLM训练与推理

视觉语言模型（VLMs）在多模态理解和推理方面取得了显著进展，但由于密集的视觉标记化，计算成本仍然很高。现有效率方法要么合并冗余的视觉标记，要么在语言骨干中逐层丢弃它们，通常是在速度和准确性之间进行权衡。本文提出了一种名为DUET-VLM的多功能即插即用双压缩框架，该框架包括（a）仅针对视觉编码器输出的视觉冗余感知压缩，将其压缩为保留信息的标记，然后（b）逐层、基于文本的视觉标记逐层丢弃，以逐层去除不相关信息的标记。这种协调的标记管理能够在保留关键语义的同时实现激进的压缩。在LLaVA-1.5-7B上，我们的方法在67%更少的标记下保持了基线超过99%的准确性，并且即使在89%的减少下仍保持超过97%的准确性。通过在训练期间采用这种双阶段压缩，它在67%的准确率达到99.7%，在89%的准确率达到97.6%，在多个基准测试中超越了先前的最先进视觉标记缩减方法。当集成到Video-LLaVA-7B中时，它甚至超过了基线，实现了超过100%的准确性，同时减少了53.1%的标记，并在极端93.4%的设置下保持97.6%的准确性。这些结果突显了DUET-VLM端到端训练的优势，能够在不牺牲准确性的前提下，实现对减少的视觉（图像/视频）输入的稳健适应，产生紧凑且语义丰富的表示，同时保持相同的计算预算。我们的代码可在https://github.com/AMD-AGI/DUET-VLM获取。

Summary / 总结

DUET-VLM proposes a dual-stage token reduction framework for VLMs, which first compresses visual tokens in the vision encoder and then progressively drops less informative tokens in the language backbone. This approach maintains over 99% of baseline accuracy with 67% fewer tokens on LLaVA-1.5-7B and surpasses prior state-of-the-art methods across multiple benchmarks. It also achieves >100% accuracy with 53.1% token reduction in Video-LLaVA-7B under extreme settings, demonstrating robust performance without sacrificing accuracy.

研究旨在通过提出DUET-VLM双阶段压缩框架来降低视觉语言模型（VLMs）的计算成本。该框架首先压缩视觉编码器的输出，然后在语言骨干中逐层地丢弃不那么信息丰富的视觉令牌。这种方法在保持超过99%的基本准确率的同时减少了67%的令牌数量，并在各种基准测试中优于先前的最先进方法。当集成到Video-LLaVA-7B中时，它甚至在显著减少令牌数量的同时超越了基线，并在极端条件下保持了高准确率。

BeetleFlow: An Integrative Deep Learning Pipeline for Beetle Image Processing

Authors: Fangxun Liu, S M Rayeed, Samuel Stevens, Alyson East, Cheng Hsuan Chiang, Colin Lee, Daniel Yi, Junke Yang, Tejas Naik, Ziyi Wang, Connor Kilrain, Elijah H Buckwalter, Jiacheng Hou, Saul Ibaven Bueno, Shuheng Wang, Xinyue Ma, Yifan Liu, Zhiyuan Tao, Ziheng Zhang, Eric Sokol, Michael Belitz, Sydne Record, Charles V. Stewart, Wei-Lun Chao

Venue: NeurIPS 2025

First: 2025-10-31T20:55:33+00:00 · Latest: 2026-03-27T15:02:34+00:00

Comments: 4 pages, NeurIPS 2025 Workshop Imageomics

Abs · PDF · Code1 · Code2

Abstract

In entomology and ecology research, biologists often need to collect a large number of insects, among which beetles are the most common species. A common practice for biologists to organize beetles is to place them on trays and take a picture of each tray. Given the images of thousands of such trays, it is important to have an automated pipeline to process the large-scale data for further research. Therefore, we develop a 3-stage pipeline to detect all the beetles on each tray, sort and crop the image of each beetle, and do morphological segmentation on the cropped beetles. For detection, we design an iterative process utilizing a transformer-based open-vocabulary object detector and a vision-language model. For segmentation, we manually labeled 670 beetle images and fine-tuned two variants of a transformer-based segmentation model to achieve fine-grained segmentation of beetles with relatively high accuracy. The pipeline integrates multiple deep learning methods and is specialized for beetle image processing, which can greatly improve the efficiency to process large-scale beetle data and accelerate biological research.

中文标题/摘要

标题：BeetleFlow：一种集成深度学习流水线用于甲虫图像处理

在昆虫学和生态学研究中，生物学家通常需要收集大量的昆虫，其中甲虫是最常见的种类。生物学家整理甲虫的常见做法是将它们放在托盘上并为每个托盘拍摄一张照片。给定成千上万张这样的托盘照片，重要的是要有一个自动化的流水线来处理大规模数据以供进一步研究。因此，我们开发了一个三阶段流水线来检测每个托盘上的所有甲虫，对每个甲虫进行排序和裁剪，并对裁剪后的甲虫进行形态学分割。在检测方面，我们设计了一个迭代过程，利用基于变换器的开放式词汇对象检测器和视觉-语言模型。在分割方面，我们手动标注了670张甲虫图像，并对两种基于变换器的分割模型进行了微调，以实现相对较高的精度的甲虫细粒度分割。该流水线集成了多种深度学习方法，专门用于甲虫图像处理，可以大大提高处理大规模甲虫数据的效率并加速生物研究。

Summary / 总结

The research aims to automate the processing of beetle images for entomology and ecology studies. It develops a 3-stage pipeline that includes detection, cropping, and morphological segmentation using deep learning methods. The pipeline uses a transformer-based object detector and a vision-language model for detection, and fine-tunes a transformer-based segmentation model for accurate beetle segmentation. Key findings show that the pipeline can efficiently process large-scale beetle data, enhancing biological research.

研究旨在自动化处理生态学和昆虫学研究中收集的甲虫图像。开发了一个3阶段流水线，用于检测、裁剪和分割甲虫。流水线使用基于变换器的目标检测器和视觉-语言模型进行检测，以及对基于变换器的分割模型进行微调以实现高精度分割。主要发现包括在处理大规模甲虫数据方面提高了效率，并增强了生物研究的能力。

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

Authors: Eyal Hadad, Mordechai Guri

First: 2026-03-26T12:53:49+00:00 · Latest: 2026-03-27T15:01:28+00:00

Comments: 13 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

中文标题/摘要

标题：形状与实质：面向本地视觉-语言模型的双层侧信道攻击

设备端视觉-语言模型（VLMs）通过本地执行承诺数据隐私。然而，我们表明，向动态高分辨率预处理（例如AnyRes）的架构转变引入了一个固有的算法侧信道。与静态模型不同，动态预处理会根据图像的长宽比将图像分解为不同数量的块，从而产生工作负载依赖的输入。我们展示了一种针对本地VLMs的双层攻击框架。在第一层中，未授权的攻击者可以利用标准的未授权操作系统指标来可靠地识别输入的几何形状。在第二层中，通过分析最后一级缓存（LLC）争用情况，攻击者可以解决相同几何形状内的语义模糊性，区分视觉密集（例如医学X光片）和稀疏（例如文本文档）的内容。通过评估最先进的模型如LLaVA-NeXT和Qwen2-VL，我们表明结合这些信号可以可靠地推断出隐私敏感的上下文。最后，我们分析了缓解这一漏洞的安全工程权衡，揭示了使用恒定工作量填充带来的显著性能开销，并提出了安全边缘AI部署的实用设计建议。

Summary / 总结

The research addresses the security vulnerability in on-device Vision-Language Models (VLMs) due to dynamic preprocessing, which introduces workload-dependent inputs. The study proposes a dual-layer attack framework: Tier 1 uses standard OS metrics to fingerprint the input's geometry, and Tier 2 profiles LLC contention to resolve semantic ambiguity. Evaluations on LLaVA-NeXT and Qwen2-VL show that combining these signals can reliably infer privacy-sensitive contexts. The research also discusses the performance overhead of mitigation strategies and suggests practical design recommendations for secure Edge AI deployments.

该论文探讨了由于动态预处理而导致的在设备上运行的视觉-语言模型（VLM）的安全风险。它引入了一种双层攻击框架，其中未授权的攻击者首先利用执行时间变化来识别输入的几何形状，然后通过分析最后一级缓存（LLC）争用解决语义上的歧义。研究评估了如LLaVA-NeXT和Qwen2-VL等最先进的模型，并表明结合这些信号可以推断出敏感的隐私上下文。该研究还讨论了缓解这些漏洞的性能开销，并提出了针对安全边缘AI部署的实用设计建议。

MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

Authors: Yimin Wei, Aoran Xiao, Hongruixuan Chen, Junshi Xia, Naoto Yokoya

First: 2026-03-18T09:34:23+00:00 · Latest: 2026-03-27T14:52:22+00:00

Comments: CVPR2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available at https://github.com/Jimmyxichen/MM-OVSeg.

中文标题/摘要

标题：MM-OVSeg：遥感领域多模态光学-SAR融合的开放词汇分割

开放词汇分割能够在开放类别文本标签集上实现像素级识别，允许超越固定类别的泛化。尽管在遥感领域具有巨大潜力，但该领域的进展仍然主要局限于晴天光学数据，并且在多云或有雾霾污染的情况下表现不佳。我们提出了MM-OVSeg，这是一种在恶劣天气条件下具有弹性的多模态光学-SAR融合框架，用于开放词汇分割。MM-OVSeg 利用了两种模态的互补优势——光学图像提供了丰富的光谱语义，而合成孔径雷达（SAR）则提供了穿透云层的结构线索。为了解决跨模态领域差距以及当前视觉语言模型的有限密集预测能力，我们提出了两种关键设计：一种跨模态统一过程，用于多传感器表示对齐，以及一种双编码器融合模块，该模块结合了来自多个视觉基础模型的多级特征，以实现文本对齐的多模态分割。广泛的实验表明，MM-OVSeg 在多种云条件下的鲁棒性和泛化能力均优于现有方法。源数据集和代码可在 https://github.com/Jimmyxichen/MM-OVSeg 获取。

Summary / 总结

MM-OVSeg is a multimodal Optical-SAR fusion framework designed for open-vocabulary segmentation in remote sensing under adverse weather conditions. It combines the rich spectral semantics of optical imagery with the cloud-penetrating structural cues from SAR to address the domain gap and improve robustness. Key experimental results show that MM-OVSeg outperforms existing methods in handling diverse cloud conditions and achieving superior generalization.

MM-OVSeg 是一种多模态光学-SAR 融合框架，旨在恶劣天气条件下进行开放词汇分割。该框架结合了光学图像丰富的光谱语义和合成孔径雷达穿透云层的结构线索，以解决现有模型的领域差距和密集预测能力有限的问题。框架包括跨模态统一过程和双编码器融合模块，增强鲁棒性和泛化能力。实验表明，MM-OVSeg 在各种云条件下优于现有方法。

ClipTTT: CLIP-Guided Test-Time Training Helps LVLMs See Better

Authors: Mriganka Nath, Anurag Das, Jiahao Xie, Bernt Schiele

First: 2026-03-27T14:47:35+00:00 · Latest: 2026-03-27T14:47:35+00:00

Comments: 30 pages, 12 figures

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models (LVLMs) tend to hallucinate, especially when visual inputs are corrupted at test time. We show that such corruptions act as additional distribution shifts, significantly amplifying hallucination rates in real-world applications. To address this, we propose CLIP-guided Test-Time Training (ClipTTT), a method to adapt LVLMs under degraded conditions on the fly with a single test sample. Specifically, we leverage the image-text alignment strength of a pre-trained CLIP model as a stable guidance signal to identify reliable self-supervision targets, enabling rapid adaptation without altering the base LVLMs. Extensive experiments on standard hallucination benchmarks, with 15 common corruptions, demonstrate that ClipTTT effectively mitigates hallucinations and improves descriptive faithfulness under visual corruptions.

中文标题/摘要

标题：ClipTTT：CLIP引导的测试时训练有助于LVLM更好地识别

大型视觉-语言模型（LVLMs）倾向于产生幻觉，尤其是在测试时视觉输入被破坏时。我们表明，这种破坏实际上会作为额外的数据分布变化，显著放大了实际应用中的幻觉率。为了解决这个问题，我们提出了CLIP引导的测试时训练（ClipTTT），这是一种在单一测试样本下实时适应LVLMs的方法，尤其是在条件恶劣的情况下。具体来说，我们利用预训练CLIP模型的图像-文本对齐强度作为稳定的指导信号，以识别可靠的自监督目标，从而实现快速适应而不改变基础LVLMs。在标准幻觉基准测试中，使用15种常见的破坏，实验表明ClipTTT有效地减轻了幻觉并提高了描述的真实性。

Summary / 总结

The research addresses the issue of hallucinations in large vision-language models (LVLMs) when visual inputs are corrupted during testing. It proposes CLIP-guided Test-Time Training (ClipTTT) to adapt LVLMs in real-time using a single test sample. By leveraging the image-text alignment strength of a pre-trained CLIP model, ClipTTT identifies reliable self-supervision targets, allowing for rapid adaptation without modifying the base LVLMs. Experiments show that ClipTTT reduces hallucinations and enhances descriptive faithfulness under various visual corruptions.

研究解决了大型视觉语言模型（LVLMs）在测试时视觉输入被破坏时出现幻觉的问题。提出了一种名为CLIP指导的测试时训练（ClipTTT）的方法，利用单个测试样本在飞速适应时不会改变基础LVLMs。该方法利用预训练CLIP模型的图像-文本对齐强度来识别可靠的自监督目标，从而实现快速适应。实验表明，ClipTTT能够有效减少幻觉并提高描述的真实性。

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

Authors: MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek

Venue: CVPR 2026

First: 2026-03-27T12:42:26+00:00 · Latest: 2026-03-27T12:42:26+00:00

Comments: Accepted in CVPR 2026; Project page, code, and dataset: https://kcsayem.github.io/handvqa/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both base and fine-tuned settings, using lightweight fine-tuning via LoRA. Our findings reveal systematic limitations in current models, including hallucinated finger parts, incorrect geometric interpretations, and poor generalization. HandVQA not only exposes these critical reasoning gaps but provides a validated path to improvement. We demonstrate that the 3D-grounded spatial knowledge learned from our benchmark transfers in a zero-shot setting, significantly improving accuracy of model on novel downstream tasks like hand gesture recognition (+10.33%) and hand-object interaction (+2.63%).

中文标题/摘要

标题：HandVQA：诊断和提升视觉-语言模型关于手部精细空间推理能力

在高风险场景如机器人辅助手术、芯片制造和基于AR/VR的人机交互中，理解人类手部的精细结构至关重要。尽管当前视觉-语言模型（VLMs）在通用视觉-语言基准测试中已接近人类水平，但在精细空间推理方面仍存在困难，尤其是在解释复杂的手部姿态时。我们提出了HandVQA，这是一个大规模诊断基准，旨在通过视觉问答评估VLMs对手部解剖结构的详细理解。该基准基于高质量的3D手部数据集（FreiHAND、InterHand2.6M、FPHA），包含超过160万道控制的多项选择题，用于探究手关节之间的空间关系，如角度、距离和相对位置。我们使用轻量级的LoRA微调方法评估了几种最先进的VLMs（LLaVA、DeepSeek和Qwen-VL），在基础和微调设置下进行评估。我们的研究发现当前模型存在系统性局限，包括虚构的手指部分、错误的几何解释和差的泛化能力。HandVQA不仅揭示了这些关键的推理缺陷，还提供了一条改进的有效途径。我们证明，从该基准中学到的3D空间知识在零样本设置下可以显著提高模型在手部手势识别（+10.33%）和手部与物体交互（+2.63%）等新下游任务上的准确性。

Summary / 总结

The research aims to improve fine-grained spatial reasoning about human hands in vision-language models, crucial for high-stakes applications. HandVQA, a large-scale diagnostic benchmark, evaluates models using over 1.6M questions on hand anatomy. State-of-the-art models like LLaVA, DeepSeek, and Qwen-VL show limitations in spatial reasoning, such as incorrect geometric interpretations. The study reveals that fine-tuning with HandVQA improves performance on downstream tasks, enhancing hand gesture recognition by 10.33% and hand-object interaction by 2.63% in a zero-shot setting.

研究旨在诊断并提升视觉语言模型在手部精细空间推理方面的表现，这对于高风险应用场景至关重要。HandVQA 是一个大规模诊断基准，通过视觉问答评估模型对手部解剖结构的理解。研究发现，当前模型在处理复杂手部姿势时存在幻觉和错误的几何解释问题。然而，通过HandVQA学习到的3D空间知识在下游任务如手部手势识别和手物交互中分别提高了10.33%和2.63%的准确性，展示了该基准在模型改进中的有效性。

Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

Authors: Yiming Ren, Yujiu Yang, Junjie Wang

First: 2026-03-27T11:47:39+00:00 · Latest: 2026-03-27T11:47:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tuning. Building on this observation, we propose Input-Adaptive Depth Aggregation (IADA), a lightweight mechanism that makes cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck. On Qwen3-VL-2B, IADA improves the average reasoning score by 9.5 points and the average perception score by $3.3$ points over LoRA-only fine-tuning with only 0.14M additional parameters, with the strongest gains appearing in parameter-efficient low-rank settings.

中文标题/摘要

标题：视觉-语言微调中输入自适应深度聚合减轻推理税

在视觉指令数据上的监督微调（SFT）通常会提高视觉-语言模型（VLMs）的感知能力，同时降低推理性能，导致训练后持续存在推理税。我们研究这种退化是否与深度卷积表示的访问中断有关，并发现即使固定的跨深度聚合也能显著恢复推理能力，表明保持跨深度访问是VLM微调中一个重要缺失的因素。基于这一观察，我们提出了一种轻量级机制输入自适应深度聚合（IADA），该机制使跨深度检索输入自适应、模态感知，并通过低秩瓶颈高效参数化。在Qwen3-VL-2B上，与仅使用低秩分解的微调相比，IADA仅增加0.14M参数，平均推理得分提高9.5分，平均感知得分提高3.3分，效果最佳的增益出现在参数高效的低秩设置中。

Summary / 总结

The study addresses the issue of reasoning performance degradation in vision-language models (VLMs) after supervised fine-tuning (SFT) on visual instruction data. It proposes Input-Adaptive Depth Aggregation (IADA) to mitigate this problem by allowing flexible and efficient cross-depth retrieval. IADA improves reasoning scores by 9.5 points and perception scores by 3.3 points on Qwen3-VL-2B with minimal additional parameters, demonstrating its effectiveness in enhancing both reasoning and perception capabilities without significantly increasing model complexity.

研究旨在解决视觉语言模型在微调后出现的推理税问题，即感知性能提升的同时伴随着推理性能的下降。研究提出了一种输入自适应深度聚合（IADA）机制，使其在跨深度检索中具有输入自适应性和模态感知性。在Qwen3-VL-2B上，IADA将推理得分提高了9.5分，感知得分提高了3.3分，仅增加了0.14M的额外参数，显示出在参数高效设置中的显著提升。

ExtrinSplat: Decoupling Geometry and Semantics for Open-Vocabulary Understanding in 3D Gaussian Splatting

Authors: Jiayu Ding, Xinpeng Liu, Zhiyi Pan, Shiqiang Long, Ge Li

Venue: CVPR 2026

First: 2025-09-26T11:38:05+00:00 · Latest: 2026-03-27T11:41:23+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Lifting 2D open-vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry-semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi-granularity, overlapping 3D object groups. A Vision-Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively supports complex polysemy. By replacing costly feature embedding with lightweight indices, ExtrinSplat reduces scene adaptation time from hours to minutes and lowers storage overhead by several orders of magnitude. On benchmark tasks for open-vocabulary 3D object selection and semantic segmentation, ExtrinSplat outperforms established embedding-based frameworks, validating the efficacy and efficiency of the proposed extrinsic paradigm.

中文标题/摘要

标题：ExtrinSplat：在3D高斯点云渲染中解耦几何与语义以实现开放词汇理解

将2D开放词汇理解提升到3D高斯点云渲染（3DGS）场景是一个关键挑战。主流方法基于嵌入范式，存在三个主要问题：（i）几何语义不一致，点而非对象作为语义基础，限制了语义保真度；（ii）语义膨胀，将千兆字节的特征数据注入几何结构；（iii）语义僵化，一个高斯特征难以捕捉丰富的多义性。为克服这些限制，我们提出了ExtrinSplat框架，该框架基于外在范式，解耦几何与语义。ExtrinSplat不嵌入特征，而是将高斯点聚类为多粒度、重叠的3D对象组。视觉-语言模型（VLM）随后解释这些组以生成轻量级文本假设，创建一个外在索引层，原生支持复杂的多义性。通过用轻量级索引替换昂贵的特征嵌入，ExtrinSplat将场景适应时间从数小时缩短到几分钟，并将存储开销降低几个数量级。在开放词汇3D对象选择和语义分割基准任务中，ExtrinSplat优于现有嵌入范式框架，验证了所提外在范式的有效性和效率。

Summary / 总结

ExtrinSplat addresses the limitations of embedding-based methods in 3D Gaussian Splatting by decoupling geometry and semantics. It clusters Gaussians into multi-granularity, overlapping 3D object groups and uses a Vision-Language Model to generate lightweight textual hypotheses, reducing scene adaptation time and storage overhead. Experiments show that ExtrinSplat outperforms existing embedding-based frameworks on open-vocabulary 3D object selection and semantic segmentation tasks.

ExtrinSplat通过将几何与语义解耦，解决了基于嵌入方法在3D高斯点绘中的局限性。它将高斯点聚类成多粒度、重叠的3D对象组，并使用视觉语言模型生成轻量级的文本假设，从而减少场景适应时间和存储开销。实验表明，ExtrinSplat在开放词汇3D对象选择和语义分割任务中优于现有框架。

TimeSenCLIP: A Time Series Vision-Language Model for Remote Sensing

Authors: Pallavi Jain, Diego Marcos, Dino Ienco, Roberto Interdonato, Tristan Berchoux

First: 2025-08-16T05:44:33+00:00 · Latest: 2026-03-27T10:59:31+00:00

Comments: Accepted (ISPRS Journal of Photogrammetry and Remote Sensing)

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) mapping via zero-shot classification and retrieval. However, current approaches face several key challenges, such as the dependence on caption-based supervision, which is often not available or very limited in terms of the covered semantics, and the fact of being adapted from generic VLM architectures that are suitable for very high resolution images. Consequently, these models tend to prioritize spatial context over spectral and temporal information, limiting their effectiveness for medium-resolution remote sensing imagery. In this work, we present TimeSenCLIP, a lightweight VLM for remote sensing time series, using a cross-view temporal contrastive framework to align multispectral Sentinel-2 time series with geo-tagged ground-level imagery, without requiring textual annotations. Unlike prior VLMs, TimeSenCLIP emphasizes temporal and spectral signals over spatial context, investigating whether single-pixel time series contain sufficient information for solving a variety of tasks.

中文标题/摘要

标题：TimeSenCLIP：用于遥感的时序视觉-语言模型

视觉-语言模型（VLMs）在遥感应用中显示出显著的潜力，特别是在通过零样本分类和检索进行土地利用和土地覆盖（LULC）制图方面。然而，当前的方法面临几个关键挑战，如依赖基于描述词的监督，这种监督往往不可用或在覆盖的语义方面非常有限，以及从适用于非常高分辨率图像的通用VLM架构中适应而来。因此，这些模型倾向于优先考虑空间上下文，而忽视光谱和时间信息，限制了其对中分辨率遥感图像的有效性。在本文中，我们提出了TimeSenCLIP，这是一种用于遥感时序的轻量级VLM，使用跨视图时间对比框架将多光谱Sentinel-2时序与地理标记的地面图像对齐，而无需文本注释。与之前的VLMs不同，TimeSenCLIP强调时间信号和光谱信号而非空间上下文，探讨单像素时序是否包含解决各种任务所需的信息。

Summary / 总结

TimeSenCLIP is a vision-language model designed for remote sensing applications, specifically addressing the limitations of existing models by focusing on temporal and spectral signals rather than spatial context. It uses a cross-view temporal contrastive framework to align multispectral Sentinel-2 time series with geo-tagged ground-level imagery without textual annotations. Key experimental findings show that TimeSenCLIP can effectively perform various tasks using single-pixel time series data, demonstrating improved performance for medium-resolution remote sensing imagery.

研究旨在提高视觉-语言模型（VLMs）在遥感应用中的有效性，特别是用于土地利用和土地覆盖图绘制。TimeSenCLIP是一种轻量级VLM，使用跨视图时间对比框架将多光谱Sentinel-2时间序列与地理标记的地面图像对齐，强调时间和光谱信号而非空间上下文。关键发现表明，单像素时间序列中包含足够的信息来解决各种任务，从而提高模型在中分辨率遥感图像中的性能。

GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation

Authors: Rui Xie, Zhi Gao, Chenrui Shi, Zirui Shang, Lu Chen, Qing Li

First: 2026-03-27T10:33:08+00:00 · Latest: 2026-03-27T10:33:08+00:00

Comments: 28 pages, 8 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

Large vision-language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain-specific software operation data during training, these agents exhibit significant domain bias - they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real-world task performance. In this paper, we present GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise), a training-free, plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline. GUIDE introduces two key innovations. First, a subtitle-driven Video-RAG pipeline unlocks video semantics through subtitle analysis, performing progressive three-stage retrieval - domain classification, topic extraction, and relevance matching - to identify task-relevant tutorial videos. Second, a fully automated annotation pipeline built on an inverse dynamics paradigm feeds consecutive keyframes enhanced with UI element detection into VLMs, inferring the required planning and grounding knowledge that are injected into the agent's corresponding modules to address both manifestations of domain bias. Extensive experiments on OSWorld demonstrate GUIDE's generality as a plug-and-play component for both multi-agent systems and single-model agents. It consistently yields over 5% improvements and reduces execution steps - without modifying any model parameters or architecture - validating GUIDE as an architecture-agnostic enhancement to bridge GUI agent domain bias.

中文标题/摘要

标题：GUIDE：通过实时网络视频检索和即插即用标注解决GUI代理领域的偏差

大型多模态模型赋予GUI代理强大的界面理解和交互能力。然而，由于在训练过程中缺乏对特定软件操作数据的充分接触，这些代理表现出显著的领域偏差——它们对特定应用程序的具体操作流程（规划）和UI元素布局（定位）缺乏熟悉，限制了其在实际任务中的表现。本文提出GUIDE（GUI领域偏差通过指令视频驱动的专业知识解决），这是一种无需训练的即插即用框架，通过检索增强的自动化标注流水线自主从网络教程视频中获取领域特定的专业知识以解决GUI代理的领域偏差。GUIDE引入了两项关键创新。首先，一种基于字幕的视频-RAG流水线通过字幕分析解锁视频语义，进行逐步的三阶段检索——领域分类、主题提取和相关性匹配，以识别与任务相关的教程视频。其次，基于逆动力学范式的完全自动化标注流水线将连续的关键帧与UI元素检测增强后输入到多模态模型中，推断出所需的规划和定位知识，注入到代理相应的模块中以解决领域偏差的两种表现形式。在OSWorld上的广泛实验表明，GUIDE作为多代理系统和单模型代理的即插即用组件具有通用性。它在不修改任何模型参数或架构的情况下，始终提供超过5%的改进并减少执行步骤，验证了GUIDE作为一种架构无关的增强方法来解决GUI代理领域偏差的有效性。

Summary / 总结

This paper addresses the domain bias issue in GUI agents by proposing GUIDE, a training-free framework that acquires domain-specific expertise from web tutorial videos. It uses a subtitle-driven retrieval-augmented generation pipeline to identify relevant videos and an automated annotation pipeline to infer planning and grounding knowledge. Experiments on OSWorld show that GUIDE improves task performance by over 5% and reduces execution steps without altering model parameters or architecture, demonstrating its effectiveness as a plug-and-play solution for GUI agents.

本文提出GUIDE框架，通过利用网络教程视频解决GUI代理的领域偏差问题，该框架无需训练即可插件化使用。GUIDE使用基于字幕的Video-RAG管道检索相关视频，并使用自动化注释管道从UI元素中推断出规划和定位知识。OSWorld上的实验表明，GUIDE在不修改模型参数或架构的情况下，能够提高任务性能超过5%，并减少执行步骤，证明了其作为GUI代理领域偏差解决方案的有效性。

GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation

Authors: Xujing Tao, Chuxin Wang, Yubo Ai, Zhixin Cheng, Zhuoyuan Li, Liangsheng Liu, Yujia Chen, Xinjun Li, Qiao Li, Wenfei Yang, Tianzhu Zhang

Venue: CVPR 2026

First: 2026-03-27T10:29:19+00:00 · Latest: 2026-03-27T10:29:19+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set. Existing methods predominantly rely on distilling knowledge from 2D open-vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose GeoGuide, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry-semantic consistency for open-vocabulary 3D segmentation. Specifically, we introduce an Uncertainty-based Superpoint Distillation module to fuse geometric and semantic features for estimating per-point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information to enhance local semantic consistency. Furthermore, our Instance-level Mask Reconstruction module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our Inter-Instance Relation Consistency module aligns geometric and semantic similarity matrices to calibrate cross-instance consistency for same-category objects, mitigating viewpoint-induced semantic drift. Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.

中文标题/摘要

标题：GeoGuide：层次几何指导下的开放词汇3D语义分割

开放词汇3D语义分割旨在分割训练集之外的任意类别。现有方法主要依赖于从2D开放词汇模型中提炼知识。然而，将3D特征对齐到2D表示空间限制了内在的3D几何学习，并继承了2D预测中的错误。为了解决这些限制，我们提出了一种名为GeoGuide的新框架，该框架利用预训练的3D模型来整合层次几何语义一致性以进行开放词汇3D分割。具体而言，我们引入了一种基于不确定性超点蒸馏模块来融合几何和语义特征以估计每个点的不确定性，自适应加权超点内的2D特征以抑制噪声同时保留区分性信息以增强局部语义一致性。此外，我们的实例级掩码重建模块利用几何先验来通过重建完整的实例掩码来在实例内部强制语义一致性。另外，我们的跨实例关系一致性模块对齐几何和语义相似性矩阵以校准同一类别对象之间的跨实例一致性，减轻视角引起的语义漂移。在ScanNet v2、Matterport3D和nuScenes上的广泛实验表明GeoGuide的优越性能。

Summary / 总结

GeoGuide is a novel framework for open-vocabulary 3D semantic segmentation that integrates hierarchical geometry-semantic consistency. It uses an Uncertainty-based Superpoint Distillation module to fuse geometric and semantic features, an Instance-level Mask Reconstruction module to enforce semantic consistency within instances, and an Inter-Instance Relation Consistency module to align geometric and semantic similarity matrices. Experiments on ScanNet v2, Matterport3D, and nuScenes show that GeoGuide outperforms existing methods in open-vocabulary 3D semantic segmentation.

GeoGuide 是一种用于开放词汇3D语义分割的新型框架，通过集成层次几何语义一致性。它使用基于不确定性超点蒸馏模块融合几何和语义特征，使用实例级掩码重构模块在实例内强制语义一致性，并使用跨实例关系一致性模块对几何和语义相似性矩阵进行对齐。在ScanNet v2、Matterport3D和nuScenes上的实验表明，GeoGuide 在开放词汇3D语义分割中的性能优于现有方法。

Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

Authors: Shrinidhi Kumbhar, Haofu Liao, Srikar Appalaraju, Kunwar Yashraj Singh

Venue: CVPR 2026

First: 2026-03-27T09:34:07+00:00 · Latest: 2026-03-27T09:34:07+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong performance in multimodal reasoning, offering bidirectional attention, parallel token generation, and iterative refinement. However, their potential for GUI grounding remains unexplored. In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. We adapt LLaDA-V for single-turn action and bounding-box prediction, framing the task as text generation from multimodal input. To better capture the hierarchical structure of bounding-box geometry, we propose a hybrid masking schedule that combines linear and deterministic masking, improving grounding accuracy by up to 6.1 points in Step Success Rate (SSR) over the GUI-adapted LLaDA-V trained with linear masking. Evaluations on four datasets spanning web, desktop, and mobile interfaces show that the adapted diffusion model with hybrid masking consistently outperforms the linear-masked variant and performs competitively with autoregressive counterparts despite limited pretraining. Systematic ablations reveal that increasing diffusion steps, generation length, and block length improves accuracy but also increases latency, with accuracy plateauing beyond a certain number of diffusion steps. Expanding the training data with diverse GUI domains further reduces latency by about 1.3 seconds and improves grounding accuracy by an average of 20 points across benchmarks. These results demonstrate that discrete DVLMs are a promising modeling framework for GUI grounding and represent an important step toward diffusion-based GUI agents.

中文标题/摘要

标题：迈向GUI代理：GUI定位的视觉-语言扩散模型

自回归（AR）视觉-语言模型（VLMs）长期以来主导着多模态理解、推理和图形用户界面（GUI）定位。最近，离散扩散视觉-语言模型（DVLMs）在多模态推理方面表现出色，提供双向注意力、并行标记生成和迭代细化。然而，它们在GUI定位方面的潜力尚未被探索。在本文中，我们评估离散DVLMs是否可以作为AR模型的可行替代品用于GUI定位。我们为单轮操作和边界框预测调整了LLaDA-V，将任务框架为从多模态输入生成文本。为了更好地捕捉边界框几何结构的层次结构，我们提出了一种混合掩码计划，结合了线性和确定性掩码，与仅使用线性掩码训练的GUI适应型LLaDA-V相比，在步骤成功率（SSR）上提高了6.1个百分点。在四个涵盖网络、桌面和移动界面的数据集上的评估表明，带有混合掩码的调整扩散模型始终优于线性掩码变体，并且在有限预训练的情况下与自回归对应物竞争。系统性消融表明，增加扩散步骤、生成长度和块长度可以提高准确性，但也增加了延迟，准确性在一定数量的扩散步骤后趋于平稳。通过扩展训练数据以涵盖更多样的GUI领域，延迟进一步减少了约1.3秒，并且在基准测试中平均提高了20个百分点的定位准确性。这些结果表明，离散DVLMs是GUI定位的有前途的建模框架，并代表了基于扩散的GUI代理的重要一步。

Summary / 总结

This work explores the use of discrete diffusion vision-language models (DVLMs) for GUI grounding, an area previously dominated by autoregressive (AR) models. The authors adapt LLaDA-V for single-turn action and bounding-box prediction, and propose a hybrid masking schedule to better capture bounding-box geometry. The model with hybrid masking outperforms the linear-masked variant and performs competitively with AR counterparts. Systematic ablations show that increasing diffusion steps and generation length improves accuracy but increases latency, while expanding training data reduces latency and improves accuracy across benchmarks.

这项研究探索了离散扩散视觉-语言模型（DVLMs）在GUI定位中的应用，这是之前由自回归（AR）模型主导的领域。作者将LLaDA-V适应于单轮动作和边界框预测，并提出了一种混合遮罩计划以更好地捕捉边界框几何结构。该模型在预训练有限的情况下，优于线性遮罩变体，并且与AR模型竞争。系统性消融表明，增加扩散步骤和生成长度可以提高准确性但增加延迟，而扩展训练数据可以减少延迟并提高准确性。

IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping

Authors: Tingyang Xiao, Liu Liu, Wei Feng, Zhengyu Zou, Xiaolin Zhou, Wei Sui, Hao Li, Dingwen Zhang, Zhizhong Su

First: 2026-02-21T03:57:01+00:00 · Latest: 2026-03-27T09:23:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Geometry foundation models have significantly advanced dense geometric SLAM, yet existing systems often lack deep semantic understanding and robust loop closure capabilities. Meanwhile, contemporary semantic mapping approaches are frequently hindered by decoupled architectures and fragile data association. We propose IRIS-SLAM, a novel RGB semantic SLAM system that leverages unified geometric-instance representations derived from an instance-extended foundation model. By extending a geometry foundation model to concurrently predict dense geometry and cross-view consistent instance embeddings, we enable a semantic-synergized association mechanism and instance-guided loop closure detection. Our approach effectively utilizes viewpoint-agnostic semantic anchors to bridge the gap between geometric reconstruction and open-vocabulary mapping. Experimental results demonstrate that IRIS-SLAM significantly outperforms state-of-the-art methods, particularly in map consistency and wide-baseline loop closure reliability.

中文标题/摘要

标题：IRIS-SLAM：统一几何实例表示的鲁棒语义定位与建图

几何基础模型显著推进了密集几何SLAM的发展，但现有系统往往缺乏深入的语义理解和鲁棒的回环闭合能力。同时，当前的语义建图方法经常受到解耦架构和脆弱数据关联的阻碍。我们提出IRIS-SLAM，这是一种新颖的RGB语义SLAM系统，利用从实例扩展基础模型中派生的统一几何实例表示。通过将几何基础模型扩展以同时预测密集几何和跨视图一致的实例嵌入，我们实现了语义协同关联机制和实例引导的回环闭合检测。我们的方法有效利用了视角无关的语义锚点，以弥合几何重建与开放词汇建图之间的差距。实验结果表明，IRIS-SLAM在地图一致性及宽基线回环闭合可靠性方面显著优于现有方法。

Summary / 总结

IRIS-SLAM is a novel RGB semantic SLAM system that integrates geometric and semantic information through unified geometric-instance representations. It extends a geometry foundation model to predict dense geometry and instance embeddings, enabling better semantic association and loop closure detection. Experimental results show that IRIS-SLAM outperforms existing methods in terms of map consistency and loop closure reliability, especially for wide-baseline scenarios.

IRIS-SLAM 是一种结合几何和语义信息的新型 RGB 语义 SLAM 系统，通过统一的实例表示进行融合。通过扩展几何基础模型来同时预测密集几何和跨视图一致的实例嵌入，IRIS-SLAM 提升了语义关联和回环检测能力。实验结果表明，IRIS-SLAM 在地图一致性及宽基线回环检测可靠性方面显著优于现有方法。

See, Symbolize, Act: Grounding VLMs with Spatial Representations for Better Gameplay

Authors: Ashish Baghel, Paras Chopra

Venue: AAAI 2026

First: 2026-03-12T06:48:57+00:00 · Latest: 2026-03-27T09:16:31+00:00

Comments: 11 pages, 13 figures. Accepted to LMReasoning Workshop at AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) excel at describing visual scenes, yet struggle to translate perception into precise, grounded actions. We investigate whether providing VLMs with both the visual frame and the symbolic representation of the scene can improve their performance in interactive environments. We evaluate three state-of-the-art VLMs across Atari games, VizDoom, and AI2-THOR, comparing frame-only, frame with self-extracted symbols, frame with ground-truth symbols, and symbol-only pipelines. Our results indicate that all models benefit when the symbolic information is accurate. However, when VLMs extract symbols themselves, performance becomes dependent on model capability and scene complexity. We further investigate how accurately VLMs can extract symbolic information from visual inputs and how noise in these symbols affects decision-making and gameplay performance. Our findings reveal that symbolic grounding is beneficial in VLMs only when symbol extraction is reliable, and highlight perception quality as a central bottleneck for future VLM-based agents.

中文标题/摘要

标题：看，符号化，行动：通过空间表示使VLMs扎根以改善游戏体验

视觉-语言模型（VLMs）在描述视觉场景方面表现出色，但在将感知转化为精确、可落地的动作方面存在困难。我们研究是否可以通过同时提供视觉框架和场景的符号表示来提高其在交互环境中的表现。我们评估了三个最先进的VLMs在Atari游戏、VizDoom和AI2-THOR中的表现，比较了仅帧、帧加自提取符号、帧加真实符号以及仅符号的管道。我们的结果显示，当符号信息准确时，所有模型都能受益。然而，当VLMs自己提取符号时，性能变得依赖于模型能力和场景复杂度。我们进一步研究了VLMs从视觉输入中准确提取符号信息的能力，以及这些符号中的噪声如何影响决策和游戏表现。我们的研究发现，只有在符号提取可靠时，符号接地才对VLMs有益，并突显感知质量是未来基于VLM的代理的核心瓶颈。

Summary / 总结

This study aims to enhance the performance of Vision-Language Models (VLMs) in interactive environments by integrating spatial representations. Three state-of-the-art VLMs were evaluated across different game environments, showing that accurate symbolic information improves model performance. However, when VLMs extract symbols themselves, their performance varies based on model capability and scene complexity. The research also highlights the importance of reliable symbol extraction for effective decision-making and gameplay.

研究旨在通过整合空间表示来提升视觉语言模型（VLMs）在互动环境中的性能。评估了三种最先进的VLMs在不同游戏环境中的表现，结果显示准确的符号信息可以提高模型的性能。然而，当VLMs自己提取符号时，其表现会根据模型能力和场景复杂性而变化。研究还强调了可靠符号提取对于有效决策和游戏表现的重要性。

Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning

Authors: Bozhao Li, Shaocong Wu, Tong Shao, Senqiao Yang, Qiben Shan, Zhuotao Tian, Jingyong Su

First: 2026-03-27T08:50:11+00:00 · Latest: 2026-03-27T08:50:11+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in open-vocabulary object detection focus primarily on two aspects: scaling up datasets and leveraging contrastive learning to align language and vision modalities. However, these approaches often neglect internal consistency within a single modality, particularly when background or environmental changes occur. This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG functions as a data generation mechanism, producing images that contain the same objects across diverse backgrounds. This is essential because existing datasets alone do not support our CCL framework. The CCLoss further enforces the invariance of object features despite environmental changes, thereby improving the model's robustness in different scenes. These strategies collectively form a unified framework for ensuring contextual consistency within the same modality. Our method achieves state-of-the-art performance, surpassing previous approaches by +16.3 AP on OmniLabel and +14.9 AP on D3. These results demonstrate the importance of enforcing intra-modal consistency, significantly enhancing model generalization in diverse environments. Our code is publicly available at: https://github.com/bozhao-li/CCL.

中文标题/摘要

标题：超越对比：通过上下文一致性学习增强开放词汇目标检测的鲁棒性

开放词汇目标检测的最新进展主要集中在两个方面：扩大数据集和利用对比学习来对齐语言和视觉模态。然而，这些方法往往忽视了单一模态内部的一致性，尤其是在背景或环境变化时。这种一致性缺失导致性能下降，因为模型在不同场景中难以检测同一对象，揭示了鲁棒性差距。为解决这一问题，我们引入了上下文一致性学习（CCL），这是一种新颖的框架，结合了两种关键策略：上下文自举数据生成（CBDG）和上下文一致性损失（CCLoss）。CBDG作为数据生成机制，生成包含相同对象但在不同背景下的图像。这很重要，因为现有的数据集本身不支持我们的CCL框架。CCLoss进一步确保了在环境变化下对象特征的不变性，从而提高了模型在不同场景中的鲁棒性。这些策略共同形成了一种确保同一模态内上下文一致性的统一框架。我们的方法在OmniLabel上达到了最先进的性能，比之前的方法高出+16.3 AP，在D3上高出+14.9 AP。这些结果表明，强制执行模态内一致性的重要性，显著增强了模型在不同环境中的泛化能力。我们的代码已公开发布于：https://github.com/bozhao-li/CCL。

Summary / 总结

The paper addresses the robustness gap in open-vocabulary object detection by introducing Contextual Consistency Learning (CCL), which enhances model performance through Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). CBDG generates images with consistent object appearances across different backgrounds, while CCLoss ensures object features remain invariant under environmental changes. The method achieves state-of-the-art performance, improving AP scores by +16.3 on OmniLabel and +14.9 on D3, highlighting the importance of intra-modal consistency for better generalization in diverse environments.

论文通过引入Contextual Consistency Learning (CCL)框架，包括Contextual Bootstrapped Data Generation (CBDG)和Contextual Consistency Loss (CCLoss)，解决了开放词汇对象检测的鲁棒性问题。CBDG生成具有相同物体但在不同背景下的图像，而CCLoss确保在环境变化下物体特征的不变性。该方法在OmniLabel和D3上分别实现了+16.3和+14.9的AP分数提升，强调了在多样化环境中增强模型泛化能力的重要性。

Compositional Image Synthesis with Inference-Time Scaling

Authors: Minsuk Ji, Sanghyeok Lee, Namhyuk Ahn

First: 2025-10-28T07:16:21+00:00 · Latest: 2026-03-27T08:35:54+00:00

Comments: projcet page: https://github.com/gcl-inha/ReFocus

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.

中文标题/摘要

标题：基于推理时缩放的组成性图像合成

尽管现代文本到图像模型具有惊人的逼真度，但在组成性方面仍然存在挑战，经常无法准确渲染对象数量、属性和空间关系。为了解决这一挑战，我们提出了一种无需训练的框架，该框架结合了以对象为中心的方法与自我精炼，以提高布局忠实度并保持美学质量。具体而言，我们利用大型语言模型（LLMs）从输入提示中合成明确的布局，并将这些布局注入到图像生成过程中，其中以对象为中心的视觉语言模型（VLM）评估重新排序多个候选方案，以迭代选择最符合提示的输出。通过将明确的布局接地与基于自我精炼的推理时缩放统一起来，我们的框架在场景对齐方面比最近的文本到图像模型表现更佳。代码可在https://github.com/gcl-inha/ReFocus获取。

Summary / 总结

The research aims to improve the compositionality of text-to-image models by addressing issues with object counts, attributes, and spatial relations. The method involves using large language models to generate explicit layouts from input prompts, which are then injected into the image generation process. An object-centric vision-language model refines these layouts iteratively to better align with the prompts. Experiments show that this approach outperforms recent text-to-image models in achieving stronger scene alignment with prompts while maintaining aesthetic quality. The code is available at https://github.com/gcl-inha/ReFocus.

研究旨在通过解决文本到图像模型在准确的对象数量、属性和空间关系方面的不足，提高其组成性。方法结合了以对象为中心的方法和自我完善，使用大型语言模型从输入提示生成明确的布局，然后将这些布局注入到图像生成过程中。对象中心的视觉语言模型（VLM）会迭代地重新排名并选择最符合提示的结果。实验结果显示，该框架在场景与提示的对齐方面优于最近的文本到图像模型。

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Authors: Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja

Venue: CVPR

First: 2026-03-27T07:22:04+00:00 · Latest: 2026-03-27T07:22:04+00:00

Comments: Computer Vision and Pattern Recognition (CVPR) 2026

Abs · PDF · Code1 · Code2

Abstract

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

中文标题/摘要

标题：在自监督变换器中发现分布式对象中心属性

自监督视觉变换器（ViTs）如DINO在最终层的[CLS]标记注意图中表现出发现对象的潜在能力，但这些图通常包含虚假激活，导致对象定位不佳。这是因为[CLS]标记在图像级目标上进行训练，总结了整个图像的信息，而不是专注于对象。这种聚合稀释了存在于局部、块级交互中的对象中心信息。我们通过计算所有层中块级注意组件（查询、键和值）之间的相似性来分析这一点。我们发现：(1) 对象中心的属性编码在所有三个组件（$q, k, v$）的相似性图中，而先前的工作仅使用键特征或[CLS]标记。(2) 这种对象中心的信息分布在整张网络中，而不仅仅是局限于最终层。基于这些见解，我们提出了一种无需训练的方法Object-DINO，该方法根据其块的相似性在所有层中聚类注意头，并自动识别对应所有对象的对象中心簇。我们通过两个应用展示了Object-DINO的有效性：增强无监督对象发现（+3.6到+12.4 CorLoc增益）和通过提供视觉定位减轻多模态大型语言模型中的对象幻觉。我们的结果表明，使用这种分布式对象中心信息可以改善下游任务，而无需额外的训练。

Summary / 总结

The research aims to improve object localization in self-supervised Vision Transformers by analyzing patch-level attention components. The method computes inter-patch similarities across all layers and finds that object-centric properties are distributed throughout the network, not just in the final layer. Key findings include enhanced unsupervised object discovery by +3.6 to +12.4 CorLoc gains and improved visual grounding in multimodal models, demonstrating the effectiveness of using distributed object-centric information without additional training.

研究旨在通过利用分布在各层中的对象中心信息来提高自监督视觉变换器的对象定位能力。方法包括分析使用查询、键和值组件的跨层内补体相似性，发现对象中心特性编码在这些图中。实验表明，无需训练的Object-DINO方法可以增强无监督对象发现，并在多模态大语言模型中减轻对象幻觉，实现CorLoc增益+3.6到+12.4，并提供视觉定位。

ACD-CLIP: Decoupling Representation and Dynamic Fusion for Zero-Shot Anomaly Detection

Authors: Ke Ma, Jun Long, Hongxiao Fei, Liujie Hua, Zhen Dai, Yueyi Luo

First: 2025-08-11T10:03:45+00:00 · Latest: 2026-03-27T07:03:50+00:00

Comments: 4 pages, 1 reference, 3 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Pre-trained Vision-Language Models (VLMs) struggle with Zero-Shot Anomaly Detection (ZSAD) due to a critical adaptation gap: they lack the local inductive biases required for dense prediction and employ inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks. The source code is available at https://github.com/cockmake/ACD-CLIP.

中文标题/摘要

标题：ACD-CLIP：解耦表示和动态融合以实现零样本异常检测

预训练的跨模态模型（VLMs）在零样本异常检测（ZSAD）中面临挑战，因为它们存在一个关键的适应差距：缺乏用于密集预测的局部归纳偏置，并采用刚性特征融合范式。我们通过一个架构协同设计框架来解决这些限制，该框架联合优化特征表示和跨模态融合。我们的方法提出了一种参数高效的卷积低秩适应（Conv-LoRA）适配器，以注入局部归纳偏置，用于细粒度表示，并引入了一种动态融合网关（DFG），利用视觉上下文自适应地调节文本提示，实现强大的双向融合。在多种工业和医学基准上的广泛实验表明，该协同设计具有更高的准确性和鲁棒性，验证了这种协同设计对于稳健地适应基础模型到密集感知任务的重要性。源代码可在https://github.com/cockmake/ACD-CLIP获取。

Summary / 总结

The research aims to improve zero-shot anomaly detection using pre-trained Vision-Language Models (VLMs) by addressing their limitations in local inductive biases and feature fusion. The method introduces Conv-LoRA for efficient local bias injection and DFG for adaptive text prompt modulation, enabling robust bidirectional fusion. Experiments show superior accuracy and robustness on various benchmarks, highlighting the importance of joint feature refinement and fusion for dense perception tasks.

研究通过提出ACD-CLIP，联合优化特征表示和跨模态融合，解决了预训练视觉-语言模型在零样本异常检测中的局限性。该方法引入了参数高效的Convolutional Low-Rank Adaptation (Conv-LoRA)适配器以注入局部归纳偏差，并引入了动态融合网关（DFG）以根据视觉上下文自适应地调节文本提示。实验结果表明，ACD-CLIP在各种工业和医疗基准测试中优于现有方法，突显了这种协同设计方法对于密集感知任务中稳健适应的重要性。

SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection

Authors: Jiaming Liang, Yifeng Zhan, Chunlin Liu, Weihua Zheng, Bingye Peng, Qiwei Liang, Boyang Cai, Xiaochun Mai, Qiang Nie

First: 2026-03-27T06:37:32+00:00 · Latest: 2026-03-27T06:37:32+00:00

Comments: Accepted by CVPR2026

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary object detection (OVOD) aims to detect known and unknown objects in the open world by leveraging text prompts. Benefiting from the emergence of large-scale vision--language pre-trained models, OVOD has demonstrated strong zero-shot generalization capabilities. However, when dealing with camouflaged objects, the detector often fails to distinguish and localize objects because the visual features of the objects and the background are highly similar. To bridge this gap, we construct a benchmark named OVCOD-D by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. In the specificity-aware sub-descriptions generated by multimodal large models, there still exist confusing and overly decorative modifiers. To mitigate such interference, we design a sub-description principal component contrastive fusion strategy that reduces noisy textual components. Furthermore, to address the challenge that the visual features of camouflaged objects are highly similar to those of their surrounding environment, we propose a specificity-guided regional weak alignment and dynamic focusing method, which aims to strengthen the detector's ability to discriminate camouflaged objects from background. Under the open-set evaluation setting, the proposed method achieves an AP of 56.4 on the OVCOD-D benchmark.

中文标题/摘要

标题：SDDF：特定性驱动的动态聚焦在开放词汇伪装目标检测中的应用

开放词汇目标检测（OVOD）旨在通过利用文本提示在开放世界中检测已知和未知的目标。得益于大规模视觉-语言预训练模型的出现，OVOD展示了强大的零样本泛化能力。然而，在处理伪装目标时，检测器往往难以区分和定位目标，因为目标的视觉特征与背景高度相似。为解决这一问题，我们通过增加精心选择的伪装目标图像的细粒度文本描述，构建了一个名为OVCOD-D的基准。由于可用的伪装目标数据集规模有限，我们采用在大规模目标检测数据集上预训练的检测器作为基线方法，因为它们具有更强的零样本泛化能力。在多模态大模型生成的特定性感知子描述中，仍然存在混淆和过度装饰的修饰词。为了减轻这种干扰，我们设计了一种子描述主成分对比融合策略，以减少噪声文本成分。此外，为了解决伪装目标的视觉特征与周围环境高度相似的挑战，我们提出了一种特定性引导的区域弱对齐和动态聚焦方法，旨在增强检测器从背景中区分伪装目标的能力。在开放集评估设置下，所提出的方法在OVCOD-D基准上的AP值为56.4。

Summary / 总结

The paper addresses the challenge of detecting camouflaged objects in open-vocabulary object detection (OVOD) by proposing a benchmark named OVCOD-D and a specificity-driven dynamic focusing method. The method enhances the detector's ability to distinguish camouflaged objects from their background by reducing noisy textual components and dynamically focusing on specific regions. The proposed approach achieves an average precision (AP) of 56.4 on the OVCOD-D benchmark under an open-set evaluation setting.

论文提出了一种特定性驱动的动态聚焦方法SDDF，以解决开放词汇物体检测中伪装物体的检测难题。通过构建包含精细文本描述的OVCOD-D基准，增强伪装物体的检测能力。该方法减少了噪声文本成分，并动态聚焦于区域以区分伪装物体与背景，最终在OVCOD-D基准上实现了56.4的AP。

Binary Verification for Zero-Shot Vision

Authors: Rongbin Hu, Jeffrey Liu

First: 2025-11-14T06:05:43+00:00 · Latest: 2026-03-27T04:08:31+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. We further integrate the proposed REC workflow into a real-world video processing and editing system, and present the system architecture and end-to-end pipeline in the paper. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today's VLMs.

中文标题/摘要

标题：零样本视觉的二进制验证

我们提出了一种无需训练、基于二进制验证的工作流，用于利用现成的VLM进行零样本视觉。该工作流包含两个步骤：(i) 量化，将开放性查询转换为具有少量明确候选者的多项选择题(MCQ)；(ii) 二进制化，针对每个候选者提出一个真/假问题，并确定性地解决：如果恰好有一个为真，则选择它；否则，重新使用剩余的可能候选者进行MCQ。我们在引用表达式定位(REC)、空间推理(Spatial-Map、Spatial-Grid、Spatial-Maze)和BLINK-Jigsaw上评估了该工作流。与直接回答开放性查询相比，量化为MCQ带来了显著收益，而真/假二进制化提供了持续的额外提升。在所有任务中，相同的流程均产生了显著改进，表明其通用性。我们进一步将提出的REC工作流集成到一个实际的视频处理和编辑系统中，并在论文中展示了该系统的架构和端到端管道。这些组件共同提供了一个简单且统一的工作流，强调推理时的设计而非特定任务的训练。它为使用当今的VLM实现更强的零样本视觉提供了一种实用的即插即用路径。

Summary / 总结

The paper proposes a training-free workflow for zero-shot vision using off-the-shelf Vision-Language Models (VLMs). It involves quantization to convert open-ended queries into multiple-choice questions and binarization to resolve answers through True/False questions. Evaluations on tasks such as referring expression grounding, spatial reasoning, and BLINK-Jigsaw show significant improvements over direct open-ended query answering, with quantization providing large gains and binarization offering consistent additional benefits. The workflow is integrated into a real-world video processing system, demonstrating its practicality and general applicability.

论文提出了一种无需训练的零样本视觉工作流，使用现成的视觉-语言模型（VLMs）。该方法包括量化将开放式查询转换为多项选择题，以及二值化为每个候选者提出真/假问题。与直接回答开放式查询相比，该方法在诸如参照表达定位、空间推理和BLINK-拼图等任务上表现出显著的性能提升。该工作流在不同任务上表现出通用性，并被集成到一个实际的视频处理系统中，强调推理时的设计而非特定任务的训练。

Making Training-Free Diffusion Segmentors Scale with the Generative Power

Authors: Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang

Venue: CVPR 2026

First: 2026-03-06T11:35:37+00:00 · Latest: 2026-03-27T03:50:08+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.

中文标题/摘要

标题：利用生成能力使训练-free 扩散分割器扩展

作为强大的生成模型，文本到图像的扩散模型最近被探索用于判别任务。一系列研究致力于在无需进一步训练的情况下，将预训练的扩散模型适应到语义分割，从而产生了训练-free 扩散分割器。这些方法通常依赖于模型注意力层的交叉注意力图，这些图被认为捕捉了图像像素和文本标记之间的语义关系。理想情况下，此类方法应受益于更强大的扩散模型，即更强的生成能力应导致更好的分割。然而，我们观察到现有方法往往无法相应地扩展。为了理解这一问题，我们识别了两个潜在的差距：(i) 交叉注意力在多个头和层之间计算，但这些单独的注意力图与统一的全局表示之间存在差异。(ii) 即使有全局图，它也不直接转化为分割的准确语义相关性，因为不同文本标记之间的评分不平衡。为了弥合这些差距，我们提出了两种技术：自动聚合和逐像素重新缩放，这两者共同使训练-free 分割能够更好地利用生成能力。我们在标准语义分割基准上评估了我们的方法，并进一步将其集成到生成技术中，展示了更好的性能和更广泛的适用性。代码在 https://github.com/Darkbblue/goca.

Summary / 总结

This paper addresses the challenge of scaling training-free diffusion segmentors with the generative power of diffusion models. It identifies two key issues: discrepancies between individual attention maps and a unified global representation, and score imbalances among text tokens. To address these, the authors propose auto aggregation and per-pixel rescaling techniques. Evaluations on standard benchmarks show improved performance, and the method is integrated into a generative technique, enhancing its broad applicability.

本文旨在解决训练-free 扩散分割器随扩散模型生成能力增强而难以扩展的问题。研究识别了两个关键问题：个体注意力图与统一全局表示之间的差异，以及文本标记之间的评分不平衡。为解决这些问题，作者提出了自动聚合和逐像素重新缩放技术。该方法在标准语义分割基准上进行了评估，并展示了改进的性能和更广泛的适用性。代码可在 https://github.com/Darkbblue/goca 获取。

Any4D: Open-Prompt 4D Generation from Natural Language and Images

Authors: Hao Li, Qiao Sun

First: 2025-11-24T04:17:26+00:00 · Latest: 2026-03-27T03:15:52+00:00

Comments: The authors identified issues in the 4D generation pipeline and evaluation that affect result validity. To ensure scientific accuracy, we will revise the methodology and experiments thoroughly before resubmitting. This version should not be cited or relied upon

Abs · PDF · Code1 · Code2

Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \textit{"GPT moment"} in the embodied domain. There is a naive observation: \textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \textit{2) reduces} learning complexity, \textit{3) improves} data efficiency in embodied data collection, and \textit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

中文标题/摘要

标题：Any4D：基于自然语言和图像的4D生成开放提示

虽然基于视频生成的具身世界模型越来越受到关注，但它们对大规模具身交互数据的依赖仍然是一个关键瓶颈。具身数据的稀缺性、收集难度和高维度性从根本上限制了语言与动作之间的对齐精度，并加剧了长时序视频生成的挑战——阻碍生成模型在具身领域实现“GPT时刻”。一个简单的观察是：具身数据的多样性远远超过了可能的基本运动空间。基于这一洞察，我们提出了**基本具身世界模型**（PEWM），该模型将视频生成限制在固定较短的时间范围内，我们的方法**1）实现了语言概念与机器人动作的视觉表示之间的精细对齐，2）降低了学习复杂性，3）提高了具身数据收集的数据效率，4）减少了推理延迟**。通过配备模块化视觉语言模型（VLM）规划器和起始-目标热图引导机制（SGG），PEWM进一步实现了灵活的闭环控制，并支持在扩展的复杂任务中对基本级策略进行组合泛化。我们的框架利用视频模型中的时空视觉先验和VLM的语义意识，弥合了精细物理交互与高层次推理之间的差距，为可扩展、可解释和通用的具身智能铺平了道路。

Summary / 总结

The paper addresses the challenge of generating embodied 4D models from natural language and images, focusing on the limitations of current video-generation-based methods due to the scarcity of embodied interaction data. It introduces Primitive Embodied World Models (PEWM), which restricts video generation to shorter horizons, enabling fine-grained alignment between language and actions, reducing learning complexity, improving data efficiency, and decreasing inference latency. PEWM uses a modular Vision-Language Model planner and a Start-Goal heatmap Guidance mechanism to support compositional generalization of primitive-level policies for complex tasks, bridging the gap between physical interaction and high-level reasoning.

论文提出了Primitive Embodied World Models (PEWM)，以解决生成与语言和动作高度对齐的体感数据的挑战。PEWM将视频生成限制在较短的时段内，从而实现细粒度对齐、降低学习复杂性、提高数据效率并减少推理延迟。该框架使用模块化的视觉-语言模型规划器和起始-目标热图引导机制，支持在复杂任务中对原始级策略的组合泛化，弥合物理交互与高层次推理之间的差距。

History

20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553