arXiv 论文速递

Snapshot: 20260422_0424

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement

Authors: Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal

Venue: ACL 2026

First: 2024-11-22T18:31:47+00:00 · Latest: 2026-04-20T17:59:36+00:00

Comments: Accepted to ACL 2026 Findings. Project page: https://video-repair.github.io

Abstract

Recent text-to-video (T2V) diffusion models have made remarkable progress in generating high-quality videos. However, they often struggle to align with complex text prompts, particularly when multiple objects, attributes, or spatial relations are specified. We introduce VideoRepair, the first self-correcting, training-free, and model-agnostic video refinement framework that automatically detects fine-grained text-video misalignments and performs targeted, localized corrections. Our key insight is that even misaligned videos usually contain correctly generated regions that should be preserved rather than regenerated. Building on this observation, VideoRepair proposes a novel region-preserving refinement strategy with three stages: (i) misalignment detection, where MLLM-based evaluation with automatically generated evaluation questions identifies misaligned regions; (ii) refinement planning, which preserves correctly generated entities, segments their regions across frames, and constructs targeted prompts for misaligned areas; and (iii) localized refinement, which selectively regenerates problematic regions while preserving faithful content through joint optimization of preserved and newly generated areas. On two benchmarks, EvalCrafter and T2V-CompBench with four recent T2V backbones, VideoRepair achieves substantial improvements over recent baselines across diverse alignment metrics. Comprehensive ablations further demonstrate the efficiency, robustness, and interpretability of our framework.

Summary / 总结

The paper introduces VideoRepair, a self-correcting framework for refining text-to-video generation. It detects misalignments between text prompts and generated videos and performs localized corrections. VideoRepair uses a three-stage process: misalignment detection, refinement planning, and localized refinement. It achieves significant improvements over existing methods on two benchmarks, EvalCrafter and T2V-CompBench, across various alignment metrics.

论文提出了VideoRepair，这是一种用于改进文本到视频生成的自我纠正框架。它检测文本提示与生成视频之间的对齐偏差，并进行局部修正。VideoRepair采用三阶段过程：对齐偏差检测、修正规划和局部修正。它在两个基准EvalCrafter和T2V-CompBench上实现了对各种对齐指标的显著改进，超过了现有方法。

T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

Authors: Savya Khosla, Sethuraman T, Aryan Chadha, Alex Schwing, Derek Hoiem

First: 2026-04-20T17:57:02+00:00 · Latest: 2026-04-20T17:57:02+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmentation; and (2) high token counts for fine-grained visual representations, which limits scalability to long videos. This work addresses both limitations. We propose T-REN (Text-aligned Region Encoder Network), an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (or region tokens). T-REN achieves this through a lightweight network added on top of a frozen vision backbone, trained to pool patch-level representations within each semantic region into region tokens and align them with region-level text annotations. With only 3.7% additional parameters compared to the vision-language backbone, this design yields substantially stronger dense cross-modal understanding while reducing the token count by orders of magnitude. Specifically, T-REN delivers +5.9 mIoU on ADE20K open-vocabulary segmentation, +18.4% recall on COCO object-level text-image retrieval, +15.6% recall on Ego4D video object localization, and +17.6% mIoU on VSPW video scene parsing, all while reducing token counts by more than 24x for images and 187x for videos compared to the patch-based vision-language backbone. The code and model are available at https://github.com/savya08/T-REN.

Summary / 总结

This work addresses the limitations of vision-language encoders by proposing T-REN, which maps visual data to text-aligned region tokens through a lightweight network. T-REN improves dense cross-modal understanding and reduces token counts, achieving significant performance gains on tasks such as open-vocabulary semantic segmentation, object-level text-image retrieval, video object localization, and video scene parsing, while using only 3.7% additional parameters compared to the vision-language backbone.

该研究通过提出T-REN，将视觉数据映射到文本对齐的区域令牌，通过一个轻量级网络实现。T-REN 提高了密集跨模态理解并减少了令牌数量，在开放词汇语义分割、对象级文本-图像检索、视频对象定位和视频场景解析等任务上取得了显著的性能提升，同时仅使用了视觉语言主干网络3.7%的额外参数。

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Authors: Wei Song, Yuran Wang, Zijia Song, Yadong Li, Zenan Zhou, Long Chen, Jianhua Xu, Jiaqi Wang, Kaicheng Yu

First: 2025-03-18T14:56:46+00:00 · Latest: 2026-04-20T17:55:12+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

The differing representation spaces required for visual understanding and generation pose a challenge in unifying them within the autoregressive paradigm of large language models. A vision tokenizer trained for reconstruction excels at capturing low-level visual appearance, making it well-suited for visual generation but lacking high-level semantic representations for understanding tasks. Conversely, a vision encoder trained via contrastive learning aligns well with language but struggles to decode back into the pixel space for generation tasks. To bridge this gap, we propose DualToken, a method that unifies representations for both understanding and generation within a single tokenizer. However, directly integrating reconstruction and semantic objectives creates conflicts, leading to degraded performance in both reconstruction fidelity and semantic accuracy. Instead of forcing a single codebook to capture both visual appearance and semantics, DualToken disentangles them by introducing separate codebooks for high-level semantics and low-level visual details. As a result, DualToken achieves 0.25 rFID and 82.0% zero-shot accuracy on ImageNet, and demonstrates strong effectiveness in downstream MLLM tasks for both understanding and generation. Specifically, our method surpasses VILA-U by 5.8 points on average across ten visual understanding benchmarks and delivers a 13% improvement on GenAI-Bench. Notably, incorporating dual visual tokens outperforms using a single token type on both understanding and generation tasks. We hope our research offers a new perspective on leveraging dual visual vocabularies for building unified vision-language models. Project page is available at https://songweii.github.io/dualtoken-project-page.

S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

Authors: Nitish Shukla, Surgan Jandial, Arun Ross

Venue: ACL 2026

First: 2026-04-20T17:06:20+00:00 · Latest: 2026-04-20T17:06:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.

中文标题/摘要

标题：S2H-DPO：针对视觉语言模型的硬度感知偏好优化

视觉语言模型（VLMs）在单图像理解方面取得了显著进展，但在多图像推理方面仍面临挑战。我们发现现有跨图像对齐方法的一个关键能力缺口：当前方法主要集中在局部推理，使用预设的图像索引（“查看第3张图像...”），忽视了全局视觉搜索和跨图像自主比较等基本技能。为解决这一局限，我们提出了一种从简单到难（S2H）的学习框架，系统地构建了三个不同层次的多图像偏好数据，要求的能力逐步增加：（1）单图像局部推理，（2）多图像局部比较，（3）全局视觉搜索。与以往依赖于模型特定属性（如幻觉或注意力启发式）生成偏好对的方法不同，我们的方法利用提示驱动的复杂性来创建适用于不同模型的选定/拒绝对。通过在LLaVA和Qwen-VL模型上的广泛评估，我们展示了我们多样化的多图像推理数据显著提高了多图像推理性能，基准测试中基线方法的性能得到了显著提升。重要的是，我们的方法在保持单图像推理性能的同时，增强了多图像理解能力，从而推动了整体视觉偏好对齐的前沿。

Summary / 总结

The paper introduces S2H-DPO, a hardness-aware preference optimization framework for Vision-Language Models (VLMs) to improve multi-image reasoning. It constructs multi-image preference data across three reasoning levels: single-image localized reasoning, multi-image localized comparison, and global visual search. Evaluations on LLaVA and Qwen-VL models demonstrate that S2H-DPO significantly enhances multi-image reasoning performance while maintaining strong single-image reasoning, advancing the state of the art in visual preference alignment.

研究旨在通过解决当前方法主要关注局部推理的局限性，提高视觉语言模型（VLMs）的多图推理能力。S2H-DPO框架引入了一种分层学习方法来构建多图偏好数据，增强全局视觉搜索和跨图自主比较。在LLaVA和Qwen-VL模型上的评估表明，在保持单图推理能力的同时，显著提升了多图推理性能。

GeoRC: A Benchmark for Geolocation Reasoning Chains

Authors: Mohit Talreja, Joshua Diao, Jim Thannikary James, Radu Casapu, Tejas Santanam, Ethan Mendes, Alan Ritter, Wei Xu, James Hays

Venue: ACL 2026

First: 2026-01-29T05:18:40+00:00 · Latest: 2026-04-20T16:58:56+00:00

Comments: Accepted to ACL 2026

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) are good at recognizing the global location of a photograph -- their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at \textit{explaining} which image evidence led to their prediction, even when their location prediction is correct. In this paper, we introduce GeoRC, the first benchmark for geolocation reasoning chains sourced directly from Champion-tier GeoGuessr experts, including the reigning world champion. This benchmark consists of 800 ``ground truth'' reasoning chains across 500 query scenes from GeoGuessr maps, with expert chains addressing hundreds of different discriminative attributes, such as soil properties, architecture, and license plate shapes. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human-expert scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at predicting locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Small open-weight VLMs such as Llama and Qwen catastrophically fail on our benchmark -- they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but \textit{no visual information at all}. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images. We open source our benchmark for the community to use.

中文标题/摘要

标题：GeoRC：地理定位推理链基准

视觉语言模型（VLMs）在识别照片的全球位置方面表现出色——它们的地理定位预测准确度与顶级的人类专家相当。但许多VLMs在解释哪些图像证据导致了它们的预测方面表现得令人惊讶地差，即使它们的位置预测是正确的。在本文中，我们介绍了GeoRC，这是首个直接源自冠军级GeoGuessr专家的地理定位推理链基准，包括现任世界冠军。该基准包括500个GeoGuessr地图查询场景的800个“真实情况”推理链，专家链涵盖了数百种不同的区分属性，如土壤特性、建筑和车牌形状。我们评估了LLM作为裁判和VLM作为裁判的策略，用于评分VLM生成的推理链与我们的专家推理链，并发现Qwen 3 LLM作为裁判与人类专家评分的相关性最高。我们的基准揭示了虽然大型封闭源VLMs如Gemini和GPT 5在预测位置方面与人类专家相当，但在生成可验证的推理链方面仍落后于人类专家。小型开源VLMs如Llama和Qwen在我们的基准测试中表现灾难性——它们的表现仅比一个基线稍好，该基线中LLM凭照片位置的先验知识生成推理链，但没有任何视觉信息。我们认为人类专家和VLMs在这一任务上的差距表明VLMs在从高分辨率图像中提取细微视觉特征方面存在局限性。我们开源了该基准供社区使用。

Summary / 总结

The paper introduces GeoRC, a benchmark for evaluating geolocation reasoning chains, addressing the gap in VLMs' ability to explain their geolocation predictions. It consists of 800 ground truth reasoning chains from 500 query scenes, sourced from GeoGuessr experts. Evaluations show that Qwen 3 as a judge correlates best with human-expert scoring. Large closed-source VLMs like Gemini and GPT 5 perform well in geolocation prediction but lag in producing auditable reasoning chains, while small open-weight VLMs like Llama and Qwen fail significantly. This highlights VLMs' challenges in extracting fine-grained visual attributes from high-resolution images.

本文介绍了GeoRC，这是一个用于评估地理定位推理链的基准，旨在解决视觉语言模型在解释其地理定位预测方面的不足。该基准包含来自500个查询场景的800条由专家GeoGuessr玩家创建的推理链。评估结果显示，Qwen 3作为裁判与人类专家评分的相关性最好。大型封闭源视觉语言模型如Gemini和GPT 5在位置预测方面表现良好，但在生成可验证的推理链方面却遇到困难，而小型开源视觉语言模型如Llama和Qwen的表现则差得令人惊讶。这表明视觉语言模型在从高分辨率图像中提取细粒度视觉特征方面存在局限性。

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

Authors: Jiang Zhang, Shijie Zhou, Bangya Liu, Achuta Kadambi, Zhiwen Fan

Venue: CVPR 2026

First: 2026-03-28T22:49:40+00:00 · Latest: 2026-04-20T16:53:57+00:00

Comments: CVPR 2026, Project Website: https://spatial-stack.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large vision-language models (VLMs) still struggle with reliable 3D spatial reasoning, a core capability for embodied and physical AI systems. This limitation arises from their inability to capture fine-grained 3D geometry and spatial relationships. While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. Moving beyond conventional late-stage vision-geometry fusion, SpatialStack stacks and synchronizes multi-level geometric features with the language backbone, enabling the model to capture both local geometric precision and global contextual semantics. Building upon this framework, we develop VLM-SpatialStack, a model that achieves state-of-the-art performance on multiple 3D spatial reasoning benchmarks. Extensive experiments and ablations demonstrate that our multi-level fusion strategy consistently enhances 3D understanding and generalizes robustly across diverse spatial reasoning tasks, establishing SpatialStack as an effective and extensible design paradigm for vision-language-geometry integration in next-generation multimodal physical AI systems.

中文标题/摘要

标题：SpatialStack：分层几何-语言融合以实现3D VLM空间推理

大型视觉-语言模型（VLMs）在可靠的3D空间推理方面仍然存在困难，这是实现具身和物理AI系统的核心能力。这一限制源于它们无法捕捉到精细的3D几何和空间关系。尽管最近的努力将多视图几何变换器引入到VLMs中，但它们通常仅融合视觉和几何编码器的深层特征，丢弃了丰富的层次信号，从而在空间理解上形成了一个根本性的瓶颈。为克服这一问题，我们提出了SpatialStack，这是一种通用的分层融合框架，可以逐步在模型层次结构中对视觉、几何和语言表示进行对齐。超越传统的后期视觉-几何融合，SpatialStack将多级几何特征与语言骨干同步堆叠，使模型能够捕捉到局部几何精度和全局上下文语义。在此框架之上，我们开发了VLM-SpatialStack模型，该模型在多个3D空间推理基准测试中达到了最先进的性能。广泛的实验和消融实验表明，我们的多级融合策略在3D理解方面始终表现出色，并且在各种空间推理任务中表现出强大的泛化能力，确立了SpatialStack作为下一代多模态物理AI系统中视觉-语言-几何集成的有效且可扩展的设计范式。

Summary / 总结

The research aims to improve 3D spatial reasoning in vision-language models (VLMs) for embodied AI systems. To address the limitation of capturing fine-grained 3D geometry and spatial relationships, the authors propose SpatialStack, a hierarchical fusion framework that progressively aligns vision, geometry, and language representations. Experiments show that VLM-SpatialStack outperforms existing models on multiple 3D spatial reasoning benchmarks, demonstrating enhanced 3D understanding and robust generalization across tasks.

研究旨在通过改进视觉-语言模型（VLM）的3D空间推理能力，提升实体AI系统的性能。为了解决捕捉精细3D几何和空间关系的限制，作者提出了SpatialStack，一种分层融合框架，逐步对齐视觉、几何和语言表示。实验表明，VLM-SpatialStack在多个3D空间推理基准测试中表现出色，展示了增强的3D理解和跨任务的稳健泛化能力。

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

Authors: Kangan Qian, ChuChu Xie, Yang Zhong, Jingrui Pang, Siwen Jiao, Sicong Jiang, Zilin Huang, Yunlong Wang, Kun Jiang, Mengmeng Yang, Hao Ye, Guanghao Zhang, Hangjun Ye, Guang Chen, Long Chen, Diange Yang

First: 2026-04-20T16:37:16+00:00 · Latest: 2026-04-20T16:37:16+00:00

Comments: 15 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models drive next-generation autonomous systems, but training them requires scalable, high-quality annotations from complex environments. Current cloud pipelines rely on generic vision-language models (VLMs) that lack geometric reasoning and domain semantics due to their 2D image-text pretraining. To address this mismatch, we propose XEmbodied, a cloud-side foundation model that endows VLMs with intrinsic 3D geometric awareness and interaction with physical cues (e.g., occupancy grids, 3D boxes). Instead of treating geometry as auxiliary input, XEmbodied integrates geometric representations via a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. Through progressive domain curriculum and reinforcement learning post-training, XEmbodied preserves general capabilities while demonstrating robust performance across 18 public benchmarks. It significantly improves spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.

中文标题/摘要

标题：XEmbodied：一种增强几何和物理线索的大型体感环境基础模型

视觉-语言-行动（VLA）模型驱动下一代自主系统，但训练它们需要来自复杂环境的大规模高质量注释。当前的云管道依赖于缺乏几何推理和领域语义的通用视觉-语言模型（VLMs），因为它们的预训练是基于二维图像-文本。为了解决这种不匹配，我们提出XEmbodied，一种云侧基础模型，赋予VLMs内在的三维几何意识和与物理线索（例如，占用网格、三维盒子）的交互能力。XEmbodied 不是将几何视为辅助输入，而是通过结构化三维适配器整合几何表示，并使用高效图像-体感适配器将物理信号提炼为上下文标记。通过渐进领域课程和后训练强化学习，XEmbodied 保持了一般能力，同时在18个公开基准测试中表现出稳健的性能。它显著提高了空间推理、交通语义、体感功能和大规模场景挖掘及体感VQA的离分布泛化。

Summary / 总结

XEmbodied is a cloud-side foundation model designed to enhance vision-language-action models with 3D geometric awareness and physical cues, addressing the limitations of generic vision-language models in geometric reasoning and domain semantics. It integrates geometric representations through a structured 3D Adapter and distills physical signals into context tokens using an Efficient Image-Embodied Adapter. After progressive domain curriculum and reinforcement learning, XEmbodied shows robust performance across 18 public benchmarks, improving spatial reasoning, traffic semantics, embodied affordance, and out-of-distribution generalization for large-scale scenario mining and embodied VQA.

XEmbodied 是一种云侧基础模型，旨在通过 3D 几何意识和物理线索增强视觉-语言-行动模型，解决通用视觉-语言模型在几何推理和领域语义方面的局限性。它通过结构化的 3D Adapter 集成几何表示，并使用高效图像-实体 Adapter 将物理信号提炼为上下文标记。经过渐进领域课程和强化学习后，XEmbodied 在 18 个公开基准测试中表现出色，提高了空间推理、交通语义、实体操作能力和分布外泛化能力，适用于大规模场景挖掘和实体 VQA。

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Authors: Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan

First: 2025-05-26T17:56:30+00:00 · Latest: 2026-04-20T16:32:17+00:00

Comments: Project Page: https://vlm-3r.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Leveraging our Spatial-Visual-View Fusion and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM-3R effectively aligns real-world spatial context with language instructions. This enables monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM-3R, not only facilitates robust visual-spatial reasoning but also enables the understanding of temporal 3D context changes, excelling in both accuracy and scalability.

中文标题/摘要

标题：VLM-3R：视觉语言模型结合指令对齐的3D重建

大型多模态模型（LMMs）在2D图像和视频上的快速进步激发了将这些模型扩展到理解3D场景的动机，旨在实现类人的视觉-空间智能。然而，实现与人类能力相媲美的深度空间理解在模型编码和数据获取方面提出了重大挑战。现有方法经常依赖外部深度传感器进行几何捕获，或者利用现成的算法预先构建3D地图，从而限制了其可扩展性，尤其是在使用单目视频输入和时间敏感应用方面。在本文中，我们提出了VLM-3R，这是一种结合3D重建指令调优的统一视觉语言模型（VLMs）框架。VLM-3R通过使用几何编码器处理单目视频帧，推导出表示空间理解的隐式3D令牌。利用我们的空间-视觉-视角融合以及超过20万条精心策划的3D重建指令调优问答（QA）对，VLM-3R有效地将现实世界的空间上下文与语言指令对齐。这使得单目3D空间辅助和具身推理成为可能。为了促进时间推理的评估，我们引入了视觉-空间-时间智能基准，其中包括超过13.86万条问答对，涵盖了五个专注于演变空间关系的不同任务。广泛的实验表明，我们的模型VLM-3R不仅促进了稳健的视觉-空间推理，还能够理解3D上下文的变化，其准确性和可扩展性均表现出色。

ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting

Authors: Clayton Fields, Casey Kennington

First: 2026-04-20T16:10:28+00:00 · Latest: 2026-04-20T16:10:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language modeling is rapidly increasing in popularity with an ever expanding list of available models. In most cases, these vision-language models have parameters in the tens of billions, which is necessary for some needs, but in many cases smaller models are necessary (e.g., on edge devices or independent robotic platforms). Unfortunately, there is little research in producing light-weight models or in training them with small datasets. Inspired by the language learning progression and data sparsity in child development, in this paper, we address both of these goals in a systematic fashion. We show that two-tower encoder models are superior to one-tower encoders in low-resource settings for discriminative English tasks. We show also that incorporating traditional convolutional networks into the two-tower transformer architecture can help produce parameter efficient vision-language models. Finally, we show that the cross-modal fusion module of two-tower encoders can vary significantly in shape and size while producing the same results. In addition, we present ESsEN, a compact vision-language model that can be trained end-to-end with relatively few resources that performs as well on several tasks with only a fraction of the parameters compared to other models. The experimental results and the tools we present here make vision-language modeling more accessible to a wider variety of researchers.

Summary / 总结

This paper addresses the challenge of creating compact and discriminative vision-language models in low-resource settings. It introduces a two-tower encoder model that outperforms one-tower models in such scenarios. The authors also incorporate traditional convolutional networks to enhance parameter efficiency. They demonstrate that the cross-modal fusion module can vary in size without affecting performance. The result is ESsEN, a lightweight model that achieves comparable performance to larger models with significantly fewer parameters, making vision-language modeling more accessible to researchers with limited resources.

本文旨在解决在低资源环境下创建紧凑且具有区分性的视觉-语言模型的挑战。研究引入了两塔编码器模型，该模型在低资源设置中优于单塔模型。作者还结合了传统的卷积网络以提高参数效率。研究还表明，两塔编码器的跨模态融合模块可以在不改变性能的情况下具有不同的形状和大小。最终，ESsEN是一个轻量级模型，可以通过相对较少的资源进行端到端训练，并且在多个任务上仅用较少的参数就能达到与大型模型相当的性能，使视觉-语言建模对更多研究人员更加易于访问。

ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification

Authors: Florian Kittler, Sheethal Bhat, Andreas Maier

First: 2026-04-20T16:01:44+00:00 · Latest: 2026-04-20T16:01:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Zero-shot vision-language models (VLMs) have shown promise for chest radiograph classification, but their performance is often limited by confounding label co-occurrence, long-tail class imbalance, and transfer instability under domain shift. We propose ProtoCLIP, a refinement strategy for CLIP-style VLMs that improves zero-shot discrimination through targeted data curation and distilled anchor alignment. Specifically, we construct pathology-focused training subsets with curated negative samples to reduce co-occurrence bias. We also introduce a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure and improving discrimination of clinically relevant co-occurring pathologies. Evaluated on an unseen dataset VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline across multiple findings. For pneumothorax specifically, ProtoCLIP achieves a state-of-the-art AUC of 0.94. These results demonstrate that anchor-guided refinement, coupled with curated supervision and controlled adaptation, can mitigate common zero-shot transfer failures in medical VLMs without requiring large-scale retraining.

中文标题/摘要

标题：ProtoCLIP：原型对齐潜在细化方法在胸部X光分类中的鲁棒零样本分类

零样本视觉-语言模型（VLMs）在胸部X光分类中显示出潜力，但其性能往往受限于混淆标签共现、长尾类别不平衡以及在领域转移下的转移不稳定。我们提出ProtoCLIP，这是一种针对CLIP风格VLMs的细化策略，通过有针对性的数据整理和提炼锚点对齐来提高零样本区分能力。具体而言，我们构建了以病理学为重点的训练子集，并使用精心挑选的负样本来减少共现偏差。我们还引入了一种保持表示的蒸馏目标，以稳定适应过程，同时保持语义结构并提高临床相关共现病理学的区分能力。在未见数据集VinDr-CXR上评估，ProtoCLIP在多个发现上比强CLIP基线提高了2-10个百分点的AUC。对于气胸而言，ProtoCLIP达到了0.94的最新AUC。这些结果表明，锚点引导的细化结合精心监督和受控适应，可以在不需大规模重新训练的情况下缓解医学VLMs中的常见零样本转移失败。

Summary / 总结

ProtoCLIP is a refinement strategy for CLIP-style vision-language models to enhance zero-shot classification of chest X-rays. It reduces label co-occurrence bias through curated negative samples and introduces a representation-preserving distillation objective to stabilize adaptation. Evaluated on VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline and achieves a state-of-the-art AUC of 0.94 for pneumothorax. This shows that anchor-guided refinement and controlled adaptation can mitigate common zero-shot transfer issues in medical VLMs without extensive retraining.

ProtoCLIP 是一种针对 CLIP 风格视觉-语言模型的精炼策略，以提升胸部 X 光片的零样本分类。它通过精选负样本减少标签共现偏差，并引入一种保持表示的蒸馏目标来稳定适应。在 VinDr-CXR 上评估，ProtoCLIP 的 AUC 比强 CLIP 基线提高了 2-10 个百分点，并且在气胸分类上达到了 0.94 的最佳 AUC。这表明锚点引导的精炼和受控适应可以缓解医疗 VLM 中常见的零样本转移问题，而无需大量重新训练。

BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections

Authors: Subin Varghese, Joshua Gao, Asad Ur Rahman, Vedhus Hoskere

First: 2025-11-16T16:30:38+00:00 · Latest: 2026-04-20T15:58:25+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Deploying embodied agents that can answer questions about their surroundings in realistic real-world settings remains difficult, partly due to the scarcity of benchmarks for episodic memory Embodied Question Answering (EQA). Inspired by the challenges of infrastructure inspections, we propose Inspection EQA as a compelling problem class for advancing episodic memory EQA. It demands multi-scale reasoning and long-range spatial understanding, while offering standardized evaluation, professional inspection reports as grounding, and egocentric imagery. We introduce BridgeEQA, a benchmark of 2,200 open-vocabulary question-answer pairs (in the style of OpenEQA) grounded in professional inspection reports across 200 real-world bridge scenes with 47.93 images on average per scene. We further propose a new EQA metric Image Citation Relevance to evaluate the ability of a model to cite relevant images. Evaluations of state-of-the-art vision-language models reveal substantial performance gaps. To address this, we propose Embodied Memory Visual Reasoning (EMVR), which formulates the inspection EQA task as a Markov decision process. EMVR shows strong performance over the baselines. Code and dataset are available at https://drags99.github.io/bridge-eqa/

中文标题/摘要

标题：BridgeEQA：虚拟具身代理在真实桥梁检查中的应用

在现实世界的真实环境中部署能够对其周围环境进行提问回答的具身代理仍然具有挑战性，部分原因是缺乏 episodic 记忆具身问答 (EQA) 的基准。受基础设施检查挑战的启发，我们提出了检查 EQA 作为促进 episodic 记忆 EQA 的一个有吸引力的问题类别。它要求多尺度推理和长距离空间理解，同时提供标准化评估、专业检查报告作为基础和第一人称图像。我们引入了 BridgeEQA，这是一个包含 2,200 个开放词汇的问答对（类似于 OpenEQA）的基准，这些问答对基于 200 个真实世界桥梁场景的专业检查报告，每个场景平均有 47.93 张图像。我们还提出了一种新的 EQA 度量 Image Citation Relevance 以评估模型引用相关图像的能力。对最先进的视觉语言模型的评估揭示了显著的性能差距。为了解决这个问题，我们提出了具身记忆视觉推理 (EMVR)，将其检查 EQA 任务形式化为马尔可夫决策过程。EMVR 在基线之上表现出强大的性能。代码和数据集可在 https://drags99.github.io/bridge-eqa/ 获取。

Summary / 总结

The research aims to develop embodied agents capable of answering questions about their surroundings in real-world settings, particularly for bridge inspections. The method involves creating a benchmark called BridgeEQA, which includes 2,200 question-answer pairs based on professional inspection reports from 200 real-world bridge scenes. Key findings show that state-of-the-art vision-language models perform poorly, and a new model called Embodied Memory Visual Reasoning (EMVR) outperforms baseline methods in this task. The Image Citation Relevance metric is proposed to evaluate the ability to cite relevant images. Code and dataset are available online.

研究旨在开发能够回答其周围环境问题的实体代理，特别是在桥梁检查中的应用。方法是创建了一个名为BridgeEQA的基准，包含基于200个真实桥梁场景的专业检查报告的2,200个问答对。关键发现表明，最先进的视觉-语言模型表现不佳，而新提出的实体记忆视觉推理（EMVR）模型在该任务中优于基线方法。提出了图像引用相关性指标来评估引用相关图像的能力。代码和数据集已在线提供。

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Authors: Xiangbo Gao, Sicong Jiang, Bangya Liu, Xinghao Chen, Minglai Yang, Siyuan Yang, Mingyang Wu, Jiongze Yu, Qi Zheng, Haozhi Wang, Jiayi Zhang, Jie Yang, Zihan Wang, Qing Yin, Zhengzhong Tu

First: 2026-04-17T17:28:24+00:00 · Latest: 2026-04-20T15:50:04+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models. Our project page is https://xiangbogaobarry.github.io/VEFX-Bench/.

Summary / 总结

The research introduces VEFX-Dataset, a large-scale human-annotated dataset for video editing, and VEFX-Reward, a reward model for evaluating video editing quality. The dataset includes 5,049 examples across 9 editing categories, each labeled for instruction following, rendering quality, and edit exclusivity. VEFX-Bench, a benchmark of 300 curated video-prompt pairs, is used to compare video editing systems. Experiments show that VEFX-Reward better aligns with human judgments than existing methods and highlights gaps in current models regarding visual plausibility, instruction following, and edit locality.

研究引入了包含5,049个示例的大规模人工标注视频编辑数据集VEFX-Dataset，以及用于评估视频编辑质量的奖励模型VEFX-Reward。每个示例在指令遵循、渲染质量和编辑独创性三个维度上都有标注。使用300个精选的视频-提示对构建了VEFX-Bench基准，用于比较视频编辑系统。实验表明，VEFX-Reward在人类判断和现有方法之间有更好的一致性，并且揭示了当前模型在视觉可信度、指令遵循和编辑局部性方面的持续差距。

Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models

Authors: Yakoub Bazi, Mohamad M. Al Rahhal, Mansour Zuair, Faroun Mohamed

First: 2026-04-20T15:47:52+00:00 · Latest: 2026-04-20T15:47:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3-VL, which follows a structured vision-language pipeline with multi-depth visual conditioning and a full-attention decoder, with Qwen3.5, a native multimodal model that combines a single-stage alignment with a hybrid decoder backbone. Experimental results on the official CDVQA test splits show that recent VLMs improve over earlier specialized baselines. They further show that performance does not scale monotonically with model size, and that native multimodal models are more effective than structured vision-language pipelines for this task. These findings indicate that tightly integrated multimodal backbones contribute more to performance than scale or explicit multi-depth visual conditioning for language-driven semantic change reasoning in RS imagery.

中文标题/摘要

标题：基于结构化和原生多模态Qwen模型重新审视遥感中的变化VQA

变化视觉问答（Change VQA）解决了关于双时相遥感（RS）图像之间语义变化的自然语言问题。尽管最近已经研究了视觉语言模型（VLMs）用于时序RS图像理解，但在现代多模态模型的背景下，Change VQA仍然未被充分探索。在这封信中，我们使用最近的Qwen模型在统一的低秩适应（LoRA）设置下重新审视CDVQA基准。我们比较了遵循多深度视觉条件和全注意解码器的结构化视觉语言管道的Qwen3-VL，以及结合单阶段对齐和混合解码器骨干的原生多模态模型Qwen3.5。官方CDVQA测试分割上的实验结果表明，最近的VLMs优于早期的专业基线。进一步表明，性能并不单调地随模型大小增加，而且原生多模态模型比结构化视觉语言管道更有效地完成此任务。这些发现表明，紧密集成的多模态骨干比规模或显式的多深度视觉条件对基于语言的语义变化推理贡献更大。

Towards Robust Text-to-Image Person Retrieval: Multi-View Reformulation for Semantic Compensation

Authors: Chao Yuan, Yujian Zhao, Haoxuan Xu, Guanglin Niu

First: 2026-04-20T15:03:01+00:00 · Latest: 2026-04-20T15:03:01+00:00

Abs · PDF · Code1 · Code2

Abstract

In text-to-image person retrieval tasks, the diversity of natural language expressions and the implicitness of visual semantics often lead to the problem of Expression Drift, where semantically equivalent texts exhibit significant feature discrepancies in the embedding space due to phrasing variations, thereby degrading the robustness of image-text alignment. This paper proposes a semantic compensation framework (MVR) driven by Large Language Models (LLMs), which enhances cross-modal representation consistency through multi-view semantic reformulation and feature compensation. The core methodology comprises three components: Multi-View Reformulation (MVR): A dual-branch prompting strategy combines key feature guidance (extracting visually critical components via feature similarity) and diversity-aware rewriting to generate semantically equivalent yet distributionally diverse textual variants; Textual Feature Robustness Enhancement: A training-free latent space compensation mechanism suppresses noise interference through multi-view feature mean-pooling and residual connections, effectively capturing "Semantic Echoes"; Visual Semantic Compensation: VLM generates multi-perspective image descriptions, which are further enhanced through shared text reformulation to address visual semantic gaps. Experiments demonstrate that our method can improve the accuracy of the original model well without training and performs SOTA on three text-to-image person retrieval datasets.

中文标题/摘要

标题：迈向稳健的文本到图像的人像检索：多视角语义重构

在文本到图像的人像检索任务中，自然语言表达的多样性和视觉语义的隐含性往往导致表达漂移问题，即在嵌入空间中由于措辞变化导致语义等价的文本表现出显著的特征差异，从而降低图像-文本对齐的稳健性。本文提出了一种由大型语言模型（LLMs）驱动的语义补偿框架（MVR），通过多视角语义重构和特征补偿增强跨模态表示一致性。核心方法包括三个组成部分：多视角重构（MVR）：一种双分支提示策略结合关键特征指导（通过特征相似性提取视觉关键组件）和多样性意识重写，生成语义等价但分布多样化的文本变体；文本特征稳健性增强：一种无需训练的潜在空间补偿机制通过多视角特征均值池化和残差连接抑制噪声干扰，有效捕捉“语义回声”；视觉语义补偿：VLM生成多视角图像描述，进一步通过共享文本重构来解决视觉语义差距。实验表明，我们的方法可以在不进行训练的情况下显著提高原始模型的准确性，并在三个文本到图像的人像检索数据集上达到SOTA性能。

Summary / 总结

This paper addresses the issue of Expression Drift in text-to-image person retrieval by proposing a semantic compensation framework (MVR) that uses Large Language Models. The method includes multi-view semantic reformulation and feature compensation to enhance cross-modal representation consistency. Key findings show that the proposed approach improves the accuracy of the original model without requiring training and achieves state-of-the-art performance on three text-to-image person retrieval datasets.

该论文通过提出一种基于大型语言模型的语义补偿框架（MVR），解决了文本到图像的人像检索中的表达漂移问题。该方法包括多视角语义重构和特征补偿，以增强跨模态表示一致性。实验结果表明，该方法在不进行训练的情况下提高了原始模型的准确性，并在三个文本到图像的人像检索数据集上达到了最先进的性能。

AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation

Authors: Haoyue Tan, Shengnan Wang, Yulin Qiao, Juncheng Zhang, Youhui Bai, Ping Gong, Zewen Jin, Cheng Li

Venue: CVPR 2026 poster

First: 2026-04-20T14:43:36+00:00 · Latest: 2026-04-20T14:43:36+00:00

Comments: CVPR 2026 poster

Abs · PDF · Code1 · Code2

Abstract

Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training-free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity-preserving clustering method to query vectors for higher compression, and designs a euclidean-similarity-preserving clustering method for keys, covering cluster number assignment, threshold-wise adaptive clustering, and efficient critical cluster selection. Experiments on CogVideoX-2B, HunyuanVideo, and Wan-2.1 on one A40 GPU demonstrate up to 1.67-4.31x speedup with negligible quality degradation.

Summary / 总结

AdaCluster is a training-free adaptive clustering framework designed to accelerate video diffusion transformers (DiTs) by reducing their inference latency. It uses angle-similarity-preserving clustering for query vectors and euclidean-similarity-preserving clustering for keys, which includes adaptive clustering and critical cluster selection. Experiments show up to 4.31x speedup with minimal quality loss on various datasets.

AdaCluster 是一种通过自适应聚类查询和键向量来解决视频扩散变换器 (DiTs) 高推理延迟问题的方法，同时保持语义相似性和适应不同层的异质令牌分布。该框架在 CogVideoX-2B、HunyuanVideo 和 Wan-2.1 数据集上实现了最高 4.31 倍的加速，且对模型性能影响较小。

Multilingual Training and Evaluation Resources for Vision-Language Models

Authors: Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini, Andrea Zugarini

First: 2026-04-20T14:42:47+00:00 · Latest: 2026-04-20T14:42:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.

Summary / 总结

This work addresses the limitations of Vision-Language Models (VLMs) being predominantly trained on English data by introducing a comprehensive suite of multilingual resources for five European languages. The authors use a regeneration-translation paradigm to create high-quality cross-lingual datasets, including Multi-PixMo for training and multilingual benchmarks for evaluation. Experiments with three different models show that using multilingual, multimodal examples during training improves performance on non-English benchmarks and provides positive transfer to English as well.

该研究针对视觉语言模型（VLMs）主要依赖英语数据训练的局限性，引入了一套涵盖五种欧洲语言的多语言资源。作者使用再生-翻译方法创建高质量的跨语言数据集，包括用于训练的Multi-PixMo和多语言基准测试集。实验结果显示，使用多语言、多模态数据训练模型在非英语基准上的表现更好，并且对英语也有积极的迁移效果。

Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation

Authors: Jiamin Zheng, Jingwen Yu, Guangcheng Chen, Hong Zhang

First: 2026-04-20T14:35:31+00:00 · Latest: 2026-04-20T14:35:31+00:00

Comments: 9 pages, 8 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Indoor robot navigation is often compromised by glass surfaces, which severely corrupt depth sensor measurements. While foundation models like Depth Anything 3 provide excellent geometric priors, they lack an absolute metric scale. We propose a training-free framework that leverages depth foundation models as a structural prior, employing a robust local RANSAC-based alignment to fuse it with raw sensor depth. This naturally avoids contamination from erroneous glass measurements and recovers an accurate metric scale. Furthermore, we introduce \ti{GlassRecon}, a novel RGB-D dataset with geometrically derived ground truth for glass regions. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art baselines, especially under severe sensor depth corruption. The dataset and related code will be released at https://github.com/jarvisyjw/GlassRecon.

中文标题/摘要

标题：通过深度先验增强玻璃表面重建以提高机器人导航

室内机器人导航常因玻璃表面而受损，严重干扰深度传感器的测量。虽然基础模型如Depth Anything 3提供了优秀的几何先验，但缺乏绝对的度量尺度。我们提出了一种无需训练的框架，利用深度基础模型作为结构先验，并采用稳健的局部RANSAC对齐方法将其与原始传感器深度融合。这自然地避免了错误玻璃测量的污染，并恢复了准确的度量尺度。此外，我们引入了GlassRecon，这是一个带有几何真值的新型RGB-D数据集，用于玻璃区域。大量实验表明，我们的方法在严重传感器深度污染下始终优于最先进的基线方法。该数据集及相关代码将在https://github.com/jarvisyjw/GlassRecon发布。

Summary / 总结

The research aims to improve indoor robot navigation by addressing the issue of glass surfaces corrupting depth sensor measurements. The method uses a training-free framework that combines a depth foundation model with a robust local RANSAC-based alignment to fuse raw sensor depth, thereby avoiding erroneous glass measurements and recovering an accurate metric scale. Experiments show that the proposed approach outperforms existing methods, particularly in scenarios with severe depth sensor corruption.

研究旨在通过解决玻璃表面干扰深度传感器测量的问题，提高室内机器人导航的性能。方法采用一种无需训练的框架，结合深度基础模型和鲁棒的局部RANSAC对齐，将原始传感器深度融合起来，从而避免错误的玻璃测量并恢复准确的度量尺度。实验表明，该方法在深度传感器严重干扰的情况下，比现有方法表现更优。

Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection

Authors: Sa Zhu, Wanqian Zhang, Lin Wang, Jinchao Zhang, Cong Wang, Bo Li

Venue: SIGIR 2026

First: 2026-04-20T14:18:27+00:00 · Latest: 2026-04-20T14:18:27+00:00

Comments: Accepted by SIGIR 2026

Abs · PDF · Code1 · Code2

Abstract

Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the 'conditioning, denoising and aligning' manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the Background-Suppress Denoising (BSD) module generates foreground knowledge by progressively removing background redundancy from videos through denoising process. This foreground knowledge serves as effective intermediate semantic anchor between video and text representations, mitigating the semantic gap and enhancing the discriminability of action-relevant segments. Furthermore, we introduce the Foreground-Prompt Alignment (FPA) module to inject extracted foreground knowledge as prompt tokens into text representations, guiding model's attention towards action-relevant segments and enabling precise cross-modal alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two OV-TAD benchmarks. The code repository is provided as follows: https://anonymous.4open.science/r/Code-2114/.

中文标题/摘要

标题：降噪与对齐：基于扩散的前景知识提示驱动的开放词汇时序动作检测

开放词汇时序动作检测（OV-TAD）旨在定位和分类未见过类别的动作片段，其中动作语义与视频表示之间的有效对齐对于准确检测至关重要。然而，现有方法难以缓解简短抽象的动作标签与丰富复杂的视频内容之间的语义不平衡，不可避免地引入了语义噪声并误导了跨模态对齐。为了解决这一挑战，我们提出了DFAlign框架，该框架利用基于扩散的去噪生成用于动作-视频对齐的前景知识。按照“条件化、去噪和对齐”的方式，我们首先引入了语义统一条件（SUC）模块，该模块将共享动作和特定动作的语义统一为扩散去噪的条件。然后，背景抑制去噪（BSD）模块通过去噪过程逐步去除视频中的背景冗余，生成前景知识。该前景知识作为视频和文本表示之间的有效中间语义锚点，缓解了语义差距并增强了与动作相关的片段的可区分性。此外，我们引入了前景提示对齐（FPA）模块，将提取的前景知识作为提示令牌注入文本表示中，引导模型的注意力集中在与动作相关的片段上，从而实现精确的跨模态对齐。大量实验表明，我们的方法在两个OV-TAD基准测试中达到了最先进的性能。代码库如下提供：https://anonymous.4open.science/r/Code-2114/

Summary / 总结

DFAlign is a novel framework for Open-Vocabulary Temporal Action Detection (OV-TAD) that uses diffusion-based denoising to generate foreground knowledge for aligning action semantics with video representations. It introduces a Semantic-Unify Conditioning (SUC) module to unify action semantics and a Background-Suppress Denoising (BSD) module to remove background redundancy, creating effective intermediate semantic anchors. The Foreground-Prompt Alignment (FPA) module then injects this foreground knowledge into text representations to guide model attention towards action-relevant segments. Experiments show that DFAlign outperforms existing methods on two OV-TAD benchmarks.

DFAlign 是一种新颖的框架，用于开放词汇量时空动作检测（OV-TAD），通过基于扩散的去噪生成前景知识，作为视频和文本表示之间的有效中间语义锚点。它引入了语义统一条件（SUC）模块来统一动作语义，并引入了背景抑制去噪（BSD）模块来去除背景冗余。然后，前景提示对齐（FPA）模块将这些前景知识注入文本表示中，引导模型的注意力。实验表明，DFAlign 在两个 OV-TAD 基准上优于现有方法。

LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging

Authors: Seungeon Lee, Soumi Das, Manish Gupta, Krishna P. Gummadi

Venue: ACL 2026

First: 2025-11-10T14:13:10+00:00 · Latest: 2026-04-20T13:05:33+00:00

Comments: Accepted as a main conference paper in ACL 2026

Abs · PDF · Code1 · Code2

Abstract

Low-Rank Adaptation (LoRA) has emerged as a parameter-efficient approach for fine-tuning large language models. However, conventional LoRA adapters are typically trained for a single task, limiting their applicability in real-world settings where inputs may span diverse and unpredictable domains. At inference time, existing approaches combine multiple LoRAs for improving performance on diverse tasks, while usually requiring labeled data or additional task-specific training, which is expensive at scale. In this work, we introduce LoRA on the Go (LoGo), a training-free framework that dynamically selects and merges adapters at the instance level without any additional requirements. LoGo leverages signals extracted from a single forward pass through LoRA adapters, to identify the most relevant adapters and determine their contributions on-the-fly. Across 5 NLP benchmarks, 27 datasets, and 3 model families, LoGo outperforms training-based baselines on some tasks upto a margin of 3.6% while remaining competitive on other tasks and maintaining inference throughput, highlighting its effectiveness and practicality.

Summary / 总结

LoRA on the Go (LoGo) is a training-free framework that dynamically selects and merges LoRA adapters at the instance level, improving performance on diverse tasks without additional labeled data or task-specific training. Across various benchmarks and datasets, LoGo outperforms training-based baselines by up to 3.6% on some tasks while maintaining competitive performance on others and preserving inference speed.

LoRA on the Go (LoGo) 是一个无需训练的框架，能够在实例级别动态选择和合并 LoRA 适配器，从而在无需额外标注数据或任务特定训练的情况下提高多种任务的性能。在各种基准和数据集上，LoGo 在某些任务上的表现比基于训练的基线高出多达 3.6%，同时保持了竞争力并在其他任务上保持了推理速度。

A Control Architecture for Training-Free Memory Use

Authors: Yanzhen Lu, Muchen Jiang, Zhicheng Qian, Xingyu Zhou

First: 2026-04-20T12:55:27+00:00 · Latest: 2026-04-20T12:55:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Prompt-injected memory can improve reasoning without updating model weights, but it also creates a control problem: retrieved content helps only when it is applied in the right state. We study this problem in a strict training-free setting and formulate it as applicability control: when to trigger a memory-assisted second pass, when to trust it, and how to maintain the memory bank over time. Our method combines uncertainty-based routing, confidence-based selective acceptance, bank selection across rule and exemplar memory, and evidence-based governance of the memory bank over time. Under a locked training-free protocol with compute-matched controls, it improves two core arithmetic benchmarks by +7.0 points on SVAMP and +7.67 points on ASDiv over baseline. The same architecture also transfers to QA and agent benchmarks with smaller positive effects and shows the same positive direction on a second checkpoint for the main arithmetic tasks. On arithmetic, the main empirical pattern is that the control architecture, rather than raw memory exposure, drives the improvements on SVAMP and ASDiv. Mechanistically, confidence separates helpful from harmful rule-bank interventions, and under fixed retrieval the repair-versus-corrupt difference localizes to rows whose retrieved set actually contains the edited entries.

MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

Authors: Yang Shi, Yifeng Xie, Minzhe Guo, Liangsi Lu, Mingxuan Huang, Jingchao Wang, Zhihong Zhu, Boyan Xu, Zhiqi Huang

Venue: ACL 2026

First: 2026-01-06T17:45:26+00:00 · Latest: 2026-04-20T12:50:29+00:00

Comments: Accepted by ACL 2026 Main

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 1997 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 12 representative VLMs, and even the best model, Gemini-3-Pro-Preview, classifies the error correctly in only 66.65\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal models. Project Page: https://mmerror-benchmark.github.io

Summary / 总结

MMErroR is a benchmark for evaluating Vision-Language Models (VLMs) in detecting erroneous reasoning, consisting of 1997 samples with single coherent reasoning errors across 24 subdomains. Unlike existing benchmarks, MMErroR focuses on process-level evaluation, requiring models to identify and classify errors in both visual and linguistic contexts. Even the best model, Gemini-3-Pro-Preview, correctly classifies errors in only 66.65% of cases, highlighting the difficulty of detecting erroneous reasoning. This benchmark provides insights into the capabilities of multi-modal models in understanding and correcting reasoning errors.

MMErroR 是一个用于评估 Vision-Language 模型（VLMs）检测错误推理能力的基准，包含 1997 个样本，每个样本包含单一的推理错误，覆盖 24 个子领域。与现有基准不同，MMErroR 侧重于过程级评估，要求模型在视觉和语言上下文中识别并分类错误。即使最好的模型 Gemini-3-Pro-Preview，也只能正确分类 66.65% 的错误，突显了检测错误推理的难度。该基准为多模态模型理解和纠正推理错误的能力提供了见解。

Understanding Counting Mechanisms in Large Language and Vision-Language Models

Authors: Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah

Venue: CVPR 2026

First: 2025-11-21T18:48:22+00:00 · Latest: 2026-04-20T12:27:33+00:00

Comments: Accepted to CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Counting is one of the fundamental abilities of large language models (LLMs) and large vision-language models (LVLMs). This paper examines how these foundation models represent and compute numerical information in counting tasks. We use controlled experiments with repeated textual and visual items and analyze counting in LLMs and LVLMs through a set of behavioral, observational, and causal mediation analyses. To this end, we design a specialized tool, CountScope, for the mechanistic interpretability of numerical content. Results show that individual tokens or visual features encode latent positional count information that can be extracted and transferred across contexts. Layerwise analyses reveal a progressive emergence of numerical representations, with lower layers encoding small counts and higher layers representing larger ones. We identify an internal counter mechanism that updates with each item, stored mainly in the final token or region. In LVLMs, numerical information also appears in visual embeddings, shifting between background and foreground regions depending on spatial composition. We further reveal that models rely on structural cues such as separators in text, which act as shortcuts for tracking item counts and strongly influence the accuracy of numerical predictions. Overall, counting emerges as a structured, layerwise process in LLMs and follows the same general pattern in LVLMs, shaped by the properties of the vision encoder.

中文标题/摘要

标题：理解大规模语言模型和多模态模型中的计数机制

计数是大规模语言模型（LLMs）和大规模多模态模型（LVLMs）的基本能力之一。本文探讨了这些基础模型在计数任务中如何表示和计算数值信息。我们通过受控实验使用重复的文本和视觉项目，并通过一系列行为、观察和因果中介分析来分析LLMs和LVLMs中的计数。为此，我们设计了一个专门的工具CountScope，用于数值内容的机制可解释性。结果表明，单个词元或视觉特征编码了潜在的位置计数信息，这些信息可以被提取并转移到不同的上下文中。逐层分析揭示了数值表示的逐步出现，较低层编码小计数，较高层表示较大计数。我们发现了一个内部计数机制，每次处理一个项目时都会更新，并主要存储在最终的词元或区域中。在LVLMs中，数值信息也出现在视觉嵌入中，根据空间组成在背景和前景区域之间转移。我们进一步揭示，模型依赖于结构线索，如文本中的分隔符，这些线索作为跟踪项目计数的捷径，并强烈影响数值预测的准确性。总体而言，计数在LLMs中以结构化、逐层的方式出现，并且在LVLMs中遵循相同的一般模式，受到视觉编码器属性的影响。

DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

Authors: Chang Zou, Changlin Li, Yang Li, Patrol Li, Jianbing Wu, Xiao He, Songtao Liu, Zhao Zhong, Kailin Huang, Linfeng Zhang

First: 2026-02-05T08:45:08+00:00 · Latest: 2026-04-20T12:18:51+00:00

Comments: 18 pages, 8 figures; cvpr2026 paper

Abs · PDF · Code1 · Code2 · Code3

Abstract

While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. Code has been made publicly available: https://github.com/Tencent-Hunyuan/DisCa

中文标题/摘要

标题：DisCa: 加速视频扩散变换器的蒸馏兼容可学习特征缓存

尽管扩散模型在视频生成领域取得了巨大成功，但这一进展伴随着计算负担的迅速增加。现有的加速方法中，特征缓存因其无需训练的特性及显著的加速性能而广受欢迎，但进一步压缩时不可避免地会面临语义和细节的丢失。另一种广泛应用的方法，训练感知的步骤蒸馏，在图像生成中取得了成功，但在视频生成中却面临严重的性能下降，且在应用到步骤蒸馏模型时，由于采样步骤稀疏，质量损失更为严重。本文首次引入了蒸馏兼容的可学习特征缓存机制。我们采用轻量级的可学习神经预测器代替传统的无需训练的启发式方法，能够更准确地捕捉高维特征演化过程。此外，我们探讨了高度压缩蒸馏在大规模视频模型中的挑战，并提出了一种保守的受限均值流方法，以实现更稳定和无损的蒸馏。通过这些努力，我们在保持生成质量的同时将加速边界进一步推至$11.8\times$。大量实验表明了我们方法的有效性。代码已公开：https://github.com/Tencent-Hunyuan/DisCa

Summary / 总结

This paper addresses the computational challenges in video generation by introducing a novel distillation-compatible learnable feature caching mechanism called DisCa. It uses a lightweight learnable neural predictor to capture feature evolution more accurately, and proposes a conservative Restricted MeanFlow approach for stable and lossless distillation. The method achieves a speedup of 11.8 times while maintaining generation quality, outperforming existing methods in both feature caching and step-distillation for video generation.

本文通过引入一种新型的蒸馏兼容可学习特征缓存机制，解决了使用扩散模型进行视频生成时的计算挑战。该方法使用轻量级的可学习神经预测器更准确地捕捉高维特征演化过程，克服了传统特征缓存的语义和细节丢失问题。提出的受限均值流方法确保了更稳定的无损蒸馏，实现了11.8倍的加速同时不牺牲生成质量。广泛的实验验证了该方法的有效性。

Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

Authors: Yujie Chen, Tailai Chen, Yifeng Gao, Zoe Wanying He, Yijue Xu, Shaobo Wang, Linfeng Zhang

First: 2026-04-20T11:20:03+00:00 · Latest: 2026-04-20T11:20:03+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textit{semantic fixing points}, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.

Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos

Authors: Haodong Chen, Qiang Huang, Jiaqi Zhao, Qiuping Jiang, Xiaojun Chang, Jun Yu

First: 2026-01-11T14:35:06+00:00 · Latest: 2026-04-20T10:53:42+00:00

Comments: 18 pages, 18 figures, and 3 tables

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) are increasingly deployed in socially consequential settings, raising concerns about social bias driven by demographic cues. A central challenge in measuring such social bias is attribution under visual confounding: real-world images entangle race and gender with correlated factors such as background and clothing, obscuring attribution. We propose a \textbf{face-only counterfactual evaluation paradigm} that isolates demographic effects while preserving real-image realism. Starting from real photographs, we generate counterfactual variants by editing only facial attributes related to race and gender, keeping all other visual factors fixed. Based on this paradigm, we construct \textbf{FOCUS}, a dataset of 480 scene-matched counterfactual images across six occupations and ten demographic groups, and propose \textbf{REFLECT}, a benchmark comprising three decision-oriented tasks: two-alternative forced choice, multiple-choice socioeconomic inference, and numeric salary recommendation. Experiments on five state-of-the-art VLMs reveal that demographic disparities persist under strict visual control and vary substantially across task formulations. These findings underscore the necessity of controlled, counterfactual audits and highlight task design as a critical factor in evaluating social bias in multimodal models.

中文标题/摘要

标题：使用仅面部的反事实从真实照片中衡量视觉语言模型中的社会偏见

视觉语言模型（VLMs）在越来越多的社会重要场景中被部署，引发了关于由人口统计特征驱动的社会偏见的担忧。衡量这种社会偏见的核心挑战是在视觉混杂下进行归因：现实世界中的图像将种族和性别与背景和着装等相关的因素纠缠在一起，遮蔽了归因。我们提出了一种**仅面部的反事实评估范式**，该范式隔离了人口统计效应，同时保持了真实图像的现实性。从真实照片出发，我们通过仅编辑与种族和性别相关的面部属性来生成反事实变体，固定所有其他视觉因素。基于这一范式，我们构建了**FOCUS**数据集，包含六种职业和十个人口统计群体的480张场景匹配的反事实图像，并提出了**REFLECT**基准，包括三个决策导向任务：二选一强制选择、多选的社会经济推断和数值薪酬推荐。对五种最先进的VLMs的实验表明，在严格的视觉控制下，人口统计差异仍然存在，并且在不同任务表述下差异显著。这些发现强调了控制性反事实审计的必要性，并突出了任务设计在评估多模态模型中的社会偏见中的关键作用。

Weakly-Supervised Lung Nodule Segmentation via Training-Free Guidance of 3D Rectified Flow

Authors: Richard Petersen, Fredrik Kahl, Jennifer Alvén

Venue: MICCAI 2026

First: 2026-04-09T14:46:14+00:00 · Latest: 2026-04-20T10:51:56+00:00

Comments: Submitted to MICCAI 2026 Added references for section 2 Added Acknowledgment

Abs · PDF · Code1 · Code2

Abstract

Dense annotations, such as segmentation masks, are expensive and time-consuming to obtain, especially for 3D medical images where expert voxel-wise labeling is required. Weakly supervised approaches aim to address this limitation, but often rely on attribution-based methods that struggle to accurately capture small structures such as lung nodules. In this paper, we propose a weakly-supervised segmentation method for lung nodules by combining pretrained state-of-the-art rectified flow and predictor models in a plug-and-play manner. Our approach uses training-free guidance of a 3D rectified flow model, requiring only fine-tuning of the predictor using image-level labels and no retraining of the generative model. The proposed method produces improved-quality segmentations for two separate predictors, consistently detecting lung nodules of varying size and shapes. Experiments on LUNA16 demonstrate improvements over baseline methods, highlighting the potential of generative foundation models as tools for weakly supervised 3D medical image segmentation.

Enhancing Continual Learning of Vision-Language Models via Dynamic Prefix Weighting

Authors: Hyeonseo Jang, Hyuk Kwon, Kibok Lee

Venue: CVPR 2026

First: 2026-04-20T10:46:09+00:00 · Latest: 2026-04-20T10:46:09+00:00

Comments: CVPR 2026; revised text and figures for improved readability

Abs · PDF · Code1 · Code2 · Code3

Abstract

We investigate recently introduced domain-class incremental learning scenarios for vision-language models (VLMs). Recent works address this challenge using parameter-efficient methods, such as prefix-tuning or adapters, which facilitate model adaptation to downstream tasks by incorporating task-specific information into input tokens through additive vectors. However, previous approaches often normalize the weights of these vectors, disregarding the fact that different input tokens require different degrees of adjustment. To overcome this issue, we propose Dynamic Prefix Weighting (DPW), a framework that dynamically assigns weights to prefixes, complemented by adapters. DPW consists of 1) a gating module that adjusts the weights of each prefix based on the importance of the corresponding input token, and 2) a weighting mechanism that derives adapter output weights as a residual of prefix-tuning weights, ensuring that adapters are utilized only when necessary. Experimental results demonstrate that our method achieves state-of-the-art performance in domain-class incremental learning scenarios for VLMs. The code is available at: https://github.com/YonseiML/dpw.

中文标题/摘要

标题：通过动态前缀权重增强视觉-语言模型的持续学习

我们研究了最近引入的领域类增量学习场景对视觉-语言模型（VLMs）的影响。近期工作使用参数高效的方法，如前缀调谐或适配器，通过添加向量将任务特定信息纳入输入标记，从而促进模型对下游任务的适应。然而，先前的方法通常会归一化这些向量的权重，忽略了不同输入标记需要不同程度调整的事实。为了解决这一问题，我们提出了动态前缀权重（DPW）框架，该框架动态地为前缀分配权重，并结合了适配器。DPW 包括 1）一个门控模块，根据相应输入标记的重要性调整每个前缀的权重，以及 2）一种权重机制，通过前缀调谐权重的残差推导适配器输出权重，确保仅在必要时使用适配器。实验结果表明，我们的方法在 VLMs 的领域类增量学习场景中达到了最先进的性能。代码可在 https://github.com/YonseiML/dpw 获取。

ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

Authors: Pei-An Chen, Yong-Ching Liang, Jia-Fong Yeh, Hung-Ting Su, Yi-Ting Chen, Min Sun, Winston Hsu

First: 2026-04-16T11:46:30+00:00 · Latest: 2026-04-20T10:36:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may change over time and are not specified in the instruction. DynAfford requires agents to perceive object states, infer implicit preconditions, and adapt their actions accordingly. To enable this capability, we introduce ADAPT, a plug-and-play module that augments existing planners with explicit affordance reasoning. Experiments demonstrate that incorporating ADAPT significantly improves robustness and task success across both seen and unseen environments. We also show that a domain-adapted, LoRA-finetuned vision-language model used as the affordance inference backend outperforms a commercial LLM (GPT-4o), highlighting the importance of task-aligned affordance grounding.

中文标题/摘要

标题：ADAPT：在未指定操作条件下的常识规划基准测试

智能具身代理不应仅仅遵循指令，因为现实环境往往包含意外情况和例外。然而，现有方法通常专注于直接执行指令，而不考虑目标对象是否可以实际操作，这意味着它们无法评估可用的操作条件。为解决这一局限，我们引入了DynAfford，这是一个基准测试，用于评估在动态环境中操作对象的具身代理，其中对象的操作条件可能会随时间变化且未在指令中指定。DynAfford要求代理感知对象状态、推断隐含的前提条件，并相应地调整其行为。为了实现这一能力，我们引入了ADAPT，这是一个即插即用模块，可以增强现有规划器的操作条件推理能力。实验表明，整合ADAPT显著提高了在已见和未见环境中的鲁棒性和任务成功率。我们还展示了，作为操作条件推理后端使用的领域适应、LoRA微调的视觉语言模型优于商业LLM（GPT-4o），突显了任务对齐的操作条件接地的重要性。

Summary / 总结

The research aims to improve the adaptability of intelligent embodied agents in dynamic environments where object affordances are not specified in instructions. ADAPT, a plug-and-play module, is introduced to enhance existing planners with explicit affordance reasoning. Experiments show that ADAPT improves robustness and task success in both seen and unseen environments. Additionally, a domain-adapted, LoRA-finetuned vision-language model outperforms GPT-4o in affordance inference, emphasizing the importance of task-aligned grounding.

研究旨在提高智能实体代理在指令未指定物体操作条件的动态环境中的适应性。引入了ADAPT模块，增强现有规划器的显式操作条件推理能力。实验表明，ADAPT在已见和未见环境中均显著提高了鲁棒性和任务成功率。此外，一个领域适应的、LoRA微调的视觉语言模型在操作条件推理方面优于GPT-4o，突显了任务对齐的操作条件接地的重要性。

Video Panels for Long Video Understanding

Authors: Lars Doorenbos, Federico Spurio, Juergen Gall

Venue: CVPR 2026

First: 2025-09-28T08:05:55+00:00 · Latest: 2026-04-20T10:17:08+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent Video-Language Models (VLMs) achieve promising results on long-video understanding, but their performance still lags behind that achieved on tasks involving images or short videos. This has led to great interest in improving the long context modeling of VLMs by introducing novel modules and additional complexity. In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models. To this end, we propose a novel visual prompting strategy specifically designed for long-video understanding. By combining multiple frames as panels into one image, we effectively trade off spatial details for temporal resolution. Our approach is training-free, parameter-free, and model-agnostic, and can be seamlessly integrated into existing VLMs. Extensive experiments on five established benchmarks across a wide range of model architectures, sizes, and context windows confirm the consistency of our approach. For the TimeScope (Long) dataset, which has the longest videos, the accuracy for video question answering is improved by up to 19.4%. Overall, our method raises the bar for long video understanding models. The code is available at https://fedespu.github.io/Video-Panels.

中文标题/摘要

标题：长视频理解的视频面板

近期的视频-语言模型（VLMs）在长视频理解任务上取得了令人鼓舞的结果，但其性能仍落后于涉及图像或短视频的任务。这导致了对改进VLMs的长上下文建模的兴趣，通过引入新颖模块和额外复杂性。本文采取了不同的方法：而不是用有限的数据微调VLMs，我们试图最大化现有模型的性能。为此，我们提出了一种专为长视频理解设计的新型视觉提示策略。通过将多个帧组合成一个图像面板，我们有效地在空间细节和时间分辨率之间进行权衡。我们的方法是无需训练的、无需参数的、模型无关的，并且可以无缝集成到现有的VLMs中。在五个广泛使用的基准上的大量实验，涵盖了各种模型架构、大小和上下文窗口，证实了我们方法的一致性。对于TimeScope（长）数据集，该数据集包含最长的视频，视频问答的准确性提高了高达19.4%。总体而言，我们的方法提高了长视频理解模型的标准。代码可在https://fedespu.github.io/Video-Panels/获取。

Summary / 总结

This paper addresses the challenge of long-video understanding by proposing a novel visual prompting strategy called Video Panels. Instead of fine-tuning existing Video-Language Models (VLMs), the authors combine multiple frames into panels to enhance temporal resolution. This approach is training-free, parameter-free, and model-agnostic, and it improves video question answering accuracy by up to 19.4% on the TimeScope (Long) dataset, demonstrating its effectiveness across various model architectures and context windows.

本文提出了一种名为Video Panels的新型视觉提示策略，以解决长视频理解的挑战。该方法不依赖于现有视频语言模型（VLMs）的微调，而是将多个帧组合成面板以增强时间分辨率。该方法无需训练和参数，且适用于多种模型架构，并在TimeScope（长）数据集上将视频问答准确率提高了高达19.4%，证明了其在不同模型架构和上下文窗口下的有效性。

Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion

Authors: Jaehyuk Jang, Wonjun Lee, Kangwook Ko, Changick Kim

Venue: ACL 2026

First: 2026-01-06T12:47:32+00:00 · Latest: 2026-04-20T09:59:57+00:00

Comments: ACL 2026 findings

Abs · PDF · Code1 · Code2

Abstract

Prompt tuning has achieved remarkable progress in vision-language models (VLMs) and is recently being adopted for audio-language models (ALMs). However, its generalization ability in ALMs remains largely underexplored. We observe that conventional prompt tuning for ALMs also suffers from the Base-New Tradeoff, and we identify that this issue stems from the disrupted semantic structure of the embedding space. To address this issue, we propose Semantically Expanded Prompt Tuning (SEPT)-a plug-and-play framework that explicitly regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models. SEPT introduces a novel semantic expansion loss with margin constraints that promote intra-class compactness and inter-class separability, thereby enhancing the semantic structure of the prompt embedding space. For comprehensive evaluation, we establish the first benchmark setup for prompt generalization in ALMs, covering both base-to-new generalization and cross-dataset transferability. Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines, while maintaining computational cost during inference.

Summary / 总结

The paper addresses the generalization challenge in prompt tuning for audio-language models (ALMs) by proposing Semantically Expanded Prompt Tuning (SEPT). This method enhances the semantic structure of the embedding space through a novel semantic expansion loss, which promotes intra-class compactness and inter-class separability. Comprehensive evaluations show that SEPT improves generalization performance across various prompt tuning baselines without increasing computational cost during inference.

论文提出了一种名为语义扩展提示调优（SEPT）的方法，以解决音频语言模型（ALMs）中提示调优的泛化能力问题。该方法通过引入一种新的语义扩展损失来增强嵌入空间的语义结构，促进类内紧凑性和类间可分性。全面的评估表明，SEPT在各种提示调优基线中提高了泛化性能，同时在推理过程中没有增加计算成本。

History

20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553