arXiv 论文速递

Snapshot: 20260304_0348

3D Field of Junctions: A Noise-Robust, Training-Free Structural Prior for Volumetric Inverse Problems

Authors: Namhoon Kim, Narges Moeini, Justin Romberg, Sara Fridovich-Keil

First: 2026-03-02T18:11:59+00:00 · Latest: 2026-03-02T18:11:59+00:00

Comments: Code will be released soon

Abstract

Volume denoising is a foundational problem in computational imaging, as many 3D imaging inverse problems face high levels of measurement noise. Inspired by the strong 2D image denoising properties of Field of Junctions (ICCV 2021), we propose a novel, fully volumetric 3D Field of Junctions (3D FoJ) representation that optimizes a junction of 3D wedges that best explain each 3D patch of a full volume, while encouraging consistency between overlapping patches. In addition to direct volume denoising, we leverage our 3D FoJ representation as a structural prior that: (i) requires no training data, and thus precludes the risk of hallucination, (ii) preserves and enhances sharp edge and corner structures in 3D, even under low signal to noise ratio (SNR), and (iii) can be used as a drop-in denoising representation via projected or proximal gradient descent for any volumetric inverse problem with low SNR. We demonstrate successful volume reconstruction and denoising with 3D FoJ across three diverse 3D imaging tasks with low-SNR measurements: low-dose X-ray computed tomography (CT), cryogenic electron tomography (cryo-ET), and denoising point clouds such as those from lidar in adverse weather. Across these challenging low-SNR volumetric imaging problems, 3D FoJ outperforms a mixture of classical and neural methods.

中文标题/摘要

标题：3D交点场：一种鲁棒的、无需训练的体结构先验用于体逆问题

体去噪是计算成像中的基础问题，因为许多3D成像逆问题面临高测量噪声水平。受2D图像去噪中Field of Junctions（ICCV 2021）强大特性的启发，我们提出了一种新颖的全3D Field of Junctions（3D FoJ）表示法，该表示法优化了一个3D楔形的交点，以最好地解释整个体的每个3D块，同时鼓励重叠块之间的一致性。除了直接体去噪外，我们还利用3D FoJ表示法作为结构先验：(i) 不需要训练数据，从而排除了幻觉的风险；(ii) 即使在低信噪比（SNR）下也能保留和增强3D中的锐利边缘和角结构；(iii) 可以通过投影或近端梯度下降作为任何低SNR体逆问题的即插即用去噪表示。我们展示了在低SNR测量的三个不同3D成像任务中使用3D FoJ进行体重建和去噪的成功：低剂量X射线计算机断层扫描（CT）、冷冻电子断层扫描（cryo-ET）以及来自恶劣天气下激光雷达点云的去噪。在这些具有挑战性的低SNR体成像问题中，3D FoJ优于经典和神经方法的混合。

Summary / 总结

This paper addresses the challenge of volume denoising in computational imaging, particularly in scenarios with high measurement noise. It introduces a 3D Field of Junctions (3D FoJ) representation that optimizes a junction of 3D wedges to explain each volume patch while ensuring consistency between overlapping patches. The method does not require training data, making it robust and free from the risk of hallucination. Experimental results show that 3D FoJ outperforms classical and neural methods in reconstructing and denoising volumes from low-dose X-ray CT, cryogenic electron tomography, and lidar point clouds in adverse weather conditions, even under low signal-to-noise ratios.

本文提出了一种3D Field of Junctions (3D FoJ) 表示方法，以解决计算成像中的体积去噪问题。该方法通过优化3D楔形的交点来解释每个3D块，并确保重叠块之间的一致性。3D FoJ 作为结构先验用于各种体积逆问题，无需训练数据，并且即使在低信噪比条件下也能有效保留锐边和尖角。该方法在低信噪比的各种3D成像任务中表现出色，包括低剂量X射线CT、冷冻电子断层扫描(cryo-ET)和激光雷达点云去噪。

OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens

Authors: Yiying Yang, Wei Cheng, Sijin Chen, Honghao Fu, Xianfang Zeng, Yujun Cai, Gang Yu, Xingjun Ma

Venue: CVPR 2026

First: 2026-03-02T17:59:05+00:00 · Latest: 2026-03-02T17:59:05+00:00

Comments: Accepted by CVPR 2026. Project Page: https://openvglab.github.io/OmniLottie/

Abs · PDF · Code1 · Code2 · Project1

Abstract

OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.

中文标题/摘要

标题：OmniLottie：通过参数化Lottie标记生成矢量动画

OmniLottie是一个多功能框架，可以从多模态指令生成高质量的矢量动画。为了灵活地控制运动和视觉内容，我们专注于Lottie，这是一种轻量级的JSON格式，用于表示形状和动画行为。然而，原始的Lottie JSON文件包含大量的不变结构元数据和格式标记，这给学习矢量动画生成带来了巨大挑战。因此，我们引入了一个精心设计的Lottie分词器，将JSON文件转换为表示形状、动画函数和控制参数的结构化命令序列。这样的分词器使我们能够基于预训练的视觉语言模型构建OmniLottie，以遵循多模态交错指令并生成高质量的矢量动画。为了进一步推进矢量动画生成的研究，我们整理了MMLottie-2M，这是一个包含大量专业设计的矢量动画及其文本和视觉注释的大规模数据集。通过广泛的实验，我们验证了OmniLottie可以生成生动且语义对齐的矢量动画，这些动画紧密符合多模态的人类指令。

FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding

Authors: Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, Zuxuan Wu

Venue: CVPR 2026

First: 2026-03-02T17:16:47+00:00 · Latest: 2026-03-02T17:16:47+00:00

Comments: Accepted at CVPR 2026. Project page: https://yiwengxie.com/FluxMem/

Abs · PDF · Code1 · Code2

Abstract

This paper presents FluxMem, a training-free framework for efficient streaming video understanding. FluxMem adaptively compresses redundant visual memory through a hierarchical, two-stage design: (1) a Temporal Adjacency Selection (TAS) module removes redundant visual tokens across adjacent frames, and (2) a Spatial Domain Consolidation (SDC) module further merges spatially repetitive regions within each frame into compact representations. To adapt effectively to dynamic scenes, we introduce a self-adaptive token compression mechanism in both TAS and SDC, which automatically determines the compression rate based on intrinsic scene statistics rather than manual tuning. Extensive experiments demonstrate that FluxMem achieves new state-of-the-art results on existing online video benchmarks, reaching 76.4 on StreamingBench and 67.2 on OVO-Bench under real-time settings, while reducing latency by 69.9% and peak GPU memory by 34.5% on OVO-Bench. Furthermore, it maintains strong offline performance, achieving 73.1 on MLVU while using 65% fewer visual tokens.

中文标题/摘要

标题：FluxMem：流式视频理解的自适应分层内存

本文提出了一种无需训练的高效流式视频理解框架FluxMem。FluxMem通过层次化、两阶段设计自适应地压缩冗余视觉记忆：(1) 时间邻近选择(TAS) 模块在相邻帧之间移除冗余视觉令牌，(2) 空间域合并(SDC) 模块进一步在每个帧内合并空间重复区域为紧凑表示。为了有效适应动态场景，我们在TAS和SDC中引入了自适应令牌压缩机制，该机制根据内在场景统计自动确定压缩率，而非手动调整。大量实验表明，FluxMem在现有在线视频基准上达到了新的最佳结果，在StreamingBench上达到76.4，在OVO-Bench上达到67.2，同时在实时设置下将延迟降低69.9%，峰值GPU内存降低34.5%。此外，它在离线性能上也表现出色，在MLVU上达到73.1，同时使用了65%更少的视觉令牌。

From Pixels to Patches: Pooling Strategies for Earth Embeddings

Authors: Isaac Corley, Caleb Robinson, Inbal Becker-Reshef, Juan M. Lavista Ferres

First: 2026-03-02T17:03:37+00:00 · Latest: 2026-03-02T17:03:37+00:00

Abs · PDF · Code1 · Code2

Abstract

As geospatial foundation models shift from patch-level to pixel-level embeddings, practitioners must aggregate thousands of pixel vectors into patch representations that preserve class-discriminative signal while matching downstream label resolution. The default choice, mean pooling, discards within-patch variability and can drop accuracy by more than 10% under spatial shift. To evaluate this effect, we introduce EuroSAT-Embed: 81,000 embedding GeoTIFFs derived from three foundation models: AlphaEarth, OlmoEarth, and Tessera. We benchmark 11 training-free and 2 parametric pooling methods under both random and geographically disjoint test splits. Our results show that richer pooling schemes reduce the geographic generalization gap by up to 40% relative to mean pooling and increases accuracy by up to 5% on spatial splits. We recommend Generalized Mean Pooling (GeM) as a drop-in replacement for mean pooling: it improves accuracy without increasing embedding dimensionality. For maximum accuracy, Stats pooling (concatenation of min/max/mean/std pooling) performs best at 4x the embedding size. We further find that pooling effectiveness varies across embedding sources and that higher-dimensional embeddings benefit most from distributional statistics.

中文标题/摘要

标题：从像素到块：地球嵌入的聚合策略

随着地理空间基础模型从块级嵌入转向像素级嵌入，从业者必须将成千上万个像素向量聚合为块表示，同时保留类区分信号并匹配下游标签分辨率。默认选择的均值聚合会丢弃块内变化性，在空间平移下准确率可能下降超过10%。为了评估这一影响，我们引入了EuroSAT-Embed：81,000个嵌入GeoTIFF，源自三个基础模型：AlphaEarth、OlmoEarth和Tessera。我们在随机和地理上不连续的测试拆分下，基准测试了11种无训练和2种参数化聚合方法。结果显示，更丰富的聚合方案相对于均值聚合可将地理泛化差距减少最多40%，并在空间拆分上提高准确率最多5%。我们推荐使用广义均值聚合（GeM）作为均值聚合的即插即用替代方案：它能提高准确率而不增加嵌入维度。对于最大准确率，统计聚合（最小值/最大值/均值/标准差聚合的拼接）在嵌入尺寸扩大4倍时表现最佳。我们还发现，聚合的有效性在不同嵌入源之间有所不同，高维嵌入最受益于分布统计。

Summary / 总结

The paper addresses the challenge of aggregating pixel-level embeddings into patch representations for geospatial applications. It evaluates 13 pooling methods on a large dataset of 81,000 embedding GeoTIFFs derived from three foundation models. The study finds that richer pooling schemes, such as Generalized Mean Pooling and Stats pooling, significantly improve accuracy and reduce the geographic generalization gap compared to mean pooling. The best performance is achieved with Stats pooling at a 4x increase in embedding size.

论文探讨了将像素级嵌入聚合为块表示以应对地理空间任务的挑战，重点在于不同聚合策略的影响。研究引入了包含来自三个基础模型的81,000个嵌入GeoTIFF的EuroSAT-Embed数据集，并评估了13种聚合方法。研究发现，更丰富的聚合方案，如广义平均聚合和统计聚合，可以减少地理泛化差距，并将准确性提高最多5%，相比均值聚合。作者推荐使用广义平均聚合作为均值聚合的替代方案，而使用统计聚合可以获得更高的准确性，但需要更高的嵌入维度。

Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation

Authors: Peiyuan Jing, Kinhei Lee, Zhenxuan Zhang, Huichi Zhou, Zhengqing Yuan, Zhifan Gao, Lei Zhu, Giorgos Papanastasiou, Yingying Fang, Guang Yang

First: 2025-04-25T16:05:06+00:00 · Latest: 2026-03-02T16:27:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Radiology report generation is critical for efficiency but current models lack the structured reasoning of experts, hindering clinical trust and explainability by failing to link visual findings to precise anatomical locations. This paper introduces BoxMed-RL, a groundbreaking unified training framework for generating spatially verifiable and explainable radiology reports. Built on a large vision-language model, BoxMed-RL revolutionizes report generation through two integrated phases: (1) In the Pretraining Phase, we refine the model via medical concept learning, using Chain-of-Thought supervision to internalize the radiologist-like workflow, followed by spatially verifiable reinforcement, which applies reinforcement learning to align medical findings with bounding boxes. (2) In the Downstream Adapter Phase, we freeze the pretrained weights and train a downstream adapter to ensure fluent and clinically credible reports. This framework precisely mimics radiologists' workflow, compelling the model to connect high-level medical concepts with definitive anatomical evidence. Extensive experiments on public datasets demonstrate that BoxMed-RL achieves an average 7% improvement in both METEOR and ROUGE-L metrics compared to state-of-the-art methods. An average 5% improvement in large language model-based metrics further underscores BoxMed-RL's robustness in generating high-quality radiology reports.

中文标题/摘要

标题：像放射科医生一样思考：基于链式思维和强化学习的可验证报告生成

放射科报告生成对于提高效率至关重要，但当前模型缺乏专家的结构化推理，这阻碍了临床信任和解释性，因为它们无法将视觉发现与精确的解剖位置联系起来。本文介绍了BoxMed-RL，这是一种革命性的统一训练框架，用于生成空间上可验证和可解释的放射科报告。该框架基于大型视觉-语言模型，通过两个集成阶段革新了报告生成：（1）在预训练阶段，我们通过医学概念学习精炼模型，并使用链式思维监督来内化放射科医生的工作流程，随后通过空间上可验证的强化学习将医学发现与边界框对齐。（2）在下游适配器阶段，我们冻结预训练权重并训练一个下游适配器以确保流畅且临床可信的报告。该框架精确地模拟了放射科医生的工作流程，促使模型将高级医学概念与明确的解剖证据联系起来。在公共数据集上的广泛实验表明，与最先进的方法相比，BoxMed-RL 在 METEOR 和 ROUGE-L 指标上平均提高了 7%。大型语言模型基线指标的平均 5% 提升进一步证明了 BoxMed-RL 在生成高质量放射科报告方面的稳健性。

Summary / 总结

This paper addresses the need for structured reasoning in radiology report generation to enhance clinical trust and explainability. It introduces BoxMed-RL, a unified training framework that pretrains a model using Chain-of-Thought supervision and spatially verifiable reinforcement, followed by a downstream adapter phase for fluent and credible reports. Experiments show a 7% improvement in METEOR and ROUGE-L metrics and a 5% improvement in large language model-based metrics compared to existing methods.

本文旨在通过结构化的推理提升放射学报告生成的临床信任和解释性。提出了BoxMed-RL，这是一种结合了医学概念学习和空间验证强化的统一训练框架，随后通过下游适配器阶段生成流畅且可信的报告。实验结果显示，BoxMed-RL在METEOR和ROUGE-L指标上平均提高了7%，而基于大型语言模型的指标进一步提高了5%。

Learning to Read Where to Look: Disease-Aware Vision-Language Pretraining for 3D CT

Authors: Simon Ging, Philipp Arnold, Sebastian Walter, Hani Alnahas, Hannah Bast, Elmar Kotter, Jiancheng Yang, Behzad Bozorgtabar, Thomas Brox

First: 2026-03-02T16:10:17+00:00 · Latest: 2026-03-02T16:10:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent 3D CT vision-language models align volumes with reports via contrastive pretraining, but typically rely on limited public data and provide only coarse global supervision. We train a 3D CT vision-language model on 98k report-volume pairs (50k patients) collected at a single hospital, combined with public datasets, using SigLIP-style contrastive pretraining together with prompt-based disease supervision in the shared vision-text embedding space. On CT-RATE, our model achieves state-of-the-art text-to-image retrieval (R@10 31.5 vs. 22.2) and competitive disease classification (AUC 83.8 vs. 83.8), with consistent results on Rad-ChestCT (AUC 77.0 vs. 77.3). We further observe that radiologists routinely reference specific images within their reports (e.g., ``series X, image Y''), linking textual descriptions to precise axial locations. We automatically mine 262k such snippet-slice pairs and introduce the task of intra-scan snippet localization -- predicting the axial depth referred to by a text snippet -- reducing mean absolute error to 36.3 mm at 12 mm feature resolution, compared with 67.0 mm for the best baseline. Adding this localization objective leaves retrieval and classification broadly unchanged within confidence bounds, yielding a single unified model for retrieval, classification, and intra-scan grounding.

中文标题/摘要

标题：学习观察位置：基于疾病的3D CT 视觉-语言预训练

最近的3D CT 视觉-语言模型通过对比预训练将体积与报告对齐，但通常依赖有限的公开数据并仅提供粗略的全局监督。我们在一个医院收集的98000份报告-体积对（50000名患者）上，结合公开数据集，使用SigLIP风格的对比预训练和基于提示的疾病监督共同训练3D CT 视觉-语言模型。在CT-RATE上，我们的模型在文本到图像检索（R@10 31.5 vs. 22.2）和疾病分类（AUC 83.8 vs. 83.8）方面达到最佳效果，且在Rad-ChestCT上也取得一致结果（AUC 77.0 vs. 77.3）。我们还观察到放射科医生在其报告中经常引用特定图像（例如，“系列X，图像Y”），将文本描述链接到精确的轴向位置。我们自动挖掘了262000个这样的片段-切片对，并引入了片段内扫描定位任务——预测文本片段所指的轴向深度——在12毫米特征分辨率下将平均绝对误差降低到36.3毫米，而最佳基线为67.0毫米。增加此定位目标在置信度范围内对检索和分类影响不大，从而产生一个统一的模型用于检索、分类和扫描内定位。

Closed-Loop Action Chunks with Dynamic Corrections for Training-Free Diffusion Policy

Authors: Pengyuan Wu, Pingrui Zhang, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li

First: 2026-03-02T15:04:18+00:00 · Latest: 2026-03-02T15:04:18+00:00

Comments: Accepted by ICRA2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion-based policies have achieved remarkable results in robotic manipulation but often struggle to adapt rapidly in dynamic scenarios, leading to delayed responses or task failures. We present DCDP, a Dynamic Closed-Loop Diffusion Policy framework that integrates chunk-based action generation with real-time correction. DCDP integrates a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder to inject environmental dynamics before action execution, achieving real-time closed-loop action correction and enhancing the system's adaptability in dynamic scenarios. In dynamic PushT simulations, DCDP improves adaptability by 19\% without retraining while requiring only 5\% additional computation. Its modular design enables plug-and-play integration, achieving both temporal coherence and real-time responsiveness in dynamic robotic scenarios, including real-world manipulation tasks. The project page is at: https://github.com/wupengyuan/dcdp

中文标题/摘要

标题：闭环动作片段结合动态校正的无训练扩散策略

基于扩散的策略在机器人操作中取得了显著成果，但往往难以在动态场景中快速适应，导致延迟响应或任务失败。我们提出了DCDP（动态闭环扩散策略）框架，该框架结合了基于片段的动作生成与实时校正。DCDP集成了自监督动态特征编码器、交叉注意力融合以及不对称动作编码器-解码器，在动作执行前注入环境动态，实现实时闭环动作校正，增强系统在动态场景中的适应性。在动态PushT模拟中，DCDP在无需重新训练的情况下提高了19%的适应性，仅需额外5%的计算量。其模块化设计使其能够即插即用，实现动态机器人场景中的时间连贯性和实时响应性，包括实际操作任务。项目页面位于：https://github.com/wupengyuan/dcdp

Summary / 总结

The motivation for this research is to improve the adaptability of diffusion-based policies in dynamic robotic manipulation scenarios. The main method involves a Dynamic Closed-Loop Diffusion Policy (DCDP) framework that integrates chunk-based action generation with real-time correction, using a self-supervised dynamic feature encoder, cross-attention fusion, and an asymmetric action encoder-decoder. Key experimental findings show that DCDP enhances adaptability by 19% in dynamic PushT simulations with only 5% additional computation, demonstrating real-time closed-loop action correction and temporal coherence in dynamic scenarios.

研究旨在提高基于扩散的策略在动态机器人操作场景中的适应性。DCDP动态闭环扩散策略框架结合了基于动作片段的动作生成和实时纠正，使用自监督动态特征编码器和不对称动作编码器-解码器。该框架在动态PushT模拟中提高了19%的适应性，仅需额外5%的计算量，展示了在动态机器人任务中的实时响应性和时间连贯性。

Semantic Similarity is a Spurious Measure of Comic Understanding: Lessons Learned from Hallucinations in a Benchmarking Experiment

Authors: Christopher Driggers-Ellis, Nachiketh Tibrewal, Rohit Bogulla, Harsh Khanna, Sangpil Youm, Christan Grant, Bonnie Dorr

First: 2026-03-02T15:03:57+00:00 · Latest: 2026-03-02T15:03:57+00:00

Comments: 8 pages, 2 figures, 3 tables. Includes link to code

Abs · PDF · Code1 · Code2

Abstract

A system that enables blind or visually impaired users to access comics/manga would introduce a new medium of storytelling to this community. However, no such system currently exists. Generative vision-language models (VLMs) have shown promise in describing images and understanding comics, but most research on comic understanding is limited to panel-level analysis. To fully support blind and visually impaired users, greater attention must be paid to page-level understanding and interpretation. In this work, we present a preliminary benchmark of VLM performance on comic interpretation tasks. We identify and categorize hallucinations that emerge during this process, organizing them into generalized object-hallucination taxonomies. We conclude with guidance on future research, emphasizing hallucination mitigation and improved data curation for comic interpretation.

中文标题/摘要

标题：语义相似度是漫画理解的虚假衡量标准：基准测试实验中幻觉带来的教训

一种使盲人或视力受损用户能够访问漫画/漫画的系统将为该社区引入一种新的叙事媒介。然而，目前尚不存在此类系统。生成式视觉-语言模型（VLMs）在描述图像和理解漫画方面显示出前景，但关于漫画理解的研究大多局限于单格层面的分析。为了全面支持盲人和视力受损用户，必须更加关注页面层面的理解和解释。在本文中，我们介绍了VLM在漫画解释任务上的初步基准测试。我们识别并分类了在这一过程中出现的幻觉，将其组织成通用对象幻觉分类法。最后，我们提出了未来研究的指导，强调幻觉的缓解和漫画解释所需数据的改进整理。

Summary / 总结

This study aims to evaluate the performance of generative vision-language models in understanding comics, particularly at the page level, to support blind and visually impaired users. The researchers identify and categorize hallucinations in the models' outputs, leading to a taxonomy of object-hallucinations. Key findings include the identification of common hallucinations and the need for improved data curation and hallucination mitigation strategies in future research.

该研究旨在评估VLMs对理解漫画的能力，以帮助盲人或视力受损用户。研究人员开发了一个基准来评估VLM在漫画解释任务上的性能，并识别出幻觉，即理解错误。主要发现包括将这些幻觉分类为对象幻觉分类法，强调未来研究中需要更好的数据整理和幻觉缓解。

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

Authors: Edoardo Bianchi, Jacopo Staiano, Antonio Liotta

First: 2025-09-30T14:00:41+00:00 · Latest: 2026-03-02T14:56:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Most existing approaches formulate action quality assessment and skill proficiency estimation as discriminative prediction tasks, typically producing discrete labels or scores without explicitly modeling the reasoning process underlying the assessment. We instead reformulate the problem as generative vision-language modeling, introducing ProfVLM, a parameter-efficient vision-language model that jointly predicts proficiency levels and generates expert-like natural language feedback from multi-view videos. ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores. Central to our method is an AttentiveGatedProjector that dynamically fuses and projects multi-view egocentric and exocentric features from a frozen TimeSformer backbone into a language model fine-tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60% compared to existing classification-based methods. By providing natural language critiques aligned with performance levels, this work shows that generative vision-language modeling offers a powerful and efficient paradigm shift for interpretable action quality assessment.

中文标题/摘要

标题：ProfVLM：一种轻量级的多视角技能熟练度评估视频语言模型

大多数现有方法将动作质量评估和技能熟练度估计建模为判别性预测任务，通常生成离散标签或分数，而不明确建模评估背后的推理过程。相反，我们将问题重新建模为生成型视觉语言建模，引入了ProfVLM，这是一种参数高效的视觉语言模型，可以联合预测熟练度等级并从多视角视频中生成类似专家的自然语言反馈。ProfVLM 利用条件语言生成提供可操作的见解以及定量评估分数。我们方法的核心是注意门控投影器，它可以动态地将冻结的TimeSformer骨干网络中的多视角主观和客观特征融合并投影到一个为反馈生成微调的语言模型中。ProfVLM 在 EgoExo4D 数据集上训练，带有专家评论，其参数量比现有分类方法少至20倍，训练时间减少高达60%。通过提供与表现水平对齐的自然语言批评，这项工作表明生成型视觉语言建模为可解释的动作质量评估提供了强大的且高效的范式转变。

Pseudo Contrastive Learning for Diagram Comprehension in Multimodal Models

Authors: Hiroshi Sasaki

First: 2026-02-27T01:39:53+00:00 · Latest: 2026-03-02T13:34:57+00:00

Comments: 9 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance, such as diagram understanding, remain challenging due to the models' limited sensitivity to fine-grained structural variations. We propose a new training paradigm designed to enhance diagram comprehension in vision-language models. Our approach introduces pseudo contrastive samples generated by a diagram renderer that creates synthetic diagrams using randomly picked text elements. These samples highlight structural differences in diagrammatic imagery without requiring any modification or editing of the original data. By incorporating these pseudo contrastive samples into the training objective, the model learns to capture more precise and semantically consistent diagram structures. Empirical evaluations on a benchmark dataset of flowcharts demonstrate substantial improvements over standard CLIP and hard-negative CLIP training in both image-text matching and visual question answering tasks. The results underscore the value of domain-specific training strategies and contribute to advancing diagrammatic understanding within the broader context of vision-language learning.

中文标题/摘要

标题：伪对比学习在多模态模型中图示理解的应用

近年来，如对比语言-图像预训练（CLIP）等多模态模型在视觉和语言表示的对齐方面表现出显著的能力。然而，对于小视觉差异承载大量语义意义的领域，如图示理解，由于模型对细微结构变化的敏感性有限，这些领域仍然具有挑战性。我们提出了一种新的训练范式，旨在增强图示理解在视觉-语言模型中的能力。我们的方法通过图示渲染器生成伪对比样本，使用随机选取的文字元素生成合成图示，这些样本突出了图示图像中的结构差异，而无需对原始数据进行任何修改或编辑。通过将这些伪对比样本纳入训练目标，模型学会捕捉更精确和语义一致的图示结构。在流程图基准数据集上的实证评估表明，与标准CLIP和硬负样本CLIP训练相比，在图像-文本匹配和视觉问答任务中均取得了显著的改进。结果强调了领域特定训练策略的价值，并为在更广泛的视觉-语言学习背景下推进图示理解做出了贡献。

Summary / 总结

The research aims to improve diagram comprehension in multimodal models like CLIP by introducing pseudo contrastive learning. This method uses a diagram renderer to generate synthetic diagrams from text elements, highlighting structural differences without altering original data. Experiments on a flowchart dataset show significant improvements in image-text matching and visual question answering tasks compared to standard CLIP and hard-negative CLIP training.

研究旨在通过解决模型对细微结构差异的敏感性问题，提高其对图表的理解能力。方法是使用图表渲染器生成伪对比样本，以突出结构差异。实验表明，在图表数据集上，该方法在图像-文本匹配和视觉问答任务中比标准CLIP和硬负样本CLIP训练方法有显著改进。

FireRed-OCR Technical Report

Authors: Hao Wu, Haoran Lou, Xinyue Li, Zuodong Zhong, Zhaojun Sun, Phellon Chen, Xuanhe Zhou, Kai Zuo, Yibo Chen, Xu Tang, Yao Hu, Boxiang Zhou, Jian Wu, Yongji Wu, Wenxin Yu, Yingmiao Liu, Yuhao Huang, Manjie Xu, Gang Liu, Yidong Ma, Zhichao Sun, Changhao Qiao

First: 2026-03-02T13:19:23+00:00 · Latest: 2026-03-02T13:19:23+00:00

Abs · PDF · Code1 · Code2

Abstract

We present FireRed-OCR, a systematic framework to specialize general VLMs into high-performance OCR models. Large Vision-Language Models (VLMs) have demonstrated impressive general capabilities but frequently suffer from ``structural hallucination'' when processing complex documents, limiting their utility in industrial OCR applications. In this paper, we introduce FireRed-OCR, a novel framework designed to transform general-purpose VLMs (based on Qwen3-VL) into pixel-precise structural document parsing experts. To address the scarcity of high-quality structured data, we construct a ``Geometry + Semantics'' Data Factory. Unlike traditional random sampling, our pipeline leverages geometric feature clustering and multi-dimensional tagging to synthesize and curate a highly balanced dataset, effectively handling long-tail layouts and rare document types. Furthermore, we propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation. This curriculum includes: (1) Multi-task Pre-alignment to ground the model's understanding of document structure; (2) Specialized SFT for standardizing full-image Markdown output; and (3) Format-Constrained Group Relative Policy Optimization (GRPO), which utilizes reinforcement learning to enforce strict syntactic validity and structural integrity (e.g., table closure, formula syntax). Extensive evaluations on OmniDocBench v1.5 demonstrate that FireRed-OCR achieves state-of-the-art performance with an overall score of 92.94\%, significantly outperforming strong baselines such as DeepSeek-OCR 2 and OCRVerse across text, formula, table, and reading order metrics. We open-source our code and model weights to facilitate the ``General VLM to Specialized Structural Expert'' paradigm.

Summary / 总结

FireRed-OCR is a framework that transforms general Vision-Language Models (VLMs) into high-performance OCR models by addressing structural hallucination issues. It uses a Geometry + Semantics Data Factory and a Three-Stage Progressive Training strategy. The framework achieves state-of-the-art performance with an overall score of 92.94% on OmniDocBench v1.5, outperforming strong baselines like DeepSeek-OCR 2 and OCRVerse in text, formula, table, and reading order metrics. The method includes multi-task pre-alignment, specialized SFT, and format-constrained GRPO for reinforcement learning. The code and model weights are open-sourced.

FireRed-OCR 是一个框架，通过解决结构幻觉问题将通用 Vision-Language 模型转化为高性能的 OCR 模型。它使用 '几何 + 语义' 数据工厂来合成平衡的数据集，并采用三阶段渐进式训练策略以提高模型性能。评估结果显示，FireRed-OCR 在 OmniDocBench v1.5 上的整体得分为 92.94%，显著优于 DeepSeek-OCR 2 和 OCRVerse 等强基线模型。

Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

Authors: Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, Yafei Wang, Linfeng Zhang

Venue: CVPR 2026

First: 2025-08-18T18:47:26+00:00 · Latest: 2026-03-02T13:11:04+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), providing a unified framework for perception and decision-making. However, their real-world deployment is hindered by significant computational overhead when processing high-resolution, multi-view images. This complexity stems from the massive number of visual tokens, which increases inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in AD. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism that prioritizes semantic and spatial coverage across views, and (ii) a view-adaptive pruning controller that automatically learns optimal pruning ratios based on camera importance to downstream tasks. Unlike prior methods, Prune2Drive requires no model retraining or access to attention maps, ensuring compatibility with modern efficient attention implementations. Extensive experiments on the DriveLM and DriveLMM-o1 benchmarks demonstrate that Prune2Drive achieves significant speedups and memory savings with minimal performance impact. When retaining only 10% of visual tokens, our method achieves a 6.40x speedup in the prefilling phase and consumes only 13.4% of the original FLOPs, with a mere 3% average performance drop on the DriveLM benchmark. Code is available at: https://github.com/MinhaoXiong/Prune2Drive.git

中文标题/摘要

标题：Prune2Drive：一种用于自动驾驶中加速视觉语言模型的即插即用框架

视觉语言模型（VLMs）在自动驾驶（AD）中作为一种有前景的范式出现，提供了一种统一的感知和决策框架。然而，它们在实际部署中受到处理高分辨率多视角图像时巨大计算开销的阻碍。这种复杂性源于视觉标记的大量数量，导致自注意力的二次复杂性增加了推理延迟和内存消耗。为了解决这些挑战，我们提出了Prune2Drive，一种用于AD中多视角VLMs的即插即用视觉标记剪枝框架。Prune2Drive引入了两项核心创新：（i）一种多样性感知的标记选择机制，优先考虑视图中的语义和空间覆盖，（ii）一种视图自适应剪枝控制器，根据摄像机对下游任务的重要性自动学习最优剪枝比例。与先前方法不同，Prune2Drive不需要模型重新训练或访问注意力图，确保与现代高效注意力实现兼容。在DriveLM和DriveLMM-o1基准上的广泛实验表明，Prune2Drive在性能影响最小的情况下实现了显著的加速和内存节省。当保留10%的视觉标记时，我们的方法在预填充阶段实现了6.40倍的加速，并消耗了原始FLOPs的13.4%，平均性能下降仅为3%。代码可在：https://github.com/MinhaoXiong/Prune2Drive.git

Summary / 总结

Prune2Drive is a plug-and-play framework that prunes visual tokens in multi-view Vision-Language Models (VLMs) for autonomous driving, addressing computational overhead. It introduces a diversity-aware token selection mechanism and a view-adaptive pruning controller, achieving significant speedups and memory savings with minimal performance impact. When retaining only 10% of visual tokens, Prune2Drive achieves a 6.40x speedup and consumes 13.4% of the original FLOPs, with a 3% average performance drop on the DriveLM benchmark.

Prune2Drive 是一个多视图 Vision-Language 模型（VLM）在自动驾驶中的插件式框架，通过剪枝视觉令牌来解决计算开销问题。它引入了一种多样性的令牌选择机制和视图自适应剪枝控制器，实现了显著的加速和内存节省，同时保持了最小的性能影响。当保留10%的视觉令牌时，Prune2Drive 实现了6.40倍的加速，并消耗了原始 FLOPs 的13.4%，在 DriveLM 基准上的平均性能下降仅为3%。

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

Authors: Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu

First: 2025-03-14T19:52:08+00:00 · Latest: 2026-03-02T12:13:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.

中文标题/摘要

标题：安全幻象：虚假相关性如何削弱VLM安全微调并可通过机器遗忘加以缓解

近期的视觉语言模型（VLMs）在多模态输入（尤其是文本和图像）的生成建模方面取得了显著进展。然而，当暴露于不安全查询时，它们生成有害内容的脆弱性引发了重要的安全问题。尽管当前的对齐策略主要依赖于监督安全微调和精心策划的数据集，但我们发现了一个根本性的局限性，我们称之为“安全幻象”，即监督微调无意中强化了表面文本模式与安全响应之间的虚假相关性，而不是培养深层次、内在的有害行为缓解。我们展示了这些虚假相关性使微调后的VLMs即使在简单的基于一个词替换的攻击中也容易受到攻击，其中用一个诱导虚假相关性的替代词替换文本查询中的一个词可以有效地绕过防护措施。此外，这些相关性导致过度谨慎，使微调后的VLMs无故拒绝良性查询。为了解决这些问题，我们展示了机器遗忘（MU）作为监督安全微调的强大替代方案，因为它避免了有偏的特征-标签映射，并直接从VLMs中移除有害知识，同时保留其一般能力。广泛的跨安全基准评估表明，基于MU的对齐将攻击成功率降低高达60.27%，并减少了超过84.20%的无谓拒绝。注意：存在可能具有冒犯性的AI生成内容。

Summary / 总结

The paper addresses the issue of spurious correlations in vision language models (VLMs) that can undermine their safety, leading to both false positives and false negatives. It introduces the concept of the 'safety mirage' where supervised fine-tuning can inadvertently reinforce these correlations. The study demonstrates that such correlations make VLMs vulnerable to simple one-word modification attacks and cause unnecessary rejections of benign queries. To mitigate these issues, the authors propose machine unlearning (MU) as an alternative to supervised fine-tuning, showing that it can reduce the attack success rate by up to 60.27% and decrease unnecessary rejections by over 84.20%.

研究关注视觉语言模型（VLMs）的安全问题，识别出一种称为‘安全幻象’的现象，即监督微调可能会无意中强化虚假关联，使VLMs容易受到简单攻击并过于谨慎。研究提出机器遗忘（MU）作为替代方法，并展示了其有效性，通过MU方法将攻击成功率降低高达60.27%，并将不必要的拒绝率降低超过84.20%。

GAM-RAG: Gain-Adaptive Memory for Evolving Retrieval in Retrieval-Augmented Generation

Authors: Yifan Wang, Mingxuan Jiang, Zhihao Sun, Yixin Cao, Yicun Liu, Keyang Chen, Guangnan Ye, Hongfeng Chai

First: 2026-03-02T12:09:17+00:00 · Latest: 2026-03-02T12:09:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieval-Augmented Generation (RAG) grounds large language models with external evidence, but many implementations rely on pre-built indices that remain static after construction. Related queries therefore repeat similar multi-hop traversal, increasing latency and compute. Motivated by schema-based learning in cognitive neuroscience, we propose GAM-RAG, a training-free framework that accumulates retrieval experience from recurring or related queries and updates retrieval memory over time. GAM-RAG builds a lightweight, relation-free hierarchical index whose links capture potential co-occurrence rather than fixed semantic relations. During inference, successful retrieval episodes provide sentence-level feedback, updating sentence memories so evidence useful for similar reasoning types becomes easier to activate later. To balance stability and adaptability under noisy feedback, we introduce an uncertainty-aware, Kalman-inspired gain rule that jointly updates memory states and perplexity-based uncertainty estimates. It applies fast updates for reliable novel signals and conservative refinement for stable or noisy memories. We provide a theoretical analysis of the update dynamics, and empirically show that GAM-RAG improves average performance by 3.95% over the strongest baseline and by 8.19% with 5-turn memory, while reducing inference cost by 61%. Our code and datasets are available at: https://anonymous.4open.science/r/GAM_RAG-2EF6.

中文标题/摘要

标题：GAM-RAG：增益自适应记忆以适应演化检索的检索增强生成

检索增强生成（RAG）通过外部证据使大型语言模型得以扎根，但许多实现依赖于在构建后保持静态的预构建索引。因此，相关的查询会重复进行类似的多跳遍历，增加延迟和计算量。受认知神经科学中基于模式的学习启发，我们提出了一种无需训练的框架GAM-RAG，该框架从反复出现或相关的查询中积累检索经验，并随着时间更新检索记忆。GAM-RAG 构建了一个轻量级、无关系的分层索引，其链接捕捉潜在共现关系而非固定的语义关系。在推理过程中，成功的检索事件提供句子级别的反馈，更新句子记忆，使对类似推理类型有用的证据更容易激活。为了在嘈杂反馈下平衡稳定性和适应性，我们引入了一种基于不确定性的、卡尔曼滤波启发的增益规则，该规则同时更新记忆状态和基于困惑度的不确定性估计。它对可靠的新信号进行快速更新，并对稳定或嘈杂的记忆进行保守的细化。我们对更新动力学进行了理论分析，并通过实验表明，与最强基线相比，GAM-RAG 的平均性能提高了 3.95%，使用 5 轮记忆时提高了 8.19%，同时将推理成本降低了 61%。我们的代码和数据集可在 https://anonymous.4open.science/r/GAM_RAG-2EF6 获取。

Summary / 总结

GAM-RAG is designed to address the limitations of static retrieval indices in RAG by dynamically updating retrieval memory. It uses a lightweight, relation-free hierarchical index and updates sentence memories based on successful retrieval episodes. An uncertainty-aware gain rule is introduced to balance stability and adaptability. Experiments show that GAM-RAG improves performance by 3.95% over the strongest baseline and by 8.19% with 5-turn memory, while reducing inference cost by 61%.

GAM-RAG 旨在解决 RAG 中静态检索索引的局限性，通过动态更新检索记忆来改进。它使用了一个轻量级、无关系的层次索引，并根据成功的检索经历更新句子记忆。引入了一种不确定性感知的增益规则，以平衡可靠信号的快速更新和稳定或嘈杂记忆的保守修正。实验结果表明，GAM-RAG 在最强基线上的性能提高了 3.95%，使用 5 轮记忆时提高了 8.19%，同时减少了 61% 的推理成本。

StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models

Authors: Keli Liu, Zhendong Wang, Wengang Zhou, Houqiang Li

First: 2026-03-02T11:35:05+00:00 · Latest: 2026-03-02T11:35:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Visual AutoRegressive (VAR) models based on next-scale prediction enable efficient hierarchical generation, yet the inference cost grows quadratically at high resolutions. We observe that the computationally intensive later scales predominantly refine high-frequency textures and exhibit substantial spatial redundancy, in contrast to earlier scales that determine the global structural layout. Existing pruning methods primarily focus on high-frequency detection for token selection, often overlooking structural coherence and consequently degrading global semantics. To address this limitation, we propose StepVAR, a training-free token pruning framework that accelerates VAR inference by jointly considering structural and textural importance. Specifically, we employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information. This dual-criterion design enables the model to retain tokens critical for both fine-grained fidelity and overall composition. To maintain valid next-scale prediction under sparse tokens, we further introduce a nearest neighbor feature propagation strategy to reconstruct dense feature maps from pruned representations. Extensive experiments on state-of-the-art text-to-image and text-to-video VAR models demonstrate that StepVAR achieves substantial inference speedups while maintaining generation quality. Quantitative and qualitative evaluations consistently show that our method outperforms existing acceleration approaches, validating its effectiveness and general applicability across diverse VAR architectures.

中文标题/摘要

标题：StepVAR：结构-纹理引导剪枝的视觉自回归模型

基于下一尺度预测的视觉自回归（VAR）模型能够实现高效的分层生成，但高分辨率下的推理成本呈平方级增长。我们观察到，计算密集型的后期尺度主要细化高频纹理并表现出显著的空间冗余，而早期尺度则决定了全局结构布局。现有的剪枝方法主要集中在高频检测以选择标记，往往忽视了结构一致性，从而损害了全局语义。为解决这一局限，我们提出了StepVAR，这是一种无需训练的标记剪枝框架，通过同时考虑结构和纹理的重要性来加速VAR推理。具体而言，我们采用轻量级的高通滤波器来捕捉局部纹理细节，同时利用主成分分析（PCA）来保留全局结构信息。这种双重标准设计使模型能够保留对于精细细节保真度和整体构图都至关重要的标记。为了在稀疏标记下保持有效的下一尺度预测，我们进一步引入了最近邻特征传播策略，从剪枝表示中重建密集特征图。在最先进的文本到图像和文本到视频VAR模型上的广泛实验表明，StepVAR在保持生成质量的同时实现了显著的推理加速。定量和定性的评估一致表明，我们的方法优于现有的加速方法，验证了其有效性和广泛的适用性，跨越了多种不同的VAR架构。

Summary / 总结

StepVAR is a training-free token pruning framework for Visual AutoRegressive (VAR) models that accelerates inference by considering both structural and textural importance. It uses a lightweight high-pass filter to capture textures and PCA to preserve structure. StepVAR improves inference speed while maintaining generation quality, outperforming existing methods across various VAR architectures.

StepVAR 是一种无需训练的 token 剪枝框架，旨在通过同时考虑结构和纹理重要性来加速 Visual AutoRegressive (VAR) 模型。它使用轻量级的高通滤波器来捕捉纹理，并使用主成分分析 (PCA) 来保留结构信息，从而实现高效推理同时保持生成质量。实验表明，StepVAR 在不降低质量的情况下显著加快了推理速度，并且在各种 VAR 架构中优于现有方法。

Spotlight on Token Perception for Multimodal Reinforcement Learning

Authors: Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng

Venue: ICLR 2026

First: 2025-10-10T11:25:33+00:00 · Latest: 2026-03-02T11:25:20+00:00

Comments: Accepted by ICLR 2026, project page: https://github.com/huaixuheqing/VPPO-RL

Abs · PDF · Code1 · Code2 · Code3

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

中文标题/摘要

标题：多模态强化学习中的代币感知焦点

虽然可验证奖励的强化学习（RLVR）已提升大型视觉-语言模型（LVLM）的推理能力，但大多数现有的多模态推理方法在RLVR优化过程中忽视了视觉感知的关键作用。本文通过代币感知的新视角，对RLVR进行开创性的探索，衡量每个生成代币的视觉依赖性。通过对思维链（CoT）过程的细致分析，我们发现了两个关键见解：首先，在展开轨迹中，代币感知是稀疏分布的，只有少量代币具有高视觉依赖性，适用于视觉接地推理；其次，不同轨迹在整体视觉依赖性上表现出显著差异。基于这些观察，我们提出了视觉感知策略优化（VPPO），这是一种新颖的策略梯度算法，明确利用代币感知来细化学习信号。具体而言，VPPO 通过双重机制实现这一点：通过整体视觉依赖性重新加权轨迹的优势，并仅专注于感知关键的代币进行策略更新。在八个全面的感知和推理基准测试中，VPPO 在领先开源的RL调优模型上表现出显著的改进，其有效性在7B和32B模型规模上得到了一致验证。我们的研究不仅为分析多模态RLVR建立了新的代币级感知视角，还提出了一种新颖且有效的优化策略，显著增强了LVLM的多模态推理能力。

Summary / 总结

This paper explores multimodal reinforcement learning with verifiable rewards (RLVR) through the lens of token perception, which measures the visual dependency of each generated token. By analyzing the Chain-of-Thought processes, the authors found that only a small fraction of tokens have high visual dependency, and different trajectories exhibit significant divergence in overall visual dependency. They propose Visually-Perceptive Policy Optimization (VPPO), which reweights trajectories based on their visual dependency and focuses updates on perceptually pivotal tokens, leading to substantial gains over existing models on eight perception and reasoning benchmarks.

本文通过关注每个生成词的视觉依赖性，即token感知，探索了多模态强化学习与可验证奖励（RLVR）的方法。研究发现，在一个rollout轨迹中，只有少量词具有高视觉依赖性，不同轨迹在整体视觉依赖性上表现出显著差异。基于这些发现，作者提出了Visually-Perceptive Policy Optimization (VPPO)，这是一种策略梯度算法，通过根据视觉依赖性重新加权轨迹，并仅在感知关键词上更新策略，来改进学习信号。VPPO在八个感知和推理基准测试中展示了显著的改进，且这种效果在不同模型规模下得到了验证。

DAWA: Dynamic Ambiguity-Wise Adaptation for Real-Time Domain Adaptive Semantic Segmentation

Authors: Taorong Liu, Zhen Zhang, Liang Liao, Jing Xiao, Chia-Wen Lin

First: 2024-09-02T08:53:08+00:00 · Latest: 2026-03-02T11:06:28+00:00

Comments: PRCV 2025

Abs · PDF · Code1 · Code2

Abstract

Test-time domain adaption (TTDA) for semantic segmentation aims to adapt a segmentation model trained on a source domain to a target domain for inference on-the-fly, where both efficiency and effectiveness are critical. However, existing TTDA methods either rely on costly frame-wise optimization or assume unrealistic domain shifts, resulting in poor adaptation efficiency and continuous semantic ambiguities. To address these challenges, we propose a real-time framework for TTDA semantic segmentation, called Dynamic Ambiguity-Wise Adaptation (DAWA), which adaptively detects domain shifts and dynamically adjusts the learning strategies to mitigate continuous ambiguities in the test time. Specifically, we introduce the Dynamic Ambiguous Patch Mask (DAP Mask) strategy, which dynamically identifies and masks highly disturbed regions to prevent error accumulation in ambiguous classes. Furthermore, we present the Dynamic Ambiguous Class Mix (DAC Mix) strategy that leverages vision-language models to group semantically similar classes and augment the target domain with a meta-ambiguous class buffer. Extensive experiments on widely used TTDA benchmarks demonstrate that DAWA consistently outperforms state-of-the-art methods, while maintaining real-time inference speeds of approximately 40 FPS.

中文标题/摘要

标题：DAWA：实时领域自适应语义分割的动态歧义适应

语义分割的测试时领域自适应（TTDA）旨在将训练于源域的分割模型适应目标域，以便实时推理，其中效率和效果至关重要。然而，现有的TTDA方法要么依赖于昂贵的帧级优化，要么假设不切实际的领域偏移，导致适应效率低下和持续的语义歧义。为了解决这些挑战，我们提出了一种实时的TTDA语义分割框架，称为动态歧义适应（DAWA），该框架在测试时自适应地检测领域偏移并动态调整学习策略以缓解持续的歧义。具体而言，我们引入了动态模糊区域掩码（DAP Mask）策略，该策略动态地识别并屏蔽高度干扰的区域，以防止在模糊类别中累积错误。此外，我们提出了动态模糊类别混合（DAC Mix）策略，该策略利用视觉-语言模型将语义相似的类别分组，并用元模糊类别缓冲区增强目标域。在广泛使用的TTDA基准上的大量实验表明，DAWA在保持约40 FPS的实时推理速度的同时，始终优于最先进的方法。

Summary / 总结

DAWA is a real-time framework for test-time domain adaptation in semantic segmentation, addressing the challenges of efficiency and effectiveness. It introduces DAP Mask and DAC Mix strategies to dynamically detect and mitigate domain shifts and ambiguities. Experiments show that DAWA outperforms existing methods while maintaining high inference speeds of about 40 FPS.

DAWA 是一种实时的测试时域适应框架，用于语义分割，旨在解决效率和效果的问题。它引入了 DAP Mask 和 DAC Mix 策略，以动态检测和缓解域偏移和持续的语义模糊性。实验表明，DAWA 在保持约 40 FPS 的快速推理速度的同时，优于现有方法。

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Authors: Haonan Jia, Shichao Dong, Xin Dong, Zenghui Sun, Jin Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang

Venue: CVPR 2026

First: 2026-03-02T10:24:41+00:00 · Latest: 2026-03-02T10:24:41+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.The code will be released when the paper is accepted.

中文标题/摘要

标题：跨模态身份映射：通过强化学习在模态转换中最小化信息损失

大型视觉-语言模型（LVLMs）在生成图像描述时经常忽略或错误表示关键的视觉内容。最小化这种信息损失将促使LVLMs关注图像细节以生成精确的描述。然而，在模态转换过程中测量信息损失本质上是具有挑战性的，因为视觉内容和文本输出之间存在模态差距。在本文中，我们提出，图像描述的质量与其通过该描述进行文本搜索检索到的图像之间的相似性呈正相关。基于这一洞察，我们进一步提出了跨模态身份映射（CIM），这是一种无需额外注释即可增强图像描述的强化学习框架。具体而言，该方法从两个角度定量评估信息损失：画廊表示一致性与查询-画廊图像相关性。在这些指标的监督下，LVLM最小化信息损失并旨在实现从图像到描述的身份映射。实验结果表明，即使与监督微调相比，我们的方法在图像描述方面也表现出更优的性能。特别是在COCO-LN500基准上，CIM在Qwen2.5-VL-7B上的关系推理方面取得了20%的改进。论文被接受后将发布代码。

Summary / 总结

This paper addresses the issue of information loss in image caption generation by LVLMs, proposing a reinforcement learning framework called Cross-modal Identity Mapping (CIM). CIM evaluates information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance, and trains LVLMs to minimize this loss. Experiments show that CIM outperforms Supervised Fine-Tuning, achieving a 20% improvement in relation reasoning on the COCO-LN500 benchmark.

本文提出了一种名为跨模态身份映射（CIM）的强化学习框架，以解决大型视觉语言模型（LVLM）在图像描述生成中信息丢失的问题。CIM从画廊表示一致性与查询-画廊图像相关性两个角度评估信息丢失，并训练LVLMs减少信息丢失。实验结果显示，CIM在COCO-LN500基准测试中相比监督微调具有20%的关系推理性能提升。

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

Authors: Lirui Luo, Guoxi Zhang, Hongming Xu, Yaodong Yang, Cong Fang, Qing Li

Venue: ICLR 2026

First: 2026-03-02T10:24:04+00:00 · Latest: 2026-03-02T10:24:04+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Reward design is of great importance for solving complex tasks with reinforcement learning. Recent studies have explored using image-text similarity produced by vision-language models (VLMs) to augment rewards of a task with visual feedback. A common practice linearly adds VLM scores to task or success rewards without explicit shaping, potentially altering the optimal policy. Moreover, such approaches, often relying on single static images, struggle with tasks whose desired behavior involves complex, dynamic motions spanning multiple visually different states. Furthermore, single viewpoints can occlude critical aspects of an agent's behavior. To address these issues, this paper presents Multi-View Video Reward Shaping (MVR), a framework that models the relevance of states regarding the target task using videos captured from multiple viewpoints. MVR leverages video-text similarity from a frozen pre-trained VLM to learn a state relevance function that mitigates the bias towards specific static poses inherent in image-based methods. Additionally, we introduce a state-dependent reward shaping formulation that integrates task-specific rewards and VLM-based guidance, automatically reducing the influence of VLM guidance once the desired motion pattern is achieved. We confirm the efficacy of the proposed framework with extensive experiments on challenging humanoid locomotion tasks from HumanoidBench and manipulation tasks from MetaWorld, verifying the design choices through ablation studies.

中文标题/摘要

标题：MVR：多视角视频奖励塑造在强化学习中的应用

奖励设计对于使用强化学习解决复杂任务至关重要。近期研究探索使用视觉语言模型（VLMs）生成的图像-文本相似度来增强任务的视觉反馈奖励。一种常见做法是线性地将VLM分数添加到任务或成功奖励中，而没有明确的塑造，这可能会改变最优策略。此外，这些方法通常依赖于单张静态图像，难以处理涉及复杂动态动作的任务，这些动作跨越多个视觉上不同的状态。此外，单一视角可能会遮挡代理行为的关键方面。为解决这些问题，本文提出了多视角视频奖励塑造（MVR）框架，该框架使用从多个视角捕获的视频来建模状态与目标任务的相关性。MVR利用冻结预训练的VLM的视频-文本相似度来学习一个状态相关性函数，以减轻基于图像方法固有的偏向特定静态姿态的偏差。此外，我们引入了一种状态依赖的奖励塑造公式，该公式结合了任务特定奖励和VLM指导，一旦实现期望的运动模式，自动减少VLM指导的影响。通过在HumanoidBench的复杂人形运动任务和MetaWorld的操纵任务上的广泛实验，我们验证了所提出框架的有效性，并通过消融研究验证了设计选择。

Summary / 总结

The paper addresses the challenge of reward design in reinforcement learning for complex tasks by proposing Multi-View Video Reward Shaping (MVR). MVR uses videos from multiple viewpoints to model state relevance and integrates task-specific rewards with VLM-based guidance. The method mitigates the bias towards static poses and handles dynamic motions. Experiments on humanoid locomotion and manipulation tasks show that MVR effectively shapes rewards, reducing the influence of VLM guidance as desired behavior is achieved.

本文提出了一种多视角视频奖励塑形方法（MVR），以解决复杂任务中强化学习中的奖励设计难题。MVR 利用多视角视频来建模状态与任务的相关性，通过冻结的预训练视觉-语言模型学习状态相关函数，从而减轻对特定静态姿态的偏见，并将任务特定奖励与基于视觉-语言模型的指导相结合，在达到期望的运动模式后自动减少后者的影响。在人形机器人运动和操作任务上的实验验证了MVR的有效性。

QIME: Constructing Interpretable Medical Text Embeddings via Ontology-Grounded Questions

Authors: Yixuan Tang, Zhenghong Lin, Yandong Sun, Anthony K. H. Tung

First: 2026-03-02T10:18:06+00:00 · Latest: 2026-03-02T10:18:06+00:00

Abs · PDF · Code1 · Code2

Abstract

While dense biomedical embeddings achieve strong performance, their black-box nature limits their utility in clinical decision-making. Recent question-based interpretable embeddings represent text as binary answers to natural-language questions, but these approaches often rely on heuristic or surface-level contrastive signals and overlook specialized domain knowledge. We propose QIME, an ontology-grounded framework for constructing interpretable medical text embeddings in which each dimension corresponds to a clinically meaningful yes/no question. By conditioning on cluster-specific medical concept signatures, QIME generates semantically atomic questions that capture fine-grained distinctions in biomedical text. Furthermore, QIME supports a training-free embedding construction strategy that eliminates per-question classifier training while further improving performance. Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders, while providing concise and clinically informative explanations.

中文标题/摘要

标题：QIME：基于本体导向问题构建可解释的医学文本嵌入

虽然密集的生物医学嵌入表现出色，但其黑盒性质限制了其在临床决策中的应用。最近基于问题的可解释嵌入将文本表示为自然语言问题的二元答案，但这些方法往往依赖于启发式或表面级对比信号，忽略了专门领域的知识。我们提出QIME，一种基于本体的框架，用于构建可解释的医学文本嵌入，其中每个维度对应一个临床有意义的“是/否”问题。通过条件化特定簇的医学概念签名，QIME 生成语义原子问题，捕捉生物医学文本中的细微差别。此外，QIME 支持无需训练的嵌入构建策略，消除每个问题分类器的训练，进一步提高性能。在生物医学语义相似性、聚类和检索基准测试中，QIME 一致优于先前的可解释嵌入方法，并显著缩小了与强大的黑盒生物医学编码器之间的差距，同时提供简洁且临床相关的解释。

Summary / 总结

QIME is an ontology-grounded framework that constructs interpretable medical text embeddings by converting text into binary answers to clinically meaningful questions. This method leverages specialized domain knowledge and generates semantically atomic questions to capture fine-grained distinctions in biomedical text. Experimental results demonstrate that QIME outperforms previous interpretable embedding methods and closely matches the performance of strong black-box encoders, while offering concise and clinically informative explanations.

QIME 是一个基于本体的框架，通过生成临床意义明确的 yes/no 问题来构建可解释的医学文本嵌入。它使用聚类特定的医学概念签名来创建语义原子问题，从而捕捉 biomedical 文本中的细微差别。QIME 的性能优于之前的可解释嵌入方法，并且几乎与强大的黑盒编码器的性能相当，同时提供简洁且临床相关的解释。

Exploring Cross-Modal Flows for Few-Shot Learning

Authors: Ziqi Jiang, Yanghao Wang, Long Chen

First: 2025-10-16T10:32:48+00:00 · Latest: 2026-03-02T09:18:55+00:00

Comments: 17 pages

Abs · PDF · Code1 · Code2

Abstract

Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.

中文标题/摘要

标题：探索少样本学习中的跨模态流动

不同模态特征的对齐是跨模态任务中最基本的挑战之一。尽管预训练的视觉-语言模型可以在图像和文本之间实现一般对齐，但它们通常需要参数高效微调（PEFT）进行进一步调整。当前的PEFT方法（例如提示微调、LoRA基或适配器基）总是选择性地微调一部分参数，这可以轻微调整视觉或文本特征，避免过拟合。在本文中，我们首次指出，所有现有的PEFT方法都是一步调整。对于特征高度纠缠的复杂（或困难）数据集来说是不够的。为此，我们提出了第一个模型无关的多步调整方法，通过学习跨模态速度场：流动匹配对齐（FMA）。具体来说，为了在训练过程中确保类别的对应关系，我们首先使用固定耦合策略。然后，我们提出了一种噪声增强策略来缓解数据稀缺问题。最后，我们设计了一个早期停止求解器，该求解器在更早的阶段终止变换过程，提高效率和准确性。与一步PEFT方法相比，FMA具有多步校正能力，可以实现更精确和稳健的对齐。广泛的结果表明，FMA可以在各种基准和骨干网络上一致地获得显著的性能提升，特别是在具有挑战性的数据集上。

Summary / 总结

This paper addresses the challenge of aligning features from different modalities in cross-modal tasks, particularly in few-shot learning scenarios. It introduces Flow Matching Alignment (FMA), a multi-step adjustment approach that learns a cross-modal velocity field, overcoming the limitations of one-step parameter-efficient fine-tuning methods. FMA uses a fixed coupling strategy, noise augmentation, and an early-stopping solver to achieve more precise and robust alignment, leading to significant performance gains across various benchmarks and backbones, especially on challenging datasets.

本文针对跨模态任务中不同模态特征对齐的挑战，特别是在少量样本学习场景中，提出了一种多步调整方法Flow Matching Alignment (FMA)，该方法通过学习跨模态速度场来提供更精确和鲁棒的对齐，相比单步参数高效微调方法。该方法包括固定耦合策略、噪声增强以处理数据稀缺性，以及早期停止求解器以提高效率和准确性。实验表明，FMA在各种基准和骨干网络上的一致性表现优于单步PEFT方法，特别是在具有挑战性的数据集上。

Learning Structured Reasoning via Tractable Trajectory Control

Authors: Po-Nien Kung, Zhen Yang, Jeffrey Luo, Cheng-Fu Yang, Haikang Deng, Zi-Yi Dou, Yinfei Yang, Nanyun Peng, Zhe Gan, Kai-Wei Chang

First: 2026-03-02T09:18:19+00:00 · Latest: 2026-03-02T09:18:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models can exhibit emergent reasoning behaviors, often manifested as recurring lexical patterns (e.g., "wait," indicating verification). However, complex reasoning trajectories remain sparse in unconstrained sampling, and standard RL often fails to guarantee the acquisition of diverse reasoning behaviors. We propose a systematic discovery and reinforcement of diverse reasoning patterns through structured reasoning, a paradigm that requires targeted exploration of specific reasoning patterns during the RL process. To this end, we propose Ctrl-R, a framework for learning structured reasoning via tractable trajectory control that actively guides the rollout process, incentivizing the exploration of diverse reasoning patterns that are critical for complex problem-solving. The resulting behavior policy enables accurate importance-sampling estimation, supporting unbiased on-policy optimization. We further introduce a power-scaling factor on the importance-sampling weights, allowing the policy to selectively learn from exploratory, out-of-distribution trajectories while maintaining stable optimization. Experiments demonstrate that Ctrl-R enables effective exploration and internalization of previously unattainable reasoning patterns, yielding consistent improvements across language and vision-language models on mathematical reasoning tasks.

中文标题/摘要

标题：通过可处理轨迹控制学习结构化推理

大型语言模型可以表现出新兴的推理行为，通常表现为重复的词汇模式（例如，“wait”，表示验证）。然而，在不受约束的采样中，复杂的推理轨迹仍然稀少，标准的强化学习往往无法保证获得多样化的推理行为。我们提出了一种通过结构化推理系统地发现和强化多样化的推理模式的方法，该方法要求在强化学习过程中有针对性地探索特定的推理模式。为此，我们提出了Ctrl-R框架，通过可处理轨迹控制学习结构化推理，该框架积极引导展开过程，激励探索对于复杂问题解决至关重要的多样化推理模式。由此产生的行为策略能够实现精确的重要性采样估计，支持无偏的在线策略优化。我们还引入了重要性采样权重的幂级数因子，使策略能够有选择地从探索性、离分布的轨迹中学习，同时保持优化的稳定性。实验表明，Ctrl-R能够有效探索和内化以前无法获得的推理模式，在数学推理任务上跨语言和多模态语言模型中取得了持续改进。

Summary / 总结

The research aims to enhance the reasoning capabilities of large language models by addressing the limitations of standard reinforcement learning (RL) in acquiring diverse reasoning behaviors. The proposed method, Ctrl-R, introduces structured reasoning through tractable trajectory control, guiding the RL process to explore and internalize complex reasoning patterns. Key experimental findings show that Ctrl-R effectively enables the discovery and reinforcement of diverse reasoning patterns, leading to consistent improvements in mathematical reasoning tasks for both language and vision-language models.

研究旨在通过解决标准强化学习（RL）在获取多样化推理行为方面的局限性，来提升大型语言模型的推理能力。提出的Ctrl-R方法通过可处理轨迹控制引入结构化推理，引导RL过程探索和内化复杂的推理模式。实验结果表明，Ctrl-R能够有效发现和强化多样化的推理模式，从而在语言和跨模态模型的数学推理任务中取得一致的改进。

Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation

Authors: Aditya Parikh, Aasa Feragen, Sneha Das, Stella Frank

First: 2026-03-02T08:59:39+00:00 · Latest: 2026-03-02T08:59:39+00:00

Comments: This is an extended version of a manuscript currently under review

Abs · PDF · Code1 · Code2

Abstract

Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.

中文标题/摘要

标题：测量VLMs未提及的内容：验证指标隐藏放射学报告生成中的临床术语消除

在放射学中可靠部署视觉-语言模型（VLMs）需要超越表面级文本相似性的验证指标，以确保临床准确性和人口统计公平性。本文探讨了当前模型评估中的一个关键盲点：使用导致高总体令牌重叠分数的解码策略，尽管这些模型陷入了模板坍缩，只生成重复的安全通用文本，而忽略了临床术语。如果不解决这一盲点，可能会导致指标游戏，即在基准测试中表现良好的模型在临床信息方面可能是无用的。相反，我们提倡使用词汇多样性指标来检查模型生成的临床特异性。我们引入了临床关联位移（CAD），这是一种词汇层面的框架，用于量化生成报告中基于人口统计的词关联的变化。加权关联消除（WAE）汇总这些变化，以衡量不同人口统计群体中的临床信号损失。我们表明，确定性解码会产生高水平的语义消除，而随机采样则生成多样化的输出，但存在引入新偏见的风险，这促使我们重新思考“最佳”报告的定义。

Summary / 总结

This paper addresses the need for more robust validation metrics in the deployment of Vision-Language Models (VLMs) in radiology, focusing on the issue of template collapse where models generate repetitive, generic text. The authors introduce Clinical Association Displacement (CAD) and Weighted Association Erasure (WAE) to measure shifts in clinical specificity and demographic bias in generated reports. They find that deterministic decoding leads to significant semantic erasure, while stochastic sampling can introduce new biases, suggesting a need to redefine what constitutes optimal reporting in clinical contexts.

本文探讨了在放射学中部署Vision-Language模型时需要更可靠的验证指标，重点关注模板坍缩问题，即模型生成重复的通用文本。作者引入了临床关联位移(CAD)和加权关联消除(WAE)来衡量生成报告中临床特异性及人口统计学偏差的变化。研究发现，确定性解码会导致显著的语义消除，而随机采样可能会引入新的偏差，这表明需要重新定义临床环境中“最佳”报告的标准。

Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

Authors: Jiaqi Han, Juntong Shi, Puheng Li, Haotian Ye, Qiushan Guo, Stefano Ermon

Venue: CVPR 2026

First: 2026-03-02T08:59:11+00:00 · Latest: 2026-03-02T08:59:11+00:00

Comments: CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.

中文标题/摘要

标题：自适应光谱特征预测在扩散采样加速中的应用

扩散模型已成为高保真图像和视频生成的主要工具，但由于扩散变换器需要多次迭代，其推理速度成为瓶颈。为了减少计算量，最近的研究工作采用了特征缓存和重用方案，在选定的扩散步骤中跳过网络评估，使用前一步骤中的缓存特征。然而，它们的初步设计仅依赖于局部近似，导致误差在大跳步时迅速增长，从而在高速度提升时降低样本质量。在本文中，我们提出了一种无需训练的光谱扩散特征预测器（Spectrum），它能够实现全局、长距离的特征重用，并且误差得到严格控制。特别地，我们将去噪器的潜在特征视为时间上的函数，并使用切比雪夫多项式进行近似。具体来说，我们通过岭回归拟合每个基的系数，然后利用这些系数预测多个未来扩散步骤的特征。我们从理论上揭示了我们的方法在长预测期具有更优的行为，并且误差不会随着步长的增加而累积。在各种最先进的图像和视频扩散模型上的广泛实验一致验证了我们方法的优越性。值得注意的是，我们在FLUX.1上实现了高达4.79倍的加速，在Wan2.1-14B上实现了4.67倍的加速，同时保持了比基线更高的样本质量。

Summary / 总结

This work addresses the slow inference speed of diffusion models by proposing a spectral diffusion feature forecaster (Spectrum) that enables global, long-range feature reuse with controlled error. The method approximates latent features with Chebyshev polynomials and uses ridge regression to forecast features at multiple future steps, leading to up to 4.79× speedup on FLUX.1 and 4.67× speedup on Wan2.1-14B with higher sample quality compared to baselines.

该研究通过提出一种光谱扩散特征预测器（Spectrum），实现全局特征重用并控制误差，以解决扩散模型的推理速度慢问题。它使用切比雪夫多项式近似潜特征，并通过岭回归拟合系数以预测多个未来步骤的特征。实验表明，Spectrum 在 FLUX.1 上可实现高达 4.79 倍的加速，在 Wan2.1-14B 上可实现 4.67 倍的加速，同时保持更高的样本质量，优于基线方法。

StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

Authors: Zanxi Ruan, Songqun Gao, Qiuyu Kong, Yiming Wang, Marco Cristani

Venue: CVPR 2026

First: 2026-02-23T17:57:37+00:00 · Latest: 2026-03-02T08:46:07+00:00

Comments: Accepted by CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: https://github.com/intelligolabs/StructXLIP.

中文标题/摘要

标题：StructXLIP：通过多模态结构线索增强视觉语言模型

基于边缘的表示是视觉理解的基本线索，这一原则根植于早期视觉研究，并且至今仍然至关重要。我们将其原则扩展到视觉语言对齐中，表明隔离和对齐跨模态的结构线索可以极大地提高对长、细节丰富的描述进行微调的效果，特别是改善跨模态检索。我们引入了StructXLIP，这是一种微调对齐范式，提取边缘图（例如Canny），将其视为图像视觉结构的代理，并过滤相应的描述以强调结构线索，使其“结构为中心”。微调通过在标准对齐损失中增加三种结构中心损失来增强：（i）将边缘图与结构文本对齐，（ii）匹配局部边缘区域与文本片段，（iii）将边缘图与颜色图像连接以防止表示漂移。从理论角度来看，虽然标准CLIP最大化视觉和文本嵌入之间的互信息，但StructXLIP还最大化了多模态结构表示之间的互信息。这种辅助优化本质上更难，引导模型向更稳健和语义稳定的极小值发展，增强视觉语言对齐。除了在通用和专门领域中均优于当前竞争对手的跨模态检索之外，我们的方法还提供了一种通用的增强配方，可以以即插即用的方式集成到未来的方案中。代码和预训练模型可在：https://github.com/intelligolabs/StructXLIP公开获取。

Summary / 总结

The paper introduces StructXLIP, a method that enhances vision-language models by aligning structural cues across modalities. It focuses on improving cross-modal retrieval for long, detailed captions by extracting edge maps and filtering captions to emphasize structural information. The method includes three structure-centric losses that align edge maps with text, match local edge regions to textual chunks, and connect edge maps to color images to prevent representation drift. StructXLIP outperforms current competitors on cross-modal retrieval and can be integrated into future approaches as a general boosting recipe.

研究旨在通过引入边缘图中的结构线索来增强视觉-语言模型，这些线索对于视觉理解至关重要。方法StructXLIP引入了一种细调对齐范式，将边缘图与结构文本对齐，匹配局部边缘区域与文本片段，并将边缘图连接到彩色图像以防止表示漂移。关键实验发现表明，StructXLIP在通用和专门领域中的跨模态检索任务中均优于当前竞争对手，证明了其作为未来方法的通用增强配方的有效性。

Sparse View Distractor-Free Gaussian Splatting

Authors: Yi Gu, Zhaorui Wang, Jiahang Cao, Jiaxu Wang, Mingle Zhao, Dongjun Ye, Renjing Xu

First: 2026-03-02T08:32:32+00:00 · Latest: 2026-03-02T08:32:32+00:00

Abs · PDF · Code1 · Code2

Abstract

3D Gaussian Splatting (3DGS) enables efficient training and fast novel view synthesis in static environments. To address challenges posed by transient objects, distractor-free 3DGS methods have emerged and shown promising results when dense image captures are available. However, their performance degrades significantly under sparse input conditions. This limitation primarily stems from the reliance on the color residual heuristics to guide the training, which becomes unreliable with limited observations. In this work, we propose a framework to enhance distractor-free 3DGS under sparse-view conditions by incorporating rich prior information. Specifically, we first adopt the geometry foundation model VGGT to estimate camera parameters and generate a dense set of initial 3D points. Then, we harness the attention maps from VGGT for efficient and accurate semantic entity matching. Additionally, we utilize Vision-Language Models (VLMs) to further identify and preserve the large static regions in the scene. We also demonstrate how these priors can be seamlessly integrated into existing distractor-free 3DGS methods. Extensive experiments confirm the effectiveness and robustness of our approach in mitigating transient distractors for sparse-view 3DGS training.

中文标题/摘要

标题：稀疏视角无干扰高斯点云渲染

3D高斯点云渲染（3DGS）能够在静态环境中实现高效的训练和快速的新视角合成。为了解决瞬态物体带来的挑战，已经出现了无干扰的3DGS方法，并在密集图像捕获可用时展示了有希望的结果。然而，在稀疏输入条件下，它们的性能显著下降。这一限制主要源于对颜色残差启发式的依赖，该启发式在有限的观察下变得不可靠。在本文中，我们提出了一种框架，以增强在稀疏视角条件下的无干扰3DGS，通过引入丰富的先验信息。具体来说，我们首先采用几何基础模型VGGT来估计相机参数并生成密集的初始3D点集。然后，我们利用VGGT的注意力图进行高效的语义实体匹配。此外，我们利用视觉语言模型（VLMs）进一步识别并保留场景中的大面积静态区域。我们还展示了这些先验如何无缝集成到现有的无干扰3DGS方法中。广泛的实验验证了我们方法在稀疏视角3DGS训练中减轻瞬态干扰的有效性和鲁棒性。

Summary / 总结

The research aims to improve 3D Gaussian Splatting (3DGS) for novel view synthesis in environments with sparse input data, particularly in the presence of transient objects. The method leverages a geometry foundation model VGGT to estimate camera parameters and generate initial 3D points, and uses attention maps and Vision-Language Models to identify and preserve static regions. Experiments show that this approach effectively mitigates the impact of transient distractors and enhances the robustness of 3DGS under sparse-view conditions.

该研究针对稀疏视图条件下3D高斯点绘制（3DGS）中的瞬态物体问题，提出了一种框架，结合几何和语义先验信息。方法使用VGGT估计相机参数并生成密集的3D点，利用注意力图和视觉语言模型进行语义实体匹配并保留静态区域。实验表明，该方法有效减少了瞬态干扰物的影响，提高了3DGS在稀疏输入条件下的鲁棒性。

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

Authors: Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, Rushuai Yang, Arctanx An, Leqi Zheng, Weijie Wang, Shawn Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

Venue: ICLR 2026

First: 2025-10-22T09:20:09+00:00 · Latest: 2026-03-02T08:22:31+00:00

Comments: Accepted to ICLR 2026. Camera-ready version. Project page: https://aaronfengzy.github.io/MV-RoboBench-Webpage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.

中文标题/摘要

标题：跨越视角：视觉语言模型在机器人场景中空间推理能力基准测试

视觉语言模型（VLMs）是嵌入式人工智能的关键，使机器人能够感知、推理和在复杂环境中行动。它们也是最近视觉语言行动（VLA）模型的基础。然而，大多数对VLMs的评估集中在单视角设置上，而对其整合多视角信息的能力则研究不足。同时，多摄像头设置在机器人平台上越来越普遍，因为它们提供了互补的视角，以减轻遮挡和深度模糊。因此，VLMs能否有效利用此类多视角输入进行机器人推理仍是一个开放问题。为了弥合这一差距，我们引入了MV-RoboBench，这是一个专门设计来评估VLMs在机器人操作中多视角空间推理能力的基准测试。MV-RoboBench 包含1700个手工策划的问答项，分为八个子任务，分为两大类：空间理解与机器人执行。我们评估了多种现有的VLMs，包括开源和闭源模型，以及结合了CoT启发技术的增强版本。结果显示，最先进的模型仍远低于人类表现，突显了VLMs在多视角机器人感知方面面临的巨大挑战。此外，我们的分析揭示了两个关键发现：（i）在多视角机器人场景中，空间智能与机器人任务执行正相关；（ii）在现有通用单视角空间理解基准测试中表现出色并不一定能成功完成我们的基准测试中的机器人空间任务。我们以开放资源的形式发布MV-RoboBench，以促进空间化视觉语言模型和VLAs的发展，不仅提供数据，还提供多视角嵌入式推理的标准评估协议。

Summary / 总结

This paper introduces MV-RoboBench, a benchmark to evaluate the multi-view spatial reasoning capabilities of vision-language models (VLMs) in robotic manipulation. The study focuses on the ability of VLMs to integrate information from multiple views, which is crucial for robotic perception in complex environments. The evaluation includes a diverse set of existing VLMs and reveals that state-of-the-art models perform significantly below human levels, highlighting the challenges in multi-view robotic perception. Key findings include the positive correlation between spatial intelligence and robotic task execution, and the lack of reliable transfer from general-purpose single-view benchmarks to robotic spatial tasks.

论文介绍了MV-RoboBench，这是一个用于评估视觉语言模型（VLM）在机器人操作中多视角空间推理能力的基准。该基准包含1,700个问答项，涵盖八个子任务，重点是空间理解和机器人执行。评估结果显示，最先进的模型在多视角机器人感知方面的表现远低于人类水平，突显了这一领域的挑战。研究还发现，在多视角场景中，空间智能与机器人任务执行之间存在正相关关系，并且在通用单视角基准上的表现并不能可靠地转化为机器人任务的成功。

FAST-DIPS: Adjoint-Free Analytic Steps and Hard-Constrained Likelihood Correction for Diffusion-Prior Inverse Problems

Authors: Minwoo Kim, Seunghyeok Shin, Hongki Lim

Venue: International Conference on Learning Representations 2026

First: 2026-03-02T08:17:26+00:00 · Latest: 2026-03-02T08:17:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Training-free diffusion priors enable inverse-problem solvers without retraining, but for nonlinear forward operators data consistency often relies on repeated derivatives or inner optimization/MCMC loops with conservative step sizes, incurring many iterations and denoiser/score evaluations. We propose a training-free solver that replaces these inner loops with a hard measurement-space feasibility constraint (closed-form projection) and an analytic, model-optimal step size, enabling a small, fixed compute budget per noise level. Anchored at the denoiser prediction, the correction is approximated via an adjoint-free, ADMM-style splitting with projection and a few steepest-descent updates, using one VJP and either one JVP or a forward-difference probe, followed by backtracking and decoupled re-annealing. We prove local model optimality and descent under backtracking for the step-size rule, and derive an explicit KL bound for mode-substitution re-annealing under a local Gaussian conditional surrogate. We also develop a latent variant and a one-parameter pixel$\rightarrow$latent hybrid schedule. Experiments achieve competitive PSNR/SSIM/LPIPS with up to 19.5$\times$ speedup, without hand-coded adjoints or inner MCMC.

中文标题/摘要

标题：FAST-DIPS：无需梯度的分析步骤和硬约束似然校正的扩散先验逆问题求解

无需训练的扩散先验能够使逆问题求解器无需重新训练，但在非线性前向算子的情况下，数据一致性往往依赖于重复的导数或内层优化/MCMC 循环，需要许多迭代和降噪器/分数评估。我们提出了一种无需训练的求解器，用硬测量空间可行性约束（闭式投影）和分析的、模型最优的步长替换这些内层循环，使每个噪声级别下的计算预算保持固定。以降噪器预测为中心，校正通过无梯度的ADMM风格分裂方法近似，该方法包括投影和少量最陡下降更新，使用一次VJP和要么一次JVP要么前向差分探针，随后进行回溯和解耦重新退火。我们证明了步长规则下的局部模型最优性和下降性，并在局部高斯条件近似下推导了模式替代重新退火的显式KL界。我们还开发了一种潜变量变体和一种一参数像素→潜变量混合时间表。实验结果显示，在没有手动编码的梯度或内层MCMC的情况下，PSNR/SSIM/LPIPS达到可竞争水平，速度提升高达19.5倍。

Summary / 总结

The research aims to improve the efficiency of inverse-problem solvers using diffusion priors without retraining. The method replaces inner optimization loops with a hard measurement-space feasibility constraint and an analytic step size, reducing the number of iterations and denoiser evaluations. Experiments show competitive PSNR/SSIM/LPIPS results with up to 19.5 times speedup compared to traditional methods.

论文针对训练-free 扩散先验在逆问题中的计算效率低下问题，特别是需要反复进行梯度计算和内部优化循环。提出了一种方法，使用硬测量空间可行性约束和分析最优步长，以实现每噪声级别固定计算预算。实验结果表明，与传统方法相比，该方法在 PSNR/SSIM/LPIPS 方面具有竞争力，且速度提升高达 19.5 倍，无需手动编码伴随梯度或内部 MCMC 步骤。

Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory

Authors: Zhengtong Zhu, Jiaqing Fan, Zhixuan Liu, Fanzhang Li

First: 2026-03-02T07:15:41+00:00 · Latest: 2026-03-02T07:15:41+00:00

Comments: Accept by AAAI2026

Abs · PDF · Code1 · Code2

Abstract

Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbf{S}patio-temporal \textbf{D}ecoupled Reasoning Video Segmentation with \textbf{A}daptive Object \textbf{M}emory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.

中文标题/摘要

标题：无需训练的空间-时间解耦推理视频分割与自适应对象记忆

推理视频对象分割（ReasonVOS）是一项具有挑战性的任务，需要使用隐式的复杂文本输入在视频序列中实现稳定的对象分割。先前的方法通过微调多模态大型语言模型（MLLMs）来生成分割输出，这需要大量的资源。此外，一些现有方法在处理空间-时间信息时是耦合的，这在一定程度上影响了模型的时间稳定性。为了解决这些问题，我们提出了一种无需训练的空间-时间解耦推理视频分割与自适应对象记忆（SDAM）方法。我们旨在设计一种无需训练的推理视频分割框架，该框架仅使用预训练模型就能超越需要微调的现有方法。同时，我们提出了一个自适应对象记忆模块，该模块根据不同视频序列中的运动线索选择并记忆关键对象。最后，我们提出了空间-时间解耦，以实现稳定的时间传播。在空间域中，我们实现了目标对象的精确定位和分割，而在时间域中，我们利用关键对象的时间信息来驱动稳定的跨帧传播。我们的方法在包括Ref-YouTubeVOS、Ref-DAVIS17、MeViS、ReasonVOS和ReVOS在内的五个基准数据集上取得了优异的结果。

Summary / 总结

The paper addresses the challenge of Reasoning Video Object Segmentation (ReasonVOS) by proposing SDAM, a training-free framework that uses pre-trained models. SDAM introduces an Adaptive Object Memory module to select and memorize key objects based on motion cues, and Spatio-temporal Decoupling to ensure stable temporal propagation. The method outperforms existing fine-tuned approaches on five benchmark datasets, demonstrating excellent results in both spatial and temporal stability.

论文提出了一种名为SDAM的训练-free框架，利用预训练模型解决推理视频对象分割的挑战。该框架引入了基于运动线索选择和记忆关键对象的自适应对象记忆模块以及时空解耦以实现稳定的时空传播。SDAM在包括Ref-YouTubeVOS、Ref-DAVIS17、MeViS、ReasonVOS和ReVOS在内的五个基准数据集上优于现有微调方法。

Benchmarking Semantic Segmentation Models via Appearance and Geometry Attribute Editing

Authors: Zijin Yin, Bing Li, Kongming Liang, Hao Sun, Zhongjiang He, Zhanyu Ma, Jun Guo

First: 2026-03-02T07:05:37+00:00 · Latest: 2026-03-02T07:05:37+00:00

Comments: Submitted to IEEE TPAMI, under review

Abs · PDF · Code1 · Code2

Abstract

Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress-test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image-level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal-EA and COCO-EA. We benchmark a wide variety of semantic segmentation models, spanning from closed-set models to open-vocabulary large models. We have several key findings: 1) advanced open-vocabulary models do not exhibit greater robustness compared to closed-set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in-distribution and out-of-distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.

中文标题/摘要

标题：基于外观和几何属性编辑的语义分割模型基准测试

语义分割在自动驾驶和医学图像分析等众多应用中扮演着关键角色。在实际部署分割模型时，提前测试其在各种复杂场景中的行为至关重要。本文构建了一个自动数据生成管道Gen4Seg，通过生成具有不同属性变化的各种具有挑战性的样本来对语义分割模型进行压力测试。除了之前仅关注全局天气和风格迁移的评估范式外，我们还研究了在对象和图像级别上外观和几何属性的变化。这些变化包括对象颜色、材质、大小、位置，以及图像级别的变化如天气和风格。为了实现这一点，我们提出了一种使用扩散模型对现有真实图像进行视觉属性编辑的方法，同时对结构信息进行精确控制。这样，现有的分割标签可以被重用到编辑后的图像中，大大降低了劳动成本。使用我们的管道，我们构建了两个新的基准Pascal-EA和COCO-EA。我们对从封闭集模型到开放词汇大型模型的广泛语义分割模型进行了基准测试。我们有几个关键发现：1) 先进的开放词汇模型在几何变化下的鲁棒性并不优于封闭集方法；2) 数据增强技术，如CutOut和CutMix，在增强对外观变化的鲁棒性方面效果有限；3) 我们的管道也可以用作数据增强工具，提高分布内和分布外性能。我们的工作表明生成模型作为自动分析分割模型的有效工具的潜力，并希望我们的发现能帮助从业者和研究人员开发出更鲁棒和可靠的分割模型。

Summary / 总结

This paper introduces Gen4Seg, an automatic data generation pipeline for benchmarking semantic segmentation models. It evaluates models' robustness to changes in appearance and geometry attributes, including object color, material, size, and position, as well as image-level variations like weather and style. Key findings include that advanced open-vocabulary models do not show greater robustness under geometric variations, data augmentation techniques are limited in enhancing robustness against appearance variations, and the pipeline can improve both in-distribution and out-of-distribution performances.

本文旨在通过引入名为Gen4Seg的新数据生成管道来评估语义分割模型在复杂场景中的鲁棒性。该方法涉及使用扩散模型对现有真实图像进行精确的结构信息控制以编辑外观和几何属性。主要发现包括先进的开放词汇模型在几何变化下并不表现出更大的鲁棒性，数据增强技术在增强对抗外观变化的鲁棒性方面有限，而该管道可以提高分布内和分布外的性能。

History

20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553