arXiv 论文速递

2026-05-12 05:05
Snapshot: 20260512_0505
Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Authors: Jerry Jiang, Haowen Sun, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, Kurt Keutzer, Wenzhao Zheng
Venue: CVPR 2026
First: 2026-05-08T17:50:47+00:00 · Latest: 2026-05-08T17:50:47+00:00
Comments: Accepted by CVPR 2026. Project page: https://wzzheng.net/Proxy3D
Abstract
Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.
Summary / 总结
Proxy3D proposes a method for efficient 3D representations in vision-language models by using semantic clustering and alignment. Given video frames, it extracts scene features through semantic and geometric encoders, clusters them into 3D proxies, and aligns these proxies with the VLM through multi-stage training. This approach achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding, and spatial intelligence benchmarks, especially with shorter sequences.
Proxy3D 提出了一种通过语义聚类和对齐来高效构建 3D 表示的方法。给定视频帧,它通过语义和几何编码器提取场景特征,对其进行语义感知聚类以获得 3D 代理,并通过多阶段训练将这些代理与 VLM 对齐。该方法在 3D 视觉问答、视觉定位和空间智能基准测试中取得了竞争力或最先进的性能,特别是在较短的序列中。
Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization
Authors: Hanchao Liu, Fang-Lue Zhang, Shining Zhang, Tai-Jiang Mu, Shi-Min Hu
First: 2026-05-08T17:43:29+00:00 · Latest: 2026-05-08T17:43:29+00:00
Comments: Accepted to CVPR2026
Abstract
Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme.
Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
Authors: Kaidi Jia, Yujie Lin, Chengyi Yang, Jiayao Ma, Jinsong Su
First: 2026-05-08T17:19:53+00:00 · Latest: 2026-05-08T17:19:53+00:00
Abstract
Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.
SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere
Authors: Chao Huang, Penfei Wei, Wei Wang, Jie Wen, Zhihua Wang, Li Shen, Wenqi Ren, Xiaochun Cao
First: 2026-05-08T16:57:38+00:00 · Latest: 2026-05-08T16:57:38+00:00
Comments: 48 pages, 25 figures
Abstract
Video anomaly detection (VAD) aims to automatically identify events that deviate from normal patterns in untrimmed surveillance videos. Existing methods universally depend on large-scale annotations or task-specific training procedures, severely limiting their rapid deployment to novel scenes. We observe that intermediate-layer features of pre-trained multimodal large language models (MLLMs) already encode rich anomaly semantics, yet existing approaches rely on the language output pathway and fail to exploit the geometric discriminability latent in these representations. Based on this finding, we propose SphereVAD, a fully training-free, zero-shot VAD framework that recasts anomaly discrimination as von Mises-Fisher (vMF) likelihood-ratio geodesic inference on the unit hypersphere, unleashing latent discriminability through principled geometric reasoning rather than learning new representations. Specifically, SphereVAD first applies Frechet mean centering to unfold feature distributions and eliminate domain biases, then employs Holistic Scene Attention (HSA) to reinforce feature consistency using cross-video priors, and finally performs vMF-guided Spherical Geodesic Pulling (SGP) to align ambiguous segments with directional prototypes on the spherical manifold. This training-free pipeline requires only minimal synthetic images for calibration. SphereVAD establishes new state-of-the-art results among training-free approaches on three major benchmarks and remains competitive with fully supervised baselines. Code will be available upon acceptance.
中文标题/摘要
标题:SphereVAD:基于单位超球面上测地推断的无需训练视频异常检测
视频异常检测(VAD)旨在自动识别在未剪辑的监控视频中偏离正常模式的事件。现有方法普遍依赖大规模注释或特定任务的训练过程,严重限制了其在新场景中的快速部署。我们观察到,预训练的多模态大型语言模型(MLLMs)的中间层特征已经包含了丰富的异常语义,但现有方法依赖于语言输出路径,未能利用这些表示中潜在的几何可区分性。基于这一发现,我们提出SphereVAD,这是一种完全无需训练、零样本的VAD框架,将异常区分重新定义为在单位超球面上的von Mises-Fisher(vMF)似然比测地推断,通过原理上的几何推理而非学习新的表示来释放潜在的可区分性。具体而言,SphereVAD 首先应用Fréchet均值中心化来展开特征分布并消除领域偏差,然后使用整体场景注意力(HSA)利用跨视频先验增强特征一致性,最后通过vMF引导的球面测地拉伸(SGP)将含糊的片段与球面流形上的方向原型对齐。这一无需训练的流水线仅需少量合成图像进行校准。SphereVAD 在三个主要基准上建立了训练无需方法的新最佳结果,并且在与完全监督基线的竞争中保持竞争力。代码将在接受后提供。
Summary / 总结
SphereVAD is a training-free video anomaly detection framework that leverages the geometric discriminability of pre-trained multimodal large language model features. It uses von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere to identify anomalies without additional training. Key steps include Frechet mean centering, Holistic Scene Attention, and vMF-guided Spherical Geodesic Pulling. SphereVAD outperforms other training-free methods on three major benchmarks and matches the performance of fully supervised approaches.
SphereVAD 是一个无需训练的视频异常检测框架,利用预训练的多模态大型语言模型特征中的几何可区分性。它通过单位超球面上的 von Mises-Fisher 似然比测地线推断来识别异常,而无需重新训练。关键步骤包括 Frechet 均值中心化、整体场景注意和球面测地拉伸。SphereVAD 在三个主要基准上优于其他无需训练的方法,并且与完全监督的方法具有竞争力。
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
Authors: Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Zefeng He, Muxin Fu, Daizong Liu, Wei-Long Zheng, Yu Cheng
First: 2026-05-01T17:54:37+00:00 · Latest: 2026-05-08T16:52:48+00:00
Abstract
While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to strengthen sustained, on-demand access to visual evidence. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for enhanced visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM shows improved robustness in longer generations and accelerates internal prediction convergence.
中文标题/摘要
标题:持久视觉记忆:在LVLM中维持深度生成中的感知
尽管自回归大型视觉-语言模型(LVLMs)在多模态任务中表现出色,但它们面临“视觉信号稀释”现象,其中文本历史的累积扩大了注意力分区函数,导致视觉注意力与生成序列长度成反比衰减。为应对这一问题,我们提出了一种轻量级可学习模块——持久视觉记忆(PVM),旨在加强视觉证据的持续、按需访问。PVM作为LVLM中馈前网络(FFN)的并行分支集成,建立了一种距离无关的检索路径,直接提供视觉嵌入以增强视觉感知,从而结构上缓解了深度生成中的信号抑制。在Qwen3-VL模型上的广泛实验表明,PVM带来了显著改进,且几乎无参数开销,为4B和8B规模提供了持续的平均准确率提升,特别是在需要持续视觉感知的复杂推理任务中。此外,深入分析显示,PVM在较长生成中表现出更好的鲁棒性,并加速了内部预测收敛。
HEART: Hyperspherical Embedding Alignment via Kent-Representation Traversal in Diffusion Models
Authors: Arani Roy, Shristi Das Biswas, Kaushik Roy
First: 2026-05-08T16:32:44+00:00 · Latest: 2026-05-08T16:32:44+00:00
Abstract
Text-to-image diffusion models can generate visually stunning images, yet, controlling what appears and how it appears, remains surprisingly difficult, especially when operating solely within the constraints of the text-conditioning space. For example, changing a subject or adjusting an attribute often leads to unintended side effects, such as altered backgrounds or distorted details. This is because most existing text-based control methods treat the embedding space as Euclidean and apply simple linear transformations, which do not reflect how semantic concepts are actually organized. In this work, we take a step back and ask: what is the true geometry of these embeddings? We find that text encoder representations lie on a hypersphere, where concepts are not linear directions but structured, anisotropic distributions better captured by Kent distributions. Building on this insight, we propose HEART, a training-free framework that performs Kent-aware geodesic transformations directly on the hypersphere. By respecting the underlying geometry, HEART enables intuitive and precise edits, such as consistent subject replacement and fine-grained attribute control, while preserving the original scene. Importantly, HEART requires no finetuning, inversion, or optimization, and generalizes across diffusion model architectures. Our results show that a simple shift in perspective, from linear to spherical, can unlock fast, and controllable image generation.
中文标题/摘要
标题:HEART:通过肯特表示遍历在扩散模型中的超球体嵌入对齐
文本到图像的扩散模型可以生成视觉上令人惊叹的图像,但控制图像中出现的内容及其表现方式仍然非常困难,尤其是在仅在文本条件空间内操作的情况下。例如,更改主题或调整属性通常会导致意想不到的副作用,如背景改变或细节失真。这是因为大多数现有的基于文本的控制方法将嵌入空间视为欧几里得空间,并应用简单的线性变换,这并不能反映语义概念的实际组织方式。在本文中,我们退一步思考:这些嵌入的真实几何形状是什么?我们发现文本编码器表示位于超球体上,其中概念不是线性方向,而是由肯特分布更好地捕捉到的结构化、各向异性分布。基于这一见解,我们提出了一种无需训练的框架HEART,它可以直接在超球体上进行肯特感知的测地线变换。通过尊重潜在的几何形状,HEART 使直观和精确的编辑成为可能,例如一致的主题替换和细粒度的属性控制,同时保留原始场景。重要的是,HEART 不需要微调、反向传播或优化,并且可以跨扩散模型架构泛化。我们的结果表明,从线性到球形的简单视角转变可以解锁快速且可控的图像生成。
Summary / 总结
The research aims to improve the control over text-to-image generation in diffusion models by addressing the limitations of existing linear transformations. The method involves recognizing the text encoder representations as lying on a hypersphere and using Kent distributions to capture the anisotropic nature of semantic concepts. The proposed HEART framework performs Kent-aware geodesic transformations directly on the hypersphere, enabling intuitive and precise edits without requiring fine-tuning or optimization. Key findings include consistent subject replacement and fine-grained attribute control while preserving the original scene, and the framework's ability to generalize across different diffusion model architectures.
研究旨在通过解决现有线性变换的局限性,改进文本到图像生成中的控制。方法是认识到文本编码器表示位于超球面上,并使用肯特分布来捕捉语义概念的各向异性。提出的HEART框架直接在超球面上进行肯特意识的测地线变换,无需微调或优化即可实现直观和精确的编辑。关键发现包括一致的主题替换和精细的属性控制,同时保持原始场景,且该框架能够在不同的扩散模型架构中泛化。
Slowly Annealed Langevin Dynamics: Theory and Applications to Training-Free Guided Generation
Authors: Atsushi Nitanda, Dake Bu, Yueming Lyu, Tanya Veeravalli
First: 2026-05-08T16:17:34+00:00 · Latest: 2026-05-08T16:17:34+00:00
Abstract
We study Slowly Annealed Langevin Dynamics (SALD), a sampler for tracking a path of moving target distributions and approximating the terminal target through time slowdown. We establish non-asymptotic convergence guarantees via a KL differential inequality, showing that slowdown improves tracking through contraction of intermediate targets and the complexity of the path. Motivated by training-free guided generation with pretrained score-based generative models, we further introduce Velocity-Aware SALD (VA-SALD), which explicitly incorporates the underlying marginal distributions of the pretrained model and uses slowdown to correct the additional deviation induced by guidance. This yields a principled framework for training-free guided generation for diffusion-based and related generative model families, together with convergence guarantees that clarify the roles of intermediate functional inequalities and guidance bias. Code is available at https://github.com/anitan0925/sald.
中文标题/摘要
标题:缓慢退火拉angevin动力学:理论及其在无训练引导生成中的应用
我们研究了缓慢退火拉angevin动力学(SALD),这是一种用于跟踪移动目标分布路径并通过对时间减速进行近似终端目标的采样器。我们通过KL微分不等式建立了非渐近收敛保证,表明减速通过中间目标的收缩和路径复杂性的降低改善了跟踪。受预训练评分生成模型的无训练引导生成的启发,我们进一步引入了速度感知SALD(VA-SALD),它明确地将预训练模型的边缘分布纳入考虑,并使用减速来纠正由引导引起的额外偏差。这为基于扩散和其他生成模型家族提供了无训练引导生成的原理性框架,并提供了收敛保证,以阐明中间函数不等式和引导偏差的作用。相关代码可在https://github.com/anitan0925/sald获取。
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
Authors: Hanqi Jiang, Junhao Chen, Yi Pan, Lifeng Chen, Weihang You, Haozhen Gong, Ruiyu Yan, Jinglei Lv, Lin Zhao, Hui Ren, Quanzheng Li, Tianming Liu, Xiang Li
First: 2026-05-08T15:55:30+00:00 · Latest: 2026-05-08T15:55:30+00:00
Abstract
Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2{,}556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
Authors: Hang Wu, Sherin Mary Mathews, Yujun Cai, Ming-Hsuan Yang, Yiwei Wang
First: 2026-05-08T15:40:40+00:00 · Latest: 2026-05-08T15:40:40+00:00
Abstract
Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage~1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage~2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48\% at 128 frames over the backbone.
中文标题/摘要
标题:面向流式视频理解的语义感知自适应视觉记忆
在线流式视频理解需要模型处理连续的视觉输入并在实时响应用户查询,其中未定义的流和不可预测的查询时间使内存管理成为中心挑战。现有方法通常通过视觉相似性启发式压缩视觉标记,或者在压缩完成后通过KV缓存级检索增强压缩。然而,压缩决策很少包含语义信号,检索通常在压缩完成后添加,使得两个阶段难以协调。我们提出了SAVEMem,这是一种无需训练的双阶段框架,将语义意识引入内存生成,并使检索范围根据查询进行调整。在第一阶段,SAVEMem在恒定的内存预算下在线构建三级流式内存。固定伪问题库提供了一个轻量级的语义先验,使得长期保留由语义显著性而非视觉相似性单独决定。在第二阶段,SAVEMem在该内存上进行查询感知检索。基于查询是针对现在还是遥远的过去,锚点条件下的近期门控适应检索范围从短期到中期和长期记忆。在此范围内,查询与记忆标记之间的后期交互选择候选帧以回答查询。将SAVEMem应用于未经训练的Qwen2.5-VL,整体OVO-Bench得分从52.27提高到62.69,并在StreamingBench和ODV-Bench上获得一致的收益,同时在128帧的骨干上将峰值GPU内存减少48%。
Summary / 总结
SAVEMem is a training-free framework designed for streaming video understanding, addressing the challenge of memory management by incorporating semantic signals into memory generation and adapting retrieval scope per query. It consists of two stages: Stage 1 builds a three-tier streaming memory under a constant budget, using a fixed pseudo-question bank to shape long-term retention based on semantic salience. Stage 2 performs query-aware retrieval, with an anchor-conditioned recency gate adjusting the retrieval scope. Applied to Qwen2.5-VL, SAVEMem improves the OVO-Bench score and yields consistent gains on StreamingBench and ODV-Bench while reducing peak GPU memory by 48% at 128 frames.
SAVEMem 是一个无需训练的框架,旨在解决流式视频理解中的内存管理问题,通过在记忆生成中引入语义信号并根据查询调整检索范围来应对挑战。该框架分为两个阶段:第一阶段在固定预算下构建三级流式记忆,并使用固定伪问题库根据语义显著性塑造长期保留。第二阶段执行基于查询的检索,其中锚条件最近性门控根据查询目标调整检索范围。应用于 Qwen2.5-VL 后,SAVEMem 提高了 OVO-Bench 的得分,并在 StreamingBench 和 ODV-Bench 上取得了持续的改进,同时将 128 帧的峰值 GPU 内存减少了 48%。
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
Authors: Xiaomin Yu, Yi Xin, Yuhui Zhang, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Chen Liu, Xiaoxing Hu, Ziyue Qiao, Hao Tang, Xiaobin Hu, Chengwei Qin, Hui Xiong, Yu Qiao, Shuicheng Yan
First: 2026-02-02T13:59:39+00:00 · Latest: 2026-05-08T15:04:41+00:00
Abstract
Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
中文标题/摘要
标题:模态间隙驱动的子空间对齐训练范式用于多模态大型语言模型
尽管多模态对比学习在对齐视觉和语言表示方面取得了成功,但仍然存在一个持续的几何异常,即模态间隙:不同模态表达相同语义的嵌入占据系统性偏移的区域。先前用于弥合这一差距的方法大多受限于简化的各向同性假设,阻碍了其在大规模场景中的应用。在本文中,我们通过精确描述模态间隙的几何形状并利用其进行高效的模型扩展来解决这些限制。首先,我们提出了固定参考框架下的模态间隙理论,将模态间隙分解为稳定的偏差和各向异性的残差。受此精确建模的指导,我们引入了无需训练的模态对齐策略 ReAlign。通过利用大量未配对数据的统计信息,ReAlign 通过锚定、追踪和质心对齐三个步骤将文本表示对齐到图像表示分布,从而明确纠正几何错位。基于 ReAlign,我们提出了 ReVision,一种用于多模态大型语言模型(MLLMs)的可扩展训练范式。ReVision 将 ReAlign 集成到预训练阶段,使模型能够在视觉指令调优之前从未配对文本中学习视觉表示的分布,而无需大规模高质量的图像-文本配对。我们的框架表明,统计对齐的未配对数据可以有效替代昂贵的图像-文本配对,为 MLLMs 的高效扩展提供了一条稳健的路径。
Summary / 总结
This paper addresses the Modality Gap in multimodal contrastive learning, where embeddings from different modalities expressing the same semantics are systematically offset. It proposes the Fixed-frame Modality Gap Theory to decompose the gap into stable biases and anisotropic residuals. The ReAlign strategy, which uses unpaired data statistics, aligns text representations into the image representation distribution through three steps: Anchor, Trace, and Centroid Alignment. ReVision, a scalable training paradigm for Multimodal Large Language Models, integrates ReAlign during pretraining to learn visual representation distributions from unpaired text data, reducing the need for expensive image-text pairs. This approach effectively scales MLLMs using statistically aligned unpaired data.
本文解决了多模态对比学习中的模态间隙问题,即不同模态表达相同语义的嵌入系统性偏移。提出固定框架模态间隙理论,将间隙分解为稳定偏差和各向异性残差。ReAlign策略利用未配对数据的统计信息,通过锚点、追踪和质心对齐三个步骤将文本表示对齐到图像表示分布。ReVision是一种可扩展的多模态大型语言模型训练范式,将ReAlign集成到预训练阶段,使模型能够在未配对文本数据中学习视觉表示分布,从而减少对昂贵的图像-文本配对数据的需求。该方法利用统计对齐的未配对数据有效扩展了MLLMs。
Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation
Authors: Abdelrahman Zaian, Sheethal Bhat, Mohamed Abdalkader, Andreas Maier
Venue: MICCAI 2026
First: 2026-05-07T12:54:53+00:00 · Latest: 2026-05-08T14:55:12+00:00
Comments: 10 pages, 5 figures. Submitted to MICCAI 2026
Abstract
Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.438 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.
Summary / 总结
Retina-RAG is a modular framework that jointly performs diabetic retinopathy severity grading, macular edema detection, and report generation. It uses a high-performance retinal classifier and a parameter-efficient vision-language model adapted via Low-Rank Adaptation, with a retrieval-augmented generation module to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves high F1-scores of 0.731 for DR grading and 0.948 for ME detection, and outperforms other models in report generation with ROUGE-L 0.438 and SBERT similarity 0.884.
Retina-RAG 是一个模块化框架,能够同时进行糖尿病视网膜病变严重程度分级、黄斑水肿检测和报告生成。它使用高性能的视网膜分类器和通过低秩适应调整的参数高效视觉语言模型,并带有检索增强生成模块以提高诊断一致性和减少幻觉。Retina-RAG 在糖尿病视网膜病变分级上的 F1 分数达到 0.731,在黄斑水肿检测上的 F1 分数达到 0.948,并在报告生成中以 ROUGE-L 0.438 和 SBERT 相似度 0.884 超过其他基线模型。
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
Authors: Brown Ebouky, Gabriele Carrino, Niccolo Avogaro, Christoph Studer, Andrea Bartezzaghi, Mattia Rigotti
First: 2026-05-08T14:49:10+00:00 · Latest: 2026-05-08T14:49:10+00:00
Abstract
Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{<LOOK>}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.
Summary / 总结
GazeVLM is a multimodal architecture that internalizes metacognitive control over attention resources, enabling top-down goal-directed attention. By autonomously generating gaze tokens, GazeVLM dynamically suppresses irrelevant visual features, enhancing spatial reasoning and reducing linguistic hallucinations. On HRBench-4k and HRBench-8k, GazeVLM outperforms state-of-the-art VLMs by nearly 4% and agentic multimodal pipelines by more than 5%.
GazeVLM 是一种多模态架构,内部化了对注意力资源的元认知控制,实现了自上而下的目标导向注意力。通过自主生成注视令牌,GazeVLM 动态抑制无关视觉特征,增强空间推理并减少语言幻觉。在 HRBench-4k 和 HRBench-8k 上,GazeVLM 的表现优于参数类别的最新视觉语言模型近 4%,以及基于图像思考的代理多模态管道超过 5%。
APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment
Authors: Caterina Gallegati, Monica Bianchini, Franco Scarselli, Vittorio Murino, Barbara Toniella Corradini
First: 2026-05-08T14:21:51+00:00 · Latest: 2026-05-08T14:21:51+00:00
Abstract
As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.
Summary / 总结
APEX is a novel evaluation framework for image quality assessment that uses the Sliced Wasserstein Distance as an assumption-free similarity measure, addressing the limitations of traditional feature-distribution metrics. It leverages open-vocabulary foundation models CLIP and DINOv2 for feature extraction and demonstrates superior robustness to visual degradations and intra- and cross-dataset stability. Theoretical and empirical evidence supports APEX's effective scalability to high-dimensional spaces.
APEX 是一种新的图像质量评估框架,使用 Sliced Wasserstein Distance 作为无假设的相似性度量,解决了传统特征分布度量的限制。它利用 CLIP 和 DINOv2 这两类开放词汇基础模型进行特征提取,并展示了对视觉退化、跨数据集和同数据集内稳定性方面的优越鲁棒性。理论和实验证据支持 APEX 在高维空间中的有效扩展性。
Pre-trained Tabular Foundation Models as Versatile Summary Networks for Neural Posterior Estimation
Authors: Elliot Pickens, Chiraag Gohel, Sidharth Satya
First: 2026-05-08T14:07:10+00:00 · Latest: 2026-05-08T14:07:10+00:00
Abstract
In this work, we study TabPFN as a training-free, modular summary network for simulation-based Bayesian inference (SBI). Tabular foundation models such as TabPFN are pretrained on broad families of synthetic tabular data-generating processes and adapt at test time through in-context learning, making them natural candidates for SBI, where posterior estimation often depends on learning informative summaries of simulated observations. We propose PFN-NPE: a general recipe that uses a pretrained TabPFN encoder as a fixed summary network for simulator outputs, then pairs the resulting summaries with a downstream inference head chosen for the problem. With normalizing flows as the default inference head, PFN-NPE matches established posterior approximation methods and sometimes outperforms them. More importantly, diagnostic probes show that the TabPFN-derived summaries often preserve useful posterior location and marginal information. These analyses also reveal a limitation in that TabPFN-derived summaries may struggle to represent the joint posterior structure even when the marginals are well recovered. Still, our experiments show that TabPFN can serve as an effective summary network across a diverse set of SBI settings, with the inference network left modular and task-dependent.
中文标题/摘要
标题:预训练表格基础模型作为灵活的神经后验估计总结网络
在本工作中,我们研究TabPFN作为一种无需训练、模块化的总结网络,用于基于模拟的贝叶斯推断(SBI)。表格基础模型如TabPFN在广泛的合成表格数据生成过程中进行预训练,并在测试时通过上下文学习进行适应,使它们成为SBI的自然候选者,在SBI中,后验估计通常依赖于学习模拟观测的有用总结。我们提出了PFN-NPE:一种通用方法,使用预训练的TabPFN编码器作为固定的总结网络,对模拟器输出进行总结,然后将生成的总结与为问题选择的下游推理头配对。使用归一化流作为默认推理头,PFN-NPE与现有的后验近似方法相当,有时甚至优于它们。更重要的是,诊断探针表明,由TabPFN生成的总结通常保留了有用的后验位置和边际信息。这些分析还揭示了一个限制,即即使边际恢复良好,TabPFN生成的总结也可能难以表示联合后验结构。然而,我们的实验表明,TabPFN可以在各种不同的SBI设置中作为有效的总结网络发挥作用,推理网络保持模块化和任务依赖性。
Summary / 总结
This study explores TabPFN as a training-free summary network for simulation-based Bayesian inference (SBI), leveraging its ability to adapt through in-context learning. The proposed PFN-NPE method uses a pretrained TabPFN encoder to generate summaries of simulator outputs, which are then processed by a downstream inference head tailored to the specific problem. Experiments show that PFN-NPE can match or outperform established posterior approximation methods, and diagnostic probes indicate that the summaries often preserve useful posterior information, though they may struggle with joint posterior structure representation.
这项研究探讨了TabPFN作为无训练的总结网络在基于模拟的贝叶斯推理(SBI)中的应用,利用其通过上下文学习进行适应的能力。提出的PFN-NPE方法使用预训练的TabPFN编码器生成模拟器输出的总结,然后由针对特定问题定制的下游推理头进行处理。实验表明,PFN-NPE可以匹配或超越现有的后验近似方法,而诊断探针显示,这些总结通常保留了有用的后验信息,尽管它们在表示联合后验结构方面可能存在困难。
OphEdit: Training-Free Text-Guided Editing of Ophthalmic Surgical Videos
Authors: Ritul Jangir, Arkya Jyoti Bagchi, Aiman Farooq, Mangalton Okram, Saurabh Seetaram Korgaonkar, Deepak Mishra
First: 2026-05-08T13:05:02+00:00 · Latest: 2026-05-08T13:05:02+00:00
Abstract
High-fidelity surgical video generation can greatly improve medical training and the development of AI, adapting these generative models for precise video editing remains a formidable challenge. Modifying surgical attributes, such as instrument tissue interactions or procedural phases is challenging due to the strict anatomical and temporal constraints. In this paper, we propose OphEdit, a novel training-free framework for the text-guided editing of ophthalmic surgical videos. Our approach leverages a deterministic second-order ODE inversion pipeline to capture Attention Value (V) tensors from the original video. By selectively injecting these stored tensors into the conditional Classifier-Free Guidance (CFG) branch during the denoising phase, OphEdit rigorously preserves the intricate anatomical geometry of the eye while seamlessly mapping text-driven semantic modifications onto the video stream. Clinical evaluations demonstrates that OphEdit effectively handles complex surgical transformations, such as instrument swaps and procedural variations, with superior structural fidelity and temporal consistency compared to natural-domain video editors. Our work represents the first application of training-free video editing in the ophthalmic surgical domain, offering a scalable solution for generating diverse, annotated medical datasets without the need for exhaustive manual recording or costly model fine-tuning. The code and prompts can be accessed at https://github.com/ophedit/OphEdit
中文标题/摘要
标题:OphEdit:无需训练的文本引导眼科手术视频编辑
高保真手术视频生成可以极大地提高医疗培训和人工智能的发展,但将这些生成模型适应精确的视频编辑仍然是一个艰巨的挑战。修改手术属性,如器械组织交互或程序阶段,由于严格的解剖学和时间约束而具有挑战性。在本文中,我们提出了一种名为OphEdit的新颖的无需训练框架,用于眼科手术视频的文本引导编辑。我们的方法利用确定性的二阶ODE反向管道从原始视频中捕获注意力值(V)张量。通过在去噪阶段选择性地将这些存储的张量注入条件Classifier-Free Guidance (CFG) 分支,OphEdit严格地保留了眼睛复杂的解剖几何结构,同时无缝地将文本驱动的语义修改映射到视频流中。临床评估表明,OphEdit能够有效地处理复杂的手术变换,如器械交换和程序变化,与自然域视频编辑器相比,具有更高的结构保真度和时间一致性。我们的工作代表了在眼科手术领域首次应用无需训练的视频编辑,提供了一种无需大量手动记录或昂贵模型微调即可生成多样化的注释医疗数据集的可扩展解决方案。代码和提示可在https://github.com/ophedit/OphEdit 获取
Summary / 总结
OphEdit is a training-free framework for text-guided editing of ophthalmic surgical videos. It uses a deterministic second-order ODE inversion pipeline to capture Attention Value tensors from the original video and injects them into the conditional Classifier-Free Guidance branch during the denoising phase. This approach preserves the intricate anatomical geometry of the eye while allowing for text-driven semantic modifications. Clinical evaluations show that OphEdit can handle complex surgical transformations with superior structural fidelity and temporal consistency compared to natural-domain video editors.
OphEdit 是一个无需训练的框架,用于眼科手术视频的文本引导编辑。它使用确定性的二阶 ODE 反向管道从原始视频中捕获 Attention Value 张量,并在去噪阶段将其注入条件 Classifier-Free Guidance 分支。这种方法可以保留眼睛的复杂解剖几何结构,同时允许进行文本驱动的语义修改。临床评估表明,OphEdit 可以处理复杂的手术变换,具有更高的结构保真度和时间一致性,优于自然域视频编辑器。
Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
Authors: Berkehan Ünal, Dierend Hauke, Fazlija Dren, Plachetka Christopher
First: 2026-05-08T12:17:56+00:00 · Latest: 2026-05-08T12:17:56+00:00
Comments: 8 pages, 4 figures
Abstract
Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
Authors: Zixuan Chen, Hao Lin, Zizhe Chen, Yizhou Tian, Garry Yang, Depeng Wang, Ya Guo, Huijia Zhu, James Cheng
First: 2026-05-07T10:04:39+00:00 · Latest: 2026-05-08T12:09:34+00:00
Abstract
LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and construct a benchmark of 300 false premises to systematically evaluate it across eight models. Suppression rates range from 19\% to 90\%, with four models exceeding 80\%, establishing correction suppression as a prevalent and severe phenomenon. Mechanistic analysis reveals that suppression is not a knowledge failure: the model registers the error internally but task context diverts early-layer attention from the false claim as output intent crystallizes toward compliance at middle layers. We characterize this as \emph{knowing but not correcting} -- suppression occurs at response selection rather than knowledge encoding. Guided by this mechanism, we propose two training-free interventions. Correction Direction Steering (CDS) estimates a correction-compliance direction from matched pairs and injects it at middle layers before output intent crystallizes. Dynamic Payload Amplification (DPA) localizes payload tokens via attention divergence between early and late layers and amplifies their representation at the final layer, requiring no calibration data. Experiments on Qwen3.5-9B and LLaMA3.1-8B show both methods substantially improve factual strictness. CDS achieves the highest correction rate on Qwen3.5-9B (0\%$\to$58.2\%). DPA is the only method that preserves or improves reasoning capability on both models. These findings introduce \emph{factual strictness} -- the willingness to uphold accuracy against contextual pressures -- as a new dimension of model reliability.
Summary / 总结
The paper investigates the phenomenon of 'correction suppression' in LLMs, where models comply with false claims embedded in task-oriented requests rather than correcting them. A benchmark of 300 false premises was created to evaluate this across eight models, revealing suppression rates ranging from 19% to 90%, with four models exceeding 80%. The authors propose two training-free interventions: Correction Direction Steering (CDS) and Dynamic Payload Amplification (DPA), which improve factual strictness. CDS achieves the highest correction rate on Qwen3.5-9B, while DPA preserves reasoning capability on both Qwen3.5-9B and LLaMA3.1-8B models. These findings highlight the importance of factual strictness as a new dimension of model reliability.
该研究探讨了LLM在任务导向请求中对错误断言的‘纠正抑制’现象,即模型在这些情况下会遵从错误而不进行纠正。研究构建了一个包含300个错误前提的基准来评估这一现象,发现抑制率从19%到90%不等,其中四个模型的抑制率超过80%。作者提出了两种无需训练的干预措施:纠正方向引导(CDS)和动态负载放大(DPA),这两种方法均提高了事实准确性。CDS在Qwen3.5-9B上实现了最高的纠正率,而DPA在Qwen3.5-9B和LLaMA3.1-8B上均保持了推理能力。这些发现强调了事实准确性作为模型可靠性新维度的重要性。
LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation
Authors: Jun Wang, Fengpeng Li, Hang Dong, Tianjin Huang, Wei Han
First: 2026-05-08T12:07:26+00:00 · Latest: 2026-05-08T12:07:26+00:00
Abstract
Remote sensing lithology interpretation is fundamental to geological surveys, mineral exploration, and regional geological mapping. Unlike general land-cover recognition, lithology interpretation is a knowledge-intensive task that requires experts to infer rock types from various features, e.g., subtle visual, spectral, textural, geomorphological, and contextual cues, making reliable automated interpretation highly challenging. Geological knowledge-guided large multimodal models offer new opportunities, yet their evaluation remains constrained by the lack of benchmarks that capture lithological annotations, multi-level geological semantics, and expert-informed assessment. Here, we propose LithoBench, a multi-level benchmark for evaluating geological semantic understanding in remote sensing lithology interpretation. LithoBench contains 10,000 expert-annotated interpretation instances across 12 representative lithological categories, including 4,000 multiple-choice and 6,000 open-ended tasks organized into five cognitive levels: Identification and Description, Comparative Analysis, Mechanism Explanation, Practical Application, and Comprehensive Reasoning. We further develop an expert-in-the-loop, knowledge-grounded semi-automated construction pipeline, coupling multi sub-processes, e.g., structured geological image descriptions, to enhance geological validity and evaluation reliability. Experiments with multiple large vision-language models eveal substantial limitations in geological semantic understanding, particularly on higher-order explanation, application, and reasoning tasks.
Summary / 总结
The research aims to evaluate large multimodal models for remote-sensing lithology interpretation, a task that requires inferring rock types from various cues. LithoBench, a multi-level benchmark, was developed to address the lack of evaluation benchmarks in this field. It includes 10,000 expert-annotated tasks across five cognitive levels, and experiments show that current models struggle with higher-order tasks such as explanation, application, and reasoning.
研究旨在评估大型多模态模型在遥感岩性解释中的应用,该任务需要从各种线索推断岩石类型。LithoBench 是一个多级基准,旨在解决该领域缺乏评估基准的问题。它包含10,000个专家注释的任务,涵盖五个认知级别,并且实验表明当前模型在解释、应用和推理等高级任务上存在局限性。
Structure Over Scale: Learning Visual Reasoning from Pedagogical Video
Authors: Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas
First: 2026-01-30T18:20:23+00:00 · Latest: 2026-05-08T11:40:25+00:00
Abstract
State-of-the-art vision-language models (VLMs) score impressively on video benchmarks yet stumble on basic visual reasoning tasks involving spatial relations, navigation, and object selection that a preschooler solves easily. We hypothesize that the explicit pedagogical structure, specifically the context-question-pause-answer cycles embedded in children's educational video, provides naturally co-aligned reasoning traces: temporally synchronized visual cues, questions, and answers that emerge only from deliberate pedagogical authoring and cannot be practically reconstructed through manual annotation at scale. To test this, we introduce SoSVQA (Structure over Scale Visual Question Answering), a unified benchmark of 10K question-answer pairs automatically extracted from Dora the Explorer (DoraVQA) and Mickey Mouse Clubhouse (ClubHVQA) with precise timestamp alignment, and fine-tune Qwen2-VL and Qwen3-VL using Group Relative Policy Optimization (GRPO) to leverage the clear correctness signals and structured reasoning traces inherent in educational content. Despite training on just 10K QA pairs from 78 hours of children's television, orders of magnitude less data than GPT and Gemini, our approach delivers generalizable performance gains for Qwen-based VLMs, yielding consistent improvements on NExT-QA (+19.7), Video-MME (+10.6), and MotionBench (+4.9), matching the performance of leading proprietary systems and demonstrating that content structure can compensate for content scale.
中文标题/摘要
标题:结构胜于规模:从教育视频中学习视觉推理
最先进的视觉-语言模型(VLMs)在视频基准测试中表现出色,但在涉及空间关系、导航和物体选择的基本视觉推理任务上却表现不佳,这些任务连学龄前儿童都能轻松解决。我们假设显式的教育结构,特别是儿童教育视频中嵌入的上下文-问题-暂停-答案循环,提供了自然对齐的推理线索:时间上同步的视觉提示、问题和答案,这些只能通过刻意的教育作者才能产生,而无法通过大规模的手动注释来实际重建。为了验证这一点,我们引入了SoSVQA(结构胜于规模视觉问答),这是一个包含10000个问题-答案对的统一基准,这些对是从《朵拉探险记》(DoraVQA)和《米奇俱乐部屋》(ClubHVQA)中自动提取并精确时间戳对齐的,我们使用组相对策略优化(GRPO)微调Qwen2-VL和Qwen3-VL,以利用教育内容中固有的明确正确信号和结构化推理线索。尽管仅在78小时的儿童电视节目中训练了10000个问题-答案对,数据量远远少于GPT和Gemini,我们的方法仍为Qwen基视觉语言模型带来了可泛化的性能提升,使其在NExT-QA (+19.7)、Video-MME (+10.6) 和 MotionBench (+4.9) 上取得一致改进,达到了领先专有系统的性能水平,证明了内容结构可以弥补内容规模的不足。
Summary / 总结
The research aims to improve visual reasoning in vision-language models by leveraging the structured pedagogical content found in children's educational videos. The study introduces SoSVQA, a benchmark of 10K question-answer pairs from Dora the Explorer and Mickey Mouse Clubhouse, with precise timestamp alignment. By fine-tuning Qwen2-VL and Qwen3-VL with Group Relative Policy Optimization (GRPO), the model benefits from clear correctness signals and structured reasoning traces, achieving consistent improvements on NExT-QA, Video-MME, and MotionBench benchmarks, despite using significantly less data than large-scale models like GPT and Gemini.
研究旨在通过利用儿童教育视频中的结构化教学内容来提升视觉语言模型的推理能力。研究引入了SoSVQA基准,包含来自《朵拉探险记》和《米奇俱乐部屋》的10K问题-答案对,并带有精确的时间戳对齐。通过使用Group Relative Policy Optimization (GRPO)对Qwen2-VL和Qwen3-VL进行微调,模型受益于清晰的正确信号和结构化的推理线索,实现了在NExT-QA、Video-MME和MotionBench基准上的持续改进,尽管使用的数据量远少于GPT和Gemini等大规模模型。
PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
Authors: Yuliang Li, Chu Zhou, Heng Guo, Boxin Shi, Imari Sato, Zhanyu Ma
First: 2026-05-08T10:43:54+00:00 · Latest: 2026-05-08T10:43:54+00:00
Comments: 23 pages, 12 figures, including appendices
Abstract
Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
Authors: Song Zhang, Yanlong Chen, Yilin Li, Yining Chen, Zili Yi, Xiaowei Zhang, Yawei Li
First: 2026-05-08T10:35:11+00:00 · Latest: 2026-05-08T10:35:11+00:00
Comments: Under review. 30 pages, 16 figures, 7 tables
Abstract
Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.
Summary / 总结
The research addresses the challenge of varying ground sampling distances (GSD) in remote sensing vision-language models (RS-VLMs) by introducing ScaleEarth, a parameter-efficient fine-tuning framework. It uses CS-HLoRA to treat GSD as a continuous conditioning variable, dynamically routing computation based on physical scale. ScaleEarth also includes SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. The method is evaluated on a new GeoScale-VQA dataset and achieves state-of-the-art results on remote-sensing benchmarks like XLRS-Bench and OmniEarth-Bench.
研究通过引入ScaleEarth框架解决了遥感视觉语言模型(RS-VLMs)在不同地面采样距离(GSD)下的挑战。该框架使用CS-HLoRA将GSD视为连续的调节变量,根据物理尺度动态路由计算。ScaleEarth还包含SSE-U,用于从视觉特征预测GSD及其不确定性,并使用GeoScale-VQA数据集提供监督,该数据集根据物理尺度进行条件化。ScaleEarth在涵盖多种地球系统任务的遥感基准测试中取得了最先进的结果。
From Pixels to Prompts: Vision-Language Models
Authors: Khang Hoang Nhat Vo
First: 2026-05-08T10:17:44+00:00 · Latest: 2026-05-08T10:17:44+00:00
Abstract
When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: \emph{it is too easy to get lost}. The field moves quickly, new model names appear constantly, and the gap between ``I know the buzzwords'' and ``I actually understand how this works'' can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want to offer something more modest - and, I hope, more durable: a clear mental map of Vision-Language Models. Enough structure that you can read new papers with confidence; enough intuition that you can design your own systems without feeling as if you are assembling LEGO bricks blindly.
中文标题/摘要
标题:从像素到提示:视觉语言模型
当你阅读一篇关于新视觉语言模型的论文时,很容易忘记这个想法不久之前是多么奇怪。教会机器看东西已经很困难了。教会它们读取和生成语言同样困难。要求它们同时做这两件事——并且还要进行推理、回答问题、遵循指令,甚至有时还能给我们带来惊喜——仍然带着一丝科幻小说的气息,即使它已经成为常态。这本书源于一个简单的感受:\emph{很容易迷失}。该领域发展迅速,不断出现新的模型名称,从“知道行话”到“真正理解这是如何运作的”之间的差距可能会让人感到不安。我多次感受到这种差距。如果你手中拿着这本书,你可能也有同样的感受。我的目标不是提供一个详尽的每项数据集、基准和新模型变体的目录。相反,我希望提供一些更谦逊的东西——我希望是更持久的东西:视觉语言模型的清晰心智地图。足够的结构,使你能自信地阅读新论文;足够的直觉,使你能设计自己的系统,而不觉得是在盲目地拼装乐高积木。
Why Self-Inconsistency Arises in GNN Explanations and How to Exploit It
Authors: Wenxin Tai, Yaqian Liu, Ting Zhong, Fan Zhou
First: 2026-05-08T09:57:22+00:00 · Latest: 2026-05-08T09:57:22+00:00
Abstract
Recent work has observed that explanations produced by Self-Interpretable Graph Neural Networks (SI-GNNs) can be self-inconsistent: when the model is reapplied to its own explanatory graph subset, it may produce a different explanation. However, why self-inconsistency arises remains poorly understood. In this work, we first identify re-explanation-induced context perturbation as the direct cause of score variation. We then introduce a latent signal assignment hypothesis to explain why only some edges are sensitive to this perturbation, and analyze how conciseness regularization affects latent signal assignment. Given that self-inconsistent edges do not provide stable evidence for the model's prediction, we propose Self-Denoising (SD), a model-agnostic and training-free post-processing strategy that calibrates explanations with only one additional forward pass. Experiments across representative SI-GNN frameworks, backbone architectures, and benchmark datasets support our hypothesis and show that SD consistently improves explanation quality while adding only about 4--6\% computational overhead in practice.
Summary / 总结
This studyNN study investigates the inconsistency in explanations generated by Self-Interpretable Graph Neural Networks (SI-GNNs).. The research proposes a latent signal hypothesis to explain why certain edges are only sensitive to perturbation and analyzes how how regularization affects latent signal patterns. Experiments on the SD-Denoising (SD) method that calibrates explanations without with only one additional stepNN framework show benchmark datasets show show-inconsistent edges are reduced, The results-- SD hypothesis and shows that SD consistently improves explanations G on a variety-NN modelNN model practice G
Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models
Authors: Mengxin Qin, Xiang Zhang, Kun Wei, Xu Yang, Cheng Deng
First: 2026-05-08T09:42:05+00:00 · Latest: 2026-05-08T09:42:05+00:00
Abstract
Class-incremental learning aims to continuously acquire new knowledge while preserving previously learned information, thereby mitigating catastrophic forgetting. Existing methods primarily restrict parameter updates but often overlook their structural properties in high-dimensional spaces. From a subspace perspective, updates induced by different tasks tend to lie in multiple overlapping low-rank subspaces, leading to cross-task subspace interference and severe forgetting. To address this issue, we propose HDSD, a Hierarchical Dual-Subspace Decoupling framework for continual learning in vision-language models. Specifically, we introduce a lightweight Feature Modulation Module (FMM) that explicitly decomposes the parameter space into general and task-specific subspaces. Building on this design, we develop two complementary components. First, a General Fusion Module (GFM) evaluates relative parameter changes across tasks and uses an adaptive threshold to capture stable and transferable knowledge. Second, a Hierarchical Learning Module (HLM) performs structured parameter decomposition via Singular Value Decomposition (SVD) and uses a scaling mechanism to constrain updates within distinct subspace scales. Together, these designs reduce subspace interference and parameter drift. Extensive experiments on conventional benchmarks show that HDSD achieves state-of-the-art results.
Summary / 总结
The paper addresses the challenge of catastrophic forgetting in class-incremental learning by proposing HDSD, a Hierarchical Dual-Subspace Decoupling framework. It introduces a Feature Modulation Module to decompose the parameter space into general and task-specific subspaces, and develops a General Fusion Module to capture stable knowledge and a Hierarchical Learning Module to constrain parameter updates within distinct subspace scales. Experiments demonstrate that HDSD outperforms existing methods on conventional benchmarks.
论文提出了一种层次双子空间解耦框架HDSD,以解决增量学习中的灾难性遗忘问题。该框架通过特征调制模块将参数空间分解为通用和任务特定子空间,并开发了通用融合模块来捕捉稳定的知识,以及层次学习模块通过奇异值分解来约束参数更新在不同的子空间尺度内。实验表明,HDSD在传统基准测试中优于现有方法。
DIMoE-Adapters: Dynamic Expert Evolution for Continual Learning in Vision-Language Models
Authors: Mengxin Qin, Xiang Zhang, Xi Wang, Kun Wei, Xu Yang, Cheng Deng
First: 2026-05-08T09:32:05+00:00 · Latest: 2026-05-08T09:32:05+00:00
Abstract
Continual learning enables vision-language models to accumulate knowledge and adapt to evolving tasks without retraining from scratch. However, in multi-domain task-incremental learning, large domain shifts intensify the stability-plasticity dilemma. Most existing methods rely on fixed architectures with statically allocated parameters, which limits adaptation to new domains and aggravates catastrophic forgetting. To address these challenges, we propose DIMoE-Adapters, a Dynamic Incremental Mixture-of-Experts Adapters framework that introduces a dynamic expert evolution paradigm to balance stability and plasticity. This paradigm is implemented through two collaborative components: Self-Calibrated Expert Evolution (SCEE) and Prototype-Guided Expert Selection (PGES). SCEE constructs and evolves a sparse expert pool through expert optimization dynamics, improving plasticity while reducing redundant capacity. PGES controls expert utilization based on the pool shaped by SCEE, improving stability across both previously encountered and unseen tasks. Extensive experiments show that DIMoE-Adapters outperforms previous state-of-the-art methods across various settings.
中文标题/摘要
标题:DIMoE-适配器:视觉-语言模型连续学习中的动态专家进化
连续学习使视觉-语言模型能够积累知识并适应不断变化的任务,而无需从头开始重新训练。然而,在多领域任务增量学习中,大规模领域转换加剧了稳定性和可塑性之间的困境。大多数现有方法依赖于固定架构和静态分配的参数,这限制了对新领域的适应性并加剧了灾难性遗忘。为了解决这些挑战,我们提出了一种动态增量混合专家适配器框架DIMoE-Adapters,该框架引入了一种动态专家进化的范式来平衡稳定性和可塑性。该范式通过两个协作组件实现:自我校准专家进化(SCEE)和原型引导专家选择(PGES)。SCEE通过专家优化动力学构建和进化一个稀疏专家池,提高可塑性同时减少冗余容量。PGES根据SCEE塑造的池子控制专家的利用,提高对先前遇到的任务和未见过的任务的稳定性。广泛的实验表明,DIMoE-Adapters在各种设置中优于先前的最先进方法。
Summary / 总结
DIMoE-Adapters is designed to enhance continual learning in vision-language models by addressing the stability-plasticity dilemma in multi-domain task-incremental learning. It introduces a dynamic expert evolution framework with two components: Self-Calibrated Expert Evolution (SCEE) and Prototype-Guided Expert Selection (PGES). SCEE dynamically constructs and evolves a sparse expert pool, enhancing adaptability while reducing redundancy. PGES manages expert usage based on the pool shaped by SCEE, improving stability. Experiments demonstrate that DIMoE-Adapters outperforms existing methods in various settings.
DIMoE-Adapters旨在通过解决多域任务增量学习中的稳定性和可塑性难题来增强视觉-语言模型的持续学习能力。它引入了一个动态专家进化框架,包含两个组件:自我校准专家进化(SCEE)和原型引导专家选择(PGES)。SCEE动态构建并进化一个稀疏专家池,增强适应性同时减少冗余。PGES根据SCEE塑造的池子管理专家使用,提高稳定性。实验表明,DIMoE-Adapters在各种设置中优于现有方法。
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
Authors: Zitong Xu, Huiyu Duan, Yifei Nie, Mingda Du, Sijing Wu, Xiongkuo Min, Tianyi Zheng, Jian Zhang, Shusong Xu, Jinwei Chen, Bo Li, Guangtao Zhai
First: 2026-05-08T09:05:08+00:00 · Latest: 2026-05-08T09:05:08+00:00
Abstract
Recent text-guided image editing (TIE) models have made remarkable progress, yet edited images still frequently suffer from fine-grained issues such as unnatural objects, lighting mismatch, and unexpected changes. Existing refinement approaches either rely on costly iterative regeneration or employ vision-language models (VLMs) with weak spatial grounding, often resulting in semantic drift and unreliable local corrections. To address these limitations, we first construct EditFHF-15K, a dataset of fine-grained human feedback for edited images, comprising (1) 15K images from 12 TIE models spanning 43 editing tasks, (2) 60K annotated artifact regions and 80K editing failure regions, each accompanied by textual reasoning, and (3) 45K mean opinion scores (MOSs) assessing perceptual quality, instruction following, and visual consistency. Based on EditFHF-15K, we propose EditRefiner, a hierarchical, interpretable, and human-aligned agentic framework that reformulates post-editing correction as a human-like perception-reasoning-action-evaluation loop. Specifically, we introduce: (1) a perception agent that detects contextual saliency maps of artifacts and editing failures, (2) a reasoning agent that interprets these perceptual cues to perform human-aligned diagnostic inference, (3) an action agent that uses the reasoning output to plan and execute localized re-editing, and (4) an evaluation agent that assesses the re-edited image and guides the action agent on whether further refinements are required. Extensive experiments demonstrate that EditRefiner consistently outperforms state-of-the-art methods in distortion localization, diagnose accuracy and human perception alignment, establishing a new paradigm for self-corrective and perceptually reliable image editing. The code is available at https://github.com/IntMeGroup/EditRefiner.
Summary / 总结
The research aims to improve the quality of text-guided image editing by addressing fine-grained issues such as unnatural objects and lighting mismatches. It introduces EditRefiner, a framework that uses a perception-reasoning-action-evaluation loop to refine edited images. Key findings show that EditRefiner outperforms existing methods in localizing distortions, diagnostic accuracy, and aligning with human perception, setting a new standard for self-corrective and perceptually reliable image editing.
研究旨在通过解决细粒度问题(如不自然的对象和光照不匹配)来提升文本引导图像编辑的质量。提出了EditRefiner框架,该框架采用感知-推理-行动-评估循环来精修编辑后的图像。主要发现表明,EditRefiner在定位失真、诊断准确性和与人类感知的对齐方面优于现有方法,确立了自我修正和感知可靠的图像编辑的新标准。
Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs
Authors: Hao Wang, Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh, Daisuke Kawahara
First: 2026-05-08T08:53:17+00:00 · Latest: 2026-05-08T08:53:17+00:00
Abstract
Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain highly vulnerable to adversarial attacks, leaving downstream applications exposed to significant risks. In this work, we propose a novel and lightweight adversarial attack detection framework based on sparse autoencoders (SAEs), termed SAEgis. By inserting an SAE module into a pretrained VLM and training it with standard reconstruction objectives, we find that the learned sparse latent features naturally capture attack-relevant signals. These features enable reliable classification of whether an input image has been adversarially perturbed, even for previously unseen samples. Extensive experiments show that SAEgis achieves strong performance across in-domain, cross-domain, and cross-attack settings, with particularly large improvements in cross-domain generalization compared to existing baselines. In addition, combining signals from multiple layers further improves robustness and stability. To the best of our knowledge, this is the first work to explore SAE as a plug-and-play mechanism for adversarial attack detection in VLMs. Our method requires no additional adversarial training, introduces minimal overhead, and provides a practical approach for improving the safety of real-world VLM systems.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
Authors: Yue Ma, Ziyuan Yang, Yi Zhang
First: 2026-05-03T07:38:42+00:00 · Latest: 2026-05-08T08:43:35+00:00
Comments: 14 pages
Abstract
Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.
中文标题/摘要
标题:预见引导的防御:在多智能体系统中预防感染扩散
基于大型多模态模型的多智能体系统(MASs)通过专门的智能体实现复杂问题的协作解决。然而,MASs容易受到传染性逃逸攻击的影响,即攻破一个智能体可以传播到其他智能体,导致广泛被攻破。现有的防御措施通过训练更具传染性的治愈因子来应对这一问题,促使智能体优先检索该因子而非病毒对抗样本(VirAEs)。然而,这种方法使智能体的响应变得同质化,只能提供表面的抑制而非真正的恢复。我们重新审视了这些防御措施,它们通过共享的治愈因子在全球范围内运作,而传染性逃逸攻击则源于局部交互行为。这种不匹配限制了它们的效果。为了解决这一问题,我们提出了一种无需训练的预见引导的局部净化(FLP)框架,其中每个智能体根据未来的交互来推理行为演变并消除感染。具体来说,每个智能体模拟后续聊天轮次中的未来行为轨迹。为了反映MASs的多样性,我们引入了一种多角色模拟策略,以在不同交互场景下进行稳健预测。然后,我们使用响应多样性作为诊断信号,通过分析基于角色预测在检索结果和语义层面的一致性来检测感染。对于受感染的智能体,我们应用局部净化:近期感染通过立即相册回滚来缓解,而长期感染则使用递归二进制诊断(RBD)来处理,该方法递归地分割图像相册并应用相同的诊断策略来定位和消除VirAEs。实验表明,FLP将最大累积感染率从超过95%降低到低于5.47%。此外,检索和语义指标与良性基线高度匹配,表明交互多样性的有效保留。
Summary / 总结
This paper addresses the vulnerability of Multi-Agent Systems (MASs) to infectious jailbreaks, where a single compromised agent can spread to others. The authors propose a training-free Foresight-Guided Local Purification (FLP) framework, which allows each agent to simulate future interactions to detect and eliminate infections. Experimental results show that FLP significantly reduces the maximum cumulative infection rate from over 95% to below 5.47%, while preserving interaction diversity.
论文针对多代理系统(MASs)中的传染性逃逸漏洞,即一个被攻破的代理可以传播给其他代理的问题,提出了一种无需训练的前瞻性局部净化(FLP)框架,该框架使每个代理能够模拟未来的交互来检测和缓解感染。该框架使用多角色模拟策略和响应多样性作为诊断信号。实验结果表明,FLP将最大累积感染率从超过95%降低到低于5.47%,同时保持了交互多样性。
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis
Authors: Yongxian Wei, Yilin Zhao, Zixuan Hu, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Chun Yuan, Dian Li
First: 2025-11-13T03:08:51+00:00 · Latest: 2026-05-08T08:34:38+00:00
Abstract
Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data are used to bootstrap problem-design strategies in the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our proposed framework achieves a cumulative average improvement of 3.4%, demonstrating robust generalization across both language and vision-language models.
Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
Authors: Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, Sunzida Siddique
First: 2025-11-19T07:52:20+00:00 · Latest: 2026-05-08T07:59:26+00:00
Abstract
Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.
Summary / 总结
This paper addresses the limitations of current metrics like BLEU, CIDEr, VQA score, SigLIP-2, and CLIPScore in capturing semantic and structural accuracy, especially in domain-specific or context-dependent scenarios. It introduces the Physics-Constrained Multimodal Data Evaluation (PCMDE) metric, which combines large language models with reasoning, knowledge-based mapping, and vision-language models. The PCMDE metric consists of three stages: feature extraction, confidence-weighted component fusion, and physics-guided reasoning to enforce structural and relational constraints. Key findings show that PCMDE outperforms existing metrics in evaluating the accuracy of multimodal synthetic images.
本文针对BLEU、CIDEr、VQA分数、SigLIP-2和CLIPScore等当前指标在捕捉语义和结构准确性方面的局限性,尤其是在特定领域或上下文依赖场景中的局限性。提出了结合大型语言模型、推理、知识映射和视觉语言模型的物理约束多模态数据评估(PCMDE)指标。PCMDE包括三个阶段:特征提取、置信加权组件融合和物理引导推理,以确保结构和关系约束。关键发现表明,PCMDE在评估多模态合成图像的准确性方面优于现有指标。
History
20260511_0418 20260510_0414 20260509_0426 20260508_0435 20260507_0454 20260506_0427 20260505_0436 20260504_0410 20260503_0414 20260502_0426 20260501_0429 20260430_0430 20260429_0437 20260428_0429 20260427_0405 20260426_0404 20260425_0410 20260424_0430 20260423_0426 20260422_0424 20260421_0418 20260420_0359 20260419_0358 20260418_0415 20260417_0421 20260416_0425 20260415_0426 20260414_0423 20260413_0352 20260412_0347 20260411_0356 20260410_0412 20260409_0411 20260407_0404 20260406_0347 20260405_0344 20260404_0350 20260403_0400 20260401_0408 20260331_0407 20260329_0347 20260328_0350 20260326_0357 20260325_0405 20260324_0400 20260323_0342 20260322_0340 20260321_0347 20260320_0356 20260319_0358 20260318_0405 20260317_0401 20260316_0343 20260315_0341 20260314_0344 20260313_0352 20260312_0352 20260311_0347 20260310_0350 20260309_0338 20260308_0337 20260307_0347 20260306_0402 20260305_0348 20260304_0348 20260303_0348 20260302_0336 20260301_0339 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553