InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
Authors: Sarah Rastegar, Violeta Chatalbasheva, Sieger Falkena, Anuj Singh, Yanbo Wang, Tejas Gokhale, Hamid Palangi, Hadi Jamali-Rad
First: 2025-12-19T17:52:43+00:00 · Latest: 2025-12-19T17:52:43+00:00
Abstract
Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
中文标题/摘要
标题:InfSplign: 从文本到图像扩散模型的推理时空间对齐
从文本到图像(T2I)扩散模型能够生成高质量的图像,但往往无法捕捉到文本提示中指定的空间关系。这一限制可以追溯到两个因素:训练数据中缺乏精细的空间监督以及文本嵌入无法编码空间语义。我们提出了一种无需训练的推理时方法InfSplign,通过在每个去噪步骤中通过复合损失调整噪声来改善空间对齐。所提出的损失利用从主干解码器提取的不同级别的交叉注意力图来强制执行准确的对象放置和采样期间的对象平衡。该方法轻量级、即插即用,并且与任何扩散主干兼容。我们在VISOR和T2I-CompBench上的全面评估表明,InfSplign达到了我们所知的最佳状态(到目前为止),在最强的现有推理时基线方法上实现了显著的性能提升,并且甚至优于基于微调的方法。代码库可在GitHub上获得。
Summary / 总结
InfSplign is a training-free method that enhances the spatial alignment of text-to-image generation by adjusting noise at each denoising step. It uses a compound loss based on cross-attention maps to ensure accurate object placement and balanced object presence. Experiments on VISOR and T2I-CompBench demonstrate that InfSplign outperforms existing inference-time baselines and even surpasses fine-tuning methods, setting a new state-of-the-art performance in spatial alignment for text-to-image generation.
InfSplign 是一种无需训练的方法,通过在每个去噪步骤中调整噪声来增强文本到图像生成的空间对齐。它使用基于交叉注意力图的复合损失来确保准确的对象放置和对象存在的平衡。实验结果表明,InfSplign 在 VISOR 和 T2I-CompBench 上的表现优于现有推理时的基线方法,并且甚至超过了基于微调的方法,达到了空间对齐的新最佳性能。
Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
Authors: Yue Li, Qi Ma, Runyi Yang, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Theo Gevers, Luc Van Gool, Danda Pani Paudel, Martin R. Oswald
First: 2025-12-19T17:22:35+00:00 · Latest: 2025-12-19T17:22:35+00:00
Abstract
While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure.
We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians' centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.
中文标题/摘要
标题:合唱:全方位3D高斯场景编码的多教师预训练
虽然3DGS已成为一种高保真场景表示,但直接从其原语中编码丰富的通用特征仍然未被充分探索。我们通过引入合唱,一种多教师预训练框架,通过从2D基础模型中提取互补信号来学习一个全方位的3D高斯点绘(3DGS)场景编码器,来解决这一问题。合唱使用共享的3D编码器和教师特定的投影器,从语言对齐、通用和对象感知的教师中学习,鼓励一个共享的嵌入空间,该空间捕捉从高层语义到细粒度结构的信号。我们评估合唱在一系列任务上:开放词汇语义和实例分割、线性探针和解码器探针,以及数据高效监督。除了3DGS,我们还测试合唱在仅支持点云的几个基准上,通过预训练一个仅使用高斯中心、颜色、估计法线作为输入的变体。有趣的是,这个编码器表现出强大的迁移性能,并在使用39.9倍少的训练场景时优于点云基线。最后,我们提出了一种渲染和提取适应方法,以促进域外微调。我们的代码和模型将在发表后发布。
Summary / 总结
Chorus is a multi-teacher pretraining framework that addresses the under-explored area of encoding rich, general-purpose features directly from 3D Gaussian Splatting primitives. It uses a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, capturing signals from high-level semantics to fine-grained structure. Chorus outperforms the point clouds baseline by 39.9 times fewer training scenes on several benchmarks and shows strong transfer to out-of-domain tasks.
Chorus 是一个多教师预训练框架,旨在直接从 3D 贝塞尔散射的基本单元中编码丰富的通用特征。它使用共享的 3D 编码器和教师特定的投影器,从语言对齐、通用和对象感知的教师中学习,捕捉从高层语义到精细结构的信号。Chorus 在多个基准测试中表现出色,仅使用 39.9 倍少的训练场景就超过了点云基线,并且在跨域任务中表现出强大的迁移能力。
On the dynamic evolution of CLIP texture-shape bias and its relationship to human alignment and model robustness
Authors: Pablo Hernández-Cámara, Jose Manuel Jaén-Lorites, Alexandra Gómez-Villa, Jorge Vila-Tomás, Valero Laparra, Jesus Malo
First: 2025-08-13T13:47:34+00:00 · Latest: 2025-12-19T16:47:41+00:00
Abstract
Contrastive language-image models such as CLIP have demonstrated remarkable generalization capabilities. However, how their internal visual representations evolve during training and how this evolution relates to human perception remains poorly understood. Most existing analysis characterize fully trained models, leaving the dynamics of representational biases and perceptual alignment largely unexplored. In this work, we present an epoch-by-epoch analysis of CLIP models throughout training, focusing on the evolution of texture-shape bias, alignment with human perceptual judgements, and sensitivity to image noise. Using multiple perceptual benchmarks spanning low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness, we identify a consistent, training-stage-dependent representational transition. Early training stages exhibit strong texture bias, elevated alignment with low-level human perceptual measures, and increased sensitivity to Gaussian noise perturbations. As training progresses, this texture bias gradually diminishes in favor of more shape-based representations, coinciding with improved robustness to noise and a decline in low-level perceptual alignment. Importantly, these dynamics are consistently observed across multiple CLIP model scales, indicating that the phenomenon is not specific to a particular architecture size. Our findings provide an empirical characterization of how perceptual alignment, feature bias, and robustness co-evolve during multimodal model training. This work reveals a systematic trade-off between early low-level perceptual alignment and later robustness, offering new insights into the representational dynamics of vision-language models and their relationship to human visual processing.
中文标题/摘要
标题:CLIP 图像-形状偏见动态演变及其与人类对齐和模型鲁棒性的关系
对比语言-图像模型如CLIP展示了卓越的泛化能力。然而,它们在训练过程中内部视觉表示如何演变以及这种演变如何与人类感知相关的问题仍然知之甚少。现有大多数分析集中在完全训练好的模型上,而代表性的偏见和感知对齐的动力学则很少被探索。在本研究中,我们对CLIP模型在整个训练过程中进行了逐个时期的分析,重点关注图像-形状偏见的演变、与人类感知判断的对齐以及对图像噪声的敏感性。通过涵盖低级图像质量评估、中级感知相似性、显著性对应和噪声鲁棒性的多个感知基准,我们识别出一种训练阶段依赖的一致性表征过渡。早期训练阶段表现出强烈的纹理偏见、与低级人类感知度量的增强对齐以及对高斯噪声扰动的增加敏感性。随着训练的进行,这种纹理偏见逐渐减少,有利于更多基于形状的表示,同时噪声鲁棒性提高,低级感知对齐下降。重要的是,这些动态在多个CLIP模型规模中一致观察到,表明该现象不局限于特定的架构大小。我们的研究结果提供了关于感知对齐、特征偏见和鲁棒性在多模态模型训练过程中如何共同演变的实证描述。这项工作揭示了早期低级感知对齐与后期鲁棒性之间的系统性权衡,为视觉-语言模型的表征动力学及其与人类视觉处理的关系提供了新的见解。
Summary / 总结
This study analyzes the evolution of texture-shape bias in CLIP models during training, focusing on alignment with human perception and robustness to image noise. The research finds that early training stages show strong texture bias and high alignment with low-level perceptual measures, but as training progresses, the models shift towards shape-based representations, improving robustness to noise and reducing low-level perceptual alignment. This transition is consistent across different model scales, indicating a general phenomenon in multimodal model training.
本研究分析了CLIP模型在训练过程中纹理-形状偏见和知觉对齐的变化,使用了多个知觉基准。早期训练阶段表现出强烈的纹理偏见、与低级人类知觉测量的高对齐以及对噪声的高敏感性,随着训练的进行,这些特性逐渐减弱,偏好基于形状的表示,并提高了对噪声的鲁棒性。这些动态在不同模型规模中是一致的,表明早期的知觉对齐与后期的鲁棒性之间存在普遍的权衡关系。
AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection
Authors: Yichen Jiang, Mohammed Talha Alam, Sohail Ahmed Khan, Duc-Tien Dang-Nguyen, Fakhri Karray
First: 2025-12-19T16:06:03+00:00 · Latest: 2025-12-19T16:06:03+00:00
Comments: Under Review
Abstract
Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework's versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.
中文标题/摘要
标题:AdaptPrompt:VLMs的参数高效适应以实现通用的深度假信息检测
近期在图像生成方面的进展导致了高度逼真合成媒体的广泛可用性,增加了可靠检测深度假信息的难度。一个关键挑战是泛化能力,因为针对特定生成器训练的检测器在面对未见过的模型时往往会失效。在本文中,我们通过利用大型视觉语言模型,特别是CLIP,来识别不同生成技术下的合成内容,来应对泛化检测的迫切需求。首先,我们引入了Diff-Gen,这是一个大规模基准数据集,包含100,000个扩散生成的假信息样本,这些样本捕捉到了与传统GAN数据集不同的广泛频谱特征。在Diff-Gen上训练的模型在跨域泛化方面表现出更强的能力,特别是在面对以前未见过的图像生成器时。其次,我们提出了AdaptPrompt,这是一种参数高效的迁移学习框架,该框架联合学习任务特定的文本提示和视觉适配器,同时冻结CLIP主干。我们进一步通过层消融实验表明,剪枝视觉编码器的最终变压器块可以更好地保留高频生成特征,显著提高检测准确性。我们的评估覆盖了25个具有挑战性的测试集,涵盖了由GAN、扩散模型和商业工具生成的合成内容,建立了标准和跨域场景下的新最佳性能。我们还通过少量样本泛化(仅使用320张图像)和源归因进一步展示了该框架的灵活性,使其能够在封闭集设置中精确识别生成器架构。
Summary / 总结
This work addresses the challenge of generalizing deepfake detection across different generative models by leveraging large vision-language models. It introduces Diff-Gen, a benchmark dataset of 100k diffusion-generated fakes, and proposes AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters. The study demonstrates that pruning the final transformer block of the vision encoder improves detection accuracy, and the framework achieves state-of-the-art results across 25 test sets, including GANs, diffusion models, and commercial tools.
该研究通过利用大型视觉-语言模型CLIP,解决跨不同生成技术的一般化深伪检测难题。引入了包含100k扩散生成假象的Diff-Gen大规模基准数据集,并提出了参数高效迁移学习框架AdaptPrompt,该框架联合学习任务特定的文本提示和视觉适配器。研究显示,剪枝视觉编码器的最终变压器块可以提高检测准确性,并且该框架在25个具有挑战性的测试集上实现了最先进的结果,包括GAN、扩散模型和商业工具生成的内容。
Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs
Authors: Zhaolin Cai, Huiyu Duan, Zitong Xu, Fan Li, Zhi Liu, Jing Liu, Wei Shen, Xiongkuo Min, Guangtao Zhai
First: 2025-12-19T14:41:50+00:00 · Latest: 2025-12-19T14:41:50+00:00
Abstract
Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose \GRASP-HO}, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction representations, then design a lightweight learnable cognitive steering conduit (CSC) module to inject the fine-grained visual evidence into a frozen MLLM for effective reasoning. To address the supervision mismatch between classification-based HOI datasets and open-vocabulary generative models, we introduce a hybrid guidance strategy that coupling the language modeling loss and auxiliary classification loss, enabling discriminative grounding without sacrificing generative flexibility. Experiments demonstrate state-of-the-art closed-set performance and strong zero-shot generalization, achieving a unified paradigm that seamlessly bridges discriminative perception and generative reasoning for open-world HOI detection.
中文标题/摘要
标题:通过可微认知引导的多模态大语言模型生成人类-物体交互检测
人类-物体交互(HOI)检测旨在定位人类-物体对及其之间的交互。现有方法在封闭世界假设下运行,将任务视为对小规模预定义动词集的分类问题,难以泛化到野生环境中未见过或模糊的交互。虽然最近的多模态大语言模型(MLLMs)具备丰富的世界知识,用于开放词汇理解,但它们仍与现有的HOI检测器脱钩,因为对其进行微调在计算上是不可行的。为了解决这些限制,我们提出了一种新颖的生成推理和可引导感知框架\GRASP-HO,将HOI检测从封闭集分类任务重新定义为开放词汇生成问题。为了弥合视觉与认知之间的鸿沟,我们首先提取混合交互表示,然后设计一个轻量级可学习的认知引导导管(CSC)模块,将细粒度的视觉证据注入冻结的MLLM以实现有效的推理。为了解决基于分类的HOI数据集与开放词汇生成模型之间的监督不匹配,我们引入了一种混合指导策略,结合语言建模损失和辅助分类损失,实现区分性定位而不牺牲生成灵活性。实验表明,该方法在封闭集性能上达到最新水平,并且具有强大的零样本泛化能力,实现了区分性感知与生成推理的无缝结合,用于开放世界HOI检测。
Summary / 总结
The research aims to improve HOI detection by addressing the limitations of existing closed-world methods, which struggle with unseen interactions. The proposed GRASP-HO framework reformulates HOI detection as an open-vocabulary generation problem, using a hybrid interaction representation and a cognitive steering conduit module to inject visual evidence into a frozen MLLM. The hybrid guidance strategy combines language modeling and auxiliary classification losses to enhance discriminative grounding while maintaining generative flexibility. Experiments show superior closed-set performance and strong zero-shot generalization capabilities.
论文针对开放世界中未见或模糊的人-物交互(HOI)检测难题,提出了一种名为GRASP-HO的新框架,将HOI检测转化为开放词汇生成问题。该方法利用混合交互表示和轻量级的认知引导模块将视觉证据注入冻结的多模态大型语言模型。通过结合语言建模损失和辅助分类损失的混合指导策略,提高了性能。实验结果显示了出色的封闭集性能和强大的零样本泛化能力。
SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing
Authors: Chaolei Wang, Yang Luo, Jing Du, Siyu Chen, Yiping Chen, Ting Han
First: 2025-09-05T14:37:31+00:00 · Latest: 2025-12-19T14:32:19+00:00
Abstract
Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel "split-then-grow" framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available at https://github.com/wangchaolei7/SGS-3D.
中文标题/摘要
标题:SGS-3D:通过可靠的语义掩码分割与生长实现高保真3D实例分割
准确的3D实例分割对于3D视觉领域高质量场景理解至关重要。然而,基于2D到3D提升的方法在提升过程中由于语义指导模糊和深度约束不足而引入的累积误差,难以产生精确的实例级分割。为应对这些挑战,我们提出了高保真3D实例分割(SGS-3D),一种新颖的“分割-然后生长”框架,首先使用几何原语净化和分割模糊的提升掩码,然后在场景中将其生长为完整的实例。与依赖原始提升掩码并牺牲分割精度的现有方法不同,SGS-3D作为一种无需训练的细化方法,联合融合语义和几何信息,使两个表示层次之间能够有效合作。具体而言,对于语义指导,我们引入了一种掩码过滤策略,利用3D几何原语的共现性来识别并移除模糊的掩码,从而确保与3D对象实例更可靠的语义一致性。对于几何细化,我们通过利用空间连续性和高层特征构建精细的物体实例,特别是在不同物体之间语义模糊的情况下。在ScanNet200、ScanNet++和KITTIE360上的实验结果表明,SGS-3D显著提高了分割精度,并且能够抵抗预训练模型产生的不准确掩码的影响,生成高保真的物体实例,同时在多种室内外环境中保持强大的泛化能力。代码可在https://github.com/wangchaolei7/SGS-3D/ 获取。
Summary / 总结
The research aims to improve the accuracy of 3D instance segmentation by addressing the issues of accumulated errors in 2D-to-3D lifting approaches. SGS-3D proposes a 'split-then-grow' framework that first purifies and splits ambiguous lifted masks using geometric primitives and then grows them into complete instances. This method enhances segmentation accuracy and robustness, especially in environments with semantic ambiguity, and demonstrates strong generalization across various indoor and outdoor settings.
SGS-3D 提出了一种新颖的 '分割-然后生长' 框架,用于高保真 3D 实例分割,通过使用几何原语净化和分割模糊的掩码,然后将其生长为完整的实例。该方法结合语义和几何信息,显著提高了分割精度和鲁棒性,特别是在具有语义模糊性的环境中,如 ScanNet200、ScanNet++ 和 KITTI-360 数据集上的实验结果所示。
PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology
Authors: Fengchun Liu, Songhan Jiang, Linghan Cai, Ziyue Wang, Yongbing Zhang
First: 2025-12-19T14:26:50+00:00 · Latest: 2025-12-19T14:26:50+00:00
Abstract
While Vision-Language Models (VLMs) have achieved notable progress in computational pathology (CPath), the gigapixel scale and spatial heterogeneity of Whole Slide Images (WSIs) continue to pose challenges for multimodal understanding. Existing alignment methods struggle to capture fine-grained correspondences between textual descriptions and visual cues across thousands of patches from a slide, compromising their performance on downstream tasks. In this paper, we propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic WSI interpretation. PathFLIP decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings to facilitate precise visual-language grounding. By harnessing Large Language Models (LLMs), PathFLIP can seamlessly follow diverse clinical instructions and adapt to varied diagnostic contexts. Furthermore, it exhibits versatile capabilities across multiple paradigms, efficiently handling slide-level classification and retrieval, fine-grained lesion localization, and instruction following. Extensive experiments demonstrate that PathFLIP outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data, paving the way for fine-grained, instruction-aware WSI interpretation in clinical practice.
中文标题/摘要
标题:PathFLIP:细粒度语言-图像预训练在通用病理学计算中的应用
尽管视觉-语言模型(VLMs)在病理学计算(CPath)方面取得了显著进展,但全视野图像(WSIs)的 gigapixel 规模和空间异质性继续对多模态理解构成挑战。现有的对齐方法难以捕捉来自数千个切片片段的文本描述与视觉线索之间的细粒度对应关系,从而影响其在下游任务上的性能。在本文中,我们提出了一种名为 PathFLIP(病理学细粒度语言-图像预训练)的新颖框架,用于整体 WSI 解释。PathFLIP 将切片级的描述分解为区域级的子描述,并生成条件文本区域嵌入,以促进精确的视觉-语言定位。通过利用大型语言模型(LLMs),PathFLIP 可以无缝地遵循各种临床指令并适应不同的诊断环境。此外,它在多个范式中表现出色,高效地处理切片级分类和检索、细粒度病灶定位以及指令遵循。广泛的实验表明,PathFLIP 在四个代表性基准上优于现有的大规模病理学 VLMs,同时需要显著较少的训练数据,为临床实践中细粒度、指令感知的 WSI 解释铺平了道路。
Summary / 总结
PathFLIP is a novel framework for fine-grained language-image pretraining in computational pathology, addressing the challenges of gigapixel-scale Whole Slide Images (WSIs). It decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings, enabling precise visual-language grounding. PathFLIP outperforms existing large-scale VLMs on four benchmarks with less training data and demonstrates versatile capabilities in slide-level classification, retrieval, lesion localization, and instruction following.
PathFLIP 是一种新型框架,旨在改进计算病理学中的细粒度语言-图像预训练。它将滑块级别的描述分解为区域级别的子描述,并生成文本条件化的区域嵌入,以更好地将文本描述与视觉线索对齐。PathFLIP 在四个基准测试中表现出色,所需训练数据量较少,展示了其在滑块级别分类、检索、细粒度病灶定位和指令遵循方面的有效性。
Replace, Don't Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly
Authors: Moshe Lahmy, Roi Yozevitch
First: 2025-12-11T16:31:29+00:00 · Latest: 2025-12-19T12:41:35+00:00
Comments: 24 pages, 2 figures
Abstract
Retrieval-Augmented Generation (RAG) systems often fail on multi-hop queries when the initial retrieval misses a bridge fact. Prior corrective approaches, such as Self-RAG, CRAG, and Adaptive-$k$, typically address this by \textit{adding} more context or pruning existing lists. However, simply expanding the context window often leads to \textbf{context dilution}, where distractors crowd out relevant information. We propose \textbf{SEAL-RAG}, a training-free controller that adopts a \textbf{``replace, don't expand''} strategy to fight context dilution under a fixed retrieval depth $k$. SEAL executes a (\textbf{S}earch $\rightarrow$ \textbf{E}xtract $\rightarrow$ \textbf{A}ssess $\rightarrow$ \textbf{L}oop) cycle: it performs on-the-fly, entity-anchored extraction to build a live \textit{gap specification} (missing entities/relations), triggers targeted micro-queries, and uses \textit{entity-first ranking} to actively swap out distractors for gap-closing evidence. We evaluate SEAL-RAG against faithful re-implementations of Basic RAG, CRAG, Self-RAG, and Adaptive-$k$ in a shared environment on \textbf{HotpotQA} and \textbf{2WikiMultiHopQA}. On HotpotQA ($k=3$), SEAL improves answer correctness by \textbf{+3--13 pp} and evidence precision by \textbf{+12--18 pp} over Self-RAG. On 2WikiMultiHopQA ($k=5$), it outperforms Adaptive-$k$ by \textbf{+8.0 pp} in accuracy and maintains \textbf{96\%} evidence precision compared to 22\% for CRAG. These gains are statistically significant ($p<0.001$). By enforcing fixed-$k$ replacement, SEAL yields a predictable cost profile while ensuring the top-$k$ slots are optimized for precision rather than mere breadth. We release our code and data at https://github.com/mosherino/SEAL-RAG.
中文标题/摘要
标题:替换,不要扩展:通过固定预算证据组装在多跳RAG中缓解上下文稀释
检索增强生成(RAG)系统在处理多跳查询时经常失败,因为初始检索遗漏了桥梁事实。之前的纠正方法,如Self-RAG、CRAG和Adaptive-$k$,通常通过增加更多上下文或修剪现有列表来解决这一问题。然而,简单地扩展上下文窗口往往会引发上下文稀释,即干扰信息挤占了相关信息。我们提出了SEAL-RAG,这是一种无需训练的控制器,采用“替换,不要扩展”的策略,在固定检索深度$k$下对抗上下文稀释。SEAL 执行一个(S搜索 → E提取 → A评估 → L循环)循环:它进行实时、实体锚定的提取,构建一个动态的“缺口规范”(缺失的实体/关系),触发有针对性的微查询,并使用实体优先排序主动替换干扰信息以获取缺口闭合证据。我们在共享环境中使用Basic RAG、CRAG、Self-RAG和Adaptive-$k$的忠实重实现对SEAL-RAG进行了评估,评估数据集为HotpotQA和2WikiMultiHopQA。在HotpotQA($k=3$)上,SEAL将答案正确性提高了3-13个百分点,证据精确度提高了12-18个百分点,超过Self-RAG。在2WikiMultiHopQA($k=5$)上,它在准确性上比Adaptive-$k$提高了8.0个百分点,并且保持了96%的证据精确度,而CRAG仅为22%。这些增益在统计上具有显著性($p<0.001$)。通过强制执行固定-$k$替换,SEAL提供了可预测的成本模型,同时确保前-$k$槽位优化的是精确度而非简单的广度。我们已在https://github.com/mosherino/SEAL-RAG/发布了我们的代码和数据。
Summary / 总结
The paper addresses the issue of context dilution in multi-hop Retrieval-Augmented Generation (RAG) systems, where the initial retrieval may miss bridge facts. It introduces SEAL-RAG, which employs a 'replace, don't expand' strategy to mitigate this problem. SEAL-RAG uses a search-extract-assess-loop cycle to build a live gap specification, trigger targeted micro-queries, and use entity-first ranking to swap out distractors for relevant evidence. Experiments on HotpotQA and 2WikiMultiHopQA show that SEAL-RAG improves answer correctness and evidence precision compared to other methods like Self-RAG and Adaptive-$k$. The gains are statistically significant and maintain high evidence precision even with a fixed retrieval depth.
论文针对多跳检索增强生成(RAG)系统中初始检索可能遗漏桥接事实导致的上下文稀释问题,提出了一种‘替换,不要扩展’策略的SEAL-RAG方法。SEAL-RAG通过搜索-提取-评估-循环周期构建实时的缺失信息规格,触发有针对性的微查询,并使用实体优先排序来替换掉无关信息以获取相关证据。在HotpotQA和2WikiMultiHopQA上的实验表明,SEAL-RAG在答案正确性和证据精度上优于Self-RAG和Adaptive-$k$等其他方法,且统计显著,并能保持高证据精度,即使在固定检索深度下。
Xiaomi MiMo-VL-Miloco Technical Report
Authors: Jiaze Li, Jingyang Chen, Yuxun Qu, Jianzhong Ju, Zhenbo Luo, Jian Luan, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu
First: 2025-12-19T10:43:37+00:00 · Latest: 2025-12-19T10:43:37+00:00
Abstract
We open-source \textbf{MiMo-VL-Miloco-7B} and its quantized variant \textbf{MiMo-VL-Miloco-7B-GGUF}, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at \href{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco}{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco} to support research and deployment in real-world smart-home applications.
中文标题/摘要
标题:小米MiMo-VL-Miloco技术报告
我们开源了MiMo-VL-Miloco-7B及其量化变体MiMo-VL-Miloco-7B-GGUF,这是一个面向家庭的视觉-语言模型对,能够在家庭场景理解和通用多模态推理方面取得优异表现。基于MiMo-VL-7B骨干网络,MiMo-VL-Miloco-7B专门针对智能家居环境,实现了手势识别和常见家庭场景理解的领先F1分数,并在视频基准测试(如Video-MME、Video-MMMU和Charades-STA)以及语言理解基准测试(如MMMU-Pro和MMLU-Pro)中也取得了持续的改进。在我们的实验中,MiMo-VL-Miloco-7B在家庭场景理解和多个多模态推理基准测试中均优于强大的闭源和开源基线。为了平衡专业化和通用性,我们设计了一种两阶段训练管道,结合监督微调和基于组相对策略优化的强化学习,利用高效的多域数据。我们进一步引入了思维链监督和基于令牌预算的推理,使模型能够在数据高效学习的同时高效推理。我们的分析表明,针对家庭场景的训练不仅增强了活动和手势理解,还提高了文本推理能力,仅对文档中心任务有适度的权衡。模型检查点、量化GGUF权重以及我们的家庭场景评估工具包可在https://github.com/XiaoMi/xiaomi-mimo-vl-miloco公开获取,以支持在实际智能家居应用中的研究和部署。
Summary / 总结
The research aims to develop home-centric vision-language models for smart-home applications. The method involves building MiMo-VL-Miloco-7B on the MiMo-VL-7B backbone and using a two-stage training pipeline combining supervised fine-tuning and reinforcement learning. Key findings show that MiMo-VL-Miloco-7B outperforms strong baselines on home-scenario understanding and multimodal reasoning benchmarks, while also improving text-only reasoning with minimal impact on document-centric tasks.
研究旨在开发适用于智能家居的应用的视觉-语言模型。MiMo-VL-Miloco-7B基于MiMo-VL-7B构建,专为智能家居环境设计,并在手势识别和多种多模态推理基准测试中超越了强大的基线模型。该模型采用结合监督微调和基于组相对策略优化的强化学习的两阶段训练管道,并结合了链式思考监督和令牌预算感知推理,以提高数据效率和推理性能。实验结果表明,该模型在智能家居场景理解和多模态推理方面表现出色,同时在文档中心任务上保持良好的性能。
RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
Authors: Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette
First: 2025-12-19T09:47:54+00:00 · Latest: 2025-12-19T09:47:54+00:00
Comments: Preprint, 23 pages, 12 figures, 7 tables
Abstract
In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.
中文标题/摘要
标题:RadImageNet-VQA:用于CT和MRI影像放射学视觉问答的大规模数据集
在本工作中,我们介绍了RadImageNet-VQA,这是一个为CT和MRI影像上的放射学视觉问答(VQA)设计的大规模数据集。现有的医学VQA数据集在规模上有限,主要以X光成像或生物医学插图为主,并且经常依赖于文本捷径。RadImageNet-VQA基于专家标注构建,提供了75万张图像配对750万组问题-答案样本。它涵盖了三个关键任务——异常检测、解剖结构识别和病理识别,覆盖了八个解剖区域和97个病理类别,并支持开放式、封闭式和多项选择式问题。广泛的实验表明,最先进的视觉-语言模型在细粒度病理识别方面仍然存在困难,特别是在开放式设置中,即使经过微调也是如此。仅基于文本的分析进一步表明,没有图像输入时,模型性能会崩溃到近乎随机,这证实了RadImageNet-VQA没有语言捷径。整个数据集和基准可以在https://huggingface.co/datasets/raidium/RadImageNet-VQA上公开获取。
Summary / 总结
RadImageNet-VQA is a large-scale dataset for radiologic VQA on CT and MRI exams, containing 750K images with 7.5M question-answer pairs covering three key tasks: abnormality detection, anatomy recognition, and pathology identification. Experiments show that state-of-the-art models struggle with fine-grained pathology identification, especially in open-ended settings, and text-only analysis confirms the necessity of image inputs. The dataset is publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.
该研究引入了RadImageNet-VQA数据集,用于CT和MRI影像的放射学视觉问答,解决了现有医疗VQA数据集的局限性。该数据集包含75万张图像和750万对问题-答案,涵盖三个任务和八个解剖区域。实验表明,最先进的模型在细粒度的病理识别上表现不佳,尤其是在开放式问题上,纯文本分析进一步证实了图像输入的必要性。该数据集已公开供研究使用。
Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?
Authors: Zabir Al Nazi, G M Shahariar, Abrar Hossain, Wei Peng
First: 2025-12-19T09:47:38+00:00 · Latest: 2025-12-19T09:47:38+00:00
Abstract
Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
中文标题/摘要
标题:视觉语言模型是跨文化理论思维推理者吗?
理论思维(ToM)——赋予他人信念、欲望和情感的能力——是人类社会智能的基础,但仍然是人工代理的重大挑战。现有的视觉-语言模型(VLMs)在社会导向任务中的应用越来越多,但它们在跨文化ToM推理方面的能力尚未得到充分探索。在本工作中,我们引入了CulturalToM-VQA,这是一个新的评估基准,包含5095个问题,旨在通过视觉问答来探究跨文化情境下的ToM推理。该数据集捕捉到了文化根基的线索,如仪式、服饰、手势和人际动态,使ToM推理的评估超越了以西方为中心的基准。该数据集通过一个VLM辅助的人机交互流程构建,首先由人类专家策划跨传统、仪式和社会互动的文化丰富的图像;然后VLM辅助生成结构化的ToM焦点场景描述,这些描述被精炼成跨越六个ToM任务分类和四个复杂度等级的问题-答案对。最终数据集涵盖了诸如心理状态归因、虚假信念推理、非字面交流、社会规范违反、视角协调和多代理推理等多样化的理论思维方面。
Summary / 总结
This study investigates whether Vision-Language Models (VLMs) can reason about cross-cultural Theory of Mind (ToM), an essential aspect of human social intelligence. The researchers developed CulturalToM-VQA, a new benchmark with 5095 questions designed to evaluate ToM reasoning in diverse cultural contexts. The dataset includes culturally grounded cues and is structured to cover various ToM tasks and complexities. Experimental results show that current VLMs struggle with cross-cultural ToM reasoning, highlighting the need for improved models capable of understanding and reasoning about diverse cultural contexts.
该研究探讨了视觉语言模型(VLMs)是否能够进行跨文化的情感理论(ToM)推理,引入了包含5095个问题的新基准CulturalToM-VQA,旨在评估在多元文化背景下的ToM推理能力。数据集包含文化背景下的线索,并通过VLM辅助的人工循环过程构建,涵盖了如心理状态归因和错误信念推理等方面。主要发现表明,当前的VLMs在跨文化ToM推理方面存在困难,强调了需要改进模型的需求。
Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
Authors: Rheeya Uppaal, Phu Mon Htut, Min Bai, Nikolaos Pappas, Zheng Qi, Sandesh Swamy
First: 2025-12-13T07:04:42+00:00 · Latest: 2025-12-19T08:16:24+00:00
Comments: Preprint
Abstract
Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.
中文标题/摘要
标题:旅程先于目的地:视觉忠实性在慢思考中的重要性
增强推理的视觉语言模型(VLMs)生成明确的思维链,承诺更大的能力和透明度,但也引入了新的失败模式:模型可能通过视觉不忠实的中间步骤得出正确答案,或者忠实推理但在最终预测上失败。仅衡量最终答案准确性的标准评估无法区分这些行为。我们引入推理链的视觉忠实性作为独立的评估维度,关注推理链的感知步骤是否基于图像。我们提出了一种无需训练的框架,将链分解为感知与推理步骤,并使用现成的VLM评判者进行步骤级忠实性评估,还通过人工元评估验证了这种方法。基于此指标,我们提出了一种轻量级的自我反思程序,无需训练即可检测并局部再生不忠实的感知步骤。在多种推理训练的VLMs和感知密集型基准测试中,我们的方法降低了不忠实感知率,同时保持最终答案的准确性,提高了多模态推理的可靠性。
Summary / 总结
The research aims to address the issue of visual unfaithfulness in reasoning-augmented vision language models (VLMs) that can lead to correct answers through inaccurate intermediate steps. The study introduces a new evaluation metric focusing on visual faithfulness and proposes a training- and reference-free framework to detect and regenerate unfaithful perception steps. The method improves the reliability of multimodal reasoning without compromising final-answer accuracy across various VLMs and benchmarks.
研究旨在解决视觉语言模型(VLMs)在推理过程中可能出现的视觉不忠实问题,这可能导致通过不准确的中间步骤得出正确答案。研究引入了一个新的评估指标,专注于视觉忠实性,并提出了一种无需训练和参考的框架来检测和再生不忠实的感知步骤。该方法在多种VLMs和感知密集型基准测试中提高了多模态推理的可靠性,同时不牺牲最终答案的准确性。
Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling
Authors: Sander Moonemans, Sebastiaan Ram, Frédérique Meeuwsen, Carlijn Lems, Jeroen van der Laak, Geert Litjens, Francesco Ciompi
First: 2025-12-19T08:14:58+00:00 · Latest: 2025-12-19T08:14:58+00:00
Comments: 10 pages, 4 figures
Abstract
Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-α, a VLM capable of visual-question answering (VQA). We show that ANTONI-α outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-α trained with different amounts of data. All methods, data, and code are publicly available.
中文标题/摘要
标题:病理学副驾的民主化:一种开放的管道和数据集用于全视野视觉-语言建模
视觉-语言模型(VLMs)有可能成为病理学家的副驾。然而,大多数VLMs要么专注于全视野图像中的小区域,要么仅提供静态的全视野级输出,或者依赖于非公开的数据,限制了可重复性。此外,包含全视野图像(WSIs)与详细临床报告配对的训练数据稀缺,限制了透明和普适的VLMs的发展。我们通过三个主要贡献来解决这些限制。首先,我们引入了Polysome,这是一种标准化的合成指令生成工具。其次,我们应用Polysome到公共HISTAI数据集上,生成了HISTAI-Instruct,这是一个包含24,259张全视野图像和超过110万指令-响应对的大规模全视野指令调优数据集。最后,我们使用HISTAI-Instruct训练了ANTONI-α,这是一种能够进行视觉-问答(VQA)的VLM。我们展示了ANTONI-α在组织识别、肿瘤检测和鉴别诊断等全视野级VQA任务上优于MedGemma。我们还比较了使用不同数据量训练的ANTONI-α的不同版本的性能。所有方法、数据和代码都是公开的。
Summary / 总结
This paper aims to democratize the use of vision-language models (VLMs) as co-pilots for pathologists by addressing limitations in existing models. The authors introduce Polysome, a tool for generating synthetic instructions, and apply it to the HISTAI dataset to create HISTAI-Instruct, a large instruction-tuning dataset. Using this dataset, they train ANTONI-α, a VLM that excels in visual-question answering tasks such as tissue identification, neoplasm detection, and differential diagnosis, outperforming MedGemma. The methods, data, and code are publicly available.
研究旨在通过解决现有模型的局限性,使视觉-语言模型(VLM)能够作为病理学家的辅助工具。方法包括创建Polysome工具生成合成指令,并使用该工具创建HISTAI-Instruct,这是一个包含大量整张切片指令和响应的大数据集。关键发现是,基于此数据集训练的ANTONI-α在组织识别、肿瘤检测和鉴别诊断等任务上优于MedGemma。此外,研究还表明,更多的数据可以提高ANTONI-α不同版本的性能。
A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs
Authors: Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, Yang Gao
First: 2025-12-19T08:07:51+00:00 · Latest: 2025-12-19T08:07:51+00:00
Abstract
Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: https://github.com/Yunkaidang/RSHR
中文标题/摘要
标题:超高清遥感MLLM基准
多模态大型语言模型(MLLMs)在现有遥感(RS)基准上展示了强大的感知和推理性能。然而,大多数先前的基准依赖于低分辨率图像,而一些高分辨率基准则遭受了推理任务设计的缺陷。我们表明,仅文本的LLM在不访问图像的情况下,能够与多模态视觉语言模型在RS推理任务上竞争,揭示了当前基准与视觉理解评估之间的重要不匹配。为了实现忠实评估,我们引入了RSHR-Bench,这是一个超高清基准,用于RS视觉理解和推理。RSHR-Bench 包含5,329张全场景图像,长边至少为4,000像素,每张图像最多约3 x 10^8像素,来自广泛使用的RS语料库和无人机采集的数据。我们设计了四个任务家族:多项选择VQA、开放式VQA、图像描述和单图像评估。这些任务涵盖了九个感知类别和四种推理类型,支持多轮和多图像对话。为了减少对语言先验的依赖,我们应用了强LLM的对抗过滤,随后进行了严格的真人验证。总体而言,我们构建了3,864个VQA任务、3,913个图像描述任务和500个完全由真人撰写或验证的单图像评估VQA对。跨开源、闭源和RS特定的VLM的评估揭示了在超高清场景中的持续性能差距。代码:https://github.com/Yunkaidang/RSHR
Summary / 总结
The research aims to address the limitations of existing remote sensing benchmarks, which predominantly use low-resolution images and flawed reasoning tasks. The authors introduce RSHR-Bench, a new benchmark for high-resolution remote sensing visual understanding and reasoning, containing 5,329 full-scene images with a long side of at least 4,000 pixels. The benchmark includes four task families: multiple-choice and open-ended VQA, image captioning, and single-image evaluation, covering various perception and reasoning categories. Evaluations show persistent performance gaps in super-high-resolution scenarios across different models.
研究旨在解决现有遥感基准的局限性,这些基准主要使用低分辨率图像和有缺陷的推理任务。作者引入了RSHR-Bench,这是一个新的高分辨率遥感视觉理解和推理基准,包含5,329张全场景图像,长边至少为4,000像素。基准包括四个任务家族:多项选择和开放式VQA、图像字幕和单图像评估,涵盖了各种感知和推理类别。评估结果显示,在超高分辨率场景中,不同模型普遍存在性能差距。
V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval
Authors: Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim
First: 2025-12-13T11:02:04+00:00 · Latest: 2025-12-19T08:02:44+00:00
Comments: 14 pages, 20 figures, conference
Abstract
Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow substantially with continuous streaming video input. This process requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Due to its iterative prefill stage, it suffers from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, which is the primary target for these models.
In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units. V-Rex achieves unprecedented real-time of 3.9-8.3 FPS and energy-efficient streaming video LLM inference on edge deployment with negligible accuracy loss. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7x speedup and 3.1-18.5x energy efficiency improvements over AGX Orin GPU. This work is the first to comprehensively tackle KV cache retrieval across algorithms and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices.
中文标题/摘要
标题:V-Rex:通过动态KV缓存检索实现实时流式视频LLM加速
流式视频大型语言模型(LLMs)越来越多地用于实时多模态任务,如视频字幕、问答、对话代理和增强现实。然而,这些模型面临着根本性的内存和计算挑战,因为它们的键值(KV)缓存会随着连续的流式视频输入而大幅增长。这一过程需要一个迭代预填充阶段,这是流式视频LLMs的独特特征。由于其迭代预填充阶段,它遭受了显著的限制,包括大量的计算、大量的数据传输和准确性的下降。至关重要的是,这个问题在边缘部署中被进一步加剧,这是这些模型的主要目标。
在这项工作中,我们提出了V-Rex,这是第一个软件硬件协同设计的加速器,全面解决了流式视频LLM推理中的算法和硬件瓶颈。V-Rex的核心是引入了ReSV,这是一种无需训练的动态KV缓存检索算法。ReSV利用基于时间空间相似性的令牌聚类来减少视频帧间的冗余KV缓存内存。为了充分利用这些算法优势,V-Rex提供了一个紧凑、低延迟的硬件加速器,其中包括一个动态KV缓存检索引擎(DRE),具有位级和早期退出计算单元。V-Rex在边缘部署中实现了前所未有的实时性能(3.9-8.3 FPS)和高效的流式视频LLM推理,几乎无准确度损失。虽然DRE仅占2.2%的功耗和2.0%的面积,但该系统在功耗和能效上分别比AGX Orin GPU提高了1.9-19.7倍和3.1-18.5倍。这项工作首次全面解决了算法和硬件中的KV缓存检索问题,使流式视频LLM推理能够在资源受限的边缘设备上实时进行。
Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model
Authors: SuBeen Lee, GilHan Park, WonJun Moon, Hyun Seok Seong, Jae-Pil Heo
First: 2025-12-19T07:52:25+00:00 · Latest: 2025-12-19T07:52:25+00:00
Abstract
Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.
中文标题/摘要
标题:视觉语言模型少样本适应的辅助描述性知识
尽管视觉语言模型(VLMs)在零样本能力方面表现出色,但在预训练数据分布与下游任务分布不一致的情况下,它们往往难以应对。少样本适应(FSA-VLM)已成为关键解决方案,通常使用参数高效微调(PEFT)方法,以最少的数据适应模型。然而,这些PEFT方法受限于其对固定的手工制作提示的依赖,这些提示往往不足以理解类别的语义。虽然一些研究提出了利用图像诱导提示来为分类提供额外线索的方法,但它们在推理时引入了巨大的计算开销。因此,我们引入了辅助描述性知识(ADK),这是一种新颖的框架,能够高效地丰富文本表示而不牺牲效率。ADK首先利用大型语言模型为每个类别离线生成丰富的描述性提示。这些预计算特征然后以两种方式部署:(1)组合知识,这是一种平均表示,提供了丰富的语义,特别是在类别名称对VLM来说模糊或不熟悉时特别有益;(2)实例特定知识,其中一种轻量级的非参数注意力机制动态选择与给定图像最相关的描述。这种方法为手工制作提示提供了两种额外的知识类型,从而在各种领域促进类别区分。此外,ADK作为无参数、即插即用的组件,增强了现有的PEFT方法。广泛的实验表明,ADK能够一致地提升多个PEFT基线的性能,在各种场景中设置新的最佳水平。
Summary / 总结
The paper addresses the challenge of distribution shifts in Vision-Language Models (VLMs) by introducing Auxiliary Descriptive Knowledge (ADK), which enriches text representations through pre-computed descriptive prompts generated by a Large Language Model. ADK enhances PEFT methods by providing Compositional Knowledge for ambiguous classes and Instance-Specific Knowledge via a lightweight attention mechanism, leading to improved performance across various scenarios and setting a new state-of-the-art.
论文通过引入辅助描述性知识(ADK),利用大型语言模型生成描述性提示来丰富文本表示,以解决Vision-Language模型在分布变化中的挑战。ADK通过提供针对模糊类别的组合知识和通过轻量级注意力机制提供的实例特定知识来增强PEFT方法,从而在各种场景中提高了性能,并设立了新的最佳水平。
Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images
Authors: Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuanyu Wan, Lijun Zhang
First: 2025-12-19T07:44:43+00:00 · Latest: 2025-12-19T07:44:43+00:00
Abstract
Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.
中文标题/摘要
标题:深入而可靠:提升图像思维的多轮推理能力
近年来,大型视觉-语言模型(VLMs)在通过图像思维解决复杂视觉任务时展现了强大的推理能力,这得益于它们在链式思维(CoT)中主动调用工具分析视觉输入,而不仅仅是感知它们。然而,现有模型在尝试错误的推理轨迹时往往难以自我反思和纠正。为解决这一局限,我们提出了DRIM模型,该模型能够在多模态CoT中进行深入而可靠的多轮推理。我们的管道包括三个阶段:数据构建、冷启动微调(SFT)和强化学习(RL)。基于高分辨率图像数据集,我们构建了高难度且可验证的视觉问答对,其中解决每个任务需要多轮工具调用来达到正确答案。在SFT阶段,我们收集工具轨迹作为冷启动数据,引导多轮推理模式。在RL阶段,我们引入了冗余惩罚策略优化,激励模型发展自我反思的推理模式。基本思想是对推理轨迹进行判断,并惩罚那些在未进行充分多尺度探索的情况下产生错误答案的轨迹。大量实验表明,DRIM在视觉理解基准测试中取得了优越的性能。
Summary / 总结
The research aims to enhance the reasoning capabilities of Vision-Language Models (VLMs) by enabling deep but reliable multi-turn reasoning when thinking with images. The method involves a three-stage pipeline: data construction, cold-start SFT, and RL. DRIM constructs high-difficulty visual question-answer pairs requiring multi-turn tool calls. During SFT, tool trajectories are collected to guide multi-turn reasoning. In the RL stage, a redundancy-penalized policy optimization is introduced to encourage self-reflective reasoning. Experiments show that DRIM outperforms existing models on visual understanding benchmarks.
研究旨在通过开发DRIM模型来增强视觉语言模型在复杂视觉任务中的推理能力,使其能够进行深入但可靠的多轮推理。方法包括构建高难度的视觉问答对、收集工具轨迹,并使用冗余惩罚的策略优化来促进自我反思的推理。关键发现表明,DRIM在视觉理解基准测试中优于现有模型。
EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance
Authors: Ankit Yadav, Ta Duc Huy, Lingqiao Liu
First: 2025-12-19T07:36:07+00:00 · Latest: 2025-12-19T07:36:07+00:00
Comments: 26 pages
Abstract
In diffusion and flow-matching generative models, guidance techniques are widely used to improve sample quality and consistency. Classifier-free guidance (CFG) is the de facto choice in modern systems and achieves this by contrasting conditional and unconditional samples. Recent work explores contrasting negative samples at inference using a weaker model, via strong/weak model pairs, attention-based masking, stochastic block dropping, or perturbations to the self-attention energy landscape. While these strategies refine the generation quality, they still lack reliable control over the granularity or difficulty of the negative samples, and target-layer selection is often fixed. We propose Exponential Moving Average Guidance (EMAG), a training-free mechanism that modifies attention at inference time in diffusion transformers, with a statistics-based, adaptive layer-selection rule. Unlike prior methods, EMAG produces harder, semantically faithful negatives (fine-grained degradations), surfacing difficult failure modes, enabling the denoiser to refine subtle artifacts, boosting the quality and human preference score (HPS) by +0.46 over CFG. We further demonstrate that EMAG naturally composes with advanced guidance techniques, such as APG and CADS, further improving HPS.
中文标题/摘要
标题:EMAG:指数移动平均指导下的自我校正扩散采样
在扩散和流匹配生成模型中,指导技术被广泛用于提高样本质量和一致性。无分类指导(CFG)是现代系统中的不二选择,通过对比条件样本和无条件样本来实现这一点。最近的工作探索了使用较弱模型在推理时对比负样本的方法,通过强弱模型对、注意力掩码、随机块删除或自注意力能量景观扰动。虽然这些策略可以细化生成质量,但它们仍然缺乏对负样本的粒度或难度的可靠控制,且目标层选择通常是固定的。我们提出了指数移动平均指导(EMAG),这是一种无需训练的机制,在扩散变换器推理时修改注意力,采用基于统计的自适应层选择规则。与先前方法不同,EMAG 生成更难、语义忠实的负样本(细粒度退化),揭示了难以处理的失败模式,使去噪器能够细化细微的伪影,使质量得分和人类偏好得分(HPS)提高 0.46。我们进一步证明,EMAG 可以自然地与高级指导技术(如 APG 和 CADS)结合使用,进一步提高 HPS。
Summary / 总结
The research aims to enhance the sample quality and consistency in diffusion and flow-matching generative models by proposing Exponential Moving Average Guidance (EMAG). EMAG modifies attention at inference time in diffusion transformers using an adaptive layer-selection rule based on statistics. This method generates harder, semantically faithful negative samples, which help refine subtle artifacts and improve the human preference score by +0.46 compared to classifier-free guidance (CFG). EMAG also complements other advanced guidance techniques, further boosting the human preference score.
论文提出了一种名为EMAG的方法,通过在推理时调整注意力并使用自适应层选择规则来提高扩散模型的样本质量。与以往技术不同,EMAG生成更难、语义上更忠实的负样本,有助于细化细微的伪影,使质量和人类偏好评分(HPS)提高了0.46。EMAG还能与APG和CADS等高级引导技术自然结合,进一步提升性能。
ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration
Authors: Fanpu Cao, Yaofo Chen, Zeng You, Wei Luo, Cen Chen
Venue: AAAI 2026 poster
First: 2025-12-19T07:27:19+00:00 · Latest: 2025-12-19T07:27:19+00:00
Comments: Accepted for poster presentation at AAAI 2026
Abstract
Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model's temporal characteristics; and (ii) a selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-alpha and DiT demonstrate that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.
中文标题/摘要
标题:ProCache:基于约束的特征缓存与选择性计算以加速扩散变换器
扩散变换器(DiTs)在生成建模中取得了最先进的性能,但其高昂的计算成本阻碍了实时部署。虽然特征缓存通过利用时间冗余提供了一种无训练的加速解决方案,但现有方法存在两个关键局限性:(1)均匀的缓存间隔无法与DiT的时间非均匀动态对齐;(2)使用过大的缓存间隔进行简单的特征重用会导致严重的误差累积。在本文中,我们分析了去噪过程中DiT特征的演变,发现特征变化和误差传播在时间和深度上都高度变化。受此启发,我们提出了ProCache,这是一种基于约束的动态特征缓存框架,通过两个核心组件解决了这些问题:(i)一种约束感知的缓存模式搜索模块,通过离线约束采样生成非均匀的激活时间表,以适应模型的时间特性;(ii)一种选择性计算模块,在深层块和高重要性标记中选择性地计算缓存段,以最小化误差累积,同时减少开销。在PixArt-alpha和DiT上的广泛实验表明,ProCache在几乎不降低质量的情况下实现了高达1.96倍和2.90倍的加速,显著优于先前的基于缓存的方法。
Summary / 总结
ProCache is a training-free dynamic feature caching framework designed to accelerate Diffusion Transformers (DiTs) by addressing the limitations of uniform caching intervals and naive feature reuse. It uses a constraint-aware caching pattern search module to generate non-uniform activation schedules and a selective computation module to minimize error accumulation. Experiments show that ProCache can achieve up to 1.96x and 2.90x acceleration with negligible quality degradation compared to prior methods.
ProCache 是一种无训练的动态特征缓存框架,旨在通过解决均匀缓存间隔和过度误差累积的问题来加速扩散变换器(DiTs)。它使用一种约束感知的缓存模式搜索模块生成非均匀激活调度,并使用选择性计算模块来最小化误差传播。实验表明,ProCache 可以实现高达 1.96 倍和 2.90 倍的加速,同时保持质量基本不变,显著优于之前的基于缓存的方法。
Vision-Language Model Guided Image Restoration
Authors: Cuixin Yang, Rongkang Dong, Kin-Man Lam
First: 2025-12-19T07:16:07+00:00 · Latest: 2025-12-19T07:16:07+00:00
Abstract
Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.
中文标题/摘要
标题:视觉-语言模型引导的图像恢复
许多图像恢复(IR)任务需要在像素级保真度和高层次语义理解之间取得平衡,以恢复具有精细细节的逼真照片。然而,之前的许多方法往往难以有效地利用视觉和语言知识。最近的努力尝试将视觉-语言模型(VLMs)纳入通用IR中,这些模型擅长对齐视觉和文本特征。然而,这些方法在恢复过程中未能利用语言先验以确保语义一致性。为了解决这一问题,本文提出了一种视觉-语言模型引导的图像恢复(VLMIR)框架,该框架利用VLMs,如CLIP,丰富的视觉-语言先验,通过增强视觉感知和语义理解来提高IR性能。我们的方法分为两个阶段:基于VLM的特征提取和基于扩散的图像恢复。在第一阶段,我们通过VLMs浓缩视觉感知和高层次语义先验,提取输入图像的互补视觉和语言表示。具体来说,我们使用余弦相似度损失和LoRA微调对低质量图像和高质量图像的描述符进行对齐,并使用退化预测器将退化和清洁图像内容嵌入分解。这些互补的视觉和文本嵌入通过交叉注意力机制整合到基于扩散的模型中,以增强恢复效果。广泛的实验和消融研究证明,VLMIR在通用和退化特定的IR任务中均表现出优越的性能,突显了VLMs整合的视觉和语言知识在提高图像恢复能力方面的重要作用。
Summary / 总结
The paper addresses the challenge of image restoration by proposing VLMIR, which integrates Vision-Language Models (VLMs) to enhance both visual perception and semantic understanding. The method consists of two stages: VLM-based feature extraction and diffusion-based image restoration. By aligning visual and textual embeddings and integrating them via cross-attention mechanisms, VLMIR improves restoration quality. Experiments show that VLMIR outperforms previous methods in both universal and degradation-specific tasks, highlighting the importance of combined visual and linguistic knowledge in image restoration.
本文提出VLMIR框架,通过整合Vision-Language模型(VLMs)来增强视觉感知和语义理解,以解决图像恢复的挑战。该方法包括两个阶段:基于VLM的特征提取和基于扩散的图像恢复。通过对齐视觉和文本特征,并通过交叉注意力机制进行整合,VLMIR提高了恢复质量,在通用和降解特定的图像恢复任务中均表现出优越性能。
HOLODECK 2.0: Vision-Language-Guided 3D World Generation with Editing
Authors: Zixuan Bian, Ruohan Ren, Yue Yang, Chris Callison-Burch
First: 2025-08-07T23:23:07+00:00 · Latest: 2025-12-19T05:39:09+00:00
Abstract
3D scene generation plays a crucial role in gaming, artistic creation, virtual reality, and many other domains. However, current 3D scene design still relies heavily on extensive manual effort from creators, and existing automated methods struggle to generate open-domain scenes or support flexible editing. To address those challenges, we introduce HOLODECK 2.0, an advanced vision-language-guided framework for 3D world generation with support for interactive scene editing based on human feedback. HOLODECK 2.0 can generate diverse and stylistically rich 3D scenes (e.g., realistic, cartoon, anime, and cyberpunk styles) that exhibit high semantic fidelity to fine-grained input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models (VLMs) to identify and parse the objects required in a scene and generates corresponding high-quality assets via state-of-the-art 3D generative models. Then, HOLODECK 2.0 iteratively applies spatial constraints derived from the VLMs to achieve semantically coherent and physically plausible layouts. Both human and model evaluations demonstrate that HOLODECK 2.0 effectively generates high-quality scenes closely aligned with detailed textual descriptions, consistently outperforming baselines across indoor and open-domain scenarios. Additionally, HOLODECK 2.0 provides editing capabilities that flexibly adapt to human feedback, supporting layout refinement and style-consistent object edits. Finally, we present a practical application of HOLODECK 2.0 in procedural game modeling to generate visually rich and immersive environments that can boost efficiency in game design.
中文标题/摘要
标题:HOLODECK 2.0:基于视觉语言的3D世界生成与编辑
3D场景生成在游戏、艺术创作、虚拟现实等领域中起着关键作用。然而,当前的3D场景设计仍然主要依赖于创作者的大量手工努力,现有的自动化方法难以生成开放领域的场景或支持灵活的编辑。为了解决这些挑战,我们引入了HOLODECK 2.0,这是一种基于视觉语言的先进框架,用于3D世界生成,并支持基于人类反馈的交互式场景编辑。HOLODECK 2.0可以根据细粒度的输入描述生成多样且风格丰富的3D场景(例如,现实主义、卡通、动漫和赛博朋克风格),适用于室内和开放领域环境。HOLODECK 2.0利用视觉语言模型(VLMs)识别和解析场景所需的物体,并通过最先进的3D生成模型生成相应的高质量资产。然后,HOLODECK 2.0迭代应用来自VLMs的空间约束,以实现语义一致且物理上合理的布局。人类和模型评估均表明,HOLODECK 2.0能够有效生成与详细文本描述高度一致的高质量场景,在室内和开放领域场景中始终优于基线方法。此外,HOLODECK 2.0提供了灵活适应人类反馈的编辑功能,支持布局细化和风格一致的对象编辑。最后,我们展示了HOLODECK 2.0在程序化游戏建模中的实际应用,以生成视觉丰富且沉浸式的环境,从而提高游戏设计的效率。
Summary / 总结
HOLODECK 2.0 is an advanced vision-language-guided framework for 3D world generation that supports interactive scene editing based on human feedback. It generates diverse and stylistically rich 3D scenes with high semantic fidelity to input descriptions, suitable for both indoor and open-domain environments. HOLODECK 2.0 leverages vision-language models to identify and parse objects and uses state-of-the-art 3D generative models to create high-quality assets, iteratively applying spatial constraints to achieve coherent and plausible layouts. Evaluations show that HOLODECK 2.0 outperforms baselines and provides editing capabilities for layout refinement and style-consistent object edits, enhancing game design efficiency.
HOLODECK 2.0 是一个基于视觉语言的先进框架,用于生成支持基于人类反馈的交互式场景编辑的3D世界。它生成具有高语义一致性的多样化和风格丰富的3D场景,适用于室内和开放域环境。HOLODECK 2.0 利用视觉语言模型识别和解析对象,并使用最先进的3D生成模型创建高质量的资产,通过迭代应用空间约束来实现连贯且物理上合理的布局。评估显示HOLODECK 2.0 在基准之上表现出色,并提供布局细化和风格一致的对象编辑功能,提高游戏设计效率。
Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
Authors: Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda, Hiroaki Santo, Yosuke Toda, Fumio Okura
Venue: WACV 2026
First: 2025-09-11T02:53:58+00:00 · Latest: 2025-12-19T05:20:46+00:00
Comments: WACV 2026 Accepted
Abstract
Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants' structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.
中文标题/摘要
标题:基于基础分割模型和文本到图像注意力的零样本分层植物分割
基础分割模型能够在无需训练的情况下(即零样本)从顶部视角的作物图像中实现合理的叶片实例提取。然而,分割由多个重叠叶片组成的整个植物个体仍然具有挑战性。这个问题被称为分层分割任务,通常需要标注训练数据集,这些数据集往往是特定于物种的,并需要大量的人工劳动。为了解决这个问题,我们引入了ZeroPlantSeg,这是一种从顶部视角图像中对罗sette状植物个体进行零样本分割的方法。我们结合了基础分割模型来提取叶片实例,并使用视觉-语言模型来推理植物结构,从而在无需额外训练的情况下提取植物个体。在包含多种植物物种、生长阶段和拍摄环境的数据集上的评估表明,我们的方法超越了现有的零样本方法,并在跨域性能上优于监督方法。相关实现可在https://github.com/JunhaoXing/ZeroPlantSeg获取。
Summary / 总结
The research aims to address the challenge of hierarchical plant segmentation, particularly for rosette-shaped plant individuals from top-view images, which is difficult due to overlapping leaves and the need for species-specific annotated datasets. The method combines a foundation segmentation model for leaf instance extraction and a vision-language model for reasoning about plant structures. Experiments show that the proposed ZeroPlantSeg method outperforms existing zero-shot methods and achieves better cross-domain performance than supervised methods.
研究旨在解决从顶部视角图像中分割整个植物个体的挑战,特别是当植物由多个重叠的叶子组成时。方法结合了基础分割模型进行叶子实例提取和视觉-语言模型进行植物结构推理。实验表明,提出的ZeroPlantSeg方法在多种数据集上优于现有零样本方法,并且在跨域性能上优于监督方法。
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
Authors: Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig
First: 2025-12-19T04:09:24+00:00 · Latest: 2025-12-19T04:09:24+00:00
Abstract
While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
中文标题/摘要
标题:DAVE:一种用于文档理解和网络代理的VLM视觉编码器
尽管视觉语言模型(VLMs)在多模态任务中表现出色,但它们所选择的视觉编码器存在根本性弱点:其低级特征缺乏文档理解和网络代理所需的稳健的结构和空间信息。为弥补这一差距,我们提出了DAVE,一种专为VLMs设计并针对这些任务定制的视觉编码器。我们的训练管道旨在利用大量未标注数据,以绕过对文档和网络图像的大规模注释成本。我们首先在未标注图像上进行自我监督预训练,然后在监督自回归预训练阶段,模型从有限的高质量数据中学习解析和定位等任务。在监督阶段内,我们采用了两种策略来提高编码器与通用视觉知识和多样化文档及网络代理任务的对齐:(i) 我们引入了一种新的模型合并方案,将使用不同文本解码器训练的编码器结合在一起,以确保与不同网络代理架构的广泛兼容性。(ii) 我们使用集成训练,将预训练的一般编码器(如SigLIP2)的特征与我们自己的文档和网络特定表示融合。在经典文档任务、VQAs、网络定位和基于代理的基准测试中的广泛实验验证了我们方法的有效性,确立了DAVE作为文档和网络应用的强大视觉编码器的地位。
Summary / 总结
DAVE is a vision encoder designed for VLMs to enhance document understanding and web agent tasks by incorporating self-supervised and supervised pretraining. It leverages abundant unlabeled data and combines encoders trained with different text decoders and ensemble training to improve compatibility and performance. Experiments show DAVE outperforms existing models on document tasks, VQAs, web localization, and agent-based benchmarks.
DAVE 是一种专门为 Vision-language 模型设计的视觉编码器,旨在增强文档理解和网页代理任务。它通过在未标注数据和高质量数据上的自我监督和监督预训练来实现。DAVE 结合了模型合并方案和集成训练,以提高其与各种网页代理架构和文档特定任务的兼容性。实验结果表明,DAVE 在经典文档任务、VQAs、网页定位和基于代理的基准测试中表现出色,使其成为这些应用中的强大视觉编码器。
CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency
Authors: Xiao Liang, Yuxuan An, Di Wang, Jiawei Hu, Zhicheng Jiao, Bin Jing, Quan Wang
First: 2025-12-19T03:50:42+00:00 · Latest: 2025-12-19T03:50:42+00:00
Abstract
Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: https://github.com/ecoxial2007/CheX-Phi4MM.
中文标题/摘要
标题:CheXPO-v2:基于知识图谱一致性的胸部X光VLM偏好优化
医学视觉-语言模型(VLMs)容易产生幻觉,影响临床可靠性。虽然强化学习方法如组相对策略优化(GRPO)提供了一种低成本的对齐解决方案,但它们依赖于稀疏的结果导向奖励,无意中促使模型“过度思考”——生成冗长、复杂且不可验证的推理链来证明答案。这种对结果的关注掩盖了事实错误,并带来了重大的安全风险。为了解决这一问题,我们提出了CheXPO-v2,这是一种新颖的对齐框架,从结果监督转向过程监督。我们的核心创新是基于实体-关系匹配的知识图谱一致性奖励机制。通过明确将推理步骤解析为结构化的“疾病、关系、解剖结构”三元组,我们提供了细粒度的监督,从原子层面惩罚不一致的逻辑和幻觉。结合这一机制与困难样本挖掘策略,我们的方法在MIMIC-CXR-VQA等基准测试上显著优于GRPO和最先进的模型。至关重要的是,CheXPO-v2仅使用5000个样本就达到了新的最佳准确率,展示了出色的数据效率,同时生成了临床适用且可验证的推理。该项目源代码可在以下网址公开获取:https://github.com/ecoxial2007/CheX-Phi4MM.
Summary / 总结
CheXPO-v2 is a novel framework for aligning medical vision-language models to reduce hallucinations and improve clinical reliability. It uses a Knowledge Graph Consistency Reward mechanism to supervise the reasoning process, penalizing incoherent logic and hallucinations. By integrating this with a hard-example mining strategy, CheXPO-v2 outperforms existing methods like GRPO and achieves state-of-the-art accuracy on MIMIC-CXR-VQA benchmarks with only 5k samples, showcasing high data efficiency and clinical soundness.
研究旨在通过解决医学视觉语言模型生成幻觉的问题,提高其可靠性。CheXPO-v2是一种新颖的对齐框架,从基于结果的监督转向基于过程的监督,使用知识图谱一致性奖励机制。该方法将推理步骤明确解析为结构化的三元组,惩罚不连贯的逻辑和幻觉。实验表明,CheXPO-v2在基准测试上优于现有方法如GRPO和最先进的模型,仅使用5k样本就达到了新的最佳准确性,显示出高效的数据利用和临床适用性。
Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
Authors: Rujiao Long, Yang Li, Xingyao Zhang, Weixun Wang, Tianqianjin Lin, Xi Zhao, Yuchi Xu, Wenbo Su, Junchi Yan, Bo Zheng
First: 2025-12-19T03:32:53+00:00 · Latest: 2025-12-19T03:32:53+00:00
Abstract
Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.
中文标题/摘要
标题:推理调色板:通过潜在上下文化调节推理以实现可控探索的(V)LMs
探索能力既影响大型(视觉-)语言模型的推理时表现,也影响强化学习(RL)训练,因为随机采样通常会产生冗余的推理路径,缺乏高层多样性。本文提出了一种名为推理调色板的新颖潜在调节框架,该框架赋予模型一个用于策略性上下文化的随机潜在变量,在生成标记之前引导其内部规划。该潜在上下文通过变分自编码器(VAE)从问题-答案对的均值池化嵌入中推断出来,其中每个采样的潜在变量可能编码不同的推理上下文。在推理过程中,采样的潜在变量被解码为可学习的标记前缀,并附加到输入提示中,调节模型的内部推理轨迹。通过这种方式,模型在输出生成之前对推理策略进行内部采样,从而塑造整个响应序列的风格和结构。简短的监督微调(SFT)预热阶段使模型能够适应这种潜在条件。在RL优化中,推理调色板通过允许按需注入多种推理模式来促进结构化探索,显著提高了探索效率和持续学习能力。在多个推理基准测试中,我们的方法使模型能够实现可解释且可控的战略行为,从而在标准RL方法上取得一致的性能提升。
Summary / 总结
The paper introduces Reasoning Palette, a latent-modulation framework designed to enhance the strategic reasoning of large language models. It uses a variational autoencoder to infer a latent variable from question-answer pairs, which is then decoded into token prefixes to guide the model's reasoning. This approach allows for interpretable and controllable exploration, leading to improved performance in various reasoning benchmarks compared to standard reinforcement learning methods. Within reinforcement learning, Reasoning Palette enables more efficient exploration and sustained learning by facilitating diverse reasoning modes on demand.
论文提出了一种名为Reasoning Palette的潜在调制框架,以解决大型(视觉-)语言模型中冗余推理路径的问题。该框架通过变分自编码器从问题-答案对中推断出一个随机的潜在变量,引导模型在生成标记前的内部规划。推理时,该潜在上下文被解码成可学习的标记前缀并添加到输入提示中,调节推理轨迹。实验表明,该方法提高了探索效率和学习能力,实现了对模型战略行为的可解释和可控控制,从而在多个推理基准上取得了稳定性能提升。
Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs
Authors: Xiao Liang, Chenxi Liu, Zhi Ma, Di Wang, Bin Jing, Quan Wang, Yuanyuan Shi
First: 2025-12-19T03:11:20+00:00 · Latest: 2025-12-19T03:11:20+00:00
Abstract
Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.
中文标题/摘要
标题:解剖区域引导对比解码:一种缓解医疗VLM幻觉的即插即用策略
医疗视觉-语言模型(MedVLMs)在临床应用中展现出巨大的潜力。然而,它们的可靠性受到幻觉的阻碍,模型往往未能从视觉证据中得出答案,而是依赖于学习到的文本先验。现有针对MedVLMs的缓解策略各有局限性:基于训练的方法依赖于昂贵的专家注释,限制了其可扩展性,而无需训练的干预措施如对比解码,尽管数据高效,但其在复杂现实临床环境中的效果可能不可靠。为解决这些挑战,我们引入了解剖区域引导对比解码(ARCD),这是一种即插即用策略,通过提供针对性的区域特定指导来缓解幻觉。我们的模块利用了解剖学掩码来引导三级对比解码过程。通过动态重新加权在标记、注意力和logits层面,它可验证地将模型的注意力引导到指定区域,强化解剖学理解并抑制事实错误的输出。在包括胸部X光、CT、脑MRI和眼部超声在内的多种数据集上的广泛实验表明,我们的方法在提高区域理解、减少幻觉和提高整体诊断准确性方面具有有效性。
Summary / 总结
The research aims to address the issue of hallucinations in Medical Vision-Language Models (MedVLMs) by introducing Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy. ARCD uses an anatomical mask to guide a three-tiered contrastive decoding process, dynamically re-weighting at the token, attention, and logits levels to focus on specified regions. Experiments across various medical imaging datasets show that ARCD improves regional understanding, reduces hallucinations, and enhances diagnostic accuracy.
研究通过引入Anatomical Region-Guided Contrastive Decoding (ARCD)策略来解决Medical Vision-Language Models (MedVLMs)中的幻觉问题。ARCD 使用解剖学掩码引导一个三级对比解码过程,动态调整在标记、注意力和概率层的权重,以引导模型关注特定区域。在多种医学影像数据集上的实验表明,ARCD 有效减少了幻觉,提高了区域理解能力,并提升了诊断准确性。
ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching
Authors: Qi Zhang, Yuxu Chen, Lei Deng, Lili Shen
First: 2025-12-19T02:36:51+00:00 · Latest: 2025-12-19T02:36:51+00:00
Comments: 10 pages, 8 figures
Abstract
Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.
中文标题/摘要
标题:ABE-CLIP:无需训练的属性绑定增强方法以提高组合图像-文本匹配
对比语言-图像预训练(CLIP)在多种多模态任务中取得了显著的性能。然而,它仍然在组合图像-文本匹配方面存在困难,特别是在准确关联对象与其相应的属性方面,因为其固有的全局表示往往忽略了属性绑定中的细粒度语义。现有方法通常需要额外的训练或大量的硬负样本,但它们经常在对新颖的组合概念进行泛化时表现有限,并未能从根本上解决全局表示的缺点。在本文中,我们提出了一种名为ABE-CLIP的新颖的无需训练的属性绑定增强方法,旨在增强CLIP类模型中的属性-对象绑定。具体而言,我们采用语义精炼机制来精炼文本中对象和属性短语的标记嵌入,从而减轻属性混淆并提高语义精度。我们还引入了一种局部标记-补丁对齐策略,该策略计算精炼文本标记与其最相关的图像补丁之间的相似性得分。通过聚合局部相似性得分,ABE-CLIP计算最终的图像-文本相似性。在多个数据集上的实验表明,ABE-CLIP显著提高了属性-对象绑定性能,甚至超过了需要大量训练的方法。
Summary / 总结
ABE-CLIP is a training-free method that enhances attribute-object binding in CLIP-like models for compositional image-text matching. It uses a Semantic Refinement Mechanism to improve token embeddings and a Local Token-Patch Alignment strategy to compute similarity scores between text and image patches. Experiments show that ABE-CLIP outperforms existing methods and achieves better attribute-object binding performance across multiple datasets.
ABE-CLIP 是一种无需训练的方法,通过细化文本中对象和属性短语的嵌入并使局部文本令牌与图像补丁对齐,来增强组成图像-文本匹配中的属性绑定。这种方法提高了语义精度,并克服了 CLIP 类模型中全局表示的局限性。实验结果表明,ABE-CLIP 在多个数据集上的属性-对象绑定任务中优于现有方法,甚至优于那些需要大量训练的方法。
Can Synthetic Images Serve as Effective and Efficient Class Prototypes?
Authors: Dianxing Shi, Dingjie Fu, Yuqiao Liu, Jun Wang
First: 2025-12-19T01:39:43+00:00 · Latest: 2025-12-19T01:39:43+00:00
Abstract
Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.
中文标题/摘要
标题:合成图像能否作为有效的和高效的类别原型?
视觉-语言模型(VLMs)在零样本图像分类任务中表现出强大的性能。然而,现有的方法,包括对比语言-图像预训练(CLIP),都依赖于标注的图文对来对齐视觉和文本模态。这种依赖性在准备高质量数据集时引入了巨大的成本和准确度要求。同时,处理两种模式的数据还需要大多数模型使用双塔编码器,这也阻碍了它们的轻量化。为了解决这些限制,我们引入了“基于大型语言模型生成的对比语言-图像预训练(LGCLIP)”框架。LGCLIP 利用大型语言模型(LLM)生成类特定的提示,引导扩散模型合成参考图像。之后,这些生成的图像作为视觉原型,从真实图像中提取的视觉特征与这些原型的视觉特征进行比较,以实现相对预测。通过通过 LLM 优化提示生成并仅使用视觉编码器,LGCLIP 保持了轻量化和高效性。至关重要的是,我们的框架在整个实验过程中只需要类别标签作为输入,消除了手动标注图文对和额外预处理的需要。实验结果验证了 LGCLIP 的可行性和高效性,展示了其在零样本分类任务中的出色性能,并建立了分类的新范式。
Summary / 总结
The research aims to address the high cost and accuracy requirements in preparing annotated text-to-image pairs for Vision-Language Models (VLMs) and the need for dual-tower encoders, which hinder their efficiency. The proposed LGCLIP framework uses a Large Language Model to generate class-specific prompts for a diffusion model to synthesize reference images, which serve as visual prototypes. These prototypes are then used to compare with real images to achieve classification. The study shows that LGCLIP is lightweight and efficient, requiring only class labels, and performs well in zero-shot classification tasks.
该研究旨在解决准备带有标注的文本-图像对以对齐视觉和文本模态所需的成本高和准确性要求高的问题。研究引入了LGCLIP框架,该框架利用大型语言模型生成类特定的提示,指导扩散模型合成参考图像。这些图像作为视觉原型,用于比较真实图像与这些原型的视觉特征以实现分类。LGCLIP仅使用视觉编码器,并且只需要类标签作为输入,从而消除了手动标注的图像-文本对和额外预处理的需要。实验结果表明,LGCLIP在零样本分类任务中表现出色,并且建立了一个新的分类范式。
Text-Conditioned Background Generation for Editable Multi-Layer Documents
Authors: Taewon Kang, Joseph K J, Chris Tensmeyer, Jihyung Kil, Wanrong Zhu, Ming C. Lin, Vlad I. Morariu
First: 2025-12-19T01:10:24+00:00 · Latest: 2025-12-19T01:10:24+00:00
Abstract
We present a framework for document-centric background generation with multi-page editing and thematic continuity. To ensure text regions remain readable, we employ a \emph{latent masking} formulation that softly attenuates updates in the diffusion space, inspired by smooth barrier functions in physics and numerical optimization. In addition, we introduce \emph{Automated Readability Optimization (ARO)}, which automatically places semi-transparent, rounded backing shapes behind text regions. ARO determines the minimal opacity needed to satisfy perceptual contrast standards (WCAG 2.2) relative to the underlying background, ensuring readability while maintaining aesthetic harmony without human intervention. Multi-page consistency is maintained through a summarization-and-instruction process, where each page is distilled into a compact representation that recursively guides subsequent generations. This design reflects how humans build continuity by retaining prior context, ensuring that visual motifs evolve coherently across an entire document. Our method further treats a document as a structured composition in which text, figures, and backgrounds are preserved or regenerated as separate layers, allowing targeted background editing without compromising readability. Finally, user-provided prompts allow stylistic adjustments in color and texture, balancing automated consistency with flexible customization. Our training-free framework produces visually coherent, text-preserving, and thematically aligned documents, bridging generative modeling with natural design workflows.
中文标题/摘要
标题:基于文本条件的多层文档背景生成
我们提出了一种面向文档的多页编辑和主题连续性的背景生成框架。为了确保文本区域的可读性,我们采用了一种称为\emph{潜在遮罩}的公式,该公式在扩散空间中柔和地衰减更新,灵感来源于物理学和数值优化中的平滑障碍函数。此外,我们引入了\emph{自动可读性优化(ARO)},它会自动在文本区域后放置半透明的圆角支撑形状。ARO 确定满足感知对比标准(WCAG 2.2)所需的最小透明度,以确保可读性同时保持美学和谐,无需人工干预。通过一个总结和指令过程来维护多页一致性,每一页被提炼成一个紧凑的表示,递归地指导后续生成。这一设计反映了人类如何通过保留先前的上下文来构建连续性,确保整个文档中的视觉主题能够一致地演变。我们的方法进一步将文档视为一种结构化的组合,在这种组合中,文本、图表和背景可以作为单独的层被保存或再生,从而允许有针对性的背景编辑而不影响可读性。最后,用户提供的提示允许在颜色和纹理方面进行风格调整,平衡了自动一致性与灵活定制。我们的无训练框架生成了视觉上连贯、保留文本并主题一致的文档,将生成建模与自然设计工作流程相结合。
Summary / 总结
The research aims to generate a coherent background for multi-page documents while preserving text readability and thematic continuity. The method uses latent masking to softly attenuate updates in the diffusion space and introduces Automated Readability Optimization (ARO) to automatically place semi-transparent backing shapes behind text regions, ensuring readability and aesthetic harmony. Key findings include the ability to maintain multi-page consistency through a summarization-and-instruction process and the capability to treat documents as structured compositions, allowing targeted background editing without compromising readability. Users can also make stylistic adjustments through prompts, balancing automated consistency with customization. The framework produces visually coherent and thematically aligned documents.
论文提出了一种多页编辑和主题连续性的文档中心背景生成框架。它使用了潜空间掩码形式来柔和地衰减更新,并引入了自动可读性优化(ARO)自动在文本区域后放置半透明的背景形状,以确保可读性同时保持美学和谐。该方法通过总结和指令过程来保持多页一致性,并将文档视为结构化的组成,允许在不损害可读性的情况下进行有针对性的背景编辑。用户提供的提示可以调整颜色和纹理,平衡自动化一致性与灵活定制。该框架生成了视觉连贯、文本保留且主题一致的文档。
VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance
Authors: Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer
First: 2025-05-21T19:08:38+00:00 · Latest: 2025-12-18T23:33:44+00:00
Comments: Project website with code and data: https://asgaardlab.github.io/videogameqa-bench/
Abstract
With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector's sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry's most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. Code and data are available at: https://asgaardlab.github.io/videogameqa-bench/
中文标题/摘要
标题:VideoGameQA-Bench:评估视觉语言模型在视频游戏质量保证中的性能
随着视频游戏现在成为娱乐行业中收入最高的领域,优化游戏开发工作流程已成为该领域持续增长的关键。最近在视觉语言模型(VLMs)方面的进展为自动化和增强游戏开发的各个方面提供了巨大潜力,特别是在质量保证(QA)方面,这是行业中最具劳动密集型且自动化选项有限的过程之一。为了准确评估VLMs在视频游戏QA任务中的性能并确定其在处理实际场景中的有效性,迫切需要标准化基准,而现有基准不足以满足该领域的特定需求。为弥补这一差距,我们引入了VideoGameQA-Bench,这是一个全面的基准,涵盖了广泛的游戏中QA活动,包括视觉单元测试、视觉回归测试、大海捞针任务、错误检测和图像和视频的错误报告生成。代码和数据可在:https://asgaardlab.github.io/videogameqa-bench/获取。