arXiv 论文速递

2026-01-16 03:39
Snapshot: 20260116_0339
SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3
Authors: Ruiqi Shen, Chang Liu, Henghui Ding
First: 2026-01-14T18:52:14+00:00 · Latest: 2026-01-14T18:52:14+00:00
Comments: Code: https://github.com/FudanCVL/SAM3-DMS
Abstract
Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group-level collective memory selection is suboptimal for complex multi-object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3-DMS, a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.
中文标题/摘要
标题:SAM3-DMS:多目标视频分割中的解耦记忆选择
Segment Anything 3 (SAM3) 已经建立了一个强大的基础,能够稳健地检测、分割和跟踪视频中的指定目标。然而,在其原始实现中,其组级集体记忆选择对于复杂的多对象场景来说并不理想,因为它基于所有并发目标的平均性能进行同步决策,经常忽视个体可靠性。为此,我们提出了SAM3-DMS,这是一种无需训练的解耦策略,利用个体对象的细粒度记忆选择。实验表明,我们的方法实现了稳健的身份保持和跟踪稳定性。值得注意的是,随着目标密度的增加,我们的优势更加明显,为野生环境中的多目标视频分割奠定了坚实的基础。
Summary / 总结
The research aims to improve the multi-target video segmentation performance of SAM3 by addressing the limitations of its original memory selection mechanism. SAM3-DMS introduces a decoupled memory selection strategy that focuses on individual objects, leading to better identity preservation and tracking stability, especially in scenarios with high target density.
研究旨在通过改进多目标视频分割中的记忆选择机制来解决复杂场景下的问题。SAM3-DMS 提出了一个解耦的记忆选择策略,为每个对象单独选择记忆,从而在高密度目标环境中更好地保持身份一致性和跟踪稳定性。
Self-Supervised Animal Identification for Long Videos
Authors: Xuyang Fang, Sion Hannuna, Edwin Simpson, Neill Campbell
First: 2026-01-14T17:53:59+00:00 · Latest: 2026-01-14T17:53:59+00:00
Comments: 11 pages, 1 figure
Abstract
Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video -- a common scenario in practice -- and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97\%) while consuming less than 1 GB of GPU memory per batch -- an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{https://huggingface.co/datasets/tonyFang04/8-calves}{here}.
中文标题/摘要
标题:自我监督的长视频动物识别
在长时间视频中识别个体动物对于行为生态学、野生动物监测和畜牧管理至关重要。传统方法需要大量手动标注,而现有的自我监督方法计算量大且不适用于长序列,因为存在内存限制和时间误差传播问题。我们提出了一种高效且自我监督的方法,将动物识别重新定义为全局聚类任务,而不是顺序跟踪问题。我们的方法假设视频中个体数量已知且固定,仅需边界框检测和总数。通过采样帧对、使用冻结的预训练骨干网络,并利用匈牙利算法进行内部批处理伪标签分配,我们的方法在没有身份标签的情况下学习判别特征。我们从视觉-语言模型中适应二元交叉熵损失,使准确率达到97%以上,同时每个批次消耗的GPU内存少于1 GB——比标准对比方法少一个数量级。在具有挑战性的实际数据集(3D-POP鸽子和8头牛喂食视频)上评估,我们的框架匹配或超越了在超过1000个标注帧上训练的监督基线,有效地消除了手动标注瓶颈。这项工作使在消费级硬件上实现高精度动物识别成为可能,具有在资源受限的研究环境中广泛应用的潜力。本文所有代码可在https://huggingface.co/datasets/tonyFang04/8-calves 获取。
Summary / 总结
The paper addresses the challenge of identifying individual animals in long-duration videos, which is crucial for various applications such as behavioral ecology and wildlife monitoring. It introduces a self-supervised method that reframes the task as a global clustering problem, using bounding box detections and a fixed number of individuals per video. The method employs a self-bootstrapping mechanism with the Hungarian algorithm and a Binary Cross Entropy loss, achieving state-of-the-art accuracy while requiring less than 1 GB of GPU memory per batch. On real-world datasets, it outperforms supervised baselines trained on over 1,000 labeled frames, demonstrating practical and high-accuracy animal identification on consumer-grade hardware.
研究旨在开发一种高效自监督方法,用于识别长视频中的个体动物,解决手动标注和现有方法计算需求高的问题。该方法将动物识别重新定义为全局聚类任务,使用边界框检测和匈牙利算法进行自举标注机制来学习具有区分性的特征,而无需身份标签。该方法在具有挑战性的数据集上达到超过97%的准确率,同时每批消耗不到1 GB的GPU内存,超越了基于超过1000个标注帧的监督基线。这项工作使在消费级硬件上实现高精度动物识别成为可能,具有在资源受限的研究环境中广泛应用的潜力。
LiteEmbed: Adapting CLIP to Rare Classes
Authors: Aishwarya Agarwal, Srikrishna Karanam, Vineet Gandhi
First: 2026-01-14T17:53:11+00:00 · Latest: 2026-01-14T17:53:11+00:00
Comments: 14 pages, 12 figures
Abstract
Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP's vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP's original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.
中文标题/摘要
标题:LiteEmbed:将CLIP适应稀有类别
大规模的跨模态模型如CLIP在零样本识别方面表现出色,但在预训练期间很少见到的类别上存在困难,包括新出现的实体和文化特定类别。我们引入了LiteEmbed,这是一种轻量级框架,用于CLIP的少量样本个性化,使得无需重新训练其编码器即可添加新类别。LiteEmbed在CLIP词汇表内对文本嵌入进行子空间引导优化,利用基于PCA的分解来分离粗粒度语义方向和细粒度变化。两个互补的目标,粗粒度对齐和细粒度分离,共同保持全局语义一致性,同时增强视觉相似类别之间的可区分性。一旦优化,嵌入可以即插即用,无缝替代CLIP的原始文本特征,应用于分类、检索、分割和检测任务。广泛的实验表明,与先前方法相比,LiteEmbed在适应未充分代表、稀有或未见过的类别方面取得了显著的改进。
Summary / 总结
The research aims to address the challenge of CLIP's poor performance on rare classes, especially newly emerging entities and culturally specific categories. LiteEmbed is a lightweight framework that optimizes text embeddings within CLIP's vocabulary to enable few-shot personalization without retraining the encoders. The method uses PCA-based decomposition to disentangle coarse semantic directions from fine-grained variations, achieving both global semantic consistency and discriminability. Experiments show significant improvements over previous methods, making LiteEmbed a viable solution for adapting CLIP to underrepresented classes.
研究旨在解决CLIP在罕见类别上的表现不佳问题,特别是新兴实体和文化特定类别。LiteEmbed是一种轻量级框架,通过子空间引导优化CLIP词汇中的文本嵌入,利用PCA将粗粒度语义方向与细粒度变化分离。该方法在零样本识别任务中显著提高了罕见类别的表现,优于先前的方法,并在分类、检索、分割和检测任务中展示了有效性。
Image2Garment: Simulation-ready Garment Generation from a Single Image
Authors: Selim Emir Can, Jan Ackermann, Kiyohiro Nakayama, Ruofan Liu, Tong Wu, Yang Zheng, Hugo Bertiche, Menglei Chai, Thabo Beeler, Gordon Wetzstein
First: 2026-01-14T17:47:33+00:00 · Latest: 2026-01-14T17:47:33+00:00
Abstract
Estimating physically accurate, simulation-ready garments from a single image is challenging due to the absence of image-to-physics datasets and the ill-posed nature of this problem. Prior methods either require multi-view capture and expensive differentiable simulation or predict only garment geometry without the material properties required for realistic simulation. We propose a feed-forward framework that sidesteps these limitations by first fine-tuning a vision-language model to infer material composition and fabric attributes from real images, and then training a lightweight predictor that maps these attributes to the corresponding physical fabric parameters using a small dataset of material-physics measurements. Our approach introduces two new datasets (FTAG and T2P) and delivers simulation-ready garments from a single image without iterative optimization. Experiments show that our estimator achieves superior accuracy in material composition estimation and fabric attribute prediction, and by passing them through our physics parameter estimator, we further achieve higher-fidelity simulations compared to state-of-the-art image-to-garment methods.
中文标题/摘要
标题:Image2Garment:从单张图像生成准备用于模拟的服装
从单张图像估计物理准确且准备用于模拟的服装极具挑战性,因为缺乏图像到物理的数据集,且该问题本身是病态的。先前的方法要么需要多视角捕捉和昂贵的可微模拟,要么仅预测服装几何形状而没有用于真实模拟所需的材料属性。我们提出了一种无需这些限制的前馈框架,首先通过微调视觉-语言模型从真实图像中推断材料组成和织物属性,然后使用少量材料-物理测量数据集训练一个轻量级预测器,将这些属性映射到相应的物理织物参数。我们的方法引入了两个新数据集(FTAG和T2P),并在无需迭代优化的情况下从单张图像生成准备用于模拟的服装。实验表明,我们的估计器在材料组成估计和织物属性预测方面具有更高的准确性,并通过我们的物理参数估计器进一步实现了比最先进的图像到服装方法更高的保真度模拟。
Summary / 总结
The research aims to generate simulation-ready garments from a single image, addressing the challenges of dataset scarcity and the ill-posed nature of the problem. The method involves fine-tuning a vision-language model to infer material composition and fabric attributes from images, followed by training a lightweight predictor to map these attributes to physical fabric parameters. Experiments demonstrate superior accuracy in material composition and fabric attribute prediction, leading to higher-fidelity simulations compared to existing methods.
研究旨在从单张图像生成物理上准确、可用于模拟的服装,解决图像到物理数据集和问题的不明确性带来的挑战。方法包括微调一个视觉-语言模型从真实图像中推断材料组成和织物属性,然后使用一个小的数据集训练一个轻量级预测器将这些属性映射到相应的物理织物参数。该方法引入了两个新数据集,并在材料组成估计和织物属性预测方面取得了优于现有方法的准确性,从而实现了更高保真度的模拟。
Head Pursuit: Probing Attention Specialization in Multimodal Transformers
Authors: Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga
Venue: NeurIPS 2025 spotlight
First: 2025-10-24T14:41:47+00:00 · Latest: 2026-01-14T15:47:59+00:00
Comments: Accepted at NeurIPS 2025 (spotlight)
Abstract
Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.
中文标题/摘要
标题:头部追踪:探究多模态变压器中的注意力专业化
语言和视觉-语言模型在广泛的任务中表现出色,但其内部机制仍部分未明。本研究探讨了文本生成模型中个体注意力头在特定语义或视觉属性上的专业化情况。基于已有的可解释性方法,我们重新解释了通过最终解码层探测中间激活的实践,将其置于信号处理的视角下。这使我们能够系统地分析多个样本,并根据其与目标概念的相关性对注意力头进行排名。我们的结果显示,在单模态和多模态变压器中,头部级别的专业化模式是一致的。令人惊讶的是,我们发现编辑仅1%的头部,使用我们的方法选择,可以可靠地抑制或增强模型输出中的目标概念。我们在诸如问答和毒性缓解的语言任务,以及包括图像分类和描述在内的视觉-语言任务上验证了我们的方法。我们的发现突显了注意力层中的可解释和可控结构,提供了理解并编辑大规模生成模型的简单工具。
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
Authors: Sheng-Yu Huang, Jaesung Choe, Yu-Chiang Frank Wang, Cheng Sun
First: 2026-01-14T15:45:57+00:00 · Latest: 2026-01-14T15:45:57+00:00
Comments: project page: https://peterjohnsonhuang.github.io/openvoxel-pages/
Abstract
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
中文标题/摘要
标题:OpenVoxel:无需训练的开放词汇3D场景理解中稀疏体素的分组与标注算法
我们提出了OpenVoxel,一种无需训练的算法,用于对开放词汇3D场景理解任务中的稀疏体素进行分组和标注。给定从3D场景多视角图像获得的稀疏体素栅格化(SVR)模型,我们的OpenVoxel能够生成描述场景中不同物体的有意义的分组。此外,通过利用强大的视觉语言模型(VLMs)和多模态大型语言模型(MLLMs),我们的OpenVoxel能够通过为每个分组进行标注来构建一个信息丰富的场景图,从而支持进一步的3D场景理解任务,如开放词汇分割(OVS)或引用表达分割(RES)。与以往方法不同,我们的方法无需训练,也不引入来自CLIP/BERT文本编码器的嵌入。相反,我们直接使用MLLMs进行文本到文本的搜索。通过广泛的实验,我们的方法在复杂引用表达分割(RES)任务中表现出优于近期研究的性能。代码将开源。
Summary / 总结
OpenVoxel is a training-free algorithm designed for grouping and captioning sparse voxels in open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization model from multi-view images, OpenVoxel generates meaningful groups and captions each group using Vision Language Models and Multi-modal Large Language Models, facilitating tasks like open-vocabulary segmentation and referring expression segmentation. Experiments show that OpenVoxel outperforms recent methods, especially in complex referring expression segmentation tasks.
OpenVoxel 是一种无需训练的算法,用于在开放词汇量的 3D 场景理解任务中对稀疏体素进行分组和描述。给定来自多视角图像的稀疏体素射线图模型,OpenVoxel 生成有意义的分组并通过视觉语言模型和多模态大型语言模型对其进行描述,从而支持开放词汇量分割和参照表达分割等任务。实验表明,OpenVoxel 在复杂参照表达分割任务中优于近期方法。
Beyond Uniform SVD:Dual-Level Optimization across Columns and Modules for LLM Compression
Authors: Lin Xv, Xian Gao, Ting Li, Yuzhuo Fu
First: 2025-10-22T09:02:37+00:00 · Latest: 2026-01-14T15:28:04+00:00
Abstract
Low-rank decomposition, particularly Singular Value Decomposition (SVD), is a pivotal technique for mitigating the storage and computational demands of Large Language Models (LLMs). However, prevalent SVD-based approaches overlook the critical phenomenon that decomposition errors exhibit significant disparity across different components of the parameter matrix, often leading to suboptimal approximation. Furthermore, existing methods lack a direct metric to evaluate the importance of individual weight matrices. To address these limitations, we propose Duo-SVD (Dual-level Optimization SVD), a novel training-free framework that synergizes optimization at both the column and the module levels. First, Duo-SVD incorporates a Column-Preserving Strategy that explicitly retains columns exhibiting high decomposition errors, while applying low-rank approximation solely to those with lower errors. Second, at the module level, we employ a Module-Adaptive Allocation Strategy that formulates ratio allocation as a global constrained optimization problem based on perturbation-induced model deviation. Extensive experiments demonstrate that Duo-SVD consistently outperforms state-of-the-art SVD-based baselines and structured pruning methods, establishing it as a superior paradigm for efficient LLM compression.
中文标题/摘要
标题:超越均匀SVD:跨列和模块的双层优化以压缩LLM
低秩分解,特别是奇异值分解(SVD),是减轻大型语言模型(LLMs)存储和计算需求的关键技术。然而,现有的SVD基方法忽视了分解误差在参数矩阵的不同组件之间存在显著差异的现象,通常导致次优近似。此外,现有方法缺乏直接评估单个权重矩阵重要性的度量。为解决这些局限性,我们提出了Duo-SVD(双层优化SVD)这一新的无训练框架,该框架在列和模块两个层面协同优化。首先,Duo-SVD引入了一种列保留策略,该策略明确保留具有高分解误差的列,而仅对具有较低误差的列进行低秩近似。其次,在模块层面,我们采用了一种模块自适应分配策略,将比例分配建模为基于扰动引起的模型偏差的全局约束优化问题。广泛的实验表明,Duo-SVD在性能上始终优于最先进的SVD基线和结构化剪枝方法,确立了其作为高效LLM压缩的优越范式。
Summary / 总结
The paper addresses the limitations of uniform Singular Value Decomposition (SVD) in compressing Large Language Models (LLMs), proposing Duo-SVD, a dual-level optimization framework. Duo-SVD optimizes both the column and module levels by retaining high-error columns and applying low-rank approximation to low-error columns, while also formulating a module-adaptive allocation strategy. Experiments show that Duo-SVD outperforms existing SVD-based methods and structured pruning techniques in terms of efficiency and compression quality.
论文针对统一SVD在压缩大型语言模型(LLM)中的局限性,提出了Duo-SVD双层优化框架。Duo-SVD在列和模块两个层面进行优化,保留高误差的列,对低误差的列进行低秩近似。在模块层面,它使用基于扰动引起的模型偏差的全局约束优化方法进行资源分配。实验表明,Duo-SVD在压缩LLM方面优于现有的SVD基线方法和结构化剪枝技术,是更优的压缩方法。
ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models
Authors: Sibo Dong, Ismail Shaheen, Maggie Shen, Rupayan Mallick, Sarah Adel Bargal
Venue: WACV 2026
First: 2025-06-13T19:57:40+00:00 · Latest: 2026-01-14T15:15:58+00:00
Comments: Accepted to WACV 2026
Abstract
Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, \textbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.
中文标题/摘要
标题:ViSTA:使用多模态适配器的文本到图像扩散模型视觉叙事
文本到图像的扩散模型已经取得了显著的成功,然而生成连贯的图像序列进行视觉叙事仍然具有挑战性。一个关键挑战是如何有效地利用所有先前的文本-图像对,即历史文本-图像对,它们提供了上下文信息,以保持帧间的一致性。现有的自回归方法依赖于所有过去的图像-文本对,但需要大量的训练,而无需训练的主题特定方法可以确保一致性,但缺乏对叙述提示的适应性。为了解决这些限制,我们提出了一种用于文本到图像扩散模型的多模态历史适配器——ViSTA。它包括(1)一个多模态历史融合模块,用于提取相关的历史特征,以及(2)一个历史适配器,用于根据提取的相关特征进行生成。我们还在推理过程中引入了一种显著的历史选择策略,其中选择最显著的历史文本-图像对,从而提高条件生成的质量。此外,我们提出使用基于视觉问答的度量TIFA来评估视觉叙事中的文本-图像对齐,从而提供更针对性和可解释的生成图像评估。在StorySalon和FlintStonesSV数据集上评估,我们提出的ViSTA模型不仅在不同帧之间保持一致,而且与叙述文本描述也很好地对齐。
Summary / 总结
The paper addresses the challenge of generating coherent image sequences for visual storytelling using text-to-image diffusion models. It introduces ViSTA, which uses a multi-modal history adapter to condition image generation on relevant history features, and a salient history selection strategy to improve consistency. The model is evaluated on StorySalon and FlintStonesSV datasets, showing consistent and narrative-aligned image sequences.
研究旨在通过文本到图像的扩散模型提高视觉叙事中图像序列的连贯性。提出了一个多模态历史适配器ViSTA,通过多模态历史融合模块和历史适配器有效利用历史文本-图像对。模型还引入了显著历史选择策略和基于视觉问答的评估指标TIFA,以更好地与叙事文本描述对齐。实验结果表明,ViSTA生成的图像既连贯又与叙事文本描述高度一致。
PrivLEX: Detecting legal concepts in images through Vision-Language Models
Authors: Darya Baranouskaya, Andrea Cavallaro
First: 2026-01-14T12:51:48+00:00 · Latest: 2026-01-14T12:51:48+00:00
Abstract
We present PrivLEX, a novel image privacy classifier that grounds its decisions in legally defined personal data concepts. PrivLEX is the first interpretable privacy classifier aligned with legal concepts that leverages the recognition capabilities of Vision-Language Models (VLMs). PrivLEX relies on zero-shot VLM concept detection to provide interpretable classification through a label-free Concept Bottleneck Model, without requiring explicit concept labels during training. We demonstrate PrivLEX's ability to identify personal data concepts that are present in images. We further analyse the sensitivity of such concepts as perceived by human annotators of image privacy datasets.
中文标题/摘要
标题:PrivLEX:通过视觉语言模型检测图像中的法律概念
我们提出了PrivLEX,这是一种新颖的图像隐私分类器,其决策基于法律定义的个人数据概念。PrivLEX是首个与法律概念对齐的可解释隐私分类器,利用视觉语言模型(VLM)的识别能力。PrivLEX依赖于零样本VLM概念检测,通过无标签的概念瓶颈模型提供可解释的分类,无需在训练期间使用显式的概念标签。我们展示了PrivLEX能够识别图像中存在的个人数据概念的能力。我们进一步分析了这些概念在图像隐私数据集的人工标注者看来的敏感性。
Summary / 总结
PrivLEX is a novel image privacy classifier that uses Vision-Language Models to detect legally defined personal data concepts, providing interpretable classifications without needing explicit concept labels during training. It demonstrates the ability to identify personal data concepts in images and analyze their sensitivity as perceived by human annotators of image privacy datasets.
PrivLEX 是一种新颖的图像隐私分类器,利用视觉-语言模型检测法律定义的个人数据概念,提供了一种无需在训练过程中使用显式概念标签的可解释方法。它展示了在图像中识别个人数据概念的能力,并分析了这些概念在图像隐私数据集的人类标注者看来的敏感性。
Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models
Authors: Junjie Li, Ziao Wang, Jianghong Ma, Xiaofeng Zhang
First: 2025-09-27T02:57:37+00:00 · Latest: 2026-01-14T12:33:16+00:00
Abstract
Large vision-language models (VLMs) achieve strong benchmark performance, but controlling their behavior through instruction tuning remains difficult. Reducing the budget of instruction tuning dataset often causes regressions, as heuristic strategies treat models as black boxes and overlook the latent capabilities that govern learning. We introduce Capability-Attributed Data Curation (CADC), a framework that shifts curation from task-specific heuristics to intrinsic capability analysis. CADC discovers intrinsic capabilities in an unsupervised manner from gradient-based learning trajectories, attributes training data to these capabilities via influence estimation, and curates capability-aware curricula through balanced selection and staged sequencing. This transforms black-box instruction tuning into a controllable, capability-driven process. With as little as 5% of the original data, CADC surpasses full-data training on multimodal benchmarks. These results validate intrinsic capabilities as the fundamental building blocks of model learning and establish CADC as a principle paradigm for instruction data curation.
中文标题/摘要
标题:揭示固有能力:视觉-语言模型数据整理的新范式
大型视觉-语言模型(VLMs)在基准测试中表现出色,但通过指令调优控制其行为仍然困难。减少指令调优数据集的预算通常会导致性能下降,因为启发式策略将模型视为黑箱,并忽略了控制学习的潜在能力。我们提出了能力归因数据整理(CADC),这是一种将整理从任务特定启发式方法转向固有能力分析的框架。CADC 通过基于梯度的学习轨迹以无监督方式发现固有能力,通过影响估计将训练数据归因于这些能力,并通过平衡选择和分阶段排序整理能力感知课程。这将黑箱指令调优转变为可控、能力驱动的过程。仅使用原始数据的5%,CADC 在多模态基准测试中超过了全数据训练。这些结果验证了固有能力是模型学习的基本构建块,并确立了CADC 作为指令数据整理原则范式的地位。
Summary / 总结
The research aims to improve the controllability of large vision-language models by focusing on their intrinsic capabilities rather than relying on heuristic instruction tuning. The Capability-Attributed Data Curation (CADC) framework uncovers these capabilities through unsupervised learning and curates training data based on these attributes. Experiments show that using only 5% of the original data, CADC outperforms full-data training on multimodal benchmarks, demonstrating the importance of intrinsic capabilities in model learning and validating CADC as an effective paradigm for instruction data curation.
研究旨在通过解决指令调优的局限性,提高大型视觉-语言模型(VLMs)的可控性。提出的Capability-Attributed Data Curation (CADC)框架通过无监督分析学习轨迹来识别模型的内在能力,将训练数据归因于这些能力,并构建一个能力导向的课程。实验表明,使用原始数据的5%,CADC可以达到与全量数据训练相当的性能,这证明了内在能力在模型学习中的重要性,并验证了CADC框架的有效性。
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Authors: Edoardo Bianchi, Jacopo Staiano, Antonio Liotta
First: 2025-09-30T14:00:41+00:00 · Latest: 2026-01-14T12:30:42+00:00
Abstract
Existing approaches treat action quality assessment and skill proficiency estimation as classification problems, outputting discrete labels without interpretable reasoning. We reformulate this task as generative vision language modeling, introducing ProfVLM, a compact model that jointly predicts proficiency levels and generates expert-like natural language feedback from multi-view videos. ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores. Central to our method is an AttentiveGatedProjector that dynamically fuses and projects multi-view egocentric and exocentric features from a frozen TimeSformer backbone into a language model fine-tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60% compared to existing classification-based methods. By providing natural language critiques aligned with performance levels, this work shows that generative vision-language modeling offers a powerful and efficient paradigm shift for interpretable action quality assessment.
中文标题/摘要
标题:ProfVLM:一种轻量级多视角熟练度评估视频语言模型
现有方法将动作质量评估和技能熟练度估计视为分类问题,输出离散标签而没有可解释的推理。我们将其任务重新定义为生成视觉语言建模,引入了ProfVLM,这是一种紧凑的模型,可以联合预测熟练度水平并从多视角视频中生成类似专家的自然语言反馈。ProfVLM 利用条件语言生成提供可操作的见解以及定量评估分数。我们方法的核心是注意门控投影器,它可以动态地将冻结的TimeSformer主干中的多视角第一人称和第三人称特征融合并投影到一个为反馈生成微调的语言模型中。在EgoExo4D和专家评论数据集上训练,ProfVLM 在参数量减少高达20倍和训练时间减少高达60%的情况下,超越了现有的基于分类的方法。通过提供与表现水平对齐的自然语言批评,这项工作表明生成视觉语言建模为可解释的动作质量评估提供了强大的和高效的范式转变。
Frequency Is What You Need: Considering Word Frequency When Text Masking Benefits Vision-Language Model Pre-training
Authors: Mingliang Liang, Martha Larson
Venue: WACV 2026
First: 2024-12-20T18:51:41+00:00 · Latest: 2026-01-14T11:07:46+00:00
Comments: Accepted by WACV 2026
Abstract
Vision Language Models (VLMs) can be trained more efficiently if training sets can be reduced in size. Recent work has shown the benefits of masking text during VLM training using a variety of strategies (truncation, random masking, block masking and syntax masking) and has reported syntax masking as the top performer. In this paper, we analyze the impact of different text masking strategies on the word frequency in the training data, and show that this impact is connected to model success. This finding motivates Contrastive Language-Image Pre-training with Word Frequency Masking (CLIPF), our proposed masking approach, which directly leverages word frequency. Extensive experiments demonstrate the advantages of CLIPF over syntax masking and other existing approaches, particularly when the number of input tokens decreases. We show that not only CLIPF, but also other existing masking strategies, outperform syntax masking when enough epochs are used during training, a finding of practical importance for selecting a text masking method for VLM training. Our code is available online.
中文标题/摘要
标题:频率是你所需要的:考虑文本掩码对视觉语言模型预训练的益处时的词频
视觉语言模型(VLMs)可以通过减小训练集的规模来更高效地进行训练。近期的研究表明,在VLM训练过程中使用各种策略(截断、随机掩码、块掩码和语法掩码)进行文本掩码可以带来益处,并且报告了语法掩码为最佳表现。在本文中,我们分析了不同文本掩码策略对训练数据中词频的影响,并表明这种影响与模型的成功相关。这一发现促使我们提出了我们的掩码方法——对比语言-图像预训练与词频掩码(CLIPF),该方法直接利用词频。广泛的实验表明,CLIPF在语法掩码和其他现有方法上具有优势,尤其是在输入标记数量减少时。我们展示了不仅CLIPF,其他现有的掩码策略在使用足够多的训练周期时也优于语法掩码,这一发现对于选择VLM训练的文本掩码方法具有实际意义。我们的代码已在线发布。
Summary / 总结
This paper investigates the impact of different text masking strategies on word frequency in training data for Vision Language Models (VLMs), finding that word frequency is crucial for model success. The authors propose Contrastive Language-Image Pre-training with Word Frequency Masking (CLIPF), which directly leverages word frequency, and show that it outperforms syntax masking and other existing approaches, especially when the number of input tokens decreases. Extensive experiments demonstrate the advantages of CLIPF, suggesting that it is a practical choice for VLM training when enough epochs are used.
本文研究了不同文本遮掩策略对视觉语言模型(VLM)训练数据中词频的影响,发现词频对模型成功至关重要。作者提出了基于词频的对比语言-图像预训练方法(CLIPF),该方法直接利用词频,并展示了它在输入令牌数量减少时比语法遮掩和其他现有方法更优。广泛的实验表明CLIPF的优势,这表明在训练足够多的周期时,它是选择VLM训练文本遮掩方法的实用选择。
Prototypical Contrastive Learning-based CLIP Fine-tuning for Object Re-identification
Authors: Jiachen Li, Xiaojin Gong
First: 2023-10-26T08:12:53+00:00 · Latest: 2026-01-14T09:17:39+00:00
Abstract
This work aims to adapt large-scale pre-trained vision-language models, such as contrastive language-image pretraining (CLIP), to enhance the performance of object reidentification (Re-ID) across various supervision settings. Although prompt learning has enabled a recent work named CLIP-ReID to achieve promising performance, the underlying mechanisms and the necessity of prompt learning remain unclear due to the absence of semantic labels in ReID tasks. In this work, we first analyze the role prompt learning in CLIP-ReID and identify its limitations. Based on our investigations, we propose a simple yet effective approach to adapt CLIP for supervised object Re-ID. Our approach directly fine-tunes the image encoder of CLIP using a prototypical contrastive learning (PCL) loss, eliminating the need for prompt learning. Experimental results on both person and vehicle Re-ID datasets demonstrate the competitiveness of our method compared to CLIP-ReID. Furthermore, we extend our PCL-based CLIP fine-tuning approach to unsupervised scenarios, where we achieve state-of-the art performance. Code is available at https://github.com/RikoLi/PCL-CLIP.
中文标题/摘要
标题:基于原型对比学习的CLIP微调在物体重识别中的应用
本工作旨在将大规模预训练的跨模态模型,如对比语言-图像预训练(CLIP),适应到物体重识别(Re-ID)任务中,以提高其在各种监督设置下的性能。尽管提示学习使最近的一项名为CLIP-ReID的工作取得了令人鼓舞的性能,但由于ReID任务中缺乏语义标签,提示学习的内在机制及其必要性仍不清楚。在本工作中,我们首先分析了CLIP-ReID中提示学习的作用,并识别其局限性。基于我们的研究,我们提出了一种简单而有效的方法,以适应CLIP进行监督物体Re-ID。我们的方法直接使用原型对比学习(PCL)损失微调CLIP的图像编码器,从而消除了提示学习的需要。在人员和车辆Re-ID数据集上的实验结果表明,我们的方法在与CLIP-ReID的比较中具有竞争力。此外,我们将基于PCL的CLIP微调方法扩展到无监督场景,在这些场景中,我们达到了最先进的性能。代码可在https://github.com/RikoLi/PCL-CLIP获取。
Summary / 总结
This work aims to improve object re-identification (Re-ID) using large-scale pre-trained vision-language models like CLIP. It proposes a method that directly fine-tunes the image encoder of CLIP with a prototypical contrastive learning (PCL) loss, avoiding the need for prompt learning. The approach shows competitive performance on both supervised and unsupervised Re-ID datasets, outperforming CLIP-ReID in unsupervised scenarios. Code is available on GitHub.
本文旨在使用如CLIP这样的大规模预训练视觉-语言模型来提高物体再识别(Re-ID)的性能。提出了一种直接使用原型对比学习(PCL)损失微调CLIP图像编码器的方法,避免了使用提示学习。该方法在监督和无监督Re-ID数据集上均表现出竞争力,无监督场景下超越了CLIP-ReID。代码可在https://github.com/RikoLi/PCL-CLIP获得。
PhyRPR: Training-Free Physics-Constrained Video Generation
Authors: Yibo Zhao, Hengjia Li, Xiaofei He, Boxi Wu
First: 2026-01-14T07:41:56+00:00 · Latest: 2026-01-14T07:41:56+00:00
Abstract
Recent diffusion-based video generation models can synthesize visually plausible videos, yet they often struggle to satisfy physical constraints. A key reason is that most existing approaches remain single-stage: they entangle high-level physical understanding with low-level visual synthesis, making it hard to generate content that require explicit physical reasoning. To address this limitation, we propose a training-free three-stage pipeline,\textit{PhyRPR}:\textit{Phy\uline{R}eason}--\textit{Phy\uline{P}lan}--\textit{Phy\uline{R}efine}, which decouples physical understanding from visual synthesis. Specifically, \textit{PhyReason} uses a large multimodal model for physical state reasoning and an image generator for keyframe synthesis; \textit{PhyPlan} deterministically synthesizes a controllable coarse motion scaffold; and \textit{PhyRefine} injects this scaffold into diffusion sampling via a latent fusion strategy to refine appearance while preserving the planned dynamics. This staged design enables explicit physical control during generation. Extensive experiments under physics constraints show that our method consistently improves physical plausibility and motion controllability.
中文标题/摘要
标题:PhyRPR:无需训练的物理约束视频生成
基于扩散的视频生成模型可以合成视觉上可信的视频,但它们往往难以满足物理约束。主要原因在于大多数现有方法仍为单阶段:它们将高层次的物理理解与低层次的视觉合成交织在一起,使得难以生成需要显式物理推理的内容。为解决这一局限,我们提出了一种无需训练的三阶段流水线——PhyRPR:PhyReason—PhyPlan—PhyRefine,该流水线将物理理解与视觉合成分离。具体而言,PhyReason 使用大型多模态模型进行物理状态推理,并使用图像生成器进行关键帧合成;PhyPlan 确定性地合成可控的粗略运动框架;PhyRefine 通过潜在融合策略将此框架注入扩散采样中,以改进外观同时保留计划的动力学。这种分阶段设计使得在生成过程中可以实现显式的物理控制。在物理约束下的大量实验表明,我们的方法在物理可信度和运动可控性方面始终表现出改进。
Summary / 总结
The motivation for this work is to improve the physical plausibility of generated videos by addressing the limitations of single-stage diffusion-based models. The proposed method, PhyRPR, consists of three stages: PhyReason, PhyPlan, and PhyRefine. PhyReason reasons about physical states and synthesizes keyframes, PhyPlan deterministically creates a coarse motion scaffold, and PhyRefine refines the appearance while preserving the planned dynamics. Experiments show that PhyRPR enhances physical plausibility and motion controllability under physical constraints.
研究旨在通过解决将物理理解与视觉合成纠缠在一起的单阶段模型的局限性,提高视频生成的物理合理性。提出的PhyRPR方法包含三个阶段:PhyReason进行物理状态推理和关键帧合成,PhyPlan生成可控的粗略运动框架,PhyRefine通过潜空间融合策略注入此框架以细化外观并保留计划的动力学。实验表明,PhyRPR在物理约束下提高了物理合理性和运动可控性。
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
Authors: DatologyAI, :, Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Scott Loftin, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt
First: 2026-01-05T18:07:51+00:00 · Latest: 2026-01-14T06:10:23+00:00
Abstract
Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
中文标题/摘要
标题:DatBench:区分性、忠实性和高效性的VLM评估
经验性评估是指导基础模型研究进展的主要指南。尽管有大量的工作集中在训练前沿的视觉-语言模型(VLMs)上,但对其评估的方法仍处于初级阶段。为了促进其成熟,我们提出了评估应满足的三个标准:(1)忠实于模态和应用,(2)能够区分不同质量的模型,(3)计算效率。通过这一视角,我们识别出一些关键的失败模式,这些模式违反了忠实性和区分性,错误地代表了模型的能力:(i)多项选择题奖励猜测,不能很好地反映下游使用场景,并且随着模型的改进而饱和;(ii)一些可以不使用图像直接回答的问题占到了某些评估的70%以上;(iii)错误标记或模棱两可的样本在某些数据集中占到了42%。关于效率,评估前沿模型的计算负担已经变得难以承受:据一些说法,近20%的开发计算资源被用于评估本身。我们没有抛弃现有的基准,而是通过转换和筛选来优化它们,以最大化忠实性和区分性。我们发现,将多项选择题转换为生成任务可以揭示出高达35%的能力下降。此外,过滤掉可以不使用图像直接回答的问题和错误标记的样本可以提高区分能力,同时降低计算成本。我们发布了DatBench-Full,这是一个包含33个数据集的清理评估套件,涵盖了九种VLM能力,以及DatBench,这是一个区分性子集,实现了13倍的平均加速(最高可达50倍),同时与原始数据集的区分能力非常接近。我们的工作概述了一条通向评估实践的道路,这些实践既严格又可持续,随着VLMs的不断扩展。
Summary / 总结
The paper proposes DatBench, a new evaluation suite for vision-language models (VLMs) that addresses the need for faithfulness, discriminability, and efficiency in model evaluations. It identifies critical issues in existing benchmarks, such as multiple-choice formats that encourage guessing and mislabeled samples that compromise model differentiation. The authors convert multiple-choice questions to generative tasks and filter out blindly solvable and mislabeled samples to improve the discriminative power and reduce computational costs. The resulting DatBench-Full suite includes 33 datasets, while the DatBench subset offers a 13x average speedup with minimal loss in discriminative power.
论文提出了DatBench,一个新的视觉-语言模型(VLM)评估套件,旨在解决模型评估中忠实性、可区分性和效率的需求。作者指出了现有基准中的关键问题,如鼓励猜测的多项选择题格式和影响模型区分能力的错误标签样本。作者将多项选择题转换为生成任务,并过滤掉盲目可解和错误标签的样本,以提高可区分性和降低计算成本。DatBench-Full套件包含33个数据集,而DatBench子集提供了13倍的平均加速,同时保持了与原始数据集相当的可区分能力。
Weakly Supervised Concept Learning with Class-Level Priors for Interpretable Medical Diagnosis
Authors: Md Nahiduzzaman, Steven Korevaar, Alireza Bab-Hadiashar, Ruwan Tennakoon
First: 2025-11-03T00:43:04+00:00 · Latest: 2026-01-14T05:13:20+00:00
Abstract
Human-interpretable predictions are essential for deploying AI in medical imaging, yet most interpretable-by-design (IBD) frameworks require concept annotations for training data, which are costly and impractical to obtain in clinical contexts. Recent attempts to bypass annotation, such as zero-shot vision-language models or concept-generation frameworks, struggle to capture domain-specific medical features, leading to poor reliability. In this paper, we propose a novel Prior-guided Concept Predictor (PCP), a weakly supervised framework that enables concept answer prediction without explicit supervision or reliance on language models. PCP leverages class-level concept priors as weak supervision and incorporates a refinement mechanism with KL divergence and entropy regularization to align predictions with clinical reasoning. Experiments on PH2 (dermoscopy) and WBCatt (hematology) show that PCP improves concept-level F1-score by over 33% compared to zero-shot baselines, while delivering competitive classification performance on four medical datasets (PH2, WBCatt, HAM10000, and CXR4) relative to fully supervised concept bottleneck models (CBMs) and V-IP.
中文标题/摘要
标题:基于类先验的弱监督概念学习以实现可解释的医学诊断
人类可解释的预测对于在医学成像中部署AI至关重要,但大多数可解释设计(IBD)框架需要对训练数据进行概念注释,这在临床环境中成本高昂且不切实际。最近试图绕过注释的方法,如零样本视觉-语言模型或概念生成框架,难以捕捉特定领域的医学特征,导致可靠性较差。在本文中,我们提出了一种新颖的先验引导概念预测器(PCP),这是一种弱监督框架,能够在没有显式监督或依赖语言模型的情况下进行概念答案预测。PCP 利用类级概念先验作为弱监督,并结合KL散度和熵正则化机制来使预测与临床推理保持一致。在PH2(皮肤镜检查)和WBCatt(血液学)上的实验表明,与零样本基线相比,PCP 的概念级F1分数提高了超过33%,同时在四个医学数据集(PH2、WBCatt、HAM10000和CXR4)上的分类性能与完全监督的概念瓶颈模型(CBMs)和V-IP相当。
Summary / 总结
This paper addresses the challenge of generating human-interpretable predictions in medical imaging by proposing a weakly supervised framework called Prior-guided Concept Predictor (PCP). PCP uses class-level concept priors as weak supervision and includes a refinement mechanism to align predictions with clinical reasoning. Experiments on dermoscopy and hematology datasets show that PCP significantly improves concept-level F1-score by over 33% compared to zero-shot baselines while maintaining competitive classification performance on multiple medical datasets.
论文提出了一种弱监督框架Prior-guided Concept Predictor (PCP),通过使用类级概念先验作为弱监督,并结合一种精炼机制来使预测与临床推理保持一致,以解决在医学成像中生成可解释预测的挑战。实验结果显示,PCP在皮肤病学和血液学数据集上的概念级F1分数相比零样本基线提高了超过33%,同时在多个医学数据集(皮肤病学、血液学、HAM10000和CXR4)上的分类性能与完全监督的概念瓶颈模型相当。
Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification
Authors: Yilin Wu, Anqi Li, Tucker Hermans, Fabio Ramos, Andrea Bajcsy, Claudia Pérez-D'Arpino
First: 2025-10-18T00:38:45+00:00 · Latest: 2026-01-14T05:03:39+00:00
Abstract
Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by-step textual plans before low-level actions, an approach inspired by Chain-of-Thought (CoT) reasoning in language models. Yet even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of-distribution (OOD) scenarios. We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training-free, runtime policy steering method for reasoning-action alignment. Given a reasoning VLA's intermediate textual plan, our framework samples multiple candidate action sequences from the same model, predicts their outcomes via simulation, and uses a pre-trained Vision-Language Model (VLM) to select the sequence whose outcome best aligns with the VLA's own textual plan. Only executing action sequences that align with the textual reasoning turns our base VLA's natural action diversity from a source of error into a strength, boosting robustness to semantic and visual OOD perturbations and enabling novel behavior composition without costly re-training. We also contribute a reasoning-annotated extension of LIBERO-100, environment variations tailored for OOD evaluation, and demonstrate up to 15% performance gain over prior work on behavior composition tasks and scales with compute and data diversity. Project Website at: https://yilin-wu98.github.io/steering-reasoning-vla/
中文标题/摘要
标题:说到做到:通过运行时推理-行动一致性验证引导视觉-语言-行动模型
推理视觉语言行动(VLA)模型通过在低级行动之前生成逐步的文本计划来提高机器人的指令遵循能力,这种方法受到语言模型中链式思考(CoT)推理的启发。然而,即使有了正确的文本计划,生成的行动也可能无法达到计划中的预期结果,尤其是在分布外(OOD)场景中。我们将这一现象形式化为一种缺乏具身CoT忠实性的问题,并引入了一种无需训练的运行时策略引导方法,以实现推理-行动的一致性。给定一个推理VLA的中间文本计划,我们的框架从同一模型中采样多个候选行动序列,通过模拟预测它们的结果,并使用预训练的视觉-语言模型(VLM)选择与VLA自身文本计划最一致的结果序列。只有执行与文本推理一致的行动序列,才能将基础VLA的自然行动多样性从错误的来源转变为优势,增强对语义和视觉OOD扰动的鲁棒性,并在无需昂贵重训练的情况下实现新的行为组合。我们还贡献了一个推理注释扩展的LIBERO-100,环境变化专门用于OOD评估,并在行为组合任务上展示了比先前工作高达15%的性能提升,且随计算能力和数据多样性而扩展。项目网站:https://yilin-wu98.github.io/steering-reasoning-vla/
Summary / 总结
This paper addresses the issue of embodied CoT faithfulness in Vision-Language-Action (VLA) models, which can lead to incorrect actions even with a correct textual plan. The authors propose a runtime policy steering method that aligns reasoning and action by sampling multiple action sequences, predicting their outcomes via simulation, and selecting the sequence that best aligns with the textual plan. This method enhances the model's robustness to out-of-distribution scenarios and enables better behavior composition without re-training. The study also introduces a reasoning-annotated extension of LIBERO-100 for OOD evaluation, showing up to a 15% performance gain over previous work on behavior composition tasks.
该论文解决了Vision-Language-Action (VLA) 模型在执行过程中出现的体化CoT忠实性问题,即使文本计划正确,也可能导致错误的动作。作者提出了一种运行时引导方法,该方法从模型中采样多个动作序列,通过模拟预测它们的结果,并选择与VLA文本计划最一致的序列。该方法增强了模型对分布外场景的鲁棒性,并且无需重新训练即可实现新的行为组合。研究结果表明,在行为组合任务上的性能提高了最多15%。
SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection
Authors: Chenhao Fu, Han Fang, Xiuzheng Zheng, Wenbo Wei, Yonghua Li, Hao Sun, Xuelong Li
First: 2026-01-14T04:42:19+00:00 · Latest: 2026-01-14T04:42:19+00:00
Abstract
Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3's multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0\% Image-AUROC and 92.2\% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
中文标题/摘要
标题:SSVP:协同语义视觉提示在工业零样本异常检测中的应用
零样本异常检测(ZSAD)利用视觉语言模型(VLMs)实现无监督的工业检测。然而,现有的ZSAD范式受限于单一的视觉骨干,难以在全局语义泛化与细粒度结构可区分性之间取得平衡。为解决这一问题,我们提出了协同语义视觉提示(SSVP),该方法高效地融合了多种视觉编码,以提升模型的细粒度感知能力。具体而言,SSVP 引入了层次语义视觉协同(HSVS)机制,将 DINOv3 的多尺度结构先验深度整合到 CLIP 语义空间中。随后,视觉条件提示生成器(VCPG)利用跨模态注意力引导动态提示生成,使语言查询能够精确锚定到特定的异常模式。此外,为解决全局评分与局部证据之间的差异,视觉文本异常映射器(VTAM)建立了双门控校准范式。在七个工业基准上的广泛评估验证了我们方法的鲁棒性;SSVP 在 MVTec-AD 上实现了 93.0% 的 Image-AUROC 和 92.2% 的 Pixel-AUROC,显著优于现有的零样本方法。
Summary / 总结
The research aims to enhance zero-shot anomaly detection in industrial settings using Vision-Language Models (VLMs) by addressing the limitations of single visual backbones. The proposed Synergistic Semantic-Visual Prompting (SSVP) method integrates multi-scale structural priors from DINOv3 into the CLIP semantic space and uses a Vision-Conditioned Prompt Generator to dynamically generate prompts based on cross-modal attention. The Visual-Text Anomaly Mapper establishes a dual-gated calibration paradigm to bridge the gap between global scoring and local evidence. Experimental results on seven industrial benchmarks show that SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, outperforming existing zero-shot approaches.
论文提出了SSVP方法,通过整合多种视觉编码来增强零样本异常检测中Vision-Language模型的细粒度感知能力。它引入了HSVS机制,将DINOv3的多尺度结构先验与CLIP的语义空间结合,并使用VCPG进行跨模态注意力驱动的动态提示生成。VTAM解决了全局评分与局部证据之间的差异。在七个工业基准上的实验表明,SSVP表现出色,达到了93.0%的Image-AUROC和92.2%的Pixel-AUROC,显著优于现有零样本方法。
SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL
Authors: Lijun Liu, Linwei Chen, Zhishou Zhang, Meng Tian, Hengfu Cui, Ruiyang Li, Zhaocheng Liu, Qiang Ju, Qianxi Li, Hong-Yu Zhou
First: 2026-01-14T04:21:07+00:00 · Latest: 2026-01-14T04:21:07+00:00
Abstract
General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.
中文标题/摘要
标题:SkinFlow:通过动态视觉编码和分阶段强化学习实现开放性皮肤病诊断的高效信息传输
通用大型视觉-语言模型(LVLM),尽管规模庞大,但在皮肤病学中往往因“弥散注意力”——无法从背景噪声中分离出细微的病理病变——而表现不佳。本文挑战了参数扩展是实现医学精确性的唯一途径这一假设。我们引入了SkinFlow框架,将诊断视为视觉信息传输效率的优化。我们的方法利用虚拟宽度动态视觉编码器(DVE)“展开”复杂的病理结构而不进行物理参数扩展,并结合两阶段强化学习策略。该策略首先对明确的医学描述进行对齐(阶段I),然后在受限的语义空间内重建隐含的诊断纹理(阶段II)。此外,我们提出了一种基于临床的评估协议,优先考虑诊断安全性和层次相关性而非严格的标签匹配。实验证据令人信服:我们的7B模型在Fitzpatrick17k基准测试中建立了新的最佳状态,相对于大规模通用模型(如Qwen3VL-235B和GPT-5.2),Top-1准确率提高了12.06%,Top-6准确率提高了28.57%。这些发现表明,优化几何容量和信息流比单纯的参数扩展更能实现优越的诊断推理。
Summary / 总结
The research aims to improve dermatological diagnosis using large vision-language models by addressing the issue of 'diffuse attention'. SkinFlow introduces a framework that optimizes visual information transmission efficiency through a Virtual-Width Dynamic Vision Encoder and a two-stage reinforcement learning strategy. The model achieves a new state-of-the-art on the Fitzpatrick17k benchmark, with significant improvements in Top-1 and Top-6 accuracy over large general-purpose models.
研究旨在通过解决‘弥散注意力’问题,利用大型视觉-语言模型提高皮肤诊断的准确性。SkinFlow提出了一种框架,通过虚拟宽度动态视觉编码器和两阶段强化学习策略来优化视觉信息传输效率。该模型在Fitzpatrick17k基准测试中达到了新的最佳状态,Top-1和Top-6准确性显著优于大型通用模型。
LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models
Authors: Haoyan Gong, Hongbin Liu
First: 2026-01-14T03:32:55+00:00 · Latest: 2026-01-14T03:32:55+00:00
Abstract
Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination. The prevailing "restoration-then-recognition" two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition, leading to artifact interference and error accumulation. While Vision-Language Models (VLMs) have demonstrated powerful general capabilities, they lack explicit structural modeling for license plate character sequences (e.g., fixed length, specific order). To address this, we propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL. The core innovation lies in the Character-Aware Multimodal Reasoning Module (CMRM), which introduces a set of learnable Character Slot Queries. Through a cross-attention mechanism, these queries actively retrieve fine-grained evidence corresponding to character positions from visual features. Subsequently, we inject these character-aware representations back into the visual tokens via residual modulation, enabling the language model to perform autoregressive generation based on explicit structural priors. Furthermore, combined with the LoRA parameter-efficient fine-tuning strategy, the model achieves domain adaptation while retaining the generalization capabilities of the large model. Extensive experiments on both synthetic and real-world severely degraded datasets demonstrate that our method significantly outperforms existing restoration-recognition combinations and general VLMs, validating the superiority of incorporating structured reasoning into large models for low-quality text recognition tasks.
中文标题/摘要
标题:LP-LLM:通过大型多模态模型实现端到端的现实世界退化车牌文本识别
现实世界的车牌识别(LPR)面临着严重的挑战,如严重的运动模糊、低分辨率和复杂的光照条件。当前占主导地位的“先修复后识别”的两阶段范式存在根本缺陷:图像修复模型的像素级优化目标与字符识别的语义目标不一致,导致伪影干扰和错误累积。尽管视觉语言模型(VLMs)展示了强大的通用能力,但它们缺乏对车牌字符序列的显式结构建模(例如,固定长度、特定顺序)。为了解决这个问题,我们提出了一种基于Qwen3-VL的端到端结构感知多模态推理框架。核心创新在于字符感知多模态推理模块(CMRM),它引入了一组可学习的字符槽查询。通过交叉注意力机制,这些查询能够主动从视觉特征中检索与字符位置对应的细粒度证据。随后,我们通过残差调制将这些字符感知表示注入视觉标记,使语言模型能够在显式结构先验的基础上进行自回归生成。此外,结合LoRA参数高效微调策略,该模型实现了领域适应,同时保留了大型模型的一般化能力。在合成和现实世界严重退化数据集上的广泛实验表明,我们的方法显著优于现有的修复-识别组合和通用VLMs,验证了将结构推理整合到大型模型中对于低质量文本识别任务的优越性。
Summary / 总结
The paper addresses the challenges of real-world license plate text recognition due to degradations like motion blur and low resolution. It proposes LP-LLM, an end-to-end structure-aware multimodal reasoning framework using Qwen3-VL. The framework includes a Character-Aware Multimodal Reasoning Module that uses learnable Character Slot Queries to retrieve fine-grained visual evidence for each character position and injects these representations back into the visual tokens. Experiments show that LP-LLM outperforms existing methods on both synthetic and real-world severely degraded datasets, highlighting the benefits of incorporating structured reasoning into large models for low-quality text recognition.
论文提出了一种端到端的结构感知多模态推理框架LP-LLM,以应对现实世界车牌识别中的挑战,如运动模糊、低分辨率和复杂光照。该框架使用Character-Aware多模态推理模块(CMRM)从视觉特征中检索字符位置的细粒度证据,并将这些表示注入视觉标记中。结合LoRA参数高效微调策略,该模型在合成和真实世界严重退化数据集上的表现优于现有方法,验证了将结构化推理融入大型模型对于低质量文本识别任务的优越性。
Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models
Authors: Ruofan Wang, Xin Wang, Yang Yao, Juncheng Li, Xuan Tong, Xingjun Ma
First: 2025-08-03T12:51:47+00:00 · Latest: 2026-01-14T02:53:24+00:00
Abstract
The widespread practice of fine-tuning open-source Vision-Language Models (VLMs) raises a critical security concern: jailbreak vulnerabilities in base models may persist in downstream variants, enabling transferable attacks across fine-tuned systems. To investigate this risk, we propose the Simulated Ensemble Attack (SEA), a grey-box jailbreak framework that assumes full access to the base VLM but no knowledge of the fine-tuned target. SEA enhances transferability via Fine-tuning Trajectory Simulation (FTS), which models bounded parameter variations in the vision encoder, and Targeted Prompt Guidance (TPG), which stabilizes adversarial optimization through auxiliary textual guidance. Experiments on the Qwen2-VL family demonstrate that SEA achieves consistently high transfer success and toxicity rates across diverse fine-tuned variants, including safety-enhanced models, while standard PGD-based image jailbreaks exhibit negligible transferability. Further analysis reveals that fine-tuning primarily induces localized parameter shifts around the base model, explaining why attacks optimized over a simulated neighborhood transfer effectively. We also show that SEA generalizes across different base generations (e.g., Qwen2.5/3-VL), indicating that its effectiveness arises from shared fine-tuning-induced behaviors rather than architecture- or initialization-specific factors.
中文标题/摘要
标题:模拟集成攻击:跨微调视觉语言模型转移破解
开源视觉语言模型(VLMs)的广泛微调引发了严重的安全问题:基础模型中的破解漏洞可能在下游变体中持续存在,从而在微调系统之间实现可转移的攻击。为研究这一风险,我们提出了模拟集成攻击(SEA),这是一种灰盒破解框架,假设完全访问基础VLM但不了解微调目标。SEA通过细调轨迹模拟(FTS)增强可转移性,该方法模拟视觉编码器中参数的边界变化,并通过辅助文本指导实现目标提示引导(TPG),从而稳定对抗优化。在Qwen2-VL家族上的实验表明,SEA在各种微调变体中,包括增强安全性的模型中,实现了高度一致的高转移成功率和毒性率,而基于标准PGD的图像破解在图像上几乎没有可转移性。进一步的分析表明,微调主要导致了基础模型周围局部参数的变化,解释了为什么在模拟邻域优化的攻击能够有效转移。我们还展示了SEA在不同基础生成代(例如Qwen2.5/3-VL)之间的一般性,表明其有效性源于共享的微调诱导行为,而不是特定于架构或初始化的因素。
Summary / 总结
The research investigates the security risk of jailbreak vulnerabilities in fine-tuned Vision-Language Models (VLMs) by proposing the Simulated Ensemble Attack (SEA), a grey-box jailbreak framework. SEA uses Fine-tuning Trajectory Simulation (FTS) and Targeted Prompt Guidance (TPG) to enhance transferability. Experiments show that SEA achieves high transfer success and toxicity rates across various fine-tuned variants, including safety-enhanced models, while standard PGD-based image jailbreaks have negligible transferability.
论文通过提出Simulated Ensemble Attack (SEA) 来研究细调Vision-Language Models (VLMs) 中的 jailbreak 漏洞风险,SEA 使用 Fine-tuning Trajectory Simulation (FTS) 和 Targeted Prompt Guidance (TPG) 来增强攻击的可移植性。实验表明,SEA 在各种细调变体中实现了高转移成功率和毒性率,而标准的基于 PGD 的图像 jailbreak 在转移方面几乎没有可移植性。分析表明,细调主要导致了基模型周围的局部参数偏移,解释了 SEA 在转移攻击方面的有效性。
Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation
Authors: Xuetao Li, Wenke Huang, Mang Ye, Jifeng Xuan, Bo Du, Sheng Liu, Miao Li
First: 2026-01-13T23:36:30+00:00 · Latest: 2026-01-13T23:36:30+00:00
Abstract
Humanoid robot manipulation is a crucial research area for executing diverse human-level tasks, involving high-level semantic reasoning and low-level action generation. However, precise scene understanding and sample-efficient learning from human demonstrations remain critical challenges, severely hindering the applicability and generalizability of existing frameworks. This paper presents a novel RGMP-S, Recurrent Geometric-prior Multimodal Policy with Spiking features, facilitating both high-level skill reasoning and data-efficient motion synthesis. To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases to enable precise 3D scene understanding within the vision-language model. Specifically, we construct a Long-horizon Geometric Prior Skill Selector that effectively aligns the semantic instructions with spatial constraints, ultimately achieving robust generalization in unseen environments. For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network. We parameterize robot-object interactions via recursive spiking for spatiotemporal consistency, fully distilling long-horizon dynamic features while mitigating the overfitting issue in sparse demonstration scenarios. Extensive experiments are conducted across the Maniskill simulation benchmark and three heterogeneous real-world robotic systems, encompassing a custom-developed humanoid, a desktop manipulator, and a commercial robotic platform. Empirical results substantiate the superiority of our method over state-of-the-art baselines and validate the efficacy of the proposed modules in diverse generalization scenarios. To facilitate reproducibility, the source code and video demonstrations are publicly available at https://github.com/xtli12/RGMP-S.git.
中文标题/摘要
标题:通用几何先验和递归脉冲特征学习在类人机器人操作中的应用
类人机器人操作是执行多样化的人类级任务的关键研究领域,涉及高层语义推理和低层动作生成。然而,精确的场景理解和从人类示范中高效学习仍然是关键挑战,严重阻碍了现有框架的应用性和泛化能力。本文提出了一种新颖的RGMP-S,递归几何先验多模态策略,结合了脉冲特征,促进了高层技能推理和数据高效运动合成。为了将高层推理与物理现实相结合,我们利用轻量级的2D几何归纳偏置,在视觉语言模型中实现精确的3D场景理解。具体而言,我们构建了一个长时距几何先验技能选择器,有效地将语义指令与空间约束对齐,最终在未见过的环境中实现稳健的泛化。为了解决机器人动作生成中的数据效率问题,我们引入了递归自适应脉冲网络。我们通过递归脉冲参数化机器人-物体交互,以实现时空一致性,全面提取长时距动态特征,同时在稀疏示范场景中缓解过拟合问题。我们在Maniskill仿真基准和三个异构的现实世界机器人系统中进行了广泛的实验,包括一个自定义开发的类人机器人、一个桌面操作器和一个商用机器人平台。实验证明了我们方法在多种泛化场景中的优越性,并验证了所提模块的有效性。为了便于可重复性,源代码和视频演示已公开发布在https://github.com/xtli12/RGMP-S.git。
Summary / 总结
This paper addresses the challenges of humanoid robot manipulation by proposing RGMP-S, a Recurrent Geometric-prior Multimodal Policy with Spiking features. It leverages 2D geometric inductive biases to enable precise 3D scene understanding and introduces a Long-horizon Geometric Prior Skill Selector for aligning semantic instructions with spatial constraints. Additionally, it uses a Recursive Adaptive Spiking Network to enhance data efficiency in robotic action generation. Experiments on various robotic systems demonstrate the method's superior performance compared to existing approaches and validate its effectiveness in diverse scenarios.
本文提出了一种名为RGMP-S的递归几何先验多模态策略,结合了轻量级的2D几何归纳偏置以实现精确的3D场景理解,并引入了长时几何先验技能选择器来对齐语义指令与空间约束。此外,还引入了递归自适应脉冲网络以提高机器人动作生成的数据效率。在多种机器人系统上的实验表明,该方法在性能和泛化能力方面优于现有方法。
Privacy-Preserving in Connected and Autonomous Vehicles Through Vision to Text Transformation
Authors: Abdolazim Rezaei, Mehdi Sookhak, Ahmad Patooghy, Shahab S. Band, Amir Mosavi
First: 2025-06-18T20:02:24+00:00 · Latest: 2026-01-13T20:26:51+00:00
Abstract
Intelligent Transportation Systems (ITS) rely on a variety of devices that frequently process privacy-sensitive data. Roadside units are important because they use AI-equipped cameras to detect traffic violations in Connected and Autonomous Vehicles (CAV). However, although the interior of a vehicle is generally considered a private space, the privacy risks associated with captured imagery remain a major concern, as such data can be misused for identity theft, profiling, or unauthorized commercial purposes. Methods like face blurring reduce privacy risks, however individuals' privacy can still be compromised. This paper introduces a novel privacy-preserving framework that leverages feedback-based reinforcement learning (RL) and vision-language models (VLMs) to protect sensitive visual information captured by AIE cameras. The proposed idea transforms images into textual descriptions using an innovative method while the main scene details are preserved and protects privacy. A hierarchical RL strategy is employed to iteratively refine the generated text, enhancing both semantic accuracy and privacy. Unlike prior captioning-based methods, our model incorporates an iterative reinforcement-learning cycle with external knowledge feedback which progressively refines privacy-aware text. In addition to qualitative textual metric evaluations, the privacy-based metrics demonstrate significant improvements in privacy preservation where SSIM, PSNR, MSE, and SRRA values obtained using the proposed method on two different datasets outperform other methods.
中文标题/摘要
标题:通过视觉到文本转换在连接和自主车辆中保护隐私
智能交通系统(ITS)依赖多种设备,这些设备经常处理隐私敏感数据。路边单元非常重要,因为它们使用配备有AI的摄像头来检测连接和自主车辆(CAV)中的交通违规行为。然而,尽管车辆内部通常被视为私人空间,但捕获的图像数据与隐私相关的风险仍然是一个重大问题,因为这些数据可能被滥用以进行身份盗窃、画像或未经授权的商业用途。诸如面部模糊化等方法可以降低隐私风险,但个人的隐私仍然可能被侵犯。本文提出了一种新颖的隐私保护框架,该框架利用基于反馈的强化学习(RL)和视觉语言模型(VLMs)来保护由AIE摄像头捕获的敏感视觉信息。所提出的想法通过一种创新的方法将图像转换为文本描述,同时保留主要场景细节并保护隐私。采用分层RL策略逐步细化生成的文本,提高语义准确性和隐私性。与基于描述的方法不同,我们的模型结合了具有外部知识反馈的迭代强化学习循环,逐步细化隐私意识文本。除了定性的文本度量评估外,基于隐私的度量指标在隐私保护方面也显示出显著改进,其中使用所提出的方法在两个不同数据集上获得的SSIM、PSNR、MSE和SRRA值优于其他方法。
Summary / 总结
This paper addresses the privacy concerns in Intelligent Transportation Systems, particularly in the context of Connected and Autonomous Vehicles. It proposes a novel privacy-preserving framework that uses feedback-based reinforcement learning and vision-language models to transform images into text descriptions while preserving the main scene details. The hierarchical RL strategy iteratively refines the text to enhance both semantic accuracy and privacy. Experimental results show significant improvements in privacy preservation metrics compared to other methods, as evidenced by SSIM, PSNR, MSE, and SRRA values on two different datasets.
本文提出了一种新颖的框架,利用基于反馈的强化学习和视觉语言模型将AI摄像头捕获的图像转换为文本描述,同时保留主要场景细节,以解决智能交通系统中的隐私问题。该方法采用分层RL策略逐迭代细化文本,以提高语义准确性和隐私性。实验结果表明,与其它技术相比,所提出的方法在SSIM、PSNR、MSE和SRRA等隐私保护指标上表现更优,适用于两个不同的数据集。
Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models
Authors: Youwei Liu, Jian Wang, Hanlin Wang, Beichen Guo, Wenjie Li
First: 2026-01-13T19:49:58+00:00 · Latest: 2026-01-13T19:49:58+00:00
Abstract
Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\texttt{ITP}), a unified framework for agent learning via lookahead imagination, where an agent's policy model interacts with the learned world model, yielding multi-step ``imagined'' trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially \textit{observable} and \textit{imaginable} Markov decision process to guide policy learning. We instantiate \texttt{ITP} with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that \texttt{ITP} significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents' reasoning capability, providing valuable insights into addressing broader, complex tasks.
中文标题/摘要
标题:想象然后规划:基于世界模型的自适应前瞻学习
世界模型的最新进展表明,它们有望模拟环境状态的未来动态,使代理能够在无需访问真实环境的情况下进行推理和行动。当前的方法主要执行单步或固定时距的展开,而其对复杂任务规划的潜力尚未得到充分利用。我们提出了想象然后规划( exttt{ITP}),这是一种通过前瞻想象进行代理学习的统一框架,其中代理的策略模型与学习到的世界模型交互,生成多步“想象”轨迹。由于想象的时距可能因任务和阶段而异,我们引入了一种新颖的自适应前瞻机制,通过权衡最终目标和任务进展进行权衡。由此产生的想象轨迹提供了关于未来后果的丰富信号,如实现的进展和潜在的冲突,这些信号与当前观察结果融合,形成部分可观测和可想象的马尔可夫决策过程,以指导策略学习。我们使用训练无和强化训练的变体实例化了 exttt{ITP}。广泛的实验表明, exttt{ITP} 显著优于竞争性基线。进一步的分析验证了我们自适应前瞻机制大大增强了代理的推理能力,为解决更广泛、更复杂的任务提供了有价值的见解。
Summary / 总结
The paper introduces Imagine-then-Plan (ITP), a framework that uses world models to enable agents to reason and plan for complex tasks. The method involves an adaptive lookahead mechanism that allows for variable imagination horizons, providing rich signals about future outcomes. Experiments show that ITP outperforms existing methods across various benchmarks, indicating improved reasoning capabilities in agents.
论文提出了Imagine-then-Plan (ITP) 框架,该框架利用世界模型进行多步前瞻想象,使代理能够通过与学习到的世界模型交互来推理复杂任务,并生成关于未来结果的丰富信号。实验表明,ITP 在各种基准测试中显著优于现有方法,而自适应前瞻机制显著增强了代理处理复杂任务的推理能力。
Parallel Context-of-Experts Decoding for Retrieval Augmented Generation
Authors: Giulio Corallo, Paolo Papotti
First: 2026-01-13T15:46:59+00:00 · Latest: 2026-01-13T15:46:59+00:00
Abstract
Retrieval Augmented Generation faces a trade-off: concatenating documents in a long prompt enables multi-document reasoning but creates prefill bottlenecks, while encoding document KV caches separately offers speed but breaks cross-document interaction. We propose Parallel Context-of-Experts Decoding (Pced), a training-free framework that shifts evidence aggregation from the attention mechanism to the decoding. Pced treats retrieved documents as isolated "experts", synchronizing their predictions via a novel retrieval-aware contrastive decoding rule that weighs expert logits against the model prior. This approach recovers cross-document reasoning capabilities without constructing a shared attention across documents.
中文标题/摘要
标题:平行专家上下文解码以增强检索生成
检索增强生成面临权衡:在长提示中连接文档可以实现多文档推理,但会创建预填充瓶颈,而单独编码文档KV缓存则可提高速度但会破坏跨文档交互。我们提出了一种无需训练的平行专家上下文解码(Pced)框架,该框架将证据聚合从注意力机制转移到解码。Pced 将检索到的文档视为孤立的“专家”,通过一种新颖的检索感知对比解码规则同步它们的预测,该规则将专家的逻辑与模型先验进行权衡。这种方法可以在不构建跨文档共享注意力的情况下恢复跨文档推理能力。
Summary / 总结
The paper addresses the challenge in Retrieval Augmented Generation (RAG) by proposing Parallel Context-of-Experts Decoding (Pced), which shifts evidence aggregation from the attention mechanism to the decoding process. Pced treats retrieved documents as isolated experts and synchronizes their predictions using a retrieval-aware contrastive decoding rule, thereby recovering cross-document reasoning capabilities without constructing a shared attention mechanism across documents.
论文提出了一种名为Parallel Context-of-Experts Decoding (Pced)的方法,以解决Retrieval Augmented Generation (RAG)中的挑战。Pced将检索到的文档视为独立的专家,并通过一种检索感知的对比解码规则同步它们的预测,从而在无需构建跨文档共享注意力机制的情况下实现跨文档推理。
SoC: Semantic Orthogonal Calibration for Test-Time Prompt Tuning
Authors: Leo Fillioux, Omprakash Chakraborty, Ismail Ben Ayed, Paul-Henry Cournède, Stergios Christodoulidis, Maria Vakalopoulou, Jose Dolz
First: 2026-01-13T15:00:03+00:00 · Latest: 2026-01-13T15:00:03+00:00
Abstract
With the increasing adoption of vision-language models (VLMs) in critical decision-making systems such as healthcare or autonomous driving, the calibration of their uncertainty estimates becomes paramount. Yet, this dimension has been largely underexplored in the VLM test-time prompt-tuning (TPT) literature, which has predominantly focused on improving their discriminative performance. Recent state-of-the-art advocates for enforcing full orthogonality over pairs of text prompt embeddings to enhance separability, and therefore calibration. Nevertheless, as we theoretically show in this work, the inherent gradients from fully orthogonal constraints will strongly push semantically related classes away, ultimately making the model overconfident. Based on our findings, we propose Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enforces smooth prototype separation while preserving semantic proximity, thereby improving calibration compared to prior orthogonality-based approaches. Across a comprehensive empirical validation, we demonstrate that SoC consistently improves calibration performance, while also maintaining competitive discriminative capabilities.
中文标题/摘要
标题:SoC:测试时提示调优的语义正交校准
随着视觉-语言模型(VLMs)在医疗保健或自动驾驶等关键决策系统中的广泛应用,其不确定性估计的校准变得至关重要。然而,在VLM测试时提示调优(TPT)文献中,这一维度尚未得到充分探索,该文献主要集中在提高其辨别性能上。最近的先进方法提倡对文本提示嵌入成对施加完全正交约束以增强可分性,从而提高校准。然而,如我们在本文中理论上所证明的,完全正交约束的固有梯度将强烈地将语义相关类推得更远,最终使模型过于自信。基于我们的发现,我们提出了语义正交校准(SoC),这是一种基于Huber的正则化器,它在保持语义邻近性的同时强制平滑原型分离,从而在与先前的正交性基方法相比时提高校准性能。在全面的经验验证中,我们证明SoC在提高校准性能的同时,也保持了竞争力的辨别能力。
Summary / 总结
This paper addresses the calibration of uncertainty estimates in vision-language models (VLMs) for critical applications like healthcare and autonomous driving. It introduces Semantic Orthogonal Calibration (SoC), a Huber-based regularizer that enhances calibration by preserving semantic proximity while improving prototype separation. Experiments show that SoC consistently improves calibration performance without sacrificing discriminative capabilities compared to previous orthogonality-based methods.
研究旨在提高用于关键系统如医疗和自动驾驶的视觉语言模型(VLMs)的不确定性估计的校准。研究引入了语义正交校准(SoC),这是一种基于Huber的正则化器,通过保持语义接近性同时促进平滑原型分离来提高校准,解决了全正交约束的局限性。实验表明,SoC在各种任务中一致提高了校准性能,同时保持了竞争力的区分能力。
Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification
Authors: Tayyab Rehman, Giovanni De Gasperis, Aly Shmahell
First: 2026-01-08T11:31:47+00:00 · Latest: 2026-01-13T14:40:15+00:00
Comments: Author email changed, Acknowlegement changes
Abstract
Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.
中文标题/摘要
标题:通过视觉语言模型和基于嵌入的分类实现监控系统中的级联多代理异常检测
在动态视觉环境中实现智能异常检测需要在实时性能与语义可解释性之间取得平衡。传统方法仅解决这一挑战的部分方面。基于重建的模型捕捉低级偏差但缺乏上下文推理,目标检测器提供速度但语义有限,而大型视觉语言系统则以高昂的计算成本提供可解释性。本研究引入了一种级联多代理框架,将这些互补的范式统一成一个连贯且可解释的架构。早期模块执行重建门控过滤和对象级评估,而更高层次的推理代理则根据需要选择性地被调用来解释语义含糊的事件。该系统采用自适应升级阈值和发布-订阅通信架构,实现异步协调和在异构硬件上的可扩展部署。在大规模监控数据上的广泛评估表明,所提出的级联架构与直接视觉语言推理相比,延迟降低了三倍,同时保持了高感知保真度(PSNR = 38.3 dB,SSIM = 0.965)和一致的语义标签。该框架超越了传统的检测管道,结合了早期退出的效率、自适应多代理推理和可解释的异常归因,为可扩展的智能视觉监控奠定了可重复和节能的基础。
Summary / 总结
This work addresses the challenge of intelligent anomaly detection in dynamic visual environments by introducing a cascading multi-agent framework that combines reconstruction-gated filtering, object-level assessment, and selective high-level reasoning. The system reduces latency by three times compared to direct vision-language inference while maintaining high perceptual fidelity and consistent semantic labeling. It employs adaptive thresholds and a publish-subscribe communication backbone for scalable deployment. The framework integrates early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, providing a reproducible and energy-efficient foundation for intelligent visual monitoring.
本文通过引入一个级联多代理框架,将基于重建的模型、物体检测器和大型视觉语言系统结合起来,解决了动态视觉环境中的智能异常检测挑战。早期模块执行过滤和物体级评估,而高级代理解释模糊事件。该系统将延迟减少了三倍,同时保持了高感知保真度和一致的语义标签。该框架结合了早期退出的效率、自适应多代理推理和可解释的异常归因,为智能视觉监控提供了可扩展且能效高的基础。
Latent Reconstruction from Generated Data for Multimodal Misinformation Detection
Authors: Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis
First: 2025-04-08T13:16:48+00:00 · Latest: 2026-01-13T14:25:49+00:00
Abstract
Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. Due to the scarcity of large-scale annotated datasets for multimodal misinformation detection (MMD), recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic, unrealistic examples, which limits their utility as training examples. To address this, we introduce "MisCaption This!", a framework for generating high-fidelity synthetic miscaptioned datasets through Adversarial Prompting of Vision-Language Models (VLMs). Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a Transformer-based network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" data generalize better to real-world misinformation, while LAMAR achieves new state-of-the-art on NewsCLIPpings, VERITE, and the newly introduced VERITE 24/25 benchmark; highlighting the efficacy of VLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction
中文标题/摘要
标题:从生成数据中提取潜在重建以检测多模态 misinformation
多模态 misinformation,如误标图像,其中的说明文歪曲了图像的来源、背景或意义,在数字时代构成了日益严峻的挑战。由于缺乏大规模标注的多模态 misinformation 检测 (MMD) 数据集,最近的方法依赖于通过离境配对或命名实体操作(例如更改名称、日期或地点)生成的合成训练数据。然而,这些方法往往产生简单且不现实的例子,限制了它们作为训练示例的实用性。为了解决这一问题,我们引入了“MisCaption This!”框架,通过对抗提示视觉-语言模型 (VLM) 生成高保真度的合成误标数据集。此外,我们还引入了“潜在多模态重建”(LAMAR),这是一种基于变换器的网络,训练其重建真实说明文的嵌入,提供强大的辅助信号以指导检测。我们探索了各种训练策略(端到端 vs. 大规模预训练)和集成机制(直接、掩码、门控和注意力)。广泛的实验表明,使用“MisCaption This!”数据训练的模型在应对真实世界的 misinformation 方面表现更好,而 LAMAR 在 NewsCLIPpings、VERITE 和新引入的 VERITE 24/25 基准测试中达到了新的最佳水平;突显了 VLM 生成数据和基于重建的网络在推进 MMD 方面的有效性。我们的代码可在 https://github.com/stevejpapad/miscaptioned-image-reconstruction 获取
Summary / 总结
This paper addresses the challenge of detecting multimodal misinformation, such as miscaptioned images, by introducing 'MisCaption This!', a framework for generating high-fidelity synthetic datasets through adversarial prompting of vision-language models. The authors also propose 'Latent Multimodal Reconstruction' (LAMAR), a Transformer-based network trained to reconstruct embeddings of truthful captions, which serves as a strong auxiliary signal for misinformation detection. Experimental results demonstrate that models trained on 'MisCaption This!' data generalize better to real-world misinformation, and LAMAR achieves state-of-the-art performance on several benchmarks, highlighting the effectiveness of VLM-generated data and reconstruction-based networks for multimodal misinformation detection.
该论文通过引入‘MisCaption This!’框架,利用视觉-语言模型的对抗提示生成高保真度的合成数据集,以应对多模态 misinformation(如误标签图像)的检测挑战。作者还提出了‘Latent Multimodal Reconstruction’(LAMAR)网络,该网络基于Transformer训练,用于重构真实描述的嵌入,作为误导信息检测的强辅助信号。实验结果表明,使用‘MisCaption This!’数据训练的模型在真实世界 misinformation 上有更好的泛化能力,而 LAMAR 在多个基准测试中达到了最先进的性能,突显了基于 VLM 生成数据和重构网络在多模态 misinformation 检测中的有效性。
VideoHEDGE: Entropy-Based Hallucination Detection for Video-VLMs via Semantic Clustering and Spatiotemporal Perturbations
Authors: Sushant Gautam, Cise Midoglu, Vajira Thambawita, Michael A. Riegler, Pål Halvorsen
First: 2026-01-13T13:42:05+00:00 · Latest: 2026-01-13T13:42:05+00:00
Abstract
Hallucinations in video-capable vision-language models (Video-VLMs) remain frequent and high-confidence, while existing uncertainty metrics often fail to align with correctness. We introduce VideoHEDGE, a modular framework for hallucination detection in video question answering that extends entropy-based reliability estimation from images to temporally structured inputs. Given a video-question pair, VideoHEDGE draws a baseline answer and multiple high-temperature generations from both clean clips and photometrically and spatiotemporally perturbed variants, then clusters the resulting textual outputs into semantic hypotheses using either Natural Language Inference (NLI)-based or embedding-based methods. Cluster-level probability masses yield three reliability scores: Semantic Entropy (SE), RadFlag, and Vision-Amplified Semantic Entropy (VASE). We evaluate VideoHEDGE on the SoccerChat benchmark using an LLM-as-a-judge to obtain binary hallucination labels. Across three 7B Video-VLMs (Qwen2-VL, Qwen2.5-VL, and a SoccerChat-finetuned model), VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often operate near chance. We further show that embedding-based clustering matches NLI-based clustering in detection performance at substantially lower computational cost, and that domain fine-tuning reduces hallucination frequency but yields only modest improvements in calibration. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE#videohedge .
中文标题/摘要
标题:VideoHEDGE:基于熵的视频-VLM 幻觉检测框架通过语义聚类和时空扰动
视频-视觉语言模型(Video-VLMs)中的幻觉仍然频繁且置信度高,而现有的不确定性度量往往无法与正确性对齐。我们引入了VideoHEDGE,这是一种模块化框架,用于视频问答中的幻觉检测,将基于熵的可靠性估计从图像扩展到时间结构化的输入。给定一个视频-问题对,VideoHEDGE 生成一个基线答案和多个高温度生成,来自干净片段及其光度和时空扰动的变体,然后使用自然语言推理(NLI)或嵌入方法将生成的文本输出聚类成语义假设。聚类级别的概率质量产生三个可靠性分数:语义熵(SE)、RadFlag 和 视觉增强语义熵(VASE)。我们使用LLM作为法官在SoccerChat基准上评估VideoHEDGE,以获得二元幻觉标签。在三个7B Video-VLMs(Qwen2-VL、Qwen2.5-VL和SoccerChat微调模型)中,VASE在更大的失真预算下始终获得最高的ROC-AUC,而SE和RadFlag通常接近随机水平。我们进一步表明,嵌入方法的聚类在计算成本显著降低的情况下,在检测性能上与NLI方法的聚类相当,并且领域微调减少了幻觉频率,但仅在校准方面带来了适度的改进。hedge-bench PyPI库使基准测试可重复和可扩展,完整的代码和实验资源可在https://github.com/Simula/HEDGE#videohedge 获取。
Summary / 总结
VideoHEDGE is a framework for detecting hallucinations in video-capable vision-language models by extending entropy-based reliability estimation to temporally structured inputs. It generates baseline and high-temperature answers from both clean and perturbed video clips, clusters these answers into semantic hypotheses, and calculates reliability scores. Across three 7B Video-VLMs, VASE consistently achieves the highest ROC-AUC, especially at larger distortion budgets, while SE and RadFlag often perform near chance. Embedding-based clustering matches NLI-based clustering in performance but at lower computational cost. Domain fine-tuning reduces hallucination frequency but only modestly improves calibration.
VideoHEDGE 是一个用于检测视频问答中幻觉的框架,通过将图像中的熵基可靠性估计扩展到视频输入。它从干净和扰动的视频片段中生成基线和高温生成的答案,然后对这些答案进行聚类以得出可靠性评分。VASE 在三个 Video-VLM 中的表现优于 SE 和 RadFlag,尤其是在较大的扰动预算下。嵌入式聚类在计算效率上比 NLI 基聚类更高效,同时达到相当的检测性能。领域微调可以减少幻觉但对校准的改进仅有限度。
Sketch-Based Facade Renovation With Generative AI: A Streamlined Framework for Bypassing As-Built Modelling in Industrial Adaptive Reuse
Authors: Warissara Booranamaitree, Xusheng Du, Yushu Cai, Zhengyang Wang, Ye Zhang, Haoran Xie
First: 2026-01-13T13:17:09+00:00 · Latest: 2026-01-13T13:17:09+00:00
Comments: 10 pages, 9 figures, Proceedings of CAADRIA 2026
Abstract
Facade renovation offers a more sustainable alternative to full demolition, yet producing design proposals that preserve existing structures while expressing new intent remains challenging. Current workflows typically require detailed as-built modelling before design, which is time-consuming, labour-intensive, and often involves repeated revisions. To solve this issue, we propose a three-stage framework combining generative artificial intelligence (AI) and vision-language models (VLM) that directly processes rough structural sketch and textual descriptions to produce consistent renovation proposals. First, the input sketch is used by a fine-tuned VLM model to predict bounding boxes specifying where modifications are needed and which components should be added. Next, a stable diffusion model generates detailed sketches of new elements, which are merged with the original outline through a generative inpainting pipeline. Finally, ControlNet is employed to refine the result into a photorealistic image. Experiments on datasets and real industrial buildings indicate that the proposed framework can generate renovation proposals that preserve the original structure while improving facade detail quality. This approach effectively bypasses the need for detailed as-built modelling, enabling architects to rapidly explore design alternatives, iterate on early-stage concepts, and communicate renovation intentions with greater clarity.
中文标题/摘要
标题:基于草图的幕墙翻新与生成式AI:工业适应性再利用中绕过建成建模的简化框架
幕墙翻新提供了比全面拆除更可持续的选择,但保留现有结构同时表达新意图的设计提案仍然具有挑战性。当前的工作流程通常需要在设计之前进行详细的建成建模,这既耗时又劳动密集,经常需要反复修改。为了解决这个问题,我们提出了一种结合生成式人工智能(AI)和视觉-语言模型(VLM)的三阶段框架,可以直接处理粗糙的结构草图和文本描述以生成一致的翻新提案。首先,输入的草图由微调后的VLM模型用于预测需要修改的位置和应添加的组件的边界框。其次,稳定扩散模型生成新元素的详细草图,通过生成性修复流水线与原始轮廓合并。最后,使用ControlNet对结果进行细化以生成照片级的真实图像。在数据集和实际工业建筑上的实验表明,所提出的框架可以生成既保留原始结构又能提高幕墙细节质量的翻新提案。这种方法有效地绕过了详细的建成建模需求,使建筑师能够快速探索设计替代方案,迭代早期概念,并以更清晰的方式传达翻新意图。
Summary / 总结
The paper addresses the challenge of producing sustainable facade renovation proposals that preserve existing structures while expressing new intent. It introduces a three-stage framework using generative AI and vision-language models to directly process rough sketches and textual descriptions, bypassing the need for detailed as-built modeling. The framework predicts modification areas, generates detailed sketches, and refines them into photorealistic images, demonstrating the ability to produce consistent renovation proposals that improve facade detail quality without extensive revisions.
论文旨在解决在保留现有结构的同时进行可持续的幕墙翻新设计的挑战。提出了一种包含三个阶段的框架,使用生成AI和视觉语言模型直接处理粗略的草图和文本描述,生成一致的翻新提案,无需详细的实际建模。该框架首先使用微调后的VLM模型预测需要修改的区域和组件,然后使用稳定扩散模型生成详细的草图,将它们与原始轮廓合并,并最终使用ControlNet将结果细化为照片级的真实图像。实验表明,所提出的方法可以有效绕过详细的实际建模,使建筑师能够快速探索设计选项,更清晰地传达翻新意图。
History
20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553