arXiv 论文速递

Modular Embedding Recomposition for Incremental Learning

Authors: Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara

First: 2025-08-22T15:25:40+00:00 · Latest: 2025-08-22T15:25:40+00:00

Comments: Accepted to the 36th British Machine Vision Conference (BMVC 2025), Sheffield, UK

Abstract

The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.

中文标题/摘要

标题：模块化嵌入重组在增量学习中的应用

预训练视觉-语言模型（VLMs）的出现显著改变了持续学习（CL）领域，主要得益于其零样本分类能力。这种能力使VLMs非常适合现实应用，能在无需适配的情况下对未见类别保持强大性能。然而当下游任务与预训练领域差异较大时，微调仍必不可少。现有CL方法主要关注在下游任务增量微调期间保持VLMs的零样本能力，我们进一步提出将这种保持转化为增强的方法——模块化嵌入重组（MoDER）。该方法通过训练多个文本专家模块（每个专精于一个已见类别）并存储于基础中心，推理时针对未见类别查询中心并组合检索到的专家，合成改进分类的精炼原型。我们在Class-IL和MTIL两种零样本增量协议共14个数据集上验证了方法的有效性。代码库详见：https://github.com/aimagelab/mammoth。

Summary / 总结

The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities.

Structuring GUI Elements through Vision Language Models: Towards Action Space Generation

Authors: Yi Xu, Yesheng Zhang, jiajia Liu, Jingdong Chen

First: 2025-08-22T10:14:15+00:00 · Latest: 2025-08-22T10:14:15+00:00

Comments: 10pageV0

Abs · PDF

Abstract

Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.

中文标题/摘要

标题：基于视觉语言模型的GUI元素结构化：面向动作空间生成

多模态大语言模型（MLLMs）已成为增强人机交互的关键工具。本文聚焦于MLLMs在图形用户界面（GUI）元素结构化领域的应用，通过分析屏幕内容协助处理用户指令。尽管MLLMs展现出潜力，但其在精确生成UI元素坐标（GUI理解的核心环节）时受限于下一词预测训练机制。这一挑战源于语言表示空间中数值坐标的语义缺失，需要大量多样化数据集来增强视觉模块能力。为此，我们提出了交并比增强最大似然（IAML）训练范式：首先构建基于IoU的坐标采样新流程来扩充训练数据，该流程考虑与真实坐标的邻近度；随后运用此数据增强策略在IAML范式下微调MLLMs，以缓解传统最大似然估计固有的曝光偏差问题。大量实验证明，IAML训练方法优于传统训练范式。

Summary / 总结

Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction.

RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution

Authors: Haodong He, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu, Gui-Song Xia

First: 2025-08-22T07:28:34+00:00 · Latest: 2025-08-22T07:28:34+00:00

Abs · PDF

Abstract

The rich textual information of large vision-language models (VLMs) combined with the powerful generative prior of pre-trained text-to-image (T2I) diffusion models has achieved impressive performance in single-image super-resolution (SISR). However, existing methods still face significant challenges in generating clear and accurate regional details, particularly in scenarios involving multiple objects. This challenge primarily stems from a lack of fine-grained regional descriptions and the models' insufficient ability to capture complex prompts. To address these limitations, we propose a Regional Attention Guided Super-Resolution (RAGSR) method that explicitly extracts localized fine-grained information and effectively encodes it through a novel regional attention mechanism, enabling both enhanced detail and overall visually coherent SR results. Specifically, RAGSR localizes object regions in an image and assigns fine-grained caption to each region, which are formatted as region-text pairs as textual priors for T2I models. A regional guided attention is then leveraged to ensure that each region-text pair is properly considered in the attention process while preventing unwanted interactions between unrelated region-text pairs. By leveraging this attention mechanism, our approach offers finer control over the integration of text and image information, thereby effectively overcoming limitations faced by traditional SISR techniques. Experimental results on benchmark datasets demonstrate that our approach exhibits superior performance in generating perceptually authentic visual details while maintaining contextual consistency compared to existing approaches.

中文标题/摘要

标题：RAGSR：区域注意力引导的图像超分辨率扩散方法

大型视觉语言模型（VLMs）丰富的文本信息与预训练文本到图像（T2I）扩散模型的强大生成先验相结合，在单图像超分辨率（SISR）领域取得了显著成果。然而，现有方法在生成清晰准确的区域细节方面仍面临重大挑战，尤其是在涉及多对象的场景中。这主要源于缺乏细粒度区域描述以及模型捕捉复杂提示的能力不足。为解决这些局限，我们提出了区域注意力引导超分辨率（RAGSR）方法，该方法显式提取局部细粒度信息，并通过新颖的区域注意力机制有效编码，既能增强细节又能保持整体视觉一致性。具体而言，RAGSR定位图像中的对象区域并为每个区域分配细粒度描述，将其格式化为区域-文本对作为T2I模型的文本先验。随后利用区域引导注意力确保每个区域-文本对在注意力过程中得到恰当处理，同时防止不相关区域-文本对之间的不良交互。通过这种注意力机制，我们的方法能更精细地控制文本与图像信息的融合，从而有效克服传统SISR技术的局限。在基准数据集上的实验结果表明，相较于现有方法，我们的方法在生成感知真实的视觉细节同时保持上下文一致性方面展现出卓越性能。

Summary / 总结

Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection

Authors: Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen, Feng-Hao Yeh, Chao-Chun Chen

First: 2025-08-22T07:26:56+00:00 · Latest: 2025-08-22T07:26:56+00:00

Abs · PDF

Abstract

Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies. However, previous approaches are fundamentally limited by their reliance on human-designed prompts and the lack of accessible anomaly samples, leading to significant gaps in context-specific anomaly understanding. In this paper, we propose \textbf{A}daptive \textbf{P}rompt \textbf{T}uning with semantic alignment for anomaly detection (APT), a groundbreaking prior knowledge-free, few-shot framework and overcomes the limitations of traditional prompt-based approaches. APT uses self-generated anomaly samples with noise perturbations to train learnable prompts that capture context-dependent anomalies in different scenarios. To prevent overfitting to synthetic noise, we propose a Self-Optimizing Meta-prompt Guiding Scheme (SMGS) that iteratively aligns the prompts with general anomaly semantics while incorporating diverse synthetic anomaly. Our system not only advances pixel-wise anomaly detection, but also achieves state-of-the-art performance on multiple benchmark datasets without requiring prior knowledge for prompt crafting, establishing a robust and versatile solution for real-world anomaly detection.

中文标题/摘要

标题：超越人工提示：面向异常检测的语义对齐自适应提示调优

预训练视觉-语言模型（VLMs）近期在异常检测领域展现出潜力。然而，现有方法因依赖人工设计的提示模板及缺乏可用异常样本而存在根本局限，导致对场景特定异常的理解存在显著差距。本文提出基于语义对齐的自适应提示调优框架（APT），这是一种突破性的无需先验知识的少样本框架，克服了传统基于提示方法的局限性。APT通过使用带噪声扰动的自生成异常样本来训练可学习提示，以捕捉不同场景下的上下文相关异常。为防止过拟合合成噪声，我们提出自优化元提示引导方案（SMGS），该方案在融入多样化合成异常的同时迭代对齐提示与通用异常语义。我们的系统不仅推进了像素级异常检测，还在多个基准数据集上实现了最先进性能，且无需提示工程所需的先验知识，为现实世界异常检测提供了鲁棒且通用的解决方案。

Summary / 总结

Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies.