arXiv 论文速递

Modular Embedding Recomposition for Incremental Learning

Authors: Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara

First: 2025-08-22T15:25:40+00:00 · Latest: 2025-08-22T15:25:40+00:00

Comments: Accepted to the 36th British Machine Vision Conference (BMVC 2025), Sheffield, UK

Abstract

The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.

中文标题/摘要

标题：模块化嵌入重组在增量学习中的应用

预训练视觉-语言模型（VLMs）的出现显著改变了持续学习（CL）领域，主要得益于其零样本分类能力。这种能力使VLMs非常适合现实应用，能在无需适配的情况下对未见类别保持强大性能。然而当下游任务与预训练领域差异较大时，微调仍不可或缺。现有CL方法主要关注在下游任务增量微调过程中保持VLMs的零样本能力。我们进一步提出将这种保持转化为增强VLMs零样本能力的方法——模块化嵌入重组（MoDER）。该方法通过训练多个文本专家（每个专家专精一个已见类别）并存储于基础枢纽中。推理时，针对每个未见类别，我们查询该枢纽并组合检索到的专家以合成改进的分类原型。我们在Class-IL和MTIL两种主流零样本增量协议（共包含14个数据集）上验证了方法的有效性。代码库详见：https://github.com/aimagelab/mammoth。

Summary / 总结

This research addresses the challenge of adapting pre-trained Vision-Language Models (VLMs) to downstream tasks in continual learning, where fine-tuning often compromises zero-shot capabilities. The proposed method, MoDular Embedding Recomposition (MoDER), trains specialized textual experts for each seen class and stores them in a hub; during inference, it retrieves and composes these experts to synthesize refined prototypes for unseen classes, enhancing zero-shot classification. Experimental results across 14 datasets under Class-IL and MTIL protocols demonstrate that MoDER effectively improves zero-shot performance compared to prior approaches.

本研究针对持续学习中预训练视觉-语言模型（VLM）在下游任务微调时零样本能力受损的问题，提出模块化嵌入重组方法（MoDER），通过训练每个已见类的专用文本专家并存储于中心库，在推理时检索并组合专家以合成未见类的优化原型，从而提升零样本分类。在Class-IL和MTIL协议下的14个数据集实验表明，该方法显著增强了零样本性能。

Structuring GUI Elements through Vision Language Models: Towards Action Space Generation

Authors: Yi Xu, Yesheng Zhang, jiajia Liu, Jingdong Chen

First: 2025-08-22T10:14:15+00:00 · Latest: 2025-08-22T10:14:15+00:00

Comments: 10pageV0

Abs · PDF

Abstract

Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.

中文标题/摘要

标题：基于视觉语言模型的GUI元素结构化：面向动作空间生成

多模态大语言模型（MLLMs）已成为增强人机交互的关键工具。本文聚焦于MLLMs在图形用户界面（GUI）元素结构化领域的应用，通过解析屏幕内容协助处理用户指令。尽管MLLMs展现出潜力，但其在精确生成UI元素坐标（GUI理解的核心环节）时受限于下一词预测的训练机制。这一挑战源于语言表示空间中数值坐标的语义缺失，需要大量多样化数据集来增强视觉模块能力。为此，我们提出了交并比增强最大似然（IAML）训练范式：首先构建基于IoU的坐标采样新流程扩充训练数据，该流程考虑与真实坐标的邻近度；随后运用此数据增强策略在IAML范式下微调MLLMs，以缓解传统最大似然估计固有的暴露偏差问题。大量实验证明，IAML训练方法优于传统训练范式。

Summary / 总结

This research addresses the challenge of using multimodal large language models (MLLMs) for graphical user interface (GUI) element structuring, particularly in generating precise UI element coordinates. The motivation stems from the limitations of next-token prediction in handling numerical coordinates due to semantic gaps in language representations. The proposed method introduces an IoU-Augmented Maximum Likelihood (IAML) training paradigm, which employs a novel pipeline for IoU-based coordinate sampling to augment training data by considering proximity to ground truth. Experimental results demonstrate that the IAML approach significantly outperforms traditional training paradigms in performance.

多模态大语言模型（MLLMs）在图形用户界面（GUI）理解中具有潜力，但由于数值标记预测的语义差距和训练数据不足，其在生成UI元素坐标时精度有限。为此，作者提出了一种IoU增强最大似然（IAML）训练范式，通过基于IoU的坐标采样管道进行数据增强，并微调MLLMs以减少暴露偏差。实验结果表明，IAML在坐标生成准确性上优于传统训练方法。

RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution

Authors: Haodong He, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, Xiangxiang Chu, Gui-Song Xia

First: 2025-08-22T07:28:34+00:00 · Latest: 2025-08-22T07:28:34+00:00

Abs · PDF

Abstract

The rich textual information of large vision-language models (VLMs) combined with the powerful generative prior of pre-trained text-to-image (T2I) diffusion models has achieved impressive performance in single-image super-resolution (SISR). However, existing methods still face significant challenges in generating clear and accurate regional details, particularly in scenarios involving multiple objects. This challenge primarily stems from a lack of fine-grained regional descriptions and the models' insufficient ability to capture complex prompts. To address these limitations, we propose a Regional Attention Guided Super-Resolution (RAGSR) method that explicitly extracts localized fine-grained information and effectively encodes it through a novel regional attention mechanism, enabling both enhanced detail and overall visually coherent SR results. Specifically, RAGSR localizes object regions in an image and assigns fine-grained caption to each region, which are formatted as region-text pairs as textual priors for T2I models. A regional guided attention is then leveraged to ensure that each region-text pair is properly considered in the attention process while preventing unwanted interactions between unrelated region-text pairs. By leveraging this attention mechanism, our approach offers finer control over the integration of text and image information, thereby effectively overcoming limitations faced by traditional SISR techniques. Experimental results on benchmark datasets demonstrate that our approach exhibits superior performance in generating perceptually authentic visual details while maintaining contextual consistency compared to existing approaches.

中文标题/摘要

标题：RAGSR：区域注意力引导的图像超分辨率扩散方法

大型视觉语言模型（VLMs）丰富的文本信息与预训练文本到图像（T2I）扩散模型的强大生成先验相结合，在单图像超分辨率（SISR）领域取得了显著成果。然而，现有方法在生成清晰准确的区域细节方面仍面临重大挑战，尤其是在涉及多对象的场景中。这主要源于缺乏细粒度区域描述以及模型捕捉复杂提示的能力不足。为解决这些局限，我们提出了区域注意力引导超分辨率（RAGSR）方法，该方法显式提取局部细粒度信息，并通过新颖的区域注意力机制有效编码，实现细节增强与整体视觉一致性的超分辨率结果。具体而言，RAGSR定位图像中的对象区域并为每个区域分配细粒度描述，将其格式化为区域-文本对作为T2I模型的文本先验。随后利用区域引导注意力确保每个区域-文本对在注意力过程中得到恰当处理，同时防止不相关区域-文本对之间的不良交互。通过这种注意力机制，我们的方法能更精细地控制文本与图像信息的整合，从而有效克服传统SISR技术的局限性。基准数据集上的实验结果表明，与现有方法相比，我们的方法在生成感知真实的视觉细节同时保持上下文一致性方面展现出卓越性能。

Summary / 总结

Motivated by the limitations of existing vision-language and diffusion-based super-resolution methods in generating accurate regional details for multi-object scenes, this work introduces RAGSR, a method that localizes object regions, assigns fine-grained captions to each, and employs a novel regional attention mechanism to guide text-image integration. The approach formats region-text pairs as textual priors and uses attention to ensure proper consideration of each pair while minimizing interference between unrelated regions. Experiments on benchmark datasets show that RAGSR outperforms existing methods in producing perceptually authentic details and maintaining contextual coherence.

现有结合视觉语言模型和扩散模型的单图像超分辨率方法在多物体场景中难以生成清晰准确的区域细节，主要由于缺乏细粒度描述和提示理解能力不足。为此，作者提出RAGSR方法，通过定位物体区域并为每个区域分配细粒度描述，利用新颖的区域注意力机制有效整合区域-文本对，同时避免不相关区域间的干扰。在基准数据集上的实验表明，该方法在生成感知真实细节的同时保持上下文一致性，性能优于现有方法。

Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection

Authors: Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen, Feng-Hao Yeh, Chao-Chun Chen

First: 2025-08-22T07:26:56+00:00 · Latest: 2025-08-22T07:26:56+00:00

Abs · PDF

Abstract

Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies. However, previous approaches are fundamentally limited by their reliance on human-designed prompts and the lack of accessible anomaly samples, leading to significant gaps in context-specific anomaly understanding. In this paper, we propose \textbf{A}daptive \textbf{P}rompt \textbf{T}uning with semantic alignment for anomaly detection (APT), a groundbreaking prior knowledge-free, few-shot framework and overcomes the limitations of traditional prompt-based approaches. APT uses self-generated anomaly samples with noise perturbations to train learnable prompts that capture context-dependent anomalies in different scenarios. To prevent overfitting to synthetic noise, we propose a Self-Optimizing Meta-prompt Guiding Scheme (SMGS) that iteratively aligns the prompts with general anomaly semantics while incorporating diverse synthetic anomaly. Our system not only advances pixel-wise anomaly detection, but also achieves state-of-the-art performance on multiple benchmark datasets without requiring prior knowledge for prompt crafting, establishing a robust and versatile solution for real-world anomaly detection.

中文标题/摘要

标题：超越人工提示：面向异常检测的语义对齐自适应提示调优

预训练视觉-语言模型（VLMs）近期在异常检测领域展现出潜力。然而，现有方法因依赖人工设计的提示模板及缺乏可用异常样本而存在根本局限，导致特定场景下的异常理解存在显著差距。本文提出基于语义对齐的自适应提示调优框架（APT），这是一种突破性的无需先验知识的少样本框架，克服了传统基于提示方法的局限性。APT通过使用带噪声扰动的自生成异常样本训练可学习提示，以捕捉不同场景中上下文相关的异常。为防止过拟合合成噪声，我们提出自优化元提示引导方案（SMGS），在融入多样化合成异常的同时迭代对齐提示与通用异常语义。该系统不仅推进了像素级异常检测，还在多个基准数据集上实现了最先进性能，且无需提示设计的先验知识，为现实世界异常检测提供了鲁棒且通用的解决方案。

Summary / 总结

This research addresses the limitations of existing vision-language models in anomaly detection, which rely on human-designed prompts and lack access to real anomaly samples, hindering context-specific understanding. The proposed method, Adaptive Prompt Tuning (APT), introduces a few-shot framework that generates synthetic anomaly samples via noise perturbations to train learnable prompts, and employs a Self-Optimizing Meta-prompt Guiding Scheme to align prompts with general anomaly semantics while avoiding overfitting to synthetic data. Experimental results demonstrate that APT achieves state-of-the-art performance on multiple benchmark datasets without requiring prior knowledge, offering a robust solution for pixel-wise anomaly detection.

针对预训练视觉-语言模型在异常检测中依赖人工设计提示且缺乏可用异常样本的局限性，本文提出了自适应提示调优（APT），一种无需先验知识的少样本框架。APT通过噪声扰动生成合成异常样本以训练可学习提示，捕捉场景相关的异常特征，并采用自优化元提示引导方案，在融入多样合成异常的同时使提示与通用异常语义对齐，避免过拟合。实验结果表明，该方法在多个基准数据集上实现了最先进的像素级异常检测性能，且无需手动设计提示。

Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation

Authors: Md Tariquzzaman, Md Farhan Ishmam, Saiyma Sittul Muna, Md Kamrul Hasan, Hasan Mahmud

Venue: ICCV 2025

First: 2025-08-22T04:11:28+00:00 · Latest: 2025-08-22T04:11:28+00:00

Comments: CV4A11y@ICCV 2025

Abs · PDF

Abstract

Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step textual instructions that enable non-SL users to imitate and learn SL gestures, promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to appear in the VLM pre-training data. To enhance zero-shot performance, we introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL parameters, like hand shape, motion, and orientation, directly into the textual prompts. Subsuming standard sign parameters into the prompt makes the instructions more structured and reproducible than free-form natural text from vanilla prompting. We envision that our work would promote inclusivity and advancement in SL learning systems for the under-resourced communities.

中文标题/摘要

标题：利用手语参数提示进行低资源手语指令生成

手语（SL）为聋哑及听力障碍群体提供了双向交流方式，然而许多手语在人工智能领域仍属资源匮乏。手语指令生成（SLIG）通过生成逐步文本指令，使非手语用户能够模仿和学习手语手势，促进双向互动。我们推出了首个孟加拉语手语指令数据集BdSLIG，用于评估视觉语言模型（VLMs）在（i）资源匮乏的SLIG任务和（ii）长尾视觉概念上的表现，因为孟加拉语手语不太可能出现在VLM预训练数据中。为提升零样本性能，我们提出手语参数融合（SPI）提示法，将标准手语参数（如手形、动作和方向）直接整合到文本提示中。将标准手语参数融入提示可使指令比传统自由格式自然文本更具结构性和可复现性。我们期望这项工作能促进资源匮乏群体在手语学习系统中的包容性与技术进步。

Summary / 总结

This research addresses the under-resourced nature of many sign languages in AI by focusing on Sign Language Instruction Generation (SLIG), which produces textual instructions to help non-signers learn gestures and enable two-way communication. The authors introduce the first Bengali SLIG dataset, BdSLIG, to evaluate Vision Language Models on low-resource and long-tail visual tasks, and propose Sign Parameter-Infused (SPI) prompting that integrates structured sign parameters like hand shape and motion directly into textual prompts. Experimental results show that SPI prompting yields more structured and reproducible instructions compared to free-form natural text from standard prompting, enhancing zero-shot performance for under-resourced sign language communities.

本研究针对手语在人工智能领域资源匮乏的问题，聚焦于孟加拉语手语指令生成（SLIG），该语言不太可能出现在标准视觉语言模型（VLM）的预训练数据中。作者引入了首个孟加拉语SLIG数据集BdSLIG，并提出了手语参数注入（SPI）提示方法，将手形、动作和方向等结构化手语参数直接整合到文本提示中，以提升VLM的零样本性能。实验结果表明，与标准提示的自由形式自然文本相比，SPI提示能生成更结构化、可复现的指令，有效处理长尾视觉概念，并提升低资源手语社区的包容性。

Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants

Authors: Chongyang Li, Yuan Zhiqiang, Jiapei Zhang, Ying Deng, Hanbo Bi, Zexi Jia, Xiaoyue Duan, Peixiang Luo, Jinchao Zhang

First: 2025-08-22T03:56:30+00:00 · Latest: 2025-08-22T03:56:30+00:00

Abs · PDF

Abstract

Approximately 283 million people worldwide live with visual impairments, motivating increasing research into leveraging Visual Language Models (VLMs) to develop effective walking assistance systems for blind and low vision individuals. However, existing VLMs in walking assistant task often have outputs that contain considerable redundancy and extraneous details, adversely affecting users' ability to accurately assess their surroundings. Moreover, these models typically lack the capability to proactively assess environmental risks and adaptively trigger reminders based on the appropriate scene, leading to excessive temporal redundancy. To mitigate output and temporal redundancy, we propose WalkVLM-LR, a walking assistance model with less redundancy. To reduce output redundancy, we introduce four human-preference-based custom reward functions within the GRPO-based reasoning framework to optimize the output in terms of conciseness, fluency, keyword density, and accuracy, thereby producing more informative and streamlined outputs. To minimize temporal redundancy, we incorporate an environment awareness discriminator, which shares the visual encoder with the VLMs to reduce redundant computations and enhance discriminative efficiency, to make WalkVLM-LR assess scene risk levels and minimize unnecessary reminders. Experimental results demonstrate that our method achieves state-of-the-art performance across all evaluation metrics compared with other models, particularly in output conciseness and less temporal redundancy.

中文标题/摘要

标题：减少冗余：提升视觉语言模型在行走辅助设备中的实用性

全球约2.83亿人患有视力障碍，这推动了对利用视觉语言模型（VLM）为盲人和低视力人群开发有效行走辅助系统的研究。然而，现有行走辅助任务中的VLM常存在输出冗余和无关细节，影响用户准确感知环境的能力。这些模型通常还缺乏主动评估环境风险及根据场景自适应触发提醒的能力，导致时间冗余。为减少输出和时间冗余，我们提出WalkVLM-LR——一种低冗余行走辅助模型。通过基于GRPO推理框架的四种人类偏好定制奖励函数优化输出简洁性、流畅性、关键词密度和准确性，以降低输出冗余；通过与环境感知判别器共享视觉编码器，减少冗余计算并提升判别效率，使WalkVLM-LR能评估场景风险等级并最小化不必要提醒。实验表明，该方法在所有评估指标上均优于其他模型，尤其在输出简洁性和低时间冗余方面表现突出。

Summary / 总结

This research addresses the problem of output and temporal redundancy in vision language models (VLMs) for walking assistants, which can impair situational awareness for visually impaired users. The proposed WalkVLM-LR model reduces output redundancy by integrating four human-preference-based reward functions within a GRPO reasoning framework to optimize conciseness, fluency, keyword density, and accuracy. It also minimizes temporal redundancy via an environment awareness discriminator that shares the visual encoder for efficient risk assessment and adaptive reminders. Experiments show state-of-the-art performance across metrics, with significant improvements in output conciseness and reduced temporal redundancy.

本研究针对视觉语言模型在行走辅助任务中存在的输出冗余和时间冗余问题，这些问题影响了视障用户对环境判断的准确性。提出的WalkVLM-LR方法通过基于人类偏好的四种奖励函数，在GRPO推理框架中优化输出的简洁性、流畅性、关键词密度和准确性，以减少输出冗余；同时引入环境感知判别器，共享视觉编码器以减少计算冗余并实现自适应风险提醒。实验结果表明，该方法在所有评估指标上均达到最优性能，尤其在输出简洁性和减少时间冗余方面表现突出。

Glo-VLMs: Leveraging Vision-Language Models for Fine-Grained Diseased Glomerulus Classification

Authors: Zhenhao Guo, Rachit Saluja, Tianyuan Yao, Quan Liu, Yuankai Huo, Benjamin Liechty, David J. Pisapia, Kenji Ikemura, Mert R. Sabuncu, Yihe Yang, Ruining Deng

First: 2025-08-21T21:05:44+00:00 · Latest: 2025-08-21T21:05:44+00:00

Abs · PDF

Abstract

Vision-language models (VLMs) have shown considerable potential in digital pathology, yet their effectiveness remains limited for fine-grained, disease-specific classification tasks such as distinguishing between glomerular subtypes. The subtle morphological variations among these subtypes, combined with the difficulty of aligning visual patterns with precise clinical terminology, make automated diagnosis in renal pathology particularly challenging. In this work, we explore how large pretrained VLMs can be effectively adapted to perform fine-grained glomerular classification, even in scenarios where only a small number of labeled examples are available. In this work, we introduce Glo-VLMs, a systematic framework designed to explore the adaptation of VLMs to fine-grained glomerular classification in data-constrained settings. Our approach leverages curated pathology images alongside clinical text prompts to facilitate joint image-text representation learning for nuanced renal pathology subtypes. By assessing various VLMs architectures and adaptation strategies under a few-shot learning paradigm, we explore how both the choice of method and the amount of labeled data impact model performance in clinically relevant scenarios. To ensure a fair comparison, we evaluate all models using standardized multi-class metrics, aiming to clarify the practical requirements and potential of large pretrained models for specialized clinical research applications. As a result, fine-tuning the VLMs achieved 0.7416 accuracy, 0.9045 macro-AUC, and 0.5277 F1-score with only 8 shots per class, demonstrating that even with highly limited supervision, foundation models can be effectively adapted for fine-grained medical image classification.

中文标题/摘要

标题：Glo-VLMs：利用视觉语言模型进行细粒度病变肾小球分类

视觉语言模型（VLMs）在数字病理学中展现出巨大潜力，但在区分肾小球亚型等细粒度疾病特异性分类任务中效果仍有限。这些亚型间细微的形态学差异，加之视觉模式与精准临床术语对齐的困难，使得肾脏病理学的自动诊断尤为挑战。本研究探索了如何有效适配大型预训练VLMs执行细粒度肾小球分类，即使在仅有少量标注样本的场景下。我们提出Glo-VLMs系统框架，旨在探索数据受限环境中VLMs的适配策略。该方法通过整合病理图像与临床文本提示，促进针对细微肾脏病理亚型的联合图文表征学习。通过在少样本学习范式下评估多种VLM架构与适配策略，我们揭示了方法选择与标注数据量对临床相关场景中模型性能的影响。采用标准化多类指标评估所有模型，旨在明确大型预训练模型在专业临床研究应用中的实际需求与潜力。实验表明，仅使用每类8个样本进行微调，VLMs即达到0.7416准确率、0.9045宏观AUC和0.5277 F1分数，证明即使监督信号极度有限，基础模型仍可有效适配细粒度医学图像分类任务。

Summary / 总结

This research addresses the challenge of fine-grained diseased glomerulus classification in renal pathology, where subtle morphological variations and the difficulty of aligning visual patterns with clinical terminology limit the effectiveness of vision-language models. The authors introduce Glo-VLMs, a framework that leverages curated pathology images and clinical text prompts to adapt large pretrained vision-language models through joint image-text representation learning under few-shot conditions. Experimental results demonstrate that fine-tuned models achieved 0.7416 accuracy, 0.9045 macro-AUC, and 0.5277 F1-score with only 8 shots per class, showing that foundation models can be effectively adapted for specialized medical image classification even with highly limited supervision.

本研究针对肾脏病理学中细粒度病变肾小球分类的挑战，其中细微的形态学变异以及视觉模式与临床术语对齐的困难限制了视觉语言模型的有效性。作者提出了Glo-VLMs框架，利用精选的病理图像和临床文本提示，通过少样本条件下的联合图像-文本表示学习来适应大型预训练视觉语言模型。实验结果表明，经过微调的模型在每类仅8个样本的情况下达到了0.7416的准确率、0.9045的宏观AUC和0.5277的F1分数，证明即使监督信息极其有限，基础模型也能有效适应专业医学图像分类任务。

Semantic-Aware Ship Detection with Vision-Language Integration

Authors: Jiahao Li, Jiancheng Pan, Yuze Sun, Xiaomeng Huang

First: 2025-08-21T19:24:52+00:00 · Latest: 2025-08-21T19:24:52+00:00

Comments: 5 pages

Abs · PDF

Abstract

Ship detection in remote sensing imagery is a critical task with wide-ranging applications, such as maritime activity monitoring, shipping logistics, and environmental studies. However, existing methods often struggle to capture fine-grained semantic information, limiting their effectiveness in complex scenarios. To address these challenges, we propose a novel detection framework that combines Vision-Language Models (VLMs) with a multi-scale adaptive sliding window strategy. To facilitate Semantic-Aware Ship Detection (SASD), we introduce ShipSem-VL, a specialized Vision-Language dataset designed to capture fine-grained ship attributes. We evaluate our framework through three well-defined tasks, providing a comprehensive analysis of its performance and demonstrating its effectiveness in advancing SASD from multiple perspectives.

中文标题/摘要

标题：语义感知的船舶检测与视觉语言集成

遥感图像中的船舶检测是一项关键任务，广泛应用于海事活动监测、航运物流和环境研究等领域。然而，现有方法往往难以捕捉细粒度语义信息，限制了其在复杂场景中的有效性。为解决这些挑战，我们提出了一种新颖的检测框架，该框架结合了视觉语言模型（VLMs）与多尺度自适应滑动窗口策略。为促进语义感知船舶检测（SASD），我们引入了ShipSem-VL——一个专门设计用于捕获细粒度船舶属性的视觉语言数据集。我们通过三个明确定义的任务评估了该框架，对其性能进行了全面分析，并从多角度证明了其在推进SASD方面的有效性。

Summary / 总结

Ship detection in remote sensing imagery is crucial for applications like maritime monitoring and logistics, but existing methods often fail to capture fine-grained semantic details in complex scenarios. To overcome this, the authors propose a novel detection framework that integrates Vision-Language Models (VLMs) with a multi-scale adaptive sliding window strategy, and they introduce ShipSem-VL, a specialized dataset designed for capturing detailed ship attributes. Experimental evaluation across three well-defined tasks demonstrates the framework's effectiveness in advancing semantic-aware ship detection from multiple perspectives.

遥感图像中的船舶检测对于海事监控和物流等应用至关重要，但现有方法在复杂场景中难以捕捉细粒度语义信息。为解决这一问题，作者提出了一种新颖的框架，将视觉语言模型（VLM）与多尺度自适应滑动窗口策略相结合，并引入了专门设计的ShipSem-VL数据集以支持语义感知船舶检测。通过在三个明确定义的任务上进行实验评估，该框架的有效性得到了验证，并从多角度提供了全面的性能分析。

VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos

Authors: Kaining Li, Shuwei He, Zihan Xu

First: 2025-08-21T18:03:16+00:00 · Latest: 2025-08-21T18:03:16+00:00

Abs · PDF

Abstract

Human action recognition in long-term videos, characterized by complex backgrounds and subtle action differences, poses significant challenges for traditional deep learning models due to computational overhead, difficulty in capturing long-range temporal dependencies, and limited semantic understanding. While Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have shown remarkable capabilities in multi-modal understanding and reasoning, their direct application to continuous video streams for fine-grained action recognition remains an open problem. This paper introduces VT-LVLM-AR (Video-Temporal Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap. VT-LVLM-AR comprises a Video-to-Event Mapper (VTEM) that efficiently transforms raw video into compact, semantically rich, and temporally coherent "visual event sequences" through lightweight spatio-temporal feature extraction, adaptive temporal pooling, and conceptual quantization with an event coherence bias. These visual event sequences are then fed into an LVLM-based Action Reasoning module, specifically a frozen LLaVA-1.5 model, adapted using parameter-efficient Prompt Tuning (P-Tuning v2) for action classification. Comprehensive evaluations on the NTU RGB+D and NTU RGB+D 120 datasets demonstrate that VT-LVLM-AR consistently achieves state-of-the-art performance, surpassing existing methods (e.g., 94.1% accuracy on NTU RGB+D X-Sub). Ablation studies confirm the critical contributions of VTEM's components and the efficacy of Prompt Tuning, while human evaluations underscore the interpretability of our visual event representations. This work highlights the immense potential of leveraging LVLMs for robust and interpretable video action understanding through effective video-to-language translation and efficient model adaptation.

中文标题/摘要

标题：VT-LVLM-AR：面向长时视频细粒度动作识别的视频时序大视觉语言模型适配器

长时视频中的人类动作识别因复杂背景和细微动作差异，对传统深度学习模型构成重大挑战，包括计算开销大、难以捕捉长程时序依赖及语义理解有限。尽管大语言模型（LLMs）和大视觉语言模型（LVLMs）在多模态理解与推理方面展现出卓越能力，但其直接应用于连续视频流进行细粒度动作识别仍存难题。本文提出VT-LVLM-AR（视频时序大视觉语言模型动作识别适配器），该创新框架通过视频-事件映射器（VTEM）将原始视频高效转换为紧凑、语义丰富且时序连贯的“视觉事件序列”，并采用基于LVLM的动作推理模块（冻结的LLaVA-1.5模型）结合参数高效的提示调优（P-Tuning v2）进行分类。在NTU RGB+D与NTU RGB+D 120数据集上的全面评估表明，VT-LVLM-AR持续实现最先进性能（如在NTU RGB+D X-Sub上达94.1%准确率）。消融研究验证了VTEM组件的关键贡献与提示调优的有效性，人类评估则突显了视觉事件表征的可解释性。本研究揭示了通过有效的视频-语言转换与高效模型适配，利用LVLMs实现鲁棒可解释视频动作理解的巨大潜力。

Summary / 总结

This research addresses the challenge of fine-grained action recognition in long-term videos, where complex backgrounds and subtle action differences hinder traditional models due to computational costs and limited semantic understanding. The proposed VT-LVLM-AR framework bridges this gap by first converting raw video into compact, semantically rich visual event sequences using a Video-to-Event Mapper with spatio-temporal feature extraction, adaptive pooling, and conceptual quantization. These sequences are then processed by a frozen LLaVA-1.5 model adapted via Prompt Tuning for action classification. Experimental results on NTU RGB+D and NTU RGB+D 120 datasets show state-of-the-art performance, achieving 94.1% accuracy on NTU RGB+D X-Sub, with ablation studies confirming the importance of each component and human evaluations highlighting the interpretability of the event representations.

本研究针对长时视频中的细粒度动作识别挑战，传统方法因计算开销大、难以捕捉长程时序依赖和语义理解有限，在复杂背景和细微动作差异下表现不佳。所提出的VT-LVLM-AR方法包含一个视频到事件映射器（VTEM），通过轻量级时空特征提取、自适应时序池化和概念量化，将原始视频转换为紧凑且语义丰富的视觉事件序列；随后使用基于冻结LLaVA-1.5模型的动作推理模块，并采用参数高效的提示调优进行动作分类。在NTU RGB+D和NTU RGB+D 120数据集上的实验表明，该方法达到了最先进性能，在NTU RGB+D X-Sub上准确率为94.1%，消融研究验证了VTEM组件和提示调优的有效性，人工评估则强调了视觉事件表示的可解释性。

LLM-empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions

Authors: Yongju Jia, Jiarui Ma, Xiangxian Li, Baiqiao Zhang, Xianhui Cao, Juan Liu, Yulong Bian

Venue: EMNLP 2025

First: 2025-08-21T16:12:06+00:00 · Latest: 2025-08-21T16:12:06+00:00

Comments: accepted by EMNLP 2025

Abs · PDF

Abstract

Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive capability in visual tasks, but their fine-tuning often suffers from bias in class-imbalanced scene. Recent works have introduced large language models (LLMs) to enhance VLM fine-tuning with supplementing semantic information. However, they often overlook inherent class imbalance in VLMs' pre-training, which may lead to bias accumulation in downstream tasks. To address this problem, this paper proposes a Multi-dimensional Dynamic Prompt Routing (MDPR) framework. MDPR constructs a comprehensive knowledge base for classes, spanning five visual-semantic dimensions. During fine-tuning, the dynamic routing mechanism aligns global visual classes, retrieves optimal prompts, and balances fine-grained semantics, yielding stable predictions through logits fusion. Extensive experiments on long-tailed benchmarks, including CIFAR-LT, ImageNet-LT, and Places-LT, demonstrate that MDPR achieves comparable results with current SOTA methods. Ablation studies further confirm the effectiveness of our semantic library for tail classes, and show that our dynamic routing incurs minimal computational overhead, making MDPR a flexible and efficient enhancement for VLM fine-tuning under data imbalance.

中文标题/摘要

标题：长尾分布下基于大语言模型的视觉语言模型调优动态提示路由机制

预训练的视觉语言模型（如CLIP）在视觉任务中展现出强大能力，但其微调过程常受类别不平衡导致的偏差影响。现有研究引入大语言模型通过补充语义信息来增强VLM微调，但忽略了VLM预训练阶段固有的类别不平衡问题，可能导致下游任务中的偏差累积。为此，本文提出多维动态提示路由（MDPR）框架，构建覆盖五个视觉语义维度的类别知识库。微调时通过动态路由机制实现全局视觉类别对齐、最优提示检索与细粒度语义平衡，并通过逻辑融合获得稳定预测。在CIFAR-LT、ImageNet-LT和Places-LT等长尾基准测试上的实验表明，MDPR达到了当前SOTA方法的可比效果。消融研究证实了语义库对尾部类别的有效性，且动态路由计算开销极小，使MDPR成为数据不平衡环境下灵活高效的VLM微调增强方案。

Summary / 总结

This research addresses the challenge of bias accumulation in fine-tuning vision-language models (VLMs) like CLIP under long-tailed data distributions, where inherent class imbalance from pre-training can degrade downstream performance. The proposed method, Multi-dimensional Dynamic Prompt Routing (MDPR), constructs a multi-dimensional visual-semantic knowledge base and employs dynamic routing to align global classes, retrieve optimal prompts, and balance fine-grained semantics, followed by logits fusion for stable predictions. Experiments on CIFAR-LT, ImageNet-LT, and Places-LT show that MDPR achieves state-of-the-art comparable results, with ablations confirming its effectiveness for tail classes and minimal computational overhead.

本研究针对视觉-语言模型（VLM）在长尾分布下微调时的偏差累积问题，即预训练类别不平衡会加剧下游任务性能下降。作者提出了多维动态提示路由（MDPR）框架，利用大语言模型构建多维度视觉-语义知识库，并在微调过程中动态检索最优提示以对齐全局类别和平衡细粒度语义。在CIFAR-LT、ImageNet-LT和Places-LT上的实验表明，MDPR达到了与当前最优方法相当的性能，消融研究验证了其对尾部类的有效性及低计算开销。