Zero-shot large vision-language model prompting for automated bone identification in paleoradiology x-ray archives
Authors: Owen Dong, Lily Gao, Manish Kota, Bennett A. Landmana, Jelena Bekvalac, Gaynor Western, Katherine D. Van Schaik
First: 2026-02-03T17:14:23+00:00 · Latest: 2026-02-03T17:14:23+00:00
Abstract
Paleoradiology, the use of modern imaging technologies to study archaeological and anthropological remains, offers new windows on millennial scale patterns of human health. Unfortunately, the radiographs collected during field campaigns are heterogeneous: bones are disarticulated, positioning is ad hoc, and laterality markers are often absent. Additionally, factors such as age at death, age of bone, sex, and imaging equipment introduce high variability. Thus, content navigation, such as identifying a subset of images with a specific projection view, can be time consuming and difficult, making efficient triaging a bottleneck for expert analysis. We report a zero shot prompting strategy that leverages a state of the art Large Vision Language Model (LVLM) to automatically identify the main bone, projection view, and laterality in such images. Our pipeline converts raw DICOM files to bone windowed PNGs, submits them to the LVLM with a carefully engineered prompt, and receives structured JSON outputs, which are extracted and formatted onto a spreadsheet in preparation for validation. On a random sample of 100 images reviewed by an expert board certified paleoradiologist, the system achieved 92% main bone accuracy, 80% projection view accuracy, and 100% laterality accuracy, with low or medium confidence flags for ambiguous cases. These results suggest that LVLMs can substantially accelerate code word development for large paleoradiology datasets, allowing for efficient content navigation in future anthropology workflows.
中文标题/摘要
标题:零样本大型视觉语言模型提示在古放射学X光档案中自动识别骨骼
古放射学,使用现代成像技术研究考古和人类学遗骸,为研究千年尺度的人类健康模式提供了新的窗口。不幸的是,收集到的放射照相图是异质的:骨骼是分离的,定位是随意的,而且经常缺少左右标记。此外,诸如死亡年龄、骨骼年龄、性别和成像设备等因素引入了高变异性。因此,内容导航,如识别具有特定投影视图的图像子集,可能耗时且困难,使得高效筛选成为专家分析的瓶颈。我们报告了一种零样本提示策略,利用最先进的大型视觉语言模型(LVLM)自动识别此类图像中的主要骨骼、投影视图和左右方向。我们的流水线将原始DICOM文件转换为骨窗PNG,将它们提交给LVLM并附上精心设计的提示,接收结构化的JSON输出,提取并格式化到电子表格中,以备验证。在由专家放射学认证的古放射学家审查的100张随机图像样本中,该系统在主要骨骼识别上的准确率为92%,在投影视图识别上的准确率为80%,在左右方向识别上的准确率为100%,对于有疑问的情况,标记为低或中等置信度。这些结果表明,LVLM可以显著加速大型古放射学数据集的代码词开发,允许在未来的人类学工作流程中高效导航内容。
Summary / 总结
The study aims to address the challenges in paleoradiology by automating the identification of bones, projection views, and laterality in heterogeneous X-ray images. It employs a zero-shot prompting strategy using a Large Vision Language Model (LVLM) to process raw DICOM files into bone windowed PNGs and generate structured JSON outputs. The system achieved 92% accuracy in identifying the main bone, 80% in projection view, and 100% in laterality, with low or medium confidence flags for ambiguous cases, significantly accelerating expert analysis of large paleoradiology datasets.
研究旨在通过使用大型视觉语言模型(LVLM)和零样本提示策略,自动识别异质性古放射学X光图像中的骨骼及其视角。方法包括将DICOM文件转换为骨骼窗口PNG,并提交给LVLM带有特定提示,返回结构化的JSON输出。系统在识别主要骨骼方面达到了92%的准确率,在投影视角方面达到了80%的准确率,在侧向性方面达到了100%的准确率,对于模糊情况标记为低或中等置信度,显著加速了古放射学工作流程中的内容导航。
Edge-Optimized Vision-Language Models for Underground Infrastructure Assessment
Authors: Johny J. Lopez, Md Meftahul Ferdaus, Mahdi Abdelguerfi
First: 2026-02-03T17:03:46+00:00 · Latest: 2026-02-03T17:03:46+00:00
Abstract
Autonomous inspection of underground infrastructure, such as sewer and culvert systems, is critical to public safety and urban sustainability. Although robotic platforms equipped with visual sensors can efficiently detect structural deficiencies, the automated generation of human-readable summaries from these detections remains a significant challenge, especially on resource-constrained edge devices. This paper presents a novel two-stage pipeline for end-to-end summarization of underground deficiencies, combining our lightweight RAPID-SCAN segmentation model with a fine-tuned Vision-Language Model (VLM) deployed on an edge computing platform. The first stage employs RAPID-SCAN (Resource-Aware Pipeline Inspection and Defect Segmentation using Compact Adaptive Network), achieving 0.834 F1-score with only 0.64M parameters for efficient defect segmentation. The second stage utilizes a fine-tuned Phi-3.5 VLM that generates concise, domain-specific summaries in natural language from the segmentation outputs. We introduce a curated dataset of inspection images with manually verified descriptions for VLM fine-tuning and evaluation. To enable real-time performance, we employ post-training quantization with hardware-specific optimization, achieving significant reductions in model size and inference latency without compromising summarization quality. We deploy and evaluate our complete pipeline on a mobile robotic platform, demonstrating its effectiveness in real-world inspection scenarios. Our results show the potential of edge-deployable integrated AI systems to bridge the gap between automated defect detection and actionable insights for infrastructure maintenance, paving the way for more scalable and autonomous inspection solutions.
中文标题/摘要
标题:优化边缘的视觉-语言模型用于地下基础设施评估
对地下基础设施(如污水和涵洞系统)的自主检查对于公共安全和城市可持续性至关重要。尽管配备视觉传感器的机器人平台可以高效地检测结构缺陷,但从这些检测中自动生成人类可读的摘要仍然是一项重大挑战,特别是在资源受限的边缘设备上。本文提出了一种新颖的两阶段端到端缺陷摘要管道,结合了我们轻量级的RAPID-SCAN分割模型和在边缘计算平台上部署的微调视觉-语言模型(VLM)。第一阶段使用RAPID-SCAN(资源感知管道检查和缺陷分割紧凑自适应网络),仅使用0.64M参数实现了0.834的F1分数,以实现高效的缺陷分割。第二阶段利用微调的Phi-3.5 VLM从分割输出生成简洁的、领域特定的自然语言摘要。我们引入了一个包含手动验证描述的检查图像数据集,用于VLM的微调和评估。为了实现实时性能,我们采用后训练量化并进行硬件特定优化,显著减少了模型大小和推理延迟,而不影响摘要质量。我们在移动机器人平台上部署并评估了完整的管道,证明了其在实际检查场景中的有效性。我们的结果表明,可部署于边缘的集成AI系统有可能弥合自动缺陷检测与基础设施维护可操作见解之间的差距,为更可扩展和自主的检查解决方案铺平了道路。
Summary / 总结
This paper addresses the challenge of generating human-readable summaries from automated inspections of underground infrastructure using a two-stage pipeline. The first stage uses the RAPID-SCAN model for efficient defect segmentation, achieving an F1-score of 0.834 with minimal parameters. The second stage employs a fine-tuned Phi-3.5 Vision-Language Model to generate concise summaries. The system is optimized for edge devices through post-training quantization and hardware-specific optimization, maintaining summarization quality while reducing model size and inference latency. Real-world deployment on a mobile robotic platform demonstrates the system's effectiveness in practical inspection scenarios.
本文解决地下基础设施检测中从自动化缺陷检测生成人类可读总结的挑战。提出了一种两阶段流水线,使用RAPID-SCAN进行高效缺陷分割,并使用微调后的Phi-3.5视觉语言模型生成简洁的总结。该流水线针对边缘设备进行了优化,实现了实时性能,同时减少了模型大小和推理延迟。实验结果表明,该流水线在实际场景中的有效性,展示了其在可扩展和自主检测解决方案中的潜力。
RegionReasoner: Region-Grounded Multi-Round Visual Reasoning
Authors: Wenfang Sun, Hao Chen, Yingjun Du, Yefeng Zheng, Cees G. M. Snoek
Venue: ICLR 2026
First: 2026-02-03T16:52:16+00:00 · Latest: 2026-02-03T16:52:16+00:00
Comments: Accepted by ICLR 2026
Abstract
Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.
中文标题/摘要
标题:RegionReasoner:基于区域的多轮视觉推理
大型视觉-语言模型在视觉推理方面取得了显著进展,但大多数现有系统依赖于单步或仅基于文本的推理,限制了它们在多个视觉上下文中逐步深化理解的能力。为解决这一限制,我们引入了一个新的多轮视觉推理基准,该基准的训练集和测试集涵盖了检测和分割任务,从而在迭代推理场景下进行系统的评估。我们进一步提出了RegionReasoner,这是一种强化学习框架,通过要求每个推理轨迹明确引用相应的参考边界框来强制执行基于区域的推理,同时通过全局-局部一致性奖励保持语义连贯性。该奖励从全局场景描述和区域级描述中提取关键对象和名词,并将其与推理轨迹对齐,以确保推理步骤之间的连贯性。RegionReasoner 通过结合定位准确性和全局-局部语义对齐的结构化奖励进行优化。在检测和分割任务上的实验表明,RegionReasoner-7B 与我们新引入的基准RegionDial-Bench 一起,显著提高了多轮推理准确性、空间定位精度和全局-局部一致性,为这一新兴研究方向建立了强大的基线。
Summary / 总结
The research aims to enhance visual reasoning capabilities by addressing the limitations of single-step or text-only reasoning in existing systems. The method involves a new multi-round visual reasoning benchmark and a reinforcement learning framework called RegionReasoner, which requires explicit citation of bounding boxes and maintains semantic coherence. Key findings include improved multi-round reasoning accuracy, spatial grounding precision, and global-local consistency on detection and segmentation tasks, setting a strong baseline for this field.
研究旨在通过解决单一步骤或仅文本推理方法的局限性,提升视觉推理能力。它引入了一个新的多轮视觉推理基准,并提出了RegionReasoner,这是一种强化学习框架,要求明确引用边界框并保持语义一致性。实验结果显示,在检测和分割任务上,该方法在多轮推理准确性、空间定位精度和全局-局部一致性方面取得了显著改进,为这一新兴研究方向奠定了坚实的基础。
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models
Authors: Yi Ding, Lijun Li, Bing Cao, Jing Shao
Venue: ICLR 2026
First: 2025-01-30T17:59:45+00:00 · Latest: 2026-02-03T16:20:24+00:00
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image safety scenarios, consisting of training and test splits. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks requiring safety-related visual reasoning. This approach not only delivers exceptional safety performance but also preserves general capabilities without any trade-offs. Specifically, fine-tuning with MIS increases average accuracy by 0.83% across five general benchmarks and reduces the Attack Success Rate (ASR) on multiple safety benchmarks by a large margin.
中文标题/摘要
标题:重新思考视觉语言模型安全微调中的瓶颈
大型视觉-语言模型(VLMs)在广泛的任务中取得了显著的性能。然而,在安全关键领域部署它们提出了重大挑战。现有的安全微调方法,专注于文本或多模态内容,无法有效解决复杂情况或在有益性和无害性之间保持平衡。我们的评估揭示了一个安全推理缺口:这些方法缺乏安全视觉推理能力,导致了这些瓶颈。为了解决这一限制并增强在安全关键环境中的视觉感知和推理能力,我们提出了一种新的数据集,该数据集将多图像输入与安全思维链(CoT)标签结合,作为细粒度的推理逻辑以提高模型性能。具体来说,我们引入了多图像安全(MIS)数据集,这是一个针对多图像安全场景的指令遵循数据集,包含训练和测试分割。我们的实验表明,使用MIS微调InternVL2.5-8B在具有安全相关视觉推理要求的复杂多图像任务中显著优于强大的开源模型和API基模型。这种方法不仅提供了卓越的安全性能,而且在没有任何权衡的情况下保留了通用能力。具体来说,使用MIS微调增加了五个通用基准的平均准确性0.83%,并在多个安全基准上大幅降低了攻击成功率(ASR)。
Summary / 总结
This paper addresses the limitations of existing safety fine-tuning methods for Vision-Language Models (VLMs) in safety-critical domains. It proposes the Multi-Image Safety (MIS) dataset, which integrates multi-image inputs with safety Chain-of-Thought labels, to enhance visual reasoning. Experiments show that fine-tuning InternVL2.5-8B with MIS improves safety performance by increasing accuracy and reducing Attack Success Rate on multiple benchmarks, without compromising general capabilities.
本文通过引入包含多张图像输入和安全链式思考(CoT)标签的新型Multi-Image Safety (MIS)数据集,解决了现有视觉语言模型(VLM)安全微调方法的局限性。实验表明,使用MIS微调InternVL2.5-8B在复杂多张图像任务中的安全性能得到提升,准确率在基准测试中提高了0.83%,并且在多个安全基准测试中的攻击成功率显著降低。
Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning
Authors: Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, Xiaodan Liang
First: 2025-05-26T17:51:47+00:00 · Latest: 2026-02-03T16:15:45+00:00
Abstract
Large Vision-Language Models (LVLMs) have become powerful general-purpose assistants, yet their predictions often lack reliability and interpretability due to insufficient grounding in visual evidence. The emerging thinking-with-images paradigm seeks to address this issue by explicitly anchoring reasoning to image regions. However, we empirically find that most existing methods suffer from a systematic scale-driven bias in optimization, where training rewards are dominated by large visual regions, suppressing learning from small but semantically critical evidence and leading to spurious grounding at inference time. To address this limitation, we propose Ground-R1, a de-biased thinking-with-images framework trained via a novel Scale Relative Policy Optimization (SRPO) objective that replaces standard GRPO. Specifically, our SRPO recalibrates reward learning across evidence regions of different sizes through scale-aware binning and intra-/inter-bin comparisons, enabling balanced credit assignment during training. Experimental results on general LVLM, high-resolution, and visual grounding benchmarks validate the effectiveness of Ground-R1 and show that SRPO yields consistent gains over standard GRPO in both response accuracy and evidence grounding.
中文标题/摘要
标题:Ground-R1: 通过强化学习激励基于视觉的推理
大型视觉-语言模型(LVLMs)已成为强大的通用助手,但它们的预测往往缺乏可靠性和可解释性,因为它们在视觉证据方面的接地不足。新兴的图像思维范式旨在通过明确将推理锚定到图像区域来解决这一问题。然而,我们实证发现,大多数现有方法在优化中存在系统性的尺度驱动偏差,其中训练奖励主要由大视觉区域主导,抑制了对小但语义上关键证据的学习,导致推理时出现虚假的接地。为解决这一局限,我们提出了Ground-R1,这是一种通过新颖的尺度相对策略优化(SRPO)目标进行训练的去偏见图像思维框架,替代了标准的GRPO。具体而言,我们的SRPO通过尺度感知的分箱和箱内/箱间比较重新校准不同大小证据区域的奖励学习,使训练期间的奖励分配更加平衡。实验结果在通用LVLM、高分辨率和视觉接地基准上验证了Ground-R1的有效性,并表明SRPO在响应准确性和证据接地方面相对于标准GRPO提供了持续的改进。
Summary / 总结
The research aims to improve the reliability and interpretability of large vision-language models by addressing the scale-driven bias in existing methods. Ground-R1 proposes a novel Scale Relative Policy Optimization (SRPO) objective to recalibrate rewards across different-sized evidence regions, ensuring balanced credit assignment during training. Experiments on various benchmarks demonstrate that Ground-R1 outperforms standard methods in both response accuracy and evidence grounding, mitigating spurious grounding issues at inference time.
研究旨在通过解决现有方法中的规模驱动偏差,提高大型视觉语言模型的可靠性和可解释性。Ground-R1 是一种去偏差的思考-图像框架,使用新型的尺度相对策略优化(SRPO)目标来跨不同大小的证据区域重新校准奖励,促进训练中的平衡信用分配。实验结果表明,Ground-R1 在响应准确性和证据定位方面均优于标准方法,有效减少了虚假定位问题。
MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment
Authors: Eunkyu Park, Wesley Hanwen Deng, Cheyon Jin, Matheus Kunzler Maldaner, Jordan Wheeler, Jason I. Hong, Hong Shen, Adam Perer, Ken Holstein, Motahhare Eslami, Gunhee Kim
First: 2026-02-03T15:48:00+00:00 · Latest: 2026-02-03T15:48:00+00:00
Abstract
Vision-Language Models (VLMs) continue to struggle to make morally salient judgments in multimodal and socially ambiguous contexts. Prior works typically rely on binary or pairwise supervision, which often fail to capture the continuous and pluralistic nature of human moral reasoning. We present MM-SCALE (Multimodal Moral Scale), a large-scale dataset for aligning VLMs with human moral preferences through 5-point scalar ratings and explicit modality grounding. Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans using an interface we tailored for data collection, enabling listwise preference optimization over ranked scenario sets. By moving from discrete to scalar supervision, our framework provides richer alignment signals and finer calibration of multimodal moral reasoning. Experiments show that VLMs fine-tuned on MM-SCALE achieve higher ranking fidelity and more stable safety calibration than those trained with binary signals.
中文标题/摘要
标题:MM-SCALE:通过标量判断和列表对齐的多模态道德推理
视觉-语言模型(VLMs)在多模态和社交含糊情境中继续难以做出道德相关的判断。先前的工作通常依赖二元或成对的监督,这往往无法捕捉到人类道德推理的连续性和多元性。我们提出了MM-SCALE(多模态道德标量),这是一个大规模数据集,通过5点标量评分和明确的模态定位,将VLMs与人类的道德偏好对齐。每对图像-场景配对都由人类使用我们为数据收集定制的界面进行道德可接受性评分和定位推理标签的标注,从而在排序场景集中实现列表偏好优化。通过从离散监督转向标量监督,我们的框架提供了更丰富的对齐信号和更精细的多模态道德推理校准。实验表明,使用MM-SCALE微调的VLMs在排名准确性和更稳定的安全性校准方面优于使用二元信号训练的模型。
Summary / 总结
The research aims to improve VLMs' ability to make morally salient judgments in ambiguous contexts by moving from binary or pairwise supervision to scalar supervision. The method involves creating MM-SCALE, a large dataset with 5-point scalar ratings and explicit modality grounding for moral acceptability scores. The key findings show that VLMs fine-tuned on MM-SCALE perform better in ranking fidelity and safety calibration compared to those trained with binary signals.
研究旨在通过从二元或成对监督转向5点标度评分来提高VLMs在模糊情境下进行道德判断的能力。方法是创建MM-SCALE数据集,包含道德可接受性的5点标度评分和明确的模态标注,以实现排序场景集的列表偏好优化。实验表明,使用MM-SCALE进行微调的VLMs在排名准确性和安全性校准方面表现更好,优于使用二元信号进行训练的模型。
Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic Labeling using Foundation Models
Authors: Mohamad Al Mdfaa, Raghad Salameh, Geesara Kulathunga, Sergey Zagoruyko, Gonzalo Ferrer
Venue: Robotics, vol. 15, no. 2, article 31, 2026
First: 2024-05-03T15:08:39+00:00 · Latest: 2026-02-03T15:20:28+00:00
Abstract
Panoptic maps enable robots to reason about both geometry and semantics. However, open-vocabulary models repeatedly produce closely related labels that split panoptic entities and degrade volumetric consistency. The proposed UPPM advances open-world scene understanding by leveraging foundation models to introduce a panoptic Dynamic Descriptor that reconciles open-vocabulary labels with unified category structure and geometric size priors. The fusion for such dynamic descriptors is performed within a multi-resolution multi-TSDF map using language-guided open-vocabulary panoptic segmentation and semantic retrieval, resulting in a persistent and promptable panoptic map without additional model training. Based on our evaluation experiments, UPPM shows the best overall performance in terms of the map reconstruction accuracy and the panoptic segmentation quality. The ablation study investigates the contribution for each component of UPPM (custom NMS, blurry-frame filtering, and unified semantics) to the overall system performance. Consequently, UPPM preserves open-vocabulary interpretability while delivering strong geometric and panoptic accuracy.
中文标题/摘要
标题:映射无形之物:利用基础模型的统一可调全景映射与动态标注
全景地图使机器人能够推理几何和语义。然而,开放词汇模型反复生成紧密相关的标签,将全景实体分割并降低体素一致性。所提出的UPPM通过利用基础模型引入全景动态描述符,结合开放词汇标签与统一类别结构和几何尺寸先验,推进开放世界场景理解。此类动态描述符的融合在多分辨率多TSDF地图中使用语言引导的开放词汇全景分割和语义检索进行,从而生成持久且可调的全景地图,无需额外模型训练。基于我们的评估实验,UPPM在地图重建精度和全景分割质量方面表现出最佳的整体性能。消融研究调查了UPPM(自定义NMS、模糊帧过滤和统一语义)每个组件对整体系统性能的贡献。因此,UPPM在保持开放词汇可解释性的同时,提供了强大的几何和全景准确性。
Summary / 总结
The research aims to improve panoptic maps for robots by addressing the issue of label splitting and inconsistency. UPPM uses foundation models to introduce a panoptic Dynamic Descriptor that aligns open-vocabulary labels with a unified category structure and geometric size priors. Experiments show that UPPM outperforms other methods in map reconstruction accuracy and panoptic segmentation quality. Ablation studies confirm the effectiveness of each component in the system.
UPPM通过引入结合几何和语义先验的全景动态描述符来解决开放词汇模型在全景图中产生不一致标签的问题。该方法利用基础模型进行语言引导的分割和语义检索,从而生成持久且可提示的全景图而不需额外训练。实验结果显示,UPPM在地图重建精度和全景分割质量方面优于其他方法。
KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs
Authors: Baiyang Song, Jun Peng, Yuxin Zhang, Guangyao Chen, Feidiao Yang, Jianyuan Guo
First: 2026-02-03T15:08:30+00:00 · Latest: 2026-02-03T15:08:30+00:00
Abstract
Training-free video understanding leverages the strong image comprehension capabilities of pre-trained vision language models (VLMs) by treating a video as a sequence of static frames, thus obviating the need for costly video-specific training. However, this paradigm often suffers from severe visual redundancy and high computational overhead, especially when processing long videos. Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension. To address these significant challenges, we propose \textbf{KTV}, a novel two-stage framework for efficient and effective training-free video understanding. In the first stage, KTV performs question-agnostic keyframe selection by clustering frame-level visual features, yielding a compact, diverse, and representative subset of frames that mitigates temporal redundancy. In the second stage, KTV applies key visual token selection, pruning redundant or less informative tokens from each selected keyframe based on token importance and redundancy, which significantly reduces the number of tokens fed into the LLM. Extensive experiments on the Multiple-Choice VideoQA task demonstrate that KTV outperforms state-of-the-art training-free baselines while using significantly fewer visual tokens, \emph{e.g.}, only 504 visual tokens for a 60-min video with 10800 frames, achieving $44.8\%$ accuracy on the MLVU-Test benchmark. In particular, KTV also exceeds several training-based approaches on certain benchmarks.
中文标题/摘要
标题:KTV:高效无训练视频LLMs的关键帧和关键视觉标记选择
无训练视频理解通过将视频视为静态帧序列,利用预训练的视觉语言模型(VLMs)的强大图像理解能力,从而避免了昂贵的视频特定训练的需要。然而,这种范式通常会遭受严重的视觉冗余和高计算开销,尤其是在处理长视频时。至关重要的是,现有的关键帧选择策略,尤其是基于CLIP相似性的策略,容易产生偏差,并且可能会无意中忽略关键帧,导致视频理解效果不佳。为了解决这些重大挑战,我们提出了一种新颖的两阶段框架KTV,用于高效且有效的无训练视频理解。在第一阶段,KTV通过聚类帧级视觉特征进行无问题的关键帧选择,生成一个紧凑、多样且具有代表性的帧子集,从而减轻时间冗余。在第二阶段,KTV应用关键视觉标记选择,根据标记的重要性及其冗余性从每个选定的关键帧中修剪冗余或信息量较少的标记,这显著减少了输入LLM的标记数量。在Multiple-Choice VideoQA任务上的广泛实验表明,KTV在使用显著较少的视觉标记的同时,优于最先进的无训练基线,例如,对于一个包含10800帧的60分钟视频,仅使用504个视觉标记,MLVU-Test基准测试的准确率达到44.8%。特别地,KTV在某些基准测试上也超过了几个基于训练的方法。
Summary / 总结
KTV is a two-stage framework for efficient training-free video understanding, addressing visual redundancy and computational overhead. It first selects keyframes by clustering frame-level visual features, then prunes redundant tokens from these keyframes. Experiments show KTV outperforms state-of-the-art training-free baselines with significantly fewer visual tokens, achieving 44.8% accuracy on the MLVU-Test benchmark for a 60-minute video with 10800 frames.
KTV 是一个两阶段框架,用于高效的无训练视频理解,解决视觉冗余和计算开销问题。它首先通过聚类帧级视觉特征选择关键帧,然后从这些关键帧中修剪冗余的标记。实验表明,KTV 在使用更少的视觉标记时优于最先进的无训练方法,在 MLVU-Test 基准上达到 44.8% 的准确率,用于一个包含 10800 帧的 60 分钟视频。
TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection
Authors: Alireza Salehi, Ehsan Karami, Sepehr Noey, Sahand Noey, Makoto Yamada, Reshad Hosseini, Mohammad Sabokrou
Venue: ICASSP
First: 2026-02-03T14:48:11+00:00 · Latest: 2026-02-03T14:48:11+00:00
Comments: This is the extended version of the paper accepted in ICASSP'26, which will be publicly available in May. Authors' contributions may vary among the versions
Abstract
Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP's coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior work compensates with complex auxiliary modules yet largely overlooks the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP's issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available at github.com/AlirezaSalehy/Tipsomaly.
中文标题/摘要
标题:TIPS 过于技巧:简单提示实现有效的零样本异常检测
异常检测在安全关键环境中识别预期行为的偏差。当目标域正常数据不可用时,零样本异常检测(ZSAD)利用视觉语言模型(VLMs)。然而,CLIP 的粗略图像-文本对齐限制了定位和检测,由于(i)空间对齐不准确和(ii)对细微异常的敏感性较弱;先前的工作通过复杂的辅助模块进行补偿,但很大程度上忽视了骨干网络的选择。我们重新审视了骨干网络,并使用 TIPS-VLM,该模型通过空间感知目标进行训练。虽然 TIPS 缓解了 CLIP 的问题,但它暴露了全局特征和局部特征之间的分布差距。我们通过分离的提示(固定用于图像级检测,可学习用于像素级定位)并注入局部证据到全局评分中来解决这一问题。在没有针对 CLIP 的特定技巧的情况下,基于 TIPS 的管道在七个工业数据集上分别提高了图像级性能 1.1-3.9% 和像素级性能 1.5-6.9%,实现了强大的泛化能力,同时保持了简洁的架构。代码可在 github.com/AlirezaSalehy/Tipsomaly 获取。
Summary / 总结
The paper addresses the challenge of zero-shot anomaly detection in safety-critical settings where target-domain normal data are unavailable. It proposes using TIPS, a vision-language model trained with spatially aware objectives, to improve both localization and detection. By employing decoupled prompts and injecting local evidence into the global score, the method enhances performance across seven industrial datasets, achieving improvements of 1.1-3.9% in image-level detection and 1.5-6.9% in pixel-level localization without relying on CLIP-specific tricks. The lean architecture shows strong generalization capabilities.
论文针对目标领域正常数据不可用时的零样本异常检测挑战,提出使用具有空间感知目标训练的TIPS视觉语言模型来改进异常检测。方法采用分离的提示进行图像级检测和可学习的提示进行像素级定位,这提高了定位和检测的准确性。TIPS基线方法在七个工业数据集上优于先前方法,图像级性能提高了1.1-3.9%,像素级性能提高了1.5-6.9%,同时保持了强大的泛化能力及简洁的架构。
Optimization and Generation in Aerodynamics Inverse Design
Authors: Huaguan Chen, Ning Lin, Luxi Chen, Rui Zhang, Wenbing Huang, Chongxuan Li, Hao Sun
First: 2026-02-03T14:32:26+00:00 · Latest: 2026-02-03T14:32:26+00:00
Abstract
Inverse design with physics-based objectives is challenging because it couples high-dimensional geometry with expensive simulations, as exemplified by aerodynamic shape optimization for drag reduction. We revisit inverse design through two canonical solutions, the optimal design point and the optimal design distribution, and relate them to optimization and guided generation. Building on this view, we propose a new training loss for cost predictors and a density-gradient optimization method that improves objectives while preserving plausible shapes. We further unify existing training-free guided generation methods. To address their inability to approximate conditional covariance in high dimensions, we develop a time- and memory-efficient algorithm for approximate covariance estimation. Experiments on a controlled 2D study and high-fidelity 3D aerodynamic benchmarks (car and aircraft), validated by OpenFOAM simulations and miniature wind-tunnel tests with 3D-printed prototypes, demonstrate consistent gains in both optimization and guided generation. Additional offline RL results further support the generality of our approach.
中文标题/摘要
标题:气动逆设计中的优化与生成
基于物理目标的逆设计具有挑战性,因为它将高维几何与昂贵的模拟耦合在一起,例如气动形状优化以减少阻力。我们通过两个典型的解决方案——最优设计点和最优设计分布——重新审视逆设计,并将它们与优化和引导生成联系起来。在此基础上,我们提出了一种新的成本预测器训练损失,并提出了一种密度梯度优化方法,该方法在提高目标的同时保持了合理的形状。我们进一步统一了现有的无训练引导生成方法。为了解决它们在高维空间中无法近似条件协方差的问题,我们开发了一种时间效率和内存效率高的近似协方差估计算法。在受控的2D研究和高保真的3D气动基准测试(汽车和飞机)上的实验,通过OpenFOAM模拟和使用3D打印原型的小型风洞测试,证明了在优化和引导生成方面的一致性改进。额外的离线强化学习结果进一步支持了我们方法的通用性。
Summary / 总结
The research aims to optimize aerodynamic shapes for drag reduction by addressing the challenges of inverse design with physics-based objectives. The authors propose a new training loss for cost predictors and a density-gradient optimization method, which improves objectives while maintaining plausible shapes. Experiments on 2D and 3D aerodynamic benchmarks, validated by simulations and wind-tunnel tests, show consistent improvements in both optimization and guided generation.
研究旨在通过解决基于物理目标的逆向设计挑战,优化和生成降低阻力的气动形状。作者提出了一种新的训练损失函数和密度梯度优化方法,该方法在保持合理形状的同时提高了目标值。通过2D和3D气动基准测试的实验,以及由模拟和风洞测试验证,展示了在优化和引导生成方面的持续改进。
Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
Authors: Tong Wang, Yunhan Zhao, Shu Kong
First: 2026-01-31T16:42:55+00:00 · Latest: 2026-02-03T14:05:53+00:00
Abstract
Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ``mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ``mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the ``mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ``mental image'' for a given multimodal query and propose to use this ``mental image'' to search for the target image. As the ``mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.
中文标题/摘要
标题:生成平行宇宙以实现无需训练的零样本组合图像检索
组合图像检索(CIR)是指使用包含参考图像和修改文本的多模态查询从数据库中检索目标图像的任务。文本说明如何修改参考图像以形成“心理图像”,基于此,CIR 应在数据库中找到目标图像。CIR 的基本挑战在于这种“心理图像”是不可物理获取的,仅由查询隐式定义。当代文献追求零样本方法,使用大型多模态模型(LMM)生成给定多模态查询的文本描述,然后使用视觉语言模型(VLM)进行文本-视觉匹配以搜索目标图像。相比之下,我们从第一原理出发,直接生成“心理图像”以实现更准确的匹配。特别地,我们提示 LMM 生成给定多模态查询的“心理图像”,并提议使用此“心理图像”来搜索目标图像。由于“心理图像”与真实图像之间存在合成到现实的领域差距,我们还为数据库中的每个真实图像生成一个合成对应物以促进匹配。因此,我们的方法使用 LMM 构建一个“平行宇宙”,其中它匹配多模态查询和数据库图像。因此,我们称此方法为平行宇宙。值得注意的是,平行宇宙是一种无需训练的零样本 CIR 方法。它在四个具有挑战性的基准测试中显著优于现有零样本方法,实现了零样本 CIR 的最佳性能。
Summary / 总结
The paper addresses the challenge of Composed Image Retrieval (CIR) by directly generating a 'mental image' using a Large Multimodal Model (LMM) for a given multimodal query, and then searching for the target image in the database. This approach, named Paracosm, constructs a synthetic 'paracosm' to facilitate more accurate matching. Experiments show that Paracosm outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.
论文通过使用大型多模态模型(LMM)直接生成给定多模态查询的‘心理图像’,然后在数据库中查找目标图像来解决组成图像检索(CIR)的挑战。这种方法称为Paracosm,构建了一个合成的‘平行宇宙’来弥合‘心理图像’与真实图像之间的差距。该方法无需训练,并在四个基准测试中显著优于现有零样本方法,实现了CIR的最先进性能。
Interpretable Logical Anomaly Classification via Constraint Decomposition and Instruction Fine-Tuning
Authors: Xufei Zhang, Xinjiao Zhou, Ziling Deng, Dongdong Geng, Jianxiong Wang
First: 2026-02-03T13:48:09+00:00 · Latest: 2026-02-03T13:48:09+00:00
Comments: 6 pages, 6 figures
Abstract
Logical anomalies are violations of predefined constraints on object quantity, spatial layout, and compositional relationships in industrial images. While prior work largely treats anomaly detection as a binary decision, such formulations cannot indicate which logical rule is broken and therefore offer limited value for quality assurance. We introduce Logical Anomaly Classification (LAC), a task that unifies anomaly detection and fine-grained violation classification in a single inference step. To tackle LAC, we propose LogiCls, a vision-language framework that decomposes complex logical constraints into a sequence of verifiable subqueries. We further present a data-centric instruction synthesis pipeline that generates chain-of-thought (CoT) supervision for these subqueries, coupling precise grounding annotations with diverse image-text augmentations to adapt vision language models (VLMs) to logic-sensitive reasoning. Training is stabilized by a difficulty-aware resampling strategy that emphasizes challenging subqueries and long tail constraint types. Extensive experiments demonstrate that LogiCls delivers robust, interpretable, and accurate industrial logical anomaly classification, providing both the predicted violation categories and their evidence trails.
中文标题/摘要
标题:通过约束分解和指令微调实现可解释的逻辑异常分类
逻辑异常是工业图像中对象数量、空间布局和组成关系预定义约束的违反。尽管先前的工作主要将异常检测视为二元决策,但这种形式无法指出哪个逻辑规则被违反,因此在质量保证方面提供的价值有限。我们引入了逻辑异常分类(LAC),这是一个将异常检测和细粒度违规分类统一到单步推理的任务。为了解决LAC,我们提出了LogiCls,这是一种视觉-语言框架,将复杂的逻辑约束分解为一系列可验证的子查询。我们还提出了一种以数据为中心的指令合成流水线,为这些子查询生成带有推理链(CoT)监督的精确语义标注,结合多样化的图像-文本增强来适应视觉语言模型(VLMs)的逻辑敏感推理。通过一种难度感知的重采样策略来稳定训练,该策略强调具有挑战性的子查询和长尾约束类型。广泛的实验表明,LogiCls 提供了稳健、可解释和准确的工业逻辑异常分类,同时提供了预测的违规类别及其证据轨迹。
Summary / 总结
The paper addresses the challenge of logical anomalies in industrial images by introducing Logical Anomaly Classification (LAC), which combines anomaly detection with detailed violation classification. The proposed LogiCls framework decomposes complex logical constraints into verifiable subqueries and uses a data-centric instruction synthesis pipeline to generate chain-of-thought supervision. Training is enhanced with a difficulty-aware resampling strategy. Experiments show that LogiCls provides robust, interpretable, and accurate classification of logical anomalies, offering both violation categories and evidence trails.
论文通过引入逻辑异常分类(LAC),将异常检测与详细的违规分类结合,解决了工业图像中的逻辑异常问题。提出的LogiCls框架将复杂的逻辑约束分解为可验证的子查询,并使用数据为中心的指令合成流水线为这些子查询生成监督。通过困难感知的重采样策略增强训练。实验表明,LogiCls提供了稳健、可解释和准确的逻辑异常分类,同时提供了预测的违规类别及其证据轨迹。
Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance
Authors: Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang
First: 2026-02-03T13:08:31+00:00 · Latest: 2026-02-03T13:08:31+00:00
Abstract
Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.
中文标题/摘要
标题:解耦骨架与肉身:具有解耦对齐和结构感知指导的高效多模态表格推理
对于大型视觉-语言模型(LVLM),表格图像的推理仍然具有挑战性,因为存在复杂的布局和紧密耦合的结构-内容信息。现有解决方案往往依赖昂贵的监督训练、强化学习或外部工具,限制了效率和可扩展性。本工作解决了一个关键问题:如何在最少标注和不使用外部工具的情况下使LVLM适应表格推理?具体而言,我们首先引入了DiSCo,这是一种解耦结构-内容对齐框架,在多模态对齐过程中明确分离结构抽象和语义接地,高效地使LVLM适应表格结构。在此基础上,我们进一步提出了Table-GLS,这是一种全局到局部结构引导推理框架,通过结构化探索和证据导向的推理进行表格推理。广泛的实验表明,我们的框架有效地增强了LVLM的表格理解和推理能力,特别是在泛化到未见过的表格结构方面表现出色。
Summary / 总结
This work addresses the challenge of table reasoning for Large Vision-Language Models (LVLMs) by introducing DiSCo, a disentangled structure-content alignment framework, and Table-GLS, a global-to-local structure-guided reasoning framework. These methods help LVLMs efficiently adapt to table structures with minimal annotation and no external tools, enhancing their understanding and reasoning capabilities, especially for unseen table structures. Extensive experiments across various benchmarks show the effectiveness of the proposed frameworks.
该研究通过引入DiSCo(分离结构内容对齐框架)和Table-GLS(全局到局部结构引导推理框架),解决了大型视觉-语言模型(LVLM)在表格推理方面的挑战。这些方法帮助LVLM在最少标注和无需外部工具的情况下高效适应表格结构,显著提升了其在各种基准上的理解和推理能力,特别是在处理未见过的表格结构方面。
PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers
Authors: Haopeng Li, Shitong Shao, Wenliang Zhong, Zikai Zhou, Lichen Bai, Hui Xiong, Zeke Xie
First: 2026-02-01T07:47:06+00:00 · Latest: 2026-02-03T13:02:26+00:00
Comments: 17 pages
Abstract
Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by discarding context. In this work, we discover that attention scores of non-critical blocks exhibit distributional stability, allowing them to be approximated accurately and efficiently rather than discarded, which is essentially important for sparse attention design. Motivated by this key insight, we propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity. Unlike the conventional keep-or-drop paradigm that directly drop the non-critical block information, PISA introduces a novel exact-or-approximate strategy: it maintains exact computation for critical blocks while efficiently approximating the remainder through block-wise Taylor expansion. This design allows PISA to serve as a faithful proxy to full attention, effectively bridging the gap between speed and quality. Experimental results demonstrate that PISA achieves 1.91 times and 2.57 times speedups on Wan2.1-14B and Hunyuan-Video, respectively, while consistently maintaining the highest quality among sparse attention methods. Notably, even for image generation on FLUX, PISA achieves a 1.2 times acceleration without compromising visual quality. Code is available at: https://github.com/xie-lab-ml/piecewise-sparse-attention.
中文标题/摘要
标题:PISA:分段稀疏注意机制使扩散变换器更高效
扩散变换器对于视频和图像生成至关重要,但其效率受到注意力机制二次复杂度的限制。虽然块稀疏注意力通过仅关注关键的键值块来加速计算,但在高稀疏度下会因丢弃上下文而导致性能下降。在本研究中,我们发现非关键块的注意分数表现出分布稳定性,允许它们被准确且高效地近似,而不是被丢弃,这对于稀疏注意力机制的设计至关重要。受这一关键洞察的启发,我们提出了PISA,一种无需训练的分段稀疏注意力机制,具有亚二次复杂度。与传统的保留或丢弃非关键块信息的模式不同,PISA 引入了一种精确或近似的新型策略:它对关键块进行精确计算,同时通过块级泰勒展开高效近似其余部分。这种设计使PISA能够作为全注意力的忠实代理,有效地弥合了速度与质量之间的差距。实验结果表明,PISA 在 Wan2.1-14B 和 Hunyuan-Video 上分别实现了1.91倍和2.57倍的加速,同时在稀疏注意力方法中保持了最高的质量。值得注意的是,即使在 FLUX 上进行图像生成,PISA 也实现了1.2倍的加速,而不会牺牲视觉质量。代码可在 https://github.com/xie-lab-ml/piecewise-sparse-attention/ 获取。
Summary / 总结
PISA (Piecewise Sparse Attention) is proposed to address the efficiency bottleneck in diffusion transformers by leveraging the distributional stability of non-critical blocks' attention scores. This method introduces an exact-or-approximate strategy, maintaining exact computation for critical blocks and approximating the remainder through block-wise Taylor expansion. Experimental results show that PISA achieves significant speedups (1.91-2.57 times) on various models without compromising quality, making it a faithful proxy to full attention.
本文通过提出PISA(Piecewise Sparse Attention)机制来解决扩散变压器的效率瓶颈。受非关键块注意力分数分布稳定性的启发,PISA 有效地近似这些分数而不是直接丢弃。该方法对关键块进行精确计算,并使用块级泰勒展开近似其余部分,从而实现亚二次复杂度。实验表明,PISA 在 Wan2.1-14B 和 Hunyuan-Video 上分别提供 1.91 到 2.57 倍的加速,且保持了高质量,并且在 FLUX 上进行图像生成时也实现了 1.2 倍的加速而不影响视觉质量。
Contextualized Visual Personalization in Vision-Language Models
Authors: Yeongtak Oh, Sangwon Yu, Junsung Park, Han Cheol Moon, Jisoo Mok, Sungroh Yoon
First: 2026-02-03T12:21:26+00:00 · Latest: 2026-02-03T12:21:26+00:00
Comments: Project Page: https://github.com/oyt9306/CoViP
Abstract
Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.
中文标题/摘要
标题:基于上下文的视觉个性化在视觉语言模型中的应用
尽管在视觉语言模型(VLMs)方面取得了近期进展,但现有方法往往无法根据用户的特定经历生成个性化的回应,因为它们缺乏将视觉输入与用户积累的视觉文本上下文关联起来的能力。我们首次将这一挑战形式化为基于上下文的视觉个性化,这要求VLMs在解释新图像时能够识别视觉内容并检索个性化的视觉体验。为了解决这一问题,我们提出了CoViP,这是一种统一框架,将个性化图像描述视为基于上下文的视觉个性化的核心任务,并通过基于强化学习的后训练和描述增强生成来提高这一能力。我们还引入了诊断性评估,明确排除了文本捷径解决方案,并验证VLMs是否真正利用了视觉上下文。广泛的实验表明,现有的开源和专有VLMs存在显著局限性,而CoViP不仅提高了个性化图像描述的能力,还在下游个性化任务中实现了整体改进。这些结果突显了CoViP作为实现稳健和泛化的基于上下文的视觉个性化的关键阶段。
Summary / 总结
The research addresses the limitation of existing vision-language models (VLMs) in generating personalized responses based on users' specific experiences. It proposes CoViP, a unified framework that enhances personalized image captioning through reinforcement learning and caption-augmented generation. Experiments show that CoViP significantly improves personalized image captioning and benefits other personalization tasks, highlighting its importance for robust contextualized visual personalization.
研究解决了现有视觉语言模型在生成基于用户特定经历的个性化响应方面的局限性。提出了CoViP统一框架,通过强化学习和带有图 caption 的生成增强个性化图像描述。实验结果表明,CoViP在个性化图像描述方面取得了显著改进,并且对其他个性化任务也有益处,优于现有模型。
Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility
Authors: Mengxuan Wang, Yuxin Chen, Gang Xu, Tao He, Hongjie Jiang, Ming Li
First: 2026-02-03T11:26:05+00:00 · Latest: 2026-02-03T11:26:05+00:00
Abstract
Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.
中文标题/摘要
标题:风险意识注入:在不牺牲实用性的情况下为视觉语言模型校准安全性
视觉语言模型(VLMs)将大型语言模型(LLMs)的推理能力扩展到跨模态设置中,但仍高度易受多模态脱缰攻击。现有防御措施主要依赖于安全性微调或激进的标记操作,导致高昂的训练成本或显著降低实用性。最近的研究表明,LLMs 本身能够识别文本中的不安全内容,而将视觉输入纳入 VLMs 中通常会稀释与风险相关的信息。受此启发,我们提出了一种轻量级且无需训练的框架——风险意识注入(RAI),通过放大 VLMs 中的不安全信号来恢复 LLM 类似的风险识别能力。具体而言,RAI 从语言嵌入中构建一个不安全原型子空间,并对选定的高风险视觉标记进行有针对性的调节,明确激活跨模态特征空间中的安全关键信号。这种调节恢复了模型从视觉输入中检测不安全内容的 LLM 类似能力,同时保持原始标记的语义完整性以进行跨模态推理。在多个脱缰和实用性基准上的广泛实验表明,RAI 在不牺牲任务性能的情况下显著降低了攻击成功率。
Summary / 总结
The paper addresses the vulnerability of vision-language models (VLMs) to multimodal jailbreak attacks by proposing Risk Awareness Injection (RAI), a lightweight and training-free framework. RAI amplifies unsafe signals in VLMs by constructing an Unsafe Prototype Subspace from language embeddings and modulating high-risk visual tokens, thereby restoring the model's ability to detect unsafe content from visual inputs without degrading task performance. Experiments show that RAI significantly reduces attack success rates across various benchmarks while maintaining utility.
论文提出了一种名为Risk Awareness Injection (RAI)的轻量级框架,通过放大视觉语言模型(VLMs)中的不安全信号来应对多模态脱缰攻击,而无需额外训练。RAI从语言嵌入中构建一个不安全原型子空间,并对选定的高风险视觉标记进行调制,以恢复模型从视觉输入中检测不安全内容的能力,同时保持跨模态推理中的语义完整性。实验表明,RAI显著降低了攻击成功率,而不损害任务性能。
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
Authors: Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Lei Zhang, Heung-Yeung Shum
First: 2025-08-12T17:57:04+00:00 · Latest: 2026-02-03T11:16:25+00:00
Comments: https://zxyin.github.io/ColorCtrl
Abstract
Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.
中文标题/摘要
标题:基于多模态扩散变换器的无训练文本引导颜色编辑
图像和视频中的文本引导颜色编辑是一个基本但尚未解决的问题,需要精细地操控颜色属性,包括反射率、光源颜色和环境照明,同时保持几何、材料属性和光物质相互作用的物理一致性。现有的无训练方法在编辑任务中具有广泛的适用性,但难以实现精确的颜色控制,并且常常在编辑和未编辑区域引入视觉不一致性。在本工作中,我们提出了ColorCtrl,这是一种利用现代多模态扩散变换器(MM-DiT)的注意力机制的无训练颜色编辑方法。通过针对注意力图和值令牌进行目标化的操控,我们的方法能够实现准确且一致的颜色编辑,并且能够对属性强度进行词级控制。我们的方法仅修改由提示指定的预期区域,而不影响无关区域。在SD3和FLUX.1-dev上的大量实验表明,ColorCtrl在编辑质量和一致性方面均优于现有的无训练方法,并达到了最先进的性能。此外,我们的方法在一致性方面超越了强大的商业模型,如FLUX.1 Kontext Max和GPT-4o图像生成。当扩展到视频模型如CogVideoX时,我们的方法表现出更大的优势,特别是在保持时间连贯性和编辑稳定性方面。最后,我们的方法还能够应用于基于指令的编辑扩散模型,如Step1X-Edit和FLUX.1 Kontext dev,进一步证明了其多功能性。
Summary / 总结
ColorCtrl is a training-free color editing method that uses Multi-Modal Diffusion Transformers to disentangle structure and color through targeted manipulation of attention maps and value tokens. It enables accurate and consistent color editing with word-level control of attribute intensity, leaving unrelated areas untouched. Experiments on SD3 and FLUX.1-dev show that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performance in both edit quality and consistency, surpassing commercial models in terms of consistency and temporal coherence in video editing.
研究旨在解决精确且一致的文字引导颜色编辑问题,这对于在保持物理一致性的同时进行精细的颜色属性操作至关重要。方法ColorCtrl利用多模态扩散变换器通过目标操纵注意力图和值令牌来分离结构和颜色,从而实现准确且一致的颜色编辑,并具有词级的属性强度控制。实验表明,ColorCtrl在编辑质量和一致性方面均优于现有训练免费的方法,并且在保持时间连贯性和编辑稳定性方面也超越了商业模型。
GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer
Authors: Junmo Cho, Suhan Kim, Sangjune An, Minsu Kim, Dong Bok Lee, Heejun Lee, Sung Ju Hwang, Hae Beom Lee
First: 2026-02-03T10:30:03+00:00 · Latest: 2026-02-03T10:30:03+00:00
Abstract
Finding effective prompts for language models (LMs) is critical yet notoriously difficult: the prompt space is combinatorially large, rewards are sparse due to expensive target-LM evaluation. Yet, existing RL-based prompt optimizers often rely on on-policy updates and a meta-prompt sampled from a fixed distribution, leading to poor sample efficiency. We propose GFlowPO, a probabilistic prompt optimization framework that casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior. In the first step, we fine-tune a lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective, using a replay-based training policy that reuses past prompt evaluations to enable sample-efficient exploration. In the second step, we introduce Dynamic Memory Update (DMU), a training-free mechanism that updates the meta-prompt by injecting both (i) diverse prompts from a replay buffer and (ii) top-performing prompts from a small priority queue, thereby progressively concentrating the search process on high-reward regions. Across few-shot text classification, instruction induction benchmarks, and question answering tasks, GFlowPO consistently outperforms recent discrete prompt optimization baselines.
中文标题/摘要
标题:GFlowPO:生成流网络作为语言模型提示优化器
找到有效提示对于语言模型(LMs)至关重要但极其困难:提示空间是组合性的,奖励稀疏,因为目标LM评估昂贵。然而,现有的基于强化学习的提示优化器通常依赖于基于策略的更新和从固定分布中采样的元提示,导致样本效率低下。我们提出GFlowPO,这是一种概率提示优化框架,将提示搜索问题表述为由元提示引导的参考LM先验正则化的潜在提示后验推断问题。第一步,我们使用基于回放的训练策略和生成流网络(GFlowNet)目标对轻量级提示LM进行微调,该策略重用过去的提示评估以实现高效的探索。第二步,我们引入动态记忆更新(DMU),这是一种无需训练的机制,通过注入(i)回放缓冲区中的多样化提示和(ii)优先队列中的高绩效提示来更新元提示,从而逐步将搜索过程集中在高奖励区域。在少量样本文本分类、指令归纳基准测试和问答任务中,GFlowPO始终优于最近的离散提示优化基线。
Summary / 总结
GFlowPO is a probabilistic prompt optimization framework that addresses the challenge of finding effective prompts for language models by casting prompt search as a posterior inference problem. It uses a fine-tuned lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective and a Dynamic Memory Update (DMU) mechanism to update the meta-prompt. GFlowPO outperforms recent discrete prompt optimization baselines across various tasks including few-shot text classification, instruction induction, and question answering.
GFlowPO 是一种将提示搜索视为后验推断问题的概率框架,使用带有离策略生成流网络目标的轻量级提示-LM 和 动态内存更新机制来更新元提示。GFlowPO 在各种任务中表现出色,优于最近的离散提示优化方法。
V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Authors: Yiming Zhao, Yu Zeng, Yukun Qi, YaoYang Liu, Xikun Bao, Lin Chen, Zehui Chen, Qing Miao, Chenxi Liu, Jie Zhao, Feng Zhao
First: 2025-03-22T11:30:46+00:00 · Latest: 2026-02-03T10:18:20+00:00
Comments: Project Page: https://vlm-reasoning.github.io/V2P-Bench/
Abstract
Large Vision-Language Models (LVLMs) have made significant strides in the field of video understanding in recent times. Nevertheless, existing video benchmarks predominantly rely on text prompts for evaluation, which often require complex referential language and diminish both the accuracy and efficiency of human model interaction in turn. To address this limitation, we propose V2P-Bench, a robust and comprehensive benchmark for evaluating the ability of LVLMs to understand Video Visual Prompts in human model interaction scenarios. V2P-Bench consists of 980 videos and 1172 well-structured high-quality QA pairs, each paired with manually annotated visual prompt frames. The benchmark spans three main tasks and twelve categories, thereby enabling fine-grained, instance-level evaluation. Through an in-depth analysis of current LVLMs, we identify several key findings: 1) Visual prompts are both more model-friendly and user-friendly in interactive scenarios than text prompts, leading to significantly improved model performance and enhanced user experience. 2) Models are reasonably capable of zero-shot understanding of visual prompts, but struggle with spatiotemporal understanding. Even o1 achieves only 71.8%, far below the human expert score of 88.3%, while most open-source models perform below 60%. 3) LVLMs exhibit pervasive Hack Phenomena in video question answering tasks, which become more pronounced as video length increases and frame sampling density decreases, thereby inflating performance scores artificially. We anticipate that V2P-Bench will not only shed light on these challenges but also serve as a foundational tool for advancing human model interaction and improving the evaluation of video understanding.
中文标题/摘要
标题:V2P-Bench:通过视觉提示评估视频-语言理解以改善人-模型交互
大型视觉-语言模型(LVLMs)在视频理解领域取得了显著进展。然而,现有的视频基准主要依赖于文本提示进行评估,这往往需要复杂的参照语言,从而降低了人-模型交互的准确性和效率。为解决这一局限性,我们提出了V2P-Bench,这是一个用于评估LVLMs在人-模型交互场景中理解视频视觉提示能力的稳健且全面的基准。V2P-Bench包含980个视频和1172个高质量的结构化问答对,每个问答对都与手动标注的视觉提示帧配对。基准涵盖了三个主要任务和十二个类别,从而实现细粒度、实例级的评估。通过对当前LVLMs的深入分析,我们发现了几个关键发现:1)在交互场景中,视觉提示比文本提示更友好,对模型和用户都更友好,从而显著提高了模型性能并增强了用户体验。2)模型在零样本理解视觉提示方面表现合理,但在时空理解方面存在困难。即使o1也只能达到71.8%,远低于人类专家的88.3%的得分,而大多数开源模型的得分低于60%。3)LVLMs在视频问答任务中普遍存在黑客现象,随着视频长度的增加和帧采样密度的降低,这种现象变得更加明显,从而人为地提高了性能得分。我们预计V2P-Bench不仅会揭示这些挑战,还将作为推动人-模型交互和改进视频理解评估的基础工具。
Summary / 总结
V2P-Bench is a new benchmark designed to evaluate the ability of large vision-language models (LVLMs) to understand video visual prompts in human model interaction scenarios. It consists of 980 videos and 1172 QA pairs with manually annotated visual prompt frames. The benchmark reveals that visual prompts are more model-friendly and user-friendly, leading to improved model performance and user experience. However, models struggle with spatiotemporal understanding and exhibit Hack Phenomena, especially in longer videos with less frame sampling, which can artificially inflate performance scores. The findings highlight the need for more robust evaluation methods in video understanding tasks.
V2P-Bench 是一个新基准,用于评估大型视觉语言模型(LVLMs)在人类模型交互场景中理解视频视觉提示的能力。它包含980个视频和1172个带有手动标注视觉提示帧的QA对。该基准揭示视觉提示比文本提示更友好,有助于提高模型性能和用户体验。然而,模型在时空理解方面存在困难,并且在较长视频和较少帧采样时表现出Hack现象,这可能会人为地提高性能分数。研究结果强调了需要更好的评估方法和模型改进以提高视频理解任务的效果。
No time to train! Training-Free Reference-Based Instance Segmentation
Authors: Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley
First: 2025-07-03T16:59:01+00:00 · Latest: 2026-02-03T10:17:35+00:00
Comments: Preprint
Abstract
The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).
中文标题/摘要
标题:没有时间训练!基于参考的实例分割无需训练
图像分割模型的历史性能一直受到大规模标注数据收集高成本的限制。Segment Anything Model (SAM) 通过可提示的、语义无关的分割范式缓解了这一原始问题,但仍需要手动视觉提示或复杂的领域特定提示生成规则来处理新图像。为了减少这种新的负担,我们的工作研究了仅提供少量参考图像时的对象分割任务。我们的关键洞察是利用基础模型学习到的强语义先验,在参考图像和目标图像之间识别对应的区域。我们发现对应关系能够自动生成实例级分割掩码以供下游任务使用,并通过一个无需训练的多阶段方法实现我们的想法,该方法包括(1)记忆库构建;(2)表示聚合;(3)语义感知特征匹配。我们的实验显示在分割指标上取得了显著改进,达到了COCO FSOD(36.8% nAP)、PASCAL VOC 少量样本(71.2% nAP50)的最佳性能,并在跨域少量样本分割基准上优于现有无需训练的方法(22.4% nAP)。
Summary / 总结
This work addresses the challenge of image segmentation by leveraging a small set of reference images to automatically generate instance-level segmentation masks without the need for training. The method uses strong semantic priors from foundation models to identify correspondences between reference and target images, and incorporates a multi-stage, training-free approach including memory bank construction, representation aggregation, and semantic-aware feature matching. Experiments demonstrate significant improvements in segmentation metrics, achieving state-of-the-art performance on COCO FSOD and PASCAL VOC Few-Shot benchmarks, and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark.
该研究通过利用预训练模型自动生成实例级分割掩码,解决了图像分割的挑战,无需进行训练。方法依赖于少量的参考图像,并采用多阶段、无需训练的方法,包括记忆库构建、表示聚合和语义感知特征匹配。实验结果显示在分割指标上取得了显著改进,达到了COCO FSOD和PASCAL VOC Few-Shot基准的最先进性能,并在跨域FSOD基准上优于现有无需训练的方法。
MemeLens: Multilingual Multitask VLMs for Memes
Authors: Ali Ezzat Shahroor, Mohamed Bayan Kmainasi, Abul Hasnat, Dimitar Dimitrov, Giovanni Da San Martino, Preslav Nakov, Firoj Alam
First: 2026-01-18T19:01:03+00:00 · Latest: 2026-02-03T09:57:57+00:00
Comments: disinformation, misinformation, factuality, harmfulness, fake news, propaganda, hateful meme, multimodality, text, images
Abstract
Memes are a dominant medium for online communication and manipulation because meaning emerges from interactions between embedded text, imagery, and cultural context. Existing meme research is distributed across tasks (hate, misogyny, propaganda, sentiment, humour) and languages, which limits cross-domain generalization. To address this gap we propose MemeLens, a unified multilingual and multitask explanation-enhanced Vision Language Model (VLM) for meme understanding. We consolidate 38 public meme datasets, filter and map dataset-specific labels into a shared taxonomy of $20$ tasks spanning harm, targets, figurative/pragmatic intent, and affect. We present a comprehensive empirical analysis across modeling paradigms, task categories, and datasets. Our findings suggest that robust meme understanding requires multimodal training, exhibits substantial variation across semantic categories, and remains sensitive to over-specialization when models are fine-tuned on individual datasets rather than trained in a unified setting. We will make the experimental resources and datasets publicly available for the community.
中文标题/摘要
标题:MemeLens:多语言多任务VLMs用于表情包理解
表情包是在线交流和操控的主要媒介,其含义源自嵌入的文字、图像和文化背景之间的互动。现有的表情包研究分布在不同的任务(仇恨、性别歧视、宣传、情感、幽默)和语言中,这限制了跨域泛化能力。为解决这一问题,我们提出MemeLens,一种统一的多语言和多任务解释增强视觉语言模型(VLM),用于表情包理解。我们整合了38个公开的表情包数据集,过滤并映射数据集特定的标签到一个包含20个任务的共享分类体系,涵盖伤害、目标、比喻/语用意图和情感。我们进行了全面的建模范式、任务类别和数据集的实证分析。我们的研究结果表明,稳健的表情包理解需要多模态训练,表现出在语义类别间存在显著差异,并且在针对单个数据集进行微调时仍对过度专业化敏感。我们将公开实验资源和数据集供社区使用。
Summary / 总结
MemeLens is a unified multilingual and multitask VLM designed to enhance meme understanding by integrating 38 public meme datasets into a shared taxonomy of 20 tasks. The model demonstrates the necessity of multimodal training and highlights the variation in meme understanding across different semantic categories. Fine-tuning on individual datasets leads to over-specialization, while unified training improves robustness. The research aims to bridge the gap in cross-domain generalization of meme research across languages and tasks.
MemeLens 是一个统一的多语言和多任务 VLM,通过将 38 个公共 meme 数据集整合到一个包含 20 个任务的共享分类中来提升 meme 理解能力。实验表明,稳健的 meme 理解需要多模态训练,并且在针对单一数据集进行微调时容易过度专业化。跨不同建模范式和任务类别的全面实证分析强调了采用统一方法来应对 meme 通信和操纵多样性的必要性。实验资源和数据集将对社区公开以供进一步研究。
RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization
Authors: Songming Liu, Bangguo Li, Kai Ma, Lingxuan Wu, Hengkai Tan, Xiao Ouyang, Hang Su, Jun Zhu
First: 2026-02-03T09:38:23+00:00 · Latest: 2026-02-03T09:38:23+00:00
Abstract
Vision-Language-Action (VLA) models hold promise for generalist robotics but currently struggle with data scarcity, architectural inefficiencies, and the inability to generalize across different hardware platforms. We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM designed to enable zero-shot deployment on novel embodiments for open-vocabulary tasks. To achieve this, we collected one of the largest open-source robotic datasets--over 10,000 hours of demonstrations in diverse families--using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI). Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference. Consequently, RDT2 becomes one of the first models that simultaneously zero-shot generalizes to unseen objects, scenes, instructions, and even robotic platforms. Besides, it outperforms state-of-the-art baselines in dexterous, long-horizon, and dynamic downstream tasks like playing table tennis. See https://rdt-robotics.github.io/rdt2/ for more information.
中文标题/摘要
标题:RDT2:探索UMI数据的缩放极限以实现零样本跨体态泛化
视觉-语言-行动(VLA)模型有望为通用机器人技术提供支持,但目前面临数据稀缺性、架构效率低下以及无法在不同硬件平台上泛化的问题。我们提出了RDT2,这是一种基于7B参数视觉语言模型的机器人基础模型,旨在实现对新体态的零样本部署以完成开放词汇任务。为此,我们收集了目前最大的开源机器人数据集——超过10,000小时的演示数据,涵盖了多种不同的家庭环境,并使用增强的、体态无关的通用操作接口(UMI)。我们的方法采用了一种新颖的三阶段训练方案,通过残差向量量化(RVQ)、流匹配和蒸馏将离散的语言知识与连续控制对齐,以实现实时推理。因此,RDT2 成为了第一个能够零样本泛化到未见过的物体、场景、指令甚至机器人平台的模型。此外,它在乒乓球等灵巧、长时序和动态下游任务中也优于最先进的基线模型。更多信息请参见 https://rdt-robotics.github.io/rdt2/。
Summary / 总结
RDT2 is a robotic foundation model designed to enable zero-shot cross-embodiment generalization for open-vocabulary tasks. It uses a 7B parameter vision-language-model and a novel three-stage training recipe involving Residual Vector Quantization, flow-matching, and distillation. The model was trained on an extensive dataset of over 10,000 hours of demonstrations using an enhanced Universal Manipulation Interface. Key findings include RDT2's ability to generalize to unseen objects, scenes, and robotic platforms, and its superior performance in dexterous tasks such as playing table tennis compared to state-of-the-art baselines.
研究旨在解决视觉-语言-动作模型在机器人领域面临的数据稀缺性和架构效率问题,重点在于实现跨体态的零样本泛化。RDT2 是一个基于7B参数的机器人基础模型,通过一种新颖的三阶段训练方法,结合残差向量量化、流匹配和蒸馏,实现了对未见过的对象、场景和机器人平台的零样本泛化,超越了现有基线模型在如乒乓球等灵巧、长时序和动态下游任务中的表现。该模型使用增强的通用操作接口训练了超过10,000小时的演示数据。更多信息请参见https://rdt-robotics.github.io/rdt2/。
UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval
Authors: Hongyu Guo, Xiangzhao Hao, Jiarui Guo, Haiyun Guo, Jinqiao Wang, Tat-Seng Chua
First: 2025-08-06T07:02:39+00:00 · Latest: 2026-02-03T09:30:40+00:00
Abstract
Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.
中文标题/摘要
标题:UniFGVC: 基于属性感知多模态检索的通用无训练少样本细粒度视觉分类
少样本细粒度视觉分类(FGVC)旨在利用有限的数据使模型能够区分细微不同的类别。近期工作主要通过微调预训练的视觉语言模型来实现性能提升,但容易导致过拟合和泛化能力弱。为解决这一问题,我们提出了UniFGVC,这是一种通用的无训练框架,将少样本FGVC重新定义为多模态检索。首先,我们提出了类别区分视觉描述生成器(CDV-描述生成器)来利用多模态大型语言模型(MLLMs)的开放世界知识,生成结构化的文本描述,捕捉区分密切相关类别的细粒度属性特征。CDV-描述生成器使用链式思考提示和视觉相似的参考图像来减少幻觉并增强生成描述的区分度。使用它,我们可以将每张图像转换为图像-描述对,从而实现更全面的特征表示,并使用少量样本构建多模态类别模板,以供后续检索管道使用。然后,现成的视觉和文本编码器嵌入查询和模板对,FGVC通过在联合空间中检索最近的模板来完成。UniFGVC确保与多种MLLMs和编码器具有广泛的兼容性,提供可靠的泛化能力和适应性,适用于少样本FGVC场景。在12个FGVC基准上的广泛实验表明,它在少样本CLIP基线方法和几种完全监督的MLLMs基线方法中表现出一致的优越性。
Summary / 总结
UniFGVC is a training-free framework for few-shot fine-grained visual classification that leverages attribute-aware multimodal retrieval. It introduces the Category-Discriminative Visual Captioner (CDV-Captioner) to generate structured text descriptions that capture fine-grained attribute features, using chain-of-thought prompting and visually similar reference images. This enables the conversion of images into image-description pairs and the construction of multimodal category templates. The framework uses off-the-shelf vision and text encoders to embed query and template pairs, achieving retrieval-based fine-grained classification. Experiments on 12 benchmarks show that UniFGVC outperforms previous few-shot CLIP-based methods and some fully-supervised MLLMs-based approaches.
UniFGVC 是一个无需训练的框架,用于少量样本细粒度视觉分类,利用多模态检索并结合细粒度属性感知。它引入了类别区分视觉描述器(CDV-Captioner),生成包含细粒度属性的结构化文本描述,使用链式思考提示和视觉相似参考图像。这将图像转换为图像-描述对,并构建多模态类别模板。该框架使用现成的视觉和文本编码器嵌入和检索最近的模板,跨12个细粒度视觉分类基准测试表现出优于先前的少量样本CLIP基方法和一些完全监督的多模态基方法的性能。
POP: Prefill-Only Pruning for Efficient Large Model Inference
Authors: Junhui He, Zhihui Fu, Jun Wang, Qingan Li
First: 2026-02-03T09:22:26+00:00 · Latest: 2026-02-03T09:22:26+00:00
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.
中文标题/摘要
标题:POP:仅预填充剪枝以提高大型模型推理效率
大型语言模型(LLMs)和视觉-语言模型(VLMs)展现了显著的能力。然而,它们的部署受到显著计算成本的阻碍。现有的结构化剪枝方法虽然在硬件效率方面表现出色,但往往会导致显著的准确率下降。在本文中,我们提出这种失败源于一种不考虑预填充和解码阶段之间不对称作用的阶段无关剪枝方法。通过引入虚拟门机制,我们的重要性分析表明,深层层对于下一个标记的预测(解码)至关重要,但在上下文编码(预填充)中则几乎冗余。利用这一洞察,我们提出了仅预填充剪枝(POP),这是一种阶段感知的推理策略,在计算密集的预填充阶段安全地省略深层层,而在敏感的解码阶段保留完整模型。为了实现阶段之间的过渡,我们引入了独立的键-值(KV)投影以保持缓存的完整性,并采用边界处理策略以确保生成的第一个标记的准确性。在Llama-3.1、Qwen3-VL和Gemma-3等不同模态上的广泛实验表明,POP在预填充延迟上实现了高达1.37倍的加速,同时保持了最小的性能损失,有效地克服了现有结构化剪枝方法的准确率-效率权衡限制。
Summary / 总结
This paper addresses the computational challenges of deploying large language models and vision-language models by proposing Prefill-Only Pruning (POP), a stage-aware inference strategy. POP leverages the asymmetric roles of the prefill and decode stages to safely omit deep layers during the prefill stage while retaining the full model for the decode stage. The method introduces independent Key-Value projections and a boundary handling strategy to maintain cache integrity and ensure accuracy. Experiments show that POP achieves up to 1.37 times speedup in prefill latency with minimal performance loss.
本文提出了一种阶段感知的推理策略Prefill-Only Pruning (POP),以解决部署大型语言模型(LLMs)和视觉-语言模型(VLMs)的计算挑战。POP 利用prefill 和decode阶段的不对称作用,在prefill阶段安全地省略深层层,而在decode阶段保留它们,使用虚拟门机制和独立的Key-Value投影。实验表明,POP 可以在最小的性能损失下实现prefill延迟最多1.37倍的加速。
Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models
Authors: Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Yongshuai Hou, Weili Guan, Jun Yu, Min Zhang
First: 2025-09-26T07:07:03+00:00 · Latest: 2026-02-03T08:40:04+00:00
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we conduct a systematic study of the spatial bias of LVLMs, examining how models respond when identical key visual information is placed at different locations within an image. Through controlled probing experiments, we observe that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a clear spatial bias in their semantic understanding. Further analysis indicates that this bias does not stem from the vision encoder, but rather from a mismatch in attention mechanisms between the vision encoder and the large language model, which disrupts the global information flow. Motivated by this insight, we propose Adaptive Global Context Injection (AGCI), a lightweight mechanism that dynamically injects shared global visual context into each image token. AGCI works without architectural modifications, mitigating spatial bias by enhancing the semantic accessibility of image tokens while preserving the model's intrinsic capabilities. Extensive experiments demonstrate that AGCI not only enhances the spatial robustness of LVLMs, but also achieves strong performance on various downstream tasks and hallucination benchmarks.
中文标题/摘要
标题:超越视觉编码器:识别和缓解大型视觉-语言模型的空间偏差
大型视觉-语言模型(LVLMs)在多种跨模态任务中取得了显著的成功,但它们对空间变化的鲁棒性仍然不够理解。在本文中,我们系统地研究了LVLMs的空间偏差,考察了当相同的关键视觉信息在图像中不同位置放置时,模型如何响应。通过受控的探针实验,我们观察到当前的LVLMs在这样的空间位移下经常产生不一致的输出,揭示了它们在语义理解中的明显空间偏差。进一步的分析表明,这种偏差并非来自视觉编码器,而是视觉编码器与大型语言模型之间的注意力机制不匹配,这破坏了全局信息流。受此见解的启发,我们提出了自适应全局上下文注入(AGCI),这是一种轻量级机制,能够动态地将共享的全局视觉上下文注入到每个图像标记中。AGCI 不需要架构修改,通过增强图像标记的语义可访问性来缓解空间偏差,同时保留模型的固有能力。广泛的实验表明,AGCI 不仅增强了LVLMs的空间鲁棒性,还在各种下游任务和幻觉基准测试中取得了优异的性能。
Summary / 总结
This study investigates the spatial bias in Large Vision-Language Models (LVLMs) by examining their response to identical visual information placed at different locations within an image. The research reveals that current LVLMs produce inconsistent outputs under such spatial shifts, indicating a mismatch in attention mechanisms between the vision encoder and the language model. To address this, the authors propose Adaptive Global Context Injection (AGCI), a lightweight mechanism that enhances the semantic accessibility of image tokens without altering the model architecture. Experiments show that AGCI improves the spatial robustness of LVLMs and performs well on various downstream tasks and hallucination benchmarks.
研究通过在图像中不同位置放置相同的视觉信息来考察大型视觉-语言模型(LVLMs)的空间偏差。研究发现,当前的LVLMs在这样的空间变化下会产生不一致的输出,表明视觉编码器和语言模型之间的注意力机制存在不匹配。为了解决这一问题,作者提出了一种轻量级机制——动态全局上下文注入(AGCI),该机制通过在每个图像标记中动态注入共享的全局视觉上下文来增强语义访问性,而不进行架构修改。实验表明,AGCI不仅提高了LVLMs的空间鲁棒性,还在各种下游任务和幻觉基准测试中表现出色。
LaVPR: Benchmarking Language and Vision for Place Recognition
Authors: Ofer Idan, Dan Badur, Yosi Keller, Yoli Shavit
First: 2026-02-03T08:38:38+00:00 · Latest: 2026-02-03T08:38:38+00:00
Abstract
Visual Place Recognition (VPR) often fails under extreme environmental changes and perceptual aliasing. Furthermore, standard systems cannot perform "blind" localization from verbal descriptions alone, a capability needed for applications such as emergency response. To address these challenges, we introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions. Using LaVPR, we investigate two paradigms: Multi-Modal Fusion for enhanced robustness and Cross-Modal Retrieval for language-based localization. Our results show that language descriptions yield consistent gains in visually degraded conditions, with the most significant impact on smaller backbones. Notably, adding language allows compact models to rival the performance of much larger vision-only architectures. For cross-modal retrieval, we establish a baseline using Low-Rank Adaptation (LoRA) and Multi-Similarity loss, which substantially outperforms standard contrastive methods across vision-language models. Ultimately, LaVPR enables a new class of localization systems that are both resilient to real-world stochasticity and practical for resource-constrained deployment. Our dataset and code are available at https://github.com/oferidan1/LaVPR.
中文标题/摘要
标题:LaVPR:语言与视觉在地点识别中的基准测试
视觉地点识别(VPR)在极端环境变化和知觉混淆下经常失效。此外,标准系统无法仅从口头描述进行“盲”定位,这种能力对于应急响应等应用至关重要。为应对这些挑战,我们引入了LaVPR,这是一个大规模基准,扩展了现有的VPR数据集,包含超过650,000个丰富的自然语言描述。使用LaVPR,我们研究了两种范式:多模态融合以增强鲁棒性以及跨模态检索以实现基于语言的定位。我们的结果显示,语言描述在视觉退化条件下提供了持续的增益,对较小的骨干网络影响最大。值得注意的是,添加语言使紧凑模型能够与更大的纯视觉架构相媲美。对于跨模态检索,我们使用低秩适应(LoRA)和多相似性损失建立了基线,这在视觉-语言模型中显著优于标准对比方法。最终,LaVPR 使一类新的定位系统成为可能,这些系统既能够抵御现实世界的随机性,又能够在资源受限的部署中实现。我们的数据集和代码可在 https://github.com/oferidan1/LaVPR 获取。
Summary / 总结
LaVPR benchmarks language and vision for place recognition, addressing challenges in extreme environments and perceptual aliasing. It introduces over 650,000 rich natural-language descriptions to enhance robustness. Multi-Modal Fusion and Cross-Modal Retrieval methods are explored, showing that language descriptions improve performance in visually degraded conditions, especially for smaller models. Cross-modal retrieval using Low-Rank Adaptation and Multi-Similarity loss outperforms standard contrastive methods, enabling resilient and resource-efficient localization systems.
LaVPR 是一个用于地点识别的语言和视觉基准,旨在解决极端环境变化和知觉混叠带来的挑战。它引入了超过650,000个丰富的自然语言描述以增强鲁棒性。研究探讨了多模态融合和跨模态检索,结果显示语言描述在视觉退化条件下提高了性能,尤其是对于较小的模型。跨模态检索使用低秩适应和多相似性损失方法优于标准对比方法,能够实现鲁棒且资源高效的定位系统。
CountZES: Counting via Zero-Shot Exemplar Selection
Authors: Muhammad Ibraheem Siddiqui, Muhammad Haris Khan
First: 2025-12-18T11:12:50+00:00 · Latest: 2026-02-03T08:33:49+00:00
Abstract
Object counting in complex scenes is particularly challenging in the zero-shot (ZS) setting, where instances of unseen categories are counted using only a class name. Existing ZS counting methods that infer exemplars from text often rely on off-the-shelf open-vocabulary detectors (OVDs), which in dense scenes suffer from semantic noise, appearance variability, and frequent multi-instance proposals. Alternatively, random image-patch sampling is employed, which fails to accurately delineate object instances. To address these issues, we propose CountZES, an inference-only approach for object counting via ZS exemplar selection. CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines OVD detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across domains.
中文标题/摘要
标题:CountZES:通过零样本示例选择进行计数
在零样本(ZS)设置中,复杂场景中的物体计数尤其具有挑战性,其中使用仅类别名称来计数未见类别的实例。现有的ZS计数方法通过文本推断示例通常依赖现成的开放词汇检测器(OVD),在密集场景中会遭受语义噪声、外观变化和频繁的多实例建议。或者,采用随机图像块采样,但无法准确划分物体实例。为了解决这些问题,我们提出CountZES,这是一种仅用于推断的物体计数方法,通过ZS示例选择。CountZES 通过三个协同阶段发现多样化的示例:检测锚定示例(DAE)、密度引导示例(DGE)和特征共识示例(FCE)。DAE 对OVD检测进行细化以隔离精确的单实例示例。DGE 引入了一种基于密度的自我监督范式,以识别统计上一致且语义紧凑的示例,而FCE 通过特征空间聚类增强视觉一致性。这些阶段共同产生了一组互补的示例,平衡了文本基础、计数一致性和特征代表性。在多种数据集上的实验表明,CountZES 在ZOC方法中表现出优越的性能,并且在不同领域中具有良好的泛化能力。
Summary / 总结
CountZES is an inference-only approach for zero-shot object counting in complex scenes. It addresses the limitations of existing methods by using three stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines detections to isolate precise single-instance exemplars, DGE identifies statistically consistent and semantically compact exemplars, and FCE reinforces visual coherence. Experiments show CountZES outperforms other zero-shot object counting methods and generalizes well across different domains.
论文针对仅使用类别名称来计数未见类别的复杂场景中的零样本(ZS)计数挑战。提出了一种仅推理的方法CountZES,通过三个阶段选择示例:检测锚定示例(DAE)、密度引导示例(DGE)和特征共识示例(FCE)。DAE细化检测以隔离精确的单实例示例,DGE识别统计上一致且语义紧凑的示例,FCE通过特征空间聚类增强视觉一致性。实验表明,CountZES在零样本对象计数方法中表现出色,并且在不同领域中具有良好的泛化能力。
Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration
Authors: Wei Dai, Haoyu Wang, Honghao Chang, Lijun He, Fan Li, Jian Sun, Haixia Bi
First: 2026-02-03T06:06:35+00:00 · Latest: 2026-02-03T06:06:35+00:00
Comments: 12 pages
Abstract
Vision Language Models (VLMs) typically assume complete modality input during inference. However, their effectiveness drops sharply when certain modalities are unavailable or incomplete. Current research primarily faces two dilemmas: Prompt-based methods struggle to restore missing yet indispensable features and impair generalization of VLMs. Imputation-based approaches, lacking effective guidance, are prone to generating semantically irrelevant noise. Restoring precise semantics while sustaining VLM generalization remains challenging. Therefore, we propose a general missing modality restoration strategy in this paper. We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features. Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to steer the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of dual encoders to achieve bidirectional alignment. Zero-shot evaluations across benchmark datasets demonstrate that our approach outperforms existing baseline methods. Extensive experiments and ablation studies confirm our model as a robust and scalable extension for VLMs in missing modality scenarios, ensuring reliability across diverse missing rates and environments. Our code and models will be publicly available.
中文标题/摘要
标题:增强基础VLM对缺失模态的鲁棒性:双向特征恢复的可扩展扩散模型
视觉语言模型(VLMs)通常假设推理时输入完整模态。然而,当某些模态不可用或不完整时,其效果会急剧下降。当前研究主要面临两个难题:基于提示的方法难以恢复缺失但至关重要的特征,影响VLMs的泛化能力;基于插补的方法缺乏有效指导,容易生成语义无关的噪声。恢复精确语义并保持VLMs的泛化能力仍然具有挑战性。因此,我们在本文中提出了一种通用的缺失模态恢复策略。我们引入了一种增强的扩散模型作为可插拔的中间阶段训练模块,以有效恢复缺失特征。我们的策略引入了两个关键创新:(I)动态模态门控,根据条件特征自适应地引导生成语义一致的特征;(II)跨模态互学习机制,通过连接双编码器的语义空间实现双向对齐。零样本评估表明,我们的方法优于现有基线方法。广泛的实验和消融研究证实,我们的模型是VLMs在缺失模态场景下的一个稳健且可扩展的扩展,确保在各种缺失率和环境中具有可靠性。我们的代码和模型将公开。
Summary / 总结
This paper addresses the challenge of restoring missing modalities in Vision Language Models (VLMs) to enhance their robustness. It proposes an enhanced diffusion model with two key innovations: Dynamic Modality Gating and Cross-Modal Mutual Learning. The model effectively restores missing features and maintains VLM generalization. Zero-shot evaluations show superior performance compared to existing methods, and extensive experiments confirm its robustness and scalability across various missing rates and environments.
研究解决了视觉语言模型(VLMs)在某些模态缺失时效果下降的问题,提出了一种增强的扩散模型,包含动态模态门控和跨模态互学习两大创新,以恢复缺失特征并保持VLM的一般性。实验表明,该方法优于现有方法,并且在各种缺失率和环境中表现出高度的鲁棒性。
FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion
Authors: Chen-Bin Feng, Youyang Sha, Longfei Liu, Yongjun Yu, Chi Man Vong, Xuanlong Yu, Xi Shen
Venue: ICLR 2026
First: 2026-02-03T05:45:22+00:00 · Latest: 2026-02-03T05:45:22+00:00
Comments: Accepted by ICLR 2026. Code is available at: \url{https://intellindust-ai-lab.github.io/projects/FSOD-VFM}
Abstract
In this paper, we present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models, a framework that leverages vision foundation models to tackle the challenge of few-shot object detection. FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories. Despite the strong generalization capabilities of foundation models, the bounding boxes generated by UPN often suffer from overfragmentation, covering only partial object regions and leading to numerous small, false-positive proposals rather than accurate, complete object detections. To address this issue, we introduce a novel graph-based confidence reweighting method. In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph diffusion operations applied to propagate confidence scores across the network. This reweighting process refines the scores of proposals, assigning higher confidence to whole objects and lower confidence to local, fragmented parts. This strategy improves detection granularity and effectively reduces the occurrence of false-positive bounding box proposals. Through extensive experiments on Pascal-5$^i$, COCO-20$^i$, and CD-FSOD datasets, we demonstrate that our method substantially outperforms existing approaches, achieving superior performance without requiring additional training. Notably, on the challenging CD-FSOD dataset, which spans multiple datasets and domains, our FSOD-VFM achieves 31.6 AP in the 10-shot setting, substantially outperforming previous training-free methods that reach only 21.4 AP. Code is available at: https://intellindust-ai-lab.github.io/projects/FSOD-VFM.
中文标题/摘要
标题:FSOD-VFM:基于视觉基础模型和图扩散的少样本目标检测
在本文中,我们提出了FSOD-VFM:基于视觉基础模型的少样本目标检测器框架,该框架利用视觉基础模型解决少样本目标检测的挑战。FSOD-VFM 结合了三个关键组件:通用提议网络(UPN)用于生成类别无关的边界框,SAM2 用于准确的掩码提取,以及 DINOv2 特征用于高效地适应新目标类别。尽管基础模型具有强大的泛化能力,但 UPN 生成的边界框经常出现过度分割的问题,仅覆盖对象的部分区域,导致产生大量小的、错误的目标检测。为了解决这一问题,我们引入了一种新颖的基于图的置信度重新加权方法。在我们的方法中,预测的边界框被建模为有向图中的节点,通过图扩散操作在图中传播置信度分数。这一重新加权过程细化了提议的分数,将更高的置信度赋予整个对象,将较低的置信度赋予局部、分割的部分。这种策略提高了检测的粒度,并有效地减少了错误目标检测框的出现。通过在 Pascal-5$^i$、COCO-20$^i$ 和 CD-FSOD 数据集上的广泛实验,我们证明了我们的方法显著优于现有方法,无需额外训练即可实现更优的性能。值得注意的是,在涵盖多个数据集和领域的挑战性 CD-FSOD 数据集上,我们的 FSOD-VFM 在 10 射设置中达到了 31.6 AP,远超之前仅达到 21.4 AP 的无训练方法。代码可在:https://intellindust-ai-lab.github.io/projects/FSOD-VFM 获取。
Summary / 总结
FSOD-VFM is a framework for few-shot object detection that uses vision foundation models. It includes a universal proposal network, SAM2 for mask extraction, and DINOv2 features for adaptation. To address overfragmentation issues, a graph-based confidence reweighting method is introduced, which refines proposal scores and reduces false positives. Experiments on Pascal-5$^i$, COCO-20$^i$, and CD-FSOD show that FSOD-VFM outperforms existing methods, achieving 31.6 AP in the 10-shot setting on CD-FSOD.
FSOD-VFM 是一种利用视觉基础模型的少样本目标检测框架,包含通用提议网络、SAM2 用于掩码提取和 DINOv2 特征。为了解决边界框过度分割的问题,引入了一种基于图的置信度重新加权方法,提高了检测精度并减少了假阳性。实验表明,FSOD-VFM 在 Pascal-5$^i$、COCO-20$^i$ 和 CD-FSOD 数据集上优于现有方法,在 CD-FSOD 的 10 射设置中达到 31.6 AP。
SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass
Authors: Chen Qian, Xinran Yu, Danyang Li, Guoxuan Chi, Zheng Yang, Qiang Ma, Xin Miao
First: 2026-02-03T05:42:51+00:00 · Latest: 2026-02-03T05:42:51+00:00
Abstract
Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning tasks, they suffer from significant performance degradation on tasks requiring fine-grained visual details. Through layer-wise analysis, we reveal substantial discrepancies in visual token importance across layers, showing that tokens deemed unimportant at shallow layers can later become highly relevant for text-conditioned reasoning. To avoid irreversible critical information loss caused by premature pruning, we introduce a new pruning paradigm, termed bypass, which preserves unselected visual tokens and forwards them to subsequent pruning stages for re-evaluation. Building on this paradigm, we propose SwiftVLM, a simple and training-free method that performs pruning at model-specific layers with strong visual token selection capability, while enabling independent pruning decisions across layers. Experiments across multiple VLMs and benchmarks demonstrate that SwiftVLM consistently outperforms existing pruning strategies, achieving superior accuracy-efficiency trade-offs and more faithful visual token selection behavior.
中文标题/摘要
标题:SwiftVLM:通过跨层令牌旁路实现高效的视觉-语言模型推理
视觉令牌剪枝是降低视觉-语言模型(VLMs)计算成本的一种有前途的方法,现有方法通常依赖于早期的剪枝决策来提高效率。虽然在粗粒度的推理任务上有效,但在需要细粒度视觉细节的任务上会遭受显著的性能下降。通过逐层分析,我们揭示了视觉令牌重要性在各层之间存在显著差异,表明在浅层被认为不重要的令牌在后续的文本条件推理中可能变得非常重要。为了避免由于过早剪枝而导致不可逆的关键信息丢失,我们引入了一种新的剪枝范式,称为旁路,该范式保留未选择的视觉令牌,并将其传递到后续的剪枝阶段进行重新评估。基于这一范式,我们提出了一种简单且无需训练的方法SwiftVLM,在模型特定的层上进行剪枝,具有强大的视觉令牌选择能力,同时允许各层之间独立的剪枝决策。在多个VLMs和基准测试中的实验表明,SwiftVLM 一致地优于现有的剪枝策略,实现了更优的准确性和效率权衡,并具有更忠实的视觉令牌选择行为。
Summary / 总结
The research aims to improve the efficiency of vision-language models (VLMs) by addressing the limitations of existing visual token pruning methods, which often suffer from performance degradation on tasks requiring fine-grained visual details. The method introduces a new bypass paradigm that preserves unselected visual tokens for re-evaluation in subsequent layers, allowing for more accurate and efficient pruning. Experiments show that SwiftVLM, a training-free method, outperforms existing strategies in terms of accuracy-efficiency trade-offs and visual token selection behavior.
研究旨在通过解决早期剪枝方法在处理需要精细视觉细节的任务时性能下降的问题,提高视觉语言模型(VLMs)的效率。提出的SwiftVLM方法引入了一种跨层令牌旁路范式,保留未选择的视觉令牌并在后续层重新评估它们,从而实现比现有策略更好的准确性和效率权衡以及更忠实的视觉令牌选择行为。