arXiv 论文速递

2025-10-23 03:29
Snapshot: 20251023_0329
DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
Authors: Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, Zhou Zhao
First: 2025-10-21T17:59:36+00:00 · Latest: 2025-10-21T17:59:36+00:00
Abstract
Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models' reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench provides valuable findings and insights about the future development of general and expertise models with dynamic spatial intelligence.
中文标题/摘要
标题:DSI-Bench:动态空间智能基准
关于动态空间关系的推理至关重要,因为观察者和物体经常同时移动。尽管视觉语言模型(VLMs)和视觉专门模型在二维任务和静态场景中表现出色,但它们理解动态三维场景的能力仍然有限。我们引入了动态空间智能,并提出了DSI-Bench基准,该基准包含近1000个动态视频和超过1700个手动标注的问题,涵盖了观察者和物体的九种解耦运动模式。空间和时间对称的设计减少了偏差,使模型对自身运动和物体运动的推理进行系统的评估成为可能。我们对14个VLMs和专家模型的评估揭示了关键的局限性:模型经常混淆观察者和物体的运动,表现出语义偏差,并且在动态场景中无法准确推断相对关系。我们的DSI-Bench提供了关于未来开发具有动态空间智能的一般和专门模型的重要发现和见解。
Summary / 总结
The research aims to evaluate models' ability to understand dynamic spatial relationships, which are crucial in scenarios where both observers and objects are moving. The study introduces DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 questions, covering nine motion patterns. Evaluating 14 vision-language models and expert models, the study finds that these models often confuse observer and object motions, show semantic biases, and struggle to infer accurate relationships in dynamic scenarios. DSI-Bench provides insights for improving models' dynamic spatial intelligence.
研究旨在评估模型在观察者和物体同时移动的动态空间关系理解能力。研究引入了DSI-Bench基准,包含近1,000个动态视频和超过1,700个问题,覆盖九种运动模式。评估14个视觉语言模型和专家模型后发现,这些模型常常混淆观察者和物体的运动,表现出语义偏见,并且在动态场景中难以准确推断关系。DSI-Bench为提高模型的动态空间智能提供了有价值的见解。
FedDEAP: Adaptive Dual-Prompt Tuning for Multi-Domain Federated Learning
Authors: Yubin Zheng, Pak-Hei Yeung, Jing Xia, Tianjie Ju, Peng Tang, Weidong Qiu, Jagath C. Rajapakse
Venue: MM 2025
First: 2025-10-21T17:32:44+00:00 · Latest: 2025-10-21T17:32:44+00:00
Comments: Accepted at MM 2025
Abstract
Federated learning (FL) enables multiple clients to collaboratively train machine learning models without exposing local data, balancing performance and privacy. However, domain shift and label heterogeneity across clients often hinder the generalization of the aggregated global model. Recently, large-scale vision-language models like CLIP have shown strong zero-shot classification capabilities, raising the question of how to effectively fine-tune CLIP across domains in a federated setting. In this work, we propose an adaptive federated prompt tuning framework, FedDEAP, to enhance CLIP's generalization in multi-domain scenarios. Our method includes the following three key components: (1) To mitigate the loss of domain-specific information caused by label-supervised tuning, we disentangle semantic and domain-specific features in images by using semantic and domain transformation networks with unbiased mappings; (2) To preserve domain-specific knowledge during global prompt aggregation, we introduce a dual-prompt design with a global semantic prompt and a local domain prompt to balance shared and personalized information; (3) To maximize the inclusion of semantic and domain information from images in the generated text features, we align textual and visual representations under the two learned transformations to preserve semantic and domain consistency. Theoretical analysis and extensive experiments on four datasets demonstrate the effectiveness of our method in enhancing the generalization of CLIP for federated image recognition across multiple domains.
中文标题/摘要
标题:FedDEAP:多域联邦学习中的自适应双提示调优
联邦学习(FL)使多个客户端能够在不暴露本地数据的情况下协作训练机器学习模型,平衡性能和隐私。然而,客户端之间存在的领域偏移和标签异质性往往阻碍了聚合全局模型的泛化能力。最近,大规模的跨模态模型如CLIP展示了强大的零样本分类能力,引发了如何在联邦设置中有效微调CLIP跨领域的问题。在本文中,我们提出了一种自适应联邦提示调优框架FedDEAP,以增强CLIP在多域场景中的泛化能力。我们的方法包括以下三个关键组件:(1) 为减轻标签监督调优导致的领域特定信息损失,我们通过使用具有无偏映射的语义和领域变换网络,将语义和领域特定特征分离;(2) 为在全局提示聚合过程中保留领域特定知识,我们引入了一种双提示设计,包含一个全局语义提示和一个局部领域提示,以平衡共享和个性化信息;(3) 为最大限度地将图像中的语义和领域信息包含在生成的文本特征中,我们在两种学习变换下对文本和视觉表示进行对齐,以保持语义和领域一致性。在四个数据集上的理论分析和大量实验表明,我们的方法在跨多个领域增强CLIP的联邦图像识别泛化能力方面的有效性。
Summary / 总结
FedDEAP is an adaptive federated prompt tuning framework designed to improve CLIP's generalization in multi-domain federated learning. It uses semantic and domain transformation networks to disentangle features, introduces a dual-prompt design to balance shared and personalized information, and aligns textual and visual representations to preserve consistency. Experiments on four datasets show that FedDEAP enhances CLIP's performance in federated image recognition across multiple domains.
FedDEAP 是一种自适应联邦提示调优框架,旨在提高 CLIP 在多域联邦学习中的泛化能力。它使用语义和领域变换网络来分离特征,引入双提示设计以平衡共享和个性化信息,并对文本和视觉表示进行对齐以保持一致性。在四个数据集上的实验表明,FedDEAP 能够增强 CLIP 在多域图像识别中的泛化性能。
Glyph: Scaling Context Windows via Visual-Text Compression
Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang
First: 2025-10-20T17:58:56+00:00 · Latest: 2025-10-21T17:12:48+00:00
Abstract
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
中文标题/摘要
标题:Glyph:通过视觉-文本压缩扩展上下文窗口
大型语言模型(LLMs)越来越多地依赖于长上下文建模,用于文档理解、代码分析和多步推理等任务。然而,将上下文窗口扩展到百万词级别带来了巨大的计算和内存成本,限制了长上下文LLMs的实际应用。在本工作中,我们从视觉上下文扩展的角度出发,应对这一挑战。我们不扩展基于词元的序列,而是提出了一种名为Glyph的框架,将长文本渲染为图像,并使用视觉-语言模型(VLMs)处理这些图像。这种方法在大幅压缩文本输入的同时保留了语义信息,并进一步设计了一种由LLM驱动的遗传搜索,以识别平衡准确性和压缩的最佳视觉渲染配置。通过广泛的实验,我们证明了我们的方法在各种长上下文基准测试中实现了3-4倍的词元压缩,同时保持与领先LLM(如Qwen3-8B)相当的准确性。这种压缩还导致填充和解码速度提高了约4倍,SFT训练速度提高了约2倍。此外,在极端压缩下,一个128K上下文的VLM可以扩展处理百万词级别的文本任务。此外,渲染的文本数据也有助于实际的多模态任务,如文档理解。我们的代码和模型已发布在https://github.com/thu-coai/Glyph。
Summary / 总结
This paper addresses the challenge of scaling context windows in large language models (LLMs) to handle long documents by proposing Glyph, a framework that converts long texts into images and processes them with vision-language models (VLMs). The method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs. It also results in faster prefilling, decoding, and SFT training, and allows VLMs with a 128K-context to handle 1M-token-level tasks under extreme compression. The rendered text data also benefits multimodal tasks like document understanding.
研究旨在解决使用大型语言模型(LLMs)处理长上下文时的计算和内存挑战。引入了Glyph框架,将长文本转换为图像以减少token数量同时保留语义信息。该方法实现了3-4倍的token压缩,并在各种基准测试中保持与Qwen3-8B等领先LLM相当的准确性。它还通过约4倍和2倍的速度提升,加快了预填充、解码和SFT训练。在极端压缩下,128K上下文的VLM可以处理1M-token级别的文本任务,且渲染的文本数据有助于文档理解等多模态任务。
Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
Authors: Yuanli Wu, Long Zhang, Yue Du, Bin Li
First: 2025-10-20T12:54:32+00:00 · Latest: 2025-10-21T17:06:29+00:00
Abstract
With video exploding across social media, surveillance, and education, compressing long footage into concise yet faithful surrogates is crucial. Supervised methods learn frame/shot importance from dense labels and excel in-domain, but are costly and brittle across datasets; unsupervised methods avoid labels but often miss high-level semantics and narrative cues. Recent zero-shot pipelines use LLMs for training-free summarization, yet remain sensitive to handcrafted prompts and dataset-specific normalization.We propose a rubric-guided, pseudo-labeled prompting framework. A small subset of human annotations is converted into high-confidence pseudo labels and aggregated into structured, dataset-adaptive scoring rubrics for interpretable scene evaluation. At inference, boundary scenes (first/last) are scored from their own descriptions, while intermediate scenes include brief summaries of adjacent segments to assess progression and redundancy, enabling the LLM to balance local salience with global coherence without parameter tuning.Across three benchmarks, our method is consistently effective. On SumMe and TVSum it achieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21) by +0.85 and +0.84 and approaching supervised performance. On the query-focused QFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stable across validation videos. These results show that rubric-guided pseudo labeling, coupled with contextual prompting, stabilizes LLM-based scoring and yields a general, interpretable zero-shot paradigm for both generic and query-focused video summarization.
中文标题/摘要
标题:基于上下文感知的伪标签评分方法在零样本视频摘要中的应用
随着视频在社交媒体、监控和教育中的爆炸式增长,压缩长视频为简洁而忠实的替代品至关重要。监督方法通过密集标签学习帧/镜头的重要性,在领域内表现出色,但跨数据集成本高且脆弱;无监督方法避免使用标签,但往往遗漏高层语义和叙事线索。最近的零样本管道使用LLM进行训练免费的摘要,但仍对手工制作的提示和数据集特定的归一化敏感。我们提出了一种基于评分标准的伪标签提示框架。一小部分人类注释被转换为高置信度的伪标签,并聚合为结构化、数据集适应性的评分标准,以实现可解释的场景评估。在推理时,边界场景(首尾)从其自身描述中评分,而中间场景包括相邻段落的简要摘要以评估进展和冗余,使LLM能够在不调整参数的情况下平衡局部显著性和全局一致性。在三个基准测试中,我们的方法始终有效。在SumMe和TVSum上,它分别实现了F1值57.58和63.05,分别超越零样本基线(56.73,62.21)0.85和0.84,并接近监督性能。在查询导向的QFVS基准测试中,它达到了53.79的F1值,击败了53.42并保持了在验证视频中的稳定性。这些结果表明,基于评分标准的伪标签标注与上下文提示相结合,稳定了基于LLM的评分,并为通用和查询导向的视频摘要提供了一种通用且可解释的零样本范式。
Summary / 总结
The paper addresses the challenge of compressing long videos into concise summaries without relying on dense labels, which are costly and dataset-specific. It proposes a rubric-guided pseudo-labeled prompting framework that uses a small set of human annotations to create high-confidence pseudo labels and structured scoring rubrics. This method enables the LLM to balance local salience with global coherence during inference. The method outperforms a zero-shot baseline on SumMe and TVSum benchmarks, achieving F1 scores of 57.58 and 63.05, respectively, and shows stability on the query-focused QFVS benchmark with an F1 score of 53.79.
论文旨在解决无需依赖密集标签压缩长视频为简洁摘要的问题,因为密集标签成本高且针对特定数据集。提出了一种基于评分准则的伪标签提示框架,利用少量的人工注释生成高置信度的伪标签和结构化的评分准则。该方法在推理时能够平衡局部显著性和全局一致性。结果表明,该方法在SumMe和TVSum基准上的F1分数分别为57.58和63.05,优于零样本基线,并且在QFVS基准上以53.79的F1分数接近监督方法的表现。
Rebellious Student: A Complementary Learning Framework for Background Feature Enhancement in Hyperspectral Anomaly Detection
Authors: Wenping Jin, Yuyang Tang, Li Zhu, Fei Guo
First: 2025-10-21T16:31:56+00:00 · Latest: 2025-10-21T16:31:56+00:00
Abstract
A recent class of hyperspectral anomaly detection methods that can be trained once on background datasets and then universally deployed -- without per-scene retraining or parameter tuning -- has demonstrated remarkable efficiency and robustness. Building upon this paradigm, we focus on the integration of spectral and spatial cues and introduce a novel "Rebellious Student" framework for complementary feature learning. Unlike conventional teacher-student paradigms driven by imitation, our method intentionally trains the spatial branch to diverge from the spectral teacher, thereby learning complementary spatial patterns that the teacher fails to capture. A two-stage learning strategy is adopted: (1) a spectral enhancement network is first trained via reverse distillation to obtain robust background spectral representations; and (2) a spatial network -- the rebellious student -- is subsequently optimized using decorrelation losses that enforce feature orthogonality while maintaining reconstruction fidelity to avoid irrelevant noise. Once trained, the framework enhances both spectral and spatial background features, enabling parameter-free and training-free anomaly detection when paired with conventional detectors. Extensive experiments on the HAD100 benchmark show substantial improvements over several established baselines with minimal computational overhead, confirming the effectiveness and generality of the proposed complementary learning paradigm. Our code is publicly available at https://github.com/xjpp2016/FERS.
中文标题/摘要
标题:叛逆学生:用于高光谱异常检测背景特征增强的互补学习框架
一类可以一次在背景数据集上训练并随后在所有场景中通用部署——无需每场景重新训练或参数调整——的高光谱异常检测方法已经展示了显著的效率和鲁棒性。在此基础上,我们专注于光谱和空间线索的整合,并引入了一种新颖的“叛逆学生”框架,用于互补特征学习。与传统的模仿驱动的教师-学生范式不同,我们的方法故意训练空间分支以偏离光谱教师,从而学习教师未能捕捉到的互补空间模式。采用两阶段学习策略:(1)首先通过反向蒸馏训练光谱增强网络以获得稳健的背景光谱表示;(2)随后使用去相关损失优化空间网络——叛逆学生——以确保特征正交性同时保持重建保真度,以避免无关噪声。训练完成后,该框架增强光谱和空间背景特征,与传统检测器结合时实现无参数和无训练的异常检测。在HAD100基准上的广泛实验表明,与几个现有基线相比,该框架在计算开销最小的情况下取得了显著改进,证实了所提出的互补学习范式的有效性和普适性。我们的代码已公开发布于https://github.com/xjpp2016/FERS。
Summary / 总结
The paper introduces a 'Rebellious Student' framework for enhancing background features in hyperspectral anomaly detection. Motivated by the need for efficient and robust methods that do not require scene-specific retraining, the authors propose a two-stage learning strategy. First, a spectral enhancement network is trained using reverse distillation to capture robust background spectral representations. Then, a spatial network is optimized to diverge from the spectral teacher, learning complementary spatial patterns. Experiments on the HAD100 benchmark demonstrate significant improvements over existing methods with minimal computational cost, validating the effectiveness of the complementary learning approach. The framework enables parameter-free and training-free anomaly detection when paired with conventional detectors.
该论文提出了一种名为“叛逆学生”的框架,用于增强高光谱异常检测中的背景特征。该框架旨在提高检测效率和鲁棒性,通过结合光谱和空间线索。采用两阶段学习策略:首先使用逆蒸馏训练光谱增强网络,然后优化空间网络以与光谱教师相异,学习教师未能捕捉到的互补空间模式。在HAD100基准上的实验表明,与现有方法相比,该方法具有显著改进且计算成本较低,验证了互补学习方法的有效性和普适性。代码已公开。
VideoVerse: How Far is Your T2V Generator from a World Model?
Authors: Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang
First: 2025-10-09T16:18:20+00:00 · Latest: 2025-10-21T16:28:13+00:00
Comments: 24 Pages, 8 Figures, 11 Tables
Abstract
The recent rapid advancement of Text-to-Video (T2V) generation technologies, which are critical to build ``world models'', makes the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality, which not only distinguishes video from other modalities but also constitutes a crucial component of world models, is severely underexplored in existing benchmarks. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark that focuses on evaluating whether a T2V model could understand complex temporal causality and world knowledge in the real world. We collect representative videos across diverse domains (e.g., natural landscapes, sports, indoor scenes, science fiction, chemical and physical experiments) and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design a suite of binary evaluation questions from the perspective of dynamic and static properties, with a total of ten carefully defined evaluation dimensions. In total, our VideoVerse comprises 300 carefully curated prompts, involving 815 events and 793 binary evaluation questions. Consequently, a human preference aligned QA-based evaluation pipeline is developed by using modern vision-language models. Finally, we perform a systematic evaluation of state-of-the-art open-source and closed-source T2V models on VideoVerse, providing in-depth analysis on how far the current T2V generators are from world models.
中文标题/摘要
标题:VideoVerse: 你的T2V生成器距离世界模型还有多远?
近期文本到视频(T2V)生成技术的迅速发展,这些技术对于构建“世界模型”至关重要,使得现有的基准越来越不足以评估最先进的T2V模型。首先,当前的评估维度,如每帧的美学质量和时间一致性,已不再能够区分最先进的T2V模型。其次,事件级的时间因果关系,不仅能够区分视频与其他模态,也是世界模型的关键组成部分,但在现有基准中严重缺乏探索。第三,现有的基准缺乏对世界知识的系统评估,这是构建世界模型所需的重要能力。为了解决这些问题,我们引入了VideoVerse,这是一个全面的基准,旨在评估T2V模型是否能够理解现实世界中的复杂时间因果关系和世界知识。我们收集了跨多个领域(如自然景观、体育、室内场景、科幻、化学和物理实验)的代表性视频,并提取了具有内在时间因果关系的事件级描述,这些描述随后由独立注释者重写为文本到视频提示。对于每个提示,我们从动态和静态属性的角度设计了一系列二元评估问题,总共定义了十个精心设计的评估维度。总共,我们的VideoVerse包含300个精心策划的提示,涉及815个事件和793个二元评估问题。因此,我们通过使用现代视觉语言模型开发了一种与人类偏好对齐的问答式评估流水线。最后,我们在VideoVerse上系统地评估了最先进的开源和闭源T2V模型,深入分析了当前T2V生成器与世界模型之间的差距。
Summary / 总结
The paper introduces VideoVerse, a new benchmark to evaluate Text-to-Video (T2V) models, addressing the limitations of existing benchmarks by focusing on complex temporal causality and world knowledge. The method involves collecting diverse videos, rewriting their descriptions into prompts, and designing a suite of evaluation questions. Key findings show that current T2V models struggle with understanding real-world temporal causality and world knowledge, indicating significant gaps from world models.
论文提出了VideoVerse,这是一个新的基准,用于评估Text-to-Video (T2V)模型,通过关注复杂的时空因果关系和世界知识来弥补现有基准的不足。方法包括收集多样化的视频并创建包含二元评估问题的文本到视频提示。该基准包含300个提示、815个事件和793个问题,并使用人类偏好对齐的问答管道进行评估,揭示了当前T2V生成器与世界模型之间的差距。
Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation
Authors: Patterson Hsieh, Jerry Yeh, Mao-Chi He, Wen-Han Hsieh, Elvis Hsieh
First: 2025-10-21T15:59:00+00:00 · Latest: 2025-10-21T15:59:00+00:00
Abstract
Climate change is intensifying the occurrence of harmful algal bloom (HAB), particularly cyanobacteria, which threaten aquatic ecosystems and human health through oxygen depletion, toxin release, and disruption of marine biodiversity. Traditional monitoring approaches, such as manual water sampling, remain labor-intensive and limited in spatial and temporal coverage. Recent advances in vision-language models (VLMs) for remote sensing have shown potential for scalable AI-driven solutions, yet challenges remain in reasoning over imagery and quantifying bloom severity. In this work, we introduce ALGae Observation and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB monitoring that combines remote sensing image understanding with severity estimation. Our approach integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML) from NASA. Experiments demonstrate that ALGOS achieves robust performance on both segmentation and severity-level estimation, paving the way toward practical and automated cyanobacterial monitoring systems.
中文标题/摘要
标题:Seg the HAB:语言引导的地理空间水华推理与分割
气候变化加剧了有害藻华(HAB)的发生,特别是蓝细菌,它们通过耗氧、毒素释放和破坏海洋生物多样性威胁到水生生态系统和人类健康。传统的监测方法,如手工水样采集,仍然劳动密集型且在空间和时间覆盖方面有限。最近在遥感领域的视觉-语言模型(VLMs)取得了进展,显示出可扩展的人工智能驱动解决方案的潜力,但在图像推理和量化水华严重程度方面仍面临挑战。在本研究中,我们引入了ALGae Observation and Segmentation(ALGOS),这是一种结合遥感图像理解和严重程度估计的分割和推理系统。我们的方法结合了GeoSAM辅助的人类评估以获得高质量的分割掩码,并在NASA提供的蓝细菌聚合手动标签(CAML)上微调视觉语言模型以进行严重程度预测。实验表明,ALGOS在分割和严重程度估计方面均表现出稳健的性能,为实用和自动化的蓝细菌监测系统铺平了道路。
Summary / 总结
This study addresses the intensifying issue of harmful algal blooms (HAB), particularly cyanobacteria, which pose threats to aquatic ecosystems and human health. Traditional monitoring methods are labor-intensive and have limited spatial and temporal coverage. The research introduces ALGae Observation and Segmentation (ALGOS), a system that combines remote sensing image understanding with severity estimation. ALGOS uses GeoSAM-assisted human evaluation for high-quality segmentation masks and fine-tunes a vision-language model with NASA's Cyanobacteria Aggregated Manual Labels (CAML). Experiments show that ALGOS performs robustly in both segmentation and severity estimation, advancing the development of practical and automated cyanobacterial monitoring systems.
本研究针对有害藻华(HAB),特别是蓝细菌对水生生态系统和人类健康的威胁,解决了传统监测方法劳动密集且空间和时间覆盖有限的问题。研究引入了ALGae Observation and Segmentation (ALGOS) 系统,结合遥感图像理解和严重程度估计。ALGOS 使用 GeoSAM 辅助的人类评估进行高质量分割掩码的校准,并使用 NASA 的蓝细菌聚合手动标签 (CAML) 对视觉语言模型进行微调。实验表明,ALGOS 在分割和严重程度估计方面表现出色,推动了实用和自动化的蓝细菌监测系统的开发。
Increasing the Utility of Synthetic Images through Chamfer Guidance
Authors: Nicola Dall'Asen, Xiaofeng Zhang, Reyhane Askari Hemmat, Melissa Hall, Jakob Verbeek, Adriana Romero-Soriano, Michal Drozdzal
Venue: NeurIPS 2025
First: 2025-08-14T13:31:24+00:00 · Latest: 2025-10-21T15:14:25+00:00
Comments: Accepted to NeurIPS 2025
Abstract
Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress in generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4% in terms of precision, and 86.4% in terms of distributional coverage, which increase to 97.5% and 92.7%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15% for in-distribution over the baselines, and up to 16% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.
中文标题/摘要
标题:通过切比雪夫引导提高合成图像的实用性
条件图像生成模型具有生成无限量合成训练数据的巨大潜力。然而,生成质量的最新进展是以牺牲生成多样性为代价的,限制了这些模型作为合成训练数据来源的实用性。尽管已经提出了基于引导的方法来通过关注质量或多样性来提高生成数据的实用性,但这些(显式或隐式的)实用性函数往往忽略了合成数据和真实数据之间潜在的分布偏移。在本文中,我们引入了切比雪夫引导:一种无需训练的引导方法,利用少量真实示例图像来表征合成数据的质量和多样性。我们展示了通过利用提出的切比雪夫引导,可以提高生成数据相对于真实图像数据集的多样性,同时在ImageNet-1k和标准地理多样性基准上保持或提高生成质量。我们的方法仅使用2张真实示例图像即可实现最先进的少样本性能,精度达到96.4%,分布覆盖度达到86.4%,使用32张真实图像时分别提高到97.5%和92.7%。我们通过在合成数据上训练下游图像分类器展示了切比雪夫引导生成的好处,相对于基线,在分布内实现了高达15%的准确率提升,在分布外实现了高达16%的提升。此外,我们的方法不需要使用无条件模型,因此在采样时相对于基于无条件引导的方法减少了31%的FLOPs。
Summary / 总结
This work addresses the issue of reduced diversity in synthetic images generated by conditional image generative models, despite improvements in quality. It introduces Chamfer Guidance, a training-free approach that uses a few real images to guide the generation process, enhancing diversity while maintaining or improving quality. The method achieves state-of-the-art few-shot performance, with significant accuracy boosts for both in-distribution and out-of-distribution tasks, and reduces computational cost by 31% compared to classifier-free-guidance-based approaches.
本文针对条件图像生成模型生成的合成图像多样性不足的问题,提出了Chamfer Guidance,这是一种无需训练的指导方法,利用少量真实图像来引导生成过程,从而提高多样性和质量。实验表明,使用2张真实图像即可显著提升合成数据的多样性,提高精度和分布覆盖度。该方法还提升了基于合成数据训练的下游分类器的准确性,最高可提升16%的性能,特别是在分布外场景下,且在采样时无需增加FLOPs。
A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI
Authors: Kazusato Oko, Licong Lin, Yuhang Cai, Song Mei
First: 2025-01-08T17:47:06+00:00 · Latest: 2025-10-21T15:12:00+00:00
Abstract
Multi-modal generative AI systems, such as those combining vision and language, rely on contrastive pre-training to learn representations across different modalities. While their practical benefits are widely acknowledged, a rigorous theoretical understanding of the contrastive pre-training framework remains limited. This paper develops a theoretical framework to explain the success of contrastive pre-training in downstream tasks, such as zero-shot classification, conditional diffusion models, and vision-language models. We introduce the concept of approximate sufficient statistics, a generalization of the classical sufficient statistics, and show that near-minimizers of the contrastive pre-training loss are approximately sufficient, making them adaptable to diverse downstream tasks. We further propose the Joint Generative Hierarchical Model for the joint distribution of images and text, showing that transformers can efficiently approximate relevant functions within this model via belief propagation. Building on this framework, we derive sample complexity guarantees for multi-modal learning based on contrastive pre-trained representations. Numerical simulations validate these theoretical findings, demonstrating the strong generalization performance of contrastively pre-trained transformers in various multi-modal tasks.
中文标题/摘要
标题:对比预训练和多模态生成AI的统计理论
多模态生成AI系统,如结合视觉和语言的系统,依赖对比预训练来学习不同模态的表示。尽管它们的实际益处得到了广泛认可,但对比预训练框架的严格理论理解仍然有限。本文发展了一个理论框架来解释对比预训练在下游任务,如零样本分类、条件扩散模型和视觉语言模型中的成功。我们引入了近似充分统计量的概念,这是经典充分统计量的一种推广,并证明了对比预训练损失的近似最小化器是近似充分的,使它们能够适应各种下游任务。我们进一步提出了联合生成分层模型,用于图像和文本的联合分布,表明在该模型中,变换器可以通过信念传播有效地近似相关函数。基于此框架,我们推导了基于对比预训练表示的多模态学习的样本复杂性保证。数值模拟验证了这些理论发现,展示了对比预训练变换器在各种多模态任务中的强大泛化性能。
Summary / 总结
This paper aims to provide a theoretical understanding of contrastive pre-training in multi-modal generative AI systems, which combine vision and language. The authors introduce the concept of approximate sufficient statistics and show that near-minimizers of the contrastive pre-training loss are approximately sufficient, making them adaptable to various downstream tasks. They also propose the Joint Generative Hierarchical Model and derive sample complexity guarantees for multi-modal learning based on contrastive pre-trained representations, supported by numerical simulations that validate the strong generalization performance of these models.
该论文发展了一个理论框架来解释对比预训练在多模态生成AI系统,如视觉和语言模型中的成功。它引入了近似充分统计的概念,并表明对比预训练损失的近似最小值是近似充分的,使其能够适应各种下游任务。多模态学习中样本复杂性的理论保证被推导出来,并通过数值模拟验证了这些发现,突显了对比预训练变压器在多模态任务中的强大泛化性能。
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Authors: Yiqi Lin, Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Mike Zheng Shou
First: 2025-10-21T14:59:29+00:00 · Latest: 2025-10-21T14:59:29+00:00
Comments: Project page: this https://linyq17.github.io/VC2L/
Abstract
Contrastive vision-language models such as CLIP have demonstrated strong performance across a wide range of multimodal tasks by learning from aligned image-text pairs. However, their ability to handle complex, real-world web documents remains limited, particularly in scenarios where text and images are interleaved, loosely aligned, or embedded in visual form. To address these challenges, we propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer. VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images, thus eliminating the need for OCR, text tokenization, or modality fusion strategy. To capture complex cross-modal relationships in multimodal web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments, leveraging the inherent coherence of documents without requiring explicitly paired image-text data. To assess the effectiveness of this approach, we introduce three retrieval benchmarks, AnyCIR, SeqCIR, and CSR, designed to evaluate cross-modal retrieval, fine-grained sequential understanding, and generalization to unseen data, respectively. Empirical results show that VC2L achieves competitive or superior performance compared to CLIP-style models on both the proposed benchmarks and established datasets such as M-BEIR and MTEB. These findings underscore the potential of multimodal web data as a valuable training resource for contrastive learning and illustrate the scalability of a unified, vision-centric approach for multimodal representation learning. Code and models are available at: https://github.com/showlab/VC2L.
中文标题/摘要
标题:探索统一的视觉中心对比替代方法以处理多模态网络文档
对比视觉-语言模型如CLIP通过学习对齐的图像-文本对,在多种多模态任务中表现出强大的性能。然而,它们在处理复杂的现实世界网络文档方面的能力有限,尤其是在文本和图像交织、松散对齐或嵌入视觉形式的情况下。为了解决这些挑战,我们提出了视觉中心对比学习(VC2L),这是一种统一框架,使用单一的视觉变换器建模文本、图像及其组合。VC2L完全在像素空间中运行,将所有输入,无论是文本、视觉还是组合,都渲染为图像,从而消除了对OCR、文本分词或模态融合策略的需要。为了捕捉多模态网络文档中的复杂跨模态关系,VC2L采用片段级对比学习目标,通过利用文档的内在连贯性对齐连续的多模态段,而无需显式配对的图像-文本数据。为了评估该方法的有效性,我们引入了三个检索基准,AnyCIR、SeqCIR和CSR,分别用于评估跨模态检索、细粒度序列理解和对未见过的数据的泛化能力。实验证明,VC2L在提出的基准和现有的M-BEIR和MTEB数据集上,与CLIP风格的模型相比,取得了竞争力或更优的性能。这些结果强调了多模态网络数据作为对比学习训练资源的价值,并展示了统一的视觉中心方法在多模态表示学习中的可扩展性。代码和模型可在以下链接获取:https://github.com/showlab/VC2L。
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning
Authors: Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin
First: 2025-06-10T17:59:44+00:00 · Latest: 2025-10-21T14:14:49+00:00
Comments: Project page: https://faceong.github.io/VIKI-R/
Abstract
Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
中文标题/摘要
标题:VIKI-R:通过强化学习协调具身多智能体合作
在动态环境中协调多个具身智能体仍然是人工智能的核心挑战,需要感知驱动的推理和可扩展的合作策略。虽然最近的工作利用了大型语言模型(LLMs)进行多智能体规划,但有少数开始探索视觉语言模型(VLMs)进行视觉推理。然而,这些基于VLM的方法在支持多种具身类型方面仍然有限。在本文中,我们介绍了VIKI-Bench,这是第一个针对具身多智能体合作的分层基准,包含三个结构化层次:智能体激活、任务规划和轨迹感知。VIKI-Bench 包括多种机器人具身、多视角视觉观察和结构化的监督信号,以评估基于视觉输入的推理。为了展示VIKI-Bench 的实用性,我们提出了VIKI-R,这是一种两阶段框架,首先使用带有Chain-of-Thought注释的演示对预训练的视觉语言模型(VLM)进行微调,然后在多级奖励信号下进行强化学习。我们的大量实验表明,VIKI-R 在所有任务级别上显著优于基线方法。此外,我们展示了强化学习使异构智能体之间出现组合合作模式。结合VIKI-Bench 和VIKI-R,它们为推进具身AI系统中的多智能体、视觉驱动的合作提供了一个统一的测试平台和方法。
Summary / 总结
This work addresses the challenge of coordinating multiple embodied agents in dynamic environments by introducing VIKI-Bench, a hierarchical benchmark for embodied multi-agent cooperation. VIKI-R, a two-stage framework, fine-tunes a pretrained vision-language model using Chain-of-Thought annotated demonstrations and then applies reinforcement learning with multi-level reward signals. The experiments demonstrate that VIKI-R outperforms baseline methods across all task levels and enables compositional cooperation among heterogeneous agents.
该研究通过引入VIKI-Bench,一个面向体态多智能体合作的分层基准,解决了在动态环境中协调多个体态智能体的挑战。VIKI-R是一个两阶段框架,首先使用带有思维链注释的演示数据微调预训练的视觉语言模型,然后在多级奖励信号下应用强化学习。实验表明,VIKI-R在所有任务级别上都优于基线方法,并且能够促进异构智能体之间的组合性合作。
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
Authors: Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xufang Luo, Mingze Sun, Zihao Pan, Yan Feng, Peng Pei, Xunliang Cai, Ruqi Huang
First: 2025-10-21T13:36:58+00:00 · Latest: 2025-10-21T13:36:58+00:00
Comments: 12 pages, 4 figures
Abstract
Though recent advances in vision-language models (VLMs) have achieved remarkable progress across a wide range of multimodal tasks, understanding 3D spatial relationships from limited views remains a significant challenge. Previous reasoning methods typically rely on pure text (e.g., topological cognitive maps) or on 2D visual cues. However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploits the rich geometric information embedded within images while reasoning, like humans do. Our framework is the first to enable 3D mentaling during reasoning without any 3D prior input, and it does not rely on explicitly labeled 3D data for training. Specifically, our training consists of two stages. First, we perform supervised training to align the 3D latent generated by VLM while reasoning with that of a 3D foundation model (e.g., VGGT). Then, we optimize the entire reasoning trajectory solely based on outcome signals, thereby refining the underlying 3D mentaling. Extensive experiments across multiple benchmarks show that 3DThinker consistently outperforms strong baselines and offers a new perspective toward unifying 3D representations into multimodal reasoning. Our code will be available at https://github.com/zhangquanchen/3DThinker.
中文标题/摘要
标题:三维思考:基于图像几何信息的空间推理
尽管近期视觉-语言模型(VLMs)在多种跨模态任务中取得了显著进展,但从有限视角理解三维空间关系仍然是一个重大挑战。以往的推理方法通常依赖纯文本(例如拓扑认知图)或二维视觉线索。然而,它们有限的表示能力阻碍了在需要三维空间想象的任务中的表现。为解决这一限制,我们提出了3DThinker框架,该框架能够在推理过程中有效利用图像中嵌入的丰富几何信息,类似于人类的思考方式。我们的框架是首个在推理过程中无需任何三维先验输入即可进行三维思考的框架,并且在训练过程中不依赖明确标注的三维数据。具体而言,我们的训练分为两个阶段。首先,我们进行监督训练,使VLM在推理过程中生成的三维潜在表示与三维基础模型(例如VGGT)生成的三维潜在表示对齐。然后,我们仅基于结果信号优化整个推理过程,从而细化底层的三维思考。在多个基准测试中的广泛实验表明,3DThinker在多个基准测试中始终优于强基线,并为将三维表示统一到跨模态推理中提供了新的视角。我们的代码将在https://github.com/zhangquanchen/3DThinker上提供。
Summary / 总结
The research aims to improve the ability of vision-language models to understand 3D spatial relationships from limited views. The proposed 3DThinker framework enhances reasoning by leveraging geometric information in images, simulating human 3D imagination without requiring 3D data. Experiments demonstrate that 3DThinker outperforms existing methods on multiple benchmarks, providing a new approach to unify 3D representations in multimodal reasoning.
研究旨在提高视觉-语言模型从有限视角理解3D空间关系的能力。提出的3DThinker框架通过利用图像中的几何信息增强推理,模拟人类的3D想象,无需使用3D数据。实验表明,3DThinker在多个基准测试中优于现有方法,为统一3D表示在多模态推理中的应用提供了新视角。
CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent
Authors: Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, Ke Li, Xing Sun
First: 2025-10-21T12:53:40+00:00 · Latest: 2025-10-21T12:53:40+00:00
Comments: 24 pages, 6 figures
Abstract
Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 89.8% precision and 93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches.
中文标题/摘要
标题:CUARewardBench:评估计算机使用代理奖励模型基准
计算机使用代理(CUAs)通过与操作系统和软件界面的自然交互来完成任务。虽然基于脚本的验证器被广泛采用,但它们在可扩展性和逐步评估方面存在局限性。奖励模型提供了有前景的替代方案,但它们在CUA评估中的有效性仍很大程度上未被探索。为解决这一差距,我们提出了CUARewardBench,包括四个关键贡献:(1)首个全面的CUA奖励基准:我们引入了首个用于评估CUA任务的输出奖励模型(ORM)和过程奖励模型(PRM)的基准,实现了从轨迹级和步骤级的系统评估。(2)多样、实用且可靠的数据集:CUARewardBench 包含来自10个软件类别和7种代理架构的轨迹,性能水平各异(25.9%-50.8%的成功率)。所有轨迹均通过精心设计的协议由专家注释,并通过严格的质量控制确保可靠性和实用性。(3)全面的分析和见解:通过在7种视觉语言模型和3种提示模板上进行广泛的实验,我们揭示了当前CUA RMs的关键局限,包括不足的视觉推理能力、知识缺陷,以及通用VLMs在奖励评估中优于专门的CUA模型。(4)一致提示集合(UPE):基于我们全面分析的见解,我们提出了UPE,这是一种新的集成方法,通过严格的统一投票和策略性的提示模板配置,显著提高了奖励模型的可靠性。UPE在ORM上的精确度为89.8%,NPV为93.3%,在PRM上的精确度为81.7%,NPV为85.1%,显著优于单一VLM和传统集成方法。
Summary / 总结
CUARewardBench is a benchmark for evaluating reward models on computer-using agents (CUAs), addressing the limitations of script-based verifiers. It includes a comprehensive dataset with trajectories from 10 software categories and 7 agent architectures, and extensive experiments across 7 vision-language models and 3 prompt templates reveal the limitations of current reward models. The study proposes a unanimous prompt ensemble (UPE) method, which significantly improves the reliability of reward models, achieving high precision and negative predictive values for both outcome and process reward models.
研究旨在通过引入CUARewardBench基准来评估计算机使用代理(CUA)的奖励模型。方法包括创建一个包含各种软件类别和代理架构轨迹的多样化数据集,并进行广泛的实验,使用视觉语言模型和提示模板。关键发现包括当前奖励模型的局限性,如视觉推理不足和知识缺陷,以及新颖的一致提示集合(UPE)方法的有效性,该方法显著提高了奖励模型的可靠性。
CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
Authors: Yongmin Lee, Hye Won Chung
Venue: NeurIPS 2025
First: 2025-10-21T12:36:25+00:00 · Latest: 2025-10-21T12:36:25+00:00
Comments: NeurIPS 2025
Abstract
Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to stronger cross-modal alignment and improved performance. Evaluated on Flickr30K and COCO, CovMatch outperforms state-of-the-art multimodal distillation methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.
中文标题/摘要
标题:CovMatch:跨协方差引导的多模态数据集精简与可训练文本编码器
多模态数据集精简旨在合成一小套图像-文本对,以实现大规模视觉-语言模型的高效训练。尽管数据集精简在单模态任务中显示出潜力,但将其扩展到多模态对比学习提出了关键挑战:跨模态对齐的学习和大型编码器的高计算成本管理。先前的方法通过冻结文本编码器并仅更新图像编码器和文本投影层来解决可扩展性问题。然而,我们发现这严重限制了语义对齐并成为性能扩展的瓶颈。我们提出了CovMatch,这是一种可扩展的数据集精简框架,通过正则化每种模态内的特征分布来对齐真实和合成特征的跨协方差。与先前的方法不同,CovMatch允许两个编码器的联合优化,从而实现更强的跨模态对齐和更好的性能。在Flickr30K和COCO上评估,CovMatch优于最先进的多模态数据集精简方法,并且仅使用500个合成对即可实现高达6.8%的绝对检索准确率提升。
Summary / 总结
CovMatch is a scalable dataset distillation framework that enhances cross-modal alignment by aligning cross-covariance of real and synthetic features while regularizing feature distributions within each modality. This approach allows joint optimization of both encoders, leading to improved performance in multimodal contrastive learning. On Flickr30K and COCO, CovMatch outperforms existing methods and achieves up to 6.8% absolute gains in retrieval accuracy using only 500 synthetic pairs.
CovMatch 是一种可扩展的多模态数据集蒸馏框架,它对真实和合成特征的交叉协方差进行对齐,并在每个模态内正则化特征分布。与之前冻结文本编码器的方法不同,CovMatch 联合优化两个编码器,从而实现更好的跨模态对齐和改进的性能。在 Flickr30K 和 COCO 上,CovMatch 超过了最先进的方法,仅使用 500 个合成样本,实现了高达 6.8% 的绝对检索准确率提升。
Zero-Shot Vehicle Model Recognition via Text-Based Retrieval-Augmented Generation
Authors: Wei-Chia Chang, Yan-Ann Chen
First: 2025-10-21T10:39:39+00:00 · Latest: 2025-10-21T10:39:39+00:00
Comments: Accepted by The 38th Conference of Open Innovations Association FRUCT, 2025
Abstract
Vehicle make and model recognition (VMMR) is an important task in intelligent transportation systems, but existing approaches struggle to adapt to newly released models. Contrastive Language-Image Pretraining (CLIP) provides strong visual-text alignment, yet its fixed pretrained weights limit performance without costly image-specific finetuning. We propose a pipeline that integrates vision language models (VLMs) with Retrieval-Augmented Generation (RAG) to support zero-shot recognition through text-based reasoning. A VLM converts vehicle images into descriptive attributes, which are compared against a database of textual features. Relevant entries are retrieved and combined with the description to form a prompt, and a language model (LM) infers the make and model. This design avoids large-scale retraining and enables rapid updates by adding textual descriptions of new vehicles. Experiments show that the proposed method improves recognition by nearly 20% over the CLIP baseline, demonstrating the potential of RAG-enhanced LM reasoning for scalable VMMR in smart-city applications.
中文标题/摘要
标题:基于文本增强生成的零样本车辆模型识别
车辆品牌和型号识别(VMMR)是智能交通系统中的一个重要任务,但现有方法难以适应新发布的车型。对比语言-图像预训练(CLIP)提供了强大的视觉-文本对齐能力,但其固定预训练权重在不进行昂贵的图像特定微调的情况下限制了性能。我们提出了一种将视觉语言模型(VLMs)与检索增强生成(RAG)结合的管道,通过基于文本的推理支持零样本识别。VLM将车辆图像转换为描述性属性,与文本特征数据库进行比较。相关条目被检索并结合描述以形成提示,语言模型(LM)推断品牌和型号。此设计避免了大规模重新训练,并通过添加新车辆的文本描述实现快速更新。实验表明,所提出的方法在识别上比CLIP基线提高了近20%,展示了RAG增强LM推理在智能城市应用中可扩展VMMR的潜力。
Summary / 总结
The research aims to address the challenge of recognizing newly released vehicle models in intelligent transportation systems. The proposed method integrates vision language models with Retrieval-Augmented Generation to enable zero-shot recognition through text-based reasoning. By converting vehicle images into descriptive attributes and retrieving relevant textual features, the system infers the make and model without the need for large-scale retraining. Experiments show a 20% improvement in recognition accuracy over the CLIP baseline, highlighting the potential of this approach for scalable vehicle make and model recognition in smart-city applications.
研究旨在提高智能交通系统中的车辆品牌和型号识别(VMMR),特别是对于新发布的车型。提出的方法将视觉语言模型(VLMs)与检索增强生成(RAG)相结合,通过基于文本的推理实现零样本识别。通过将车辆图像转换为描述性属性并与文本特征数据库进行比较,系统检索相关条目并将其与描述结合形成提示,供语言模型推断品牌和型号。实验结果显示,该方法相比CLIP基线提高了近20%的识别率,突显了RAG增强的LM推理在智能城市应用中实现可扩展VMMR的潜力。
VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching
Authors: Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, Chang Xu
Venue: NeurIPS 2025
First: 2025-02-04T09:48:14+00:00 · Latest: 2025-10-21T10:33:29+00:00
Comments: Accepted to NeurIPS 2025
Abstract
Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions in an end-to-end manner. However, their substantial computational cost poses a challenge for real-time robotic control, where rapid decision-making is essential. This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. Exploiting the temporal continuity in robotic manipulation, VLA-Cache identifies minimally changed tokens between adjacent frames and reuses their cached key-value representations, thereby circumventing redundant computations. Additionally, to maintain action precision, VLA-Cache selectively re-computes task-relevant tokens that are environmentally sensitive, ensuring the fidelity of critical visual information. To further optimize efficiency, we introduce a layer adaptive token reusing strategy that dynamically adjusts the reuse ratio based on attention concentration across decoder layers, prioritizing critical tokens for recomputation. Extensive experiments on two simulation platforms (LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache achieves up to 1.7x speedup in CUDA latency and a 15% increase in control frequency, with negligible loss on task success rate. The code and videos can be found at our project page: https://vla-cache.github.io.
中文标题/摘要
标题:VLA-Cache:通过自适应令牌缓存实现高效的视觉-语言-动作操作
视觉-语言-动作(VLA)模型展示了强大的多模态推理能力,能够以端到端的方式直接从视觉感知和语言指令生成动作。然而,它们巨大的计算成本对实时机器人控制构成了挑战,因为快速决策至关重要。本文介绍了一种无需训练的推理加速方法VLA-Cache,通过在帧间适配性地缓存和重用静态视觉令牌来减少计算开销。利用机器人操作中的时间连续性,VLA-Cache识别相邻帧中变化最小的令牌,并重用它们的缓存键值表示,从而避免冗余计算。此外,为了保持动作精度,VLA-Cache选择性地重新计算对环境敏感的任务相关令牌,确保关键视觉信息的保真度。为了进一步优化效率,我们引入了一种层自适应令牌重用策略,该策略根据解码器层间的注意力集中程度动态调整重用比例,优先对关键令牌进行重新计算。在两个仿真平台(LIBERO和SIMPLER)和一个实际机器人系统上的广泛实验表明,VLA-Cache在CUDA延迟上实现了高达1.7倍的加速,并且控制频率提高了15%,同时任务成功率几乎没有损失。代码和视频可以在我们的项目页面找到:https://vla-cache.github.io。
Summary / 总结
VLA-Cache is an inference acceleration method for Vision-Language-Action models that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. It selectively re-computes task-relevant tokens to maintain action precision and introduces a layer adaptive token reusing strategy to optimize efficiency. Experiments show VLA-Cache achieves up to 1.7x speedup in CUDA latency and a 15% increase in control frequency without affecting task success rate.
VLA-Cache 是一种用于 Vision-Language-Action 模型的推理加速方法,通过在帧间适配性地缓存和重用静态视觉令牌来减少计算开销。它选择性地重新计算任务相关的令牌,并使用分层适配性令牌重用策略来优化效率。实验表明,VLA-Cache 可以实现高达 1.7 倍的 CUDA 延迟加速和 15% 的控制频率提升,同时不影响任务成功率。
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Authors: Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, Li Yuan
First: 2025-10-19T15:38:06+00:00 · Latest: 2025-10-21T10:27:04+00:00
Abstract
Instruction-based image editing has achieved remarkable progress; however, models solely trained via supervised fine-tuning often overfit to annotated patterns, hindering their ability to explore and generalize beyond training distributions. To this end, we introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization. Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a likelihood-free policy optimization method consistent with the flow matching forward process, thereby enabling the use of higher-order samplers and more efficient training. Another key challenge here is the absence of a universal reward model, resulting from the diverse nature of editing instructions and tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback. Furthermore, we carefully design a low-variance group filtering mechanism to reduce MLLM scoring noise and stabilize optimization. UniWorld-V2, trained with this framework, achieves \textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks, scoring 4.49 and 7.83, respectively. Crucially, our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its wide applicability. Code and models are publicly available at https://github.com/PKU-YuanGroup/UniWorld-V2.
中文标题/摘要
标题:Uniworld-V2: 使用扩散负样本感知微调和MLLM隐式反馈强化图像编辑
基于指令的图像编辑取得了显著进展;然而,仅通过监督微调训练的模型往往会过度拟合标注模式,限制了它们探索和泛化的能力。为此,我们提出了Edit-R1,一种基于策略优化的新型后训练框架。具体而言,我们利用了与流匹配前向过程一致的无似然性策略优化方法——扩散负样本感知微调(DiffusionNFT),从而能够使用高阶采样器并进行更高效的训练。另一个关键挑战是没有通用的奖励模型,这源于编辑指令和任务的多样性。为了解决这一问题,我们采用多模态大型语言模型(MLLM)作为统一的、无需训练的奖励模型,利用其输出logits提供细粒度反馈。此外,我们精心设计了一种低方差组过滤机制,以减少MLLM评分噪声并稳定优化。UniWorld-V2在ImgEdit和GEdit-Bench基准测试中分别取得了4.49和7.83的优异结果,证明了其广泛适用性。我们的框架具有模型无关性,在Qwen-Image-Edit和FLUX-Kontext等不同基础模型上应用时,能够显著提升性能。代码和模型已公开发布于https://github.com/PKU-YuanGroup/UniWorld-V2。
Summary / 总结
The research aims to improve instruction-based image editing by addressing overfitting issues. It introduces Edit-R1, a post-training framework using Diffusion Negative-aware Finetuning and a Multimodal Large Language Model (MLLM) for reward modeling. The framework achieves state-of-the-art results on ImgEdit and GEdit-Bench benchmarks with scores of 4.49 and 7.83, respectively, and is model-agnostic, enhancing various base models like Qwen-Image-Edit and FLUX-Kontext.
研究旨在通过解决过拟合问题来提升基于指令的图像编辑。引入了Edit-R1框架,结合使用Diffusion Negative-aware Finetuning和Multimodal Large Language Model (MLLM)进行奖励建模。该框架在ImgEdit和GEdit-Bench基准测试中分别取得了4.49和7.83的最优成绩,并且是模型无关的,能够显著提升诸如Qwen-Image-Edit和FLUX-Kontext等不同基础模型的性能。
MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Authors: Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan
First: 2025-10-09T17:59:54+00:00 · Latest: 2025-10-21T10:22:02+00:00
Comments: We have come across a recent approach that has not been properly attributed at the time of submission and compared in a fair setting. Therefore, we would like to withdraw the paper to address these concerns
Abstract
Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
中文标题/摘要
标题:MATRIX:多模态智能体调优以实现稳健的工具使用推理
视觉语言模型(VLMs)越来越多地被用作控制器,具有访问外部工具的能力,用于复杂的推理和决策,但其有效性受限于高质量多模态轨迹的稀缺性和手动注释的成本。我们通过一种以视觉为中心的智能体调优框架解决了这一挑战,该框架自动合成多模态轨迹、生成逐步偏好对,并训练一个VLM控制器以实现稳健的工具使用推理。我们的流水线首先构建了M-TRACE,这是一个包含28500个多模态任务和177000个验证轨迹的大规模数据集,使基于模仿的轨迹调优成为可能。在此基础上,我们开发了MATRIX智能体,该智能体在M-TRACE上进行了逐步工具推理的微调。为了实现更精细的对齐,我们进一步引入了Pref-X,这是一个包含11000个自动生成的偏好对的集合,并通过逐步偏好学习对MATRIX进行了优化。在三个基准测试Agent-X、GTA和GAIA上,MATRIX始终超越了开源和闭源的VLMs,展示了可扩展且有效的多模态工具使用。我们的数据和代码可在https://github.com/mbzuai-oryx/MATRIX/获得。
StarBench: A Turn-Based RPG Benchmark for Agentic Multimodal Decision-Making and Information Seeking
Authors: Haoran Zhang, Chenhao Zhu, Sicong Guo, Hanzhe Guo, Haiming Li, Donglin Yu
First: 2025-10-21T10:02:59+00:00 · Latest: 2025-10-21T10:02:59+00:00
Abstract
Human players do more than press buttons: they ground what they see on screen into precise keyboard-mouse actions and, when stuck, they seek information before trying again. We ask whether current vision-language models (VLMs) can do the same. Despite encouraging results under simplified control or tool scaffolds, human-like play in a real client - mapping raw screenshots to temporally coherent low-level actions while deciding when to ask for guidance - remains an open challenge. We introduce StarBench, a turn-based RPG benchmark derived from Honkai: Star Rail that targets these two human-like competencies: multimodal decision-making from pixels to actions and agentic information seeking. StarBench standardizes evaluation across eight combat tasks and two regimes with shared tasks and metrics: (i) direct control, where agents receive only screenshots and must emit low-level primitives (click and keypress) with no semantic hints; and (ii) tool-assisted control, where higher-level intents can be mapped to primitives by detectors and OCR outputs provide optional textualized observations to ease UI grounding. To mirror human practice, StarBench also includes an ask-or-act diagnostic that measures whether and when agents choose to request brief guidance before proceeding, and how that choice affects subsequent performance. We report reference baselines for contemporary VLMs and a human reference. Results expose sizable gaps in perception-to-control fidelity in the direct regime, while showing that judicious information seeking correlates with improved success, establishing StarBench as a reproducible yardstick for agentic information seeking and multimodal decision-making in real-client play.
中文标题/摘要
标题:StarBench:基于回合制RPG的代理多模态决策与信息搜索基准
人类玩家所做的远不止按按钮:他们将屏幕上的内容转化为精确的键盘鼠标操作,并在受阻时寻求信息再尝试。我们询问当前的视觉-语言模型(VLMs)是否能做到这一点。尽管在简化控制或工具辅助下取得了一些令人鼓舞的结果,但在实际客户端中,从原始截图映射到时间上连贯的低级动作并决定何时寻求指导的人类级游戏仍是一个开放的挑战。我们引入了StarBench,一个源自《崩坏:星轨》的基于回合制的RPG基准,旨在针对这两种人类级能力:从像素到动作的多模态决策和主动的信息搜索。StarBench 在八个战斗任务和两种具有共享任务和指标的制度中标准化了评估:(i)直接控制,其中代理仅接收截图并必须发出低级原语(点击和按键)而没有任何语义提示;(ii)工具辅助控制,其中高级意图可以由检测器映射到原语,OCR输出提供可选的文本化观察以简化UI定位。为了反映人类实践,StarBench 还包括一个询问或行动诊断,衡量代理在继续之前选择请求简短指导的时机及其对后续表现的影响。我们报告了当前VLMs的参考基线和人类参考。结果揭示了直接控制制度中感知到控制的准确性差距,同时表明审慎的信息搜索与成功率提高相关,确立了StarBench作为代理信息搜索和多模态决策在实际客户端游戏中的可重复度量标准。
Summary / 总结
StarBench is a benchmark for evaluating vision-language models (VLMs) in agentic multimodal decision-making and information seeking in a turn-based RPG. It standardizes evaluation across eight combat tasks and two regimes: direct control and tool-assisted control. The study finds significant gaps in perception-to-control fidelity in the direct regime, but shows that judicious information seeking correlates with improved success, establishing StarBench as a valuable benchmark for these competencies.
StarBench 是一个用于评估视觉-语言模型(VLMs)在回合制RPG中的自主信息寻求和多模态决策能力的基准。它涉及将原始截图映射到低级动作,并决定何时寻求指导。基准包括直接控制和工具辅助控制两种模式,其中包含一个询问或行动诊断,衡量寻求指导的影响。结果表明,在直接模式下感知到控制的准确性存在显著差距,但恰当地寻求信息可以提高成功率,验证了StarBench作为衡量这些能力的可靠标准的有效性。
ImageGem: In-the-wild Generative Image Interaction Dataset for Generative Model Personalization
Authors: Yuanhe Guo, Linxi Xie, Zhuoran Chen, Kangrui Yu, Ryan Po, Guandao Yang, Gordon Wetztein, Hongyi Wen
First: 2025-10-21T09:08:01+00:00 · Latest: 2025-10-21T09:08:01+00:00
Abstract
We introduce ImageGem, a dataset for studying generative models that understand fine-grained individual preferences. We posit that a key challenge hindering the development of such a generative model is the lack of in-the-wild and fine-grained user preference annotations. Our dataset features real-world interaction data from 57K users, who collectively have built 242K customized LoRAs, written 3M text prompts, and created 5M generated images. With user preference annotations from our dataset, we were able to train better preference alignment models. In addition, leveraging individual user preference, we investigated the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Finally, we propose an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences. Our results demonstrate that the ImageGem dataset enables, for the first time, a new paradigm for generative model personalization.
中文标题/摘要
标题:ImageGem:用于生成模型个性化的大规模在野生成图像交互数据集
我们介绍了ImageGem,一个用于研究能够理解细微个体偏好的生成模型的数据集。我们认为,阻碍此类生成模型发展的关键挑战是缺乏大规模和细微的用户偏好注释。我们的数据集包含来自57,000名用户的实际交互数据,他们共同构建了242,000个定制的LoRAs,编写了300万条文本提示,并生成了500万张生成图像。借助我们数据集中的用户偏好注释,我们能够训练出更好的偏好对齐模型。此外,利用个别用户的偏好,我们研究了检索模型和视觉语言模型在个性化图像检索和生成模型推荐方面的性能。最后,我们提出了一种端到端框架,用于在潜在权重空间中编辑定制的扩散模型,以与个别用户偏好对齐。我们的结果表明,ImageGem数据集首次使生成模型个性化成为可能。
Summary / 总结
The research introduces ImageGem, a dataset aimed at developing generative models that understand individual preferences. It addresses the challenge of lack of fine-grained user preference annotations by collecting real-world interaction data from 57,000 users, who created 242,000 customized LoRAs, wrote 3 million text prompts, and generated 5 million images. The dataset was used to train better preference alignment models and to investigate the performance of retrieval models and a vision-language model on personalized image retrieval and generative model recommendation. Additionally, an end-to-end framework for editing customized diffusion models in a latent weight space to align with individual user preferences was proposed, demonstrating the dataset's potential for generative model personalization.
研究介绍了ImageGem数据集,旨在开发能够理解个体偏好的生成模型。该数据集通过收集57,000名用户的真实交互数据,解决了缺乏细粒度用户偏好注释的问题,这些用户创建了242,000个定制的LoRAs,撰写了300万条文本提示,并生成了500万张图像。该数据集被用于训练更好的偏好对齐模型,并研究了检索模型和视觉-语言模型在个性化图像检索和生成模型推荐方面的性能。此外,还提出了一种端到端框架,用于在潜在权重空间中编辑定制的扩散模型,以与个体用户偏好对齐,展示了该数据集在生成模型个性化方面的潜力。
Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding
Authors: Jinlin Li, Yuran Wang, Yifei Yuan, Xiao Zhou, Yingying Zhang, Xixian Yong, Yefeng Zheng, Xian Wu
First: 2025-10-21T06:11:24+00:00 · Latest: 2025-10-21T06:11:24+00:00
Abstract
Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to object hallucination -- generating descriptions of nonexistent or misidentified objects. Prior work has partially mitigated this via auxiliary training objectives or external modules, but challenges remain in terms of scalability, adaptability, and model independence. To address these limitations, we propose Adaptive Token Ensemble Decoding (ATED), a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency. Experiments on standard hallucination detection benchmarks demonstrate that ATED significantly outperforms state-of-the-art methods, reducing hallucination without compromising fluency or relevance. Our findings highlight the benefits of adaptive ensembling and point to a promising direction for improving LVLM robustness in high-stakes applications. The code is available at https://github.com/jinlin2021/ATED.
中文标题/摘要
标题:超越单一模型:通过自适应令牌ensemble解码减轻多模态幻觉
大型视觉-语言模型(LVLMs)在图像字幕和视觉问答等多模态任务中取得了令人印象深刻的成果。然而,它们仍然容易出现对象幻觉——生成不存在或误识别对象的描述。先前的工作部分通过辅助训练目标或外部模块减轻了这一问题,但在可扩展性、适应性和模型独立性方面仍存在挑战。为了解决这些限制,我们提出了自适应令牌ensemble解码(ATED),这是一种无需训练的令牌级ensemble框架,在推理过程中通过聚合多个LVLM的预测来减轻幻觉。ATED动态计算每个模型在每个解码步骤的不确定性权重,反映其可靠性。它还整合了多种解码路径以提高上下文关联性和语义一致性。在标准幻觉检测基准上的实验表明,ATED显著优于现有最佳方法,在减少幻觉的同时不牺牲流畅性和相关性。我们的研究结果突显了自适应ensemble的优势,并指出了提高LVLM在高风险应用中鲁棒性的有希望的方向。代码可在https://github.com/jinlin2021/ATED/ 获取。
Summary / 总结
The research aims to address the issue of object hallucination in large vision-language models (LVLMs) by proposing Adaptive Token Ensemble Decoding (ATED), a training-free method that aggregates predictions from multiple LVLMs during inference. ATED dynamically assigns weights to each model based on their reliability at each decoding step and integrates diverse decoding paths to enhance contextual grounding and semantic consistency. Experiments show that ATED outperforms existing methods in reducing hallucination while maintaining fluency and relevance.
论文提出了一种名为Adaptive Token Ensemble Decoding (ATED)的方法,通过在推理过程中聚合多个大型视觉语言模型(LVLM)的预测来解决对象幻觉问题。ATED动态地根据每个模型在每个解码步骤的可靠性为其分配权重,并整合多种解码路径以增强上下文关联性和语义一致性。实验表明,ATED在减少幻觉的同时保持流畅性和相关性,优于现有方法。
Foundation Cures Personalization: Improving Personalized Models' Prompt Consistency via Hidden Foundation Knowledge
Authors: Yiyang Cai, Zhengkai Jiang, Yulong Liu, Chunyang Jiang, Wei Xue, Yike Guo, Wenhan Luo
Venue: NeurIPS 2025
First: 2024-11-22T15:21:38+00:00 · Latest: 2025-10-21T05:52:43+00:00
Comments: Accepted to NeurIPS 2025
Abstract
Facial personalization faces challenges to maintain identity fidelity without disrupting the foundation model's prompt consistency. The mainstream personalization models employ identity embedding to integrate identity information within the attention mechanisms. However, our preliminary findings reveal that identity embeddings compromise the effectiveness of other tokens in the prompt, thereby limiting high prompt consistency and attribute-level controllability. Moreover, by deactivating identity embedding, personalization models still demonstrate the underlying foundation models' ability to control facial attributes precisely. It suggests that such foundation models' knowledge can be leveraged to cure the ill-aligned prompt consistency of personalization models. Building upon these insights, we propose FreeCure, a framework that improves the prompt consistency of personalization models with their latent foundation models' knowledge. First, by setting a dual inference paradigm with/without identity embedding, we identify attributes (e.g., hair, accessories, etc.) for enhancements. Second, we introduce a novel foundation-aware self-attention module, coupled with an inversion-based process to bring well-aligned attribute information to the personalization process. Our approach is training-free, and can effectively enhance a wide array of facial attributes; and it can be seamlessly integrated into existing popular personalization models based on both Stable Diffusion and FLUX. FreeCure has consistently shown significant improvements in prompt consistency across these facial personalization models while maintaining the integrity of their original identity fidelity.
中文标题/摘要
标题:基础模型治愈个性化:通过隐藏的基础知识提高个性化模型的提示一致性
面部个性化在保持身份真实性的同时,面临着维护提示一致性方面的挑战。主流的个性化模型通过身份嵌入将身份信息整合到注意力机制中。然而,我们的初步研究发现,身份嵌入会削弱其他提示词的有效性,从而限制了高提示一致性和属性级可控性。此外,通过禁用身份嵌入,个性化模型仍然能够精确控制面部属性,这表明基础模型的知识可以被利用来治愈个性化模型提示不一致的问题。基于这些见解,我们提出了FreeCure框架,利用基础模型的潜在知识来提高个性化模型的提示一致性。首先,通过设置带有/不带有身份嵌入的双重推理范式,我们识别出需要增强的属性(如发型、配饰等)。其次,我们引入了一种新颖的基础模型感知自注意力模块,并结合基于反演的过程,将对齐良好的属性信息引入个性化过程。我们的方法无需训练,可以有效增强各种面部属性;并且可以无缝集成到基于Stable Diffusion和FLUX的现有流行个性化模型中。FreeCure在这些面部个性化模型中始终显示出显著的提示一致性改进,同时保持其原始身份真实性的完整性。
Summary / 总结
The paper addresses the challenge of maintaining identity fidelity while improving prompt consistency in facial personalization models. It proposes FreeCure, a framework that leverages hidden foundation knowledge to enhance prompt consistency without compromising identity fidelity. The method involves a dual inference paradigm and a foundation-aware self-attention module, which effectively improves various facial attributes across different personalization models while preserving identity integrity.
本文旨在解决面部个性化模型在保持身份保真度的同时提高提示一致性的问题。提出了一种名为FreeCure的框架,利用基础模型的知识来增强提示一致性,而不损害身份保真度。通过使用双重推理范式和基础感知自注意力模块,FreeCure能够识别并增强面部特征,展示了在多种个性化模型中显著提高提示一致性的效果,同时保持原始的身份完整性。
Exploring Cross-Modal Flows for Few-Shot Learning
Authors: Ziqi Jiang, Yanghao Wang, Long Chen
First: 2025-10-16T10:32:48+00:00 · Latest: 2025-10-21T05:40:11+00:00
Comments: 13 pages, 6 figures
Abstract
Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.
中文标题/摘要
标题:探索少样本学习中的跨模态流动
不同模态特征的对齐是跨模态任务中最基本的挑战之一。尽管预训练的视觉-语言模型可以在图像和文本之间实现一般对齐,但它们通常需要参数高效微调(PEFT)进行进一步调整。当前的PEFT方法(例如提示调优、LoRA基的或适配器基的)总是选择性地微调一部分参数,这可以轻微调整视觉或文本特征之一,并避免过拟合。在本文中,我们首次指出所有现有的PEFT方法都是一步调整。对于特征高度纠缠的复杂(或困难)数据集来说,这是不够的。为此,我们提出了第一个模型无关的多步调整方法,通过学习跨模态速度场:流动匹配对齐(FMA)。具体来说,为了在训练过程中确保类别的对应性,我们首先使用固定耦合策略。然后,我们提出了一种噪声增强策略来缓解数据稀缺问题。最后,我们设计了一个早期停止求解器,该求解器在更早终止变换过程,从而提高效率和准确性。与一步PEFT方法相比,FMA具有多步校正能力,可以实现更精确和稳健的对齐。广泛的结果表明,FMA可以在各种基准和骨干网络上一致地获得显著的性能提升,特别是在具有挑战性的数据集上。
Summary / 总结
This paper addresses the challenge of aligning features from different modalities in cross-modal tasks, particularly in few-shot learning scenarios. It introduces a novel multi-step adjustment approach called Flow Matching Alignment (FMA) that learns a cross-modal velocity field. FMA improves upon existing parameter-efficient fine-tuning methods by performing multi-step adjustments, which better aligns features across complex datasets. Experimental results show that FMA consistently outperforms one-step fine-tuning methods across various benchmarks and backbones, especially on challenging datasets.
本文针对跨模态任务中不同模态特征对齐的挑战,提出了一种多步调整方法Flow Matching Alignment (FMA)。FMA通过学习跨模态速度场来实现更精确和稳健的对齐,特别适用于复杂数据集。实验结果表明,FMA在各种基准和骨干网络上的一致性表现优于现有的单步参数高效微调方法,特别是在具有挑战性的数据集上表现更佳。
When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models
Authors: Xianzheng Ma, Brandon Smart, Yash Bhalgat, Shuai Chen, Xinghui Li, Jian Ding, Jindong Gu, Dave Zhenyu Chen, Songyou Peng, Jia-Wang Bian, Philip H Torr, Marc Pollefeys, Matthias Nießner, Ian D Reid, Angel X. Chang, Iro Laina, Victor Adrian Prisacariu
First: 2024-05-16T16:59:58+00:00 · Latest: 2025-10-21T05:00:34+00:00
Comments: 2nd version update to Jun.2025
Abstract
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.
中文标题/摘要
标题:当LLMs踏入3D世界:多模态大型语言模型在3D任务上的综述与元分析
随着大型语言模型(LLMs)的发展,它们与3D空间数据(3D-LLMs)的整合取得了快速进步,为理解和与物理空间互动提供了前所未有的能力。本文综述了使LLMs处理、理解和生成3D数据的方法。强调了LLMs的独特优势,如上下文学习、逐步推理、开放式词汇能力和广泛的世界知识,突显了它们在增强体态人工智能(AI)系统中的空间理解和互动方面的潜力。本文调查了从点云到神经辐射场(NeRF)的各种3D数据表示,并探讨了它们与LLMs的整合,用于3D场景理解、描述、问答和对话任务,以及基于LLMs的代理进行空间推理、规划和导航。本文还简要回顾了其他整合3D和语言的方法。本文中的元分析揭示了显著的进步,但也强调了需要新的方法来充分利用3D-LLMs的潜力。因此,本文旨在为未来研究探索和扩展3D-LLMs在理解和与复杂3D世界互动方面的能力指明方向。为了支持本文的综述,我们建立了一个项目页面,将与我们主题相关的论文组织和列出:https://github.com/ActiveVisionLab/Awesome-LLM-3D。
Summary / 总结
This paper surveys and analyzes the integration of large language models (LLMs) with 3D spatial data, highlighting their unique advantages like in-context learning and extensive world knowledge. It covers various 3D data representations and their applications in tasks such as 3D scene understanding, captioning, and dialogue. The meta-analysis indicates significant progress but also identifies the need for new approaches to fully leverage 3D-LLMs. The study aims to guide future research in enhancing spatial comprehension and interaction within embodied AI systems.
这篇综述探讨了大型语言模型(LLMs)如何处理、理解和生成3D数据,强调了它们的优势,如上下文学习和广泛的世界知识。它涵盖了各种3D数据表示及其与LLMs的集成,用于3D场景理解、导航等任务。元分析显示了显著的进步,但也指出了需要新的方法来充分利用3D-LLMs的潜力。该研究旨在指导这一领域的未来研究。
Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems
Authors: Jie Zhang, Ting Xu, Gelei Deng, Runyi Hu, Han Qiu, Tianwei Zhang, Qing Guo, Ivor Tsang
First: 2025-09-04T05:35:32+00:00 · Latest: 2025-10-21T04:49:38+00:00
Comments: Agent4Science 2025 Spotlight
Abstract
Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.
中文标题/摘要
标题:可见但不可读:视觉语言模型在不同书写系统中的一个系统性盲点
书写是一种普遍的文化技术,利用视觉进行象征性交流。人类表现出惊人的适应性:即使字符被分割、融合或部分遮挡,我们也能轻易识别出单词。本文探讨先进视觉语言模型(VLMs)是否也具备这种适应性。我们构建了跨不同书写系统的两个心理物理学启发式基准,通过拼接、重组和叠加字符,生成对模型可见但对人类可读的“可见但不可读”刺激,尽管清洁文本上的表现良好,但当代VLMs在这些扰动下表现出严重的性能下降,经常产生不相关或不连贯的输出。这一模式表明,模型过度依赖通用的视觉不变性,而对构成先验的依赖不足,这些先验对于稳健的识字能力至关重要。我们发布了刺激生成代码、提示和评估协议,以促进透明的复制和后续工作。我们的发现促使了能够编码符号分割、组合和跨书写系统绑定的架构和训练策略的发展,并指出了在教育、无障碍、文化遗产和安全领域部署多模态系统时的具体挑战。
Summary / 总结
This paper explores whether advanced vision language models (VLMs) can recognize fragmented or occluded text, a skill humans exhibit. By creating 'visible but unreadable' stimuli through psychophysics methods across Chinese and English writing systems, the study finds that VLMs perform poorly under these conditions, often producing unrelated or incoherent outputs. This indicates a structural limitation in VLMs, which rely on generic visual invariances but lack compositional priors necessary for robust literacy. The authors release the stimuli generation code and protocols to encourage further research and development in this area.
本文研究了先进视觉语言模型(VLMs)是否能够识别碎片化或被遮挡的文本,这是人类的一项技能。通过在中文和英文书写系统中使用心理物理学方法创建“可见但不可读”的刺激,研究发现,VLMs在这些条件下表现不佳,经常产生不相关或不连贯的输出。这表明VLMs在结构上存在局限性,它们依赖于通用的视觉不变性,但缺乏用于稳健读写的组成先验。作者发布了刺激生成代码和评估协议,以促进进一步的研究和发展。
GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation
Authors: Tuan Pham, Thanh-Tung Le, Xiaohui Xie, Stephan Mandt
Venue: ICCV
First: 2025-10-21T04:47:36+00:00 · Latest: 2025-10-21T04:47:36+00:00
Comments: Accepted to ICCV Findings 2025. The first two authors contributed equally. The last two authors share co-corresponding authorship
Abstract
We introduce a novel framework for metric depth estimation that enhances pretrained diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance. While existing DB-MDE methods excel at predicting relative depth, estimating absolute metric depth remains challenging due to scale ambiguities in single-image scenarios. To address this, we reframe depth estimation as an inverse problem, leveraging pretrained latent diffusion models (LDMs) conditioned on RGB images, combined with stereo-based geometric constraints, to learn scale and shift for accurate depth recovery. Our training-free solution seamlessly integrates into existing DB-MDE frameworks and generalizes across indoor, outdoor, and complex environments. Extensive experiments demonstrate that our approach matches or surpasses state-of-the-art methods, particularly in challenging scenarios involving translucent and specular surfaces, all without requiring retraining.
中文标题/摘要
标题:GeoDiff:几何引导扩散模型的度量深度估计
我们提出了一种新的框架,通过立体视觉指导增强预训练的单目深度估计(DB-MDE)模型,以提高其性能。尽管现有的DB-MDE方法在预测相对深度方面表现出色,但在单图像场景中由于尺度不确定性,估计绝对度量深度仍然具有挑战性。为了解决这个问题,我们将深度估计重新定义为一个逆问题,利用预训练的潜扩散模型(LDMs)在RGB图像条件下的条件,结合基于立体的几何约束,学习尺度和偏移,以实现准确的深度恢复。我们的无需训练的解决方案可以无缝集成到现有的DB-MDE框架中,并在室内外及复杂环境中泛化。广泛的实验表明,我们的方法在挑战性场景中,特别是在涉及透明和镜面表面的情况下,能够匹配或超越最先进的方法,而无需重新训练。
Summary / 总结
The research introduces GeoDiff, a framework that enhances pretrained diffusion-based monocular depth estimation models with stereo vision guidance to address the challenge of estimating absolute metric depth. By leveraging pretrained latent diffusion models and stereo-based geometric constraints, the method learns scale and shift for accurate depth recovery. Experiments show that GeoDiff matches or surpasses state-of-the-art methods, especially in scenarios with translucent and specular surfaces, without needing retraining.
该论文提出了GeoDiff框架,通过结合预训练的扩散模型和立体视觉几何约束,增强单目深度估计模型,以解决绝对度量深度估计的挑战。通过利用预训练的潜在扩散模型和立体几何约束,该方法学习尺度和偏移以实现准确的深度恢复。实验表明,GeoDiff在涉及透明和镜面表面的复杂场景中,能够达到或超越现有最佳方法的效果,且无需重新训练。
StreamingTOM: Streaming Token Compression for Efficient Video Understanding
Authors: Xueyi Chen, Keda Tao, Kele Shao, Huan Wang
First: 2025-10-21T03:39:41+00:00 · Latest: 2025-10-21T03:39:41+00:00
Abstract
Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8\%$ on offline benchmarks and $55.8\%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.
中文标题/摘要
标题:StreamingTOM:流式视频视觉语言模型的高效令牌压缩
与离线处理不同,流式视频视觉语言模型面临两个根本约束:因果性和累积性。因果性阻止了对离线方法利用的未来帧的访问,而累积性导致令牌无限制增长,造成效率瓶颈。然而,现有方法仅调节后LLM kv缓存,而未改变昂贵的前LLM预填充。我们提出了一种无需训练、即插即用的两阶段框架StreamingTOM,该框架通过可预测的延迟同时解决前LLM和后LLM瓶颈。因果时间缩减对每个帧施加固定预算,并基于相邻帧变化和令牌显著性选择令牌,大幅减少每个帧的预填充成本,仅处理每帧的紧凑子集视觉令牌而不是所有视觉令牌。在线量化内存以4位格式存储令牌,在需要时检索相关组并去量化,保持活跃kv缓存的大小不受流长度的影响。实验表明,我们的方法相比之前最佳方案实现了15.7倍的kv缓存压缩、1.2倍的峰值内存降低和2倍的TTFT加速。StreamingTOM在无需训练的方法中保持了最先进的准确率,在离线基准测试中平均为63.8%,在RVS中为55.8%/3.7。这些结果突显了我们两阶段方法在具有受限增长的高效流式视频理解中的实际优势。
Summary / 总结
StreamingTOM is a training-free framework that addresses the challenges of causality and token accumulation in streaming video vision-language models. It introduces Causal Temporal Reduction to impose a fixed per-frame budget and select relevant tokens, and Online Quantized Memory to store and retrieve tokens efficiently. Experiments show that StreamingTOM achieves 15.7 times kv-cache compression, 1.2 times lower peak memory, and 2 times faster TTFT compared to previous state-of-the-art methods, while maintaining competitive accuracy on offline and RVS benchmarks.
StreamingTOM 是一个无需训练的框架,旨在解决流式视频视觉-语言模型中的因果性和标记累积问题。它引入了因果时间缩减来固定每帧的预算并选择相关标记,以及在线量化内存来高效存储和检索标记。实验表明,StreamingTOM 实现了15.7倍的kv缓存压缩、1.2倍更低的峰值内存和2倍更快的TTFT,同时在离线和RVS基准测试中保持了竞争力的准确性。
MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
Authors: Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue, Bo Zheng
First: 2025-06-24T02:37:59+00:00 · Latest: 2025-10-21T03:37:17+00:00
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities. However, this advancement also introduces novel safety risks, as these models become increasingly vulnerable to harmful multimodal prompts that can trigger unethical or unsafe behaviors. Existing safety alignment approaches, primarily designed for unimodal language models, fall short in addressing the complex and nuanced threats posed by multimodal inputs. Moreover, current safety datasets lack the fine-grained, policy-grounded reasoning required to robustly align reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities. Our data generation pipeline emphasizes multimodal diversity, policy-grounded reasoning, and rigorous quality filtering using strong multimodal judges. Extensive experiments demonstrate that fine-tuning VLMs on MSR-Align substantially improves robustness against both textual and vision-language jailbreak attacks, while preserving or enhancing general reasoning performance. MSR-Align provides a scalable and effective foundation for advancing the safety alignment of reasoning-capable VLMs. Our dataset is made publicly available at https://huggingface.co/datasets/Leigest/MSR-Align.
中文标题/摘要
标题:MSR-Align:基于政策的多模态对齐以实现安全推理的视觉-语言模型
视觉-语言模型(VLMs)通过增强的链式思考能力在多模态推理任务中取得了显著进展。然而,这种进步也带来了新的安全风险,因为这些模型越来越容易受到有害的多模态提示的影响,这些提示可能会触发不道德或不安全的行为。现有的安全对齐方法主要是为单模态语言模型设计的,无法有效应对多模态输入带来的复杂和微妙的威胁。此外,当前的安全数据集缺乏细粒度的、基于政策的推理,这使得对具备推理能力的VLMs进行稳健的对齐变得困难。在本文中,我们提出了MSR-Align,这是一个高质量的多模态安全推理数据集,旨在弥合这一差距。MSR-Align支持在视觉和文本模态之间进行细粒度、审慎的基于政策的推理。我们的数据生成管道强调多模态多样性、基于政策的推理以及严格的质量筛选,使用强大的多模态评判者。广泛的实验表明,对MSR-Align进行微调可以显著提高VLMs对文本和视觉-语言脱缰攻击的鲁棒性,同时保持或提升一般推理性能。MSR-Align为推进具备推理能力的VLMs的安全对齐提供了一个可扩展且有效的基础。我们的数据集已公开发布于https://huggingface.co/datasets/Leigest/MSR-Align。
Summary / 总结
The research aims to address the safety risks in Vision-Language Models (VLMs) by introducing MSR-Align, a multimodal safety reasoning dataset. The method involves fine-grained, policy-grounded reasoning across both vision and text modalities, with a focus on multimodal diversity and rigorous quality filtering. Key findings show that fine-tuning VLMs on MSR-Align enhances robustness against textual and vision-language jailbreak attacks while maintaining or improving general reasoning performance.
研究旨在通过引入MSR-Align多模态安全推理数据集来解决视觉-语言模型(VLM)的安全风险。方法包括跨视觉和文本模态进行细粒度、基于政策的推理,重点是多模态多样性以及严格的质量筛选。关键发现表明,通过MSR-Align对VLM进行微调可以增强其对文本和视觉-语言突破攻击的鲁棒性,同时保持或提升一般推理性能。
UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding
Authors: Da Zhang, Chenggang Rong, Bingyu Li, Feiyu Wang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
First: 2025-10-21T03:32:15+00:00 · Latest: 2025-10-21T03:32:15+00:00
Comments: We have released V1, which only reports the test results. Our work is still ongoing, and the next version will be coming soon
Abstract
Large vision-language models (VLMs) have achieved remarkable success in natural scene understanding, yet their application to underwater environments remains largely unexplored. Underwater imagery presents unique challenges including severe light attenuation, color distortion, and suspended particle scattering, while requiring specialized knowledge of marine ecosystems and organism taxonomy. To bridge this gap, we introduce UWBench, a comprehensive benchmark specifically designed for underwater vision-language understanding. UWBench comprises 15,003 high-resolution underwater images captured across diverse aquatic environments, encompassing oceans, coral reefs, and deep-sea habitats. Each image is enriched with human-verified annotations including 15,281 object referring expressions that precisely describe marine organisms and underwater structures, and 124,983 question-answer pairs covering diverse reasoning capabilities from object recognition to ecological relationship understanding. The dataset captures rich variations in visibility, lighting conditions, and water turbidity, providing a realistic testbed for model evaluation. Based on UWBench, we establish three comprehensive benchmarks: detailed image captioning for generating ecologically informed scene descriptions, visual grounding for precise localization of marine organisms, and visual question answering for multimodal reasoning about underwater environments. Extensive experiments on state-of-the-art VLMs demonstrate that underwater understanding remains challenging, with substantial room for improvement. Our benchmark provides essential resources for advancing vision-language research in underwater contexts and supporting applications in marine science, ecological monitoring, and autonomous underwater exploration. Our code and benchmark will be available.
中文标题/摘要
标题:UWBench:一种全面的水下视觉-语言基准
大型视觉-语言模型(VLMs)在自然场景理解方面取得了显著成功,但在水下环境中的应用仍处于起步阶段。水下图像面临着严重的光线衰减、颜色失真和悬浮颗粒散射等独特挑战,同时还需要对海洋生态系统和生物分类学的专业知识。为解决这一问题,我们引入了UWBench,一种专门设计用于水下视觉-语言理解的全面基准。UWBench 包含了15,003张高分辨率的水下图像,涵盖了海洋、珊瑚礁和深海等多种水下环境。每张图像都附有人工验证的注释,包括15,281个描述海洋生物和水下结构的物体引用表达式,以及124,983个涵盖从物体识别到生态关系理解等多种推理能力的问题-答案对。该数据集捕捉了丰富的能见度、光照条件和水体浑浊度的变化,为模型评估提供了现实的测试平台。基于UWBench,我们建立了三个全面的基准:详细的图像描述以生成生态信息丰富的场景描述,视觉定位以精确定位海洋生物,以及视觉问答以进行多模态的水下环境推理。对最先进的VLMs的广泛实验表明,水下理解仍然具有挑战性,有很大的改进空间。我们的基准为推进水下视觉-语言研究提供了必要的资源,并支持海洋科学、生态监测和自主水下探索的应用。我们的代码和基准将可供下载。
Summary / 总结
UWBench is a comprehensive benchmark for underwater vision-language understanding, addressing the unique challenges of underwater imagery. It includes 15,003 high-resolution images with detailed annotations and 124,983 question-answer pairs. Experiments show that current large vision-language models struggle with underwater understanding, highlighting the need for improvement. The benchmark aims to support research in marine science and autonomous underwater exploration.
UWBench 是一个全面的水下视觉-语言理解基准,旨在解决水下图像的独特挑战。它包含15,003张高分辨率图像和详细的注释,以及124,983个问答对。实验表明,当前的VLMs在水下理解方面存在困难,显示出显著的改进空间。该基准旨在支持海洋科学和自主水下探索的研究。
OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion
Authors: Tianyu Huang, Runnan Chen, Dongting Hu, Fengming Huang, Mingming Gong, Tongliang Liu
First: 2025-10-21T03:24:12+00:00 · Latest: 2025-10-21T03:24:12+00:00
Abstract
Understanding 3D scenes is pivotal for autonomous driving, robotics, and augmented reality. Recent semantic Gaussian Splatting approaches leverage large-scale 2D vision models to project 2D semantic features onto 3D scenes. However, they suffer from two major limitations: (1) insufficient contextual cues for individual masks during preprocessing and (2) inconsistencies and missing details when fusing multi-view features from these 2D models. In this paper, we introduce \textbf{OpenInsGaussian}, an \textbf{Open}-vocabulary \textbf{Ins}tance \textbf{Gaussian} segmentation framework with Context-aware Cross-view Fusion. Our method consists of two modules: Context-Aware Feature Extraction, which augments each mask with rich semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to mitigate alignment errors and incompleteness. Through extensive experiments on benchmark datasets, OpenInsGaussian achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines by a large margin. These findings underscore the robustness and generality of our proposed approach, marking a significant step forward in 3D scene understanding and its practical deployment across diverse real-world scenarios.
中文标题/摘要
标题:OpenInsGaussian: 开放词汇实例高斯分割与上下文感知跨视图融合
理解3D场景对于自动驾驶、机器人技术和增强现实至关重要。近期的语义高斯点云方法利用大规模2D视觉模型将2D语义特征投影到3D场景上。然而,它们存在两个主要局限性:(1) 预处理过程中个体掩码缺乏足够的上下文线索;(2) 从这些2D模型融合多视图特征时出现不一致性和细节缺失。本文提出了一种名为\textbf{OpenInsGaussian}的框架,这是一种具有上下文感知跨视图融合的开放词汇实例高斯分割方法。该方法由两个模块组成:上下文感知特征提取,该模块为每个掩码增加丰富的语义上下文;以及注意力驱动特征聚合,该模块选择性地融合多视图特征以减轻对齐错误和不完整性。通过在基准数据集上的广泛实验,OpenInsGaussian在开放词汇3D高斯分割中取得了最先进的结果,大幅超越现有基线方法。这些发现突显了我们提出方法的稳健性和通用性,标志着3D场景理解及其在各种实际场景中的部署取得了重要进展。
Summary / 总结
OpenInsGaussian is an open-vocabulary instance Gaussian segmentation framework that addresses the limitations of previous methods by incorporating context-aware cross-view fusion. It consists of two modules: Context-Aware Feature Extraction, which enriches each mask with semantic context, and Attention-Driven Feature Aggregation, which selectively fuses multi-view features to improve alignment and completeness. The method achieves state-of-the-art results in open-vocabulary 3D Gaussian segmentation, outperforming existing baselines significantly.
OpenInsGaussian 是一种开放词汇实例高斯分割框架,通过引入上下文感知的多视图融合来解决先前方法的局限性。它包含两个模块:上下文感知特征提取,该模块为每个掩码添加丰富的语义上下文,以及注意力驱动特征聚合,该模块选择性地融合多视图特征以提高对齐和完整性。该方法在开放词汇3D高斯分割中取得了最先进的结果,显著优于现有基线。
History
20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553