arXiv 论文速递

2025-11-09 03:26
Snapshot: 20251109_0326
TextRegion: Text-Aligned Region Tokens from Frozen Image-Text Models
Authors: Yao Xiao, Qiqian Fu, Heyi Tao, Yuqun Wu, Zhen Zhu, Derek Hoiem
Venue: Transactions on Machine Learning Research, 2025
First: 2025-05-29T17:59:59+00:00 · Latest: 2025-11-06T18:59:57+00:00
Comments: Published in TMLR, with a J2C Certification
Abstract
Image-text models excel at image-level tasks but struggle with detailed visual understanding. While these models provide strong visual-language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training-free framework that combines the strengths of image-text models and SAM2 to generate powerful text-aligned region tokens. These tokens enable detailed visual understanding while preserving open-vocabulary capabilities. They can be directly applied to various downstream tasks, including open-world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state-of-the-art training-free methods. Additionally, our framework is compatible with many image-text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.
中文标题/摘要
标题:TextRegion: 冻结图像-文本模型的文本对齐区域标记
图像-文本模型在图像级任务上表现出色,但在详细的视觉理解方面存在困难。尽管这些模型提供了强大的视觉-语言对齐,但分割模型如SAM2能够提供精确的空间边界。为此,我们提出了一种简单、有效且无需训练的TextRegion框架,该框架结合了图像-文本模型和SAM2的优点,生成强大的文本对齐区域标记。这些标记能够实现详细的视觉理解,同时保留开放词汇的能力。它们可以直接应用于各种下游任务,包括开放世界语义分割、指示表达理解以及语义定位。我们进行了广泛的评估,并且在与最先进的无需训练方法的比较中,始终取得了优越或竞争力的表现。此外,我们的框架与许多图像-文本模型兼容,使其非常实用且易于扩展,随着更强的模型出现。代码可在:https://github.com/avaxiao/TextRegion 获取。
Summary / 总结
The research aims to enhance detailed visual understanding by combining the visual-language alignment of image-text models with the precise spatial boundaries provided by SAM2. The proposed TextRegion framework generates text-aligned region tokens without additional training, achieving superior or competitive performance in various downstream tasks such as open-world semantic segmentation and referring expression comprehension. This framework is compatible with multiple image-text models, making it practical and extensible for future advancements.
研究旨在通过结合图像-文本模型的视觉-语言对齐能力和SAM2的精确分割能力,提升详细的视觉理解。提出的TextRegion框架生成了文本对齐的区域令牌,这些令牌在开放世界语义分割、指示表达理解和定位等任务中表现优异。评估结果显示,该方法在现有无训练方法中表现出一致的优越或竞争力。
DashCLIP: Leveraging multimodal models for generating semantic embeddings for DoorDash
Authors: Omkar Gurjar, Kin Sum Liu, Praveen Kolli, Utsaw Kumar, Mandar Rahurkar
First: 2025-03-18T20:38:31+00:00 · Latest: 2025-11-06T18:08:18+00:00
Abstract
Despite the success of vision-language models in various generative tasks, obtaining high-quality semantic representations for products and user intents is still challenging due to the inability of off-the-shelf models to capture nuanced relationships between the entities. In this paper, we introduce a joint training framework for product and user queries by aligning uni-modal and multi-modal encoders through contrastive learning on image-text data. Our novel approach trains a query encoder with an LLM-curated relevance dataset, eliminating the reliance on engagement history. These embeddings demonstrate strong generalization capabilities and improve performance across applications, including product categorization and relevance prediction. For personalized ads recommendation, a significant uplift in the click-through rate and conversion rate after the deployment further confirms the impact on key business metrics. We believe that the flexibility of our framework makes it a promising solution toward enriching the user experience across the e-commerce landscape.
中文标题/摘要
标题:DashCLIP:利用多模态模型为DoorDash生成语义嵌入
尽管视觉-语言模型在各种生成任务中取得了成功,但由于现成模型无法捕捉实体之间的细微关系,获得高质量的语义表示仍然具有挑战性。在本文中,我们通过对比学习图像-文本数据,引入了一种联合训练产品和用户查询的框架,通过将单模态和多模态编码器对齐。我们的新颖方法使用LLM整理的相关性数据集训练查询编码器,消除了对互动历史的依赖。这些嵌入展示了强大的泛化能力,并在包括产品分类和相关性预测在内的多个应用中提高了性能。对于个性化广告推荐,在部署后点击率和转化率的显著提升进一步证实了对关键业务指标的影响。我们认为,我们框架的灵活性使其成为丰富电子商务领域用户体验的有前途的解决方案。
Summary / 总结
The research aims to generate high-quality semantic embeddings for products and user intents by leveraging multimodal models. The method involves a joint training framework that aligns uni-modal and multi-modal encoders through contrastive learning on image-text data. Key findings include strong generalization capabilities of the embeddings and improved performance in applications such as product categorization and relevance prediction. The approach also enhances click-through rate and conversion rate in personalized ad recommendations, demonstrating its impact on business metrics.
研究旨在通过利用多模态模型生成产品和用户意图的高质量语义嵌入。方法是通过对比学习对图像-文本数据进行联合训练,以对齐单模态和多模态编码器。关键发现包括嵌入的强泛化能力以及在产品分类和相关性预测等应用中的改进性能。该方法还提高了个性化广告推荐的点击率和转化率,证明了其对业务指标的影响。
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Authors: Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu
First: 2025-11-06T17:25:23+00:00 · Latest: 2025-11-06T17:25:23+00:00
Comments: 36 pages, 14 figures
Abstract
"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.
中文标题/摘要
标题:视频思维:视频生成作为有前景的多模态推理范式
"文本思维"和"图像思维"范式显著提高了大型语言模型(LLMs)和视觉语言模型(VLMs)的推理能力。然而,这些范式存在固有的局限性。首先,图像只能捕捉单一时刻,无法表示动态过程或连续变化;其次,文本和视觉作为独立模态的分离,阻碍了统一的多模态理解和生成。为克服这些局限,我们引入了“视频思维”这一新范式,利用视频生成模型(如Sora-2)在统一的时间框架内结合视觉和文本推理。为支持这一探索,我们开发了视频思维基准(VideoThinkBench)。VideoThinkBench 包含两类任务:(1)视觉中心任务(如眼力谜题),(2)文本中心任务(如GSM8K和MMMU的子集)。我们的评估表明Sora-2是一个有效的推理者。在视觉中心任务中,Sora-2通常与最先进的视觉语言模型(SOTA)相当,甚至在某些任务(如眼力游戏)上超过了VLMs。在文本中心任务中,Sora-2在MATH上的准确率为92%,在MMMU上的准确率为75.53%。此外,我们系统地分析了这些能力的来源。我们还发现,自我一致性与上下文学习可以提高Sora-2的性能。总之,我们的研究结果表明,视频生成模型可能是统一的多模态理解和生成模型,将“视频思维”定位为统一的多模态推理范式。
Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment
Authors: Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, Lixing Zou, Zhaoye Zhou, Gen Li, Bo Zhao
First: 2025-11-06T17:07:49+00:00 · Latest: 2025-11-06T17:07:49+00:00
Comments: Github: https://github.com/MINT-SJTU/Evo-1
Abstract
Vision-Language-Action (VLA) models have emerged as a powerful framework that unifies perception, language, and control, enabling robots to perform diverse tasks through multimodal understanding. However, current VLA models typically contain massive parameters and rely heavily on large-scale robot data pretraining, leading to high computational costs during training, as well as limited deployability for real-time inference. Moreover, most training paradigms often degrade the perceptual representations of the vision-language backbone, resulting in overfitting and poor generalization to downstream tasks. In this work, we present Evo-1, a lightweight VLA model that reduces computation and improves deployment efficiency, while maintaining strong performance without pretraining on robot data. Evo-1 builds on a native multimodal Vision-Language model (VLM), incorporating a novel cross-modulated diffusion transformer along with an optimized integration module, together forming an effective architecture. We further introduce a two-stage training paradigm that progressively aligns action with perception, preserving the representations of the VLM. Notably, with only 0.77 billion parameters, Evo-1 achieves state-of-the-art results on the Meta-World and RoboTwin suite, surpassing the previous best models by 12.4% and 6.9%, respectively, and also attains a competitive result of 94.8% on LIBERO. In real-world evaluations, Evo-1 attains a 78% success rate with high inference frequency and low memory overhead, outperforming all baseline methods. We release code, data, and model weights to facilitate future research on lightweight and efficient VLA models.
中文标题/摘要
标题:Evo-1:轻量级视觉-语言-行动模型,保留语义对齐
视觉-语言-行动(VLA)模型已成为一种强大的框架,统一了感知、语言和控制,使机器人能够通过多模态理解执行多种任务。然而,当前的VLA模型通常包含大量参数,并且依赖大规模机器人数据的预训练,导致训练时计算成本高,且实时推理部署能力有限。此外,大多数训练范式往往会降低视觉-语言主干的感知表示,导致过拟合和下游任务泛化能力差。在本研究中,我们提出了Evo-1,这是一种轻量级的VLA模型,减少了计算量并提高了部署效率,同时保持了强大的性能,无需使用机器人数据进行预训练。Evo-1基于原生多模态视觉-语言模型(VLM),结合了一种新颖的跨模态扩散变换器以及优化的集成模块,共同形成了有效的架构。我们还引入了一种两阶段训练范式,逐步将行动与感知对齐,保留了VLM的表示。值得注意的是,仅包含0.77亿参数的Evo-1在Meta-World和RoboTwin套件上取得了最先进的结果,分别超越了之前最佳模型12.4%和6.9%,并在LIBERO上也取得了竞争力的结果,达到94.8%。在实际评估中,Evo-1以高推理频率和低内存开销实现了78%的成功率,超越了所有基线方法。我们发布了代码、数据和模型权重,以促进轻量级和高效VLA模型的未来研究。
HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model
Authors: Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilcahe Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
First: 2025-06-05T07:26:34+00:00 · Latest: 2025-11-06T15:28:19+00:00
Comments: Project page: https://youngwanlee.github.io/holisafe
Abstract
Despite emerging efforts to enhance the safety of Vision-Language Models (VLMs), current approaches face two main shortcomings. 1) Existing safety-tuning datasets and benchmarks only partially consider how image-text interactions can yield harmful content, often overlooking contextually unsafe outcomes from seemingly benign pairs. This narrow coverage leaves VLMs vulnerable to jailbreak attacks in unseen configurations. 2) Prior methods rely primarily on data-centric tuning, with limited architectural innovations to intrinsically strengthen safety. We address these gaps by introducing a holistic safety dataset and benchmark, \textbf{HoliSafe}, that spans all five safe/unsafe image-text combinations, providing a more robust basis for both training and evaluation (HoliSafe-Bench). We further propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images for VLMs. This module endows VLMs with a dual functionality: they not only learn to generate safer responses but can also provide an interpretable harmfulness classification to justify their refusal decisions. A significant advantage of this approach is its modularity; the VGM is designed as a plug-in component, allowing for seamless integration with diverse pre-trained VLMs across various scales. Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks. Additionally, the HoliSafe-Bench itself reveals critical vulnerabilities in existing VLM models. We hope that HoliSafe and VGM will spur further research into robust and interpretable VLM safety, expanding future avenues for multimodal alignment.
中文标题/摘要
标题:HoliSafe:视觉语言模型的全面安全基准和建模
尽管已经出现了增强视觉语言模型(VLMs)安全性的努力,但当前的方法存在两个主要不足。1)现有的安全调优数据集和基准仅部分考虑了图像-文本交互可能导致有害内容的问题,经常忽视看似无害的配对所引发的上下文不安全结果。这种狭窄的覆盖范围使VLMs在未见配置中容易受到脱狱攻击。2)先前的方法主要依赖于数据驱动的调优,缺乏对内在增强安全性的架构创新。我们通过引入一个全面的安全数据集和基准——HoliSafe,涵盖了所有五种安全/不安全的图像-文本组合,为训练和评估提供了更坚实的基础(HoliSafe-Bench)。我们还提出了一种新的模块化框架,通过视觉守护模块(VGM)增强VLM的安全性,该模块旨在评估输入图像对VLM的有害性。该模块赋予VLMs双重功能:它们不仅学习生成更安全的响应,还可以提供可解释的有害性分类,以证明其拒绝决策的合理性。这种方法的一个显著优势是其模块化;VGM被设计为插件组件,可以无缝集成到各种规模的预训练VLMs中。实验表明,使用VGM训练的Safe-VLM在多个VLM基准测试中实现了最先进的安全性能。此外,HoliSafe-Bench本身揭示了现有VLM模型中的关键漏洞。我们希望HoliSafe和VGM能够激发更多关于稳健和可解释的VLM安全性的研究,扩展未来多模态对齐的途径。
Evaluating LLM-Contaminated Crowdsourcing Data Without Ground Truth
Authors: Yichi Zhang, Jinlong Pang, Zhaowei Zhu, Yang Liu
First: 2025-06-08T04:38:39+00:00 · Latest: 2025-11-06T15:24:22+00:00
Comments: 32 pages, 7 figures
Abstract
The recent success of generative AI highlights the crucial role of high-quality human feedback in building trustworthy AI systems. However, the increasing use of large language models (LLMs) by crowdsourcing workers poses a significant challenge: datasets intended to reflect human input may be compromised by LLM-generated responses. Existing LLM detection approaches often rely on high-dimensional training data such as text, making them unsuitable for annotation tasks like multiple-choice labeling. In this work, we investigate the potential of peer prediction -- a mechanism that evaluates the information within workers' responses without using ground truth -- to mitigate LLM-assisted cheating in crowdsourcing with a focus on annotation tasks. Our approach quantifies the correlations between worker answers while conditioning on (a subset of) LLM-generated labels available to the requester. Building on prior research, we propose a training-free scoring mechanism with theoretical guarantees under a crowdsourcing model that accounts for LLM collusion. We establish conditions under which our method is effective and empirically demonstrate its robustness in detecting low-effort cheating on real-world crowdsourcing datasets.
中文标题/摘要
标题:无需地面真实性的LLM污染众包数据评估
生成式AI的近期成功突显了高质量人类反馈在构建可信赖AI系统中的关键作用。然而,众包工作者越来越多地使用大型语言模型(LLM)带来了重大挑战:旨在反映人类输入的数据可能受到LLM生成响应的污染。现有的LLM检测方法通常依赖于高维训练数据(如文本),使其不适合用于如多项选择标注等注释任务。在本文中,我们研究了同伴预测——一种机制,该机制可以在不使用地面真实性的前提下评估工人响应中的信息——在众包注释任务中对抗LLM辅助作弊的潜力。我们的方法在条件(LLM生成标签的一部分)下量化了工人答案之间的相关性。基于先前的研究,我们提出了一种无需训练的评分机制,并在考虑LLM合谋的众包模型下提供了理论保证。我们确定了该方法有效性的条件,并通过在真实世界众包数据集上进行实证研究,证明了其在检测低努力作弊方面的鲁棒性。
Summary / 总结
This paper addresses the challenge of detecting large language model (LLM)-contaminated data in crowdsourced annotation tasks. It proposes using peer prediction to evaluate worker responses without ground truth, leveraging available LLM-generated labels. The method quantifies answer correlations and is theoretically guaranteed to work under certain conditions. Empirical results show its effectiveness in detecting low-effort cheating on real-world datasets.
该研究通过利用同伴预测来检测 crowdsourcing 中由大型语言模型 (LLM) 污染的数据,无需使用 ground truth 数据即可评估工人回答。该方法通过量化工人答案之间的相关性并基于 LLM 生成的标签进行条件处理,来识别低努力作弊,并在实际 crowdsourcing 数据集中展示了其鲁棒性。
GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
Authors: Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
First: 2025-11-06T12:19:02+00:00 · Latest: 2025-11-06T12:19:02+00:00
Abstract
We introduce GUI-360$^\circ$, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and is constrained by three persistent gaps: a scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines for multi-modal trajectories, and the absence of a unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction. GUI-360$^\circ$ addresses these gaps with an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications, and includes full-resolution screenshots, accessibility metadata when available, instantiated goals, intermediate reasoning traces, and both successful and failed action trajectories. The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision--language models on GUI-360$^\circ$ reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning and reinforcement learning yield significant gains but do not close the gap to human-level reliability. We release GUI-360$^\circ$ and accompanying code to facilitate reproducible research and accelerate progress on robust desktop CUAs. The full dataset has been made public on https://huggingface.co/datasets/vyokky/GUI-360.
中文标题/摘要
标题:GUI-360:计算机使用代理的综合数据集和基准测试
我们介绍了GUI-360°,这是一个大规模、综合性的数据集和基准测试套件,旨在推动计算机使用代理(CUAs)的发展。CUAs面临独特的挑战,并受到三个持续存在的缺口的限制:现实世界CUA任务的稀缺性、多模态轨迹的自动化收集和注解管道的缺乏,以及缺乏一个统一的基准来联合评估GUI定位、屏幕解析和动作预测。GUI-360°通过一个增强的LLM辅助、主要自动化的查询来源、环境模板构建、任务实例化、批量执行和LLM驱动的质量过滤管道来解决这些缺口。发布的语料库包含超过120万执行的动作步骤,跨越数千个轨迹,涵盖了流行的Windows办公应用程序,并包括全分辨率截图、可用时的无障碍元数据、实例化的目标、中间推理轨迹以及成功和失败的动作轨迹。该数据集支持三个经典任务:GUI定位、屏幕解析和动作预测,以及反映现代代理设计的GUI+API动作空间。在GUI-360°上对最先进的视觉-语言模型进行基准测试揭示了在定位和动作预测方面存在显著的开箱即用的不足;监督微调和强化学习取得了显著的改进,但并未完全弥补到人类水平的可靠性差距。我们发布了GUI-360°及其配套代码,以促进可重复研究并加速对稳健桌面CUAs的进展。完整的数据集已公开发布在https://huggingface.co/datasets/vyokky/GUI-360。
Summary / 总结
GUI-360 addresses the challenges faced by computer-using agents by providing a comprehensive dataset and benchmark suite. It overcomes the lack of real-world tasks, automated pipelines for multi-modal trajectories, and a unified benchmark for evaluating GUI grounding, screen parsing, and action prediction. The dataset includes over 1.2 million action steps from thousands of trajectories in popular Windows office applications, with full-resolution screenshots, accessibility metadata, and reasoning traces. Benchmarking shows that state-of-the-art models have significant shortcomings in grounding and action prediction, but supervised fine-tuning and reinforcement learning improve performance. The dataset is publicly available at https://huggingface.co/datasets/vyokky/GUI-360.
GUI-360通过提供一个全面的数据集和基准套件来解决计算机使用代理所面临的挑战,克服了缺乏真实世界任务、自动化多模态轨迹管道和统一基准的问题,用于评估GUI定位、屏幕解析和动作预测。该数据集包含来自流行Windows办公应用的超过120万的动作步骤,包括全分辨率截图、无障碍元数据和推理轨迹。基准测试显示,最先进的模型在定位和动作预测方面存在显著缺陷,但监督微调和强化学习可以显著提高性能。数据集已公开发布在https://huggingface.co/datasets/vyokky/GUI-360。
TowerVision: Understanding and Improving Multilinguality in Vision-Language Models
Authors: André G. Viveiros, Patrick Fernandes, Saul Santos, Sonal Sannigrahi, Emmanouil Zaranis, Nuno M. Guerreiro, Amin Farajian, Pierre Colombo, Graham Neubig, André F. T. Martins
First: 2025-10-22T17:02:48+00:00 · Latest: 2025-11-06T11:09:11+00:00
Comments: 15 pages, 7 figures, submitted to arXiv October 2025. All models, datasets, and training code will be released at https://huggingface.co/collections/utter-project/towervision
Abstract
Despite significant advances in vision-language models (VLMs), most existing work follows an English-centric design process, limiting their effectiveness in multilingual settings. In this work, we provide a comprehensive empirical study analyzing the impact of several multilingual design choices, such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs for both image-text and video-text tasks, built upon the multilingual text-only model Tower+. TowerVision achieves competitive performance on multiple multimodal multilingual benchmarks and shows particular strength in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, our models surpass existing approaches trained on substantially larger datasets, as demonstrated on ALM-Bench and Multi30K (image tasks) and ViMUL-Bench (video tasks). Alongside the models, we release VisionBlocks, a high-quality, curated vision-language dataset. Our findings highlight that multilingual vision-language training data substantially improves cross-lingual generalization -- both from high-resource to underrepresented languages and vice versa -- and that instruction-tuned LLMs are not always the optimal initialization point. To support further research, we publicly release all models, data, and training recipes.
中文标题/摘要
标题:TowerVision:理解并改进视觉语言模型中的多语言性
尽管在视觉语言模型(VLMs)方面取得了显著进展,但大多数现有工作都遵循以英语为中心的设计过程,限制了它们在多语言环境中的有效性。在本研究中,我们提供了一项全面的经验性研究,分析了多种多语言设计选择的影响,如训练数据组成、编码器选择和文本骨干。结果是TowerVision,一个基于多语言文本模型Tower+的多语言VLM家族,适用于图像文本和视频文本任务。TowerVision在多个跨模态多语言基准测试中取得了竞争力的表现,并在文化背景任务和跨模态翻译方面表现出特别的优势。通过在微调过程中结合视觉和文化背景,我们的模型在ALM-Bench和Multi30K(图像任务)以及ViMUL-Bench(视频任务)上超过了现有在更大数据集上训练的方法。除了模型外,我们还发布了VisionBlocks,一个高质量、精选的视觉语言数据集。我们的研究结果表明,多语言视觉语言训练数据显著提高了跨语言泛化能力——无论是从高资源语言到未充分代表的语言,还是反之亦然——并且指令调优的大规模语言模型并不总是最佳的初始化点。为了支持进一步的研究,我们将在https://huggingface.co/collections/utter-project/towervision上公开发布所有模型、数据和训练配方。
Summary / 总结
The research aims to address the limitations of vision-language models (VLMs) that are predominantly designed for English, thereby affecting their performance in multilingual settings. The study comprehensively analyzes the impact of multilingual design choices such as training data composition, encoder selection, and text backbones. The result is TowerVision, a family of open multilingual VLMs that achieve competitive performance on various multimodal multilingual benchmarks and excel in culturally grounded tasks and multimodal translation. By incorporating visual and cultural context during fine-tuning, TowerVision surpasses existing approaches on ALM-Bench, Multi30K, and ViMUL-Bench. The study also releases VisionBlocks, a curated vision-language dataset, and publicly shares all models, data, and training recipes to support further research.
研究旨在解决现有的视觉语言模型(VLMs)主要针对英语设计的问题,从而影响其在多语言环境中的表现。研究全面分析了多语言设计选择的影响,如训练数据组成、编码器选择和文本骨干。研究结果是TowerVision,一个开放的多语言VLM家族,能够在多种跨模态多语言基准测试中取得竞争力的表现,并在文化基础任务和跨模态翻译方面表现出色。通过在微调过程中结合视觉和文化背景,TowerVision在ALM-Bench、Multi30K和ViMUL-Bench上超越了现有方法。研究还发布了VisionBlocks,一个高质量的视觉语言数据集,并公开分享了所有模型、数据和训练方法以支持进一步的研究。
On the Brittleness of CLIP Text Encoders
Authors: Allie Tran, Luca Rossetto
First: 2025-11-06T10:33:55+00:00 · Latest: 2025-11-06T10:33:55+00:00
Comments: Accepted for publication at MMM'26
Abstract
Multimodal co-embedding models, especially CLIP, have advanced the state of the art in zero-shot classification and multimedia information retrieval in recent years by aligning images and text in a shared representation space. However, such modals trained on a contrastive alignment can lack stability towards small input perturbations. Especially when dealing with manually expressed queries, minor variations in the query can cause large differences in the ranking of the best-matching results. In this paper, we present a systematic analysis of the effect of multiple classes of non-semantic query perturbations in an multimedia information retrieval scenario. We evaluate a diverse set of lexical, syntactic, and semantic perturbations across multiple CLIP variants using the TRECVID Ad-Hoc Video Search queries and the V3C1 video collection. Across models, we find that syntactic and semantic perturbations drive the largest instabilities, while brittleness is concentrated in trivial surface edits such as punctuation and case. Our results highlight robustness as a critical dimension for evaluating vision-language models beyond benchmark accuracy.
中文标题/摘要
标题:CLIP文本编码器的脆弱性研究
多模态联合嵌入模型,尤其是CLIP,在近年来通过在共享表示空间中对齐图像和文本方面推动了零样本分类和多媒体信息检索的最新进展。然而,这些在对比对齐上训练的模态可能对小输入扰动缺乏稳定性。特别是在处理手动表达的查询时,查询中的细微变化会导致最佳匹配结果排名的巨大差异。在本文中,我们系统分析了在多媒体信息检索场景中多种非语义查询扰动的影响。我们使用TRECVID即兴视频搜索查询和V3C1视频集合,对多种CLIP变体进行了多种词法、句法和语义扰动的评估。在不同模型中,我们发现句法和语义扰动导致了最大的不稳定性,而脆弱性集中在诸如标点符号和大小写这样的琐碎表面编辑上。我们的结果强调了在基准准确度之外,鲁棒性是评估视觉语言模型的一个关键维度。
Summary / 总结
This paper investigates the brittleness of CLIP text encoders in multimedia information retrieval, focusing on how small perturbations in non-semantic queries can lead to significant changes in ranking results. The authors analyze various types of perturbations, including lexical, syntactic, and semantic, across different CLIP variants using TRECVID queries and the V3C1 video collection. They find that syntactic and semantic perturbations cause the most instability, while minor surface edits like punctuation and case have a lesser impact. The study emphasizes the importance of robustness in evaluating vision-language models beyond just benchmark accuracy.
本文研究了CLIP文本编码器在多媒体信息检索中的脆弱性,重点关注查询中的小变化如何导致排名结果的巨大变化。作者系统分析了不同CLIP变体在TRECVID查询和V3C1视频集合上的各种非语义查询扰动,包括词汇、语法和语义变化。研究发现,语法和语义扰动导致的不稳定性最大,而标点符号和大小写的细微编辑影响最小。该研究强调了在仅基准准确度之外评估视觉语言模型时,鲁棒性的重要性。
RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability
Authors: Jonggwon Park, Byungmu Yoon, Soobum Kim, Kyoyun Choi
Venue: NeurIPS 2025
First: 2025-04-10T03:14:17+00:00 · Latest: 2025-11-06T09:22:17+00:00
Comments: NeurIPS 2025
Abstract
Recent advancements in multimodal models have significantly improved vision-language (VL) alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning and offer limited interpretability through attention probability visualizations. To address these challenges, we introduce $\textbf{RadZero}$, a novel framework for VL alignment in chest X-ray with zero-shot multi-task capability. A key component of our approach is $\textbf{VL-CABS}$ ($\textbf{V}$ision-$\textbf{L}$anguage $\textbf{C}$ross-$\textbf{A}$ttention $\textbf{B}$ased on $\textbf{S}$imilarity), which aligns text embeddings with local image features for interpretable, fine-grained VL reasoning. RadZero leverages large language models to extract concise semantic sentences from radiology reports and employs multi-positive contrastive training to effectively capture relationships between images and multiple relevant textual descriptions. It uses a pre-trained vision encoder with additional trainable Transformer layers, allowing efficient high-resolution image processing. By computing similarity between text embeddings and local image patch features, VL-CABS enables zero-shot inference with similarity probability for classification, and pixel-level VL similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation. Furthermore, VL similarity map analysis highlights the potential of VL-CABS for improving explainability in VL alignment. Additionally, qualitative evaluation demonstrates RadZero's capability for open-vocabulary semantic segmentation, further validating its effectiveness in medical imaging. Code is available at $\href{https://github.com/deepnoid-ai/RadZero}{https://github.com/deepnoid-ai/RadZero}$.
中文标题/摘要
标题:RadZero:基于相似性的跨注意力在胸部X光中的可解释视觉-语言对齐,具备零样本多任务能力
近年来,多模态模型在医学影像学中的视觉-语言(VL)对齐方面取得了显著进步。然而,现有方法难以有效利用复杂的医学影像报告进行学习,并且通过注意力概率可视化提供的可解释性有限。为了解决这些挑战,我们提出了**RadZero**,一种具备零样本多任务能力的新型胸部X光中的VL对齐框架。我们方法的关键组件是**VL-CABS**(基于相似性的视觉-语言跨注意力),它通过将文本嵌入与局部图像特征对齐,实现可解释的细粒度VL推理。RadZero 利用大型语言模型从医学影像报告中提取简洁的语义句子,并采用多正样本对比训练来有效捕捉图像与多个相关文本描述之间的关系。它使用预训练的视觉编码器和额外的可训练Transformer层,实现高效的高分辨率图像处理。通过计算文本嵌入与局部图像块特征之间的相似性,VL-CABS 使VL-CABS能够在分类中实现零样本推理,并生成像素级的VL相似性图用于定位和分割。在公共胸部X光基准测试上的实验结果表明,RadZero 在零样本分类、定位和分割方面优于现有最先进的方法。此外,VL相似性图分析突显了VL-CABS在VL对齐中的解释性潜力。此外,定性评估进一步证明了RadZero 在开放词汇语义分割中的能力,进一步验证了其在医学成像中的有效性。代码可在$\href{https://github.com/deepnoid-ai/RadZero}{https://github.com/deepnoid-ai/RadZero}$ 获取。
Summary / 总结
RadZero is a novel framework for vision-language alignment in chest X-ray with zero-shot multi-task capability. It introduces VL-CABS, a similarity-based cross-attention mechanism that aligns text embeddings with local image features for interpretable reasoning. RadZero leverages large language models to extract semantic sentences from radiology reports and uses multi-positive contrastive training to capture image-text relationships. Experimental results show that RadZero outperforms state-of-the-art methods in zero-shot classification, grounding, and segmentation, and highlights the potential for improving explainability in vision-language alignment.
RadZero 是一种用于胸部 X 光片的零样本多任务视图-语言对齐框架,引入了基于相似性的视图-语言交叉注意力机制 VL-CABS,用于可解释的细粒度视图-语言推理。RadZero 利用大型语言模型从放射学报告中提取语义句子,并使用多正样本对比训练来捕捉图像与文本描述之间的关系。实验结果表明,RadZero 在零样本分类、定位和分割方面优于现有最佳方法,并突出了 VL-CABS 在视图-语言对齐中提高可解释性的潜力。
Text to Sketch Generation with Multi-Styles
Authors: Tengjie Li, Shikui Tu, Lei Xu
Venue: NeurIPS 2025
First: 2025-11-06T07:13:56+00:00 · Latest: 2025-11-06T07:13:56+00:00
Comments: Accepted by NeurIPS 2025
Abstract
Recent advances in vision-language models have facilitated progress in sketch generation. However, existing specialized methods primarily focus on generic synthesis and lack mechanisms for precise control over sketch styles. In this work, we propose a training-free framework based on diffusion models that enables explicit style guidance via textual prompts and referenced style sketches. Unlike previous style transfer methods that overwrite key and value matrices in self-attention, we incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism. This design effectively reduces content leakage from reference sketches and enhances synthesis quality, especially in cases with low structural similarity between reference and target sketches. Furthermore, we extend our framework to support controllable multi-style generation by integrating features from multiple reference sketches, coordinated via a joint AdaIN module. Extensive experiments demonstrate that our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control. The official implementation of M3S is available at https://github.com/CMACH508/M3S.
中文标题/摘要
标题:基于多风格的文本到草图生成
视觉语言模型的最新进展促进了草图生成的进步。然而,现有的专门方法主要侧重于通用合成,缺乏对草图风格的精确控制机制。在此工作中,我们提出了一种基于扩散模型的无需训练框架,通过文本提示和参考风格草图实现显式的风格指导。与之前的方法不同,我们通过线性平滑将参考特征作为辅助信息纳入,并利用风格-内容指导机制。这种设计有效地减少了参考草图中的内容泄露,提高了合成质量,尤其是在参考草图与目标草图结构相似度低的情况下。此外,我们通过结合多个参考草图的特征并由联合AdaIN模块协调,将框架扩展以支持可控的多风格生成。大量实验表明,我们的方法实现了高质量的草图生成,具有准确的风格对齐和增强的风格控制灵活性。M3S的官方实现可在https://github.com/CMACH508/M3S获得。
Summary / 总结
This work addresses the limitation of existing methods in controlling sketch styles precisely by proposing a training-free framework based on diffusion models. The framework uses textual prompts and referenced style sketches for explicit style guidance, and incorporates reference features with linear smoothing to enhance synthesis quality. The method also supports multi-style generation by integrating features from multiple reference sketches. Experiments show that the approach generates high-quality sketches with accurate style alignment and improved style control flexibility.
该研究针对现有方法在草图生成中的局限性,提出了一种基于扩散模型的无需训练框架。该框架通过文本提示和参考草图实现显式的风格指导,特别是在参考草图与目标草图结构相似度低的情况下,提高了合成质量。该框架进一步扩展以支持多风格生成,通过联合AdaIN模块整合多个参考草图的特征。实验表明,该方法能够实现高质量的草图生成,具有准确的风格对齐和增强的风格控制灵活性。
Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration
Authors: Yunghee Lee, Byeonghyun Pak, Junwha Hong, Hoseong Kim
Venue: NeurIPS 2025
First: 2025-11-06T07:08:58+00:00 · Latest: 2025-11-06T07:08:58+00:00
Comments: 21 pages, 8 figures. NeurIPS 2025. Project page: https://yhlee-add.github.io/THG
Abstract
In this paper, we propose Tortoise and Hare Guidance (THG), a training-free strategy that accelerates diffusion sampling while maintaining high-fidelity generation. We demonstrate that the noise estimate and the additional guidance term exhibit markedly different sensitivity to numerical error by reformulating the classifier-free guidance (CFG) ODE as a multirate system of ODEs. Our error-bound analysis shows that the additional guidance branch is more robust to approximation, revealing substantial redundancy that conventional solvers fail to exploit. Building on this insight, THG significantly reduces the computation of the additional guidance: the noise estimate is integrated with the tortoise equation on the original, fine-grained timestep grid, while the additional guidance is integrated with the hare equation only on a coarse grid. We also introduce (i) an error-bound-aware timestep sampler that adaptively selects step sizes and (ii) a guidance-scale scheduler that stabilizes large extrapolation spans. THG reduces the number of function evaluations (NFE) by up to 30% with virtually no loss in generation fidelity ($\Delta$ImageReward $\leq$ 0.032) and outperforms state-of-the-art CFG-based training-free accelerators under identical computation budgets. Our findings highlight the potential of multirate formulations for diffusion solvers, paving the way for real-time high-quality image synthesis without any model retraining. The source code is available at https://github.com/yhlee-add/THG.
Summary / 总结
Tortoise and Hare Guidance (THG) is a training-free method that accelerates diffusion model inference by reformulating the classifier-free guidance (CFG) as a multirate system of ODEs. THG integrates the noise estimate on a fine-grained timestep grid and the additional guidance on a coarser grid, reducing the number of function evaluations by up to 30% without compromising generation fidelity. The method also includes an adaptive timestep sampler and a guidance-scale scheduler, which further stabilize the process. THG outperforms state-of-the-art CFG-based accelerators under the same computational budget.
Tortoise and Hare Guidance (THG) 是一种无需训练的方法,能够在保持高保真生成的同时加速扩散模型的推理。通过将分类器自由引导(CFG)的 ODE 重新公式化为多速率系统,THG 在细网格上积分噪声估计,在粗网格上积分附加引导,最多可减少 30% 的函数评估次数,而不牺牲生成质量。该方法还包括一个自适应时间步长采样器和一个引导尺度调度器,进一步稳定过程。THG 在相同的计算预算下优于现有基于 CFG 的加速器。
FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models
Authors: Shengming Yuan, Xinyu Lyu, Shuailong Wang, Beitao Chen, Jingkuan Song, Lianli Gao
Venue: NeurIPS 2025
First: 2025-10-13T09:22:12+00:00 · Latest: 2025-11-06T06:08:08+00:00
Comments: 19 pages, 11 figures. Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract
Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs' adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model's associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.
中文标题/摘要
标题:FlexAC:向多模态大型语言模型中灵活控制关联推理的方向
多模态大型语言模型(MLLMs)在忠实性和创造性之间存在固有的权衡,因为不同任务需要不同程度的关联推理。然而,现有方法缺乏调节这种推理强度的灵活性,限制了MLLMs在事实性和创造性场景中的适应性。为了解决这一问题,我们提出为MLLMs配备机制,使其能够灵活控制关联推理。我们首先研究了MLLMs中关联行为的内部机制,并发现:(1) 中间层在塑造模型的关联倾向中起着关键作用,(2) 修改这些层中的表示可以有效地调节关联推理强度,(3) 可以利用幻觉来推导出引导这种调节的引导向量。基于这些发现,我们引入了灵活关联控制(FlexAC),这是一种轻量级且无需训练的框架,用于调节MLLMs中的关联行为。FlexAC 首先通过幻觉引导的中间表示来编码关联方向。然后,它选择高关联实例来构建有效的关联引导向量,其强度会根据创造性指导与输出稳定性之间的平衡进行自适应校准。最后,考虑到关联推理的多维性质,FlexAC 结合了从少量目标领域样本前向传递中提取的任务特定关联向量,使模型能够遵循多种关联方向,更好地适应创造性任务。值得注意的是,我们的方法在Creation-MMBench上的创造性提高了5.8倍,在CHAIR上的幻觉率降低了29%,超过了现有基线,证明了其在MLLMs中实现灵活控制关联推理的有效性。我们的代码可在https://github.com/ylhz/FlexAC/获取。
Summary / 总结
The paper addresses the challenge of controlling associative reasoning in multimodal large language models (MLLMs) to balance faithfulness and creativity. It proposes FlexAC, a lightweight framework that modulates associative reasoning by inducing hallucination-guided intermediate representations and adaptively calibrating associative steering vectors. FlexAC improves creativity by up to 5.8x on Creation-MMBench and reduces hallucination rate by 29% on CHAIR, outperforming existing methods in enabling flexible control over associative reasoning in MLLMs.
论文旨在解决在多模态大型语言模型(MLLMs)中控制联想推理以平衡忠实性和创造力的挑战。它提出了FlexAC框架,通过诱导幻觉引导的中间表示并适配地校准联想引导向量来调节联想推理。FlexAC在Creation-MMBench上将创造力提高至最多5.8倍,在CHAIR上将幻觉率降低29%,优于现有方法,展示了其在MLLMs中实现联想推理灵活控制的有效性。
Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving
Authors: Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, Liam Paull
First: 2025-06-12T19:14:00+00:00 · Latest: 2025-11-06T05:07:51+00:00
Abstract
Maintaining good driving behavior in out-of-distribution scenarios remains a critical challenge in autonomous driving. A promising direction is to leverage the generalist knowledge and reasoning capabilities of large-language models by treating unusual driving scenarios as a logical reasoning task. In this work, we present Poutine, a method that uses an off-the-shelf 3B-parameter vision-language model (VLM) - without any additional components - to achieve robust end-to-end autonomous driving via a simple and scalable training recipe. To learn strong base driving capabilities, we first train Poutine-Base using self-supervised next-token prediction over vision, language, and trajectory (VLT) tokens, leveraging both nominal and long-tail driving data. In the second stage, we fine-tune Poutine-Base using Group Relative Policy Optimization (GRPO) with a small set of human preference-labeled examples. We evaluated our approach on the Waymo end-to-end driving benchmark curated for long-tail scenarios. The final Poutine model achieves an RFS of 7.99 on the test set, placing 1st in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a significant margin. Our results suggest that handcrafted tokenizers or custom architectural components added to base VLMs in prior work are not necessary to achieve strong driving performance. Instead, this work highlights the potential of scalable VLT pretraining combined with lightweight RL fine-tuning to enable robust and generalizable autonomous driving.
中文标题/摘要
标题:普廷:视觉-语言-轨迹预训练和强化学习后训练实现稳健的端到端自动驾驶
在异常驾驶场景中保持良好的驾驶行为仍然是自动驾驶领域的关键挑战。一种有前景的方向是利用大型语言模型的通用知识和推理能力,将异常驾驶场景视为逻辑推理任务。在本工作中,我们提出了普廷方法,该方法仅使用一个现成的30亿参数视觉-语言模型(VLM),无需任何额外组件,通过简单的可扩展训练食谱实现稳健的端到端自动驾驶。为了学习强大的基础驾驶能力,我们首先使用自我监督的下一个标记预测方法对普廷基础模型进行预训练,利用标准和长尾驾驶数据。在第二阶段,我们使用组相对策略优化(GRPO)对普廷基础模型进行微调,使用少量的人类偏好标注示例。我们在为长尾场景定制的Waymo端到端驾驶基准上评估了我们的方法。最终的普廷模型在测试集上的RFS为7.99,在2025年Waymo基于视觉的端到端驾驶挑战赛中以显著优势获得第一名。我们的结果表明,在先前的工作中,添加手工制作的分词器或自定义架构组件以增强基础VLM并不是实现强大驾驶性能所必需的。相反,本工作强调了可扩展的视觉-语言-轨迹预训练与轻量级的强化学习微调相结合的潜力,以实现稳健和泛化的自动驾驶。
Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding
Authors: Zhuoming Li, Aitong Liu, Mengxi Jia, Yubi Lu, Tengxiang Zhang, Changzhi Sun, Dell Zhang, Xuelong Li
First: 2025-10-21T14:46:48+00:00 · Latest: 2025-11-06T01:45:32+00:00
Comments: IMWUT2025
Abstract
Free-form gesture understanding is highly appealing for human-computer interaction, as it liberates users from the constraints of predefined gesture categories. However, the sole existing solution GestureGPT suffers from limited recognition accuracy and slow response times. In this paper, we propose Gestura, an end-to-end system for free-form gesture understanding. Gestura harnesses a pre-trained Large Vision-Language Model (LVLM) to align the highly dynamic and diverse patterns of free-form gestures with high-level semantic concepts. To better capture subtle hand movements across different styles, we introduce a Landmark Processing Module that compensate for LVLMs' lack of fine-grained domain knowledge by embedding anatomical hand priors. Further, a Chain-of-Thought (CoT) reasoning strategy enables step-by-step semantic inference, transforming shallow knowledge into deep semantic understanding and significantly enhancing the model's ability to interpret ambiguous or unconventional gestures. Together, these components allow Gestura to achieve robust and adaptable free-form gesture comprehension. Additionally, we have developed the first open-source dataset for free-form gesture intention reasoning and understanding with over 300,000 annotated QA pairs.
中文标题/摘要
标题:Gestura:一个基于LVLM的实时自由手势理解系统
自由手势理解对于人机交互具有高度吸引力,因为它使用户摆脱了预定义手势类别的限制。然而,现有的唯一解决方案GestureGPT在识别准确性和响应速度方面存在局限。本文提出了一种端到端的自由手势理解系统Gestura。Gestura利用预训练的大规模视觉-语言模型(LVLM)将自由手势的高度动态和多样化模式与高层次语义概念对齐。为了更好地捕捉不同风格下的细微手部动作,我们引入了一种地标处理模块,通过嵌入解剖手部先验知识来弥补LVLM在细粒度领域知识方面的不足。此外,一种逐步推理策略(CoT)使模型能够逐步进行语义推理,将浅层知识转化为深层语义理解,显著增强了模型对模糊或非传统手势的解释能力。这些组件共同使Gestura能够实现稳健且适应性强的自由手势理解。此外,我们还开发了首个用于自由手势意图推理和理解的开源数据集,包含超过30万条标注的问答对。
Summary / 总结
Gestura is an end-to-end system for free-form gesture understanding that leverages a Large Vision-Language Model (LVLM) to align dynamic gestures with high-level semantics. It includes a Landmark Processing Module to enhance fine-grained hand movement capture and a Chain-of-Thought reasoning strategy for deep semantic inference. This system improves recognition accuracy and response times compared to existing solutions like GestureGPT. Key findings include robust gesture comprehension and the development of an open-source dataset with over 300,000 annotated QA pairs for free-form gesture intention reasoning and understanding.
Gestura 是一个端到端的自由手势理解系统,利用大型视觉语言模型(LVLM)将动态手势与高层次语义对齐。它包含一个关键点处理模块以增强对手部细微动作的捕捉,并采用逐步推理策略进行深入语义推理。该系统在识别准确性和响应时间方面优于现有解决方案如 GestureGPT。主要发现包括稳健的手势理解以及开发了一个包含超过30万标注问答对的开源数据集,用于自由手势意图推理和理解。
Seg the HAB: Language-Guided Geospatial Algae Bloom Reasoning and Segmentation
Authors: Patterson Hsieh, Jerry Yeh, Mao-Chi He, Wen-Han Hsieh, Elvis Hsieh
First: 2025-10-21T15:59:00+00:00 · Latest: 2025-11-05T22:17:59+00:00
Abstract
Climate change is intensifying the occurrence of harmful algal bloom (HAB), particularly cyanobacteria, which threaten aquatic ecosystems and human health through oxygen depletion, toxin release, and disruption of marine biodiversity. Traditional monitoring approaches, such as manual water sampling, remain labor-intensive and limited in spatial and temporal coverage. Recent advances in vision-language models (VLMs) for remote sensing have shown potential for scalable AI-driven solutions, yet challenges remain in reasoning over imagery and quantifying bloom severity. In this work, we introduce ALGae Observation and Segmentation (ALGOS), a segmentation-and-reasoning system for HAB monitoring that combines remote sensing image understanding with severity estimation. Our approach integrates GeoSAM-assisted human evaluation for high-quality segmentation mask curation and fine-tunes vision language model on severity prediction using the Cyanobacteria Aggregated Manual Labels (CAML) from NASA. Experiments demonstrate that ALGOS achieves robust performance on both segmentation and severity-level estimation, paving the way toward practical and automated cyanobacterial monitoring systems.
中文标题/摘要
标题:Seg the HAB:语言引导的地理空间水华推理与分割
气候变化加剧了有害藻华(HAB)的发生,特别是蓝细菌,它们通过耗氧、毒素释放和破坏海洋生物多样性威胁到水生生态系统和人类健康。传统的监测方法,如手工水样采集,仍然劳动密集型且在空间和时间覆盖方面有限。最近在遥感领域的视觉-语言模型(VLMs)取得了进展,显示出可扩展的人工智能驱动解决方案的潜力,但在图像推理和量化水华严重程度方面仍面临挑战。在本研究中,我们引入了藻华观测与分割(ALGOS)系统,这是一种结合遥感图像理解和严重程度估计的分割和推理系统。我们的方法结合了GeoSAM辅助的人类评估以获得高质量的分割掩码,并在NASA提供的蓝细菌聚合手动标签(CAML)上微调视觉语言模型以进行严重程度预测。实验表明,ALGOS在分割和严重程度估计方面均表现出稳健的性能,为实用和自动化的蓝细菌监测系统铺平了道路。
Summary / 总结
This study addresses the intensifying issue of harmful algal blooms (HAB), particularly cyanobacteria, which pose threats to aquatic ecosystems and human health. Traditional monitoring methods are labor-intensive and have limited coverage. The research introduces ALGae Observation and Segmentation (ALGOS), a system that combines remote sensing image understanding with severity estimation. ALGOS uses GeoSAM-assisted human evaluation for high-quality segmentation and fine-tunes a vision-language model with NASA's Cyanobacteria Aggregated Manual Labels (CAML). The experiments show that ALGOS performs robustly in both segmentation and severity estimation, advancing the development of practical and automated cyanobacterial monitoring systems.
研究针对气候变暖导致的有害藻华(HAB)加剧,威胁水生生态系统和人类健康的问题。该研究引入了ALGae Observation and Segmentation(ALGOS)系统,结合遥感图像理解和严重程度估计。ALGOS利用GeoSAM辅助的人类评估进行高质量分割,并使用NASA的Cyanobacteria Aggregated Manual Labels(CAML)对视觉语言模型进行微调。实验表明,ALGOS在分割和严重程度估计方面表现出色,推动了蓝细菌监测系统的自动化发展。
DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination
Authors: Xuan Gong, Tianshi Ming, Xinpeng Wang, Zhihua Wei
First: 2024-10-06T15:12:09+00:00 · Latest: 2025-11-05T20:57:25+00:00
Comments: Accepted by EMNLP2024 (Main Conference), add GitHub link
Abstract
Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that $D$ive into $A$ttention $M$echanism of LVLM to $R$educe $O$bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code is released at https://github.com/coder-gx/DAMRO.
中文标题/摘要
标题:DAMRO:深入LVLM的注意力机制以减少物体幻觉
尽管大型视觉-语言模型(LVLM)取得了巨大的成功,但它们不可避免地会遭受幻觉问题。我们知道,LVLM中的视觉编码器和大型语言模型(LLM)解码器都是基于Transformer的,允许模型通过注意力机制提取视觉信息并生成文本输出。我们发现,LLM解码器对图像标记的注意力分布与视觉编码器高度一致,两者都倾向于关注特定的背景标记而不是图像中的目标物体。我们将这种意外的注意力分布归因于视觉编码器本身固有的缺陷,这误导了LLM过度强调冗余信息并生成物体幻觉。为了解决这一问题,我们提出了一种名为DAMRO的新型无训练策略,即通过深入LVLM的注意力机制来减少物体幻觉。具体而言,我们的方法利用ViT的分类标记(CLS)来过滤出散布在背景中的高注意力异常标记,然后在解码阶段消除它们的影响。我们使用POPE、CHAIR、MME和GPT-4V辅助评估等基准对包括LLaVA-1.5、LLaVA-NeXT和InstructBLIP在内的LVLM进行了评估。结果表明,我们的方法显著减少了这些异常标记的影响,从而有效缓解了LVLM的幻觉问题。代码已发布于https://github.com/coder-gx/DAMRO。
Summary / 总结
The research aims to address the issue of object hallucination in Large Vision-Language Models (LVLMs) by analyzing the attention mechanisms within the models. The method, DAMRO, focuses on the attention distribution of the LLM decoder and employs classification tokens to filter out high-attention outlier tokens in the background, thereby reducing the influence of these tokens during the decoding stage. Experimental results show that DAMRO effectively alleviates object hallucination in LVLMs such as LLaVA-1.5, LLaVA-NeXT, and InstructBLIP, as evaluated on various benchmarks including POPE, CHAIR, MME, and GPT-4V Aided Evaluation.
研究旨在通过分析大型视觉语言模型(LVLM)中的注意力机制来解决对象幻觉问题。方法DAMRO关注LLM解码器的注意力分布,并利用分类标记过滤背景中的高注意力异常标记,从而在解码阶段减少这些标记的影响。实验结果表明,DAMRO有效缓解了LLaVA-1.5、LLaVA-NeXT和InstructBLIP等LVLM中的对象幻觉问题,这些评估基于POPE、CHAIR、MME和GPT-4V辅助评估等基准。
Comparing Computational Pathology Foundation Models using Representational Similarity Analysis
Authors: Vaibhav Mishra, William Lotter
First: 2025-09-18T23:01:13+00:00 · Latest: 2025-11-05T20:38:54+00:00
Comments: Proceedings of the 5th Machine Learning for Health (ML4H) Symposium
Abstract
Foundation models are increasingly developed in computational pathology (CPath) given their promise in facilitating many downstream tasks. While recent studies have evaluated task performance across models, less is known about the structure and variability of their learned representations. Here, we systematically analyze the representational spaces of six CPath foundation models using techniques popularized in computational neuroscience. The models analyzed span vision-language contrastive learning (CONCH, PLIP, KEEP) and self-distillation (UNI (v2), Virchow (v2), Prov-GigaPath) approaches. Through representational similarity analysis using H&E image patches from TCGA, we find that UNI2 and Virchow2 have the most distinct representational structures, whereas Prov-Gigapath has the highest average similarity across models. Having the same training paradigm (vision-only vs. vision-language) did not guarantee higher representational similarity. The representations of all models showed a high slide-dependence, but relatively low disease-dependence. Stain normalization decreased slide-dependence for all models by a range of 5.5% (CONCH) to 20.5% (PLIP). In terms of intrinsic dimensionality, vision-language models demonstrated relatively compact representations, compared to the more distributed representations of vision-only models. These findings highlight opportunities to improve robustness to slide-specific features, inform model ensembling strategies, and provide insights into how training paradigms shape model representations. Our framework is extendable across medical imaging domains, where probing the internal representations of foundation models can support their effective development and deployment.
中文标题/摘要
标题:使用表征相似性分析比较计算病理学基础模型
计算病理学(CPath)中越来越多地开发基础模型,因其在促进许多下游任务方面具有潜力。尽管最近的研究已经评估了模型在任务性能上的表现,但关于它们学习的表征结构和变异性了解较少。在这里,我们使用计算神经科学中流行的技巧系统地分析了六个CPath基础模型的表征空间。分析的模型涵盖了视觉-语言对比学习(如CONCH、PLIP、KEEP)和自我蒸馏(如UNI (v2)、Virchow (v2)、Prov-GigaPath)的方法。通过使用TCGA的HE图像片段进行表征相似性分析,我们发现UNI2和Virchow2具有最不同的表征结构,而Prov-Gigapath在模型间具有最高的平均相似性。相同的训练范式(视觉仅限 vs. 视觉-语言)并不保证更高的表征相似性。所有模型的表征都显示出较高的玻片依赖性,但相对较低的疾病依赖性。染色标准化在不同模型中降低了玻片依赖性,范围从CONCH的5.5%到PLIP的20.5%。在固有维度方面,视觉-语言模型展示了相对紧凑的表征,而视觉仅限模型则具有更分散的表征。这些发现突显了提高对玻片特定特征鲁棒性的机会,指导了模型集成策略,并提供了关于训练范式如何塑造模型表征的见解。我们的框架可以在医学成像领域扩展,其中探索基础模型的内部表征可以支持其有效开发和部署。
Summary / 总结
This study evaluates the representational structures of six computational pathology foundation models using representational similarity analysis. The models include vision-language contrastive learning approaches (CONCH, PLIP, KEEP) and self-distillation approaches (UNI (v2), Virchow (v2), Prov-GigaPath). Key findings show that UNI2 and Virchow2 have distinct representational structures, while Prov-GigaPath has the highest similarity across models. Slide-dependence was high for all models, but stain normalization reduced it. Vision-language models had more compact representations compared to vision-only models, indicating different ways training paradigms shape model representations.
该研究使用表示相似性分析评估了六个计算病理学基础模型的表示空间。这些模型包括视觉-语言对比学习方法(CONCH、PLIP、KEEP)和自我蒸馏方法(UNI (v2)、Virchow (v2)、Prov-GigaPath)。关键发现表明,UNI2和Virchow2具有不同的表示结构,而Prov-GigaPath在模型间具有最高的平均相似性。这些表示高度依赖于切片,但对疾病依赖性较低。染色归一化可以减少切片依赖性,视觉-语言模型的表示比视觉仅模型更紧凑。这些结果为提高模型的鲁棒性并指导模型集成策略提供了见解。
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation
Authors: Jaden Park, Mu Cai, Feng Yao, Jingbo Shang, Soochahn Lee, Yong Jae Lee
First: 2025-11-05T18:59:52+00:00 · Latest: 2025-11-05T18:59:52+00:00
Abstract
Recent advances in Vision-Language Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to test-set leakage. While prior works have proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for contaminated VLMs remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on multi-modal semantic perturbation, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple realistic contamination strategies, confirming its robustness and effectiveness. The code and perturbed dataset will be released publicly.
中文标题/摘要
标题:使用多模态语义扰动进行VLM污染检测
近期视觉-语言模型(VLMs)在众多基准任务上取得了最先进的性能。然而,使用规模庞大的互联网数据,通常为专有数据进行预训练,引发了从业者和用户的重要关切:由于测试集泄露导致的性能夸大。尽管先前的工作提出了诸如预训练数据去污染和LLM基准重设计等缓解策略,但开发用于检测污染VLM的方法这一互补方向仍被忽视。为解决这一缺口,我们故意在流行基准上污染开源VLM,并展示现有检测方法要么完全失效,要么表现不一致。然后,我们提出了一种基于多模态语义扰动的新型简单而有效的检测方法,证明污染模型在受控扰动下无法泛化。最后,我们在多种实际污染策略下验证了我们的方法,证实了其稳健性和有效性。代码和扰动数据将公开发布。
Summary / 总结
This paper addresses the issue of inflated performance in Vision-Language Models (VLMs) due to test-set leakage, which is a concern given the use of large-scale proprietary pretraining corpora. The authors propose a new detection method based on multi-modal semantic perturbation to identify contaminated VLMs. They show that existing detection approaches are inadequate and that their proposed method effectively identifies contaminated models by failing to generalize under controlled perturbations. The method is validated across various contamination strategies, confirming its robustness and effectiveness.
本文针对由于测试集泄露导致的视觉-语言模型(VLMs)性能膨胀问题进行了研究,鉴于大规模专有预训练数据集的使用,这是一个值得关注的问题。作者提出了一种基于多模态语义扰动的新检测方法来识别被污染的VLMs。他们表明现有的检测方法不够有效,并且他们提出的方法通过在受控扰动下无法泛化来有效识别被污染的模型。该方法在多种污染策略下进行了验证,证实了其稳健性和有效性。
TABLET: A Large-Scale Dataset for Robust Visual Table Understanding
Authors: Iñigo Alonso, Imanol Miranda, Eneko Agirre, Mirella Lapata
First: 2025-09-25T14:14:27+00:00 · Latest: 2025-11-05T16:33:45+00:00
Abstract
While table understanding increasingly relies on pixel-only settings where tables are processed as visual representations, current benchmarks predominantly use synthetic renderings that lack the complexity and visual diversity of real-world tables. Additionally, existing visual table understanding (VTU) datasets offer fixed examples with single visualizations and pre-defined instructions, providing no access to underlying serialized data for reformulation. We introduce TABLET, a large-scale VTU dataset with 4 million examples across 20 tasks, grounded in 2 million unique tables where 88% preserve original visualizations. Each example includes paired image-HTML representations, comprehensive metadata, and provenance information linking back to the source datasets. Fine-tuning vision-language models like Qwen2.5-VL-7B on TABLET improves performance on seen and unseen VTU tasks while increasing robustness on real-world table visualizations. By preserving original visualizations and maintaining example traceability in a unified large-scale collection, TABLET establishes a foundation for robust training and extensible evaluation of future VTU models.
中文标题/摘要
标题:TABLET:大规模视觉表格理解数据集
尽管表格理解越来越多地依赖于基于像素的设置,其中表格被视为视觉表示,但当前的基准测试主要使用缺乏现实世界表格复杂性和视觉多样性的合成渲染。此外,现有的视觉表格理解(VTU)数据集提供固定示例和单一可视化,并预定义指令,不提供访问底层序列化数据以重新表述的机会。我们引入了TABLET,这是一个包含400万示例的大型VTU数据集,覆盖20个任务,基于200万张独特表格,其中88%保留了原始可视化。每个示例包括配对的图像-HTML表示、全面的元数据以及链接回源数据集的来源信息。在TABLET上微调如Qwen2.5-VL-7B的视觉语言模型可以提高已见和未见VTU任务的性能,同时增强对现实世界表格可视化的鲁棒性。通过保留原始可视化并在统一的大规模集合中保持示例可追溯性,TABLET为未来的VTU模型的稳健训练和扩展评估奠定了基础。
Summary / 总结
The research motivation is to address the limitations of current VTU benchmarks which lack real-world complexity and visual diversity. TABLET, a large-scale dataset with 4 million examples across 20 tasks, is introduced. Each example includes image-HTML representations and comprehensive metadata. Fine-tuning vision-language models on TABLET improves performance and robustness on VTU tasks. The dataset preserves original visualizations and maintains example traceability, providing a robust training and evaluation foundation for future VTU models.
研究动机是解决当前VTU基准数据集缺乏真实世界复杂性和无法访问底层数据的问题。引入了包含400万个示例、覆盖20个任务的TABLET数据集。每个示例包括图像-HTML表示和元数据。在TABLET上微调视觉语言模型可以提高对已见和未见任务的性能,并增强对真实世界表格可视化结果的鲁棒性。
Text-guided Fine-Grained Video Anomaly Detection
Authors: Jihao Gu, Kun Li, He Wang, Kaan Akşit
First: 2025-11-01T11:59:23+00:00 · Latest: 2025-11-05T15:46:07+00:00
Abstract
Video Anomaly Detection (VAD) aims to identify anomalous events within video segments. In scenarios such as surveillance or industrial process monitoring, anomaly detection is of critical importance. While existing approaches are semi-automated, requiring human assessment for anomaly detection, traditional VADs offer limited output as either normal or anomalous. We propose Text-guided Fine-Grained Video Anomaly Detection (T-VAD), a framework built upon Large Vision-Language Model (LVLM). T-VAD introduces an Anomaly Heatmap Decoder (AHD) that performs pixel-wise visual-textual feature alignment to generate fine-grained anomaly heatmaps. Furthermore, we design a Region-aware Anomaly Encoder (RAE) that transforms the heatmaps into learnable textual embeddings, guiding the LVLM to accurately identify and localize anomalous events in videos. This significantly enhances both the granularity and interactivity of anomaly detection. The proposed method achieving SOTA performance by demonstrating 94.8% Area Under the Curve (AUC, specifically micro-AUC) and 67.8%/76.7% accuracy in anomaly heatmaps (RBDC/TBDC) on the UBnormal dataset, and subjectively verified more preferable textual description on the ShanghaiTech-based dataset (BLEU-4: 62.67 for targets, 88.84 for trajectories; Yes/No accuracy: 97.67%), and on the UBnormal dataset (BLEU-4: 50.32 for targets, 78.10 for trajectories; Yes/No accuracy: 89.73%).
中文标题/摘要
标题:文本引导的细粒度视频异常检测
视频异常检测(VAD)旨在识别视频片段中的异常事件。在监控或工业过程监控等场景中,异常检测至关重要。尽管现有方法是半自动化,需要人工评估异常检测,但传统VADs的输出仅限于正常或异常。我们提出了文本引导的细粒度视频异常检测(T-VAD),该框架基于大型视觉-语言模型(LVLM)。T-VAD引入了异常热图解码器(AHD),通过像素级的视觉-文本特征对齐生成细粒度的异常热图。此外,我们设计了区域感知异常编码器(RAE),将热图转换为可学习的文本嵌入,引导LVLM准确识别和定位视频中的异常事件。这显著提高了异常检测的粒度和交互性。所提出的方法在UBnormal数据集上实现了SOTA性能,AUC(特别是微AUC)达到94.8%,异常热图(RBDC/TBDC)准确率为67.8%/76.7%,并在ShanghaiTech基于的数据集上主观验证了更优的文本描述(BLEU-4:目标62.67,轨迹88.84;是/否准确率:97.67%),以及UBnormal数据集上(BLEU-4:目标50.32,轨迹78.10;是/否准确率:89.73%)。
Summary / 总结
The research aims to improve the accuracy and granularity of video anomaly detection by integrating textual guidance. The method, Text-guided Fine-Grained Video Anomaly Detection (T-VAD), uses a Large Vision-Language Model (LVLM) with an Anomaly Heatmap Decoder (AHD) and a Region-aware Anomaly Encoder (RAE) to generate fine-grained anomaly heatmaps and guide the LVLM for precise anomaly localization. The proposed method achieves state-of-the-art performance with 94.8% micro-AUC and 67.8%/76.7% accuracy in anomaly heatmaps on the UBnormal dataset, and shows superior textual description quality with BLEU-4 scores of 62.67 and 88.84 on the ShanghaiTech-based dataset.
研究旨在通过引入T-VAD来提高视频异常检测的精细度和互动性,T-VAD利用大型视觉语言模型。T-VAD包括一个异常热图解码器进行像素级特征对齐,以及一个区域感知异常编码器生成可学习的文本嵌入,增强异常定位。该方法在UBnormal数据集上达到最先进的性能,微AUC为94.8%,异常热图准确率为67.8%/76.7%,并在ShanghaiTech数据集上具有更优的文本描述,BLEU-4分数和Yes/No准确率更高。
What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes
Authors: Candace Ross, Florian Bordes, Adina Williams, Polina Kirichenko, Mark Ibrahim
Venue: NeurIPS
First: 2025-11-05T15:37:50+00:00 · Latest: 2025-11-05T15:37:50+00:00
Comments: 10 pages, 6 figures. Accepted to NeurIPS Datasets & Benchmarks 2025
Abstract
Multimodal language models possess a remarkable ability to handle an open-vocabulary's worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O. With more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking "what's in common?". We evaluate leading multimodal language models, including models specifically trained to perform chain-of-thought reasoning. We find that perceiving objects in single images is tractable for most models, yet reasoning across scenes is very challenging even for the best models, including reasoning models. Despite saturating many leaderboards focusing on perception, the best performing model only achieves 35% on Common-O -- and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1%. Curiously, we find models are more prone to hallucinate when similar objects are present in the scene, suggesting models may be relying on object co-occurrence seen during training. Among the models we evaluated, we found scale can provide modest improvements while models explicitly trained with multi-image inputs show bigger improvements, suggesting scaled multi-image training may offer promise. We make our benchmark publicly available to spur research into the challenge of hallucination when reasoning across scenes.
中文标题/摘要
标题:共同点是什么?多模态模型在跨场景推理时会胡言乱语
多模态语言模型具备处理开放词汇表中物体的能力。然而,最好的模型在处理现实世界中的场景推理时仍然会出现胡言乱语的情况,这揭示了它们在现有感知基准测试上的强大表现与在现实世界中的推理能力之间的差距。为了解决这一差距,我们构建了一个名为Common-O的新基准,该基准包含超过10500个实例,使用完全新的图像,避免了网络训练数据的污染。Common-O不仅涉及感知,还借鉴了人类的认知测试,通过询问“共同点是什么?”来探究跨场景的推理。我们评估了领先的多模态语言模型,包括专门训练进行链式推理的模型。我们发现,在单张图像中感知物体对于大多数模型来说是可处理的,但在跨场景推理方面,即使是最好的模型也面临巨大挑战,包括推理模型。尽管在专注于感知的排行榜上已经饱和,表现最好的模型在Common-O上的得分仅为35%,而在Common-O复杂场景中,最好的模型得分仅为1%。有趣的是,我们发现当场景中存在相似物体时,模型更容易胡言乱语,这表明模型可能依赖于在训练期间看到的物体共现。在我们评估的模型中,我们发现规模可以提供适度的改进,而明确使用多张图像输入训练的模型则显示出更大的改进,这表明多张图像训练可能具有前景。我们公开发布了该基准,以促进跨场景推理时胡言乱语挑战的研究。
Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding
Authors: Songtao Jiang, Yuan Wang, Sibo Song, Tianxiang Hu, Chenyi Zhou, Bin Pu, Yan Zhang, Zhibo Yang, Yang Feng, Joey Tianyi Zhou, Jin Hao, Zijian Chen, Ruijia Wu, Tao Tang, Junhui Lv, Hongxia Xu, Hongwei Wang, Jun Xiao, Bin Feng, Fudong Zhu, Kenli Li, Weidi Xie, Jimeng Sun, Jian Wu, Zuozhu Liu
First: 2025-10-09T17:06:42+00:00 · Latest: 2025-11-05T15:19:13+00:00
Abstract
Real-world clinical decision-making requires integrating heterogeneous data, including medical text, 2D images, 3D volumes, and videos, while existing AI systems fail to unify all these signals, limiting their utility. In this paper, we introduce Hulu-Med, a transparent, generalist medical Vision-Language Model (VLM) designed to unify language-only, 2D/3D vision-language, and video understanding within a single architecture. Hulu-Med is trained on a curated corpus of 16.7 million samples, comprising exclusively public or synthetic data, spanning 12 major anatomical systems and 14 medical imaging modalities. Hulu-Med employs a medical-aware token-reduction strategy that prunes redundant visual tokens, achieving up to a 55% reduction for 3D and video inputs, improving cross-modal efficiency, and enabling training at 7B-32B parameter scales in approximately 4,000-40,000 GPU hours. Across 30 public in-domain and out-of-domain medical benchmarks-covering text reasoning, visual question answering, report generation, multilingual dialogue, video understanding, and rare disease diagnosis-Hulu-Med surpasses existing open-source models on 27 of 30 benchmarks and outperforms proprietary systems such as GPT-4o on 16 benchmarks. Despite being a VLM, Hulu-Med outperforms GPT-4o and matches GPT-o1 on the text-only HealthBench. For the first time in the community, we provide a fully transparent, reproducible and cost-effective pipeline for holistic medical vision-language understanding by releasing our end-to-end data curation, training procedures, and model parameters. Code and models are available at https://github.com/ZJUI-AI4H/Hulu-Med.
中文标题/摘要
标题:Hulu-Med:面向全面医疗视图语言理解的透明通用模型
现实世界中的临床决策需要整合异构数据,包括医学文本、2D图像、3D体积和视频,而现有的AI系统无法统一所有这些信号,限制了它们的实用性。在本文中,我们介绍了Hulu-Med,这是一种透明的通用医疗视图语言模型(VLM),旨在在一个架构中统一语言理解、2D/3D视图语言理解和视频理解。Hulu-Med基于1670万样本的精心策划的语料库进行训练,这些样本仅包含公开或合成数据,涵盖了12个主要的解剖系统和14种医学成像模态。Hulu-Med采用了一种医学意识的标记减少策略,去除冗余的视觉标记,对于3D和视频输入,最多可减少55%的标记,提高跨模态效率,并在约4000-40000个GPU小时的训练中支持7B-32B参数规模。在涵盖文本推理、视觉问答、报告生成、多语言对话、视频理解和罕见疾病诊断的30个公开领域内和领域外医学基准测试中,Hulu-Med在27个基准测试中超越了现有的开源模型,并在16个基准测试中超越了如GPT-4o等专有系统。尽管是VLM,Hulu-Med在仅文本的HealthBench上也超越了GPT-4o,并与GPT-o1持平。我们首次为社区提供了全面的医疗视图语言理解的完全透明、可重复和成本效益高的管道,通过发布我们的端到端数据策划、训练流程和模型参数。代码和模型可在https://github.com/ZJUI-AI4H/Hulu-Med/获得。
Summary / 总结
Hulu-Med is a transparent generalist medical Vision-Language Model designed to unify language, 2D/3D vision-language, and video understanding. It is trained on 16.7 million samples from 12 major anatomical systems and 14 medical imaging modalities, achieving up to 55% reduction in redundant visual tokens for 3D and video inputs. Across 30 medical benchmarks, Hulu-Med outperforms existing models on 27 benchmarks and proprietary systems like GPT-4o on 16 benchmarks, while also matching GPT-o1 on text-only tasks. The model provides a transparent, reproducible, and cost-effective pipeline for holistic medical vision-language understanding through open-source code and model parameters.
Hulu-Med 是一个透明的通用医疗视觉-语言模型,旨在统一语言、2D/3D视觉-语言和视频理解。它基于12个主要解剖系统和14种医学成像模态的1670万样本进行训练,对于3D和视频输入可减少高达55%的冗余视觉标记,提高跨模态效率。在30个医疗基准测试中,Hulu-Med 在27个基准测试中超越现有模型,在16个基准测试中超越如GPT-4o等专有系统,同时在文本仅有的任务中与GPT-o1持平。该模型通过开源代码和模型参数提供了一个透明、可复现且成本效益高的整体医疗视觉-语言理解管道。
Manipulation Facing Threats: Evaluating Physical Vulnerabilities in End-to-End Vision Language Action Models
Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Chengyuan Yu, Mengshu Sun, Qiang Zhang, Jiahang Cao, Yijie Guo, Ning Liu, Kaidi Xu, Jize Zhang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu
First: 2024-09-20T03:02:05+00:00 · Latest: 2025-11-05T14:57:42+00:00
Abstract
Recently, driven by advancements in Multimodal Large Language Models (MLLMs), Vision Language Action Models (VLAMs) are being proposed to achieve better performance in open-vocabulary scenarios for robotic manipulation tasks. Since manipulation tasks involve direct interaction with the physical world, ensuring robustness and safety during the execution of this task is always a very critical issue. In this paper, by synthesizing current safety research on MLLMs and the specific application scenarios of the manipulation task in the physical world, we comprehensively evaluate VLAMs in the face of potential physical threats. Specifically, we propose the Physical Vulnerability Evaluating Pipeline (PVEP) that can incorporate as many visual modal physical threats as possible for evaluating the physical robustness of VLAMs. The physical threats in PVEP specifically include Out-of-Distribution, Typography-based Visual Prompt, and Adversarial Patch Attacks. By comparing the performance fluctuations of VLAMs before and after being attacked, we provide generalizable \textbf{\textit{Analyses}} of how VLAMs respond to different physical threats.
中文标题/摘要
标题:面对威胁的操纵:评估端到端视觉语言动作模型的物理脆弱性
近年来,随着多模态大型语言模型(MLLMs)的发展,视觉语言动作模型(VLAMs)被提出以在机器人操纵任务的开放词汇场景中实现更好的性能。由于操纵任务涉及直接与物理世界互动,确保执行此任务时的鲁棒性和安全性始终是一个非常关键的问题。在本文中,通过综合当前MLLMs的安全研究以及操纵任务在物理世界中的具体应用场景,我们全面评估了VLAMs在面对潜在物理威胁时的表现。具体而言,我们提出了物理脆弱性评估管道(PVEP),它可以尽可能多地纳入视觉模态的物理威胁,以评估VLAMs的物理鲁棒性。PVEP中的物理威胁具体包括离分布、基于字体的视觉提示和对抗性补丁攻击。通过比较VLAMs在攻击前后性能的变化,我们提供了关于VLAMs如何应对不同物理威胁的可泛化的**分析**。
Summary / 总结
This paper evaluates the physical robustness of Vision Language Action Models (VLAMs) in robotic manipulation tasks, motivated by the need for safety and robustness in direct physical interactions. The authors propose the Physical Vulnerability Evaluating Pipeline (PVEP) to assess VLAMs against various physical threats, including out-of-distribution, typography-based visual prompts, and adversarial patches. Key findings show that VLAMs exhibit different performance fluctuations when subjected to these attacks, providing insights into their vulnerability to physical threats.
本文通过提出物理脆弱性评估管道(PVEP),包括离分布、基于字体的视觉提示和对抗性补丁攻击,来评估Vision Language Action Models(VLAMs)在机器人操作任务中的物理鲁棒性。研究发现,当VLAMs受到这些物理威胁时,其性能会表现出不同的波动,提供了对其不同类型攻击的脆弱性的见解。
Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models
Authors: Hao Cheng, Erjia Xiao, Yichi Wang, Lingfeng Zhang, Qiang Zhang, Jiahang Cao, Kaidi Xu, Mengshu Sun, Xiaoshuai Hao, Jindong Gu, Renjing Xu
First: 2025-03-14T15:42:42+00:00 · Latest: 2025-11-05T14:45:59+00:00
Comments: This paper is accepted by IJCAI2025 Workshop on Deepfake Detection, Localization, and Interpretability as Best Student Paper
Abstract
Current Cross-Modality Generation Models (GMs) demonstrate remarkable capabilities in various generative tasks. Given the ubiquity and information richness of vision modality inputs in real-world scenarios, Cross-Vision tasks, encompassing Vision-Language Perception (VLP) and Image-to-Image (I2I), have attracted significant attention. Large Vision Language Models (LVLMs) and I2I Generation Models (GMs) are employed to handle VLP and I2I tasks, respectively. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to produce disruptive outputs that are semantically aligned with those words. Additionally, visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of cross-vision tasks. However, the specific characteristics of the threats posed by visual prompts remain underexplored. In this paper, to comprehensively investigate the performance impact induced by Typographic Visual Prompt Injection (TVPI) in various LVLMs and I2I GMs, we propose the Typographic Visual Prompts Injection Dataset and thoroughly evaluate the TVPI security risks on various open-source and closed-source LVLMs and I2I GMs under visual prompts with different target semantics, deepening the understanding of TVPI threats.
中文标题/摘要
标题:探索跨模态生成模型中的图型视觉提示注入威胁
当前的跨模态生成模型(GMs)在各种生成任务中表现出显著的能力。鉴于现实世界场景中视觉模态输入的普遍性和信息丰富性,包括视觉语言感知(VLP)和图像到图像(I2I)在内的跨视觉任务引起了广泛关注。大型视觉语言模型(LVLMs)和I2I生成模型分别用于处理VLP和I2I任务。先前的研究表明,在输入图像中印刷图型文字会显著诱导LVLMs和I2I GMs生成与这些文字语义一致的破坏性输出。此外,视觉提示作为一种更复杂的图型形式,也被发现对跨视觉任务的各种应用构成了安全风险。然而,视觉提示所造成的具体威胁特征仍待进一步探索。在本文中,为了全面调查图型视觉提示注入(TVPI)在各种LVLMs和I2I GMs中的性能影响,我们提出了图型视觉提示注入数据集,并在具有不同目标语义的视觉提示下对各种开源和闭源LVLMs和I2I GMs进行了彻底的安全风险评估,加深了对TVPI威胁的理解。
Summary / 总结
This paper investigates the security threats posed by typographic visual prompts in cross-modality generation models. It proposes a dataset to evaluate the impact of typographic visual prompt injection (TVPI) on various large vision language models and image-to-image generation models. The study reveals that visual prompts can significantly influence model outputs, leading to semantically aligned but disruptive results, especially when targeting specific semantics. This work deepens the understanding of TVPI threats in cross-vision tasks.
本文探讨了图文提示注入(TVPI)在跨模态生成模型中的安全威胁。研究引入了一个数据集来评估图文提示注入对各种大型视觉语言模型(LVLMs)和图像到图像(I2I)生成模型的影响。研究发现,图文提示可以显著影响模型输出,导致语义上一致但具有破坏性的结果,并强调了更好地理解和缓解这些安全风险的必要性。
Revisiting Multimodal Positional Encoding in Vision-Language Models
Authors: Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, Shuai Bai
First: 2025-10-27T08:00:46+00:00 · Latest: 2025-11-05T14:25:38+00:00
Comments: 16 pages
Abstract
Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code will be avaliable at https://github.com/JJJYmmm/Multimodal-RoPEs.
中文标题/摘要
标题:重新审视视觉-语言模型中的多模态位置编码
多模态位置编码对于视觉-语言模型至关重要,但对多模态位置编码的系统性研究却很少。我们对多模态旋转位置嵌入(RoPE)进行了全面分析,考察了其两个核心组成部分:位置设计和频率分配。通过大量实验,我们确定了三个关键指导原则:位置一致性、充分利用频率以及保留文本先验,以确保布局明确、表示丰富以及从预训练的大语言模型中忠实转移。基于这些见解,我们提出了多头RoPE(MHRoPE)和MRoPE-交错(MRoPE-I)两种简单且即插即用的变体,无需进行架构更改。我们的方法在多种基准测试中始终优于现有方法,显著提高了通用和细粒度多模态理解。代码将在https://github.com/JJJYmmm/Multimodal-RoPEs上提供。
Summary / 总结
This study revisits multimodal position encoding in vision-language models and provides a comprehensive analysis of Rotary Positional Embedding (RoPE), focusing on position design and frequency allocation. Through extensive experiments, the authors identify key guidelines for effective multimodal position encoding, including positional coherence, full frequency utilization, and preservation of textual priors. Based on these insights, they propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), which outperform existing methods across various benchmarks, enhancing both general and fine-grained multimodal understanding without requiring architectural changes. Code is available at https://github.com/JJJYmmm/Multimodal-RoPEs.
该研究重新审视了视觉-语言模型中的多模态位置编码,并对旋转位置嵌入(RoPE)进行了全面分析,重点关注位置设计和频率分配。通过大量实验,作者确定了有效多模态位置编码的关键准则,包括位置一致性、频率充分利用和文本先验的保留。基于这些见解,他们提出了多头RoPE(MHRoPE)和MRoPE-交错(MRoPE-I),这些方法在各种基准测试中均优于现有方法,增强了通用和细粒度的多模态理解,且无需更改架构。代码可在https://github.com/JJJYmmm/Multimodal-RoPEs 获取。
ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs
Authors: Ben Zhang, LuLu Yu, Lei Gao, QuanJiang Guo, Jing Liu, Hui Gao
First: 2025-08-06T08:31:11+00:00 · Latest: 2025-11-05T13:58:18+00:00
Abstract
During reasoning in vision-language models (VLMs), false positive (FP) reasoning occurs when a model produces the correct answer but follows an incorrect reasoning path, resulting in undermined reasoning reliability. Existing approaches mainly rely on prompt engineering, knowledge distillation or reinforcement learning to improve reasoning reliability, both of which require large amounts of high-quality data and thus limit practical applicability. Few approaches have focused on directly detecting and correcting FPs. To address these issues, we propose ViFP, a framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs. ViFP builds effective reasoning paths through multi-turn QA and dynamically analyzes the consistency of the reasoning path to identify potential FPs. It also introduces a targeted reasoning chain correction mechanism to modify FP reasoning, thereby improving logical consistency and accuracy. Finally, we introduce a reliability evaluation metric, VoC, which integrates answer accuracy and the FP rate, providing a quantitative tool to assess whether a VLM not only answers correctly but also reasons reliably. Our experiments on closed-source VLMs show that ViFP consistently improves performance across three datasets: A-OKVQA, OK-VQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of FPs, validating its benefits in enhancing reasoning reliability.
中文标题/摘要
标题:ViFP:视觉假阳性检测框架以增强VLMs推理可靠性
在视觉语言模型(VLMs)的推理过程中,当模型给出正确答案但遵循错误的推理路径时,会发生假阳性(FP)推理,从而削弱推理可靠性。现有方法主要依赖于提示工程、知识蒸馏或强化学习来提高推理可靠性,但这些方法需要大量高质量的数据,从而限制了其实用性。很少有方法专注于直接检测和纠正FPs。为了解决这些问题,我们提出了ViFP,一种用于增强VLMs推理可靠性的视觉假阳性检测框架。ViFP通过多轮问答构建有效的推理路径,并动态分析推理路径的一致性以识别潜在的FPs。它还引入了针对性的推理链修正机制来修改FP推理,从而提高逻辑一致性和准确性。最后,我们引入了可靠性评估指标VoC,该指标结合了答案准确率和FP率,提供了一种定量工具来评估VLM不仅回答正确,还能可靠地推理。我们在闭源VLMs上的实验表明,ViFP在三个数据集A-OKVQA、OK-VQA和FVQA上均能持续提高性能。在A-OKVQA上,ViFP将准确率提高了最多5.4%,超越了之前的最佳方法4.3%,并显著减少了FP的数量,验证了其在增强推理可靠性方面的益处。
Summary / 总结
The paper proposes ViFP, a framework for detecting and correcting visual false positives in vision-language models to enhance reasoning reliability. It uses multi-turn QA and dynamically analyzes reasoning paths to identify and correct potential false positives, introducing a VoC metric to evaluate both accuracy and false positive rate. Experiments show ViFP improves accuracy by up to 5.4% and significantly reduces false positives across three datasets, A-OKVQA, OK-VQA, and FVQA.
论文提出ViFP框架,用于检测和纠正视觉假阳性,以提高视觉语言模型的推理可靠性。ViFP使用多轮问答并动态分析推理路径的一致性,引入了针对性的推理链修正机制。还引入了VoC指标来评估可靠性。实验结果显示,ViFP在A-OKVQA、OK-VQA和FVQA三个数据集上将准确率提高至多5.4%,并显著减少了假阳性,验证了其在提高推理可靠性方面的优势。
Decoupling Augmentation Bias in Prompt Learning for Vision-Language Models
Authors: Gahyeon Kim, Sohee Kim, Seokju Lee
First: 2025-11-05T11:15:16+00:00 · Latest: 2025-11-05T11:15:16+00:00
Comments: Accepted in Pattern Recognition
Abstract
Recent advances in large-scale vision and language models have led to significant progress in zero-shot learning tasks. Methods such as CoOp and CoCoOp have shown that replacing handcrafted prompts with learnable vectors, known as prompt learning, can result in improved performance. However, these models often struggle to generalize to entirely unseen categories. While traditional zero-shot learning techniques benefit from various data augmentation strategies, prompt learning has primarily focused on text-based modifications, leaving the potential of image-based augmentation largely unexplored. In this work, we explore how image-level augmentations, particularly those that introduce attribute-specific variations, can support and enhance prompt learning. Our analysis examines the interaction between these augmentations and soft prompt frameworks, revealing their potential to improve generalization. We also identify a limitation in existing methods, such as CoCoOp, which do not provide explicit guidance for learning prompts that focus on semantically meaningful visual features. To address this, we propose Adding Attributes to Prompt Learning, AAPL, a novel method that introduces adversarial token embeddings to decouple superficial visual variations introduced by augmentation from class-relevant semantic representations. This decoupling enables the learned prompts to concentrate on visually discriminative features that align with the target categories. We conduct comprehensive experiments on eleven benchmark datasets, and AAPL consistently outperforms existing methods across few-shot, zero-shot, cross-dataset, and domain generalization settings. Our source code is publicly available at: https://github.com/Gahyeonkim09/AAPL
中文标题/摘要
标题:分离视觉语言模型提示学习中增强偏差
大规模视觉和语言模型的最新进展在零样本学习任务中取得了显著进展。CoOp和CoCoOp等方法表明,用可学习向量替换手工设计的提示,即提示学习,可以提高性能。然而,这些模型往往难以泛化到完全未见过的类别。虽然传统的零样本学习技术受益于各种数据增强策略,但提示学习主要集中在文本修改上,图像级别的增强潜力尚未得到充分探索。在本文中,我们探讨了图像级增强,特别是那些引入属性特定变化的增强,如何支持和增强提示学习。我们的分析研究了这些增强与软提示框架之间的相互作用,揭示了它们提高泛化能力的潜力。我们还指出了现有方法(如CoCoOp)的一个局限性,即它们没有提供明确的指导来学习专注于语义有意义的视觉特征的提示。为了解决这个问题,我们提出了添加属性到提示学习(AAPL),这是一种新颖的方法,通过引入对抗性标记嵌入来分离由增强引入的表面视觉变化与类别相关语义表示。这种分离使学习到的提示能够集中于与目标类别对齐的视觉区分特征。我们在11个基准数据集上进行了全面实验,AAPL在少量样本、零样本、跨数据集和领域泛化设置中均优于现有方法。我们的源代码可在以下网址获取:https://github.com/Gahyeonkim09/AAPL
Summary / 总结
This work addresses the challenge of generalizing to unseen categories in prompt learning for vision-language models. It introduces AAPL, a method that uses adversarial token embeddings to decouple superficial visual variations from semantic representations, thereby improving generalization. Comprehensive experiments on eleven benchmark datasets show that AAPL outperforms existing methods in few-shot, zero-shot, cross-dataset, and domain generalization settings.
本文旨在解决视觉-语言模型中提示学习在处理未见类别时的泛化能力问题。提出了AAPL方法,通过引入对抗性令牌嵌入来解耦由增强引入的表面视觉变化与类别相关的语义表示,从而提高泛化能力。在 eleven 个基准数据集上的全面实验表明,AAPL 在少量样本、零样本、跨数据集和领域泛化设置中均优于现有方法。
Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge
Authors: Yi Yang, Yiming Xu, Timo Kaiser, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang
First: 2025-11-05T10:01:31+00:00 · Latest: 2025-11-05T10:01:31+00:00
Abstract
In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.
中文标题/摘要
标题:使用LLaVA-Video的多目标跟踪检索:MOT25-StAG挑战的无训练解决方案
在本报告中,我们提出了对MOT25-时空动作定位(MOT25-StAG)挑战的解决方案。该挑战的目标是使用复杂现实场景的视频数据作为输入,准确地定位和跟踪与特定和自由形式语言查询匹配的多个对象。我们将基础任务建模为视频检索问题,并提出了一种两阶段、零样本的方法,结合了最先进的跟踪模型FastTracker和多模态大型语言模型LLaVA-Video的优势。在MOT25-StAG测试集上,我们的方法分别获得了20.68的m-HIoU和10.73的HOTA分数,在挑战中获得第二名。
Summary / 总结
The research aims to accurately localize and track multiple objects based on specific language queries in complex real-world scenes. The method combines FastTracker for tracking and LLaVA-Video for multi-modal understanding, achieving m-HIoU and HOTA scores of 20.68 and 10.73 respectively on the MOT25-StAG test set, placing second in the challenge.
研究旨在基于特定语言查询在复杂现实场景中准确地定位和跟踪多个物体。方法结合了FastTracker进行跟踪和LLaVA-Video进行多模态理解,在MOT25-StAG测试集上取得了m-HIoU和HOTA分数分别为20.68和10.73的成绩,获得了挑战的第二名。
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Authors: Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
First: 2025-10-30T08:21:50+00:00 · Latest: 2025-11-05T05:49:17+00:00
Comments: 10 pages
Abstract
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
中文标题/摘要
标题:时间流动的方向如何?基于心理物理学的视觉-语言模型评估
现代视觉-语言模型(VLMs)在许多多模态任务中表现出色,但在视频中的时间信息理解方面仍然薄弱且未得到充分评估。我们通过一个看似简单但揭示性强的挑战——判断时间箭头(AoT)——即判断短片段是正向播放还是反向播放,来探索这一差距。我们引入了AoT-PsyPhyBENCH,这是一个经心理物理学验证的基准测试,测试VLMs是否能在自然视频中推断出时间方向,使用与人类相同的刺激和行为基线。我们对开放权重和专有、推理和非推理VLMs的全面评估显示,大多数模型的表现接近随机猜测,甚至最好的模型在物理不可逆过程(如自由落体、扩散/爆炸)和因果手动动作(如分割/加法)上的人类识别能力方面也远远落后。这些结果突显了当前多模态系统中的一个基本差距:虽然它们捕捉到了丰富的视觉-语义关联,但缺乏用于时间连续性和因果理解的归纳偏置。我们发布了AoT-PsyPhyBENCH的代码和数据,以鼓励进一步提高VLMs在物理和时间推理能力方面的发展。
Summary / 总结
This study evaluates the temporal understanding of vision-language models (VLMs) by introducing AoT-PsyPhyBENCH, a benchmark based on psychophysical validation. The models were tested on their ability to determine the direction of time in short video clips. Most models performed poorly, even on simple irreversible processes and causal actions, indicating a significant gap in their temporal reasoning capabilities. The results suggest that while VLMs excel in visual-semantic correlations, they lack the necessary biases for understanding temporal continuity and causality.
该研究通过引入基于心理物理验证的AoT-PsyPhyBENCH基准,评估了视觉语言模型(VLMs)的时序理解能力。模型被测试其判断短视频片段时间方向的能力。大多数模型表现不佳,即使在简单的不可逆过程和因果动作上也是如此,表明它们在时序推理方面存在显著差距。结果表明,虽然VLMs在视觉语义关联方面表现出色,但它们缺乏理解时序连续性和因果性的必要偏置。
History
20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553