Think Visually, Reason Textually: Vision-Language Synergy in ARC
Authors: Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang
First: 2025-11-19T18:59:04+00:00 · Latest: 2025-11-19T18:59:04+00:00
Abstract
Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code will be released soon.
中文标题/摘要
标题:视觉思考,文本推理:在ARC中的视觉-语言协同作用
从少量示例中进行抽象推理仍然是前沿基础模型如GPT-5和Grok 4的核心未解问题。这些模型仍然无法从少量示例中推断出结构化的转换规则,这是人类智能的关键标志之一。人工通用智能抽象和推理语料库(ARC-AGI)为这种能力提供了一个严格的测试平台,要求概念规则归纳和向新任务的迁移。大多数现有方法将ARC-AGI视为纯粹的文本推理任务,忽视了人类在解决此类谜题时高度依赖视觉抽象的事实。然而,我们的初步实验揭示了一个悖论:简单地将ARC-AGI网格转换为图像会因规则执行不精确而导致性能下降。这导致我们的核心假设是视觉和语言在不同的推理阶段具有互补的优势:视觉支持全局模式的抽象和验证,而语言则专门用于符号规则的制定和精确执行。基于这一见解,我们提出了两种协同策略:(1)视觉-语言协同推理(VLSR),将ARC-AGI分解为模态对齐的子任务;(2)模态切换自我校正(MSSC),利用视觉验证文本推理以进行内在错误校正。广泛的实验表明,我们的方法在多种旗舰模型和多个ARC-AGI任务上比纯文本基线高出4.33%。我们的研究结果表明,将视觉抽象与语言推理统一起来是未来基础模型实现可泛化的、类人的智能的关键一步。源代码将很快发布。
Summary / 总结
The paper addresses the challenge of abstract reasoning from minimal examples, a key aspect of human intelligence, using the ARC-AGI testbed. It proposes a vision-language synergy approach, decomposing the task into modality-aligned subtasks and using visual verification to correct text-based reasoning errors. Experiments show a 4.33% improvement over text-only models across various tasks and models.
论文针对从少量示例中进行抽象推理的挑战,这是高级AI模型的关键能力。它提出了一种视觉-语言协同方法,将任务分解为视觉和文本子任务,并利用视觉验证来纠正文本错误。实验显示,与仅文本方法相比,在多种模型和任务上提高了4.33%的性能,强调了结合视觉和文本推理对于实现更接近人类的AI能力的重要性。
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Authors: Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong, Jinyang Guo, Xianglong Liu, Jun Zhang
First: 2025-11-19T18:48:27+00:00 · Latest: 2025-11-19T18:48:27+00:00
Comments: Code will be released upon acceptance
Abstract
Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$.
中文标题/摘要
标题:MoDES:通过动态专家跳过加速Mixture-of-Experts多模态大型语言模型
Mixture-of-Experts (MoE) 多模态大型语言模型(MLLMs)在视觉-语言任务中表现出色,但存在高计算效率问题。为了减少推理开销,已经提出了专家跳过方法,根据当前输入令牌来停用冗余专家。然而,我们发现将这些方法(最初设计用于单模态大型语言模型(LLMs))应用于MLLMs会导致显著的性能下降。这主要是因为这些方法未能考虑MoE层中专家的异质贡献以及这些层中令牌的模态特定行为。受这些发现的启发,我们提出了MoDES,这是第一个无需训练的框架,能够自适应地跳过专家以实现高效且准确的MoE MLLM推理。它结合了全局调制局部门控(GMLG)机制,将全局层间重要性整合到局部路由概率中,以准确估计每个令牌的专家重要性。然后应用了一种双模态阈值化(DMT)方法,分别处理每个模态的令牌,以推导跳过计划。为了设置最优阈值,我们引入了一种前沿搜索算法,利用单调性特性,将收敛时间从几天缩短到几小时。针对13个基准的3个模型系列的广泛实验表明,MoDES远优于先前的方法。例如,当跳过Qwen3-VL-MoE-30B-A3B-Instruct的88%专家时,性能提升高达10.67%(97.33% vs. 86.66%)。此外,MoDES显著提高了推理速度,将预填充时间提高了2.16倍,解码时间提高了1.26倍。
Walrus: A Cross-Domain Foundation Model for Continuum Dynamics
Authors: Michael McCabe, Payel Mukhopadhyay, Tanya Marwah, Bruno Regaldo-Saint Blancard, Francois Rozet, Cristiana Diaconu, Lucas Meyer, Kaze W. K. Wong, Hadi Sotoudeh, Alberto Bietti, Irina Espejo, Rio Fear, Siavash Golkar, Tom Hehir, Keiya Hirashima, Geraud Krawezik, Francois Lanusse, Rudy Morel, Ruben Ohana, Liam Parker, Mariel Pettee, Jeff Shen, Kyunghyun Cho, Miles Cranmer, Shirley Ho
First: 2025-11-19T18:36:03+00:00 · Latest: 2025-11-19T18:36:03+00:00
Abstract
Foundation models have transformed machine learning for language and vision, but achieving comparable impact in physical simulation remains a challenge. Data heterogeneity and unstable long-term dynamics inhibit learning from sufficiently diverse dynamics, while varying resolutions and dimensionalities challenge efficient training on modern hardware. Through empirical and theoretical analysis, we incorporate new approaches to mitigate these obstacles, including a harmonic-analysis-based stabilization method, load-balanced distributed 2D and 3D training strategies, and compute-adaptive tokenization. Using these tools, we develop Walrus, a transformer-based foundation model developed primarily for fluid-like continuum dynamics. Walrus is pretrained on nineteen diverse scenarios spanning astrophysics, geoscience, rheology, plasma physics, acoustics, and classical fluids. Experiments show that Walrus outperforms prior foundation models on both short and long term prediction horizons on downstream tasks and across the breadth of pretraining data, while ablation studies confirm the value of our contributions to forecast stability, training throughput, and transfer performance over conventional approaches. Code and weights are released for community use.
中文标题/摘要
标题:Walrus:一种用于连续动力学的跨域基础模型
基础模型已经改变了语言和视觉领域的机器学习,但在物理模拟中实现类似的影响仍然具有挑战性。数据异质性和长期动力学的不稳定阻碍了从足够多样的动力学中学习,而不同的分辨率和维度挑战了在现代硬件上高效训练。通过实证和理论分析,我们引入了新的方法来克服这些障碍,包括基于谐波分析的稳定方法、负载平衡的分布式2D和3D训练策略以及计算自适应的标记化。利用这些工具,我们开发了Walrus,一种主要用于流体样连续动力学的基础模型。Walrus在天体物理学、地球科学、流变学、等离子体物理学、声学和经典流体等十九种不同场景下进行预训练。实验表明,Walrus在下游任务和预训练数据的整个范围内,在短期和长期预测方面都优于先前的基础模型,而消融研究证实了我们对预测稳定性、训练吞吐量和转移性能的贡献优于传统方法。代码和权重已向社区开放使用。
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Authors: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty
First: 2025-10-20T17:52:06+00:00 · Latest: 2025-11-19T17:57:07+00:00
Comments: 29 pages, 9 tables, 6 figures
Abstract
Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
中文标题/摘要
标题:基础自动评估器:扩展多任务生成评估器训练以适应以推理为中心的领域
针对训练和测试期间不断增加的可扩展评估需求,微调专门的生成评估器已成为一个流行的范式。然而,近期的工作主要集中在使用新的方法,如强化学习(RL),来训练评估器,而避免大规模的数据驱动开发。在本工作中,我们专注于数据扩展,收集了涵盖五个独特评估任务(成对、步骤级、无参考和有参考验证、单评级)以及多个以推理评估为中心的领域的250万样本。利用我们的数据,我们训练了基础自动推理评估器(FARE),这是一个包含80亿和200亿参数(其中36亿活跃参数)的评估器家族,使用简单的迭代拒绝采样监督微调(SFT)方法。FARE-8B挑战了更大的专门RL训练评估器,而FARE-20B则成为开源评估器的新标准,超越了专门的700亿+评估器。除了静态基准,我们还在实际任务中评估了FARE:作为推理时间的重排序器,FARE-20B在MATH上达到了接近完美的性能。作为RL训练中的验证器,FARE提高了下游RL训练模型的性能,最高可达14.1%优于字符串匹配验证器。从FARE初始化的持续微调FARE-Code在评估测试案例质量方面比gpt-oss-20B高出65%。
VisPlay: Self-Evolving Vision-Language Models from Images
Authors: Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
First: 2025-11-19T17:55:15+00:00 · Latest: 2025-11-19T17:55:15+00:00
Abstract
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
中文标题/摘要
标题:VisPlay:从图像中自我进化的视觉-语言模型
强化学习(RL)为提高视觉-语言模型(VLMs)在复杂推理任务上的表现提供了一个原则性的框架。然而,现有的RL方法通常依赖于人工标注的标签或特定任务的启发式方法来定义可验证的奖励,这两种方法都成本高昂且难以扩展。我们引入了VisPlay,这是一种自我进化的RL框架,使VLMs能够利用大量未标注的图像数据自主提高其推理能力。从一个基础VLM开始,VisPlay将模型分配为两个相互作用的角色:一个图像条件下的提问者,负责提出具有挑战性但可回答的视觉问题;一个跨模态推理器,生成银级答案。这些角色通过组相对策略优化(GRPO)联合训练,该方法结合了多样性和难度奖励,以平衡生成问题的复杂性和银级答案的质量。VisPlay在两个模型家族中高效扩展。当在Qwen2.5-VL和MiMo-VL上训练时,VisPlay在八个基准测试中,包括MM-Vet和MMMU,实现了视觉推理、组合泛化和幻觉减少的一致改进,展示了自我进化的跨模态智能的可扩展路径。项目页面可在https://bruno686.github.io/VisPlay/获取
Summary / 总结
VisPlay is a self-evolving reinforcement learning framework that enhances Vision-Language Models (VLMs) using large amounts of unlabeled image data. It assigns the model two roles: an Image-Conditioned Questioner that formulates challenging visual questions, and a Multimodal Reasoner that generates silver responses. These roles are trained with Group Relative Policy Optimization (GRPO), which balances question complexity and answer quality. VisPlay improves visual reasoning, compositional generalization, and reduces hallucination across eight benchmarks, showing a scalable path to self-evolving multimodal intelligence.
VisPlay 是一个自进化的视觉-语言模型(VLMs)强化学习框架,利用大量未标注的图像数据自主提升推理能力。它将模型分配为两个角色:图像条件下的提问者和多模态推理者,通过组相对策略优化(GRPO)训练生成具有挑战性的问题和银级回答。VisPlay 在多个基准测试中一致地提高了视觉推理、组合泛化能力,并减少了幻觉,展示了自进化的多模态智能的可扩展路径。
Hierarchical Semantic Tree Anchoring for CLIP-Based Class-Incremental Learning
Authors: Tao Hu, Lan Li, Zhen-Hao Xie, Da-Wei Zhou
First: 2025-11-19T17:14:47+00:00 · Latest: 2025-11-19T17:14:47+00:00
Abstract
Class-Incremental Learning (CIL) enables models to learn new classes continually while preserving past knowledge. Recently, vision-language models like CLIP offer transferable features via multi-modal pre-training, making them well-suited for CIL. However, real-world visual and linguistic concepts are inherently hierarchical: a textual concept like "dog" subsumes fine-grained categories such as "Labrador" and "Golden Retriever," and each category entails its images. But existing CLIP-based CIL methods fail to explicitly capture this inherent hierarchy, leading to fine-grained class features drift during incremental updates and ultimately to catastrophic forgetting. To address this challenge, we propose HASTEN (Hierarchical Semantic Tree Anchoring) that anchors hierarchical information into CIL to reduce catastrophic forgetting. First, we employ an external knowledge graph as supervision to embed visual and textual features in hyperbolic space, effectively preserving hierarchical structure as data evolves. Second, to mitigate catastrophic forgetting, we project gradients onto the null space of the shared hyperbolic mapper, preventing interference with prior tasks. These two steps work synergistically to enable the model to resist forgetting by maintaining hierarchical relationships. Extensive experiments show that HASTEN consistently outperforms existing methods while providing a unified structured representation.
中文标题/摘要
标题:基于CLIP的类增量学习层次语义树锚定
类增量学习(CIL)使模型能够在不断学习新类别的同时保留过去的知识。最近,像CLIP这样的跨模态模型通过多模态预训练提供了可转移的特征,使它们非常适合CIL。然而,现实世界的视觉和语言概念本质上是层次化的:一个文本概念如“狗”涵盖了诸如“拉布拉多”和“金毛寻回犬”等细粒度类别,每个类别又包含其自身的图像。但现有的基于CLIP的CIL方法未能明确捕捉到这种固有的层次结构,导致在增量更新过程中细粒度类特征漂移,并最终导致灾难性遗忘。为了解决这一挑战,我们提出了HASTEN(层次语义树锚定),将层次信息锚定到CIL中以减少灾难性遗忘。首先,我们利用外部知识图谱作为监督,将视觉和文本特征嵌入双曲空间,有效地在数据演变过程中保留层次结构。其次,为了减轻灾难性遗忘,我们将梯度投影到共享双曲映射的零空间,防止干扰先前的任务。这两步协同工作,使模型能够通过保持层次关系来抵抗遗忘。大量实验表明,HASTEN在保持统一结构表示的同时,始终优于现有方法。
Summary / 总结
The paper addresses the challenge of catastrophic forgetting in class-incremental learning (CIL) by proposing HASTEN, which captures the hierarchical nature of visual and textual concepts. HASTEN uses an external knowledge graph to embed features in hyperbolic space, preserving the hierarchical structure, and projects gradients onto the null space of the shared hyperbolic mapper to prevent interference with prior tasks. Experiments demonstrate that HASTEN outperforms existing methods in CIL while maintaining a unified structured representation.
研究旨在通过解决层次概念漂移问题来改进CLIP等模型的类增量学习(CIL)。提出了HASTEN(层次语义树锚定)方法,通过外部知识图谱将特征嵌入双曲空间,并将梯度投影到共享映射的零空间以防止遗忘。实验表明,HASTEN在性能上优于现有方法,并保持了一致的结构化表示。
The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification
Authors: Dante Francisco Wasmuht, Otto Brookes, Maximillian Schall, Pablo Palencia, Chris Beirne, Tilo Burghardt, Majid Mirmehdi, Hjalmar Kühl, Mimi Arandjelovic, Sam Pottie, Peter Bermant, Brandon Asheim, Yi Jin Toh, Adam Elzinga, Jason Holmberg, Andrew Whitworth, Eleanor Flatt, Laura Gustafson, Chaitanya Ryali, Yuan-Ting Hu, Baishan Guo, Andrew Westbury, Kate Saenko, Didac Suris
First: 2025-11-19T17:07:08+00:00 · Latest: 2025-11-19T17:07:08+00:00
Abstract
Automated video analysis is critical for wildlife conservation. A foundational task in this domain is multi-animal tracking (MAT), which underpins applications such as individual re-identification and behavior recognition. However, existing datasets are limited in scale, constrained to a few species, or lack sufficient temporal and geographical diversity - leaving no suitable benchmark for training general-purpose MAT models applicable across wild animal populations. To address this, we introduce SA-FARI, the largest open-source MAT dataset for wild animals. It comprises 11,609 camera trap videos collected over approximately 10 years (2014-2024) from 741 locations across 4 continents, spanning 99 species categories. Each video is exhaustively annotated culminating in ~46 hours of densely annotated footage containing 16,224 masklet identities and 942,702 individual bounding boxes, segmentation masks, and species labels. Alongside the task-specific annotations, we publish anonymized camera trap locations for each video. Finally, we present comprehensive benchmarks on SA-FARI using state-of-the-art vision-language models for detection and tracking, including SAM 3, evaluated with both species-specific and generic animal prompts. We also compare against vision-only methods developed specifically for wildlife analysis. SA-FARI is the first large-scale dataset to combine high species diversity, multi-region coverage, and high-quality spatio-temporal annotations, offering a new foundation for advancing generalizable multianimal tracking in the wild. The dataset is available at $\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$.
中文标题/摘要
标题:SA-FARI数据集:在动物视频中进行任何动物的分割识别
自动视频分析对于野生动物保护至关重要。该领域的一个基础任务是多动物跟踪(MAT),它支撑着个体再识别和行为识别等应用。然而,现有数据集在规模、物种限制或时空多样性方面存在局限,没有适合训练适用于野生动物种群的通用MAT模型的基准。为解决这一问题,我们引入了SA-FARI,这是最大的开源野生动物多动物跟踪数据集。它包含从四大洲741个地点收集的约10年(2014-2024)的11,609个相机陷阱视频,覆盖99个物种类别。每个视频都进行了详尽标注,最终包含约46小时密集标注的视频片段,包含16,224个掩码身份和942,702个个体边界框、分割掩码和物种标签。除了特定任务的标注,我们还发布了每个视频的匿名相机陷阱位置。最后,我们使用最先进的视觉-语言模型在SA-FARI上进行了检测和跟踪基准测试,包括SAM 3,使用了物种特定和通用动物提示进行评估。我们还与专门为野生动物分析开发的仅视觉方法进行了比较。SA-FARI是第一个结合高物种多样性、多区域覆盖和高质量时空标注的大规模数据集,为在野外实现可泛化的多动物跟踪提供了新的基础。数据集可在$\href{https://www.conservationxlabs.com/sa-fari}{\text{conservationxlabs.com/SA-FARI}}$获取。
Summary / 总结
The research aims to improve automated video analysis for wildlife conservation by addressing the limitations of existing datasets. The study introduces SA-FARI, a large-scale dataset for multi-animal tracking in wild animals, containing 11,609 camera trap videos from 741 locations across 4 continents, spanning 99 species. Key findings include the dataset's comprehensive annotations, with 16,224 masklet identities and 942,702 individual bounding boxes, and its use in benchmarking state-of-the-art vision-language models for detection and tracking. The dataset offers a new foundation for advancing generalizable multianimal tracking in the wild.
研究引入了SA-FARI数据集,这是一个用于野生动物视频中多动物跟踪的大规模数据集,解决了现有数据集在规模、物种多样性和时间多样性方面的局限性。该数据集包含来自4大洲741个地点的11,609个相机陷阱视频,涵盖了99个物种类别,并进行了详细的标注。该数据集用于评估最先进的视觉-语言模型和视觉仅方法在检测和跟踪方面的性能,展示了其在野生多动物跟踪方面的应用价值。数据集可在conservationxlabs.com/SA-FARI获取。
When to Think and When to Look: Uncertainty-Guided Lookback
Authors: Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yunlong, Tang, Luchuan Song, Susan Liang, Zhongfei, Zhang, Jason J. Corso, Chenliang Xu
First: 2025-11-19T17:01:02+00:00 · Latest: 2025-11-19T17:01:02+00:00
Abstract
Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in successful trajectories and correlate with better visual grounding. Building on this insight, we propose uncertainty guided lookback, a training free decoding strategy that combines an uncertainty signal with adaptive lookback prompts and breadth search. Our method improves overall MMMU performance, delivers the largest gains in categories where standard thinking is weak, and outperforms several strong decoding baselines, setting a new state of the art under fixed model families and token budgets. We further show that this decoding strategy generalizes, yielding consistent improvements on five additional benchmarks, including two broad multimodal suites and math focused visual reasoning datasets.
中文标题/摘要
标题:何时思考何时查看:基于不确定性回溯
测试时的思考(即生成明确的中间推理链)已被证明能提升大型语言模型的性能,并且最近在大型视觉语言模型(LVLMs)中也显示出强大的增益。然而,尽管取得了这些有希望的结果,仍然没有系统分析思考如何影响视觉推理。我们提供了首次此类分析,通过大规模、受控的比较思考对LVLMs的影响,评估了InternVL3.5和Qwen3-VL家族中的十个变体在MMMU-val上的表现,使用宽松的令牌预算和多轮解码。我们展示了更多的思考并不总是更好的;长链往往导致忽略图像的长期错误轨迹,并且表现不如标准指令模式运行的相同模型。更深入的分析表明,某些短回溯短语,明确地回溯到图像,强烈富集于成功的轨迹中,并与更好的视觉定位相关。基于这一洞察,我们提出了基于不确定性回溯的解码策略,该策略结合了不确定性信号和自适应回溯提示及广度搜索。我们的方法在整体MMMU性能上有所提升,在标准思考较弱的类别中取得最大的增益,并优于几个强大的解码基线,固定模型家族和令牌预算下达到新的最佳水平。我们进一步展示了该解码策略的泛化能力,在五个额外的基准上取得一致的改进,包括两个广泛的多模态套件和数学聚焦的视觉推理数据集。
Summary / 总结
The study investigates the impact of test-time thinking on visual reasoning in large vision language models (LVLMs), comparing ten variants from InternVL3.5 and Qwen3-VL families. It finds that more thinking is not always beneficial, as long chains often lead to incorrect reasoning. Instead, short lookback phrases that reference the image are more effective. Based on this, the authors propose an uncertainty-guided lookback strategy that improves overall performance and outperforms several strong baselines, setting a new state of the art on fixed model families and token budgets. This strategy also generalizes well, improving performance on five additional benchmarks.
研究探讨了测试时思考对大型视觉语言模型(LVLMs)视觉推理的影响,比较了来自InternVL3.5和Qwen3-VL系列的十个变体。研究发现,更多的思考并不总是有益的,因为长的推理链往往会导致错误的推理。相反,那些明确回溯到图像的短回溯短语更有效。基于此,作者提出了一种基于不确定性指导的回溯策略,该策略在整体性能上有所提升,并优于几个强大的基线,设定了固定模型家族和令牌预算下的新最佳状态。此外,该解码策略具有良好的泛化能力,在五个额外的基准上也取得了持续的性能改进,包括两个广泛的多模态套件和数学聚焦的视觉推理数据集。
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Authors: Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, Xipeng Qiu
First: 2025-11-19T16:52:23+00:00 · Latest: 2025-11-19T16:52:23+00:00
Abstract
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations, leading to demonstration bias and limiting performance. Reinforcement learning (RL) is a vital post-training strategy to overcome these limits, yet current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity. Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. SRPO eliminates the need for external demonstrations or manual reward engineering by leveraging the model's own successful trajectories, generated within the current training batch, as a self-reference. This allows us to assign a progress-wise reward to failed attempts. A core innovation is the use of latent world representations to measure behavioral progress robustly. Instead of relying on raw pixels or requiring domain-specific fine-tuning, we utilize the compressed, transferable encodings from a world model's latent space. These representations naturally capture progress patterns across environments, enabling accurate, generalized trajectory comparison. Empirical evaluations on the LIBERO benchmark demonstrate SRPO's efficiency and effectiveness. Starting from a supervised baseline with 48.9% success, SRPO achieves a new state-of-the-art success rate of 99.2% in just 200 RL steps, representing a 103% relative improvement without any extra supervision. Furthermore, SRPO shows substantial robustness, achieving a 167% performance improvement on the LIBERO-Plus benchmark.
中文标题/摘要
标题:SRPO:自参照策略优化在视觉-语言-行动模型中的应用
视觉-语言-行动(VLA)模型在机器人操作方面表现出色,但受到对专家演示的严重依赖的限制,导致演示偏差并限制了性能。强化学习(RL)是克服这些限制的重要后训练策略,然而当前的VLA-RL方法,包括基于群体的优化方法,受到严重奖励稀疏性的困扰。依赖于二元成功指标浪费了失败轨迹中的宝贵信息,导致训练效率低下。为了解决这个问题,我们提出了自参照策略优化(SRPO),这是一种新颖的VLA-RL框架。SRPO通过利用模型自身在当前训练批次中生成的成功轨迹作为自参照,消除了对外部演示或手动奖励工程的需求。这使我们能够为失败尝试分配按进度的奖励。核心创新是使用潜在世界表示来稳健地衡量行为进步。我们利用世界模型潜在空间中的压缩、可转移编码,而不是依赖原始像素或需要领域特定微调,这些表示自然地捕捉了跨环境的进步模式,使准确、泛化的轨迹比较成为可能。在LIBERO基准上的实证评估表明SRPO的高效性和有效性。从监督基线的48.9%成功率开始,SRPO仅在200个RL步骤后达到新的最佳成功率99.2%,相对改进了103%且无需额外监督。此外,SRPO表现出显著的鲁棒性,在LIBERO-Plus基准上实现了167%的性能改进。
Summary / 总结
The research aims to address the limitations of Vision-Language-Action (VLA) models in robotic manipulation, particularly their reliance on expert demonstrations and the sparsity of rewards in reinforcement learning (RL). The proposed Self-Referential Policy Optimization (SRPO) framework uses the model's own successful trajectories to assign progress-wise rewards to failed attempts, avoiding the need for external demonstrations or manual reward engineering. SRPO leverages latent world representations from a world model's latent space to measure behavioral progress, enabling accurate trajectory comparison across environments. Experiments on the LIBERO benchmark show that SRPO significantly improves performance, achieving a 99.2% success rate in 200 RL steps, a 103% relative improvement over a supervised baseline, and demonstrating robustness with a 167% performance improvement on the LIBERO-Plus benchmark.
研究旨在解决视觉-语言-动作(VLA)模型在机器人操作中的限制,特别是它们对专家演示的依赖以及强化学习(RL)中的奖励稀疏性问题。提出了一个新的VLA-RL框架SRPO,通过使用模型自身的成功轨迹作为自我参考,对失败尝试进行进度奖励。SRPO利用世界模型潜空间中的潜在世界表示来衡量行为进度,实现跨环境的准确轨迹比较。在LIBERO基准上的实验表明,SRPO显著提高了性能,实现了99.2%的成功率,在200个RL步骤中,相对于监督基线提高了103%的相对改进,并在LIBERO-Plus基准上展示了鲁棒性,性能提高了167%。
US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery
Authors: Miruna-Alexandra Gafencu, Yordanka Velikova, Nassir Navab, Mohammad Farid Azampour
Venue: MICCAI 2025
First: 2025-11-19T16:45:04+00:00 · Latest: 2025-11-19T16:45:04+00:00
Comments: Accepted at the Workshop on Shape in Medical Imaging at MICCAI 2025
Abstract
Ultrasound offers a radiation-free, cost-effective solution for real-time visualization of spinal landmarks, paraspinal soft tissues and neurovascular structures, making it valuable for intraoperative guidance during spinal procedures. However, ultrasound suffers from inherent limitations in visualizing complete vertebral anatomy, in particular vertebral bodies, due to acoustic shadowing effects caused by bone. In this work, we present a novel multi-modal deep learning method for completing occluded anatomical structures in 3D ultrasound by leveraging complementary information from a single X-ray image. To enable training, we generate paired training data consisting of: (1) 2D lateral vertebral views that simulate X-ray scans, and (2) 3D partial vertebrae representations that mimic the limited visibility and occlusions encountered during ultrasound spine imaging. Our method integrates morphological information from both imaging modalities and demonstrates significant improvements in vertebral reconstruction (p < 0.001) compared to state of art in 3D ultrasound vertebral completion. We perform phantom studies as an initial step to future clinical translation, and achieve a more accurate, complete volumetric lumbar spine visualization overlayed on the ultrasound scan without the need for registration with preoperative modalities such as computed tomography. This demonstrates that integrating a single X-ray projection mitigates ultrasound's key limitation while preserving its strengths as the primary imaging modality. Code and data can be found at https://github.com/miruna20/US-X-Complete
中文标题/摘要
标题:US-X 完整:一种多模态的三维解剖形状恢复方法
超声波提供了一种无辐射、成本效益高的实时可视化脊柱解剖标志、脊旁软组织和神经血管结构的方法,使其在脊柱手术中的术中指导中具有重要价值。然而,超声波由于骨骼引起的声影效应,在可视化完整的椎体解剖结构方面存在固有的局限性,特别是椎体本身。在本研究中,我们提出了一种新颖的多模态深度学习方法,通过利用单张X射线图像的互补信息来完成3D超声中的遮挡解剖结构。为了进行训练,我们生成了配对的训练数据,包括:(1) 模拟X射线扫描的2D侧位椎体视图,(2) 模拟超声脊柱成像中遇到的有限可见性和遮挡的3D部分椎体表示。我们的方法结合了两种成像模态的形态信息,并在椎体重建方面显著优于现有的3D超声椎体完成技术(p < 0.001)。我们进行了体模研究作为未来临床转化的初步步骤,并在超声扫描上实现了更准确、完整的椎体体积可视化,无需与术前成像模态(如计算机断层扫描)进行配准。这表明,结合单张X射线投影可以缓解超声波的关键局限性,同时保留其作为主要成像模态的优势。代码和数据可在 https://github.com/miruna20/US-X-Complete 获取。
Summary / 总结
This work addresses the limitation of ultrasound in visualizing complete vertebral anatomy by proposing a multi-modal deep learning method that combines ultrasound and X-ray images. The method generates paired training data from 2D lateral vertebral views and 3D partial vertebrae representations to improve vertebral reconstruction. The results show significant improvements (p < 0.001) over state-of-the-art techniques in 3D ultrasound vertebral completion, enabling more accurate and complete volumetric spine visualization without the need for registration with preoperative modalities. This demonstrates the potential of integrating X-ray information to enhance ultrasound's capabilities in spinal procedures.
该研究通过结合超声和X光图像提出了一种多模态深度学习方法,以解决超声在可视化完整椎体解剖结构方面的局限性。该方法通过生成来自2D侧位椎体视图和3D部分椎体表示的配对训练数据来提高椎体重建效果。结果显示,与现有的3D超声椎体完成技术相比,该方法在椎体重建方面有显著改进(p < 0.001),能够实现更准确和完整的椎体体积可视化,无需与术前成像如计算机断层扫描进行注册。这表明结合单张X光投影可以缓解超声的关键局限性,同时保留其作为主要成像模态的优势。
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
Authors: Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar
First: 2025-11-19T16:09:38+00:00 · Latest: 2025-11-19T16:09:38+00:00
Comments: Accepted in the 5th IEEE Big Data Workshop on Multimodal AI (MMAI 2025), Dec 8-11, Macau, China, 2025 (Preprint Copy)
Abstract
With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR's effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.
中文标题/摘要
标题:AVATAAR:通过时间自适应对齐与推理的代理视频回答
随着视频内容的日益普及,有效理解和回答长视频的问题已成为众多应用的必要条件。尽管大型视觉语言模型(LVLM)提高了性能,但在处理需要全面理解和详细分析的复杂查询时仍面临挑战。为克服这些障碍,我们引入了AVATAAR,这是一种模块化且可解释的框架,结合了全局和局部视频上下文,以及一个预检索思考代理和重思模块。AVATAAR创建了一个持久的全局摘要,并在重思模块和预检索思考代理之间建立了反馈循环,使系统能够根据部分答案细化检索策略,并模仿人类的迭代推理。在CinePile基准测试中,AVATAAR在时间推理、技术查询、主题问题和叙事理解方面分别取得了+5.6%、+5%、+8%和+8.2%的相对改进。我们的实验表明,每个模块都对整体性能做出了积极贡献,反馈循环对于适应性至关重要。这些发现突显了AVATAAR在增强视频理解能力方面的有效性。最终,AVATAAR提供了一种可扩展的长视频问答解决方案,结合了准确度、可解释性和可扩展性。
Summary / 总结
AVATAAR is a modular framework designed to improve the understanding and answering of questions about long-form videos. It combines global and local video context with a Pre Retrieval Thinking Agent and a Rethink Module to create a persistent summary and refine retrieval strategies iteratively. On the CinePile benchmark, AVATAAR shows significant improvements, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. The feedback loop between the modules is crucial for adaptability and performance enhancement.
AVATAAR 是一个模块化框架,旨在提高对长视频的理解和回答问题的能力。它结合了全局和局部视频上下文以及预检索思考代理和反思模块,创建持久的摘要并迭代地细化检索策略。在 CinePile 基准测试中,AVATAAR 显示出显著的改进,分别在时间推理、技术查询、主题问题和叙事理解方面取得了 +5.6%、+5%、+8% 和 +8.2% 的相对增益。模块之间的反馈循环对于适应性和性能提升至关重要。
SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
Authors: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
Venue: ICML
First: 2025-02-25T12:02:17+00:00 · Latest: 2025-11-19T14:34:24+00:00
Comments: @inproceedings{zhang2025spargeattn, title={Spargeattn: Accurate sparse attention accelerating any model inference}, author={Zhang, Jintao and Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun and Chen, Jianfei}, booktitle={International Conference on Machine Learning (ICML)}, year={2025} }
Abstract
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The code is available at https://github.com/thu-ml/SpargeAttn.
中文标题/摘要
标题:SpargeAttention: 准确且无需训练的稀疏注意力加速任意模型推理
由于注意力机制的时间复杂度为二次方,高效的注意力实现对于大型模型至关重要。幸运的是,注意力通常表现出稀疏性,即注意力图中的许多值接近零,允许省略相应的计算。许多研究利用稀疏模式加速注意力。然而,大多数现有工作集中在通过利用注意力图的特定稀疏模式在特定模型中优化注意力。一种同时保证各种模型的加速和端到端性能的通用稀疏注意力仍然难以实现。在本文中,我们提出了一种名为SpargeAttn的通用稀疏和量化注意力,适用于任何模型。我们的方法使用两阶段在线过滤器:在第一阶段,我们快速准确地预测注意力图,从而省略一些矩阵乘法。在第二阶段,我们设计了一种在线softmax感知过滤器,不会产生额外开销,并进一步省略一些矩阵乘法。实验表明,我们的方法显著加速了包括语言、图像和视频生成在内的各种模型,而不会牺牲端到端指标。代码可在https://github.com/thu-ml/SpargeAttn获取。
Summary / 总结
The research aims to address the quadratic time complexity of attention mechanisms in large models by leveraging the inherent sparsity of attention maps. SpargeAttention proposes a two-stage online filter to predict and filter out unnecessary computations, accelerating various models including language, image, and video generation, without compromising end-to-end performance. Experiments demonstrate significant speedup across different models.
SpargeAttention 通过利用注意力图中的固有稀疏性来加速大型模型中的注意力机制。它引入了两阶段的在线过滤器来预测和细化注意力图,从而跳过不必要的矩阵乘法。实验表明,SpargeAttention 可以显著加速包括语言、图像和视频生成在内的各种模型,同时不会降低性能指标。代码可在 https://github.com/thu-ml/SpargeAttn 获取。
SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome
Authors: Dabin Jeong, Amirhossein Vahidi, Ciro Ramírez-Suástegui, Marie Moullet, Kevin Ly, Mohammad Vali Sanian, Sebastian Birk, Yinshui Chang, Adam Boxall, Daniyal Jafree, Lloyd Steele, Vijaya Baskar MS, Muzlifah Haniffa, Mohammad Lotfollahi
First: 2025-11-19T14:22:23+00:00 · Latest: 2025-11-19T14:22:23+00:00
Abstract
Recent advances in computational pathology have leveraged vision-language models to learn joint representations of Hematoxylin and Eosin (HE) images with spatial transcriptomic (ST) profiles. However, existing approaches typically align HE tiles with their corresponding ST profiles at a single scale, overlooking fine-grained cellular structures and their spatial organization. To address this, we propose Sigmma, a multi-modal contrastive alignment framework for learning hierarchical representations of HE images and spatial transcriptome profiles across multiple scales. Sigmma introduces multi-scale contrastive alignment, ensuring that representations learned at different scales remain coherent across modalities. Furthermore, by representing cell interactions as a graph and integrating inter- and intra-subgraph relationships, our approach effectively captures cell-cell interactions, ranging from fine to coarse, within the tissue microenvironment. We demonstrate that Sigmm learns representations that better capture cross-modal correspondences, leading to an improvement of avg. 9.78\% in the gene-expression prediction task and avg. 26.93\% in the cross-modal retrieval task across datasets. We further show that it learns meaningful multi-tissue organization in downstream analyses.
中文标题/摘要
标题:SIGMMA:基于图的多层次多模态对比对齐框架,用于组织病理图像和空间转录组
计算病理学的最新进展利用视觉-语言模型学习Hematoxylin和Eosin (HE) 图像与空间转录组 (ST) 轮廓的联合表示。然而,现有方法通常在单个尺度上对HE切片与其相应的ST轮廓进行对齐,忽视了细微的细胞结构及其空间组织。为了解决这个问题,我们提出了Sigmma,这是一种多模态对比对齐框架,用于在多个尺度上学习HE图像和空间转录组轮廓的层次表示。Sigmma 引入了多尺度对比对齐,确保不同尺度下学习的表示在模态间保持一致性。此外,通过将细胞相互作用表示为图,并整合跨子图和内子图关系,我们的方法有效地捕捉了组织微环境中从精细到粗略的细胞-细胞相互作用。我们证明Sigmma 学习的表示更好地捕捉了跨模态对应关系,在基因表达预测任务中平均提高了9.78%,在跨模态检索任务中平均提高了26.93%。我们进一步表明,它在下游分析中学习了有意义的多组织组织。
Summary / 总结
The research aims to improve the alignment of histopathology images and spatial transcriptomic profiles by addressing the limitations of existing single-scale approaches. SIGMMA, a multi-modal contrastive alignment framework, learns hierarchical representations across multiple scales, ensuring coherent representations at different scales. The method uses a graph-based approach to capture cell-cell interactions from fine to coarse scales, leading to better cross-modal correspondences. Experiments show that SIGMMA improves gene-expression prediction by 9.78% and cross-modal retrieval by 26.93% compared to existing methods across various datasets.
研究旨在通过解决现有单尺度方法的局限性,改进病理图像和空间转录组学资料的对齐。SIGMMA 多模态对比对齐框架引入了多尺度对比对齐,并将细胞相互作用表示为图,以捕捉从精细到粗略的细胞-细胞相互作用。该方法显著提高了跨模态对应关系,分别在基因表达预测和跨模态检索任务中提高了9.78%和26.93%,并在下游分析中学习到了有意义的多组织组织结构。
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Authors: Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen
First: 2025-09-29T08:49:21+00:00 · Latest: 2025-11-19T14:22:05+00:00
Abstract
Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3--72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6\% to 41.8\% (+5.2\%), and the mean MindCube accuracy rose from 31.4\% to 38.1\% (+6.7\%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.
中文标题/摘要
标题:欧几里得的礼物:通过几何代理任务增强视觉-语言模型的空间感知与推理能力
空间智能涵盖了丰富的能力,包括可视化和变换形状、心理旋转物体、判断相对位置和包含关系,以及估算数量。然而,这仍然是多模态大型语言模型(MLLMs)中的一个关键未解决的挑战。为了填补这一空白,我们建议将欧几里得几何问题解决作为代理任务。具体来说,我们精心构建了一个多模态数据集,称为Euclid30K,包含约30000个平面和立体几何问题。此外,为了使模型能够从这些几何问题中学习和应用欧几里得原理,我们使用组相对策略优化(GRPO)对Qwen2.5VL、Qwen3VL和RoboBrain2.0家族中的七个模型变体(参数范围从3亿到72亿)进行了微调,激励模型识别形状、计数和关联实体,并使用欧几里得原理进行多步演绎推理。我们的实验表明,经过微调的模型在四个空间推理基准(Super-CLEVR、Omni3DBench、VSI-Bench和MindCube)上实现了显著的零样本增益,而无需任何特定任务的调整。值得注意的是,经过Euclid30K训练后,VSI-Bench的平均准确率从36.6%提高到41.8%(+5.2%),MindCube的平均准确率从31.4%提高到38.1%(+6.7%)。据我们所知,这是首次系统研究证明几何导向的微调可以赋予视觉-语言模型广泛转移的空间技能。代码和Euclid30K数据集可以在https://zgca-ai4edu.github.io/Euclids_Gift找到。
Summary / 总结
This study addresses the challenge of enhancing spatial intelligence in Multimodal Large Language Models (MLLMs) by treating Euclidean geometry problem-solving as a surrogate task. The researchers created a dataset called Euclid30K with 30K geometry problems and fine-tuned seven model variants using Group Relative Policy Optimization (GRPO). The models showed significant improvements on four spatial reasoning benchmarks, with accuracy gains of 5.2% on VSI-Bench and 6.7% on MindCube after training on Euclid30K. This is the first systematic study demonstrating that geometry-centric fine-tuning can impart broadly transferable spatial skills to vision-language models.
该研究通过将欧几里得几何问题解决作为代理任务来提升多模态大型语言模型(MLLMs)的空间智能。研究人员创建了一个名为Euclid30K的数据集,包含30K几何问题,并使用组相对策略优化(GRPO)对七个模型变体进行了微调。模型在四个空间推理基准测试中表现出显著改进,VSI-Bench的准确率提高了5.2%,MindCube的准确率提高了6.7%。这是首次系统性研究证明,几何中心的微调可以赋予视觉语言模型广泛迁移的空间技能。
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
Authors: Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang
First: 2025-09-04T07:26:20+00:00 · Latest: 2025-11-19T13:23:53+00:00
Abstract
The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we cache images likely to be OOD samples from the historical test images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we cache the subset of ID classes that are visually similar to historical test images and then leverage MLLM reasoning to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD), making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 3.1\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.
中文标题/摘要
标题:ANTS:通过测试时的多模态大语言模型理解和推理构建自适应负文本空间进行OOD检测
引入负标签(NLs)已被证明能有效提升Out-of-Distribution(OOD)检测。然而,现有方法往往缺乏对OOD图像的理解,难以构建准确的负空间。此外,缺乏与ID标签语义相似的负标签限制了其在近OOD检测中的能力。为解决这些问题,我们提出通过利用多模态大语言模型(MLLMs)的理解和推理能力来塑造自适应负文本空间(ANTS)。具体而言,我们从历史测试图像中缓存可能为OOD样本的图像,并提示MLLM描述这些图像,生成能够精确刻画OOD分布并增强远OOD检测的表达性负句子。对于近OOD设置,其中OOD样本类似于分布内(ID)子集,我们缓存与历史测试图像视觉相似的ID类子集,并利用MLLM推理生成针对该子集的视觉相似负标签,有效减少假阴性并提高近OOD检测。为了平衡这两种类型的负文本空间,我们设计了一种自适应加权得分,使方法能够处理不同的OOD任务设置(近OOD和远OOD),使其在开放环境中具有高度适应性。在ImageNet基准测试上,我们的ANTS显著降低了FPR95 3.1%,建立了新的最佳水平。此外,我们的方法无需训练且零样本,具有高可扩展性。
Summary / 总结
The paper proposes ANTS, an adaptive negative textual space shaping method for enhancing Out-of-Distribution (OOD) detection. It leverages MLLMs to generate expressive negative sentences for far-OOD detection and visually similar negative labels for near-OOD detection. Experiments on ImageNet show that ANTS reduces FPR95 by 3.1%, setting a new state-of-the-art and demonstrating high scalability without requiring training data or fine-tuning.
该研究提出了一种名为ANTs的方法,通过利用多模态大型语言模型(MLLMs)生成适应性的负文本空间来增强Out-of-Distribution (OOD)检测。该方法通过使用MLLMs描述OOD图像并生成精确的负句子,以提高远OOD检测效果。对于近OOD检测,ANTs生成视觉相似的负标签以减少误检。ANTs使用自适应加权分数来平衡远OOD和近OOD检测,实现了在ImageNet基准上的最新性能,FPR95降低了3.1%。该方法无需训练且为零样本,具有高度的可扩展性。
D4C: Data-free Quantization for Contrastive Language-Image Pre-training Models
Authors: Wenlun Zhang, Yunshan Zhong, Zihao Ding, Xinyu Li, Kentaro Yoshioka
First: 2025-11-19T13:08:25+00:00 · Latest: 2025-11-19T13:08:25+00:00
Abstract
Data-Free Quantization (DFQ) offers a practical solution for model compression without requiring access to real data, making it particularly attractive in privacy-sensitive scenarios. While DFQ has shown promise for unimodal models, its extension to Vision-Language Models such as Contrastive Language-Image Pre-training (CLIP) models remains underexplored. In this work, we reveal that directly applying existing DFQ techniques to CLIP results in substantial performance degradation due to two key limitations: insufficient semantic content and low intra-image diversity in synthesized samples. To tackle these challenges, we propose D4C, the first DFQ framework tailored for CLIP. D4C synthesizes semantically rich and structurally diverse pseudo images through three key components: (1) Prompt-Guided Semantic Injection aligns generated images with real-world semantics using text prompts; (2) Structural Contrastive Generation reproduces compositional structures of natural images by leveraging foreground-background contrastive synthesis; and (3) Perturbation-Aware Enhancement applies controlled perturbations to improve sample diversity and robustness. These components jointly empower D4C to synthesize images that are both semantically informative and structurally diverse, effectively bridging the performance gap of DFQ on CLIP. Extensive experiments validate the effectiveness of D4C, showing significant performance improvements on various bit-widths and models. For example, under the W4A8 setting with CLIP ResNet-50 and ViT-B/32, D4C achieves Top-1 accuracy improvement of 12.4% and 18.9% on CIFAR-10, 6.8% and 19.7% on CIFAR-100, and 1.4% and 5.7% on ImageNet-1K in zero-shot classification, respectively.
中文标题/摘要
标题:D4C:对比语言-图像预训练模型的无数据量化
无数据量化(DFQ)提供了一种在无需访问真实数据的情况下进行模型压缩的实用解决方案,特别是在涉及隐私的场景中尤为吸引人。尽管DFQ在单模态模型中显示出潜力,但将其扩展到如对比语言-图像预训练(CLIP)模型等视觉-语言模型方面仍处于探索阶段。在本文中,我们揭示了直接将现有DFQ技术应用于CLIP会导致显著性能下降,原因在于两个关键限制:语义内容不足和合成样本内的低图像多样性。为应对这些挑战,我们提出了D4C,这是第一个针对CLIP的DFQ框架。D4C通过三个关键组件生成语义丰富且结构多样的伪图像:(1)提示引导的语义注入使用文本提示将生成的图像与现实世界的语义对齐;(2)结构对比生成利用前景-背景对比合成来重现自然图像的组成结构;(3)扰动感知增强通过施加受控扰动来提高样本的多样性和鲁棒性。这些组件共同赋予D4C生成既语义丰富又结构多样的图像的能力,有效地弥合了DFQ在CLIP上的性能差距。广泛的实验验证了D4C的有效性,展示了在各种位宽和模型上显著的性能提升。例如,在W4A8设置下,使用CLIP ResNet-50和ViT-B/32,D4C在CIFAR-10上的Top-1精度提高了12.4%,在CIFAR-100上的Top-1精度提高了6.8%,在ImageNet-1K上的Top-1精度提高了1.4%。
Summary / 总结
The research aims to address the challenge of applying Data-Free Quantization (DFQ) to Vision-Language Models like CLIP, which has not been extensively explored. The authors propose D4C, a novel DFQ framework that synthesizes semantically rich and structurally diverse pseudo images through three key components: Prompt-Guided Semantic Injection, Structural Contrastive Generation, and Perturbation-Aware Enhancement. Experimental results show significant performance improvements, with D4C achieving up to 19.7% Top-1 accuracy improvement on ImageNet-1K in zero-shot classification under the W4A8 setting with CLIP ResNet-50 and ViT-B/32.
本文解决了将数据免费量化(DFQ)应用于对比语言-图像预训练(CLIP)模型的问题,这些模型在合成样本中存在语义内容不足和图像内多样性低的问题,导致性能下降。为了解决这些问题,作者提出了D4C,这是一种新颖的DFQ框架,通过三种组件生成语义丰富且结构多样的伪图像:Prompt-Guided Semantic Injection、Structural Contrastive Generation和Perturbation-Aware Enhancement。实验结果表明,D4C显著提高了性能,在零样本分类中,对ImageNet-1K的Top-1准确率提高了高达19.7%。
Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection
Authors: Fanxiao Li, Jiaying Wu, Tingchao Fu, Yunyun Dong, Bingbing Song, Wei Zhou
First: 2025-08-18T08:19:43+00:00 · Latest: 2025-11-19T12:36:58+00:00
Abstract
The proliferation of multimodal misinformation poses growing threats to public discourse and societal trust. While Large Vision-Language Models (LVLMs) have enabled recent progress in multimodal misinformation detection (MMD), the rise of generative AI (GenAI) tools introduces a new challenge: GenAI-driven news diversity, characterized by highly varied and complex content. We show that this diversity induces multi-level drift, comprising (1) model-level misperception drift, where stylistic variations disrupt a model's internal reasoning, and (2) evidence-level drift, where expression diversity degrades the quality or relevance of retrieved external evidence. These drifts significantly degrade the robustness of current LVLM-based MMD systems. To systematically study this problem, we introduce DriftBench, a large-scale benchmark comprising 16,000 news instances across six categories of diversification. We design three evaluation tasks: (1) robustness of truth verification under multi-level drift; (2) susceptibility to adversarial evidence contamination generated by GenAI; and (3) analysis of reasoning consistency across diverse inputs. Experiments with six state-of-the-art LVLM-based detectors show substantial performance drops (average F1 -14.8%) and increasingly unstable reasoning traces, with even more severe failures under adversarial evidence injection. Our findings uncover fundamental vulnerabilities in existing MMD systems and suggest an urgent need for more resilient approaches in the GenAI era.
中文标题/摘要
标题:远离真相:由GenAI驱动的新闻多样性挑战LVLM基的误信息检测
多模态误信息的泛滥对公共话语和社会信任构成了日益增长的威胁。虽然大型视觉语言模型(LVLM)在多模态误信息检测(MMD)方面取得了近期进展,但生成型人工智能(GenAI)工具的兴起引入了一个新的挑战:由GenAI驱动的新闻多样性,其特征是内容高度多样化和复杂化。我们表明,这种多样性导致了多级漂移,包括(1)模型级感知漂移,其中风格变化干扰了模型的内部推理,以及(2)证据级漂移,其中表达多样性降低了检索外部证据的质量或相关性。这些漂移显著削弱了当前基于LVLM的MMD系统的稳健性。为了系统地研究这一问题,我们引入了DriftBench,这是一个包含16,000个新闻实例的大规模基准,涵盖了六个多样化的类别。我们设计了三个评估任务:(1)在多级漂移下的事实验证稳健性;(2)对抗性证据污染的易感性,这些证据由GenAI生成;以及(3)对多样输入推理一致性的分析。六种最先进的基于LVLM的检测器的实验表明,性能下降显著(平均F1 -14.8%),推理轨迹越来越不稳定,在对抗性证据注入下失败更加严重。我们的研究揭示了现有MMD系统的基本脆弱性,并建议在GenAI时代迫切需要更稳健的方法。
Summary / 总结
The paper addresses the challenge of GenAI-driven news diversity in multimodal misinformation detection, which induces model-level and evidence-level drifts, degrading the robustness of current LVLM-based systems. To study this, the authors introduce DriftBench, a large-scale benchmark, and evaluate six state-of-the-art detectors, showing significant performance drops and unstable reasoning traces under diverse inputs and adversarial evidence. This highlights the need for more resilient MMD approaches in the GenAI era.
论文探讨了由GenAI驱动的新闻多样性在多模态虚假信息检测中带来的多级漂移问题,影响了模型级感知漂移和证据级质量。为了研究这一问题,作者创建了包含16,000个新闻实例的DriftBench基准,并评估了六种最先进的LVLM基检测器,发现这些漂移导致了显著的性能下降和不稳定的推理轨迹,尤其是在对抗性证据注入的情况下。这表明在GenAI时代需要更 robust 的MMD系统。
Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training
Authors: Yunjiao Zhou, Xinyan Chen, Junlang Qian, Lihua Xie, Jianfei Yang
First: 2025-11-19T12:11:36+00:00 · Latest: 2025-11-19T12:11:36+00:00
Abstract
Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and enforcing inter-segment separation, all without altering the pretrained encoder. Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7\% mAP on HumanML3D benchmark. Meanwhile, significant improvements also exist in downstream retrieval, establishing a new paradigm for annotation-free motion understanding.
中文标题/摘要
标题:零样本开放词汇人体动作定位与测试时训练
理解复杂的human activities需要将动作分解为细粒度的、语义对齐的子动作。这一动作定位过程对于行为分析、具身人工智能和虚拟现实至关重要。然而,现有的大多数方法依赖于密集的监督和预定义的动作类别,这在开放词汇的真实世界环境中是不可行的。在本文中,我们提出了一种名为ZOMG的零样本、开放词汇框架,该框架可以在无需任何注释或微调的情况下将动作序列分割为具有语义意义的子动作。技术上,ZOMG结合了(1)语言语义分割,利用大型语言模型将指令分解为有序的子动作单元,以及(2)软遮罩优化,学习实例特定的时间掩码以聚焦于对子动作至关重要的帧,同时保持内部段的连续性并强制不同段之间的分离,而不改变预训练编码器。在三个动作-语言数据集上的实验表明,ZOMG在动作定位性能上达到了最先进的效果和效率,在HumanML3D基准测试上mAP提高了8.7%。同时,在下游检索中也取得了显著的改进,建立了无注释动作理解的新范式。
Summary / 总结
The paper proposes ZOMG, a zero-shot open-vocabulary framework for human motion grounding that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. It uses language semantic partition to decompose instructions and soft masking optimization to learn instance-specific temporal masks. Experiments show that ZOMG outperforms prior methods by +8.7% mAP on the HumanML3D benchmark and improves downstream retrieval, establishing a new paradigm for annotation-free motion understanding.
论文提出了ZOMG,这是一种零样本开放词汇框架,用于将人体运动分割成具有语义意义的子动作,无需任何注释或微调。它使用语言语义分割来分解指令,并使用软遮罩优化来学习实例特定的时间掩码。实验表明,ZOMG在HumanML3D基准上的mAP性能比先前的方法高出+8.7%,并且在下游检索中也取得了显著改进,建立了无注释运动理解的新范式。
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
Authors: Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim
First: 2024-11-28T11:27:56+00:00 · Latest: 2025-11-19T11:56:48+00:00
Comments: Accepted to TMLR 2025. First two authors contributed equally
Abstract
Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.
中文标题/摘要
标题:MaskRIS:面向引用图像分割的语义失真感知数据增强
引用图像分割(RIS)是一项先进的视觉-语言任务,涉及根据自由形式的文本描述识别和分割图像中的对象。尽管先前的研究集中在对齐视觉和语言特征上,但探索训练技术,如数据增强,仍然未被充分研究。在本文中,我们探索了RIS的有效数据增强,并提出了一种名为Masked Referring Image Segmentation(MaskRIS)的新颖训练框架。我们观察到,传统的图像增强方法对RIS来说效果不佳,导致性能下降,而简单的随机遮掩显著提高了RIS的性能。MaskRIS使用图像和文本遮掩,随后通过失真感知上下文学习(DCL)充分利用遮掩策略的优势。这种方法可以提高模型对遮挡、不完整信息和各种语言复杂性的鲁棒性,从而显著提高性能。实验表明,MaskRIS可以轻松应用于各种RIS模型,在完全监督和弱监督设置中均优于现有方法。最后,MaskRIS在RefCOCO、RefCOCO+和RefCOCOg数据集上达到了新的最佳性能。代码可在https://github.com/naver-ai/maskris/获取。
Summary / 总结
MaskRIS is a novel training framework for Referring Image Segmentation (RIS) that uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL), to enhance model robustness. This approach significantly improves performance in handling occlusions and linguistic complexities, outperforming existing methods in both fully and weakly supervised settings and achieving new state-of-the-art results on RefCOCO, RefCOCO+, and RefCOCOg datasets.
MaskRIS 是一种用于 Referring Image Segmentation (RIS) 的新颖数据增强框架,结合了图像和文本掩码,并采用 Distortion-aware Contextual Learning (DCL)。这种方法提高了模型对遮挡和语言复杂性的鲁棒性,导致在完全监督和弱监督设置中取得了显著的性能提升。MaskRIS 在 RefCOCO、RefCOCO+ 和 RefCOCOg 数据集上实现了新的最佳性能。
IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?
Authors: Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Yunfei Zhao, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, Botian Shi
First: 2025-09-29T12:38:06+00:00 · Latest: 2025-11-19T11:16:00+00:00
Abstract
The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available at https://github.com/SIGMME/IWR-Bench.
中文标题/摘要
标题:IWR-Bench:大型视觉语言模型能否从用户交互视频中重建交互网页?
网页到代码的任务要求模型理解网页的视觉表示并生成相应的代码。然而,现有的基准主要集中在静态屏幕截图到代码的任务上,从而忽视了真实世界网页应用程序中至关重要的动态交互。为了解决这一局限性,本文引入了IWR-Bench,这是一个新的基准,用于评估大型视觉语言模型(LVLM)从视频中重建交互网页的能力。IWR-Bench 包含来自100个真实网站的113个精心策划的任务,涉及1001个动作,具有多样化的交互复杂性(例如,网页游戏)、视觉风格和领域。每个任务不仅包括用户交互视频,还包括所有抓取的静态资产(例如,图片、视频)。该基准评估模型在两个基本挑战上的表现:综合多模态推理以从视频和资产中推断交互逻辑,以及高级代码生成以将这种逻辑转化为功能代码。一个代理作为裁判的框架结合了全面的度量系统,自动评估生成网页的功能正确性和视觉保真度。在28个LVLM上的广泛实验揭示了一个显著的挑战:最佳模型的整体得分为36.35%,功能正确性(24.39% IFS)远远落后于视觉保真度(64.25% VFS)。这些结果突显了当前模型在推理时间动态和合成事件驱动逻辑方面的重要局限性,确立了IWR-Bench作为视觉语言研究具有挑战性的前沿领域。基准和评估代码将在https://github.com/SIGMME/IWR-Bench/公开。
Summary / 总结
The paper introduces IWR-Bench, a new benchmark for evaluating Large Vision-Language Models (LVLMs) in reconstructing interactive webpages from user interaction videos. It consists of 113 tasks from 100 real-world websites, focusing on dynamic interactions and diverse visual styles. Experiments on 28 LVLMs show that the best model achieves only 36.35% overall score, with functional correctness lagging behind visual fidelity. This highlights the need for better temporal reasoning and event-driven logic synthesis in LVLMs.
论文提出了IWR-Bench,这是一个新的基准,用于评估大型视觉-语言模型(LVLM)从用户交互视频重建交互式网页的能力。该基准包括来自100个真实网站的113个任务,具有多样化的交互复杂性和视觉风格。对28个LVLM的实验显示,最佳模型的总体得分为36.35%,功能正确性明显落后于视觉保真度。这表明当前模型在处理时间动态和合成事件驱动逻辑方面存在不足,需要进一步改进。
What Your Features Reveal: Data-Efficient Black-Box Feature Inversion Attack for Split DNNs
Authors: Zhihan Ren, Lijun He, Jiaxi Liang, Xinzhu Fu, Haixia Bi, Fan Li
First: 2025-11-19T10:30:38+00:00 · Latest: 2025-11-19T10:30:38+00:00
Abstract
Split DNNs enable edge devices by offloading intensive computation to a cloud server, but this paradigm exposes privacy vulnerabilities, as the intermediate features can be exploited to reconstruct the private inputs via Feature Inversion Attack (FIA). Existing FIA methods often produce limited reconstruction quality, making it difficult to assess the true extent of privacy leakage. To reveal the privacy risk of the leaked features, we introduce FIA-Flow, a black-box FIA framework that achieves high-fidelity image reconstruction from intermediate features. To exploit the semantic information within intermediate features, we design a Latent Feature Space Alignment Module (LFSAM) to bridge the semantic gap between the intermediate feature space and the latent space. Furthermore, to rectify distributional mismatch, we develop Deterministic Inversion Flow Matching (DIFM), which projects off-manifold features onto the target manifold with one-step inference. This decoupled design simplifies learning and enables effective training with few image-feature pairs. To quantify privacy leakage from a human perspective, we also propose two metrics based on a large vision-language model. Experiments show that FIA-Flow achieves more faithful and semantically aligned feature inversion across various models (AlexNet, ResNet, Swin Transformer, DINO, and YOLO11) and layers, revealing a more severe privacy threat in Split DNNs than previously recognized.
中文标题/摘要
标题:你的特征透露什么:面向Split DNN的高效黑盒特征反转攻击
Split DNNs通过将密集计算卸载到云服务器来使边缘设备受益,但这种模式暴露了隐私漏洞,因为中间特征可以通过特征反转攻击(FIA)被利用以重建私有输入。现有的FIA方法通常重建质量有限,使得难以评估隐私泄露的真实程度。为了揭示泄露特征的隐私风险,我们引入了FIA-Flow,这是一种黑盒FIA框架,可以从中间特征实现高保真图像重建。为了利用中间特征中的语义信息,我们设计了隐特征空间对齐模块(LFSAM)以弥合中间特征空间与隐空间之间的语义差距。此外,为了纠正分布不匹配,我们开发了确定性反转流匹配(DIFM),它通过一次推理将非流形特征投影到目标流形上。这种解耦设计简化了学习并使有效训练仅需少量图像-特征对成为可能。为了从人类视角量化隐私泄露,我们还提出了基于大型视觉-语言模型的两个度量标准。实验表明,FIA-Flow在各种模型(AlexNet、ResNet、Swin Transformer、DINO和YOLO11)和层上实现了更忠实且语义对齐的特征反转,揭示了Split DNNs中的隐私威胁比之前认为的更为严重。
Summary / 总结
The research aims to address privacy vulnerabilities in Split DNNs by developing FIA-Flow, a black-box Feature Inversion Attack framework that achieves high-fidelity image reconstruction from intermediate features. It introduces a Latent Feature Space Alignment Module to align semantic information and a Deterministic Inversion Flow Matching to rectify distributional mismatch. Experiments demonstrate that FIA-Flow provides more faithful and semantically aligned reconstructions across different models and layers, highlighting a more severe privacy threat in Split DNNs than previously thought.
研究旨在通过开发FIA-Flow,一种黑盒特征反转攻击框架,来解决Split DNNs中中间特征带来的隐私风险。该研究引入了Latent Feature Space Alignment Module (LFSAM) 来对齐语义信息,并开发了Deterministic Inversion Flow Matching (DIFM) 来纠正分布不匹配。实验表明,FIA-Flow 能够实现高保真度的图像重建,并揭示了Split DNNs 中比现有方法更严重的隐私威胁,提供了更准确的隐私泄漏评估。
When Words Change the Model: Sensitivity of LLMs for Constraint Programming Modelling
Authors: Alessio Pellegrino, Jacopo Mauro
First: 2025-11-18T10:40:32+00:00 · Latest: 2025-11-19T10:26:11+00:00
Abstract
One of the long-standing goals in optimisation and constraint programming is to describe a problem in natural language and automatically obtain an executable, efficient model. Large language models appear to bring this vision closer, showing impressive results in automatically generating models for classical benchmarks. However, much of this apparent success may derive from data contamination rather than genuine reasoning: many standard CP problems are likely included in the training data of these models. To examine this hypothesis, we systematically rephrased and perturbed a set of well-known CSPLib problems to preserve their structure while modifying their context and introducing misleading elements. We then compared the models produced by three representative LLMs across original and modified descriptions. Our qualitative analysis shows that while LLMs can produce syntactically valid and semantically plausible models, their performance drops sharply under contextual and linguistic variation, revealing shallow understanding and sensitivity to wording.
中文标题/摘要
标题:当词语改变模型:大规模语言模型在约束编程建模中的敏感性
优化和约束编程领域的一个长期目标是用自然语言描述一个问题,并自动获得一个可执行且高效的模型。大规模语言模型似乎使这一愿景更接近现实,展示了在经典基准问题上自动生成模型的惊人成果。然而,这种表面上的成功可能更多来自于数据污染而非真正的推理:许多标准的CP问题很可能包含在这些模型的训练数据中。为了检验这一假设,我们系统地重新表述并扰动了一组著名的CSPLib问题,以保持其结构同时修改其上下文并引入误导性元素。然后,我们比较了三种代表性的大规模语言模型在原始描述和修改描述下生成的模型。我们的定性分析表明,虽然大规模语言模型可以生成语法正确且语义合理的模型,但在上下文和语言变化下其性能急剧下降,揭示了浅薄的理解和对措辞的敏感性。
Summary / 总结
The study investigates the sensitivity of large language models (LLMs) to natural language descriptions of constraint programming (CP) problems. Motivated by the goal of automatically generating efficient models from natural language, the research systematically modified well-known CSPLib problems to test the models' robustness. Key findings indicate that while LLMs can generate valid models, their performance significantly decreases under contextual and linguistic changes, suggesting shallow understanding and high sensitivity to wording.
研究旨在通过系统地修改CSPLib中的经典问题来考察大型语言模型(LLMs)对自然语言描述的敏感性。研究比较了三种LLM在原始描述和修改描述下生成的模型,结果显示尽管LLM能够生成有效的模型,但在上下文和语言变化下其性能显著下降,表明其理解浅薄且对措辞高度敏感。
Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models
Authors: Mehran Tamjidi, Hamidreza Dastmalchi, Mohammadreza Alimoradijazi, Ali Cheraghian, Aijun An, Morteza Saberi
Venue: AAAI 2026
First: 2025-11-19T10:22:22+00:00 · Latest: 2025-11-19T10:22:22+00:00
Comments: Accepted by AAAI 2026. 7 pages, 4 figures
Abstract
3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs.
中文标题/摘要
标题:随走随适应:无需训练的在线3D视觉-语言基础模型测试时适应
3D视觉-语言基础模型(VLFMs)在开放世界点云处理任务中展示了强大的泛化能力和零样本识别能力。然而,在实际场景中,由于数据噪声、不完整或与训练数据分布不同,这些模型往往表现不佳。为解决这一问题,我们提出了一种基于动态原型学习的无需训练的在线测试时适应(TTA)策略——Uni-Adapter。我们定义了一个3D缓存来存储类特定的聚类中心作为原型,并不断更新以捕捉异质数据分布中的类内变异性。这些动态原型作为基于缓存的逻辑计算的锚点,通过相似性评分进行计算。同时,基于图的标签平滑模块捕捉原型之间的相似性,以确保相似原型之间的标签一致性。最后,我们使用熵加权聚合统一原始3D VLFM和精炼的3D缓存的预测,以实现可靠的适应。无需重新训练,Uni-Adapter 有效缓解了分布偏移,实现了在不同3D VLFMs上的多种3D基准测试中的最佳性能,分别提高了ModelNet-40C 10.55%,ScanObjectNN-C 8.26%,和ShapeNet-C 4.49%。
Summary / 总结
The paper addresses the underperformance of 3D Vision-Language Foundation Models (VLFMs) in practical scenarios with noisy and different distribution data. It introduces Uni-Adapter, a training-free online test-time adaptation strategy based on dynamic prototype learning. Uni-Adapter uses a 3D cache to store and update class-specific cluster centers, which are used for logit computation and label smoothing to improve model performance. Experiments show that Uni-Adapter enhances the performance of 3D VLFMs on various benchmarks, with improvements of 10.55%, 8.26%, and 4.49% on ModelNet-40C, ScanObjectNN-C, and ShapeNet-C, respectively, compared to the source models.
研究旨在提高3D Vision-Language Foundation Models (VLFMs)在具有噪声或不同数据分布的实际场景中的性能。提出了一种名为Uni-Adapter的训练-free在线测试时自适应策略,该策略使用动态原型学习来更新存储在3D缓存中的类特定聚类中心。该方法在各种3D基准测试中提高了模型性能,实现了显著的提升,如在ModelNet-40C上提高了10.55%,在ScanObjectNN-C上提高了8.26%,在ShapeNet-C上提高了4.49%,超过了原始VLFMs。
Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception
Authors: Jiashu Yang, Yifan Han, Yucheng Xie, Ning Guo, Wenzhao Lian
First: 2025-11-19T09:42:08+00:00 · Latest: 2025-11-19T09:42:08+00:00
Abstract
In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.
中文标题/摘要
标题:看,放大,理解:用于具身感知的机器人眼球
在具身AI感知系统中,视觉感知应该是主动的:目标不是被动处理静态图像,而是在像素和空间预算限制内积极获取更具信息量的数据。现有的视觉模型和固定RGB-D相机系统根本无法在广泛覆盖与精细细节获取之间取得平衡,严重限制了它们在开放世界机器人应用中的有效性。为了解决这一问题,我们提出了EyeVLA,一种用于主动视觉感知的机器人眼球,可以根据指令采取主动行动,实现对精细目标物体的清晰观察和广泛空间范围内的详细信息。EyeVLA 将行动行为离散化为行动令牌,并将其与具备强大开放世界理解能力的视觉语言模型(VLMs)集成,从而在单一自回归序列中实现视觉、语言和行动的联合建模。通过使用2D边界框坐标引导推理链,并应用强化学习来优化视角选择策略,我们仅使用少量真实世界数据将VLM的开放世界场景理解能力转移到视觉语言行动(VLA)策略上。实验表明,我们的系统在真实世界环境中高效执行指令场景,并通过指令驱动的旋转和放大动作主动获取更准确的视觉信息,从而实现强大的环境感知能力。EyeVLA 引入了一种新的机器人视觉系统,利用详细且空间丰富的大规模具身数据,并主动获取高度信息丰富的视觉观察,以支持下游具身任务。
Summary / 总结
The research aims to develop an active visual perception system for embodied AI, addressing the limitations of existing vision models and fixed cameras in acquiring both wide-area coverage and fine-grained details. EyeVLA, a robotic eyeball, uses action tokens and vision-language models to enable proactive observation and detailed scene understanding. Experiments demonstrate that EyeVLA can efficiently perform instructed tasks in real-world environments and acquire more accurate visual information through rotation and zoom actions, enhancing environmental perception capabilities.
研究提出了一种名为EyeVLA的机器人眼球系统,能够根据指令采取主动行动。它将动作令牌与视觉语言模型集成,以实现视觉、语言和动作的联合建模。实验表明,EyeVLA能够高效地在真实环境中执行指令任务,并通过旋转和放大等指令驱动的动作获取更准确的视觉信息,从而增强环境感知能力。
Reinforcement Learning in Queue-Reactive Models: Application to Optimal Execution
Authors: Tomas Espana, Yadh Hafsi, Fabrizio Lillo, Edoardo Vittori
First: 2025-11-19T09:26:23+00:00 · Latest: 2025-11-19T09:26:23+00:00
Abstract
We investigate the use of Reinforcement Learning for the optimal execution of meta-orders, where the objective is to execute incrementally large orders while minimizing implementation shortfall and market impact over an extended period of time. Departing from traditional parametric approaches to price dynamics and impact modeling, we adopt a model-free, data-driven framework. Since policy optimization requires counterfactual feedback that historical data cannot provide, we employ the Queue-Reactive Model to generate realistic and tractable limit order book simulations that encompass transient price impact, and nonlinear and dynamic order flow responses. Methodologically, we train a Double Deep Q-Network agent on a state space comprising time, inventory, price, and depth variables, and evaluate its performance against established benchmarks. Numerical simulation results show that the agent learns a policy that is both strategic and tactical, adapting effectively to order book conditions and outperforming standard approaches across multiple training configurations. These findings provide strong evidence that model-free Reinforcement Learning can yield adaptive and robust solutions to the optimal execution problem.
中文标题/摘要
标题:强化学习在队列反应模型中的应用:最优执行
我们研究了强化学习在元订单最优执行中的应用,目标是在较长时间内逐步执行大订单,同时最小化实现亏损和市场影响。不同于传统的参数化价格动态和影响建模方法,我们采用了一种无模型的数据驱动框架。由于策略优化需要历史数据无法提供的反事实反馈,我们使用队列反应模型生成包含瞬时价格影响和非线性动态订单流响应的现实且可处理的限价订单簿模拟。从方法论上讲,我们在包含时间、库存、价格和深度变量的状态空间中训练了一个双深Q网络代理,并将其性能与现有基准进行比较。数值模拟结果表明,该代理学会了既能战略性又能战术性执行的策略,能够有效适应订单簿条件,并在多种训练配置中优于标准方法。这些发现提供了强有力的证据,表明无模型的强化学习可以为最优执行问题提供适应性和鲁棒性的解决方案。
Summary / 总结
The research aims to apply Reinforcement Learning to the optimal execution of large orders, minimizing implementation shortfall and market impact. A model-free approach using the Queue-Reactive Model generates realistic limit order book simulations for training a Double Deep Q-Network agent. The agent learns a strategic and tactical policy, outperforming benchmarks in various training scenarios, indicating the effectiveness of model-free Reinforcement Learning in solving the optimal execution problem.
研究旨在利用强化学习进行大额订单的最优执行,以最小化实施亏损和市场影响。采用无模型方法和队列反应模型生成真实的限价订单簿模拟。双深Q网络代理基于时间、库存、价格和深度变量的状态空间进行训练。实验结果表明,代理学会了有效的策略,在各种训练配置中均优于标准基准,表明无模型强化学习在最优执行问题上具有适应性和稳健性的潜力。
Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery
Authors: Christopher Gaul, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz Rodríguez, Eri Pérez Corral
First: 2025-06-12T13:36:27+00:00 · Latest: 2025-11-19T09:18:27+00:00
Abstract
Accurate automatic screening of minors in unconstrained images requires models robust to distribution shift and resilient to the under-representation of children in public datasets. To address these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary underage heads (12, 15, 18, and 21 years). This design focuses on the legally critical age range while keeping the backbone frozen. Class imbalance is mitigated through an $α$-reweighted focal loss and age-balanced mini-batch sampling, while an age gap removes ambiguous samples near thresholds.
Evaluation is conducted on our new Overall Underage Benchmark (303k cleaned training images, 110k test images), defining both the "ASORES-39k" restricted overall test, which removes the noisiest domains, and the age estimation wild-shifts test "ASWIFT-20k" of 20k-images, stressing extreme poses ($>$45°), expressions, and low image quality to emulate real-world shifts.
Trained on the cleaned overall set with resampling and age gap, our multiage model "F" reduces the mean absolute error on ASORES-39k from 4.175 y (age-only baseline) to 4.068 y and improves under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the ASWIFT-20k, the same configuration nearly sustains 0.99 recall while F2 rises from 0.742 to 0.833, demonstrating robustness to domain shift.
中文标题/摘要
标题:基于多任务和多年龄段方法的未成年人检测在不受约束图像中的应用
准确地在不受约束的图像中自动筛查未成年人需要模型对分布偏移具有鲁棒性,并且能够应对公共数据集中儿童样本不足的问题。为了解决这些问题,我们提出了一种多任务架构,基于冻结的FaRL视觉-语言骨干,结合一个紧凑的两层MLP,该MLP在一种年龄回归头和四个二元未成年头(12岁、15岁、18岁和21岁)之间共享特征。此设计专注于法律上关键的年龄范围,同时保持骨干冻结。通过$α$加权焦点损失和年龄平衡的小批量采样来缓解类别不平衡,年龄间隔还消除了接近阈值的模糊样本。评估在我们新的整体未成年人基准(303,000张清洗后的训练图像,110,000张测试图像)上进行,定义了“ASORES-39k”受限的整体测试,该测试去除了最嘈杂的领域,以及20,000张图像的年龄估计野外偏移测试“ASWIFT-20k”,该测试强调极端姿势(>45°)、表情和低图像质量,以模拟现实世界的偏移。在清洗后的整体数据集上进行训练,并通过重采样和年龄间隔,我们的多年龄段模型“F”将ASORES-39k上的平均绝对误差从年龄基线的4.175岁降低到4.068岁,并将18岁以下的检测F2分数从0.801提高到1%假成人率下的0.857。在ASWIFT-20k下,相同的配置几乎保持了0.99的召回率,F2分数从0.742提高到0.833,展示了对领域偏移的鲁棒性。
Summary / 总结
The research aims to develop a robust model for detecting underage individuals in unconstrained images, addressing distribution shift and under-representation issues. It proposes a multi-task architecture with a frozen FaRL backbone and a compact two-layer MLP for age-regression and underage classification tasks. The model shows improved performance, reducing mean absolute error on ASORES-39k from 4.175 to 4.068 years and enhancing under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. It also demonstrates robustness to domain shift with 0.99 recall on ASWIFT-20k.
研究旨在开发一种在非受限图像中检测未成年人的稳健模型,解决分布偏移和数据不足的问题。提出了一种基于冻结FaRL视觉-语言骨干和紧凑的两层MLP的多任务架构,用于年龄回归和未成年分类任务。该模型在ASORES-39k上的平均绝对误差从4.175降低到4.068年,并且在1%假成年率下将18岁以下检测的F2分数从0.801提高到0.857。此外,该模型在ASWIFT-20k上表现出对领域偏移的鲁棒性,召回率达到0.99。
SkinGPT-R1: Adapter-Only Dual Distillation for Efficient Dermatology Reasoning
Authors: Yuhao Shen, Jiahe Qian, Zhangtianyi Chen, Yuanhao He, Juexiao Zhou
First: 2025-11-19T08:55:23+00:00 · Latest: 2025-11-19T08:55:23+00:00
Abstract
We present SkinGPT-R1, a dermatology focused vision language model that makes diagnostic chain of thought reasoning explicit, step by step, and verifiable. To support skin specific reasoning, we build DermCoT, a corpus of standardized dermatologic chain of thought narratives that combines 10,000 DermEval filtered training cases with 3,000 dermatologist scored certified cases, and we define DermEval as a physician aligned six dimensional evaluator and DermBench as the corresponding benchmark for dermatologic chain of thought quality. On DermBench, across 14 general, reasoning, and medical vision language models, SkinGPT-R1 achieves an average score of 4.031 out of 5 over the six clinician defined dimensions, ranks 1st among all systems, and improves the average score over Vision-R1 by about 41%. On three dermatology classification benchmarks, SkinGPT-R1 delivers stable accuracy gains over Vision-R1 and remains competitive among strong vision language models. Ablation results further show that DermCoT based chain of thought supervision provides substantial improvements over the base model and that adding dermatology aware visual distillation yields consistent additional gains in both narrative quality and recognition.
中文标题/摘要
标题:SkinGPT-R1:仅适配器双重精炼以实现高效的皮肤科推理
我们提出了SkinGPT-R1,这是一种专注于皮肤科的视觉语言模型,使诊断推理过程明确、逐步和可验证。为了支持皮肤特定的推理,我们构建了DermCoT,这是一个标准化的皮肤科推理叙述语料库,结合了10,000个经过DermEval筛选的训练案例和3,000个皮肤科医生评分的认证案例,并定义了DermEval为与医生对齐的六维评估器和DermBench作为相应的基准,用于评估皮肤科推理质量。在DermBench上,SkinGPT-R1在14个通用、推理和医学视觉语言模型中,在六个临床定义维度上的平均得分为4.031/5,排名第一,并且比Vision-R1的平均得分提高了约41%。在三个皮肤科分类基准上,SkinGPT-R1在准确度上相对于Vision-R1有所提升,并且在强大的视觉语言模型中保持竞争力。进一步的消融实验表明,基于DermCoT的推理监督对基础模型有显著改进,且增加皮肤科意识的视觉精炼在叙述质量和识别方面提供了持续的额外收益。
Summary / 总结
SkinGPT-R1 is a dermatology-focused vision language model that explicitly reasons through diagnostic steps. It uses DermCoT, a corpus combining 10,000 filtered training cases and 3,000 certified cases, and DermEval as a six-dimensional evaluation tool. SkinGPT-R1 scores 4.031 out of 5 on DermBench, ranking first and improving the base model's score by 41%. On dermatology classification benchmarks, it outperforms Vision-R1 and maintains competitive accuracy. Ablation studies confirm DermCoT's and dermatology-aware visual distillation's benefits in narrative quality and recognition.
SkinGPT-R1 是一个专注于皮肤病的视觉语言模型,它明确地进行诊断推理。它使用了结合了10,000个过滤训练案例和3,000个认证案例的DermCoT语料库,以及DermEval作为六维评估工具。SkinGPT-R1 在DermBench上的评分为4.031/5,排名第一,并将基线模型的得分提高了41%。在皮肤病分类基准测试中,它优于Vision-R1,并保持了与强大视觉语言模型的竞争力。消融研究进一步证实了DermCoT和皮肤病意识视觉蒸馏在叙述质量和识别方面的益处。
Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling
Authors: Jiale Liu, Haoming Zhou, Yishu Zhu, Bingzhi Chen, Yuncheng Jiang
Venue: AAAI 2026
First: 2025-11-11T00:28:11+00:00 · Latest: 2025-11-19T08:39:44+00:00
Comments: 10 pages, 6 figures, accepted by AAAI 2026
Abstract
Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.
中文标题/摘要
标题:基于粒度感知和区域不确定性建模的跨模态细粒度对齐
细粒度的图像-文本对齐是多模态学习中的一个关键挑战,支撑着视觉问答、图像字幕和视觉语言导航等重要应用。与全局对齐不同,细粒度对齐需要精确对应局部视觉区域和文本标记,常常受到嘈杂注意力机制和跨模态关系简化建模的阻碍。在本文中,我们识别了现有方法的两个基本局限性:缺乏稳健的模态内机制来评估视觉和文本标记的重要性,导致在复杂场景中泛化能力差;以及缺乏细粒度的不确定性建模,无法捕捉区域-词对应关系的一对多和多对一性质。为了解决这些问题,我们提出了一种统一的方法,结合了重要性感知和粒度感知建模以及区域级不确定性建模。我们的方法利用模态特定的偏差来识别显著特征,而不依赖于脆弱的跨模态注意力,并将区域特征表示为高斯分布的混合,以捕捉细粒度的不确定性。在Flickr30K和MS-COCO上的广泛实验表明,我们的方法在各种骨干架构上达到了最先进的性能,显著增强了细粒度图像-文本对齐的鲁棒性和可解释性。
Summary / 总结
This paper addresses the challenge of fine-grained image-text alignment in multimodal learning, focusing on precise correspondence between visual regions and textual tokens. It introduces a unified approach that incorporates significance-aware and granularity-aware modeling, as well as region-level uncertainty modeling, to overcome limitations of existing methods. Experiments on Flickr30K and MS-COCO show that this approach outperforms previous methods, enhancing robustness and interpretability in fine-grained alignment tasks.
该论文针对多模态学习中的细粒度图像-文本对齐挑战,关注局部视觉区域与文本标记之间的精确对应。它提出了一种统一的方法,包括显著性感知和粒度感知建模,以及区域级不确定性建模,以克服现有方法的局限性。实验结果表明,该方法在Flickr30K和MS-COCO上优于先前的方法,并增强了鲁棒性和可解释性。
Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation
Authors: Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Weitao Ma, Xiachong Feng
First: 2025-11-08T08:37:26+00:00 · Latest: 2025-11-19T08:00:50+00:00
Comments: AAAI2026 Oral
Abstract
Despite the remarkable advancements of Large Vision-Language Models (LVLMs), the mechanistic interpretability remains underexplored. Existing analyses are insufficiently comprehensive and lack examination covering visual and textual tokens, model components, and the full range of layers. This limitation restricts actionable insights to improve the faithfulness of model output and the development of downstream tasks, such as hallucination mitigation. To address this limitation, we introduce Fine-grained Cross-modal Causal Tracing (FCCT) framework, which systematically quantifies the causal effects on visual object perception. FCCT conducts fine-grained analysis covering the full range of visual and textual tokens, three core model components including multi-head self-attention (MHSA), feed-forward networks (FFNs), and hidden states, across all decoder layers. Our analysis is the first to demonstrate that MHSAs of the last token in middle layers play a critical role in aggregating cross-modal information, while FFNs exhibit a three-stage hierarchical progression for the storage and transfer of visual object representations. Building on these insights, we propose Intermediate Representation Injection (IRI), a training-free inference-time technique that reinforces visual object information flow by precisely intervening on cross-modal representations at specific components and layers, thereby enhancing perception and mitigating hallucination. Consistent improvements across five widely used benchmarks and LVLMs demonstrate IRI achieves state-of-the-art performance, while preserving inference speed and other foundational performance.
中文标题/摘要
标题:大型视觉语言模型中物体表示因果追踪:机制可解释性和幻觉缓解
尽管大型视觉-语言模型(LVLMs)取得了显著进展,但其机制可解释性仍被严重忽视。现有分析不够全面,缺乏对视觉和文本标记、模型组件以及所有层级的全面检查。这一限制限制了对模型输出忠实度的改进和下游任务(如幻觉缓解)的发展。为解决这一限制,我们引入了细粒度跨模态因果追踪(FCCT)框架,该框架系统地量化了对视觉物体感知的因果影响。FCCT 对视觉和文本标记的整个范围、三个核心模型组件(多头自注意力(MHSA)、前馈网络(FFNs)和隐藏状态)以及所有解码器层进行了精细分析。我们的分析首次表明,中间层最后一个标记的MHSA在跨模态信息聚合中起着关键作用,而FFNs则表现出三级层次结构,用于视觉物体表示的存储和转移。基于这些见解,我们提出了中间表示注入(IRI),这是一种无需训练的推理时技术,通过在特定组件和层上精确干预跨模态表示来增强视觉物体信息流,从而提高感知并缓解幻觉。在五个广泛使用的基准和LVLMs上的一致改进表明,IRI 达到了最先进的性能,同时保持了推理速度和其他基础性能。
Summary / 总结
This study addresses the lack of mechanistic interpretability in Large Vision-Language Models (LVLMs) by introducing the Fine-grained Cross-modal Causal Tracing (FCCT) framework. FCCT analyzes the causal effects on visual object perception across all layers and components, revealing the critical role of MHSAs and the hierarchical progression of FFNs. Based on these insights, the study proposes Intermediate Representation Injection (IRI), which enhances perception and mitigates hallucination by precisely intervening on cross-modal representations. Experiments across five benchmarks show that IRI improves performance while maintaining inference speed and other foundational metrics.
该研究通过引入细粒度跨模态因果追踪(FCCT)框架,解决了大型视觉-语言模型(LVLM)的机制可解释性不足问题。FCCT系统地分析了各层和组件对视觉对象感知的影响。研究发现,中间层的MHSAs和FFNs在视觉对象表示中起着关键作用。基于这些发现,研究提出了一种名为中间表示注入(IRI)的推理时技术,该技术通过精确干预特定组件和层的跨模态表示来增强感知并减轻幻觉。实验表明,IRI在五个基准测试中表现出色,同时保持了推理速度和其他基础性能。
Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
Authors: Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, Sunzida Siddique
First: 2025-11-19T07:52:20+00:00 · Latest: 2025-11-19T07:52:20+00:00
Abstract
Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.
中文标题/摘要
标题:基于物理的多模态合成图像基准度量
当前最先进的度量标准如BLEU、CIDEr、VQA分数、SigLIP-2和CLIPScore往往无法捕捉到语义或结构准确性,尤其是在特定领域或上下文依赖的场景中。为了解决这些问题,本文提出了一种结合大型语言模型、推理、知识基于映射和视觉语言模型的物理约束多模态数据评估(PCMDE)度量标准。该架构包括三个主要阶段:(1)通过对象检测和VLMs提取空间和语义信息的多模态特征;(2)基于置信度加权组件融合进行自适应组件级验证;(3)使用大型语言模型进行基于物理的推理,以执行结构和关系约束(例如,对齐、位置、一致性)的强制执行。
Summary / 总结
This paper addresses the limitations of current metrics like BLEU, CIDEr, VQA score, SigLIP-2, and CLIPScore in capturing semantic or structural accuracy, especially in domain-specific or context-dependent scenarios. It introduces a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric that combines large language models with reasoning, knowledge-based mapping, and vision-language models. The PCMDE metric consists of three stages: feature extraction, confidence-weighted component fusion, and physics-guided reasoning to enforce structural and relational constraints. Key findings show that PCMDE outperforms existing metrics in evaluating the accuracy and coherence of multimodal synthetic images.
本文针对BLEU、CIDEr、VQA分数、SigLIP-2和CLIPScore等当前指标在捕捉语义和结构准确性方面的局限性,尤其是在特定领域或上下文依赖场景中的局限性。提出了一种结合大型语言模型、推理、知识映射和视觉语言模型的物理约束多模态数据评估(PCMDE)指标。该方法包括三个阶段:特征提取、置信加权组件融合和基于物理的推理,以确保结构和关系约束。关键发现表明,PCMDE在评估多模态合成图像的准确性方面优于现有指标。