Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration
Authors: Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee, Jihye Park, Yeji Choi, Seungryong Kim
First: 2025-12-09T18:56:54+00:00 · Latest: 2025-12-09T18:56:54+00:00
Abstract
Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
中文标题/摘要
标题:统一扩散变换器用于高保真文本感知图像恢复
文本感知图像恢复(TAIR)旨在从包含退化文本内容的低质量输入中恢复高质量图像。虽然扩散模型为通用图像恢复提供了强大的生成先验知识,但在以文本为中心的任务中,它们往往会由于缺乏显式的语言知识而产生文本幻觉。为了解决这个问题,我们提出了一种统一的文本恢复框架UniT,该框架以迭代方式结合了扩散变换器(DiT)、视觉语言模型(VLM)和文本检测模块(TSM)。在UniT中,VLM从退化图像中提取文本内容,提供显式的文本指导。同时,TSM在扩散特征上进行训练,在每个去噪步骤中生成中间OCR预测,使VLM能够在去噪过程中逐步细化其指导。最后,DiT骨干利用其强大的表征能力,利用这些线索恢复细粒度的文本内容,同时有效抑制文本幻觉。在SA-Text和Real-Text基准测试上的实验表明,UniT能够忠实恢复退化文本,显著减少幻觉,并在TAIR任务中实现最先进的端到端F1分数性能。
Summary / 总结
The paper proposes UniT, a unified text restoration framework that integrates a Diffusion Transformer, a Vision-Language Model, and a Text Spotting Module for high-fidelity text restoration. The VLM extracts textual content from degraded images, while the TSM generates intermediate OCR predictions to refine the VLM's guidance. The DiT backbone then recovers fine-grained textual content while suppressing text hallucinations. Experiments show that UniT effectively reconstructs degraded text and reduces hallucinations, achieving state-of-the-art performance in the TAIR task.
论文提出了一种名为UniT的统一文本恢复框架,结合了扩散变换器、视觉语言模型和文本检测模块。该框架通过从退化图像中提取文本指导信息和生成中间OCR预测来逐步细化文本内容,有助于抑制文本幻觉。实验表明,UniT在忠实文本重建和减少幻觉方面优于现有方法,并实现了最先进的F1分数性能。
Self-Evolving 3D Scene Generation from a Single Image
Authors: Kaizhi Zheng, Yue Fan, Jing Gu, Zishuo Xu, Xuehai He, Xin Eric Wang
First: 2025-12-09T18:44:21+00:00 · Latest: 2025-12-09T18:44:21+00:00
Abstract
Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.
中文标题/摘要
标题:从单张图像自演化生成3D场景
从单张图像生成高质量、纹理化的3D场景仍然是视觉和图形学中的一个基本挑战。最近的图像到3D生成器可以从单个视角恢复合理的几何结构,但它们以对象为中心的训练限制了其在复杂、大规模场景中的泛化能力,这些场景需要忠实的结构和纹理。我们提出了一种自演化、无需训练的框架EvoScene,该框架逐步从单张图像中重建完整的3D场景。关键思想是结合现有模型的互补优势:3D生成模型的几何推理能力和视频生成模型的视觉知识。通过三个迭代阶段——空间先验初始化、视觉引导的3D场景网格生成和空间引导的新视角生成——EvoScene 在2D和3D领域之间交替,逐步提高结构和外观。在多种场景上的实验表明,与强大的基线相比,EvoScene 实现了更好的几何稳定性、视图一致的纹理以及未见区域的补全,生成了可以直接用于实际应用的3D网格。
Summary / 总结
The research aims to generate high-quality, textured 3D scenes from a single image, addressing the limitations of existing object-centric models in handling complex scenes. EvoScene, a self-evolving framework, progressively reconstructs complete 3D scenes through three stages: Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation. Experiments show that EvoScene outperforms strong baselines in terms of geometric stability, view-consistent textures, and unseen-region completion, making it suitable for practical applications.
研究旨在从单张图像生成高质量的3D场景,解决现有以对象为中心的模型在处理复杂场景时的局限性。EvoScene是一种自我进化的框架,结合了3D生成模型的几何推理和视频生成模型的视觉知识,逐步重建完整的3D场景。该框架包括三个阶段:空间先验初始化、视觉引导的3D场景网格生成和空间引导的新视角生成,逐次提高结构和外观。实验表明,EvoScene在几何稳定性、视图一致的纹理和未见区域的完成方面优于强基线,生成可用于实际应用的3D网格。
TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
Authors: Shima Imani, Seungwhan Moon, Lambert Mathias, Lu Zhang, Babak Damavandi
First: 2025-12-05T18:40:18+00:00 · Latest: 2025-12-09T18:19:52+00:00
Abstract
Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.
中文标题/摘要
标题:TRACE:分析和增强视觉语言模型逐步推理的框架
可靠地进行数学和科学推理仍然是大型视觉语言模型面临的开放挑战。标准的最终答案评估往往掩盖了推理错误,允许无声失败持续存在。为了解决这一问题,我们引入了TRACE,一种透明推理和一致性评估框架,它诊断推理轨迹而非仅仅评估最终结果。TRACE的核心在于利用辅助推理集,这是一种分解复杂问题的紧凑子问题答案对,通过基于一致性的度量评估中间步骤,并揭示标准评估中忽略的失败。我们的实验表明,辅助推理集(ARS)的一致性与最终答案的正确性相关,并有助于定位失败出现的推理步骤,提供模型改进的可操作信号。此外,TRACE定义了置信区域,区分可靠和不可靠的推理路径,支持有效的过滤、调试和模型优化。
Summary / 总结
The research aims to improve the reliability of mathematical and scientific reasoning in large vision-language models by addressing the limitations of standard final-answer evaluation. TRACE, a framework for Transparent Reasoning And Consistency Evaluation, diagnoses reasoning trajectories using Auxiliary Reasoning Sets, which decompose complex problems and evaluate intermediate steps. Experiments show that consistency across these sets correlates with final-answer correctness and helps identify specific reasoning steps where failures occur, providing actionable signals for model improvement. Additionally, TRACE defines confidence regions to distinguish reliable from unreliable reasoning paths, aiding in effective filtering, debugging, and model refinement.
研究旨在通过解决标准最终答案评估的局限性,提高大型视觉-语言模型在数学和科学推理方面的可靠性。TRACE,一种透明推理和一致性评估框架,使用辅助推理集将问题分解为子问题,并评估中间步骤。实验表明,这些集中的一致性与正确的最终答案相关,并有助于识别推理步骤中的失败点,提供模型改进的行动信号。TRACE 还定义了置信区域,以区分可靠和不可靠的推理路径,支持有效的过滤、调试和模型优化。
SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing
Authors: Aysim Toker, Andreea-Maria Oncescu, Roy Miles, Ismail Elezi, Jiankang Deng
First: 2025-12-09T18:15:43+00:00 · Latest: 2025-12-09T18:15:43+00:00
Abstract
Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.
中文标题/摘要
标题:SATGround:一种针对遥感领域视觉定位的空间感知方法
视觉语言模型(VLMs)正在成为遥感领域强大的通用工具,能够跨多种任务整合信息,并通过聊天界面实现灵活的指令式交互。在本文中,我们通过提出一种新颖的结构化定位机制,增强了基于VLM的卫星图像视觉定位。我们的方法包括在多样化的指令遵循任务上微调预训练的VLM,并通过专门的控制标记接口连接一个专用的定位模块。该方法促进了语言和空间信息的联合推理,显著增强了模型在复杂卫星场景中精确定位物体的能力。我们在几个遥感基准测试上评估了我们的框架,始终优于现有方法,包括在视觉定位上相对于先前方法的24.8%的相对改进。我们的结果突显了将结构化空间推理集成到VLM中的好处,为更可靠的现实世界卫星数据分析铺平了道路。
Summary / 总结
This paper introduces SATGround, a method that enhances visual grounding in satellite imagery using a spatially-aware approach. The approach involves fine-tuning a pretrained vision-language model on various instruction-following tasks and integrating a specialized grounding module through control tokens. This method improves the model's ability to precisely locate objects in complex satellite scenes, achieving a 24.8% relative improvement over previous methods on visual grounding benchmarks.
研究旨在通过视觉语言模型(VLMs)提高卫星图像中的视觉定位能力。通过在各种指令跟随任务上微调预训练的VLM,并集成一个专门的定位模块,该方法增强了模型在复杂场景中精确定位物体的能力,相比之前的方法在视觉定位基准测试中取得了24.8%的相对改进。
The Missing Point in Vision Transformers for Universal Image Segmentation
Authors: Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
First: 2025-05-26T10:29:13+00:00 · Latest: 2025-12-09T17:56:45+00:00
Abstract
Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P.
中文标题/摘要
标题:视觉变换器在通用图像分割中的缺失点
图像分割仍然是计算机视觉中的一个挑战性任务,需要稳健的掩码生成和精确的分类。基于掩码的方法通过捕捉全局上下文来生成高质量的掩码。然而,在存在模糊边界和类别分布不平衡的情况下,准确地分类这些掩码仍然是一个开放的挑战。在本文中,我们引入了ViT-P,这是一种新颖的两阶段分割框架,将掩码生成与分类分离。第一阶段使用提案生成器生成无类别的掩码提案,而第二阶段则利用基于视觉变换器(ViT)的点分类模型通过关注掩码中心点来细化预测。ViT-P 作为一种无需预训练的适配器,允许将各种预训练的视觉变换器无缝集成到其架构中,确保其对密集预测任务的适应性。此外,我们证明粗略和边界框注释可以有效提高分类性能,而无需在精细注释数据集上进行额外训练,从而降低注释成本并保持强大的性能。在COCO、ADE20K和Cityscapes数据集上的广泛实验验证了ViT-P的有效性,分别在ADE20K全景分割中达到54.0 PQ,在Cityscapes语义分割中达到87.4 mIoU,在ADE20K语义分割中达到63.6 mIoU。代码和预训练模型可在:https://github.com/sajjad-sh33/ViT-P 获取。
Summary / 总结
This paper addresses the challenge of accurate mask classification in image segmentation by proposing ViT-P, a two-stage framework that decouples mask generation from classification. The first stage generates class-agnostic mask proposals, and the second stage uses a point-based classification model based on Vision Transformer to refine predictions by focusing on mask central points. ViT-P achieves state-of-the-art results on ADE20K panoptic segmentation, Cityscapes semantic segmentation, and ADE20K semantic segmentation with 54.0 PQ, 87.4 mIoU, and 63.6 mIoU, respectively, without requiring additional pre-training. Coarse and bounding box annotations enhance classification without the need for fine annotation datasets, reducing annotation costs.
该研究提出了一种两阶段框架ViT-P,以解决图像分割中准确的掩膜分类问题,该框架将掩膜生成与分类分离。第一阶段生成无类别的掩膜提案,第二阶段使用基于Vision Transformer的点分类模型进行预测细化。ViT-P无需预训练,可以集成各种预训练的视觉变压器。实验结果表明,ViT-P在COCO、ADE20K和Cityscapes数据集上达到了最先进的性能,分别在ADE20K全景分割中获得54.0 PQ,在Cityscapes语义分割中获得87.4 mIoU,在ADE20K语义分割中获得63.6 mIoU。
Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference
Authors: Amit Bendkhale
Venue: AAAI 2026
First: 2025-12-09T17:52:57+00:00 · Latest: 2025-12-09T17:52:57+00:00
Comments: 6 pages, 3 figures. Code and data: https://github.com/Amiton7/Tri-Bench. Accepted to the AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)
Abstract
Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.
中文标题/摘要
标题:Tri-Bench:在相机倾斜和物体干扰下空间推理的VLM可靠性压力测试
可验证的几何推理是值得信赖和可控的代理AI的关键组成部分。尽管具有令人印象深刻的性能,但在现实场景变化下,视觉-语言模型(VLMs)经常失败。我们提出了Tri-Bench,这是一个紧凑的基准测试,专注于平面三角形问题,以隔离相对几何推理,同时强调两个关键部署因素:相机姿态(平面 vs. 倾斜)和通过物体干扰(10种日常物体)的场景上下文。为了测试可验证性和可控性,我们使用一个单一的固定提示来评估四个最近的VLMs,该提示的护栏明确描述了一个周围的正方形边界,从而可以通过齐次变换获得正确答案。我们评估了六个简单的任务,涉及二进制和连续目标,观察到相对于3D真实值的整体准确性较低,平均约为69%(最佳约为75%,最差约为64%)。同样的响应在图像平面的2D投影中更加一致,平均准确性约为72%。所有四个VLMs在识别少数形状类别(等边、等腰、直角三角形)方面表现一致不佳,准确率降至约0%。此外,总体VLM准确性在相机倾斜下下降了约4.1%。这表明模型未能正确利用提示中提供的明确框架参考提示,而是默认使用2D图像平面线索。最后,我们发现物体干扰对VLM准确性没有显著影响。
Summary / 总结
The research aims to evaluate the reliability of Vision-Language Models (VLMs) in geometric reasoning under realistic scene changes, focusing on camera pose and object interference. Tri-Bench, a benchmark of planar triangle problems, was used to test four recent VLMs with a fixed prompt that includes a guardrail. The results show that VLMs have modest accuracy, around 69% on average, and perform poorly on recognizing minority shape classes, with accuracy dropping to 0% under camera tilt. Object interference did not significantly affect VLM performance.
研究旨在评估Vision-Language模型在现实场景变化下的几何推理可靠性,重点关注相机姿态和物体干扰。使用了Tri-Bench,一个平面三角形问题基准,测试了四个近期的VLM模型,并使用了一个固定提示,其中包含描述方形边界的护栏。结果显示,VLMs的整体准确率平均约为69%,对于少数形状类如等边和等腰三角形的识别准确率几乎降至0%。此外,相机倾斜导致准确率下降约4.1%,表明模型倾向于依赖2D图像线索而非提供的3D框架参考提示。
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Authors: Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang
First: 2025-12-09T17:18:32+00:00 · Latest: 2025-12-09T17:18:32+00:00
Comments: 16 pages, 8 figures, conference or other essential info
Abstract
Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.
中文标题/摘要
标题:InfiniteVL:结合线性与稀疏注意机制以实现高效且无输入限制的跨模态视觉-语言模型
窗口注意力和线性注意力是缓解视觉-语言模型(VLM)中二次复杂性和不断增长的KV缓存的两种主要策略。然而,我们发现基于窗口的VLM在序列长度超过窗口大小时会性能下降,而线性注意力在诸如OCR和文档理解等信息密集型任务中表现不佳。为克服这些限制,我们提出了InfiniteVL,这是一种结合滑动窗口注意力(SWA)与Gated DeltaNet的线性复杂度VLM架构。为了在资源受限的情况下实现竞争性的跨模态性能,我们设计了三阶段训练策略,包括蒸馏预训练、指令调优和长序列SFT。令人惊讶的是,使用比领先VLM少于2%的训练数据,InfiniteVL不仅大幅优于之前的线性复杂度VLM,还与基于Transformer的领先VLM性能相当,同时展示了有效的长期记忆保留。与通过FlashAttention-2加速的类似规模的Transformer-based VLM相比,InfiniteVL实现了超过3.6倍的推理速度提升,同时保持了恒定的延迟和内存占用。在流式视频理解场景中,它能够保持稳定的每秒24帧实时预填充速度,同时保留长期记忆缓存。代码和模型可在https://github.com/hustvl/InfiniteVL获取。
Summary / 总结
InfiniteVL is a linear-complexity Vision-Language Model that combines sliding window attention with Gated DeltaNet to address the limitations of window-based and linear attention methods. It uses a three-stage training strategy and achieves performance comparable to leading Transformer-based models with less than 2% of the training data. InfiniteVL outperforms previous linear-complexity models and demonstrates a 3.6x inference speedup while maintaining constant latency and memory footprint. It also supports real-time processing in streaming video understanding scenarios with stable 24 FPS prefill speed and long-term memory cache retention.
InfiniteVL 是一种结合滑动窗口注意力和 Gated DeltaNet 的线性复杂度视觉-语言模型,旨在克服窗口基模型和线性注意力方法的局限性。它采用三阶段训练策略,并使用不到 2% 的训练数据达到了与领先 Transformer 基模型相当的性能。InfiniteVL 在推理速度上比类似规模的 Transformer 基模型快 3.6 倍,同时保持了恒定的延迟和内存占用。它还支持在流式视频理解场景中以稳定的 24 FPS 前填充速度进行实时处理,并保留长期记忆缓存。
Training-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning
Authors: Yi Zhang, Chun-Wun Cheng, Junyi He, Ke Yu, Yushun Tang, Carola-Bibiane Schönlieb, Zhihai He, Angelica I. Aviles-Rivero
First: 2025-12-09T17:12:22+00:00 · Latest: 2025-12-09T17:12:22+00:00
Comments: Accepted in IEEE Transactions on Multimedia (TMM)
Abstract
Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincaré ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.
中文标题/摘要
标题:无需训练的双超曲面适配器以实现更好的跨模态推理
近期在视觉-语言模型(VLMs)方面的研究显著提升了我们在跨模态推理方面的能力。然而,现有方法在领域变化时性能会下降,或者需要大量的计算资源进行新领域的微调。为了解决这一问题,我们开发了一种新的大型视觉-语言模型的适应方法,称为“无需训练的双超曲面适配器”(T-DHA)。我们以超曲面空间而非传统的欧几里得空间来表征视觉-语言之间的语义概念关系,这种关系通常具有层次树结构。超曲面空间的体积随半径呈指数增长,而欧几里得空间则呈多项式增长。我们发现,这种独特性质特别适合使用Poincaré球模型嵌入层次数据结构,从而显著提高了表示能力和区分能力。结合负学习,它可以在更少的特征维度下提供更准确和稳健的分类。我们在各种数据集上的广泛实验结果表明,T-DHA方法在少量样本图像识别和领域泛化任务中显著优于现有最先进的方法。
Summary / 总结
The research aims to improve the performance of Vision-Language Models (VLMs) in cross-modal reasoning, especially in new domains without requiring fine-tuning. It introduces Training-free Dual Hyperbolic Adapters (T-DHA), which embeds hierarchical data structures in hyperbolic space, offering better representation and discrimination power compared to traditional Euclidean space. Experiments show that T-DHA outperforms existing methods in few-shot image recognition and domain generalization tasks.
研究旨在通过解决领域变化导致的性能下降和减少大量计算资源的需求,提高Vision-Language Models (VLMs)的跨模态推理能力。提出了一种名为Training-free Dual Hyperbolic Adapters (T-DHA)的训练-free方法,该方法在双曲空间中嵌入层次数据结构,相比传统的欧几里得空间,能够提供更好的表示能力和区分能力。实验结果显示,T-DHA在少量样本图像识别和领域泛化任务中优于现有最先进的方法。
SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
Authors: Kaiyu Li, Shengqi Zhang, Yupeng Deng, Zhi Wang, Deyu Meng, Xiangyong Cao
First: 2025-12-09T15:42:28+00:00 · Latest: 2025-12-09T15:42:28+00:00
Abstract
Most existing methods for training-free Open-Vocabulary Semantic Segmentation (OVSS) are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a preliminary exploration of applying SAM 3 to the remote sensing OVSS task without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. We evaluate our method on extensive remote sensing datasets. Experiments show that this simple adaptation achieves promising performance, demonstrating the potential of SAM 3 for remote sensing OVSS. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.
中文标题/摘要
标题:SegEarth-OV3:探索SAM 3在遥感图像中的开放词汇语义分割
大多数现有的训练免费开放词汇语义分割(OVSS)方法基于CLIP。尽管这些方法取得了进展,但在精确定位方面仍面临挑战,或者需要复杂的管道将单独的模块结合起来,特别是在存在大量密集和小型目标的遥感场景中。最近,提出了分割任何事物模型3(SAM 3),在一个可提示框架中统一了分割和识别。在本文中,我们初步探索了在没有任何训练的情况下将SAM 3应用于遥感OVSS任务。首先,我们实现了一种掩码融合策略,将SAM 3的语义分割头和Transformer解码器(实例头)的输出结合起来。这使我们能够利用两个头的优点,以更好地覆盖土地。其次,我们利用存在头的出现分数来过滤掉场景中不存在的类别,减少由于地理空间场景中的大规模词汇和像素级处理引起的假阳性。我们在广泛的遥感数据集上评估了我们的方法。实验表明,这种简单的适应取得了令人鼓舞的性能,展示了SAM 3在遥感OVSS中的潜力。我们的代码发布在https://github.com/earth-insights/SegEarth-OV-3。
Summary / 总结
This paper explores the application of Segment Anything Model 3 (SAM 3) for training-free Open-Vocabulary Semantic Segmentation (OVSS) in remote sensing images. It introduces a mask fusion strategy that combines outputs from SAM 3's semantic segmentation head and Transformer decoder, and uses the presence score to filter out non-existent categories, reducing false positives. Experiments on various remote sensing datasets show promising performance, highlighting SAM 3's potential in this domain.
本文探索了将Segment Anything Model 3 (SAM 3) 应用于遥感图像的训练-free Open-Vocabulary Semantic Segmentation (OVSS)。作者实现了一种掩码融合策略,结合SAM 3的语义分割头和Transformer解码器以提高地表覆盖,并利用存在分数过滤掉不存在的类别,减少假阳性。在多种遥感数据集上的实验显示了良好的性能,突显了SAM 3在该领域的潜力。
Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Authors: Yuhan Liu, Lianhui Qin, Shengjie Wang
First: 2025-10-23T17:59:21+00:00 · Latest: 2025-12-09T15:09:56+00:00
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.
中文标题/摘要
标题:小草图,大裁决:基于推测的信息密集型视觉推理
大型视觉-语言模型(VLMs)在多模态理解方面取得了显著进展,但在处理密集交织了文本注释和细粒度图形元素的信息密集型图像时,它们面临挑战。主要挑战在于在密集布局中精确定位关键线索以及进行多跳推理以整合分散的证据。我们提出了一种名为推测裁决(SV)的无训练框架,该框架受到推测解码的启发,结合了多个轻量级草图专家和一个大型裁决模型。在草图阶段,小型VLMs作为草图专家生成提供多样化定位候选的推理路径;在裁决阶段,强大的VLM综合这些路径生成最终答案,同时降低计算成本并恢复正确答案。为了进一步提高效率和准确性,SV引入了一种共识专家选择机制,仅将高一致性的推理路径转发到裁决阶段。实验证明,SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等具有挑战性的信息密集型和高分辨率视觉问答基准测试中取得了持续的改进。通过综合多个部分准确推理路径中的正确见解,SV在错误纠正和成本效率方面优于大型专有模型或训练管道。代码可在https://github.com/Tinaliu0123/speculative-verdict/ 获取。
Summary / 总结
The research addresses the challenge of visual reasoning in information-intensive images, where large VLMs struggle due to dense layouts and multi-hop reasoning. Speculative Verdict (SV) proposes a training-free framework combining lightweight draft experts with a strong VLM to generate diverse localization candidates and synthesize them for the final answer, reducing computational cost while maintaining accuracy. SV shows consistent improvements on benchmarks like InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K, demonstrating both error correction and cost-efficiency.
本文针对信息密集型图像中的视觉推理难题,其中大型VLM因密集布局和多跳推理而难以应对。提出了一种名为Speculative Verdict (SV)的无训练框架,该框架使用多个轻量级的草稿专家生成多样化的推理路径,然后由强大的VLM在判决阶段综合这些路径,以降低成本同时保持准确性。此外,SV还引入了一种共识专家选择机制,仅转发高一致性的推理路径。实验结果显示,SV在InfographicVQA、ChartMuseum、ChartQAPro和HR-Bench 4K等基准测试中表现出一致的改进,证明了其在错误修正和成本效率方面的优势,优于大型专有模型或训练管道。
Trajectory Densification and Depth from Perspective-based Blur
Authors: Tianchen Qiu, Qirun Zhang, Jiajian He, Zhengyue Zhuge, Jiahui Xu, Yueting Chen
First: 2025-12-09T14:11:43+00:00 · Latest: 2025-12-09T14:11:43+00:00
Abstract
In the absence of a mechanical stabilizer, the camera undergoes inevitable rotational dynamics during capturing, which induces perspective-based blur especially under long-exposure scenarios. From an optical standpoint, perspective-based blur is depth-position-dependent: objects residing at distinct spatial locations incur different blur levels even under the same imaging settings. Inspired by this, we propose a novel method that estimate metric depth by examining the blur pattern of a video stream and dense trajectory via joint optical design algorithm. Specifically, we employ off-the-shelf vision encoder and point tracker to extract video information. Then, we estimate depth map via windowed embedding and multi-window aggregation, and densify the sparse trajectory from the optical algorithm using a vision-language model. Evaluations on multiple depth datasets demonstrate that our method attains strong performance over large depth range, while maintaining favorable generalization. Relative to the real trajectory in handheld shooting settings, our optical algorithm achieves superior precision and the dense reconstruction maintains strong accuracy.
中文标题/摘要
标题:基于视角模糊的轨迹稠化与深度估计
在缺乏机械稳定器的情况下,相机在拍摄过程中不可避免地会经历旋转动态,这会在长时间曝光场景中引起视角模糊。从光学角度来看,视角模糊是深度位置相关的:即使在相同的成像设置下,处于不同空间位置的物体也会产生不同的模糊程度。受此启发,我们提出了一种新颖的方法,通过联合光学设计算法检查视频流和稠密轨迹的模糊模式来估计米制深度。具体来说,我们使用现成的视觉编码器和点跟踪器来提取视频信息。然后,我们通过窗口嵌入和多窗口聚合估计深度图,并使用视觉语言模型稠密化光学算法得到的稀疏轨迹。在多个深度数据集上的评估表明,我们的方法在大深度范围内表现出强大的性能,同时保持了良好的泛化能力。与手持拍摄设置中的真实轨迹相比,我们的光学算法在精度上表现出优越性,而稠密重建保持了很强的准确性。
Summary / 总结
The research aims to estimate metric depth and densify trajectories in videos captured without mechanical stabilizers, leveraging perspective-based blur. The method uses off-the-shelf vision encoder and point tracker to extract video information, followed by depth map estimation through windowed embedding and multi-window aggregation. The trajectory is then densified using a vision-language model. Experiments show strong performance over a large depth range and superior precision compared to real trajectories in handheld shooting settings, with good generalization.
论文针对无机械稳定器情况下长曝光场景中的视角模糊问题,提出了一种通过分析视频流中的模糊模式和密集轨迹来估计米尺度深度的方法,采用联合光学设计算法。该方法利用视觉编码器和点跟踪器提取视频信息,并通过窗口嵌入和多窗口聚合估计深度图。视觉语言模型进一步细化了密集轨迹。在多个深度数据集上的实验表明,所提出的方法在大深度范围内表现良好,具有较强的泛化能力。光学算法在手持拍摄设置的真实轨迹精度上表现更优,且密集重建保持了高准确性。
OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics
Authors: Jisang Yoo, Gyeongjin Kang, Hyun-kyu Ko, Hyeonwoo Yu, Eunbyung Park
First: 2025-12-09T14:10:23+00:00 · Latest: 2025-12-09T14:10:23+00:00
Comments: 8 pages, 4 figures
Abstract
Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.
中文标题/摘要
标题:OpenMonoGS-SLAM:具有开放集语义的单目高斯点云SLAM
同时定位与建图(SLAM)是机器人技术、AR/VR和自主系统中的基础组件。近年来,随着对空间AI的关注增加,将SLAM与语义理解相结合变得越来越重要,以实现智能感知和交互。最近的研究已经探索了这种整合,但它们通常依赖于深度传感器或封闭集语义模型,限制了其在开放环境中的可扩展性和适应性。在本文中,我们提出了OpenMonoGS-SLAM,这是第一个将3D高斯点云(3DGS)与开放集语义理解统一的单目SLAM框架。为了实现这一目标,我们利用了视觉基础模型(VFMs)的最新进展,包括MASt3R用于视觉几何和SAM及CLIP用于开放词汇语义。这些模型在多种任务中提供了稳健的泛化能力,使我们能够实现准确的单目相机跟踪和建图,以及对开放环境中的语义有丰富的理解。我们的方法不依赖任何深度输入或3D语义真值,仅依赖于自我监督学习目标。此外,我们提出了一种专门设计的记忆机制,用于管理高维语义特征,有效地构建了高斯语义特征图,从而实现了强大的整体性能。实验结果表明,我们的方法在封闭集和开放集分割任务中的性能与现有基线相当或更优,且不依赖于额外的传感器,如深度图或语义注释。
Summary / 总结
OpenMonoGS-SLAM is a monocular SLAM framework that integrates 3D Gaussian Splatting with open-set semantic understanding, leveraging Visual Foundation Models for robust visual geometry and open-vocabulary semantics. It achieves accurate monocular camera tracking and mapping without depth input or 3D semantic ground truth, and demonstrates performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks.
OpenMonoGS-SLAM 是一种结合了 3D 高斯点积和开放集语义理解的单目 SLAM 框架,利用了视觉基础模型如 MASt3R、SAM 和 CLIP。该方法能够实现准确的单目相机跟踪和建图,以及在开放世界环境中的鲁棒语义理解,无需使用深度传感器或 3D 语义标注数据。实验结果表明,该方法在闭集和开集分割任务中达到了与现有基线相当或更优的性能。
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
Authors: Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong
First: 2025-10-28T13:22:39+00:00 · Latest: 2025-12-09T13:57:50+00:00
Comments: work in progress
Abstract
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents. Our code and data are available at https://github.com/OS-Copilot/OS-Sentinel.
中文标题/摘要
标题:OS-Sentinel:通过混合验证在现实工作流中提升移动GUI代理的安全性
由视觉-语言模型(VLMs)驱动的计算机使用代理在操作如移动平台等数字环境方面展示了类似人类的能力。尽管这些代理在推进数字自动化方面具有巨大潜力,但它们进行不安全操作的可能性,如系统破坏和隐私泄露,引起了广泛关注。在移动环境复杂且庞大的操作空间中检测这些安全问题是一个艰巨的挑战,目前仍严重未被探索。为建立移动代理安全性研究的基础,我们引入了MobileRisk-Live,一个动态沙盒环境,附带一个包含现实轨迹和细粒度注释的安全检测基准。在此基础上,我们提出了OS-Sentinel,一种新颖的混合安全检测框架,该框架结合了形式验证器以检测明确的系统级违规行为,以及基于VLM的上下文评估器以评估上下文风险和代理行为。实验结果显示,OS-Sentinel在多个指标上比现有方法提高了10%-30%。进一步的分析提供了关键见解,促进了更安全、更可靠的自主移动代理的发展。我们的代码和数据可在https://github.com/OS-Copilot/OS-Sentinel获取。
Summary / 总结
The research aims to enhance the safety of mobile GUI agents powered by Vision-Language Models (VLMs) by addressing the risk of unsafe operations such as system compromise and privacy leakage. The study introduces OS-Sentinel, a hybrid safety detection framework that combines a Formal Verifier for explicit system-level violations and a VLM-based Contextual Judge for contextual risk assessment. Experiments demonstrate that OS-Sentinel outperforms existing methods by 10%-30% across various metrics, providing valuable insights for developing safer mobile agents.
研究旨在通过解决不安全操作的风险,增强由视觉-语言模型(VLMs)驱动的移动GUI代理的安全性。提出了OS-Sentinel,这是一种结合形式验证器和基于VLM的上下文评估器的混合安全检测框架。实验表明,OS-Sentinel在多个指标上比现有方法提高了10%-30%,有助于开发更安全和可靠的移动代理。
Mind to Hand: Purposeful Robotic Control via Embodied Reasoning
Authors: Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo, Haotian Yang, Weiqi Jin, Jianan Wang
First: 2025-12-09T13:19:37+00:00 · Latest: 2025-12-09T13:19:37+00:00
Comments: 49 pages, 25 figures
Abstract
Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.
中文标题/摘要
标题:心手相连:通过具身推理实现目的性机器人控制
人类在行动中具有上下文和意图,并且推理起着核心作用。尽管互联网规模的数据使AI系统具备了广泛的推理能力,但在物理行动中实现这些能力仍然是一个重大挑战。我们引入了Lumo-1,这是一种统一了机器人推理(“心”)与机器人行动(“手”)的通用视觉-语言-行动(VLA)模型。我们的方法基于预训练的多模态推理能力,逐步扩展到具身推理和动作预测,并最终实现结构化推理和推理-行动对齐。这导致了一个三阶段的预训练管道:(1)在精选的视觉-语言数据上继续预训练VLM,以增强具身推理技能,如规划、空间理解和轨迹预测;(2)与视觉-语言数据一起进行跨具身机器人数据的协同训练;(3)在收集于具有类人灵巧性和敏捷性的双臂移动操作器Astribot S1上的轨迹上进行动作训练,结合推理过程。最后,我们整合强化学习以进一步提高推理-行动一致性,并在语义推理和运动控制之间形成闭环。广泛的实验表明,Lumo-1在具身视觉-语言推理方面取得了显著的性能提升,这是通用机器人控制的关键组成部分。实际评估进一步表明,Lumo-1在一系列具有挑战性的机器人任务中超越了强大的基线,具有强大的泛化能力,特别是在长时任务和需要推理策略、概念和空间的人类自然指令方面表现出色。
Summary / 总结
The research aims to enable robots to act with context and intention, similar to humans, by integrating reasoning and action. Lumo-1, a vision-language-action model, is introduced to unify robot reasoning with action. The model undergoes a three-stage pre-training process: enhancing embodied reasoning skills, co-training with robot data, and action training with reasoning on a bimanual mobile manipulator. Extensive experiments show significant improvements in embodied vision-language reasoning and strong generalization to new tasks and environments, surpassing strong baselines in challenging robotic tasks.
研究旨在通过整合推理和行动,使机器人能够像人类一样以情境和意图行动。引入了Lumo-1,这是一种视觉-语言-行动模型,将机器人推理与行动统一起来。该模型经过三个阶段的预训练:增强体现推理技能、与机器人数据协同训练以及在双臂移动 manipulator 上进行带有推理的行动训练。大量实验显示,在体现视觉-语言推理方面取得了显著改进,并且在新任务和环境中的泛化能力强,特别是在复杂任务和需要策略、概念和空间推理的人类自然指令中表现出色,超越了强大的基线模型。
Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Authors: Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen
First: 2025-12-08T07:05:18+00:00 · Latest: 2025-12-09T12:48:18+00:00
Abstract
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the impact of function words. Similar to differential amplifiers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more aligned and robust VLMs. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53% ASR drop with only 0.2/0.3/0.6% performance drops on the 3 tested models on retrieval, and a 90% ASR drop with a 0.3% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code will be made publicly at https://github.com/michaeltian108/FDA.
中文标题/摘要
标题:减少对功能词的关注以提高视觉-语言模型的鲁棒性
为了解决鲁棒性和性能之间的权衡问题,我们观察到功能词可能会使视觉-语言模型在跨模态对抗攻击中变得脆弱,并提出功能词去注意力(FDA)来减轻功能词的影响。类似于差分放大器,我们的FDA在注意力头内计算原始的和功能词的交叉注意力,并从前者中差分地减去后者,以获得更对齐和鲁棒的视觉-语言模型。全面的实验包括在2个下游任务、3个数据集和3个模型上的2个SOTA基线,在6种不同攻击下的2个下游任务。总体而言,我们的FDA在3个测试模型上的检索任务中平均导致18/13/53%的ASR下降,同时仅导致0.2/0.3/0.6%的性能下降;在视觉定位任务中,ASR下降90%,但性能提高了0.3%。我们通过实验展示了FDA的可扩展性、泛化能力和零样本性能,并进行了深入的消融研究和分析。代码将在https://github.com/michaeltian108/FDA公开。
MVP: Multiple View Prediction Improves GUI Grounding
Authors: Yunzhu Zhang, Zeyu Pan, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu
First: 2025-12-09T12:19:00+00:00 · Latest: 2025-12-09T12:19:00+00:00
Abstract
GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1) Attention-Guided View Proposal, which derives diverse views guided by instruction-to-image attention scores, and (2) Multi-Coordinates Clustering, which ensembles predictions by selecting the centroid of the densest spatial cluster. Extensive experiments demonstrate MVP's effectiveness across various models and benchmarks. Notably, on ScreenSpot-Pro, MVP boosts UI-TARS-1.5-7B to 56.1%, GTA1-7B to 61.7%, Qwen3VL-8B-Instruct to 65.3%, and Qwen3VL-32B-Instruct to 74.0%. The code is available at https://github.com/ZJUSCL/MVP.
中文标题/摘要
标题:MVP:多视图预测提高GUI定位
GUI定位,即将自然语言指令转化为精确的像素坐标,对于开发实用的GUI代理至关重要。然而,我们发现现有的定位模型在坐标预测方面存在显著的不稳定性,轻微的视觉扰动(例如裁剪几个像素)可以大幅改变预测结果,导致结果在正确和错误之间翻转。这种不稳定性严重削弱了模型性能,尤其是在高分辨率和小UI元素的样本中。为了解决这一问题,我们提出了多视图预测(MVP),这是一种无需训练的框架,通过多视图推理来提升定位性能。我们的核心见解是,虽然单视图预测可能不稳定,但通过聚合多个精心裁剪视图的预测结果,可以有效区分正确的坐标和异常值。MVP 包含两个组件:(1)注意力引导的视图提案,该组件根据指令到图像的注意力分数生成多样化的视图;(2)多坐标聚类,该组件通过选择最密集空间簇的质心来聚合预测结果。广泛的实验表明,MVP 在各种模型和基准测试中都表现出有效性。值得注意的是,在ScreenSpot-Pro上,MVP 将UI-TARS-1.5-7B 提升至56.1%,GTA1-7B 提升至61.7%,Qwen3VL-8B-Instruct 提升至65.3%,Qwen3VL-32B-Instruct 提升至74.0%。代码可在 https://github.com/ZJUSCL/MVP 获取。
Summary / 总结
The paper addresses the instability in GUI grounding models, where minor visual perturbations can drastically change prediction results. To tackle this, the authors propose Multi-View Prediction (MVP), a training-free framework that uses multi-view inference to improve grounding performance. MVP consists of two components: Attention-Guided View Proposal and Multi-Coordinates Clustering. Experiments show that MVP significantly enhances the performance of various models, achieving improvements of up to 74.0% on ScreenSpot-Pro benchmarks.
研究针对GUI接地模型中存在的微小视觉扰动会导致预测显著变化的问题,提出了一个无需训练的多视图预测(MVP)框架,通过多视图推理来提升接地性能。MVP 包含两个组件:注意力引导的视图提案和多坐标聚类。实验表明,MVP 在不同基准上的多种模型中提高了性能,UI-TARS-1.5-7B、GTA1-7B、Qwen3VL-8B-Instruct 和 Qwen3VL-32B-Instruct 的准确率分别提高了到 56.1%、61.7%、65.3% 和 74.0%。
Beyond Real Weights: Hypercomplex Representations for Stable Quantization
Authors: Jawad Ibn Ahad, Maisha Rahman, Amrijit Biswas, Muhammad Rafsan Kabir, Robin Krambroeckers, Sifat Momen, Nabeel Mohammed, Shafin Rahman
Venue: WACV
First: 2025-12-09T12:10:57+00:00 · Latest: 2025-12-09T12:10:57+00:00
Comments: Accepted in Winter Conference on Applications of Computer Vision (WACV) 2026
Abstract
Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.
中文标题/摘要
标题:超越真实权重:超复表示法在稳定量化中的应用
多模态语言模型(MLLMs)需要大量的参数容量来对齐高维视觉特征与语言表示,使其在计算上非常沉重且难以高效部署。我们提出了一种渐进重参数化策略,通过逐步用参数化超复乘法(PHM)层替换密集的前馈网络块来压缩这些模型。残差插值调度,结合轻量级重构和知识蒸馏损失,确保在训练过程中PHM模块继承其密集对应模块的功能行为。这种转变在保持强大的多模态对齐的同时,实现了参数和FLOP的大幅减少,同时保持了快速推理而不降低输出质量。我们在多个视觉语言模型(VLMs)上进行了评估。我们的方法在保持与基础模型相当的性能的同时,实现了显著的模型大小和推理延迟减少。渐进的PHM替换因此提供了一种架构兼容的路径,以实现更高效的多模态推理,并补充现有的低比特量化技术。
Summary / 总结
The research aims to address the computational challenges of multimodal language models (MLLMs) by introducing a progressive reparameterization strategy that replaces dense feed-forward network blocks with Parameterized Hypercomplex Multiplication (PHM) layers. This method uses a residual interpolation schedule and lightweight losses to ensure functional equivalence during training, resulting in substantial parameter and FLOP reductions while maintaining strong multimodal alignment. Evaluations on various vision-language models show that the approach preserves performance while significantly reducing model size and inference latency.
研究旨在通过引入一种渐进重参数化策略,将密集的前馈网络块逐步替换为参数化超复数乘法(PHM)层,以减轻多模态语言模型(MLLMs)的计算负担。该方法使用残差插值调度和轻量级损失来确保训练期间的功能一致性,从而实现参数和FLOP的大幅减少,同时保持强大的多模态对齐和性能。在各种视觉语言模型上的评估表明,该方法在保持与基模型相当的性能的同时,实现了显著的模型大小和推理延迟减少。
Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models
Authors: Minghao Yin, Yukang Cao, Kai Han
First: 2025-11-27T13:03:57+00:00 · Latest: 2025-12-09T11:45:51+00:00
Abstract
We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.
中文标题/摘要
标题:悟空的72变:基于流模型的无训练高保真纹理3D变形
我们提出了一种名为WUKONG的新型无训练框架,用于高保真纹理3D变形,该框架以一对源和目标提示(图像或文本)作为输入。与传统方法不同,后者依赖于手动对应匹配和变形轨迹估计(限制了泛化能力并需要昂贵的预处理),WUKONG利用基于流的生成先验来生成具有丰富纹理细节的高保真3D过渡。为了确保形状过渡的平滑性,我们利用基于流的生成过程的内在连续性,将变形问题形式化为最优传输重心问题。我们还引入了一种顺序初始化策略,以防止突然的几何失真并保持身份一致性。为了忠实保留纹理,我们提出了一种基于相似性的语义一致性机制,该机制选择性地保留高频细节并允许对混合动力学进行精确控制。这避免了常见的过度平滑现象,同时保持了语义保真度。广泛的定量和定性评估表明,WUKONG显著优于现有方法,在各种几何和纹理变化中取得了更优的结果。
Summary / 总结
WUKONG is a training-free framework for high-fidelity textured 3D morphing that uses image or text prompts. It leverages flow-based transformers to produce smooth and detailed 3D transitions without manual correspondence matching. Key findings show that WUKONG outperforms existing methods in handling various geometry and texture variations, avoiding common artifacts while maintaining semantic fidelity.
WUKONG 是一个无需训练的框架,用于高保真纹理 3D 变形,通过输入一对源和目标提示实现。它利用基于流的变换器生成平滑且细节丰富的 3D 过渡,避免了手动对应匹配和昂贵的预处理需求。关键发现表明,WUKONG 在处理多样化的几何和纹理变化时表现更优,定量和定性评估结果均优于现有方法。
Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions
Authors: Ada Gorgun, Fawaz Sammani, Nikos Deligiannis, Bernt Schiele, Jonas Fischer
First: 2025-12-09T11:05:08+00:00 · Latest: 2025-12-09T11:05:08+00:00
Comments: Code is available at: https://github.com/adagorgun/PCI-Prompt-Controlled-Interventions
Abstract
Diffusion models are usually evaluated by their final outputs, gradually denoising random noise into meaningful images. Yet, generation unfolds along a trajectory, and analyzing this dynamic process is crucial for understanding how controllable, reliable, and predictable these models are in terms of their success/failure modes. In this work, we ask the question: when does noise turn into a specific concept (e.g., age) and lock in the denoising trajectory? We propose PCI (Prompt-Conditioned Intervention) to study this question. PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time. The central idea is the analysis of Concept Insertion Success (CIS), defined as the probability that a concept inserted at a given timestep is preserved and reflected in the final image, offering a way to characterize the temporal dynamics of concept formation. Applied to several state-of-the-art text-to-image diffusion models and a broad taxonomy of concepts, PCI reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type. These findings also provide actionable insights for text-driven image editing, highlighting when interventions are most effective without requiring access to model internals or training, and yielding quantitatively stronger edits that achieve a balance of semantic accuracy and content preservation than strong baselines. Code is available at: https://github.com/adagorgun/PCI-Prompt-Controlled-Interventions
中文标题/摘要
标题:通过提示条件干预研究扩散模型中的时间概念动态
扩散模型通常通过其最终输出进行评估,逐步将随机噪声去噪为有意义的图像。然而,生成过程沿着一条轨迹展开,分析这一动态过程对于理解这些模型在成功/失败模式方面的可控性、可靠性和可预测性至关重要。在本文中,我们提出的问题是:何时噪声会转变为特定的概念(例如年龄),并锁定去噪轨迹?我们提出了PCI(提示条件干预)来研究这一问题。PCI 是一个无需训练且模型无关的框架,用于通过扩散时间分析概念动态。核心思想是概念插入成功率(CIS)的分析,定义为在给定时间步插入的概念在最终图像中被保留和反映的概率,提供了一种表征概念形成时间动态的方法。将PCI应用于多个最先进的文本到图像扩散模型和广泛的概念分类,揭示了扩散模型中概念的多样时间行为,在同一概念类型内,轨迹的某些阶段对特定概念更为有利。这些发现还为文本驱动的图像编辑提供了可操作的见解,强调了在无需访问模型内部结构或训练的情况下,干预何时最有效,并且生成的编辑在语义准确性和内容保留之间取得了比强基线更强的定量效果。代码可在:https://github.com/adagorgun/PCI-Prompt-Controlled-Interventions 获取
Summary / 总结
This study investigates the temporal dynamics of concept formation in diffusion models by proposing PCI (Prompt-Conditioned Intervention), a training-free and model-agnostic framework. PCI analyzes the probability of a concept being preserved and reflected in the final image at different timesteps, termed Concept Insertion Success (CIS). The research reveals diverse temporal behaviors across different diffusion models, indicating certain phases are more favorable for specific concepts. These insights offer actionable guidance for text-driven image editing, enabling stronger and more balanced edits than strong baselines without requiring model internals or training.
该研究通过引入无训练且模型无关的框架PCI(Prompt-Conditioned Intervention)来探究扩散模型中概念形成的时序动态。PCI通过分析不同时间步长的概念插入成功率(CIS),揭示了不同模型在时间上的多样化行为。研究发现,特定阶段的去噪轨迹更有利于特定概念的形成,为文本驱动的图像编辑提供了实用的见解,无需访问模型内部或进行训练,且在语义准确性和内容保留方面优于强基线。
SAM-Body4D: Training-Free 4D Human Body Mesh Recovery from Videos
Authors: Mingqi Gao, Yunqi Miao, Jungong Han
First: 2025-12-09T09:37:31+00:00 · Latest: 2025-12-09T09:37:31+00:00
Abstract
Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human-centric understanding in real-world scenarios. While recent image-based HMR methods such as SAM 3D Body achieve strong robustness on in-the-wild images, they rely on per-frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM-Body4D, a training-free framework for temporally consistent and occlusion-robust HMR from videos. We first generate identity-consistent masklets using a promptable video segmentation model, then refine them with an Occlusion-Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full-body mesh trajectories, while a padding-based parallel strategy enables efficient multi-human inference. Experimental results demonstrate that SAM-Body4D achieves improved temporal stability and robustness in challenging in-the-wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam-body4d.
中文标题/摘要
标题:SAM-Body4D:无需训练的视频中4D人体网格恢复
人体网格恢复(HMR)旨在从2D观察中重建3D人体姿态和形状,是现实场景中以人体为中心理解的基础。尽管最近基于图像的HMR方法如SAM 3D Body在野外图像上表现出强大的鲁棒性,但在应用于视频时依赖于逐帧推理,导致时间不一致并在遮挡下性能下降。我们通过利用视频中固有的人体连续性来解决这些问题,而无需额外训练。我们提出了SAM-Body4D,这是一种无需训练的框架,用于从视频中实现时间一致性和遮挡鲁棒性的HMR。我们首先使用可提示的视频分割模型生成身份一致的掩模,然后使用遮挡感知模块细化它们以恢复缺失区域。细化后的掩模引导SAM 3D Body生成一致的全身网格轨迹,而基于填充的并行策略使多人体推理变得高效。实验结果表明,SAM-Body4D在具有挑战性的野外视频中实现了更好的时间稳定性和鲁棒性,无需任何重新训练。我们的代码和演示可在:https://github.com/gaomingqi/sam-body4d 获取。
Summary / 总结
The research addresses the limitations of existing image-based Human Mesh Recovery (HMR) methods, such as temporal inconsistency and performance degradation under occlusions, when applied to videos. SAM-Body4D is a training-free framework that generates identity-consistent masklets using a promptable video segmentation model, which are then refined with an Occlusion-Aware module to recover missing regions. This approach ensures temporally consistent and occlusion-robust HMR, demonstrating improved performance in challenging in-the-wild videos without any retraining.
研究解决了现有基于图像的人体网格恢复(HMR)方法在应用于视频时存在的时间不一致性和在遮挡情况下的性能下降问题。SAM-Body4D 是一个无需训练的框架,使用可提示的视频分割模型生成身份一致的掩膜,然后使用遮挡感知模块进行细化,引导 SAM 3D Body 生成时间上一致的全身网格轨迹。该方法在具有挑战性的野外视频中展示了更好的时间稳定性和鲁棒性,无需重新训练。
Enabling Validation for Robust Few-Shot Recognition
Authors: Hanxin Wang, Tian Liu, Shu Kong
First: 2025-06-05T07:37:15+00:00 · Latest: 2025-12-09T09:22:13+00:00
Comments: Project website: https://hannawang09.github.io/projects/vest/
Abstract
Few-Shot Recognition (FSR) tackles classification tasks by training with minimal task-specific labeled data. Prevailing methods adapt or finetune a pretrained Vision-Language Model (VLM) and augment the scarce training data by retrieving task-relevant but noisy samples from open data sources. The finetuned VLM generalizes decently well to the task-specific in-distribution (ID) test data but struggles with out-of-distribution (OOD) test data. This motivates our study of robust FSR with VLM finetuning. The core challenge of FSR is data scarcity, extending beyond limited training data to a complete lack of validation data. We identify a key paradox as a potential solution: repurposing the retrieved open data for validation. As such retrieved data are inherently OOD compared with the task-specific ID training data, finetuned VLMs yield degraded performance on the retrieved data. This causes the validation logic to favor the pretrained model without any finetuning, hindering improvements w.r.t generalization. To resolve this dilemma, we introduce a novel validation strategy that harmonizes performance gain and degradation on the few-shot ID data and the retrieved data, respectively. Our validation enables parameter selection for partial finetuning and checkpoint selection, mitigating overfitting and improving test-data generalization. We unify this strategy with robust learning into a cohesive framework: Validation-Enabled Stage-wise Tuning (VEST). Extensive experiments on the established ImageNet OOD benchmarks show that VEST significantly outperforms existing VLM adaptation methods, achieving state-of-the-art FSR performance on both ID and OOD data.
中文标题/摘要
标题:通过验证增强鲁棒的少样本识别
少样本识别(FSR)通过使用少量任务特定标记数据进行训练来解决分类任务。现有方法通过调整或微调预训练的视觉-语言模型(VLM),并从开放数据源中检索相关但噪声较大的样本来扩充稀缺的训练数据。微调后的VLM在任务特定的内部分布(ID)测试数据上表现良好,但在外部分布(OOD)测试数据上表现不佳。这促使我们研究基于VLM微调的鲁棒FSR。FSR的核心挑战是数据稀缺,不仅限于有限的训练数据,还包括完全缺乏验证数据。我们发现一个关键悖论可能是潜在的解决方案:重新利用检索到的开放数据进行验证。由于这些检索到的数据与任务特定的ID训练数据相比是固有的OOD,微调后的VLM在这些数据上的表现较差。这导致验证逻辑倾向于选择未微调的预训练模型,阻碍了泛化能力的提升。为了解决这一困境,我们提出了一种新的验证策略,该策略在少样本ID数据和检索数据上的性能提升和下降之间取得平衡。我们的验证策略能够选择部分微调的参数和检查点,减轻过拟合并提高测试数据的泛化能力。我们将这种策略与鲁棒学习统一到一个综合框架中:验证增强阶段调整(VEST)。在建立的ImageNet OOD基准测试上的广泛实验表明,VEST显著优于现有的VLM适应方法,在ID和OOD数据上均实现了最先进的FSR性能。
Summary / 总结
This paper addresses the challenge of robust few-shot recognition (FSR) by proposing a novel validation strategy called Validation-Enabled Stage-wise Tuning (VEST). The motivation arises from the difficulty in validating models when training data is scarce. The method repurposes open data retrieval for validation, which is inherently out-of-distribution (OOD) compared to the in-distribution (ID) training data. Key experimental findings show that VEST outperforms existing vision-language model adaptation methods, achieving state-of-the-art performance on both ID and OOD data by mitigating overfitting and improving generalization.
本文提出了一种名为Validation-Enabled Stage-wise Tuning (VEST)的新颖验证策略,以解决少样本识别(FSR)中的鲁棒性问题。动机来自于由于数据稀缺难以获得验证数据。该方法将检索到的开放数据重新用于验证,这些数据本质上是离分布(OOD)的。这种方法有助于选择参数和检查点,从而减轻过拟合并提高在ID和OOD测试数据上的泛化能力。在ImageNet OOD基准测试上的实验表明,VEST显著优于现有方法,实现了最先进的FSR性能。
CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics
Authors: Dahyeon Kye, Jeahun Sung, Mingyu Jeon, Jihyong Oh
First: 2025-12-08T04:39:12+00:00 · Latest: 2025-12-09T07:35:43+00:00
Comments: Please visit our project page at https://cmlab-korea.github.io/CHIMERA/
Abstract
Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.
中文标题/摘要
标题:CHIMERA:自适应缓存注入与语义锚点提示的零样本图像形态变换及其形态导向度量
扩散模型展示了卓越的生成能力,但在实现平滑且语义一致的图像形态变换方面仍面临挑战。现有方法往往由于缺乏自适应的结构和语义对齐而产生突兀的过渡或过度饱和的外观。我们提出CHIMERA,一种基于扩散的零样本框架,将形态变换形式化为缓存反演引导的去噪过程。为处理大规模的语义和外观差异,我们提出了自适应缓存注入和语义锚点提示。自适应缓存注入(ACI)在DDIM反演过程中缓存输入的低、中、高层特征,并在去噪过程中适配性地重新注入,从而在深度和时间自适应的方式下实现空间和语义对齐,并实现自然特征融合和平滑过渡。语义锚点提示(SAP)利用视觉-语言模型生成共享的锚点提示,作为语义锚点,连接不相似的输入,并引导去噪过程向一致的结果发展。最后,我们引入全局-局部一致性评分(GLCS),这是一种形态导向度量,同时评估两个输入的全局和谐性和局部形态变换的平滑度。广泛的实验和用户研究显示,CHIMERA实现了比现有方法更平滑且更语义对齐的过渡,建立了图像形态变换的新基准。代码和项目页面将公开发布。
Summary / 总结
CHIMERA is a zero-shot diffusion-based framework for image morphing that addresses the challenges of abrupt transitions and over-saturated appearances by introducing Adaptive Cache Injection and Semantic Anchor Prompting. ACI caches features from both inputs during inversion and re-injects them adaptively during denoising, facilitating spatial and semantic alignment. SAP uses a vision-language model to generate a semantic anchor, guiding the denoising process towards coherent results. GLCS, a morphing-oriented metric, evaluates global harmonization and local smoothness. Experiments show CHIMERA outperforms existing methods in achieving smoother and more semantically aligned transitions, setting a new state-of-the-art in image morphing.
CHIMERA 是一种基于扩散模型的零样本框架,通过将图像变形视为缓存反演引导的去噪过程来实现平滑且语义一致的图像变形。它引入了自适应缓存注入(ACI)和语义锚点提示(SAP)来处理大范围的语义和外观差异。ACI 在去噪过程中缓存并重新注入来自两个输入的特征,而 SAP 使用视觉语言模型生成共享的锚点提示。实验表明,CHIMERA 在实现更平滑且语义对齐的过渡方面优于现有方法,建立了图像变形的新状态。GLCS 是一种变形导向的度量标准,同时评估全局和谐性和局部过渡的平滑性。
OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
Authors: Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, Harry Yang
First: 2025-12-09T06:49:33+00:00 · Latest: 2025-12-09T06:49:33+00:00
Abstract
Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.
中文标题/摘要
标题:OpenSubject:利用视频衍生的身份和多样性先验进行主题驱动的图像生成和操作
尽管在主题驱动的图像生成方面取得了令人鼓舞的进展,但当前的模型往往偏离参考身份,并且在包含多个主题的复杂场景中难以应对。为了解决这一挑战,我们引入了OpenSubject,这是一个包含250万样本和435万张图像的视频衍生大规模数据集,用于主题驱动的生成和操作。该数据集通过一个四阶段管道构建,利用跨帧身份先验。 (i) 视频编目。我们应用分辨率和美学过滤以获得高质量的片段。 (ii) 跨帧主题挖掘和配对。我们利用基于视觉-语言模型(VLM)的类别共识、局部定位和多样性意识配对来选择图像对。 (iii) 身份保留参考图像合成。我们引入分割图引导的出画填充以合成用于主题驱动生成的输入图像,并通过边界引导的填充生成用于主题驱动操作的输入图像,同时包括几何感知增强和不规则边界侵蚀。 (iv) 验证和注释。我们利用VLM验证合成样本,基于阶段(iii)重新合成失败样本,然后构建短和长注释。此外,我们引入了一个涵盖主题驱动生成和操作的基准,并使用VLM裁判评估身份保真度、提示一致性、操作一致性以及背景一致性。大量实验表明,使用OpenSubject进行训练可以提高生成和操作性能,特别是在复杂场景中。
Summary / 总结
OpenSubject is a large-scale dataset with 2.5 million samples and 4.35 million images, designed to improve subject-driven image generation and manipulation. It leverages a four-stage pipeline that includes video curation, cross-frame subject mining, identity-preserving reference image synthesis, and verification. The dataset enhances model performance, especially in complex scenes, by improving identity fidelity, prompt adherence, manipulation consistency, and background consistency. Benchmarks and evaluations using a vision-language model confirm these improvements.
OpenSubject 是一个包含 2.5M 个样本和 4.35M 张图像的大规模数据集,旨在提高基于主体的图像生成和操作。它利用四阶段管道:视频筛选、跨帧主体挖掘、身份保留的参考图像合成以及验证。该数据集通过提高身份保真度和操作一致性,特别是在复杂场景中,增强了模型性能。基准测试评估了数据集的有效性,显示了在生成和操作任务中的显著改进。
VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation
Authors: Md Selim Sarowar, Sungho Kim
First: 2025-12-08T06:54:16+00:00 · Latest: 2025-12-09T06:40:52+00:00
Abstract
Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.
中文标题/摘要
标题:VFM-VLM:基于视觉基础模型和视觉语言模型的3D姿态估计中的视觉比较
视觉基础模型(VFMs)和视觉语言模型(VLMs)通过提供丰富的语义和几何表示,已经革新了计算机视觉领域。本文介绍了CLIP基和DINOv2基方法在手部物体抓取场景中3D姿态估计的全面视觉比较。我们在6D物体姿态估计任务上评估了这两种模型,并展示了它们的互补优势:CLIP在通过语言定位实现语义理解方面表现出色,而DINOv2提供了更优的密集几何特征。通过在基准数据集上的大量实验,我们表明基于CLIP的方法在语义一致性方面表现更好,而基于DINOv2的方法在几何精度方面表现出竞争性性能。我们的分析为选择合适的视觉模型用于机器人操作和抓取提供了见解。
Summary / 总结
This paper explores the use of Vision Foundation Models (VFMs) and Vision Language Models (VLMs) for 3D pose estimation in hand object grasping scenarios. It compares CLIP and DINOv2 approaches, showing that CLIP excels in semantic understanding while DINOv2 provides better geometric features. Experiments on benchmark datasets reveal that CLIP-based methods achieve better semantic consistency, whereas DINOv2-based methods offer competitive geometric precision. These findings provide guidance for selecting appropriate vision models for robotic manipulation tasks.
该论文探讨了Vision Foundation Models (VFMs)和Vision Language Models (VLMs)在手部抓取物体的3D姿态估计中的应用。研究对比了基于CLIP和DINOv2的方法,显示CLIP在语义理解方面表现出色,而DINOv2在几何特征方面更优。实验表明,基于CLIP的方法在语义一致性方面表现更好,而基于DINOv2的方法在几何精度方面表现出色。这些发现为选择适用于机器人操作和抓取任务的视觉模型提供了参考。
PAVAS: Physics-Aware Video-to-Audio Synthesis
Authors: Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji
First: 2025-12-09T06:28:50+00:00 · Latest: 2025-12-09T06:28:50+00:00
Abstract
Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.
中文标题/摘要
标题:PAVAS:物理感知的视频到音频合成
近期视频到音频(V2A)生成技术在感知质量和时间同步方面取得了显著进展,但大多数模型仍以外观为导向,捕捉视觉与听觉之间的关联而不考虑塑造现实世界声音的物理因素。我们提出了物理感知的视频到音频合成(PAVAS),这是一种通过物理驱动音频适配器(Phy-Adapter)将物理推理融入到潜在扩散基础V2A生成中的方法。适配器接收物理参数估计器(PPE)估计的对象级物理参数,PPE使用视觉语言模型(VLM)推断移动物体的质量,并使用基于分割的动态3D重建模块恢复其运动轨迹以计算速度。这些物理线索使模型能够合成反映潜在物理因素的声音。为了评估物理现实性,我们创建了VGG-Impact基准,专注于物体-物体交互,并引入了音频-物理相关系数(APCC),这是一种衡量物理和听觉属性之间一致性的评估指标。全面的实验表明,PAVAS生成的音频在物理上合理且感知上一致,优于现有V2A模型的定量和定性评估。访问https://physics-aware-video-to-audio-synthesis.github.io 查看演示视频。
Summary / 总结
PAVAS is a method that integrates physical reasoning into Video-to-Audio synthesis by using a Physics-Driven Audio Adapter (Phy-Adapter) which receives physical parameters estimated by a Physical Parameter Estimator (PPE). The PPE uses a Vision-Language Model and a segmentation-based dynamic 3D reconstruction module to infer object mass and motion trajectory. This approach enhances the synthesis of physically plausible sounds. PAVAS outperforms existing models in both quantitative and qualitative evaluations, as demonstrated by a new benchmark VGG-Impact and an evaluation metric APCC that measures consistency between physical and auditory attributes.
PAVAS 是一种将物理推理集成到基于潜扩散模型的视频到音频合成方法中,使用了一个物理驱动的音频适配器(Phy-Adapter)。Phy-Adapter 接收由物理参数估计器(PPE)估计的物理参数,PPE 利用视觉语言模型和基于分割的动态 3D 重建模块。这种方法增强了对物理因素的声学合成。实验结果表明,PAVAS 生成了物理上合理且感知上一致的音频,超越了现有的视频到音频模型在定量和定性评估中的表现。
Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection
Authors: Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu, Xuanjing Huang, Luc Van Gool, Yu-Gang Jiang
First: 2025-06-06T08:41:09+00:00 · Latest: 2025-12-09T06:15:26+00:00
Abstract
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.
中文标题/摘要
标题:域-RAG:检索引导的跨域少量样本对象检测合成图像生成
跨域少量样本对象检测(CD-FSOD)旨在仅使用以前未见过的领域中少量标记样本来检测新型对象。尽管数据增强和生成方法在少量样本学习中显示出前景,但它们在CD-FSOD中的有效性仍不清楚,因为需要同时实现视觉真实性和领域对齐。现有策略,如粘贴增强和文本到图像生成,往往无法保留正确的对象类别或产生与目标领域一致的背景,这使得它们直接应用于CD-FSOD具有挑战性。为了解决这些挑战,我们提出了一种无需训练、检索引导的合成图像生成框架Domain-RAG,专门用于CD-FSOD。Domain-RAG包括三个阶段:领域感知背景检索、领域引导背景生成和前景背景合成。具体而言,输入图像首先被分解为前景和背景区域。然后,检索语义和风格上相似的图像以引导生成模型在同时考虑原始和检索上下文的情况下合成新的背景。最后,保留的前景与新生成的领域对齐背景组合以形成生成图像。无需任何额外的监督或训练,Domain-RAG在包括CD-FSOD、遥感少量样本对象检测和伪装少量样本对象检测在内的多种任务中生成高质量、领域一致的样本。大量实验表明,Domain-RAG在强基线之上表现出一致的改进,并建立了新的最佳结果。代码将在接受后发布。
Summary / 总结
Domain-RAG is a training-free framework for Cross-Domain Few-Shot Object Detection (CD-FSOD) that uses retrieval-guided compositional image generation. It decomposes input images into foreground and background, retrieves semantically and stylistically similar images for background generation, and composes them with the foreground to ensure domain consistency. Experiments demonstrate consistent improvements over strong baselines and establish new state-of-the-art results in CD-FSOD, remote sensing FSOD, and camouflaged FSOD tasks.
Domain-RAG 是一种用于跨域少量样本目标检测(CD-FSOD)的检索引导式组合图像生成框架。它将输入图像分解为前景和背景,检索语义和风格相似的图像以生成背景,并将生成的背景与前景组合以确保域一致性。实验结果表明,Domain-RAG 在 CD-FSOD、遥感少量样本目标检测和伪装少量样本目标检测中均能持续超越强基线,并且在没有额外监督或训练的情况下建立了新的最先进成果。
CATP: Contextually Adaptive Token Pruning for Efficient and Enhanced Multimodal In-Context Learning
Authors: Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, Ruixiang Tang
First: 2025-08-11T11:41:51+00:00 · Latest: 2025-12-09T06:05:33+00:00
Comments: 14 pages, 12 figures, 6 tables
Abstract
Modern large vision-language models (LVLMs) convert each input image into a large set of tokens that far outnumber the text tokens. Although this improves visual perception, it also introduces severe image token redundancy. Because image tokens contain sparse information, many contribute little to reasoning but greatly increase inference cost. Recent image token pruning methods address this issue by identifying important tokens and removing the rest. These methods improve efficiency with only small performance drops. However, most of them focus on single-image tasks and overlook multimodal in-context learning (ICL), where redundancy is higher and efficiency is more important. Redundant tokens weaken the advantage of multimodal ICL for rapid domain adaptation and lead to unstable performance. When existing pruning methods are applied in this setting, they cause large accuracy drops, which exposes a clear gap and the need for new approaches. To address this, we propose Contextually Adaptive Token Pruning (CATP), a training-free pruning method designed for multimodal ICL. CATP uses two stages of progressive pruning that fully reflect the complex cross-modal interactions in the input sequence. After removing 77.8% of the image tokens, CATP achieves an average performance gain of 0.6% over the vanilla model on four LVLMs and eight benchmarks, clearly outperforming all baselines. At the same time, it improves efficiency by reducing inference latency by an average of 10.78%. CATP strengthens the practical value of multimodal ICL and lays the foundation for future progress in interleaved image-text settings.
中文标题/摘要
标题:CATP:上下文自适应的标记剪枝以实现高效和增强的多模态即席学习
现代大型视觉-语言模型(LVLMs)将每个输入图像转换为一个大型标记集,其数量远超过文本标记。虽然这提高了视觉感知,但也引入了严重的图像标记冗余。因为图像标记包含稀疏信息,许多标记对推理贡献甚微但大大增加了推理成本。最近的图像标记剪枝方法通过识别重要标记并移除其余标记来解决这一问题。这些方法在仅小幅性能下降的情况下提高了效率。然而,大多数方法专注于单图像任务,而忽略了多模态即席学习(ICL),其中冗余更高,效率更为重要。冗余标记削弱了多模态ICL在快速领域适应中的优势,导致性能不稳定。当现有剪枝方法应用于这种场景时,它们会导致较大的准确率下降,这暴露了一个明显的差距和对新方法的需求。为了解决这一问题,我们提出了上下文自适应的标记剪枝(CATP),这是一种为多模态ICL设计的无需训练的剪枝方法。CATP使用两个阶段的渐进剪枝,完全反映了输入序列中的复杂跨模态交互。在移除77.8%的图像标记后,CATP在四个LVLMs和八个基准上相对于vanilla模型平均提高了0.6%的性能,明显优于所有基线。同时,它通过平均减少10.78%的推理延迟提高了效率。CATP增强了多模态ICL的实际价值,并为未来交错图像-文本设置中的进展奠定了基础。
Summary / 总结
The paper addresses the issue of redundant image tokens in large vision-language models (LVLMs) that hinder efficient and enhanced multimodal in-context learning (ICL). It introduces Contextually Adaptive Token Pruning (CATP), a training-free method that prunes image tokens based on complex cross-modal interactions. After pruning 77.8% of image tokens, CATP improves performance by an average of 0.6% across four LVLMs and eight benchmarks, while also reducing inference latency by 10.78%. This method fills a gap in existing pruning techniques and enhances the practical value of multimodal ICL.
论文提出了CATP,一种无需训练的令牌剪枝方法,用于提高多模态在上下文学习的效率和增强效果。CATP旨在减少大型视觉-语言模型(LVLM)中的冗余,并在不显著损失性能的情况下提高效率。实验表明,通过剪枝77.8%的图像令牌,CATP在四个LVLM和八个基准上实现了平均0.6%的性能提升,同时将推理延迟降低了10.78%。该方法显著优于现有基线,增强了多模态在上下文学习的实际价值。
HybridToken-VLM: Hybrid Token Compression for Vision-Language Models
Authors: Jusheng Zhang, Xiaoyang Guo, Kaitong Cai, Qinhan Lv, Yijia Fan, Wenhao Chai, Jian Wang, Keze Wang
First: 2025-12-09T04:48:38+00:00 · Latest: 2025-12-09T04:48:38+00:00
Abstract
Vision-language models (VLMs) have transformed multimodal reasoning, but feeding hundreds of visual patch tokens into LLMs incurs quadratic computational costs, straining memory and context windows. Traditional approaches face a trade-off: continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures. We introduce HTC-VLM, a hybrid framework that disentangles semantics and appearance through dual channels, i.e., a continuous pathway for fine-grained details via ViT patches and a discrete pathway for symbolic anchors using MGVQ quantization projected to four tokens. These are fused into a 580-token hybrid sequence and compressed into a single voco token via a disentanglement attention mask and bottleneck, ensuring efficient and grounded representations. HTC-VLM achieves an average performance retention of 87.2 percent across seven benchmarks (GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, ScienceQA-Image), outperforming the leading continuous baseline at 81.0 percent with a 580-to-1 compression ratio. Attention analyses show that the compressed token prioritizes the discrete anchor, validating its semantic guidance. Our work demonstrates that a minimalist hybrid design can resolve the efficiency-fidelity dilemma and advance scalable VLMs.
中文标题/摘要
标题:HybridToken-VLM:视觉语言模型的混合令牌压缩
视觉语言模型(VLMs)已经改变了多模态推理,但将数百个视觉块令牌输入LLMs会带来二次计算成本,对内存和上下文窗口造成压力。传统方法面临权衡:连续压缩会稀释高层次语义,如对象身份,而离散量化会丢失细微的细节,如纹理。我们引入了HTC-VLM,这是一种通过双通道分离语义和外观的混合框架,即通过ViT块的连续路径传递细微细节,并通过MGVQ量化投影到四个令牌的符号锚点的离散路径。这些内容被融合成一个580个令牌的混合序列,并通过解耦注意力掩码和瓶颈压缩成一个voco令牌,确保高效且具地基的表示。HTC-VLM 在七个基准测试(GQA、VQAv2、MMBench、MME、POPE、SEED-Bench、ScienceQA-Image)中实现了87.2%的平均性能保留,压缩比为580比1,优于领先的连续基线81.0%。注意力分析表明,压缩令牌优先处理离散锚点,验证了其语义指导。我们的工作表明,简约的混合设计可以解决效率与准确性的矛盾,并推动可扩展的VLMs。
Summary / 总结
The research aims to address the computational and memory challenges in feeding visual patch tokens into large language models for vision-language tasks. The proposed HybridToken-VLM uses a hybrid framework that combines continuous and discrete pathways to compress visual patch tokens efficiently while preserving both fine-grained details and high-level semantics. The method achieves an average performance retention of 87.2 percent across seven benchmarks, outperforming the leading continuous baseline with a 580-to-1 compression ratio.
研究旨在解决将视觉补丁令牌输入大型语言模型进行视觉语言任务时的计算和内存挑战。提出的HybridToken-VLM使用了具有双通道的混合框架:一个连续路径用于精细细节,一个离散路径用于符号锚点。该方法在七个基准测试中实现了87.2%的平均性能保留,比领先的连续基线高出6.2个百分点,并且具有580比1的压缩比。
MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models
Authors: Jusheng Zhang, Kaitong Cai, Xiaoyang Guo, Sidi Liu, Qinhan Lv, Ruiqi Chen, Jing Yang, Yijia Fan, Xiaofei Sun, Jian Wang, Ziliang Chen, Liang Lin, Keze Wang
First: 2025-12-09T04:13:31+00:00 · Latest: 2025-12-09T04:13:31+00:00
Abstract
The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Yet a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. To fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. We evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.
中文标题/摘要
标题:MM-CoT:多模态模型视觉链式推理基准
进行链式推理(CoT)的能力是多模态模型(MMs)的一个重要里程碑,使它们能够解决复杂的视觉推理问题。然而,一个关键问题仍然存在:这种推理是否真正基于视觉证据并且逻辑上连贯?现有的基准主要强调生成,而忽视了验证,即评估推理链是否在视觉上一致且逻辑上有效的能力。为了填补这一空白,我们引入了MM-CoT,这是一种专门设计的诊断基准,用于探测MMs中视觉接地和逻辑连贯性的CoT推理。模型必须选择唯一满足两个正交约束的事件链:(i)视觉一致性,确保每一步都基于可观察的证据,(ii)逻辑连贯性,确保因果和常识的有效性。对抗性干扰项被设计成违反其中一个约束,从而暴露不同的推理失败。我们对MM-CoT进行了领先视觉语言模型的评估,并发现即使是最先进的系统也难以应对,揭示了生成流畅性和真实推理准确性的显著差距。MM-CoT与现有基准的相关性较低,证实了它衡量的是视觉接地和逻辑推理的独特组合。该基准为开发能够在视觉世界中不仅合理地推理,而且忠实和连贯地推理的未来模型奠定了基础。
Summary / 总结
MM-CoT is a benchmark designed to evaluate the visual grounding and logical coherence of Chain-of-Thought reasoning in multimodal models. Models must select a single event chain that is both visually consistent and logically coherent, with adversarial distractors to test these constraints. The evaluation shows that even advanced vision-language models struggle with this task, highlighting a gap between generative fluency and true reasoning fidelity. MM-CoT demonstrates low correlation with existing benchmarks, indicating its unique focus on visual grounding and logical reasoning.
论文提出了MM-CoT基准,用于测试多模态模型在视觉推理和逻辑连贯性方面的表现。模型需要选择既符合视觉一致性和逻辑连贯性的正确事件链,同时设计的干扰项会违反这些约束。评估结果显示,即使是先进的模型也难以应对,突显了生成流畅性和真实推理准确性的差距。MM-CoT揭示了一种视觉接地和逻辑推理的独特组合,而现有基准无法捕捉到这一点。
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models
Authors: Yongxian Wei, Yilin Zhao, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Gang Liu, Jiahong Yan, Chun Yuan, Dian Li
First: 2025-11-13T03:08:51+00:00 · Latest: 2025-12-09T02:09:47+00:00
Abstract
Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data bootstrap problem-design strategies from the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models. Moreover, a solver trained on the synthesized data provides improved rewards for continued generator training, enabling co-evolution and yielding a further 0.7% performance gain. Our code will be made publicly available here.
中文标题/摘要
标题:学习提出问题:基于推理驱动和求解器自适应的数据合成方法
为大型推理模型训练的数据合成提供了比有限的人工精选数据集更具扩展性的替代方案,能够生成高质量的数据。然而,现有方法面临几个挑战:(i)无差别的生成忽略了解题器的能力,导致生成低价值的问题,或者依赖复杂的数据管道来平衡问题难度;和(ii)生成问题缺乏推理,导致浅层的问题变体。在本文中,我们开发了一个问题生成器,该生成器在合成之前明确地进行推理以规划问题方向,并根据求解器的能力调整难度。具体来说,我们构建了相关的问题对,并通过推理模型生成中间的问题设计推理(CoT)进行增强。这些数据为生成器提供了问题设计策略的启动。然后,我们将求解器对合成问题的反馈作为奖励信号,使生成器能够校准难度并生成接近求解器能力边缘的互补问题。在10个数学和通用推理基准上的广泛实验表明,我们的方法平均提高了2.5%,并能够泛化到语言和视觉语言模型。此外,使用合成数据训练的求解器为生成器的持续训练提供了更好的奖励,促进了协同进化,进一步提高了0.7%的性能。我们的代码将在此公开发布。
Summary / 总结
This paper addresses the challenges of data synthesis for training large reasoning models, such as indiscriminate generation and a lack of reasoning in problem generation. The authors propose a method that involves reasoning explicitly to plan problem directions and adapting difficulty to the solver's ability. By constructing related problem pairs and using intermediate problem-design CoT from a reasoning model, the generator can produce high-quality data. The solver's feedback is used as a reward signal to calibrate difficulty and generate complementary problems near the edge of the solver's competence. Experiments on 10 benchmarks show an average improvement of 2.5% and generalization to both language and vision-language models, with an additional 0.7% performance gain through co-evolution.
本文解决了大规模推理模型训练中数据合成的挑战,如无差别的生成和生成问题缺乏推理。作者提出了一种方法,该方法明确地推理以规划问题方向,并根据求解器的能力调整难度。通过构建相关问题对,并使用推理模型生成的中间问题设计推理,生成器生成高质量的数据。求解器的反馈被用作奖励信号来调整难度并生成接近求解器能力边缘的互补问题。在10个基准测试上的实验显示平均改进了2.5%,并通过求解器和生成器的协同进化进一步获得了0.7%的性能提升。