arXiv 论文速递

2025-12-12 03:33
Snapshot: 20251212_0333
ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning
Authors: Xinyu Liu, Hangjie Yuan, Yujie Wei, Jiazheng Xing, Yujin Han, Jiahao Pan, Yanbiao Ma, Chi-Min Chan, Kang Zhao, Shiwei Zhang, Wenhan Luo, Yike Guo
First: 2025-12-10T18:57:09+00:00 · Latest: 2025-12-10T18:57:09+00:00
Abstract
Video unified models exhibit strong capabilities in understanding and generation, yet they struggle with reason-informed visual editing even when equipped with powerful internal vision-language models (VLMs). We attribute this gap to two factors: 1) existing datasets are inadequate for training and evaluating reasoning-aware video editing, and 2) an inherent disconnect between the models' reasoning and editing capabilities, which prevents the rich understanding from effectively instructing the editing process. Bridging this gap requires an integrated framework that connects reasoning with visual transformation. To address this gap, we introduce the Reason-Informed Video Editing (RVE) task, which requires reasoning about physical plausibility and causal dynamics during editing. To support systematic evaluation, we construct RVE-Bench, a comprehensive benchmark with two complementary subsets: Reasoning-Informed Video Editing and In-Context Video Generation. These subsets cover diverse reasoning dimensions and real-world editing scenarios. Building upon this foundation, we propose the ReViSE, a Self-Reflective Reasoning (SRF) framework that unifies generation and evaluation within a single architecture. The model's internal VLM provides intrinsic feedback by assessing whether the edited video logically satisfies the given instruction. The differential feedback that refines the generator's reasoning behavior during training. Extensive experiments on RVE-Bench demonstrate that ReViSE significantly enhances editing accuracy and visual fidelity, achieving a 32% improvement of the Overall score in the reasoning-informed video editing subset over state-of-the-art methods.
中文标题/摘要
标题:ReViSE:统一模型中基于推理的视频编辑
视频统一模型在理解和生成方面表现出强大的能力,但在配备了强大内部视觉-语言模型(VLM)的情况下,它们在基于推理的视觉编辑方面仍然存在困难。我们将其差距归因于两个因素:1)现有的数据集不足以用于训练和评估推理感知的视频编辑,2)模型推理能力和编辑能力之间固有的脱节,这阻碍了丰富的理解有效地指导编辑过程。弥合这一差距需要一个将推理与视觉转换集成在一起的框架。为了解决这一差距,我们提出了基于推理的视频编辑(RVE)任务,该任务要求在编辑过程中考虑物理合理性与因果动态。为了支持系统的评估,我们构建了RVE-Bench,这是一个全面的基准,包含两个互补的子集:基于推理的视频编辑和上下文视频生成。这些子集涵盖了多种推理维度和现实世界的编辑场景。在此基础上,我们提出了ReViSE,一种自我反思推理(SRF)框架,将生成和评估统一在一个架构中。模型内部的VLM通过评估编辑后的视频是否逻辑上满足给定的指令,提供内在反馈。差异反馈在训练过程中细化生成器的推理行为。在RVE-Bench上的广泛实验表明,ReViSE 显著提高了编辑准确性和视觉保真度,在基于推理的视频编辑子集上相对于最先进的方法实现了32%的整体分数提升。
Summary / 总结
The paper addresses the challenge of reason-informed video editing by introducing the RVE task and RVE-Bench, which evaluate the ability to reason about physical plausibility and causal dynamics. The proposed ReViSE framework uses a Self-Reflective Reasoning (SRF) approach to unify generation and evaluation, providing intrinsic feedback to refine the model's reasoning. Experiments show that ReViSE improves editing accuracy and visual fidelity by 32% in the reasoning-informed video editing subset compared to existing methods.
论文通过引入RVE任务和RVE-Bench来评估物理合理性及因果动态的推理能力。提出的ReViSE框架采用自我反思推理(SRF)方法统一生成和评估,并提供内在反馈以改进模型的推理行为。实验表明,ReViSE在推理驱动的视频编辑子集中的整体得分比最先进的方法提高了32%,显著提升了编辑准确性和视觉保真度。
VisualActBench: Can VLMs See and Act like a Human?
Authors: Daoan Zhang, Pai Liu, Xiaofei Zhou, Yuan Ge, Guangchen Lan, Jing Bi, Christopher Brinton, Ehsan Hoque, Jiebo Luo
First: 2025-12-10T18:36:18+00:00 · Latest: 2025-12-10T18:36:18+00:00
Abstract
Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.
中文标题/摘要
标题:VisualActBench:VLMs能否像人类一样观察和行动?
视觉-语言模型(VLMs)在感知和描述视觉环境方面取得了显著进展。然而,它们仅凭视觉输入主动推理和行动的能力,而无需明确的文本提示,仍处于探索阶段。我们引入了一个新的任务——视觉行动推理,并提出了一个包含1,074个视频和3,733个人类标注动作的大规模基准VisualActBench,覆盖四个真实场景。每个动作都标注了行动优先级水平(APL)和主动-反应类型,以评估模型的人类对齐推理和价值敏感性。我们在VisualActBench上评估了29个VLMs,并发现尽管前沿模型如GPT4o表现出相对较强的能力,但与人类水平的推理相比,特别是在生成主动、高优先级动作方面,仍存在显著差距。我们的结果突显了当前VLMs在解释复杂背景、预测结果和与人类决策框架对齐方面的局限性。VisualActBench为评估和提高主动视觉中心AI代理的现实世界准备性奠定了全面的基础。
Summary / 总结
The paper introduces VisualActionReasoning as a new task and VisualActBench, a benchmark with 1,074 videos and 3,733 human-annotated actions, to evaluate VLMs' ability to proactively reason and act based on visual inputs. Evaluating 29 VLMs, including GPT4o, the study finds that while these models show some capability, they still fall short of human-level reasoning, especially in generating proactive, high-priority actions. This highlights the need for better context interpretation and alignment with human decision-making in VLMs.
论文提出了VisualActionBench,这是一个用于评估VLMs基于视觉输入进行主动推理和行动能力的新基准,包含1,074个视频和3,733个人标注的动作,覆盖四个场景,并对每个动作进行优先级和主动/反应类型标注。评估29个VLMs后,研究发现虽然如GPT4o等模型表现出一定能力,但在生成高优先级的主动行动方面仍远不及人类水平,这表明VLMs在理解复杂背景和预测结果方面仍存在不足。
Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs
Authors: Pius Horn, Janis Keuper
First: 2025-12-10T18:01:50+00:00 · Latest: 2025-12-10T18:01:50+00:00
Abstract
Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: https://github.com/phorn1/pdf-parse-bench
中文标题/摘要
标题:PDF中数学公式解析基准测试
正确解析PDF中的数学公式对于训练大型语言模型和从学术文献中构建科学知识库至关重要,但现有基准要么完全排除公式,要么缺乏语义感知的评估指标。我们引入了一种新的基准测试框架,以精确的LaTeX地面真实值为中心的合成PDF,使布局、公式和内容特征的系统控制成为可能。一个关键的方法贡献是首次使用LLM作为裁判进行语义公式评估,并结合了一个稳健的两阶段匹配管道来处理解析器输出的一致性问题。通过对250个公式对(30名评估者共750次评分)的人工验证,我们证明基于LLM的评估与人工判断的相关性(皮尔逊r=0.78)远高于CDM(r=0.34)和文本相似度(r≈0)。评估20多种当代PDF解析器(包括专门的OCR模型、视觉-语言模型和基于规则的方法)在100个合成文档中的2000多个公式上显示出了显著的性能差异。我们的研究结果为选择适用于下游应用的解析器的实践者提供了宝贵的见解,并建立了稳健、可扩展的方法,以实现PDF公式提取质量的可重复评估。代码和基准数据:https://github.com/phorn1/pdf-parse-bench
Summary / 总结
The paper introduces a new benchmark for evaluating document parsers in extracting mathematical formulas from PDFs, focusing on synthetically generated PDFs with precise LaTeX ground truth. It uses a two-stage matching pipeline and LLM-as-a-judge for semantic formula assessment, achieving higher correlation with human judgment (Pearson r=0.78) compared to other methods. Evaluating 20+ parsers across 100 synthetic documents, it highlights significant performance differences, providing insights for practitioners and establishing a robust methodology for reproducible evaluation.
论文引入了一个新的基准,用于评估文档解析器从PDF中提取数学公式的性能,重点是带有精确LaTeX地面真值的合成PDF。它使用两阶段匹配管道和LLM作为评判者进行语义公式评估,与人类判断的相关性(皮尔逊r=0.78)高于其他方法。评估了20多种解析器在100个合成文档中的表现,突显了显著的性能差异,为从业者提供了宝贵的见解,并建立了可重复的评估方法。
Constrained Discrete Diffusion
Authors: Michael Cardei, Jacob K Christopher, Thomas Hartvigsen, Bhavya Kailkhura, Ferdinando Fioretto
Venue: NeurIPS 2025
First: 2025-03-12T19:48:12+00:00 · Latest: 2025-12-10T17:52:23+00:00
Comments: Published at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract
Discrete diffusion models are a class of generative models that construct sequences by progressively denoising samples from a categorical noise distribution. Beyond their rapidly growing ability to generate coherent natural language, these models present a new and important opportunity to enforce sequence-level constraints, a capability that current autoregressive models cannot natively provide. This paper capitalizes on this opportunity by introducing Constrained Discrete Diffusion (CDD), a novel integration of differentiable constraint optimization within the diffusion process to ensure adherence to constraints, logic rules, or safety requirements for generated sequences. Unlike conventional text generators that often rely on post-hoc filtering or model retraining for controllable generation, CDD directly imposes constraints into the discrete diffusion sampling process, resulting in a training-free and effective approach. Experiments in toxicity-controlled text generation, property-constrained molecule design, and instruction-constrained text completion demonstrate that CDD achieves zero constraint violations in a diverse array of tasks while preserving fluency, novelty, and coherence while outperforming autoregressive and existing discrete diffusion approaches.
中文标题/摘要
标题:约束离散扩散
离散扩散模型是一类生成模型,通过逐步去除来自分类噪声分布的样本中的噪声来构建序列。除了其迅速增长的生成连贯自然语言的能力之外,这些模型还提供了一种新的重要机会,即在序列级别上施加约束,这是当前自回归模型无法原生提供的能力。本文利用这一机会,引入了约束离散扩散(CDD),这是一种在扩散过程中结合可微约束优化的新方法,以确保生成序列遵守约束、逻辑规则或安全要求。与依赖于事后过滤或模型重新训练的常规文本生成器不同,CDD 直接将约束施加到离散扩散采样过程中,从而实现无训练且有效的生成方法。在毒性控制文本生成、属性约束分子设计和指令约束文本完成等任务中的实验表明,CDD 在多种任务中实现了零约束违反,同时保持流畅性、新颖性和连贯性,并优于自回归和现有离散扩散方法。
Summary / 总结
Constrained Discrete Diffusion (CDD) is introduced to enforce sequence-level constraints in generative models, addressing a limitation of autoregressive models. CDD integrates differentiable constraint optimization into the discrete diffusion process, ensuring generated sequences meet specific requirements without post-hoc filtering. Experiments show CDD outperforms autoregressive and existing discrete diffusion approaches in tasks such as toxicity-controlled text generation, property-constrained molecule design, and instruction-constrained text completion, achieving zero constraint violations while maintaining fluency and coherence.
Constrained Discrete Diffusion (CDD) 通过在离散扩散框架中整合可微约束优化来确保生成过程中遵守序列级约束。这种方法无需训练或后处理过滤即可确保遵守约束、逻辑规则或安全要求。实验表明,CDD 在毒性控制文本生成、属性约束分子设计和指令约束文本完成等任务中优于自回归和现有离散扩散方法,实现了零约束违反,同时保持了流畅性、新颖性和连贯性。
Human Motion Unlearning
Authors: Edoardo De Matteis, Matteo Migliarini, Alessio Sampieri, Indro Spinelli, Fabio Galasso
First: 2025-03-24T13:46:27+00:00 · Latest: 2025-12-10T16:23:59+00:00
Abstract
We introduce Human Motion Unlearning and motivate it through the concrete task of preventing violent 3D motion synthesis, an important safety requirement given that popular text-to-motion datasets (HumanML3D and Motion-X) contain from 7\% to 15\% violent sequences spanning both atomic gestures (e.g., a single punch) and highly compositional actions (e.g., loading and swinging a leg to kick). By focusing on violence unlearning, we demonstrate how removing a challenging, multifaceted concept can serve as a proxy for the broader capability of motion "forgetting." To enable systematic evaluation of Human Motion Unlearning, we establish the first motion unlearning benchmark by automatically filtering HumanML3D and Motion-X datasets to create distinct forget sets (violent motions) and retain sets (safe motions). We introduce evaluation metrics tailored to sequential unlearning, measuring both suppression efficacy and the preservation of realism and smooth transitions. We adapt two state-of-the-art, training-free image unlearning methods (UCE and RECE) to leading text-to-motion architectures (MoMask and BAMM), and propose Latent Code Replacement (LCR), a novel, training-free approach that identifies violent codes in a discrete codebook representation and substitutes them with safe alternatives. Our experiments show that unlearning violent motions is indeed feasible and that acting on latent codes strikes the best trade-off between violence suppression and preserving overall motion quality. This work establishes a foundation for advancing safe motion synthesis across diverse applications. Website: https://www.pinlab.org/hmu.
中文标题/摘要
标题:人类运动反学习
我们介绍了人类运动反学习,并通过具体任务防止暴力3D运动合成来激发其动机,鉴于流行的文本到运动数据集(HumanML3D和Motion-X)包含7%到15%的暴力序列,涵盖从原子手势(例如,单次挥拳)到高度组合的动作(例如,加载并挥动腿部踢打)。通过专注于暴力反学习,我们展示了如何通过移除一个复杂概念作为更广泛运动“遗忘”能力的代理。为了系统评估人类运动反学习,我们通过自动过滤HumanML3D和Motion-X数据集建立了第一个运动反学习基准,创建了不同的遗忘集(暴力动作)和保留集(安全动作)。我们引入了针对顺序反学习的评估指标,衡量抑制效果以及保持真实感和平滑过渡。我们将两种最先进的无需训练的图像反学习方法(UCE和RECE)适应领先的文本到运动架构(MoMask和BAMM),并提出了潜码替换(LCR)这一新颖的无需训练方法,该方法在离散潜码表示中识别暴力潜码,并用安全替代品替换它们。我们的实验表明,反学习暴力动作是可行的,且在暴力抑制和保持整体运动质量之间采取潜码操作达到了最佳权衡。这项工作为在各种应用中推进安全运动合成奠定了基础。网站:https://www.pinlab.org/hmu.
Summary / 总结
The research introduces Human Motion Unlearning to address the safety concern of violent 3D motion synthesis in text-to-motion datasets. By focusing on removing violent sequences, the study demonstrates the broader capability of motion forgetting. A new benchmark is established by filtering out violent motions from HumanML3D and Motion-X datasets. The authors introduce evaluation metrics for sequential unlearning and adapt existing unlearning methods to text-to-motion architectures, proposing a novel Latent Code Replacement approach. Experiments show that unlearning violent motions is feasible, with Latent Code Replacement achieving the best balance between violence suppression and motion quality preservation.
Human Motion Unlearning旨在通过从流行的文字到动作数据集中移除暴力序列来防止合成暴力3D动作。作者通过过滤HumanML3D和Motion-X数据集中的暴力动作,建立了首个动作遗忘基准。他们评估了包括UCE、RECE和LCR在内的多种遗忘方法,并发现作用于潜在代码可以最好地在抑制暴力和保持动作质量之间取得平衡。这项工作为在各种应用中实现更安全的动作合成奠定了基础。
RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations
Authors: Savya Khosla, Sethuraman T, Alexander Schwing, Derek Hoiem
First: 2024-12-02T18:59:53+00:00 · Latest: 2025-12-10T16:00:41+00:00
Abstract
We present RELOCATE, a simple training-free baseline designed to perform the challenging task of visual query localization in long videos. To eliminate the need for task-specific training and efficiently handle long videos, RELOCATE leverages a region-based representation derived from pretrained vision models. At a high level, it follows the classic object localization approach: (1) identify all objects in each video frame, (2) compare the objects with the given query and select the most similar ones, and (3) perform bidirectional tracking to get a spatio-temporal response. However, we propose some key enhancements to handle small objects, cluttered scenes, partial visibility, and varying appearances. Notably, we refine the selected objects for accurate localization and generate additional visual queries to capture visual variations. We evaluate RELOCATE on the challenging Ego4D Visual Query 2D Localization dataset, establishing a new baseline that outperforms prior task-specific methods by 49% (relative improvement) in spatio-temporal average precision.
中文标题/摘要
标题:RELOCATE:一种基于区域表示的简单无训练基线视觉查询定位
我们提出了RELOCATE,一种简单的无训练基线,旨在执行在长视频中进行视觉查询定位这一具有挑战性的任务。为了消除任务特定训练的需要并高效处理长视频,RELOCATE 利用了从预训练视觉模型中提取的区域表示。总体而言,它遵循经典的物体定位方法:(1) 在每个视频帧中识别所有物体,(2) 将物体与给定的查询进行比较并选择最相似的物体,(3) 进行双向跟踪以获得时空响应。然而,我们提出了一些关键增强措施以处理小物体、杂乱场景、部分可见性和变化的外观。值得注意的是,我们对选定的物体进行了细化以实现准确的定位,并生成额外的视觉查询以捕捉视觉变化。我们在具有挑战性的Ego4D视觉查询2D定位数据集上评估了RELOCATE,建立了新的基线,相比之前的任务特定方法在时空平均精度上提高了49%(相对改进)。
Summary / 总结
RELOCATE is a training-free baseline for visual query localization in long videos, using region-based representations from pretrained vision models. It identifies objects in each frame, compares them with the query, and tracks them bidirectionally to generate a spatio-temporal response. Key enhancements include refining selected objects and generating additional queries to handle small objects, cluttered scenes, partial visibility, and varying appearances. RELOCATE outperforms previous methods by 49% in spatio-temporal average precision on the Ego4D Visual Query 2D Localization dataset.
RELOCATE 是一个无需训练的基线方法,用于长视频中的视觉查询定位,利用预训练模型的区域表示。它在每一帧中识别物体,将它们与查询进行比较,并进行双向跟踪以生成时空响应。关键改进包括精确细化选定的物体和生成额外的查询以处理小物体、杂乱场景和外观变化。RELOCATE 在 Ego4D 数据集上的时空平均精度上比之前的方法提高了 49%。
An Automated Tip-and-Cue Framework for Optimized Satellite Tasking and Visual Intelligence
Authors: Gil Weissman, Amir Ivry, Israel Cohen
First: 2025-12-10T14:14:08+00:00 · Latest: 2025-12-10T14:14:08+00:00
Comments: Under review at IEEE Transactions on Geoscience and Remote Sensing (TGRS). 13 pages, 8 figures
Abstract
The proliferation of satellite constellations, coupled with reduced tasking latency and diverse sensor capabilities, has expanded the opportunities for automated Earth observation. This paper introduces a fully automated Tip-and-Cue framework designed for satellite imaging tasking and scheduling. In this context, tips are generated from external data sources or analyses of prior satellite imagery, identifying spatiotemporal targets and prioritizing them for downstream planning. Corresponding cues are the imaging tasks formulated in response, which incorporate sensor constraints, timing requirements, and utility functions. The system autonomously generates candidate tasks, optimizes their scheduling across multiple satellites using continuous utility functions that reflect the expected value of each observation, and processes the resulting imagery using artificial-intelligence-based models, including object detectors and vision-language models. Structured visual reports are generated to support both interpretability and the identification of new insights for downstream tasking. The efficacy of the framework is demonstrated through a maritime vessel tracking scenario, utilizing Automatic Identification System (AIS) data for trajectory prediction, targeted observations, and the generation of actionable outputs. Maritime vessel tracking is a widely researched application, often used to benchmark novel approaches to satellite tasking, forecasting, and analysis. The system is extensible to broader applications such as smart-city monitoring and disaster response, where timely tasking and automated analysis are critical.
中文标题/摘要
标题:一种优化卫星任务和视觉智能的自动化提示和提示框架
卫星星座的普及,任务延迟的降低以及传感器能力的多样化,扩展了自动地球观测的机会。本文介绍了一种用于卫星成像任务和调度的完全自动化提示和提示框架。在此上下文中,提示是从外部数据源或先前卫星图像的分析中生成的,用于识别时空目标并优先考虑下游规划。相应的提示是根据传感器约束、时间要求和效用函数形成的成像任务。该系统自主生成候选任务,使用反映每次观测预期价值的连续效用函数优化其在多颗卫星之间的调度,并使用基于人工智能的模型处理生成的图像,包括物体检测器和视觉语言模型。结构化的视觉报告被生成以支持可解释性和识别新的下游任务见解。通过利用自动识别系统(AIS)数据进行轨迹预测、目标观测和生成可操作输出的海上船舶跟踪场景,展示了该框架的有效性。海上船舶跟踪是一个广泛研究的应用程序,常用于测试新的卫星任务、预测和分析方法。该系统可以扩展到更广泛的应用,如智慧城市监控和灾害响应,其中及时的任务分配和自动化分析至关重要。
Summary / 总结
This paper presents an automated Tip-and-Cue framework for optimizing satellite imaging tasks and visual intelligence. Tips are generated from external data or prior satellite imagery to identify and prioritize targets, while cues are the imaging tasks formulated based on sensor constraints and utility functions. The system autonomously generates and schedules tasks across multiple satellites, optimizing their utility and processing the imagery using AI models. The framework is validated through a maritime vessel tracking scenario using AIS data, demonstrating its effectiveness in generating actionable outputs. The system is also extensible to other applications like smart-city monitoring and disaster response.
本文提出了一种自动化的Tip-and-Cue框架,用于优化卫星成像任务和调度。从外部数据或先前的卫星图像中生成提示,以识别和优先处理目标,而提示则基于传感器约束和效用函数形成具体的成像任务。该系统自主生成并调度任务,优化其效用,并使用AI模型处理图像。该框架通过使用AIS数据进行海上船只跟踪场景验证,展示了其生成可操作输出的有效性。该系统还可扩展应用于智能城市监控和灾害响应等领域。
IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting
Authors: Tao Zhang, Yuyang Hong, Yang Xia, Kun Ding, Zeyu Zhang, Ying Wang, Shiming Xiang, Chunhong Pan
First: 2025-12-10T14:01:02+00:00 · Latest: 2025-12-10T14:01:02+00:00
Abstract
Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench.
中文标题/摘要
标题:IF-Bench:基于生成视觉提示增强红外图像理解的基准测试与提升
近年来,多模态大型语言模型(MLLMs)在各种基准测试中取得了令人印象深刻的进展。然而,它们在理解红外图像方面的能力尚未得到探索。为了解决这一差距,我们引入了IF-Bench,这是第一个专门用于评估红外图像多模态理解的高质量基准。IF-Bench 包含来自23个红外数据集的499张图像和680个精心策划的视觉问题-答案对,涵盖了图像理解的10个关键维度。基于此基准,我们系统地评估了超过40个开源和闭源的MLLMs,采用循环评估、双语评估和混合判断策略以提高结果的可靠性。我们的分析揭示了模型规模、架构和推理范式如何影响红外图像的理解,为该领域提供了宝贵的见解。此外,我们提出了一种无需训练的生成视觉提示(GenViP)方法,该方法利用先进的图像编辑模型将红外图像转换为语义和空间上对齐的RGB图像,从而缓解领域分布偏移。广泛的实验表明,我们的方法在多种MLLMs中一致地实现了显著的性能提升。基准测试和代码可在https://github.com/casiatao/IF-Bench/ 获取。
Summary / 总结
The research aims to evaluate and enhance the understanding of infrared images by multimodal large language models (MLLMs). IF-Bench, a new benchmark, includes 499 infrared images and 680 question-answer pairs, covering 10 dimensions of image understanding. The study evaluates over 40 MLLMs using cyclic evaluation, bilingual assessment, and hybrid judgment. The authors propose a training-free generative visual prompting (GenViP) method to improve MLLM performance on infrared images, showing consistent performance gains across different models. Key findings include the impact of model scale, architecture, and inference paradigms on infrared image comprehension and the effectiveness of GenViP in mitigating domain shifts.
研究旨在评估多模态大型语言模型(MLLMs)在理解红外图像方面的能力,这一直未被充分探索。研究引入了IF-Bench,该基准包含499张红外图像和680个问题-答案对,使用循环评估、双语评估和混合判断策略系统地评估了40个MLLMs。分析揭示了模型规模、架构和推理范式对红外图像理解的影响。此外,提出了一种无需训练的生成视觉提示(GenViP)方法,通过将红外图像转换为语义和空间上对齐的RGB图像来提高MLLM性能,从而在各种MLLMs上取得了显著的性能提升。
Mind to Hand: Purposeful Robotic Control via Embodied Reasoning
Authors: Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo, Haotian Yang, Weiqi Jin, Jianan Wang
First: 2025-12-09T13:19:37+00:00 · Latest: 2025-12-10T12:05:30+00:00
Comments: 49 pages, 25 figures
Abstract
Humans act with context and intention, with reasoning playing a central role. While internet-scale data has enabled broad reasoning capabilities in AI systems, grounding these abilities in physical action remains a major challenge. We introduce Lumo-1, a generalist vision-language-action (VLA) model that unifies robot reasoning ("mind") with robot action ("hand"). Our approach builds upon the general multi-modal reasoning capabilities of pre-trained vision-language models (VLMs), progressively extending them to embodied reasoning and action prediction, and ultimately towards structured reasoning and reasoning-action alignment. This results in a three-stage pre-training pipeline: (1) Continued VLM pre-training on curated vision-language data to enhance embodied reasoning skills such as planning, spatial understanding, and trajectory prediction; (2) Co-training on cross-embodiment robot data alongside vision-language data; and (3) Action training with reasoning process on trajectories collected on Astribot S1, a bimanual mobile manipulator with human-like dexterity and agility. Finally, we integrate reinforcement learning to further refine reasoning-action consistency and close the loop between semantic inference and motor control. Extensive experiments demonstrate that Lumo-1 achieves significant performance improvements in embodied vision-language reasoning, a critical component for generalist robotic control. Real-world evaluations further show that Lumo-1 surpasses strong baselines across a wide range of challenging robotic tasks, with strong generalization to novel objects and environments, excelling particularly in long-horizon tasks and responding to human-natural instructions that require reasoning over strategy, concepts and space.
中文标题/摘要
标题:心手相连:通过具身推理实现目的性机器人控制
人类在行动中具有上下文和意图,并且推理起着核心作用。尽管互联网规模的数据使AI系统具备了广泛的推理能力,但在物理行动中实现这些能力仍然是一个重大挑战。我们引入了Lumo-1,这是一种统一了机器人推理(“心”)与机器人行动(“手”)的通用视觉-语言-行动(VLA)模型。我们的方法基于预训练的多模态推理能力,逐步扩展到具身推理和动作预测,并最终实现结构化推理和推理-行动对齐。这导致了一个三阶段的预训练管道:(1)在精选的视觉-语言数据上继续预训练VLM,以增强具身推理技能,如规划、空间理解和轨迹预测;(2)与视觉-语言数据一起进行跨具身机器人数据的协同训练;(3)在收集于具有类人灵巧性和敏捷性的双臂移动操作器Astribot S1上的轨迹上进行动作训练,结合推理过程。最后,我们整合强化学习以进一步提高推理-行动一致性,并在语义推理和运动控制之间形成闭环。广泛的实验表明,Lumo-1在具身视觉-语言推理方面取得了显著的性能提升,这是通用机器人控制的关键组成部分。实际应用评估进一步表明,Lumo-1在一系列具有挑战性的机器人任务中超越了强大的基线,具有强大的泛化能力,特别是在长时任务和需要推理策略、概念和空间的人类自然指令方面表现出色。
Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment
Authors: Yuan Li, Zitang Sun, Yen-ju Chen, Shin'ya Nishida
Venue: Building Reasonable Inference for Vision-Language Models in Blind Image Quality Assessment. In: Taniguchi, T., et al. Neural Information Processing. ICONIP 2025. Lecture Notes in Computer Science, vol 16310. Springer, Singapore
First: 2025-12-10T11:50:42+00:00 · Latest: 2025-12-10T11:50:42+00:00
Comments: Accepted to the ICONIP (International Conference on Neural Information Processing), 2025
Abstract
Recent progress in BIQA has been driven by VLMs, whose semantic reasoning abilities suggest that they might extract visual features, generate descriptive text, and infer quality in a human-like manner. However, these models often produce textual descriptions that contradict their final quality predictions, and the predicted scores can change unstably during inference - behaviors not aligned with human reasoning. To understand these issues, we analyze the factors that cause contradictory assessments and instability. We first estimate the relationship between the final quality predictions and the generated visual features, finding that the predictions are not fully grounded in the features and that the logical connection between them is weak. Moreover, decoding intermediate VLM layers shows that the model frequently relies on a limited set of candidate tokens, which contributes to prediction instability. To encourage more human-like reasoning, we introduce a two-stage tuning method that explicitly separates visual perception from quality inference. In the first stage, the model learns visual features; in the second, it infers quality solely from these features. Experiments on SPAQ and KONIQ demonstrate that our approach reduces prediction instability from 22.00% to 12.39% and achieves average gains of 0.3124/0.3507 in SRCC/PLCC across LIVE, CSIQ, SPAQ, and KONIQ compared to the baseline. Further analyses show that our method improves both stability and the reliability of the inference process.
中文标题/摘要
标题:在盲图像质量评估中为视觉-语言模型构建合理的推理
BIQA 的近期进展得益于 VLMs,其语义推理能力表明它们可能以类似人类的方式提取视觉特征、生成描述性文本并进行质量推理。然而,这些模型经常产生与其最终质量预测相矛盾的文本描述,并且在推理过程中预测分数会不稳定地变化,这些行为不符合人类推理。为了理解这些问题,我们分析了导致矛盾评估和不稳定性的因素。我们首先估计最终质量预测与生成的视觉特征之间的关系,发现预测并未完全基于特征,且两者之间的逻辑联系较弱。此外,解码中间的 VLM 层显示,模型经常依赖于一组有限的候选词,这导致了预测的不稳定性。为了促进更类似人类的推理,我们引入了一种两阶段调优方法,明确地将视觉感知与质量推理分离。在第一阶段,模型学习视觉特征;在第二阶段,它仅从这些特征中推断质量。在 SPAQ 和 KONIQ 上的实验表明,我们的方法将预测不稳定性从 22.00% 降低到 12.39%,并且在 LIVE、CSIQ、SPAQ 和 KONIQ 上相对于基线的 SRCC/PLCC 平均提高了 0.3124/0.3507。进一步的分析表明,我们的方法提高了推理过程的稳定性和可靠性。
Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
Authors: Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising
Venue: IEEE Open Journal of Vehicular Technology, vol. 7, pp. 54-72, 2026
First: 2025-10-09T15:38:41+00:00 · Latest: 2025-12-10T11:35:59+00:00
Comments: Published in IEEE Open Journal of Vehicular Technology. Final version available at: https://ieeexplore.ieee.org/document/11230063
Abstract
Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.
中文标题/摘要
标题:评估小型视觉-语言模型在距离依赖交通感知中的表现
视觉-语言模型(VLMs)变得越来越强大,展示了在需要视觉和文本理解的各种任务中表现出色的能力。它们强大的泛化能力使它们成为自动驾驶系统的一个有前途的组成部分,这些系统必须处理意外的边缘情况。然而,要在这种安全关键的应用中获得信任,一个模型首先必须具备可靠的感知系统。此外,由于交通场景中的关键对象和代理通常处于远处,我们要求系统不是“近视”的,即在近距离(20米以内)和远距离(30米以上)都具有强大的感知能力。为此,我们引入了距离标注交通感知问答(DTPQA),这是第一个专注于交通场景中基于感知的问题的视觉问答(VQA)基准,其中包含距离标注。通过排除需要推理的问题,我们确保模型性能反映的是感知能力。由于自动驾驶硬件的处理能力有限,无法支持大型VLMs,我们的研究集中在较小的VLMs上。具体来说,我们在DTPQA上评估了几种最先进的(SOTA)小型VLMs,结果显示,尽管问题很简单,但这些模型的表现远逊于人类(最佳小型VLM的平均准确率约为60%,而人类的准确率约为85%)。然而,值得注意的是,人类样本量相对较小,这带来了统计上的限制。我们还确定了一些特定的感知任务,例如区分左右,这些任务对这些模型来说仍然特别具有挑战性。
Summary / 总结
This study evaluates small vision-language models on a new benchmark called DTPQA, which focuses on traffic perception tasks with distance annotations. The research aims to assess the models' ability to handle both close and long-range perception tasks, crucial for automated driving systems. Key findings show that state-of-the-art small VLMs perform significantly worse than humans on these tasks, with an average accuracy of about 60% compared to 85% for humans. However, the small human sample size introduces some statistical limitations.
研究评估了小型视觉-语言模型(VLMs)在距离标注交通感知问答(DTPQA)基准上的表现,重点关注它们在近距离和远距离的感知能力。尽管问题相对简单,但小型VLMs的表现远逊于人类,最佳模型的平均准确率为约60%,而人类的准确率为约85%。研究还指出,这些模型在区分交通场景中的左右方向时存在特别大的挑战,表明需要在特定感知任务上进行改进。
Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation
Authors: Nadeem Nazer, Hongkuan Zhou, Lavdim Halilaj, Ylli Sadikaj, Steffen Staab
First: 2025-12-10T09:19:17+00:00 · Latest: 2025-12-10T09:19:17+00:00
Abstract
Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like "hole", "cut", "scratch" that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of "abnormal" with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.
中文标题/摘要
标题:基于渐进调优的缺陷感知混合提示优化以实现零样本多类型异常检测与分割
近期的视觉语言模型(VLMs)如CLIP通过利用文本提示中的高层语义信息,在显著分布偏移下展示了令人印象深刻的异常检测性能。然而,这些模型往往忽视了细微的细节,例如哪些类型的异常,如“孔洞”、“切割”、“划痕”等,这些可以提供更具体的异常本质洞察。我们认为,识别细微的异常类型1)以结构化的语义丰富了“异常”的表示,缩小了粗略异常信号与细微缺陷类别的差距;2)使制造商能够理解异常的根本原因并迅速实施更针对性和适当的纠正措施。虽然整合这种详细语义信息至关重要,但为每种缺陷类型设计手工制作的提示既耗时又容易受到人为偏见的影响。因此,我们提出了DAPO,一种基于渐进调优的缺陷感知提示优化方法,用于零样本多类型和二元异常检测与分割下的分布偏移。我们的方法通过学习混合缺陷感知提示,结合固定文本锚点和可学习的标记嵌入,将异常相关的图像特征与其相应的文本语义对齐。我们在公共基准(MPDD、VisA、MVTec-AD、MAD和Real-IAD)和内部数据集上进行了实验。结果表明,与基线模型相比,DAPO在分布偏移下的图像级AUROC和平均精度指标上平均提高了3.7%,在零样本设置下定位新型异常类型时平均提高了6.5%。
Summary / 总结
The research aims to enhance zero-shot multi-type anomaly detection and segmentation by incorporating fine-grained defect information through a novel approach called DAPO, which uses progressive tuning to optimize hybrid defect-aware prompts. The method aligns image features with text semantics, combining fixed textual anchors and learnable token embeddings. Experiments on various benchmarks show that DAPO outperforms baseline models, achieving a 3.7% improvement in AUROC and average precision metrics and a 6.5% improvement in localizing novel anomaly types under zero-shot settings.
研究旨在通过引入细粒度缺陷类型来增强零样本多类型异常检测和分割,提出了一种名为DAPO的新方法,该方法通过渐进调优优化提示。该方法使用混合缺陷感知提示对齐图像特征与文本语义。实验结果显示,DAPO在各种基准上的AUROC和平均精度分别提高了3.7%,并在零样本设置下定位新型异常类型时提高了6.5%的平均精度。
Representation Calibration and Uncertainty Guidance for Class-Incremental Learning based on Vision Language Model
Authors: Jiantao Tan, Peixian Ma, Tong Yu, Wentao Zhang, Ruixuan Wang
First: 2025-12-10T09:09:23+00:00 · Latest: 2025-12-10T09:09:23+00:00
Abstract
Class-incremental learning requires a learning system to continually learn knowledge of new classes and meanwhile try to preserve previously learned knowledge of old classes. As current state-of-the-art methods based on Vision-Language Models (VLMs) still suffer from the issue of differentiating classes across learning tasks. Here a novel VLM-based continual learning framework for image classification is proposed. In this framework, task-specific adapters are added to the pre-trained and frozen image encoder to learn new knowledge, and a novel cross-task representation calibration strategy based on a mixture of light-weight projectors is used to help better separate all learned classes in a unified feature space, alleviating class confusion across tasks. In addition, a novel inference strategy guided by prediction uncertainty is developed to more accurately select the most appropriate image feature for class prediction. Extensive experiments on multiple datasets under various settings demonstrate the superior performance of our method compared to existing ones.
中文标题/摘要
标题:基于视觉语言模型的类别增量学习的表示校准与不确定性指导
类别增量学习要求学习系统不断学习新类别的知识,同时尽量保留之前学习的老类别的知识。当前基于视觉-语言模型(VLMs)的最先进的方法仍然难以区分不同学习任务中的类别。为此,提出了一种基于VLM的图像分类连续学习框架。在该框架中,向预训练并冻结的图像编码器中添加了任务特定的适配器以学习新知识,并提出了一种基于轻量级投影混合的跨任务表示校准策略,以帮助在统一特征空间中更好地区分所有已学习的类别,缓解任务间的类别混淆。此外,还开发了一种由预测不确定性指导的推理策略,以更准确地选择最适合的图像特征进行类别预测。在多种数据集的不同设置下进行的大量实验表明,与现有方法相比,我们的方法具有优越的性能。
Summary / 总结
The research aims to improve class-incremental learning in Vision-Language Models (VLMs) by addressing the challenge of differentiating classes across learning tasks. The proposed framework includes task-specific adapters for learning new classes and a cross-task representation calibration strategy to better separate all learned classes in a unified feature space. Additionally, an inference strategy guided by prediction uncertainty is developed to enhance class prediction accuracy. Experiments show that the method outperforms existing approaches in various settings.
研究旨在通过解决任务间类别混淆问题来改进视觉语言模型的类别增量学习。方法包括在预训练的图像编码器中添加任务特定适配器,并使用跨任务表示校准策略以更好地在统一特征空间中区分已学习的类别。此外,还开发了一种由预测不确定性指导的推理策略,以提高类别预测的准确性。实验表明,该方法在多个数据集和设置下优于现有方法。
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
Authors: Eunsu Kim, Junyeong Park, Na Min An, Junseong Kim, Hitesh Laxmichand Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ashokbhai Ilasariya, Hyunjung Shim, Alice Oh
First: 2025-11-27T22:23:08+00:00 · Latest: 2025-12-10T06:59:02+00:00
Abstract
In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.
中文标题/摘要
标题:世界于一框:理解文化混融作为视觉-语言模型的新挑战
在全球化世界中,来自不同起源的文化元素经常出现在同一幅视觉场景中。我们将这些称为文化混融场景,但大型视觉-语言模型(LVLMs)如何感知它们仍鲜有研究。我们探讨了文化混融作为LVLMs的关键挑战,并考察了当来自多个地区的文化物品同时出现时,当前模型的行为。为了系统地分析这些行为,我们构建了CultureMix,一个包含23000张扩散生成、人工验证的文化混融图像的食品视觉问答(VQA)基准,涵盖四个子任务:(1)仅食品,(2)食品+食品,(3)食品+背景,(4)食品+食品+背景。评估10个LVLMs,我们发现模型在混融环境中一致地未能保留个体文化身份。模型表现出强烈的背景依赖性,当将文化背景添加到仅食品基线中时,准确率下降14%,并且它们在不同背景下对相同食品产生不一致的预测。为解决这些局限性,我们探索了三种鲁棒性策略。我们发现使用多样文化混融数据集的监督微调显著提高了模型的一致性并降低了背景敏感性。我们呼吁增加对文化混融场景的关注,这是开发能够在文化多样化的现实环境中可靠运行的LVLMs的关键步骤。
Summary / 总结
The paper explores how large vision-language models (LVLMs) handle culture mixing scenarios, where elements from different cultures appear together in a single visual scene. To systematically analyze this, the authors created CultureMix, a benchmark with 23k images across four subtasks. They found that LVLMs often fail to preserve individual cultural identities in mixed settings, showing strong reliance on the background and producing inconsistent predictions. Supervised fine-tuning with diverse culture mixing data was found to improve model consistency and reduce background sensitivity, suggesting a path forward for better performance in culturally diverse environments.
研究探讨了大型视觉-语言模型(LVLMs)如何处理不同文化元素出现在同一视觉场景中的文化混合场景。为了系统地分析这一问题,创建了一个名为CultureMix的新基准,包含23,000张图像,分为四个子任务。评估结果显示,这些模型在混合设置中往往无法保留个体文化身份,表现出对背景的强烈依赖,并在不同背景下对相同食物产生不一致的预测。研究建议,使用多样化的文化混合数据集进行监督微调可以提高模型的一致性并减少对背景的敏感性。
TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment
Authors: Kanghyun Baek, Sangyub Lee, Jin Young Choi, Jaewoo Song, Daemin Park, Jooyoung Choi, Chaehun Shin, Bohyung Han, Sungroh Yoon
First: 2025-12-10T06:18:30+00:00 · Latest: 2025-12-10T06:18:30+00:00
Abstract
Despite recent advances, diffusion-based text-to-image models still struggle with accurate text rendering. Several studies have proposed fine-tuning or training-free refinement methods for accurate text rendering. However, the critical issue of text omission, where the desired text is partially or entirely missing, remains largely overlooked. In this work, we propose TextGuider, a novel training-free method that encourages accurate and complete text appearance by aligning textual content tokens and text regions in the image. Specifically, we analyze attention patterns in MM-DiT models, particularly for text-related tokens intended to be rendered in the image. Leveraging this observation, we apply latent guidance during the early stage of denoising steps based on two loss functions that we introduce. Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
中文标题/摘要
标题:TextGuider: 无需训练的注意力对齐文本渲染指导
尽管取得了近期进展,基于扩散的文本到图像模型在准确的文本渲染方面仍然存在问题。多项研究提出了微调或无需训练的细化方法以实现准确的文本渲染。然而,文本遗漏的问题,即所需文本部分或完全缺失,仍然被忽视。在本文中,我们提出了一种名为TextGuider的新颖的无需训练方法,通过对齐图像中的文本内容标记和文本区域来鼓励准确且完整的文本显示。具体而言,我们分析了MM-DiT模型中的注意力模式,特别是那些旨在在图像中呈现的文本相关标记。基于我们引入的两个损失函数,在去噪步骤的早期阶段应用潜在指导。我们的方法在测试时的文本渲染方面达到了最先进的性能,显著提高了召回率,并在OCR准确性和CLIP分数方面取得了优异结果。
Summary / 总结
TextGuider is a training-free method that enhances text rendering in text-to-image models by aligning textual content tokens with their corresponding regions in the image. It analyzes attention patterns in MM-DiT models and applies latent guidance during the early denoising steps using two introduced loss functions. The method improves recall and OCR accuracy, achieving state-of-the-art performance in text rendering.
TextGuider 是一种无需训练的方法,通过在图像中对齐文本内容令牌和文本区域来提高文本到图像模型中的文本渲染效果。它分析了 MM-DiT 模型中的注意力模式,并在去噪步骤的早期阶段应用了两种引入的损失函数进行潜在指导。该方法在文本召回率、OCR 准确性和 CLIP 分数方面达到了最先进的性能。
OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
Authors: Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, Harry Yang
First: 2025-12-09T06:49:33+00:00 · Latest: 2025-12-10T05:44:15+00:00
Abstract
Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.
中文标题/摘要
标题:OpenSubject:利用视频衍生的身份和多样性先验进行主题驱动的图像生成和操作
尽管在主题驱动的图像生成方面取得了令人鼓舞的进展,但当前的模型往往偏离参考身份,并且在包含多个主题的复杂场景中难以应对。为了解决这一挑战,我们引入了OpenSubject,这是一个包含250万样本和435万张图像的视频衍生大规模数据集,用于主题驱动的生成和操作。该数据集通过一个四阶段管道构建,利用跨帧身份先验。 (i) 视频编目。我们应用分辨率和审美过滤以获得高质量的片段。 (ii) 跨帧主题挖掘和配对。我们利用基于视觉-语言模型(VLM)的类别共识、局部定位和多样性意识配对来选择图像对。 (iii) 身份保留参考图像合成。我们引入分割图引导的出画填充以合成用于主题驱动生成的输入图像,并通过边界引导的填充生成用于主题驱动操作的输入图像,同时包括几何感知增强和不规则边界侵蚀。 (iv) 验证和注释。我们利用VLM验证合成样本,基于阶段(iii)重新合成失败样本,然后构建短和长注释。此外,我们引入了一个涵盖主题驱动生成和操作的基准,并使用VLM评判员评估身份保真度、提示一致性、操作一致性以及背景一致性。大量实验表明,使用OpenSubject进行训练可以提高生成和操作性能,特别是在复杂场景中。
Summary / 总结
OpenSubject is a large-scale video-derived dataset for subject-driven image generation and manipulation, containing 2.5 million samples and 4.35 million images. It is created through a four-stage pipeline that includes video curation, cross-frame subject mining, identity-preserving reference image synthesis, and verification. The dataset is used to train models that improve in identity fidelity, prompt adherence, manipulation consistency, and background consistency, especially in complex scenes.
OpenSubject 是一个包含 2.5M 个样本和 4.35M 张图像的大规模数据集,通过四阶段管道创建,旨在解决当前主体驱动图像生成模型的局限性。该数据集利用跨帧身份先验,并包括身份保留的参考图像合成和几何增强。实验表明,使用 OpenSubject 训练可以提高生成和操作性能,尤其是在复杂场景中。
ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning
Authors: MD Thamed Bin Zaman Chowdhury, Moazzem Hossain
First: 2025-11-09T10:44:26+00:00 · Latest: 2025-12-10T04:19:10+00:00
Abstract
Reliable geospatial information on road accidents is vital for safety analysis and infrastructure planning, yet most low- and middle-income countries continue to face a critical shortage of accurate, location-specific crash data. Existing text-based geocoding tools perform poorly in multilingual and unstructured news environments, where incomplete place descriptions and mixed language (e.g. Bangla-English) scripts obscure spatial context. To address these limitations, this study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning), a vision-language framework that emulates human spatial reasoning to infer accident location coordinates directly from available textual and map-based cues. ALIGN integrates large language and vision-language model mechanisms within a multi-stage pipeline that performs optical character recognition, linguistic reasoning, and map-level verification through grid-based spatial scanning. The framework systematically evaluates each predicted location against contextual and visual evidence, ensuring interpretable, fine-grained geolocation outcomes without requiring model retraining. Applied to Bangla-language news data source, ALIGN demonstrates consistent improvements over traditional geoparsing methods, accurately identifying district- and sub-district-level crash sites. Beyond its technical contribution, the framework establishes a high accuracy foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the broader integration of multimodal artificial intelligence in transportation analytics.
中文标题/摘要
标题:ALIGN:一种通过地理空间神经推理进行高精度事故位置推断的跨模态框架
可靠的道路事故地理空间信息对于安全分析和基础设施规划至关重要,但大多数低收入和中等收入国家仍然面临准确的、位置特定的碰撞数据严重短缺的问题。现有的基于文本的地理编码工具在多语言和非结构化新闻环境中表现不佳,其中不完整的地点描述和混合语言(例如孟加拉语-英语)脚本模糊了空间上下文。为了解决这些限制,本研究引入了ALIGN(通过地理空间神经推理进行事故位置推断),这是一种跨模态框架,模仿人类的空间推理能力,直接从可用的文本和地图线索中推断事故位置坐标。ALIGN 在一个多阶段管道中整合了大型语言模型和跨模态模型机制,该管道执行光学字符识别、语言推理和基于网格的空间扫描的地图级验证。该框架系统地评估每个预测位置的上下文和视觉证据,确保可解释的、细粒度的地理定位结果,而无需重新训练模型。将该框架应用于孟加拉语新闻数据源,ALIGN 在传统地理解析方法上表现出一致的改进,准确地识别出区级和次区级的碰撞地点。除了技术贡献外,该框架为数据稀缺地区自动碰撞地图绘制奠定了高精度基础,支持基于证据的道路安全政策制定,并促进多模态人工智能在交通分析中的更广泛集成。
Summary / 总结
The research aims to improve the accuracy of accident location inference in regions with limited crash data by developing ALIGN, a vision-language framework. ALIGN uses a multi-stage pipeline combining optical character recognition, linguistic reasoning, and map-level verification to infer accident locations from textual and map-based cues. The framework shows consistent improvements over traditional geoparsing methods, accurately identifying district- and sub-district-level crash sites in Bangla-language news data, supporting evidence-based road-safety policymaking in data-scarce regions.
研究旨在提高低收入和中等收入国家事故位置推断的准确性,这些国家缺乏地理空间数据。ALIGN 是一个视觉-语言框架,使用包括光学字符识别、语言推理和地图级验证的多阶段管道,从文本和地图线索中直接推断事故位置。该框架在识别孟加拉语新闻数据中的区级和次区级事故地点方面优于传统地理解析方法,为数据稀缺地区的自动事故地图绘制提供了坚实的基础。
Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning
Authors: Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Ligong Han, Hongyang He, Zhengtao Yao, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, Ruixiang Tang
First: 2025-05-21T04:25:23+00:00 · Latest: 2025-12-10T03:53:23+00:00
Comments: 14 pages, 8 figures, 5 tables
Abstract
Multimodal in-context learning (ICL) is becoming a key capability that allows large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, which expands their usefulness in many real-world applications. However, ICL performance remains unstable even when the in-context demonstrations (ICDs) are well matched, showing that LVLMs still struggle to make full use of the provided context. While existing work mainly focuses on prompt engineering or post-hoc logit calibration, we study the attention mechanisms inside LVLMs to address their inherent limitations. We identify two important weaknesses in their self-attention that hinder effective ICL. To address these weaknesses, we propose Context-Aware Modulated Attention (CAMA), a training-free and plug-and-play method that dynamically adjusts attention logits based on the input in-context sequence. CAMA uses a two-stage modulation process that strengthens attention to semantically important tokens, especially visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, showing clear effectiveness and generalization. It can also activate the intended benefits of prompt engineering methods and remains robust across different sequence configurations. Therefore, CAMA opens up new directions for improving multimodal reasoning through a deeper understanding of attention dynamics.
中文标题/摘要
标题:使LVLMs聚焦:基于上下文的注意力调制以提高多模态即席学习效果
多模态即席学习(ICL)已成为一种关键能力,使大型视觉语言模型(LVLMs)能够在不更新参数的情况下适应新任务,从而在许多实际应用中扩展其用途。然而,即使上下文示例(ICDs)匹配良好,ICL性能仍然不稳定,表明LVLMs仍然难以充分利用提供的上下文。虽然现有工作主要集中在提示工程或事后logit校准上,我们研究了LVLMs内部的注意力机制以解决其固有的局限性。我们识别出其自我注意力中的两个重要弱点,这些弱点阻碍了有效的ICL。为了解决这些弱点,我们提出了基于上下文的调制注意力(CAMA),这是一种无需训练且即插即用的方法,可以根据输入的即席序列动态调整注意力logits。CAMA采用两阶段调制过程,增强对语义重要标记,尤其是视觉标记的注意力。在四个LVLMs和七个基准测试中,CAMA始终优于vanilla模型和基线,显示出明显的有效性和泛化能力。它还可以激活提示工程方法的预期益处,并在不同的序列配置下保持稳健。因此,CAMA为通过更深入理解注意力动态来提高多模态推理开辟了新的方向。
Summary / 总结
The study aims to enhance the stability of multimodal in-context learning (ICL) in large vision-language models (LVLMs) by addressing their inherent limitations in attention mechanisms. The proposed Context-Aware Modulated Attention (CAMA) dynamically adjusts attention logits based on input context without requiring training. Experiments across four LVLMs and seven benchmarks demonstrate that CAMA consistently improves ICL performance, activates prompt engineering benefits, and maintains robustness under various sequence configurations.
研究旨在通过解决大型视觉语言模型(LVLM)内在的注意力机制限制,提高多模态上下文学习(ICL)的稳定性。提出的Context-Aware Modulated Attention (CAMA) 在不进行训练的情况下,根据输入上下文动态调整注意力权重,有效加强了对语义重要标记的关注。实验结果显示,CAMA 在四个 LVLM 和七个基准测试中均优于基础模型,表现出明显的有效性和泛化能力,并且能够增强提示工程方法的效果,同时在不同序列配置下保持鲁棒性。
Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
Authors: Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang
First: 2025-10-09T11:08:07+00:00 · Latest: 2025-12-10T03:27:37+00:00
Comments: The paper has been withdrawn because it will undergo a major revision. The revised version will differ substantially from the current one, making replacement inappropriate
Abstract
Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
中文标题/摘要
标题:视频到音频生成中检测和缓解插入幻觉
视频到音频生成在自动合成视频声音方面取得了显著进展。然而,现有的评估指标主要关注语义和时间对齐,忽视了一个关键的失败模式:模型经常生成声学事件,特别是语音和音乐,这些事件在视频中没有对应的视觉来源。我们称这种现象为插入幻觉,并将其识别为由数据集偏差驱动的系统性风险,当前的指标完全无法检测到这种风险。为应对这一挑战,我们首先开发了一种系统性的评估框架,该框架采用多个声学事件检测器的多数投票集成。我们还引入了两个新的度量标准来量化这一问题的普遍性和严重性:IH@vid(带有幻觉的视频比例)和IH@dur(幻觉持续时间的比例)。在此基础上,我们提出了后验特征校正(PFC),这是一种无需训练的推理时方法,可以缓解插入幻觉。PFC采用两步过程:首先生成初始音频输出以检测幻觉段落,然后在这些时间戳处遮蔽相应的视频特征并重新生成音频。在几个主流的V2A基准测试上进行的实验首次揭示了最先进的模型遭受严重的插入幻觉。相比之下,我们的PFC方法平均将幻觉的普遍性和持续时间减少了超过50%,且在不损害传统音频质量和时间同步指标的情况下,在某些情况下甚至提高了这些指标。我们的工作首次正式定义、系统性测量并有效缓解了插入幻觉,为更可靠和忠实的V2A模型铺平了道路。
Summary / 总结
The paper addresses the issue of Insertion Hallucination in Video-to-Audio generation, where models generate sounds that do not correspond to visual sources. It introduces a new evaluation framework using a majority-voting ensemble of audio event detectors and two metrics, IH@vid and IH@dur, to quantify the problem. Additionally, it proposes Posterior Feature Correction (PFC), a training-free method that reduces hallucinations by over 50% on average without degrading conventional audio quality metrics. This work is the first to formally define and mitigate this issue, enhancing the reliability of V2A models. However, the paper has been withdrawn for a major revision.
论文解决了视频到音频生成中的插入幻觉问题,即模型生成与视觉来源不对应的音频。它引入了一个新的评估框架,使用多个音频事件检测器的多数投票集合,并提出了两个指标IH@vid和IH@dur来量化问题。此外,它还提出了后验特征校正(PFC)方法,这是一种无需训练的推理时方法,可以将幻觉减少超过50%,同时不损害传统的音频质量指标。这项工作是首次正式定义、系统测量并有效缓解幻觉问题,提高了V2A模型的可靠性。然而,该论文已被撤回,将进行重大修订。
View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs
Authors: Yuanyuan Liu, Haiyang Mei, Dongyang Zhan, Jiayue Zhao, Dongsheng Zhou, Bo Dong, Xin Yang
First: 2025-12-10T00:59:17+00:00 · Latest: 2025-12-10T00:59:17+00:00
Abstract
3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.
中文标题/摘要
标题:View-on-Graph:基于场景图上视觉-语言推理的零样本3D视觉定位
3D视觉定位(3DVG)是从语言描述中识别3D场景中的物体。现有的零样本方法通过将3D空间信息(SI)转换为适合视觉-语言模型(VLM)处理的形式,通常作为指定视图渲染或带有对象标记的视频序列,利用2D视觉-语言模型(VLM)。然而,这种VLM + SI范式产生了纠缠的视觉表示,迫使VLM处理整个杂乱的线索,使其难以有效利用空间语义关系。在本文中,我们提出了一种新的VLM x SI范式,将3D SI外部化为一种形式,使VLM能够在推理过程中逐步检索所需的内容。我们通过一种新颖的View-on-Graph(VoG)方法实现这一范式,将场景组织成多模态、多层场景图,使VLM能够作为主动代理,在遍历场景时选择性地访问必要的线索。此设计具有两个内在优势:(i)通过将3D上下文结构化为空间和语义上一致的场景图,而不是用密集纠缠的视觉输入使VLM困惑,降低了VLM的推理难度;(ii)通过主动探索和推理场景图,自然地生成可解释的3DVG的透明、逐步痕迹。大量实验表明,VoG在零样本性能上达到了最先进的水平,确立了结构化场景探索是推进零样本3DVG的一种有前途的策略。
Summary / 总结
This paper proposes View-on-Graph (VoG), a novel method for zero-shot 3D visual grounding that externalizes 3D spatial information into a structured scene graph, enabling a vision-language model to selectively access necessary cues during reasoning. VoG outperforms existing approaches by reducing the complexity of spatial reasoning and providing transparent, step-by-step traces for interpretable 3D visual grounding, achieving state-of-the-art performance in zero-shot 3DVG tasks.
该研究提出了一种名为View-on-Graph (VoG)的方法,将场景组织成多模态、多层的场景图,使视觉-语言模型在推理过程中能够选择性地访问必要的线索。这种方法降低了空间推理的复杂性,并自然地生成了可解释的结果,实现了零样本场景理解的最新性能。
Prompt-Based Continual Compositional Zero-Shot Learning
Authors: Sauda Maryam, Sara Nadeem, Faisal Qureshi, Mohsen Ali
First: 2025-12-09T22:36:31+00:00 · Latest: 2025-12-09T22:36:31+00:00
Abstract
We tackle continual adaptation of vision-language models to new attributes, objects, and their compositions in Compositional Zero-Shot Learning (CZSL), while preventing forgetting of prior knowledge. Unlike classical continual learning where classes are disjoint, CCZSL is more complex as attributes and objects may reoccur across sessions while compositions remain unique. Built on a frozen VLM backbone, we propose the first Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) framework that retains prior knowledge through recency-weighted multi-teacher distillation. It employs session-aware compositional prompts to fuse multimodal features for new compositions, while attribute and object prompts are learned through session-agnostic fusion to maintain global semantic consistency, which is further stabilized by a Cosine Anchor Loss (CAL) to preserve prior knowledge. To enhance adaptation in the current session, an Orthogonal Projection Loss (OPL) ensures that new attribute and object embeddings remain distinct from previous ones, preventing overlap, while an Intra-Session Diversity Loss (IDL) promotes variation among current-session embeddings for richer, more discriminative representations. We also introduce a comprehensive protocol that jointly measures catastrophic forgetting and compositional generalization. Extensive experiments on UT-Zappos and C-GQA benchmarks demonstrate that PromptCCZSL achieves substantial improvements over prior VLM-based and non-VLM baselines, setting a new benchmark for CCZSL in closed-world settings.
中文标题/摘要
标题:基于提示的持续组合零样本学习
我们针对视觉-语言模型在Compositional Zero-Shot Learning (CZSL) 中对新属性、对象及其组合的持续适应问题,同时防止遗忘先前知识。不同于传统持续学习中类别的互斥性,CCZSL 更加复杂,因为属性和对象可能在不同会话中重复出现,而组合则保持唯一性。基于冻结的VLM主干,我们提出了第一个基于提示的持续组合零样本学习(PromptCCZSL)框架,通过最近性加权多教师蒸馏保留先前知识。该框架使用会话感知的组合提示融合多模态特征以生成新的组合,而属性和对象提示通过会话无关的融合学习以保持全局语义一致性,进一步通过余弦锚点损失(CAL)稳定以保留先前知识。为了增强当前会话的适应性,正交投影损失(OPL)确保新属性和对象嵌入与先前的嵌入保持独特性,防止重叠,而会话内多样性损失(IDL)促进当前会话嵌入之间的变化,以获得更丰富、更具区分性的表示。我们还引入了一个综合协议,联合衡量灾难性遗忘和组合泛化。在UT-Zappos和C-GQA基准上的广泛实验表明,PromptCCZSL 在持续组合零样本学习中显著优于基于VLM和非VLM的基线方法,为闭世界设置中的CCZSL 设定了新基准。
Summary / 总结
The research aims to address the challenge of continual adaptation of vision-language models to new attributes, objects, and their compositions in Compositional Zero-Shot Learning (CZSL) without forgetting previous knowledge. The proposed Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) framework uses recency-weighted multi-teacher distillation and session-aware compositional prompts to maintain global semantic consistency. It also employs Orthogonal Projection Loss (OPL) and Intra-Session Diversity Loss (IDL) to enhance adaptation and prevent overlap. Experiments on UT-Zappos and C-GQA benchmarks show that PromptCCZSL outperforms previous VLM-based and non-VLM methods, setting a new benchmark for CCZSL in closed-world settings.
研究旨在解决视觉-语言模型在Compositional Zero-Shot Learning (CZSL) 中对新属性、对象及其组合进行持续适应的问题,同时不忘记先前的知识。提出的Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) 框架使用冻结的VLM主干,并通过近期加权多教师蒸馏保留先前知识。该框架使用会话感知的组合提示生成新组合,并使用会话无关的属性和对象提示保持全局语义一致性,同时还引入了额外的损失来增强适应性并防止重叠。在UT-Zappos和C-GQA基准上的实验表明,PromptCCZSL在闭合世界设置中的CCZSL基准上取得了显著的改进。
Hard Work Does Not Always Pay Off: Poisoning Attacks on Neural Architecture Search
Authors: Zachary Coalson, Huazheng Wang, Qingyun Wu, Sanghyun Hong
First: 2024-05-09T19:55:07+00:00 · Latest: 2025-12-09T19:45:12+00:00
Comments: Accepted at TMLR 2025.12
Abstract
We study the robustness of data-centric methods to find neural network architectures, known as neural architecture search (NAS), against data poisoning. To audit this robustness, we design a poisoning framework that enables the systematic evaluation of the ability of NAS to produce architectures under data corruption. Our framework examines four off-the-shelf NAS algorithms, representing different approaches to architecture discovery, against four data poisoning attacks, including one we tailor specifically for NAS. In our evaluation with the CIFAR-10 and CIFAR-100 benchmarks, we show that NAS is \emph{seemingly} robust to data poisoning, showing marginal accuracy drops even under large poisoning budgets. However, we demonstrate that when considering NAS algorithms designed to achieve a few percentage points of accuracy gain, this expected improvement can be substantially diminished under data poisoning. We also show that the reduction varies across NAS algorithms and analyze the factors contributing to their robustness. Our findings are: (1) Training-based NAS algorithms are the least robust due to their reliance on data. (2) Training-free NAS approaches are the most robust but produce architectures that perform similarly to random selections from the search space. (3) NAS algorithms can produce architectures with improved accuracy, even when using out-of-distribution data like MNIST. We lastly discuss potential countermeasures. Our code is available at: https://github.com/ztcoalson/NAS-Robustness-to-Data-Poisoning
中文标题/摘要
标题:辛勤工作并不总是有回报:针对神经架构搜索的数据投毒攻击
我们研究了数据为中心的方法——神经架构搜索(NAS)——在面对数据投毒时的鲁棒性。为了评估这种鲁棒性,我们设计了一种投毒框架,使我们能够系统地评估NAS在数据损坏情况下生成架构的能力。我们的框架针对四种现成的NAS算法进行了测试,这些算法代表了不同的架构发现方法,并针对四种数据投毒攻击进行了测试,包括一种我们专门针对NAS定制的攻击。在使用CIFAR-10和CIFAR-100基准进行评估时,我们展示了NAS在数据投毒情况下似乎具有鲁棒性,即使在较大的投毒预算下,其准确率也仅略有下降。然而,我们证明,当考虑旨在实现几百分点准确率提升的NAS算法时,这种预期的改进在数据投毒情况下可能会大幅降低。我们还展示了这种减少在不同NAS算法之间存在差异,并分析了影响其鲁棒性的因素。我们的发现是:(1)基于训练的NAS算法是最不鲁棒的,因为它们依赖于数据。(2)无需训练的NAS方法是最鲁棒的,但生成的架构与搜索空间中的随机选择相似。(3)即使使用MNIST等离散数据分布的数据,NAS算法也能生成具有改进准确率的架构。最后,我们讨论了潜在的应对措施。我们的代码可在:https://github.com/ztcoalson/NAS-Robustness-to-Data-Poisoning 获取。
Summary / 总结
This study investigates the robustness of neural architecture search (NAS) against data poisoning. The authors developed a poisoning framework to evaluate four NAS algorithms under various data corruption scenarios. Despite showing minor accuracy drops under large poisoning budgets, NAS algorithms designed for small accuracy improvements are significantly affected by data poisoning. The study finds that training-based NAS algorithms are the least robust, while training-free approaches are more robust but produce architectures similar to random selections. The research also demonstrates that NAS can generate accurate architectures using out-of-distribution data. The findings suggest that NAS is not always robust to data poisoning and highlight the need for further research into countermeasures.
研究探讨了神经架构搜索(NAS)在面对数据投毒攻击时的鲁棒性。通过设计一个系统性的投毒框架,研究人员评估了四种NAS算法在四种数据投毒攻击下的表现。尽管NAS显示出轻微的准确率下降,但似乎具有鲁棒性。然而,研究发现某些NAS算法预期的准确率提升在数据投毒下会显著减少。研究结果表明,基于训练的NAS算法较为脆弱,而无需训练的NAS方法较为鲁棒,但生成的架构类似于搜索空间中的随机选择。研究还展示了NAS可以使用异类数据生成改进的架构。研究最后提出了对抗数据投毒的潜在对策。
ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors
Authors: Liming Kuang, Yordanka Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Pani Paudel, Benjamin Busam
First: 2025-12-09T19:16:28+00:00 · Latest: 2025-12-09T19:16:28+00:00
Abstract
Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.
中文标题/摘要
标题:ConceptPose:无需训练的零样本物体姿态估计
物体姿态估计是计算机视觉和机器人技术中的基本任务,但大多数方法都需要大量的、特定于数据集的训练。同时,大规模的视觉语言模型展示了显著的零样本能力。在本文中,我们通过引入ConceptPose框架,将这两个领域结合起来,ConceptPose是一个既无需训练也无需特定模型的物体姿态估计框架。ConceptPose利用视觉语言模型(VLM)创建开放词汇的3D概念图,其中每个点都标记有一个从注意图中提取的概念向量。通过在概念图之间建立稳健的3D-3D对应关系,我们的方法可以实现精确的6自由度相对姿态估计。在没有任何特定于物体或数据集的训练的情况下,我们的方法在常见的零样本相对姿态估计基准测试中达到了最先进的结果,ADD(-S)得分比现有方法高出62%以上,包括那些利用大量特定于数据集训练的方法。
Summary / 总结
ConceptPose is a training-free and model-free framework for object pose estimation that leverages a vision-language model to create open-vocabulary 3D concept maps. By establishing robust 3D-3D correspondences across these maps, ConceptPose achieves precise 6DoF relative pose estimation. It outperforms existing methods, including those that require extensive training, by over 62% in ADD(-S) score on common zero-shot relative pose estimation benchmarks.
ConceptPose 是一个无需训练且无需特定模型的框架,用于零样本物体姿态估计,它利用视觉语言模型创建开放词汇的3D概念图。通过在这些图之间建立稳健的3D-3D对应关系,ConceptPose 实现了精确的6DoF相对姿态估计。在常见的零样本相对姿态估计基准测试中,它在ADD(-S)分数上比现有方法高出超过62%,证明了其有效性,无需任何物体或数据集特定的训练。
Unified Diffusion Transformer for High-fidelity Text-Aware Image Restoration
Authors: Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee, Jihye Park, Yeji Choi, Seungryong Kim
First: 2025-12-09T18:56:54+00:00 · Latest: 2025-12-09T18:56:54+00:00
Abstract
Text-Aware Image Restoration (TAIR) aims to recover high-quality images from low-quality inputs containing degraded textual content. While diffusion models provide strong generative priors for general image restoration, they often produce text hallucinations in text-centric tasks due to the absence of explicit linguistic knowledge. To address this, we propose UniT, a unified text restoration framework that integrates a Diffusion Transformer (DiT), a Vision-Language Model (VLM), and a Text Spotting Module (TSM) in an iterative fashion for high-fidelity text restoration. In UniT, the VLM extracts textual content from degraded images to provide explicit textual guidance. Simultaneously, the TSM, trained on diffusion features, generates intermediate OCR predictions at each denoising step, enabling the VLM to iteratively refine its guidance during the denoising process. Finally, the DiT backbone, leveraging its strong representational power, exploit these cues to recover fine-grained textual content while effectively suppressing text hallucinations. Experiments on the SA-Text and Real-Text benchmarks demonstrate that UniT faithfully reconstructs degraded text, substantially reduces hallucinations, and achieves state-of-the-art end-to-end F1-score performance in TAIR task.
中文标题/摘要
标题:统一扩散变换器用于高保真文本感知图像恢复
文本感知图像恢复(TAIR)旨在从包含退化文本内容的低质量输入中恢复高质量图像。虽然扩散模型为通用图像恢复提供了强大的生成先验知识,但在以文本为中心的任务中,它们往往会由于缺乏显式的语言知识而产生文本幻觉。为了解决这个问题,我们提出了一种统一的文本恢复框架UniT,该框架以迭代方式结合了扩散变换器(DiT)、视觉语言模型(VLM)和文本检测模块(TSM),以实现高保真文本恢复。在UniT中,VLM从退化图像中提取文本内容,提供显式的文本指导。同时,TSM在每个去噪步骤中生成中间OCR预测,使VLM能够在去噪过程中逐步细化其指导。最后,DiT骨干利用其强大的表征能力,利用这些线索恢复细粒度的文本内容,同时有效抑制文本幻觉。在SA-Text和Real-Text基准测试上的实验表明,UniT能够忠实恢复退化文本,显著减少幻觉,并在TAIR任务中实现最先进的端到端F1分数性能。
Summary / 总结
The paper proposes UniT, a unified framework for text-aware image restoration that integrates a Diffusion Transformer, a Vision-Language Model, and a Text Spotting Module. This approach iteratively refines text content in degraded images by extracting textual information and generating intermediate OCR predictions, which helps in reducing text hallucinations. Experiments show that UniT outperforms existing methods in terms of faithful text reconstruction and reducing hallucinations, achieving state-of-the-art F1-score performance.
该论文提出了一种名为UniT的统一框架,结合了扩散变换器、视觉语言模型和文本检测模块,通过迭代提取文本信息和生成中间OCR预测来细化退化图像中的文本内容,从而抑制文本幻觉。实验表明,UniT在忠实文本重建和减少幻觉方面优于现有方法,并实现了最先进的F1分数性能。
Self-Evolving 3D Scene Generation from a Single Image
Authors: Kaizhi Zheng, Yue Fan, Jing Gu, Zishuo Xu, Xuehai He, Xin Eric Wang
First: 2025-12-09T18:44:21+00:00 · Latest: 2025-12-09T18:44:21+00:00
Abstract
Generating high-quality, textured 3D scenes from a single image remains a fundamental challenge in vision and graphics. Recent image-to-3D generators recover reasonable geometry from single views, but their object-centric training limits generalization to complex, large-scale scenes with faithful structure and texture. We present EvoScene, a self-evolving, training-free framework that progressively reconstructs complete 3D scenes from single images. The key idea is combining the complementary strengths of existing models: geometric reasoning from 3D generation models and visual knowledge from video generation models. Through three iterative stages--Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation--EvoScene alternates between 2D and 3D domains, gradually improving both structure and appearance. Experiments on diverse scenes demonstrate that EvoScene achieves superior geometric stability, view-consistent textures, and unseen-region completion compared to strong baselines, producing ready-to-use 3D meshes for practical applications.
中文标题/摘要
标题:从单张图像自演化生成3D场景
从单张图像生成高质量、纹理化的3D场景仍然是视觉和图形学中的一个基本挑战。最近的图像到3D生成器可以从单个视角恢复合理的几何结构,但它们以对象为中心的训练限制了其在复杂、大规模场景中的泛化能力,这些场景需要忠实的结构和纹理。我们提出了EvoScene,这是一种无需训练的自演化框架,可以逐步从单张图像中重建完整的3D场景。关键思想是结合现有模型的互补优势:3D生成模型的几何推理能力和视频生成模型的视觉知识。通过三个迭代阶段——空间先验初始化、视觉引导的3D场景网格生成和空间引导的新视角生成——EvoScene 在2D和3D领域之间交替,逐步提高结构和外观。在多种场景上的实验表明,与强大的基线相比,EvoScene 实现了更好的几何稳定性、视图一致的纹理以及未见区域的完成,生成了可以直接用于实际应用的3D网格。
Summary / 总结
The research aims to generate high-quality 3D scenes from single images, addressing the limitations of existing object-centric models in handling complex scenes. EvoScene, a self-evolving framework, progressively reconstructs complete 3D scenes through three stages: Spatial Prior Initialization, Visual-guided 3D Scene Mesh Generation, and Spatial-guided Novel View Generation. The key finding is that EvoScene outperforms strong baselines in terms of geometric stability, view-consistent textures, and unseen-region completion, producing ready-to-use 3D meshes for practical applications.
研究旨在从单张图像生成高质量的3D场景,解决现有基于对象的模型在处理复杂场景时的局限性。EvoScene 是一个自我进化的框架,结合了3D生成模型的几何推理能力和视频生成模型的视觉知识。通过三个阶段,它逐步重建完整的3D场景,提高结构和外观。实验表明,EvoScene 在几何稳定性、视图一致的纹理以及未见区域的完成方面优于强基线,生成可用于实际应用的3D网格。
TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models
Authors: Shima Imani, Seungwhan Moon, Lambert Mathias, Lu Zhang, Babak Damavandi
First: 2025-12-05T18:40:18+00:00 · Latest: 2025-12-09T18:19:52+00:00
Abstract
Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.
中文标题/摘要
标题:TRACE:分析和增强视觉语言模型逐步推理的框架
可靠地进行数学和科学推理仍然是大型视觉语言模型面临的开放挑战。标准的最终答案评估往往掩盖了推理错误,允许无声失败持续存在。为了解决这一问题,我们引入了TRACE,一种透明推理和一致性评估框架,该框架诊断推理轨迹而非仅关注最终结果。TRACE的核心在于利用辅助推理集,这是一种分解复杂问题的紧凑子问题答案对,通过基于一致性的度量评估中间步骤,并揭示标准评估中忽略的失败。我们的实验表明,辅助推理集(ARS)的一致性与最终答案的正确性相关,并有助于定位失败出现的推理步骤,提供可操作的信号以改进模型。此外,TRACE定义了置信区域,区分可靠和不可靠的推理路径,支持有效的过滤、调试和模型优化。
Summary / 总结
The research aims to improve the reliability of mathematical and scientific reasoning in large vision-language models by addressing the limitations of standard final-answer evaluation. TRACE, a framework for Transparent Reasoning And Consistency Evaluation, introduces Auxiliary Reasoning Sets to decompose complex problems and evaluate intermediate steps through consistency metrics. Experiments demonstrate that consistency across these sets correlates with final-answer correctness and helps identify specific reasoning steps where failures occur, providing actionable insights for model improvement. Additionally, TRACE defines confidence regions to distinguish between reliable and unreliable reasoning paths, aiding in effective filtering and debugging.
研究旨在通过解决标准最终答案评估的局限性,提高大型视觉-语言模型在数学和科学推理方面的可靠性。TRACE框架引入了辅助推理集来分解复杂问题并评估中间步骤。实验表明,这些集中的一致性与最终答案的正确性相关,并有助于识别出推理步骤中的失败点,提供改进模型的行动指南。此外,TRACE定义了置信区间,以区分可靠的和不可靠的推理路径,支持有效的调试和模型优化。
SATGround: A Spatially-Aware Approach for Visual Grounding in Remote Sensing
Authors: Aysim Toker, Andreea-Maria Oncescu, Roy Miles, Ismail Elezi, Jiankang Deng
First: 2025-12-09T18:15:43+00:00 · Latest: 2025-12-09T18:15:43+00:00
Abstract
Vision-language models (VLMs) are emerging as powerful generalist tools for remote sensing, capable of integrating information across diverse tasks and enabling flexible, instruction-based interactions via a chat interface. In this work, we enhance VLM-based visual grounding in satellite imagery by proposing a novel structured localization mechanism. Our approach involves finetuning a pretrained VLM on a diverse set of instruction-following tasks, while interfacing a dedicated grounding module through specialized control tokens for localization. This method facilitates joint reasoning over both language and spatial information, significantly enhancing the model's ability to precisely localize objects in complex satellite scenes. We evaluate our framework on several remote sensing benchmarks, consistently improving the state-of-the-art, including a 24.8% relative improvement over previous methods on visual grounding. Our results highlight the benefits of integrating structured spatial reasoning into VLMs, paving the way for more reliable real-world satellite data analysis.
中文标题/摘要
标题:SATGround:一种针对遥感领域视觉定位的空间感知方法
视觉语言模型(VLMs)正在成为遥感领域强大的通用工具,能够跨多种任务整合信息,并通过聊天界面实现灵活的指令式交互。在本文中,我们通过提出一种新颖的结构化定位机制,增强了基于VLM的卫星图像视觉定位。我们的方法包括在多样化的指令遵循任务上微调预训练的VLM,并通过专门的控制标记接口连接一个专用的定位模块。该方法促进了语言和空间信息的联合推理,显著增强了模型在复杂卫星场景中精确定位物体的能力。我们在几个遥感基准上评估了我们的框架,始终优于现有方法,包括在视觉定位上的24.8%的相对改进。我们的结果突显了将结构化空间推理集成到VLM中的好处,为更可靠的遥感数据分析铺平了道路。
Summary / 总结
The research aims to enhance visual grounding in satellite imagery using vision-language models (VLMs) by introducing a structured localization mechanism. The method involves fine-tuning a pretrained VLM on various instruction-following tasks and integrating a grounding module with control tokens for spatial localization. This approach improves joint reasoning over language and spatial information, leading to a 24.8% relative improvement in visual grounding performance on remote sensing benchmarks compared to previous methods.
本文提出了SATGround方法,通过一种空间感知的方法增强卫星图像中的视觉定位。该方法涉及对预训练的视觉-语言模型进行微调,使其能够执行各种指令跟随任务,并通过专用的控制标记集成一个专门的定位模块。这种方法提高了模型在复杂卫星场景中精确定位物体的能力,相比之前的方法在视觉定位基准测试中取得了24.8%的相对改进。
The Missing Point in Vision Transformers for Universal Image Segmentation
Authors: Sajjad Shahabodini, Mobina Mansoori, Farnoush Bayatmakou, Jamshid Abouei, Konstantinos N. Plataniotis, Arash Mohammadi
First: 2025-05-26T10:29:13+00:00 · Latest: 2025-12-09T17:56:45+00:00
Abstract
Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask-based approaches yield high-quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT-P, a novel two-stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class-agnostic mask proposals, while the second stage utilizes a point-based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT-P serves as a pre-training-free adapter, allowing the integration of various pre-trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT-P, achieving state-of-the-art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad-sh33/ViT-P}{https://github.com/sajjad-sh33/ViT-P.
中文标题/摘要
标题:视觉变换器在通用图像分割中的缺失点
图像分割仍然是计算机视觉中的一个挑战性任务,需要稳健的掩码生成和精确的分类。基于掩码的方法通过捕捉全局上下文来生成高质量的掩码。然而,在存在模糊边界和类别分布不平衡的情况下,准确地对这些掩码进行分类仍然是一个开放的挑战。在本文中,我们引入了ViT-P,这是一种新颖的两阶段分割框架,将掩码生成与分类脱钩。第一阶段使用提案生成器生成无类别的掩码提案,而第二阶段则利用基于视觉变换器(ViT)的点分类模型,通过关注掩码中心点来细化预测。ViT-P 作为一种无需预训练的适配器,允许将各种预训练的视觉变换器无缝集成到其架构中,确保其对密集预测任务的适应性。此外,我们证明粗略和边界框注释可以有效提高分类性能,而无需在精细注释数据集上进行额外训练,从而降低注释成本并保持强大的性能。在COCO、ADE20K和Cityscapes数据集上的广泛实验验证了ViT-P的有效性,分别在ADE20K全景分割中达到54.0 PQ,在Cityscapes语义分割中达到87.4 mIoU,在ADE20K语义分割中达到63.6 mIoU。代码和预训练模型可在:https://github.com/sajjad-sh33/ViT-P 获取。
Summary / 总结
This work addresses the challenge of image segmentation by proposing ViT-P, a two-stage framework that separates mask generation from classification. It uses a proposal generator for mask proposals and a point-based classification model based on Vision Transformer for refinement. ViT-P achieves state-of-the-art results on ADE20K, Cityscapes, and COCO datasets without additional pre-training, demonstrating its effectiveness and adaptability to dense prediction tasks.
该研究提出了一种两阶段框架ViT-P,通过分离掩码生成和分类来解决图像分割中准确掩码分类的挑战。第一阶段生成无类别的掩码提案,第二阶段使用基于Vision Transformer的点分类模型来细化预测。ViT-P在ADE20K、Cityscapes和COCO数据集上取得了最先进的结果,无需额外预训练,展示了其在密集预测任务中的有效性和适应性。粗略和边界框注释可以增强分类,无需额外的精细注释数据集,从而降低注释成本。
Tri-Bench: Stress-Testing VLM Reliability on Spatial Reasoning under Camera Tilt and Object Interference
Authors: Amit Bendkhale
Venue: AAAI 2026
First: 2025-12-09T17:52:57+00:00 · Latest: 2025-12-09T17:52:57+00:00
Comments: 6 pages, 3 figures. Code and data: https://github.com/Amiton7/Tri-Bench. Accepted to the AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)
Abstract
Verifiable geometric reasoning is a critical component for trustworthy and controllable agentic AI. Despite impressive capabilities, Vision-Language Models (VLMs) often fail under realistic scene changes. We present Tri-Bench, a compact benchmark of planar triangle problems that isolates relative geometric reasoning while stressing two deployment-critical factors: camera pose (planar vs. tilted) and scene context via object interference (10 everyday objects). To test verifiability and control, we evaluate four recent VLMs using a single, fixed prompt whose guardrail explicitly describes a surrounding square border, enabling correct answers via homography. We evaluate six simple tasks over binary and continuous targets, and observe that the overall accuracy with respect to 3D ground truth is modest, ~69% on average (best ~75%, worst ~64%). The same responses align even more closely with 2D projections in the image plane, where mean accuracy is ~72%. All four VLMs consistently fail, with accuracy falling to ~0%, on recognizing minority shape classes (equilateral, isosceles, right-angled triangles). Additionally, overall VLM accuracy degrades by ~4.1% under camera tilt. This demonstrates that models fail to correctly utilize the explicit frame-of-reference hint provided in the prompt and default to 2D image plane cues. Finally, we find that object interference has no significant effect on VLM accuracy.
中文标题/摘要
标题:Tri-Bench:在相机倾斜和物体干扰下对视觉-语言模型在空间推理可靠性的压力测试
可验证的几何推理是值得信赖和可控的代理人工智能的关键组成部分。尽管具有令人印象深刻的性能,但在现实场景变化下,视觉-语言模型(VLMs)经常失败。我们提出了Tri-Bench,这是一个紧凑的基准测试,专注于平面三角形问题,以隔离相对几何推理,同时强调两个关键部署因素:相机姿态(平面 vs. 倾斜)和通过物体干扰(10种日常生活中的物体)的场景上下文。为了测试可验证性和可控性,我们使用一个单一的固定提示来评估四个最近的VLMs,该提示的护栏明确描述了一个周围的正方形边界,从而可以通过齐次变换获得正确答案。我们评估了六个简单的任务,涉及二进制和连续目标,观察到相对于3D地面真实值的整体准确性较低,平均约为69%(最佳约为75%,最差约为64%)。同样的响应在图像平面的二维投影中与2D投影的准确性更加一致,平均准确性约为72%。所有四个VLMs在识别少数形状类别(等边、等腰、直角三角形)时都表现一致不佳,准确率降至约0%。此外,总体VLM准确性在相机倾斜下下降了约4.1%。这表明模型未能正确利用提示中提供的明确框架参考提示,而是默认使用2D图像平面线索。最后,我们发现物体干扰对VLM准确性没有显著影响。
Summary / 总结
The research aims to evaluate the reliability of Vision-Language Models (VLMs) in geometric reasoning under realistic scene changes. Tri-Bench, a benchmark involving planar triangle problems, tests VLMs on camera pose (planar vs. tilted) and object interference. Four recent VLMs were evaluated using a fixed prompt with a guardrail describing a square border, leading to an average accuracy of ~69% with respect to 3D ground truth, dropping to ~0% for recognizing minority shape classes under camera tilt. Object interference had no significant impact on accuracy.
研究旨在评估Vision-Language模型在现实场景变化下的几何推理可靠性。Tri-Bench 是一个针对平面三角形问题的基准测试,测试模型在相机姿态(平面 vs. 倾斜)和物体干扰下的表现。使用固定提示和边界描述,评估了四个最近的VLMs,结果显示相对于3D真实情况的平均准确率为约69%,在相机倾斜时识别少数形状类别时准确率降至约0%。物体干扰对准确率没有显著影响。
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Authors: Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang
First: 2025-12-09T17:18:32+00:00 · Latest: 2025-12-09T17:18:32+00:00
Comments: 16 pages, 8 figures, conference or other essential info
Abstract
Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.
中文标题/摘要
标题:InfiniteVL:结合线性和稀疏注意机制以实现高效且无限制输入的视觉-语言模型
窗口注意和线性注意是缓解视觉-语言模型(VLM)中二次复杂性和不断增长的KV缓存的两种主要策略。然而,我们发现基于窗口的VLM在序列长度超过窗口大小时会性能下降,而线性注意在OCR和文档理解等信息密集型任务中表现不佳。为克服这些限制,我们提出了InfiniteVL,这是一种结合滑动窗口注意(SWA)和门控DeltaNet的线性复杂度VLM架构。为了在资源受限的情况下实现竞争性的多模态性能,我们设计了三阶段训练策略,包括蒸馏预训练、指令调优和长序列SFT。令人惊讶的是,使用比领先VLM少于2%的训练数据,InfiniteVL不仅大幅优于之前的线性复杂度VLM,还与基于Transformer的领先VLM性能相当,同时展示了有效的长期记忆保留。与通过FlashAttention-2加速的类似规模的Transformer-based VLM相比,InfiniteVL实现了超过3.6倍的推理加速,同时保持了恒定的延迟和内存占用。在流式视频理解场景中,它能够保持稳定的每秒24帧实时预填充速度,同时保留长期记忆缓存。代码和模型可在https://github.com/hustvl/InfiniteVL获取。
Summary / 总结
InfiniteVL is a linear-complexity Vision-Language Model that combines sliding window attention with Gated DeltaNet to address the limitations of window-based and linear attention approaches. It employs a three-stage training strategy and achieves performance comparable to leading Transformer-based models while significantly reducing inference time by over 3.6 times. It also maintains real-time processing in streaming video understanding scenarios with stable long-term memory retention.
InfiniteVL 是一种结合滑动窗口注意力和 Gated DeltaNet 的线性复杂度视觉-语言模型,旨在克服窗口基模型和线性注意力方法的局限性。它采用三阶段训练策略,并在比领先 Transformer 基模型少 2% 的训练数据下实现了相当的性能。与使用 FlashAttention-2 加速的类似大小模型相比,InfiniteVL 的推理速度提高了 3.6 倍,同时保持了恒定的延迟和内存占用。在流式视频理解场景中,它能够以稳定的 24 FPS 实时预填充速度并保留长期记忆缓存。
History
20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553