arXiv 论文速递

2026-01-28 03:41
Snapshot: 20260128_0341
Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings
Authors: Mumin Jia, Jairo Diaz-Rodriguez
First: 2026-01-26T18:54:34+00:00 · Latest: 2026-01-26T18:54:34+00:00
Comments: arXiv admin note: substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437
Abstract
Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
中文标题/摘要
标题:基于句子嵌入核变化点检测的无监督文本分段
无监督文本分段至关重要,因为边界标签成本高、主观性强且难以在不同领域和粒度选择间转移。我们提出了一种无需训练的方法Embed-KCPD,将句子表示为嵌入向量,并通过最小化惩罚核变化点检测目标来估计边界。除了算法实现,我们还开发了关于$m$依赖序列的核变化点检测的第一种依赖意识理论,这是一种语言中常见的短程依赖的有限记忆抽象。我们证明了总体惩罚风险的oracle不等式,并证明了每个真实变化点在相对于段长度较小的窗口内被恢复。为了将理论与实践连接起来,我们引入了一种基于LLM的模拟框架,生成具有可控有限记忆依赖和已知边界的合成文档,验证了预测的缩放行为。在标准分段基准测试中,Embed-KCPD经常优于强大的无监督基线。对泰勒·斯威夫特的推文进行的案例研究表明,Embed-KCPD结合了强大的理论保证、模拟可靠性以及文本分段的实际有效性。
Summary / 总结
The paper addresses unsupervised text segmentation, which is challenging due to the high cost and subjectivity of boundary labels. It introduces Embed-KCPD, a training-free method that uses sentence embeddings and kernel change-point detection to estimate boundaries. The method is validated through a simulation framework based on large language models, showing that it often outperforms other unsupervised baselines. Theoretical guarantees include an oracle inequality and a localization guarantee. A case study on Taylor Swift's tweets demonstrates the method's practical effectiveness and reliability.
论文提出了一种名为Embed-KCPD的无监督文本分割方法,该方法利用句子嵌入和核变化点检测来估计边界。该方法通过基于大语言模型的模拟框架进行验证,显示其通常优于现有无监督基线。理论保证包括一个oracle不等式和一个局部保证。对泰勒·斯威夫特推文的研究案例展示了该方法的实际有效性和可靠性。
HiCache: A Plug-in Scaled-Hermite Upgrade for Taylor-Style Cache-then-Forecast Diffusion Acceleration
Authors: Liang Feng, Shikang Zheng, Jiacheng Liu, Yuqi Lin, Qinming Zhou, Peiliang Cai, Xinyu Wang, Junjie Chen, Chang Zou, Yue Ma, Linfeng Zhang
First: 2025-08-23T10:35:16+00:00 · Latest: 2026-01-26T18:39:41+00:00
Abstract
Diffusion models have achieved remarkable success in content generation but often incur prohibitive computational costs due to iterative sampling. Recent feature caching methods accelerate inference via temporal extrapolation, yet can suffer quality degradation from inaccurate modeling of the complex dynamics of feature evolution. We propose HiCache (Hermite Polynomial-based Feature Cache), a training-free acceleration framework that improves feature prediction by aligning mathematical tools with empirical properties. Our key insight is that feature-derivative approximations in diffusion Transformers exhibit multivariate Gaussian characteristics, motivating the use of Hermite polynomials as a potentially optimal basis for Gaussian-correlated processes. We further introduce a dual-scaling mechanism that ensures numerical stability while preserving predictive accuracy, and is also effective when applied standalone or integrated with TaylorSeer. Extensive experiments demonstrate HiCache's superiority, achieving 5.55x speedup on FLUX.1-dev while matching or exceeding baseline quality, and maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Moreover, HiCache can be naturally added to previous caching methods to enhance their performance, e.g., improving ClusCa from 0.9480 to 0.9840 in terms of image rewards. Code: https://github.com/fenglang918/HiCache
中文标题/摘要
标题:HiCache:一种基于插件的扩展型Scaled-Hermite升级版,用于Taylor风格的缓存先行-随后预测扩散加速
扩散模型在内容生成方面取得了显著成功,但由于迭代采样,往往会产生高昂的计算成本。最近的特征缓存方法通过时间外推加速推理,但可能会因对特征演变复杂动力学的不准确建模而降低质量。我们提出了HiCache(基于赫尔mite多项式的特征缓存),这是一种无需训练的加速框架,通过将数学工具与经验特性对齐来提高特征预测能力。我们的核心见解是,扩散Transformer中的特征导数近似表现出多元高斯特性,这促使我们使用赫尔mite多项式作为高斯相关过程的潜在最优基。我们还引入了一种双重缩放机制,以确保数值稳定性同时保持预测准确性,并且在单独使用或与TaylorSeer集成时也有效。广泛的实验表明HiCache的优越性,在FLUX.1-dev上实现了5.55倍的加速,同时匹配或超过了基线质量,并在文本到图像、视频生成和超分辨率任务中保持了强大的性能。此外,HiCache可以自然地添加到先前的缓存方法中以增强其性能,例如,将ClusCa的图像奖励从0.9480提高到0.9840。代码:https://github.com/fenglang918/HiCache
Summary / 总结
HiCache is a training-free acceleration framework for diffusion models that uses Hermite polynomials to improve feature prediction, reducing computational costs while maintaining quality. It introduces a dual-scaling mechanism for numerical stability and can be integrated with existing methods like TaylorSeer. Experiments show HiCache achieves a 5.55x speedup on FLUX.1-dev with comparable or better quality, and enhances other caching methods like ClusCa in image generation tasks.
HiCache 是一种无需训练的加速框架,利用 Hermite 多项式改进特征预测,降低计算成本同时保持质量。它引入了双重缩放机制以确保数值稳定性,并且可以与 TaylorSeer 等现有方法集成。实验结果显示,HiCache 在 FLUX.1-dev 上实现了 5.55 倍的加速,质量与基线相当或更好,并且可以增强其他缓存方法如 ClusCa 的图像生成效果。
Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems
Authors: Jusheng Zhang, Yijia Fan, Kaitong Cai, Jing Yang, Jiawei Yao, Jian Wang, Guanlong Qu, Ziliang Chen, Keze Wang
Venue: ICLR 2026
First: 2026-01-26T17:58:53+00:00 · Latest: 2026-01-26T17:58:53+00:00
Comments: Accepted to ICLR 2026
Abstract
Vision-Language Models (VLMs) enable powerful multi-agent systems, but scaling them is economically unsustainable: coordinating heterogeneous agents under information asymmetry often spirals costs. Existing paradigms, such as Mixture-of-Agents and knowledge-based routers, rely on heuristic proxies that ignore costs and collapse uncertainty structure, leading to provably suboptimal coordination. We introduce Agora, a framework that reframes coordination as a decentralized market for uncertainty. Agora formalizes epistemic uncertainty into a structured, tradable asset (perceptual, semantic, inferential), and enforces profitability-driven trading among agents based on rational economic rules. A market-aware broker, extending Thompson Sampling, initiates collaboration and guides the system toward cost-efficient equilibria. Experiments on five multimodal benchmarks (MMMU, MMBench, MathVision, InfoVQA, CC-OCR) show that Agora outperforms strong VLMs and heuristic multi-agent strategies, e.g., achieving +8.5% accuracy over the best baseline on MMMU while reducing cost by over 3x. These results establish market-based coordination as a principled and scalable paradigm for building economically viable multi-agent visual intelligence systems.
中文标题/摘要
标题:为何将疑虑藏在心中?在多智能体 bandit 系统中交易视觉不确定性
视觉-语言模型(VLMs)能够实现强大的多智能体系统,但将其扩展在经济上是不可持续的:在信息不对称的情况下协调异构智能体往往会导致成本螺旋上升。现有的范式,如混合智能体和知识路由器,依赖于忽略成本的启发式代理,导致不确定性结构的坍塌,从而导致可证明的次优协调。我们提出了Agora框架,将协调重新定义为不确定性的一种分散市场。Agora将知识论不确定性结构化为可交易的资产(感知、语义、推理),并基于理性经济规则在智能体之间实施基于盈利能力的交易。市场意识经纪人扩展了Thompson抽样,启动合作并引导系统向成本效率均衡发展。在五个跨模态基准(MMMU、MMBench、MathVision、InfoVQA、CC-OCR)上的实验表明,Agora在性能上优于强大的VLMs和启发式多智能体策略,例如在MMMU上比最佳基线高出8.5%的准确率,同时成本降低超过3倍。这些结果确立了基于市场的协调作为一种原理上可行且可扩展的范式,用于构建经济上可行的多智能体视觉智能系统。
Summary / 总结
The paper addresses the economic inefficiency of coordinating multi-agent systems with heterogeneous agents under information asymmetry, which is a common issue in Vision-Language Models (VLMs). It proposes Agora, a framework that transforms coordination into a decentralized market for trading uncertainty, where agents can profitably exchange different types of uncertainty. Experiments on five multimodal benchmarks demonstrate that Agora outperforms strong VLMs and heuristic multi-agent strategies, achieving higher accuracy and reducing costs significantly.
论文针对使用视觉语言模型(VLMs)协调异构代理时的经济效率问题,提出了一种名为Agora的框架,将知识不确定性转化为可交易资产,并促使代理基于盈利驱动进行交易。实验结果显示,Agora在五个跨模态基准上优于强VLMs和启发式策略,实现了更高的准确率并降低了超过3倍的成本。
Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge
Authors: Li Kang, Heng Zhou, Xiufeng Song, Rui Li, Bruno N. Y. Chen, Ziye Wang, Ximeng Meng, Stone Tao, Yiran Qin, Xiaohong Liu, Ruimao Zhang, Lei Bai, Yilun Du, Hao Su, Philip Torr, Zhenfei Yin, Ruihao Gong, Yejun Zeng, Fengjun Zhong, Shenghao Jin, Jinyang Guo, Xianglong Liu, Xiaojun Jia, Tianqi Shan, Wenqi Ren, Simeng Qin, Jialing Yang, Xiaoyu Ma, Tianxing Chen, Zixuan Li, Zijian Cai, Yan Qin, Yusen Qin, Qiangyu Chen, Kaixuan Wang, Zhaoming Han, Yao Mu, Ping Luo, Yuanqi Yao, Haoming Song, Jan-Nico Zaech, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
Venue: NeurIPS 2025
First: 2026-01-26T17:56:19+00:00 · Latest: 2026-01-26T17:56:19+00:00
Comments: MARS Challenge @ NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI. Challenge page: https://mars-eai.github.io/MARS-Challenge-Webpage/
Abstract
Recent advancements in multimodal large language models and vision-languageaction models have significantly driven progress in Embodied AI. As the field transitions toward more complex task scenarios, multi-agent system frameworks are becoming essential for achieving scalable, efficient, and collaborative solutions. This shift is fueled by three primary factors: increasing agent capabilities, enhancing system efficiency through task delegation, and enabling advanced human-agent interactions. To address the challenges posed by multi-agent collaboration, we propose the Multi-Agent Robotic System (MARS) Challenge, held at the NeurIPS 2025 Workshop on SpaVLE. The competition focuses on two critical areas: planning and control, where participants explore multi-agent embodied planning using vision-language models (VLMs) to coordinate tasks and policy execution to perform robotic manipulation in dynamic environments. By evaluating solutions submitted by participants, the challenge provides valuable insights into the design and coordination of embodied multi-agent systems, contributing to the future development of advanced collaborative AI systems.
中文标题/摘要
标题:多智能体机器人系统(MARS)挑战的进展与创新
近期多模态大型语言模型和视觉-语言-动作模型的进展显著推动了嵌入式人工智能的发展。随着领域向更复杂的任务场景过渡,多智能体系统框架变得必不可少,以实现可扩展、高效和协作的解决方案。这一转变由三个主要因素推动:增强的智能体能力、通过任务委派提高系统效率以及实现高级的人机交互。为应对多智能体协作带来的挑战,我们提出了多智能体机器人系统(MARS)挑战,该挑战在NeurIPS 2025空间、视觉、语言和嵌入式人工智能研讨会中举办。竞赛集中在两个关键领域:规划与控制,参赛者利用视觉-语言模型(VLMs)进行多智能体嵌入式规划,以协调任务并执行机器人在动态环境中的操作。通过评估参赛者提交的解决方案,挑战提供了有关多智能体嵌入式系统设计和协调的宝贵见解,为先进协作人工智能系统的未来发展做出了贡献。
Summary / 总结
The research motivation is to advance Embodied AI through multi-agent systems, addressing the need for scalable and efficient solutions in complex task scenarios. The main method involves using multimodal large language models and vision-language-action models to coordinate multi-agent planning and control in dynamic environments. Key experimental findings include the successful exploration of multi-agent embodied planning using vision-language models and the evaluation of policy execution for robotic manipulation, providing insights into the design and coordination of multi-agent systems.
研究旨在通过发展多智能体机器人系统来推进嵌入式人工智能,解决复杂任务场景下的可扩展性和高效性需求。主要方法是使用多模态大型语言模型和视觉-语言-动作模型来协调多智能体任务。关键实验发现包括成功探索多智能体嵌入式规划,并评估了机器人在动态环境中的操作策略执行,为这类系统的设计和协调提供了见解。
MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning
Authors: Peng Xia, Jinglu Wang, Yibo Peng, Kaide Zeng, Zihan Dong, Xian Wu, Xiangru Tang, Hongtu Zhu, Yun Li, Linjun Zhang, Shujie Liu, Yan Lu, Huaxiu Yao
Venue: ICLR 2026
First: 2025-05-31T13:22:55+00:00 · Latest: 2026-01-26T17:15:26+00:00
Comments: ICLR 2026
Abstract
Medical Large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal diagnostic tasks. However, existing single-agent models struggle to generalize across diverse medical specialties, limiting their performance. Recent efforts introduce multi-agent collaboration frameworks inspired by clinical workflows, where general practitioners (GPs) and specialists interact in a fixed sequence. Despite improvements, these static pipelines lack flexibility and adaptability in reasoning. To address this, we propose MMedAgent-RL, a reinforcement learning (RL)-based multi-agent framework that enables dynamic, optimized collaboration among medical agents. Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists and its own knowledge to make final decisions. To address the inconsistency in specialist outputs, we introduce a curriculum learning (CL)-guided RL strategy with dynamic entropy regulation, progressively teaching the attending physician to balance between imitating specialists and correcting their mistakes. Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL outperforms both open-source and proprietary Med-LVLMs. Notably, it achieves an average performance gain of 23.6% over strong baselines.
中文标题/摘要
标题:MMedAgent-RL:多模态医疗推理中多智能体协作的优化
医疗大型视觉-语言模型(Med-LVLMs)在多模态诊断任务中显示出强大的潜力。然而,现有的单智能体模型难以在多种医学专科之间泛化,限制了其性能。最近的努力引入了受临床工作流程启发的多智能体协作框架,其中全科医生(GPs)和专科医生按固定顺序交互。尽管有所改进,但这些静态管道缺乏推理的灵活性和适应性。为了解决这个问题,我们提出了一种基于强化学习(RL)的多智能体框架MMedAgent-RL,该框架能够实现医疗智能体之间的动态、优化协作。具体来说,我们通过RL训练了两个基于Qwen2.5-VL的全科医生智能体:分诊医生学习将患者分配到合适的专科,而主治医生则整合多专科医生的判断和自身知识来做出最终决定。为了解决专科医生输出的一致性问题,我们引入了一种带有动态熵调节的课程学习(CL)引导的RL策略,逐步教导主治医生在模仿专科医生和纠正其错误之间取得平衡。在五个医疗VQA基准上的实验表明,MMedAgent-RL优于开源和专有Med-LVLMs。值得注意的是,它在强基线上的平均性能提高了23.6%。
Summary / 总结
MMedAgent-RL is a reinforcement learning-based multi-agent framework designed to optimize collaboration among medical agents for improved diagnostic performance. It trains two general practitioner agents to triage patients and integrate specialist judgments, respectively. A curriculum learning strategy with dynamic entropy regulation is used to enhance the attending physician's ability to balance imitation and correction. Experiments show MMedAgent-RL outperforms existing models on five medical VQA benchmarks, achieving a 23.6% average performance gain over strong baselines.
MMedAgent-RL 是一个基于强化学习的多智能体框架,旨在优化医疗智能体之间的协作以提高多模态诊断任务的表现。该框架训练两个全科医生智能体进行患者分诊和综合专科判断。通过动态熵调节引导的课程学习策略,增强主治医生平衡模仿和纠正的能力。实验表明,MMedAgent-RL 在五个医学 VQA 基准测试中优于现有模型,平均性能提升 23.6%。
Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge
Authors: Xiao Liu, Jiawei Zhang
First: 2026-01-26T17:14:57+00:00 · Latest: 2026-01-26T17:14:57+00:00
Comments: Work in progress
Abstract
Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitable visual knowledge. In this work, we investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation. We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions, and construct GEOATTRACTION-500, a benchmark of 500 globally distributed attractions spanning varied regions and popularity levels. GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments, all validated against human evaluation. Applying GAP to the state-of-the-art text-to-video model Sora 2, we find that, contrary to common assumptions of strong geographic bias, the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions, development levels, and cultural groupings, with only weak dependence on attraction popularity. These results suggest that current text-to-video models express global visual knowledge more evenly than expected, highlighting both their promise for globally deployed applications and the need for continued evaluation as such systems evolve.
中文标题/摘要
标题:视频生成模型在地理上公平吗?基于景点吸引力的全球视觉知识评估
近期文本到视频生成技术取得了令人信服的视觉成果,但尚不清楚这些模型是否编码了地理上公平的视觉知识。本文通过基于景点吸引力的评估,研究了文本到视频模型的地理公平性和地理上扎根的视觉知识。我们引入了Geo-Attraction Landmark Probing (GAP),这是一种系统框架,用于评估模型如何忠实合成来自不同地区的旅游景点,构建了包含500个全球分布的景点的GEOATTRACTION-500基准,这些景点覆盖了不同的地区和受欢迎程度。GAP 结合了互补的指标,将整体视频质量与景点特定知识分离,包括全球结构对齐、细粒度关键点对齐以及视觉-语言模型判断,所有这些都经过了人类评估的验证。将GAP 应用于最先进的文本到视频模型Sora 2,我们发现,与常见的地理偏见假设相反,该模型在不同地区、发展水平和文化群体中表现出相对均匀的地理扎根视觉知识水平,对景点受欢迎程度的依赖性较弱。这些结果表明,当前的文本到视频模型比预期更均匀地表达了全球视觉知识,既突显了其在全球部署应用中的潜力,也强调了随着此类系统的发展需要继续进行评估。
Summary / 总结
This work evaluates the geographic fairness of text-to-video generation models by introducing Geo-Attraction Landmark Probing (GAP), a systematic framework that assesses the models' ability to synthesize tourist attractions from diverse regions. Applying GAP to the state-of-the-art model Sora 2, the study finds that the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions and cultural groupings, with only weak dependence on attraction popularity, challenging the common assumption of strong geographic bias.
这项工作通过引入Geo-Attraction Landmark Probing (GAP)系统框架,评估文本到视频生成模型在合成来自不同地区的旅游景点方面的地理公平性。将GAP应用于最先进的模型Sora 2后,研究发现该模型在不同地区、发展水平和文化群体中的地理定位视觉知识表现出相对均匀的水平,对景点的受欢迎程度依赖性较弱,这挑战了对强地理偏见的常见假设。
A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Authors: Shihab Aaqil Ahamed, Udaya S. K. P. Miriya Thanthrige, Ranga Rodrigo, Muhammad Haris Khan
Venue: ICLR 2026
First: 2025-10-30T12:45:24+00:00 · Latest: 2026-01-26T17:12:54+00:00
Comments: Accepted at ICLR 2026
Abstract
Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.
中文标题/摘要
标题:A-TPT:视觉语言模型测试时提示调谐的角多样性校准特性
测试时提示调谐(TPT)已成为一种有前途的技术,用于在无需依赖标记数据的情况下将大型视觉语言模型(VLMs)适应未见过的任务。然而,文本特征之间的缺乏分散性会损害校准性能,这引起了人们对VLMs的可靠性和安全性方面的担忧。当前的TPT方法主要集中在通过最大化平均文本特征分散度或施加正交约束来鼓励角度分离,从而提高提示校准。然而,这些方法可能无法始终在类别间文本特征之间实现最优的角度分离,这意味着忽视了角多样性的关键作用。为了解决这个问题,我们提出了一种新颖的TPT框架A-TPT,该框架引入了角多样性,以鼓励由相应可学习提示诱导的归一化文本特征的分布均匀性。这种均匀性是通过最大化单位超球面上特征之间的最小成对角度距离来实现的。我们通过在不同数据集上使用各种骨干网络进行广泛实验,展示了我们的方法在降低累积平均校准误差方面始终优于最先进的TPT方法,同时保持了相当的准确性。值得注意的是,我们的方法在自然分布转移的零样本校准性能方面表现出色,并且在医学数据集上具有良好的泛化能力。我们提供了广泛的分析,包括理论方面,以建立A-TPT的基础。这些结果突显了促进角多样性以实现分散良好的文本特征的潜力,显著提高了VLM在测试时适应过程中的校准。我们的代码将公开发布。
Summary / 总结
The paper introduces A-TPT, a novel test-time prompt tuning framework that enhances the angular diversity of textual features to improve the calibration of vision-language models. By maximizing the minimum pairwise angular distance between features, A-TPT consistently outperforms existing methods in reducing calibration error while maintaining accuracy across various backbones and datasets. It particularly excels in zero-shot calibration for natural distribution shifts and generalizes well to medical datasets.
论文针对测试时提示调优(TPT)中视觉-语言模型(VLM)文本特征缺乏角度多样性的问题,可能导致校准性能下降。为此,提出了A-TPT框架,通过最大化特征之间的最小角度距离来确保分布均匀。大量实验表明,A-TPT在减少校准误差、保持准确性方面优于现有方法,特别是在零样本设置和医疗数据集上表现更佳。
MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
Authors: Peiran Wu, Zhuorui Yu, Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen
Venue: ICLR 2026
First: 2025-10-09T08:07:19+00:00 · Latest: 2026-01-26T16:58:19+00:00
Comments: Accepted at ICLR 2026
Abstract
The rapid progress of large language models (LLMs) has laid the foundation for multimodal models. However, visual language models (VLMs) still face heavy computational costs when extended from images to videos due to high frame rates and long durations. Token compression is a promising solution, yet most existing training-free methods cause information loss and performance degradation. To overcome this, we propose \textbf{Memory-Augmented Reinforcement Learning-based Token Compression (MARC)}, which integrates structured retrieval and RL-based distillation. MARC adopts a \textit{retrieve-then-compress} strategy using a \textbf{Visual Memory Retriever (VMR)} to select key clips and a \textbf{Compression Group Relative Policy Optimization (C-GRPO)} framework to distil reasoning ability from a teacher to a student model. Experiments on six video benchmarks show that MARC achieves near-baseline accuracy using only one frame's tokens -- reducing visual tokens by \textbf{95\%}, GPU memory by \textbf{72\%}, and latency by \textbf{23.9\%}. This demonstrates its potential for efficient, real-time video understanding in resource-constrained settings such as video QA, surveillance, and autonomous driving.
Summary / 总结
MARC is a method for compressing visual tokens in video understanding models to reduce computational costs. It uses a Visual Memory Retriever to select key clips and a Compression Group Relative Policy Optimization framework to distill reasoning ability from a teacher to a student model. Experiments show MARC can achieve near-baseline accuracy while reducing visual tokens by 95%, GPU memory by 72%, and latency by 23.9%. This makes it suitable for real-time applications like video QA and autonomous driving.
MARC 是一种用于压缩视频理解模型中视觉标记的方法,旨在解决将模型从图像扩展到视频时的高计算成本问题。它使用视觉记忆检索器选择关键片段,并使用压缩组相对策略优化框架从教师模型向学生模型传递推理能力。实验结果显示,MARC 可以将视觉标记减少 95%,GPU 内存减少 72%,延迟减少 23.9%,同时保持接近基线的准确性。
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
Authors: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu
First: 2026-01-21T07:26:15+00:00 · Latest: 2026-01-26T15:57:42+00:00
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10$\times$ faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
中文标题/摘要
标题:HERMES: KV缓存作为层次化内存以提高流式视频理解效率
近期多模态大型语言模型(MLLMs)在离线视频理解方面取得了显著进步。然而,将这些能力扩展到流式视频输入仍然具有挑战性,因为现有模型难以同时保持稳定的理解性能、实时响应和低GPU内存开销。为了解决这一挑战,我们提出了一种名为HERMES的新型无训练架构,用于实时和准确地理解视频流。基于机制性注意力调查,我们将KV缓存概念化为一种层次化内存框架,以跨多个粒度封装视频信息。在推理过程中,HERMES重用紧凑的KV缓存,能够在资源受限的情况下实现高效的流式理解。值得注意的是,HERMES在用户查询到达时不需要额外的辅助计算,从而保证了连续视频流交互的实时响应,TTFT比之前的最佳方案快10倍。即使与均匀采样相比,将视频令牌减少高达68%,HERMES在所有基准测试中仍能实现优于或可比的准确性,流式数据集上最高可获得11.4%的提升。
Summary / 总结
The research aims to improve real-time streaming video understanding by addressing the challenges of maintaining performance, real-time responses, and low GPU memory usage. HERMES, a training-free architecture, uses a hierarchical memory framework based on a KV cache to efficiently process video streams. During inference, HERMES reuses a compact KV cache, achieving 10 times faster time-to-first-token compared to previous state-of-the-art models. It maintains or improves accuracy even with up to 68% fewer video tokens, demonstrating its effectiveness in resource-constrained environments.
研究旨在通过解决保持性能、实时响应和低GPU内存使用率的挑战,提高实时流视频理解能力。HERMES是一种无需训练的架构,采用基于KV缓存的分层内存框架来高效处理视频流。在推理过程中,HERMES重用紧凑的KV缓存,相比之前的最佳模型,实现了10倍更快的时间到第一个令牌。即使减少高达68%的视频标记,它也能保持或提高准确率,证明其在资源受限环境中的有效性。
SMooGPT: Stylized Motion Generation using Large Language Models
Authors: Lei Zhong, Yi Yang, Changjian Li
First: 2025-09-04T09:41:18+00:00 · Latest: 2026-01-26T15:51:20+00:00
Abstract
Stylized motion generation is actively studied in computer graphics, especially benefiting from the rapid advances in diffusion models. The goal of this task is to produce a novel motion respecting both the motion content and the desired motion style, e.g., ``walking in a loop like a Monkey''. Existing research attempts to address this problem via motion style transfer or conditional motion generation. They typically embed the motion style into a latent space and guide the motion implicitly in a latent space as well. Despite the progress, their methods suffer from low interpretability and control, limited generalization to new styles, and fail to produce motions other than ``walking'' due to the strong bias in the public stylization dataset. In this paper, we propose to solve the stylized motion generation problem from a new perspective of reasoning-composition-generation, based on our observations: i) human motion can often be effectively described using natural language in a body-part centric manner, ii) LLMs exhibit a strong ability to understand and reason about human motion, and iii) human motion has an inherently compositional nature, facilitating the new motion content or style generation via effective recomposing. We thus propose utilizing body-part text space as an intermediate representation, and present SMooGPT, a fine-tuned LLM, acting as a reasoner, composer, and generator when generating the desired stylized motion. Our method executes in the body-part text space with much higher interpretability, enabling fine-grained motion control, effectively resolving potential conflicts between motion content and style, and generalizes well to new styles thanks to the open-vocabulary ability of LLMs. Comprehensive experiments and evaluations, and a user perceptual study, demonstrate the effectiveness of our approach, especially under the pure text-driven stylized motion generation.
中文标题/摘要
标题:SMooGPT:使用大型语言模型进行风格化运动生成
风格化运动生成在计算机图形学中得到了积极的研究,尤其是得益于扩散模型的迅速发展。该任务的目标是生成既尊重运动内容又符合所需运动风格的新运动,例如“像猴子一样环形行走”。现有研究试图通过运动风格转换或条件运动生成来解决这一问题。它们通常将运动风格嵌入到潜在空间中,并在潜在空间中隐式地引导运动。尽管取得了进展,但它们的方法在可解释性和控制性方面较低,对新风格的泛化能力有限,并且由于公共风格化数据集中的强烈偏见,无法生成除“行走”之外的运动。在本文中,我们从推理-组合-生成的新视角出发,解决风格化运动生成问题,基于我们的观察:i) 人体运动往往可以用自然语言在以身体部位为中心的方式进行有效描述,ii) 大型语言模型在理解和推理人体运动方面表现出很强的能力,iii) 人体运动具有固有的组合性质,有助于通过有效的重组生成新的运动内容或风格。因此,我们提出利用身体部位文本空间作为中间表示,并提出SMooGPT,这是一种微调后的大型语言模型,在生成所需风格化运动时充当推理者、组合者和生成者。我们的方法在身体部位文本空间中执行,具有更高的可解释性,能够实现精细的运动控制,有效解决运动内容和风格之间的潜在冲突,并由于大型语言模型的开放式词汇能力,能够很好地泛化到新风格。全面的实验和评估以及用户感知研究证明了我们方法的有效性,特别是在纯文本驱动的风格化运动生成方面。
Summary / 总结
The paper aims to enhance stylized motion generation by leveraging large language models (LLMs) to address limitations in existing methods such as low interpretability and poor generalization to new styles. The proposed method, SMooGPT, uses a reasoning-composition-generation approach, where body-part text space serves as an intermediate representation. SMooGPT outperforms existing methods by providing higher interpretability, enabling fine-grained control, and effectively generating motions that respect both content and style, even for new styles not seen in the training data.
该论文旨在通过利用大型语言模型(LLMs)来改进风格化运动生成,解决现有方法中存在的低可解释性和难以泛化到新风格的问题。所提出的SMooGPT方法采用推理-合成-生成的新视角,其中身体部位文本空间作为中间表示。SMooGPT能够有效生成具有所需风格和内容的运动,提供精细的控制并更好地泛化到新风格,优于以往的方法。
CLIP's Visual Embedding Projector is a Few-shot Cornucopia
Authors: Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette
Venue: WACV 2026
First: 2024-10-07T17:59:59+00:00 · Latest: 2026-01-26T14:50:34+00:00
Comments: WACV 2026
Abstract
We introduce ProLIP, a simple and architecture-agnostic method for adapting contrastively pretrained vision-language models, such as CLIP, to few-shot classification. ProLIP fine-tunes the vision encoder's projection matrix with Frobenius norm regularization on its deviation from the pretrained weights. It achieves state-of-the-art performance on 11 few-shot classification benchmarks under both ``few-shot validation'' and ``validation-free'' settings. Moreover, by rethinking the non-linear CLIP-Adapter through ProLIP's lens, we design a Regularized Linear Adapter (RLA) that performs better, requires no hyperparameter tuning, is less sensitive to learning rate values, and offers an alternative to ProLIP in black-box scenarios where model weights are inaccessible. Beyond few-shot classification, ProLIP excels in cross-dataset transfer, domain generalization, base-to-new class generalization, and test-time adaptation--where it outperforms prompt tuning while being an order of magnitude faster to train. Code is available at https://github.com/astra-vision/ProLIP .
中文标题/摘要
标题:CLIP的视觉嵌入投影器是少样本大宝库
我们介绍了ProLIP,这是一种简单且架构无关的方法,用于适应对比预训练的视觉-语言模型,如CLIP,以实现少样本分类。ProLIP通过弗罗贝尼乌斯范数正则化其与预训练权重偏差的投影矩阵进行微调。它在11个少样本分类基准测试中均实现了最先进的性能,包括“少样本验证”和“无验证”设置下。此外,通过从ProLIP的角度重新思考非线性CLIP-Adapter,我们设计了一种正则化线性适配器(RLA),其性能更好,无需超参数调整,对学习率值的敏感性较低,并在模型权重不可访问的黑盒场景中提供了ProLIP的替代方案。除了少样本分类,ProLIP在跨数据集迁移、领域泛化、基类到新类泛化和测试时适应方面表现出色——在这些方面,它优于提示调优,同时训练速度快了几个数量级。代码可在https://github.com/astra-vision/ProLIP 获取。
Summary / 总结
ProLIP is a method for adapting contrastively pretrained vision-language models like CLIP to few-shot classification by fine-tuning the vision encoder's projection matrix with Frobenius norm regularization. It achieves state-of-the-art performance on 11 benchmarks and excels in various scenarios such as cross-dataset transfer and test-time adaptation, outperforming prompt tuning and being much faster to train.
ProLIP 是一种通过对视图编码器的投影矩阵进行带有弗罗贝尼乌斯范数正则化的微调方法,以适应对比预训练的视觉-语言模型如 CLIP,用于少样本分类。它在 11 个基准测试中达到了最先进的性能,并在各种迁移学习任务中表现出色,优于提示调优,且训练速度快得多。此外,ProLIP 还导致了正则化线性适配器 (RLA) 的开发,该适配器表现更好,对超参数和学习率的变化不那么敏感。代码可在 https://github.com/astra-vision/ProLIP 获取。
Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates
Authors: Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi
First: 2026-01-26T14:16:51+00:00 · Latest: 2026-01-26T14:16:51+00:00
Abstract
While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.
中文标题/摘要
标题:即时强化学习:无需梯度更新的LLM代理连续学习
虽然大型语言模型(LLM)代理在通用任务上表现出色,但在部署后权重冻结的情况下,它们本质上难以进行持续适应。传统的强化学习(RL)提供了一种解决方案,但会带来巨大的计算成本和灾难性遗忘的风险。我们提出了即时强化学习(JitRL),这是一种无需训练的框架,能够在测试时进行策略优化而无需任何梯度更新。JitRL 维护一个动态的非参数经验记忆,并在需要时检索相关轨迹以实时估计动作优势。这些估计值随后用于直接调节LLM的输出logits。我们理论上证明,这种增量更新规则是KL约束策略优化目标的精确闭式解。在WebArena和Jericho上的广泛实验表明,JitRL 在无需训练的方法中达到了新的最佳水平。至关重要的是,JitRL 在性能上超过了计算成本高昂的微调方法(例如WebRL),同时将成本降低了30多倍,为持续学习代理提供了可扩展的路径。代码可在https://github.com/liushiliushi/JitRL/ 获取。
Summary / 总结
The research aims to address the challenge of continual adaptation in Large Language Model (LLM) agents without updating gradients. Just-In-Time Reinforcement Learning (JitRL) is introduced as a training-free framework that allows for policy optimization at test time without gradient updates. JitRL uses a dynamic, non-parametric memory to retrieve relevant trajectories and estimate action advantages, which are then used to directly modulate the LLM's output logits. Experiments on WebArena and Jericho show that JitRL outperforms computationally expensive fine-tuning methods while significantly reducing monetary costs, establishing a new state-of-the-art in training-free methods for continual learning agents.
论文提出了即时强化学习(JitRL)框架,该框架允许LLM代理在无需梯度更新的情况下适应新任务。JitRL 使用动态的非参数记忆来检索相关轨迹并估计动作优势,然后直接调节LLM的输出logits。在WebArena和Jericho上的实验表明,JitRL 在性能上优于微调方法,并且将计算成本降低了超过30倍,使其成为连续学习LLM代理的可扩展解决方案。
DisasterInsight: A Multimodal Benchmark for Function-Aware and Grounded Disaster Assessment
Authors: Sara Tehrani, Yonghao Xu, Leif Haglund, Amanda Berg, Michael Felsberg
First: 2026-01-26T13:48:11+00:00 · Latest: 2026-01-26T13:48:11+00:00
Comments: Under review at ICPR 2026
Abstract
Timely interpretation of satellite imagery is critical for disaster response, yet existing vision-language benchmarks for remote sensing largely focus on coarse labels and image-level recognition, overlooking the functional understanding and instruction robustness required in real humanitarian workflows. We introduce DisasterInsight, a multimodal benchmark designed to evaluate vision-language models (VLMs) on realistic disaster analysis tasks. DisasterInsight restructures the xBD dataset into approximately 112K building-centered instances and supports instruction-diverse evaluation across multiple tasks, including building-function classification, damage-level and disaster-type classification, counting, and structured report generation aligned with humanitarian assessment guidelines. To establish domain-adapted baselines, we propose DI-Chat, obtained by fine-tuning existing VLM backbones on disaster-specific instruction data using parameter-efficient Low-Rank Adaptation (LoRA). Extensive experiments on state-of-the-art generic and remote-sensing VLMs reveal substantial performance gaps across tasks, particularly in damage understanding and structured report generation. DI-Chat achieves significant improvements on damage-level and disaster-type classification as well as report generation quality, while building-function classification remains challenging for all evaluated models. DisasterInsight provides a unified benchmark for studying grounded multimodal reasoning in disaster imagery.
中文标题/摘要
标题:DisasterInsight:一种面向功能和基于灾害评估的多模态基准
及时解释卫星图像对于灾害响应至关重要,但现有的遥感视觉-语言基准主要集中在粗略标签和图像级识别上,忽视了实际人道主义工作流程中所需的功能理解和指令鲁棒性。我们引入了DisasterInsight,这是一种多模态基准,旨在评估视觉-语言模型(VLMs)在现实灾害分析任务中的表现。DisasterInsight将xBD数据集重新结构化为约112,000个以建筑物为中心的实例,并支持跨多个任务的指令多样评估,包括建筑物功能分类、损坏程度和灾害类型分类、计数以及与人道主义评估指南一致的结构化报告生成。 为了建立领域适应基线,我们提出了DI-Chat,这是通过在灾害特定指令数据上对现有VLM主干进行参数高效低秩适应(LoRA)微调得到的。在最先进的通用和遥感VLM上的广泛实验揭示了各任务之间存在显著的性能差距,尤其是在损坏理解与结构化报告生成方面。DI-Chat在损坏程度和灾害类型分类以及报告生成质量方面取得了显著改进,而建筑物功能分类对所有评估模型来说仍然是一个挑战。DisasterInsight为研究灾害图像中的基于地面的多模态推理提供了一个统一基准。
Summary / 总结
DisasterInsight is a multimodal benchmark designed to evaluate vision-language models on realistic disaster analysis tasks, addressing the need for functional understanding and instruction robustness in disaster response. It restructures the xBD dataset into building-centered instances and supports diverse instruction evaluation across multiple tasks. Experiments show significant performance gaps, especially in damage understanding and structured report generation, with DI-Chat, a domain-adapted baseline, improving these tasks but not building-function classification. DisasterInsight provides a unified benchmark for grounded multimodal reasoning in disaster imagery.
DisasterInsight 是一个多模态基准,旨在评估视觉-语言模型在真实灾难分析任务中的表现,重点关注功能理解和指令鲁棒性。它将 xBD 数据集重构为约 112K 个以建筑物为中心的实例,并支持包括建筑物功能分类、损坏程度和灾难类型分类、计数和结构化报告生成在内的多种任务。实验显示,在损坏理解等任务上存在显著性能差距,而通过细调现有模型得到的 DI-Chat 在损坏程度和灾难类型分类以及报告生成质量上取得了显著改进,但建筑物功能分类对所有评估模型来说仍然具有挑战性。
Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone
Authors: Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth
First: 2025-08-26T12:41:35+00:00 · Latest: 2026-01-26T12:18:24+00:00
Comments: Accepted to the Findings of EACL 2026
Abstract
As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.
中文标题/摘要
标题:用不同方式再问我:GRAS用于衡量视觉语言模型在性别、种族、年龄和肤色方面的偏差
随着视觉语言模型(VLMs)在实际应用中变得越来越重要,理解它们的民众人种偏差至关重要。我们引入了GRAS,这是一个基准,用于揭示VLMs在性别、种族、年龄和肤色方面的民众人种偏差,提供了迄今为止最广泛的覆盖范围。我们还提出了GRAS偏差评分,这是一个可解释的指标,用于量化偏差。我们对五种最先进的VLM进行了基准测试,并揭示了令人担忧的偏差水平,最不偏见的模型的GRAS偏差评分为100分中的2分。我们的研究结果还揭示了一个方法论上的见解:使用视觉问答(VQA)评估VLMs的偏差需要考虑问题的多种表述形式。我们的代码、数据和评估结果已公开。
Summary / 总结
The study aims to understand demographic biases in Vision Language Models (VLMs) by introducing GRAS, a benchmark for measuring biases across gender, race, age, and skin tone. The researchers propose the GRAS Bias Score, an interpretable metric, and benchmark five state-of-the-art VLMs, finding that the least biased model still has a GRAS Bias Score of 2 out of 100. The study also highlights the importance of considering multiple question formulations in VQA for bias evaluation. The code, data, and evaluation results are publicly available.
研究旨在理解并衡量视觉语言模型(VLMs)在性别、种族、年龄和肤色方面的民众人种偏见。作者引入了GRAS基准来评估这些偏见,并提出GRAS偏见分数作为可解释的度量标准。他们对标五个最先进的VLMs,并发现即使是最不偏见的模型的GRAS偏见分数也只有100分中的2分,表明仍有很大的改进空间。研究还强调,在视觉问答(VQA)中考虑多种问题表述对于全面的偏见评估至关重要。
ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks
Authors: Gabriel Lee Jun Rong, Christos Korgialas, Dion Jia Xu Ho, Pai Chet Ng, Xiaoxiao Miao, Konstantinos N. Plataniotis
First: 2026-01-26T11:36:34+00:00 · Latest: 2026-01-26T11:36:34+00:00
Abstract
Existing automated attack suites operate as static ensembles with fixed sequences, lacking strategic adaptation and semantic awareness. This paper introduces the Agentic Reasoning for Methods Orchestration and Reparameterization (ARMOR) framework to address these limitations. ARMOR orchestrates three canonical adversarial primitives, Carlini-Wagner (CW), Jacobian-based Saliency Map Attack (JSMA), and Spatially Transformed Attacks (STA) via Vision Language Models (VLM)-guided agents that collaboratively generate and synthesize perturbations through a shared ``Mixing Desk". Large Language Models (LLMs) adaptively tune and reparameterize parallel attack agents in a real-time, closed-loop system that exploits image-specific semantic vulnerabilities. On standard benchmarks, ARMOR achieves improved cross-architecture transfer and reliably fools both settings, delivering a blended output for blind targets and selecting the best attack or blended attacks for white-box targets using a confidence-and-SSIM score.
中文标题/摘要
标题:ARMOR: 代理推理方法编排与重构以实现稳健的对抗攻击
现有的自动化攻击套件作为静态组合体,具有固定的序列,缺乏战略适应性和语义意识。本文提出了代理推理方法编排与重构(ARMOR)框架以解决这些问题。ARMOR 通过视觉语言模型(VLM)引导的代理,协作生成和合成扰动,利用共享的“混合台”。大型语言模型(LLMs)在实时闭环系统中适应性地调整和重构并行攻击代理,利用图像特定的语义漏洞。在标准基准测试中,ARMOR 实现了跨架构的改进转移,并可靠地欺骗两种设置,为盲目标提供混合输出,并使用置信度和SSIM分数为白盒目标选择最佳攻击或混合攻击。
Summary / 总结
The motivation for this research is to enhance the strategic adaptation and semantic awareness of automated attack suites. The main method involves the ARMOR framework, which uses Vision Language Models to guide agents in orchestrating and reparameterizing three adversarial primitives (CW, JSMA, and STA) through a shared 'Mixing Desk'. The framework employs Large Language Models to adaptively tune these agents in real-time, exploiting image-specific semantic vulnerabilities. Key experimental findings show that ARMOR improves cross-architecture transfer and reliably fools both blind and white-box targets, delivering either a blended output or the best attack based on confidence and SSIM scores.
该研究的动机是增强自动化攻击套件的战略适应性和语义意识。主要方法是使用Vision Language Models引导代理协调和重新参数化三种对抗性原语(CW、JSMA和STA)通过共享的‘Mixing Desk’。框架使用大型语言模型在实时闭环系统中适应性调整这些代理,利用图像特定的语义漏洞。关键实验发现表明,ARMOR在跨架构传输上表现出色,并可靠地欺骗了盲目标和白盒目标,根据置信度和SSIM分数提供混合输出或最佳攻击。
Making medical vision-language models think causally across modalities with retrieval-augmented cross-modal reasoning
Authors: Weiqin Yang, Haowen Xue, Qingyi Peng, Hexuan Hu, Qian Huang, Tingbo Zhang
First: 2026-01-26T11:03:00+00:00 · Latest: 2026-01-26T11:03:00+00:00
Abstract
Medical vision-language models (VLMs) achieve strong performance in diagnostic reporting and image-text alignment, yet their underlying reasoning mechanisms remain fundamentally correlational, exhibiting reliance on superficial statistical associations that fail to capture the causal pathophysiological mechanisms central to clinical decision-making. This limitation makes them fragile, prone to hallucinations, and sensitive to dataset biases. Retrieval-augmented generation (RAG) offers a partial remedy by grounding predictions in external knowledge. However, conventional RAG depends on semantic similarity, introducing new spurious correlations. We propose Multimodal Causal Retrieval-Augmented Generation, a framework that integrates causal inference principles with multimodal retrieval. It retrieves clinically relevant exemplars and causal graphs from external sources, conditioning model reasoning on counterfactual and interventional evidence rather than correlations alone. Applied to radiology report generation, diagnosis prediction, and visual question answering, it improves factual accuracy, robustness to distribution shifts, and interpretability. Our results highlight causal retrieval as a scalable path toward medical VLMs that think beyond pattern matching, enabling trustworthy multimodal reasoning in high-stakes clinical settings.
中文标题/摘要
标题:利用检索增强跨模态推理使医疗视觉-语言模型在不同模态间进行因果思考
医疗视觉-语言模型(VLMs)在诊断报告和图像-文本对齐方面表现出色,但其内部推理机制本质上是相关性的,依赖于表面的统计关联,无法捕捉到临床决策中至关重要的因果病理机制。这一局限性使它们变得脆弱,容易产生幻觉,并对数据集偏差敏感。检索增强生成(RAG)提供了一部分补救措施,通过外部知识来支撑预测。然而,传统的RAG依赖于语义相似性,引入了新的虚假关联。我们提出了一种多模态因果检索增强生成框架,该框架将因果推理原则与多模态检索相结合。它从外部来源检索出临床相关的范例和因果图,并基于反事实和干预性证据来条件化模型的推理,而不是仅仅依赖于关联。应用于放射学报告生成、诊断预测和视觉问答,它提高了事实准确性、对分布偏移的鲁棒性和可解释性。我们的结果突显了因果检索作为医疗VLMs超越模式匹配思考的可扩展路径,使其能够在高风险临床环境中实现可信的多模态推理。
Summary / 总结
The research aims to enhance medical vision-language models by incorporating causal reasoning to improve their robustness and interpretability. The method involves using a framework called Multimodal Causal Retrieval-Augmented Generation, which integrates causal inference principles with multimodal retrieval to condition model reasoning on counterfactual and interventional evidence. Key findings show improvements in factual accuracy, robustness to distribution shifts, and interpretability in tasks such as radiology report generation, diagnosis prediction, and visual question answering.
研究旨在通过引入因果推理来提升医疗视觉语言模型的稳健性和可解释性。方法是使用一种名为Multimodal Causal Retrieval-Augmented Generation的框架,将因果推理原则与多模态检索相结合,使模型的推理基于反事实和干预性证据。关键发现表明,在放射学报告生成、诊断预测和视觉问答等任务中,模型在事实准确性、分布变化鲁棒性和可解释性方面有所提升。
Beyond Rigid: Benchmarking Non-Rigid Video Editing
Authors: Bingzheng Qu, Kehai Chen, Xuefeng Bai, Jun Yu, Min Zhang
First: 2026-01-26T10:28:09+00:00 · Latest: 2026-01-26T10:28:09+00:00
Abstract
Despite the remarkable progress in text-driven video editing, generating coherent non-rigid deformations remains a critical challenge, often plagued by physical distortion and temporal flicker. To bridge this gap, we propose NRVBench, the first dedicated and comprehensive benchmark designed to evaluate non-rigid video editing. First, we curate a high-quality dataset consisting of 180 non-rigid motion videos from six physics-based categories, equipped with 2,340 fine-grained task instructions and 360 multiple-choice questions. Second, we propose NRVE-Acc, a novel evaluation metric based on Vision-Language Models that can rigorously assess physical compliance, temporal consistency, and instruction alignment, overcoming the limitations of general metrics in capturing complex dynamics. Third, we introduce a training-free baseline, VM-Edit, which utilizes a dual-region denoising mechanism to achieve structure-aware control, balancing structural preservation and dynamic deformation. Extensive experiments demonstrate that while current methods have shortcomings in maintaining physical plausibility, our method achieves excellent performance across both standard and proposed metrics. We believe the benchmark could serve as a standard testing platform for advancing physics-aware video editing.
中文标题/摘要
标题:超越刚性:非刚性视频编辑基准测试
尽管文本驱动的视频编辑取得了显著进展,但生成连贯的非刚性变形仍然是一个关键挑战,常常受到物理失真和时间闪烁的困扰。为解决这一问题,我们提出了NRVBench,这是第一个专门且全面的基准测试,用于评估非刚性视频编辑。首先,我们精心策划了一个高质量的数据集,包含来自六个基于物理的类别的180个非刚性运动视频,附带2,340个细粒度的任务指令和360个多项选择题。其次,我们提出了基于视觉-语言模型的NRVE-Acc新型评估指标,可以严格评估物理合规性、时间一致性和指令对齐,克服了通用指标在捕捉复杂动态方面的局限性。第三,我们引入了一个无需训练的基础模型VM-Edit,利用双区域去噪机制实现结构感知控制,平衡结构保存和动态变形。大量实验表明,尽管当前方法在保持物理合理性方面存在不足,但我们的方法在标准和提出的指标上均表现出色。我们认为,该基准测试可以作为物理感知视频编辑的标准测试平台。
Summary / 总结
The research aims to address the challenge of generating coherent non-rigid deformations in video editing, which is often affected by physical distortion and temporal flicker. The authors propose NRVBench, a benchmark that includes a curated dataset of 180 non-rigid motion videos and a novel evaluation metric, NRVE-Acc, based on Vision-Language Models. They also introduce a training-free baseline, VM-Edit, which uses a dual-region denoising mechanism to balance structural preservation and dynamic deformation. The experiments show that current methods struggle with physical plausibility, but VM-Edit performs well across both standard and proposed metrics.
研究旨在解决视频编辑中非刚性变形的连贯性问题,通常受到物理失真和时间闪烁的影响。作者提出了NRVBench,该基准包括一个由180个非刚性运动视频组成的精心策划的数据集和一个基于Vision-Language模型的新颖评估指标NRVE-Acc。他们还引入了一个无需训练的基线VM-Edit,该基线使用双区域去噪机制来平衡结构保留和动态变形。实验表明,当前方法在物理合理性方面存在不足,但VM-Edit在标准和提出的指标上表现良好。
Coding the Visual World: From Image to Simulation Using Vision Language Models
Authors: Sagi Eppel
First: 2026-01-08T19:49:05+00:00 · Latest: 2026-01-26T10:11:31+00:00
Abstract
The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) have the ability to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.
中文标题/摘要
标题:编码可视世界:使用视觉语言模型从图像到模拟
构建世界的心理模型是理解的核心方面。同样,视觉理解可以被视为构建图像中所描绘系统代表模型的能力。这项工作探讨了视觉语言模型(VLMs)使用Im2Sim方法识别和模拟图像中所示系统和机制的能力。给定一个真实世界的系统的自然图像(例如,城市、云、植被),VLM被要求描述该系统并编写模拟和生成它的代码。然后执行生成的代码以产生合成图像,并将其与原始图像进行比较。这种方法在各种复杂的涌现系统上进行了测试,从物理系统(波、光、云)到植被、城市、材料和地质构造。通过对VLM生成的模型和图像的分析,我们研究了它们对图像中系统的理解。结果表明,领先的VLM(GPT、Gemini)具有跨多个抽象层次和多个领域理解并建模复杂多组件系统的能力。同时,VLMs在复制图像中的细部和低级模式排列方面表现出有限的能力。这些发现揭示了一个有趣的不对称性:VLMs结合了对图像的高层次、深入的视觉理解,但对细部感知有限。
Summary / 总结
This work investigates the capability of Vision Language Models (VLMs) to simulate real-world systems depicted in images using the Im2Sim methodology. The VLMs are given natural images and asked to describe and code the system, which is then executed to generate a synthetic image. The study tests this approach on various complex systems, showing that leading VLMs like GPT and Gemini can understand and model complex, multi-component systems across different domains but struggle with fine details. This reveals an interesting asymmetry in VLMs' visual understanding capabilities.
这项研究探讨了视觉语言模型(VLMs)使用Im2Sim方法识别和模拟图像中复杂系统的能力。VLMs被给定自然图像并要求描述系统并编写模拟代码,然后执行生成合成图像。研究测试了这种方法在物理和自然现象等多种系统上的应用。结果表明,领先的VLMs如GPT和Gemini能够理解并跨不同领域建模复杂的多组件系统,但在复制细节点上存在局限性。这揭示了VLMs在视觉理解上的不对称性,即它们在高层次理解方面表现出色,但在细节感知方面有限。
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis
Authors: Yuan Gao, Mattia Piccinini, Yuchen Zhang, Dingrui Wang, Korbinian Moller, Roberto Brusnicki, Baha Zarrouki, Alessio Gambi, Jan Frederik Totz, Kai Storms, Steven Peters, Andrea Stocco, Bassam Alrifaee, Marco Pavone, Johannes Betz
First: 2025-06-13T07:25:59+00:00 · Latest: 2026-01-26T09:55:34+00:00
Comments: Final version (Accepted by the IEEE Open Journal of Intelligent Transportation Systems)
Abstract
For autonomous vehicles, safe navigation in complex environments depends on handling a broad range of diverse and rare driving scenarios. Simulation- and scenario-based testing have emerged as key approaches to development and validation of autonomous driving systems. Traditional scenario generation relies on rule-based systems, knowledge-driven models, and data-driven synthesis, often producing limited diversity and unrealistic safety-critical cases. With the emergence of foundation models, which represent a new generation of pre-trained, general-purpose AI models, developers can process heterogeneous inputs (e.g., natural language, sensor data, HD maps, and control actions), enabling the synthesis and interpretation of complex driving scenarios. In this paper, we conduct a survey about the application of foundation models for scenario generation and scenario analysis in autonomous driving (as of May 2025). Our survey presents a unified taxonomy that includes large language models, vision-language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios. In addition, we review the methodologies, open-source datasets, simulation platforms, and benchmark challenges, and we examine the evaluation metrics tailored explicitly to scenario generation and analysis. Finally, the survey concludes by highlighting the open challenges and research questions, and outlining promising future research directions. All reviewed papers are listed in a continuously maintained repository, which contains supplementary materials and is available at https://github.com/TUM-AVS/FM-for-Scenario-Generation-Analysis.
中文标题/摘要
标题:自动驾驶中的基础模型:场景生成与分析综述
对于自动驾驶车辆而言,安全导航依赖于处理各种多样且罕见的驾驶场景。仿真和基于场景的测试已成为开发和验证自动驾驶系统的关键方法。传统场景生成依赖于基于规则的系统、知识驱动模型和数据驱动合成,通常生成的场景多样性有限且不现实。随着基础模型的出现,这些代表新一代预训练通用人工智能模型,开发者可以处理异构输入(例如自然语言、传感器数据、高清地图和控制动作),从而生成和解释复杂的驾驶场景。本文综述了截至2025年5月基础模型在自动驾驶场景生成与分析中的应用。综述中提出了一种统一的分类体系,包括大型语言模型、视觉-语言模型、多模态大型语言模型、扩散模型和世界模型,用于生成和分析自动驾驶场景。此外,综述还回顾了方法论、开源数据集、仿真平台和基准挑战,并审查了针对场景生成与分析的评估指标。最后,综述总结了开放挑战和研究问题,并概述了有前景的未来研究方向。所有审阅的论文均列于一个持续维护的仓库中,该仓库包含补充材料,可访问 https://github.com/TUM-AVS/FM-for-Scenario-Generation-Analysis。
Summary / 总结
This paper surveys the application of foundation models in autonomous driving, focusing on scenario generation and analysis. Motivated by the need for diverse and realistic driving scenarios, the study explores how foundation models can process various inputs to generate complex scenarios. Key findings include the use of large language models, vision-language models, and multimodal models to enhance scenario diversity and realism, as well as the development of new evaluation metrics for these models. The survey also highlights open challenges and future research directions in this area.
本文综述了基础模型在自动驾驶中的应用,重点关注场景生成和分析。它指出了传统方法的局限性,并介绍了可以处理多种输入以生成和解释复杂驾驶场景的基础模型。关键发现包括使用大型语言模型、视觉-语言模型和其他多模态模型来增强场景的多样性和真实性,并回顾了场景生成和分析的方法、数据集和评估指标。
Vision-Language-Model-Guided Differentiable Ray Tracing for Fast and Accurate Multi-Material RF Parameter Estimation
Authors: Zerui Kang, Yishen Lim, Zhouyou Gu, Seung-Woo Ko, Tony Q. S. Quek, Jihong Park
First: 2026-01-26T07:54:53+00:00 · Latest: 2026-01-26T07:54:53+00:00
Abstract
Accurate radio-frequency (RF) material parameters are essential for electromagnetic digital twins in 6G systems, yet gradient-based inverse ray tracing (RT) remains sensitive to initialization and costly under limited measurements. This paper proposes a vision-language-model (VLM) guided framework that accelerates and stabilizes multi-material parameter estimation in a differentiable RT (DRT) engine. A VLM parses scene images to infer material categories and maps them to quantitative priors via an ITU-R material table, yielding informed conductivity initializations. The VLM further selects informative transmitter/receiver placements that promote diverse, material-discriminative paths. Starting from these priors, the DRT performs gradient-based refinement using measured received signal strengths. Experiments in NVIDIA Sionna on indoor scenes show 2-4$\times$ faster convergence and 10-100$\times$ lower final parameter error compared with uniform or random initialization and random placement baselines, achieving sub-0.1\% mean relative error with only a few receivers. Complexity analyses indicate per-iteration time scales near-linearly with the number of materials and measurement setups, while VLM-guided placement reduces the measurements required for accurate recovery. Ablations over RT depth and ray counts confirm further accuracy gains without significant per-iteration overhead. Results demonstrate that semantic priors from VLMs effectively guide physics-based optimization for fast and reliable RF material estimation.
中文标题/摘要
标题:基于视觉-语言模型的可微射线追踪框架用于快速准确的多材料射频参数估计
准确的射频(RF)材料参数对于6G系统中的电磁数字孪生至关重要,但基于梯度的逆射线追踪(RT)方法对初始化敏感且在有限测量下成本高昂。本文提出了一种基于视觉-语言模型(VLM)的框架,以加速并稳定多材料参数估计的可微射线追踪(DRT)引擎。VLM 解析场景图像以推断材料类别,并通过国际电信联盟(ITU-R)材料表将它们映射到定量先验,从而提供有信息量的导电性初始化。VLM 进一步选择具有信息性的发射器/接收器位置,以促进多样且材料区分的路径。从这些先验开始,DRT 使用测量的接收信号强度进行基于梯度的细化。在NVIDIA Sionna上的室内场景实验表明,与均匀或随机初始化和随机放置基线相比,该方法的收敛速度加快了2-4倍,最终参数误差降低了10-100倍,仅使用少数几个接收器即可达到小于0.1%的平均相对误差。复杂性分析表明,每次迭代的时间与材料数量和测量设置数量几乎线性相关,而VLM引导的位置选择减少了准确恢复所需的测量次数。对射线追踪深度和射线计数的消融实验进一步证实了在不显著增加每次迭代开销的情况下,可以获得更高的准确性。结果表明,VLM提供的语义先验有效地引导了基于物理的优化,以实现快速可靠的射频材料估计。
Summary / 总结
This paper addresses the challenge of accurately estimating radio-frequency material parameters using inverse ray tracing, which is sensitive to initialization and costly. It proposes a vision-language-model (VLM) guided framework that uses a VLM to infer material categories and provide informed initializations, and selects transmitter/receiver placements to promote diverse paths. The differentiable ray tracing (DRT) engine then refines these initializations using measured signal strengths. Experiments show that this approach converges 2-4 times faster and achieves 10-100 times lower final parameter error compared to baselines, with sub-0.1% mean relative error using only a few receivers.
本文解决了使用逆射线追踪准确估计射频材料参数的挑战,该方法对初始化敏感且成本高。它提出了一种基于视觉语言模型(VLM)的框架,使用可微射线追踪(DRT)引擎来加速和稳定估计过程。VLM 从场景图像中推断材料类别并映射到定量先验,提供初始值。它还选择能够促进多样性和材料区分路径的发射器/接收器位置。实验表明,这种方法比均匀或随机初始化和随机位置基线快2-4倍,并且最终参数误差低10-100倍,仅使用少量接收器即可达到低于0.1%的平均相对误差。
V-Loop: Visual Logical Loop Verification for Hallucination Detection in Medical Visual Question Answering
Authors: Mengyuan Jin, Zehui Liao, Yong Xia
First: 2026-01-26T07:46:41+00:00 · Latest: 2026-01-26T07:46:41+00:00
Abstract
Multimodal Large Language Models (MLLMs) have shown remarkable capability in assisting disease diagnosis in medical visual question answering (VQA). However, their outputs remain vulnerable to hallucinations (i.e., responses that contradict visual facts), posing significant risks in high-stakes medical scenarios. Recent introspective detection methods, particularly uncertainty-based approaches, offer computational efficiency but are fundamentally indirect, as they estimate predictive uncertainty for an image-question pair rather than verifying the factual correctness of a specific answer. To address this limitation, we propose Visual Logical Loop Verification (V-Loop), a training-free and plug-and-play framework for hallucination detection in medical VQA. V-Loop introduces a bidirectional reasoning process that forms a visually grounded logical loop to verify factual correctness. Given an input, the MLLM produces an answer for the primary input pair. V-Loop extracts semantic units from the primary QA pair, generates a verification question by conditioning on the answer unit to re-query the question unit, and enforces visual attention consistency to ensure answering both primary question and verification question rely on the same image evidence. If the verification answer matches the expected semantic content, the logical loop closes, indicating factual grounding; otherwise, the primary answer is flagged as hallucinated. Extensive experiments on multiple medical VQA benchmarks and MLLMs show that V-Loop consistently outperforms existing introspective methods, remains highly efficient, and further boosts uncertainty-based approaches when used in combination.
中文标题/摘要
标题:V-Loop:医疗视觉问答中幻觉检测的视觉逻辑循环验证
多模态大型语言模型(MLLMs)在医疗视觉问答(VQA)中协助疾病诊断方面展现了显著的能力。然而,它们的输出仍然容易出现幻觉(即与视觉事实相矛盾的响应),在高风险医疗场景中存在重大风险。最近的内省检测方法,特别是基于不确定性的方法,提供了计算效率,但本质上是间接的,它们估计图像-问题对的预测不确定性,而不是验证特定答案的事实正确性。为了解决这一局限性,我们提出了一种无需训练且即插即用的框架——视觉逻辑循环验证(V-Loop),用于医疗VQA中的幻觉检测。V-Loop引入了一种双向推理过程,形成一个视觉支撑的逻辑循环来验证事实正确性。给定输入,MLLM为原始输入对生成一个答案。V-Loop从原始QA对中提取语义单元,根据答案单元生成验证问题以重新查询问题单元,并确保视觉注意力一致性,以确保回答原始问题和验证问题都依赖于相同的图像证据。如果验证答案与预期的语义内容匹配,则逻辑循环闭合,表明事实支撑;否则,原始答案被标记为幻觉。在多个医疗VQA基准和MLLM上的广泛实验表明,V-Loop在性能上始终优于现有内省方法,保持了高度的效率,并且在与基于不确定性的方法结合使用时进一步提升了其性能。
Summary / 总结
The research aims to address the vulnerability of Multimodal Large Language Models (MLLMs) to hallucinations in medical visual question answering (VQA) by proposing V-Loop, a training-free framework that verifies the factual correctness of answers through a bidirectional reasoning process. V-Loop forms a logical loop by extracting semantic units from the primary QA pair, generating a verification question, and ensuring visual attention consistency. The experiments demonstrate that V-Loop outperforms existing introspective methods and enhances the performance of uncertainty-based approaches.
研究旨在通过提出V-Loop框架来解决多模态大型语言模型在医学视觉问答(VQA)中对幻觉的脆弱性问题,V-Loop通过双向推理过程验证答案的正确性。V-Loop通过从主QA对中提取语义单元,基于答案生成验证问题,并确保视觉注意力一致性。实验表明,V-Loop在多个医学VQA基准和多模态大型语言模型上均优于现有方法,并能进一步提升基于不确定性的方法的性能。
GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation
Authors: Shasha Guo, Liang Pang, Xi Wang, Yanling Wang, Huawei Shen, Jing Zhang
First: 2025-10-13T05:33:51+00:00 · Latest: 2026-01-26T07:35:43+00:00
Comments: 19 pages
Abstract
Auxiliary lines are essential for solving complex geometric problems but remain challenging for large vision-language models (LVLMs). Recent attempts construct auxiliary lines via code-driven rendering, a strategy that relies on accurate and executable code generation to produce visual renderings of the auxiliary lines for subsequent reasoning. However, in complex solid geometry settings, such a strong dependence on precise specifications substantially restricts the robustness of this strategy. Alternatively, we turn to a simpler and more stable solution, representing auxiliary-line constructions as structured textual descriptions. To bridge the gap between textual descriptions and spatial structure, we propose a reinforcement learning framework that enhances diagram-text alignment. The core is a cross-modal reward model that evaluates how well the generated auxiliary-line description matches the ground-truth auxiliary-line diagram. The reward signal drives a GRPO-based RL stage to yield informative auxiliary-line descriptions for the reasoning. To support the training and evaluation, we develop a scalable data pipeline and construct AuxSolidMath, a dataset of 3,018 real-exam geometry problems with paired diagrams and aligned textual fields. Based on this framework, we derive GeoVLMath, an LVLM for solving complex solid geometry.
中文标题/摘要
标题:GeoVLMath:通过跨模态奖励辅助线创建增强视觉语言模型中的几何推理
辅助线对于解决复杂的几何问题至关重要,但对大型视觉语言模型(LVLM)来说仍然具有挑战性。最近的尝试通过代码驱动的渲染来构建辅助线,这种方法依赖于准确且可执行的代码生成,以生成辅助线的视觉渲染,供后续推理使用。然而,在复杂的立体几何设置中,这种对精确规范的强烈依赖极大地限制了该策略的鲁棒性。相反,我们转向了一个更简单且更稳定的解决方案,将辅助线的构建表示为结构化的文本描述。为了弥合文本描述与空间结构之间的差距,我们提出了一种强化学习框架,以增强图表-文本对齐。核心是一个跨模态奖励模型,该模型评估生成的辅助线描述与真实辅助线图表的匹配程度。奖励信号驱动基于GRPO的强化学习阶段,以生成有助于推理的辅助线描述。为了支持训练和评估,我们开发了一个可扩展的数据管道,并构建了包含3,018个实际考试几何问题的数据集AuxSolidMath,这些问题是带有配对图表和对齐文本字段的。基于此框架,我们推导出了GeoVLMath,这是一种用于解决复杂立体几何问题的LVLM。
Summary / 总结
The research aims to improve geometry reasoning in vision-language models by addressing the challenge of creating auxiliary lines in complex geometric problems. The method involves a reinforcement learning framework with a cross-modal reward model that aligns textual descriptions with diagrammatic representations. Key experimental findings show that GeoVLMath, the proposed model, effectively generates informative auxiliary-line descriptions, enhancing the robustness of geometric problem solving in LVLMs.
研究旨在通过解决复杂几何问题中的辅助线创建挑战,提升视觉语言模型的几何推理能力。方法包括一个强化学习框架,结合跨模态奖励模型,将文本描述与图示表示对齐。关键实验发现表明,所提出的GeoVLMath模型能够有效生成有信息量的辅助线描述,增强LVLM在几何问题解决中的鲁棒性。
The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs
Authors: Asif Azad, Mohammad Sadat Hossain, MD Sadik Hossain Shanto, M Saifur Rahman, Md Rizwan Parvez
First: 2025-09-16T08:17:39+00:00 · Latest: 2026-01-26T07:17:20+00:00
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 18 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. For closed-source models lacking token-level logprob access, we develop and validate instruction-guided likelihood proxies. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don't know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.
中文标题/摘要
标题:说“可能”的艺术:一种用于VLMs不确定性基准测试的同形透镜
视觉-语言模型(VLMs)在跨科学和推理任务的复杂视觉理解方面取得了显著进展。尽管性能基准测试已加深了我们对这些能力的理解,但不确定性量化这一关键维度却未得到充分关注。因此,不同于以往专注于有限场景的同形预测研究,我们进行了全面的不确定性基准测试研究,评估了18个最先进的VLMs(开源和闭源)在6个多模态数据集上的表现,使用了3种不同的评分函数。对于缺乏标记级别logprob访问的闭源模型,我们开发并验证了基于指令的似然度代理。我们的研究结果表明,更大的模型在不确定性量化方面表现更优;知道得越多的模型也更清楚自己不知道什么。更确定的模型具有更高的准确性,而数学和推理任务在所有模型中的不确定性表现均劣于其他领域。本研究为多模态系统的可靠不确定性评估奠定了基础。
Summary / 总结
The study aims to address the underexplored area of uncertainty quantification in Vision-Language Models (VLMs) by benchmarking 18 state-of-the-art VLMs across 6 multimodal datasets using 3 scoring functions. For closed-source models, instruction-guided likelihood proxies were developed and validated. Key findings include that larger models better quantify uncertainty, and more certain models achieve higher accuracy. Mathematical and reasoning tasks show poorer uncertainty performance compared to other domains.
研究旨在通过全面的基准测试来解决视觉语言模型(VLMs)中未充分探索的不确定性量化问题。研究评估了18个最先进的VLMs在六个跨模态数据集上的表现,使用了三种评分函数。对于没有token级logprob访问权限的闭源模型,作者开发并验证了基于指令的似然度代理。主要发现包括:更大的模型能够更好地量化不确定性,更了解的模型也更清楚自己的局限性。更确定的模型能够获得更高的准确性,但数学和推理任务在所有领域中的不确定性表现较差。
GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models
Authors: Shaokang Wang, Pei Fu, Ruoceng Zhang, Shaojie Zhang, Xiuwen Xi, Jiahui Yang, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan
First: 2026-01-26T06:29:41+00:00 · Latest: 2026-01-26T06:29:41+00:00
Abstract
While Large Vision-Language Models (LVLMs) have significantly advanced GUI agents' capabilities in parsing textual instructions, interpreting screen content, and executing tasks, a critical challenge persists: the irreversibility of agent operations, where a single erroneous action can trigger catastrophic deviations. To address this, we propose the GUI Action Critic's Data Flywheel System (GAIA), a training framework that enables the models to have iterative critic capabilities, which are used to improve the Test-Time Scaling (TTS) of basic GUI agents' performance. Specifically, we train an Intuitive Critic Model (ICM) using positive and negative action examples from a base agent first. This critic evaluates the immediate correctness of the agent's intended actions, thereby selecting operations with higher success probability. Then, the initial critic guides agent actions to collect refined positive/negative samples, initiating the self-improving cycle. The augmented data then trains a second-round critic with enhanced discernment capability. We conduct experiments on various datasets and demonstrate that the proposed ICM can improve the test-time performance of various closed-source and open-source models, and the performance can be gradually improved as the data is recycled. The code and dataset will be publicly released.
中文标题/摘要
标题:GAIA:一种用于训练GUI测试时缩放批评模型的数据飞轮系统
尽管大型视觉-语言模型(LVLMs)在GUI代理解析文本指令、解释屏幕内容和执行任务方面取得了显著进展,但仍然存在一个关键挑战:代理操作的不可逆性,一个错误的操作可能会导致灾难性的偏差。为了解决这个问题,我们提出了GUI操作批评家的数据飞轮系统(GAIA),这是一种训练框架,使模型能够具备迭代批评能力,用于提高基本GUI代理测试时缩放(TTS)的性能。具体来说,我们首先使用基础代理的正负行动示例训练直觉批评模型(ICM)。该批评家评估代理意图行动的即时正确性,从而选择成功率更高的操作。然后,初始批评家引导代理行动收集精炼的正负样本,启动自我改进循环。增强的数据随后用于训练具有更强辨别能力的第二轮批评家。我们在多个数据集上进行了实验,并证明所提出的ICM可以提高各种闭源和开源模型的测试时性能,并且随着数据的循环利用,性能可以逐步提高。代码和数据集将公开发布。
Summary / 总结
The research aims to address the challenge of irreversibility in GUI agent operations, where a single mistake can lead to significant errors. GAIA, a training framework, is proposed to enable iterative critic capabilities, improving the Test-Time Scaling (TTS) of GUI agents. It trains an Intuitive Critic Model (ICM) using positive and negative action examples to evaluate the correctness of agent actions and guide the collection of refined samples, leading to a self-improving cycle. Experiments show that this approach enhances the performance of various models, with improvements increasing with data recycling.
论文提出了GAIA,一种用于GUI测试时缩放批评模型的训练框架。GAIA旨在提高GUI代理的可靠性,通过使用正负动作示例训练直觉批评模型(ICM)来评估和精炼代理动作。ICM指导更精细数据的收集,这些数据随后用于训练具有更强辨别能力的第二个批评模型。实验表明,这种方法可以提高各种GUI代理的性能,随着数据的循环使用,性能逐步提升。
QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding
Authors: Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Kaiwei Zhang, Jun Jia, Dandan Zhu, Guangtao Zhai, Xiongkuo Min
First: 2026-01-26T06:27:03+00:00 · Latest: 2026-01-26T06:27:03+00:00
Abstract
Visual quality assessment (VQA) is increasingly shifting from scalar score prediction toward interpretable quality understanding -- a paradigm that demands \textit{fine-grained spatiotemporal perception} and \textit{auxiliary contextual information}. Current approaches rely on supervised fine-tuning or reinforcement learning on curated instruction datasets, which involve labor-intensive annotation and are prone to dataset-specific biases. To address these challenges, we propose \textbf{QualiRAG}, a \textit{training-free} \textbf{R}etrieval-\textbf{A}ugmented \textbf{G}eneration \textbf{(RAG)} framework that systematically leverages the latent perceptual knowledge of large multimodal models (LMMs) for visual quality perception. Unlike conventional RAG that retrieves from static corpora, QualiRAG dynamically generates auxiliary knowledge by decomposing questions into structured requests and constructing four complementary knowledge sources: \textit{visual metadata}, \textit{subject localization}, \textit{global quality summaries}, and \textit{local quality descriptions}, followed by relevance-aware retrieval for evidence-grounded reasoning. Extensive experiments show that QualiRAG achieves substantial improvements over open-source general-purpose LMMs and VQA-finetuned LMMs on visual quality understanding tasks, and delivers competitive performance on visual quality comparison tasks, demonstrating robust quality assessment capabilities without any task-specific training. The code will be publicly available at https://github.com/clh124/QualiRAG.
中文标题/摘要
标题:QualiRAG:视觉质量理解的检索增强生成
视觉质量评估(VQA)正越来越多地从预测标量分数转向可解释的质量理解——这一范式要求具备\textit{精细的空间-时间感知}和\textit{辅助上下文信息}。当前的方法依赖于在精心策划的指令数据集上进行监督微调或强化学习,这涉及劳动密集型注释,并且容易受到数据集特定偏差的影响。为了解决这些挑战,我们提出了一种\textbf{QualiRAG},这是一种\textit{无需训练}的\textbf{R}etrieval-\textbf{A}ugmented \textbf{G}eneration \textbf{(RAG)}框架,该框架系统地利用了大型多模态模型(LMM)的潜在感知知识来进行视觉质量感知。与传统的RAG从静态语料库检索不同,QualiRAG通过将问题分解为结构化请求并构建四种互补的知识来源:\textit{视觉元数据}、\textit{主题定位}、\textit{全局质量总结}和\textit{局部质量描述},动态生成辅助知识,然后通过相关性感知检索进行证据导向的推理。广泛的实验表明,QualiRAG在视觉质量理解任务中显著优于开源通用LMM和VQA微调LMM,并在视觉质量比较任务中表现出竞争力,展示了无需任何特定任务训练的稳健的质量评估能力。代码将在https://github.com/clh124/QualiRAG公开发布。
Summary / 总结
QualiRAG is a training-free Retrieval-Augmented Generation framework designed for visual quality understanding, which leverages the latent perceptual knowledge of large multimodal models. It dynamically generates auxiliary knowledge by decomposing questions into structured requests and constructs four complementary knowledge sources: visual metadata, subject localization, global quality summaries, and local quality descriptions. Experiments show that QualiRAG outperforms open-source general-purpose large multimodal models and VQA-finetuned models on visual quality understanding tasks and delivers competitive performance on visual quality comparison tasks without any task-specific training.
QualiRAG 是一个无需训练的检索增强生成框架,旨在进行视觉质量理解,通过大型多模态模型的潜在感知知识来工作。它通过将问题分解为结构化请求并构建四种互补的知识来源(视觉元数据、主体定位、全局质量总结和局部质量描述)来动态生成辅助知识。实验结果表明,QualiRAG 在视觉质量理解任务上显著优于通用的大型多模态模型和 VQA 微调模型,并在视觉质量比较任务上表现出竞争力,无需任何特定任务的训练。
Spatial-Conditioned Reasoning in Long-Egocentric Videos
Authors: James Tribble, Hao Wang, Si-En Hong, Chaoyi Zhou, Ashish Bastola, Siyu Huang, Abolfazl Razi
First: 2026-01-26T03:21:35+00:00 · Latest: 2026-01-26T03:21:35+00:00
Abstract
Long-horizon egocentric video presents significant challenges for visual navigation due to viewpoint drift and the absence of persistent geometric context. Although recent vision-language models perform well on image and short-video reasoning, their spatial reasoning capability in long egocentric sequences remains limited. In this work, we study how explicit spatial signals influence VLM-based video understanding without modifying model architectures or inference procedures. We introduce Sanpo-D, a fine-grained re-annotation of the Google Sanpo dataset, and benchmark multiple VLMs on navigation-oriented spatial queries. To examine input-level inductive bias, we further fuse depth maps with RGB frames and evaluate their impact on spatial reasoning. Our results reveal a trade-off between general-purpose accuracy and spatial specialization, showing that depth-aware and spatially grounded representations can improve performance on safety-critical tasks such as pedestrian and obstruction detection.
中文标题/摘要
标题:长视角中心视点视频中的空间条件推理
长时程中心视点视频由于视角漂移和缺乏持久的几何上下文,为视觉导航带来了重大挑战。尽管最近的视觉-语言模型在图像和短视频推理方面表现良好,但在长中心视点序列中的空间推理能力仍然有限。在本文中,我们研究了显式空间信号如何影响基于视觉-语言模型的视频理解,而不修改模型架构或推理过程。我们引入了Sanpo-D,这是对Google Sanpo数据集的细粒度重新注释,并在导航导向的空间查询上对多个视觉-语言模型进行基准测试。为进一步检验输入级归纳偏见,我们还将深度图与RGB帧融合,并评估其对空间推理的影响。我们的结果揭示了通用准确性和空间专业化之间的权衡,表明深度感知和空间定位的表示可以提高行人和障碍物检测等关键任务上的性能。
Summary / 总结
This study addresses the challenges of visual navigation in long-horizon egocentric videos by leveraging spatial signals to enhance visual language models (VLMs). The researchers introduced Sanpo-D, a detailed re-annotation of the Google Sanpo dataset, and evaluated multiple VLMs on navigation-related spatial queries. By fusing depth maps with RGB frames, they found a trade-off between general accuracy and spatial specialization, demonstrating that depth-aware and spatially grounded representations can enhance performance in safety-critical tasks like pedestrian and obstruction detection.
研究针对长时第一人称视角视频中的视觉导航挑战,其中视角漂移和缺乏持久的几何上下文构成了重大障碍。研究引入了Sanpo-D,即对Google Sanpo数据集进行精细重新注释,以评估视觉语言模型(VLMs)的空间推理能力,无需修改模型架构或推理过程。通过将深度图与RGB帧融合,研究显示深度感知和空间定位表示可以提高行人和障碍物检测等关键任务的性能,揭示了一般准确性和空间专业化之间的权衡。
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Authors: Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryu, Donghyun Kim, Michael S. Ryoo
First: 2024-06-13T17:59:16+00:00 · Latest: 2026-01-26T00:29:47+00:00
Abstract
Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely related. Therefore, when performing long-form video question answering (LVQA), all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature leverage large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Motivated by this inefficiency, we propose LVNet, a modular and training-free framework featuring a novel Hierarchical Keyframe Selector (HKS) that efficiently selects a minimal set of informative frames tailored to each question. LVNet's modularity allows easy integration with existing approaches for more efficient LVQA. We achieve state-of-the-art performance among similarly configured models across four benchmark LVQA datasets: EgoSchema, NExT-QA, IntentQA, VideoMME. The code can be found at https://github.com/jongwoopark7978/LVNet
中文标题/摘要
标题:太多帧,非全有用:长视频问答的高效策略
跨越广泛时间间隔的长视频高度信息冗余,并包含多个往往松散关联的独立事件或实体。因此,在进行长视频问答(LVQA)时,生成正确回答所需的所有信息往往可以包含在一小部分帧中。最近的研究利用大型语言模型(LLMs)在LVQA基准测试中取得了出色的表现,依赖视觉语言模型(VLMs)将视频中的所有视觉内容转换为自然语言。这些VLMs通常会独立地对长视频中均匀采样的大量帧进行字幕处理,这既不高效,也往往是冗余的。受此不效率的启发,我们提出了LVNet,这是一种模块化且无需训练的框架,配备了一种新颖的分层关键帧选择器(HKS),能够高效地选择针对每个问题的最小信息性帧集。LVNet的模块化允许其与现有方法轻松集成,以实现更高效的LVQA。我们在四个基准LVQA数据集EgoSchema、NExT-QA、IntentQA、VideoMME上实现了同类配置模型中的最佳性能。代码可在https://github.com/jongwoopark7978/LVNet找到
Summary / 总结
The paper addresses the inefficiency of using all frames in long-form video question answering (LVQA) by proposing LVNet, a modular framework with a Hierarchical Keyframe Selector (HKS) that selects a minimal set of informative frames. This approach reduces redundancy and improves efficiency. Experiments on four benchmark datasets show that LVNet achieves state-of-the-art performance without requiring training.
论文提出了一种模块化框架LVNet,结合了层次关键帧选择器(HKS),能够高效地选择少量关键帧以回答长视频问题。这种方法减少了冗余,提高了效率。LVNet在四个基准数据集EgoSchema、NExT-QA、IntentQA和VideoMME上达到了最先进的性能。
Prefill-Guided Thinking for zero-shot detection of AI-generated images
Authors: Zoher Kachwala, Danishjeet Singh, Danielle Yang, Filippo Menczer
First: 2025-05-20T22:44:04+00:00 · Latest: 2026-01-25T23:38:23+00:00
Abstract
Traditional supervised methods for detecting AI-generated images depend on large, curated datasets for training and fail to generalize to novel, out-of-domain image generators. As an alternative, we explore pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images. We evaluate VLM performance on three diverse benchmarks encompassing synthetic images of human faces, objects, and animals produced by 16 different state-of-the-art image generators. While off-the-shelf VLMs perform poorly on these datasets, we find that prefilling responses effectively guides their reasoning -- a method we call Prefill-Guided Thinking (PGT). In particular, prefilling a VLM response with the phrase "Let's examine the style and the synthesis artifacts" improves the Macro F1 scores of three widely used open-source VLMs by up to 24%. We analyze this improvement in detection by tracking answer confidence during response generation. For some models, prefills counteract early overconfidence -- akin to mitigating the Dunning-Kruger effect -- leading to better detection performance.
中文标题/摘要
标题:预填充引导思考在零样本检测AI生成图像中的应用
传统的监督方法依赖于大型、策划的数据集进行训练,无法泛化到新的、域外的图像生成器。作为替代方案,我们探索预训练的视觉-语言模型(VLMs)在零样本检测AI生成图像中的应用。我们评估了VLM在三个多样基准上的性能,这些基准涵盖了由16种不同最先进的图像生成器生成的人脸、物体和动物的合成图像。尽管即用型VLM在这些数据集上表现不佳,但我们发现预填充响应能够有效引导其推理——我们称之为预填充引导思考(PGT)的方法。特别是,用短语“让我们检查风格和合成伪影”预填充VLM的响应,可以将三种广泛使用的开源VLM的宏F1分数提高多达24%。我们通过跟踪生成答案时的信心来分析这种检测改进。对于某些模型,预填充可以抵消早期的过度自信——类似于缓解邓宁-克鲁格效应——从而提高检测性能。
Summary / 总结
The paper addresses the challenge of detecting AI-generated images using zero-shot methods, which are limited by the need for large, curated datasets. Instead, it evaluates the performance of pre-trained Vision-Language Models (VLMs) on three diverse benchmarks. The authors introduce Prefill-Guided Thinking (PGT), where prefilling VLM responses with specific phrases enhances their detection performance, improving Macro F1 scores by up to 24%. This method helps mitigate early overconfidence in model responses, leading to better detection accuracy.
论文探讨了使用零样本方法检测AI生成图像的挑战,该方法依赖于预训练的Vision-Language模型(VLMs),而不是大型数据集。它引入了Prefill-Guided Thinking(PGT)方法,即预填充响应引导VLM的推理,使三种不同基准上的宏F1分数提高高达24%。这种方法有助于缓解模型早期的过度自信,从而提高检测性能。
Mitigating the Modality Gap: Few-Shot Out-of-Distribution Detection with Multi-modal Prototypes and Image Bias Estimation
Authors: Yimu Wang, Evelien Riddell, Adrian Chow, Sean Sedwards, Krzysztof Czarnecki
Venue: WACV 2026
First: 2025-02-02T04:30:51+00:00 · Latest: 2026-01-25T22:18:42+00:00
Comments: WACV 2026
Abstract
Existing vision-language model (VLM)-based methods for out-of-distribution (OOD) detection typically rely on similarity scores between input images and in-distribution (ID) text prototypes. However, the modality gap between image and text often results in high false positive rates, as OOD samples can exhibit high similarity to ID text prototypes. To mitigate the impact of this modality gap, we propose incorporating ID image prototypes along with ID text prototypes. We present theoretical analysis and empirical evidence indicating that this approach enhances VLM-based OOD detection performance without any additional training. To further reduce the gap between image and text, we introduce a novel few-shot tuning framework, SUPREME, comprising biased prompts generation (BPG) and image-text consistency (ITC) modules. BPG enhances image-text fusion and improves generalization by conditioning ID text prototypes on the Gaussian-based estimated image domain bias; ITC reduces the modality gap by minimizing intra- and inter-modal distances. Moreover, inspired by our theoretical and empirical findings, we introduce a novel OOD score $S_{\textit{GMP}}$, leveraging uni- and cross-modal similarities. Finally, we present extensive experiments to demonstrate that SUPREME consistently outperforms existing VLM-based OOD detection methods.
中文标题/摘要
标题:缓解模态差距:基于多模态原型和图像偏置估计的少量样本离分布检测
现有的基于视觉-语言模型(VLM)的离分布(OOD)检测方法通常依赖于输入图像与在分布(ID)文本原型之间的相似性分数。然而,图像和文本之间的模态差距往往导致高误报率,因为OOD样本可能与ID文本原型表现出高度相似性。为了缓解这种模态差距的影响,我们提出结合ID图像原型和ID文本原型。我们提供了理论分析和实验证据,表明这种方法在不进行额外训练的情况下提高了基于VLM的OOD检测性能。为了进一步缩小图像和文本之间的差距,我们引入了一种新颖的少量样本调优框架SUPREME,包括带有偏置提示生成(BPG)模块和图像-文本一致性(ITC)模块。BPG增强了图像-文本融合并通过对ID文本原型进行基于高斯估计的图像域偏置条件,从而提高了泛化能力;ITC通过最小化跨模态和跨模态距离来减少模态差距。此外,受到我们的理论和实验证据的启发,我们引入了一种新颖的OOD分数$S_{\textit{GMP}}$,利用单模态和跨模态相似性。最后,我们进行了广泛的实验,证明SUPREME始终优于现有的基于VLM的OOD检测方法。
Summary / 总结
The paper addresses the challenge of out-of-distribution (OOD) detection in vision-language models by proposing a method that incorporates both image and text prototypes to mitigate the modality gap. It introduces a novel few-shot tuning framework called SUPREME, which includes biased prompts generation and image-text consistency modules to enhance image-text fusion and reduce the modality gap. The study demonstrates that this approach improves OOD detection performance without additional training and outperforms existing methods in extensive experiments.
该论文通过提出结合图像和文本原型的方法来缓解视觉-语言模型中的出分布(OOD)检测问题,以减轻模态差距。所提出的SUPREME方法包括一个带有偏差提示生成模块,用于根据估计的图像领域偏差条件化文本原型,以及一个图像-文本一致性模块,用于减少模态差距。作者引入了一种新的OOD得分$S_{ extit{GMP}}$,利用单模态和跨模态相似性。大量实验表明,SUPREME在OOD检测方面优于现有的基于VLM的方法。
RemEdit: Efficient Diffusion Editing with Riemannian Geometry
Authors: Eashan Adhikarla, Brian D. Davison
Venue: WACV 2026
First: 2026-01-25T17:58:57+00:00 · Latest: 2026-01-25T17:58:57+00:00
Abstract
Controllable image generation is fundamental to the success of modern generative AI, yet it faces a critical trade-off between semantic fidelity and inference speed. The RemEdit diffusion-based framework addresses this trade-off with two synergistic innovations. First, for editing fidelity, we navigate the latent space as a Riemannian manifold. A mamba-based module efficiently learns the manifold's structure, enabling direct and accurate geodesic path computation for smooth semantic edits. This control is further refined by a dual-SLERP blending technique and a goal-aware prompt enrichment pass from a Vision-Language Model. Second, for additional acceleration, we introduce a novel task-specific attention pruning mechanism. A lightweight pruning head learns to retain tokens essential to the edit, enabling effective optimization without the semantic degradation common in content-agnostic approaches. RemEdit surpasses prior state-of-the-art editing frameworks while maintaining real-time performance under 50% pruning. Consequently, RemEdit establishes a new benchmark for practical and powerful image editing. Source code: https://www.github.com/eashanadhikarla/RemEdit.
中文标题/摘要
标题:RemEdit:基于黎曼几何的高效扩散编辑
可控图像生成是现代生成AI成功的关键,但面临着语义保真度和推理速度之间的关键权衡。RemEdit基于扩散的框架通过两项协同创新解决了这一权衡。首先,为了编辑保真度,我们在黎曼流形中导航潜在空间。基于蟒蛇的模块高效地学习流形结构,使可以直接和准确地计算光滑语义编辑的测地线路径。这种控制进一步通过双-SLERP混合技术和来自视觉语言模型的目标感知提示增强过程进行细化。其次,为了加速,我们引入了一种新的任务特定注意剪枝机制。一个轻量级的剪枝头学习保留对编辑至关重要的标记,从而在不牺牲内容无关方法中常见的语义降级的情况下实现有效的优化。RemEdit在50%剪枝下仍保持实时性能,超越了先前的最先进的编辑框架。因此,RemEdit为实用且强大的图像编辑设定了新基准。源代码:https://www.github.com/eashanadhikarla/RemEdit.
Summary / 总结
RemEdit addresses the trade-off between semantic fidelity and inference speed in image generation by navigating the latent space as a Riemannian manifold and using a dual-SLERP blending technique. It also introduces a task-specific attention pruning mechanism to accelerate the process. The results show that RemEdit outperforms previous state-of-the-art editing frameworks while maintaining real-time performance even with 50% pruning.
RemEdit通过将潜在空间视为黎曼流形并使用mamba模块进行高效的测地线路径计算,解决了图像生成中语义保真度和推理速度之间的权衡问题。它还引入了一种任务特定的注意力剪枝机制以加速过程。结果表明,RemEdit在保持实时性能的同时,即使在剪枝50%的情况下也超越了之前的最先进的编辑框架。
History
20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553