Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning
Authors: Valentin Noël
First: 2026-01-02T18:49:37+00:00 · Latest: 2026-01-02T18:49:37+00:00
Comments: 58 pages, 19 figures, Under Review
Abstract
We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling 85.0--95.6\% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93--95\% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B's Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.
中文标题/摘要
标题:理性几何:有效数学推理的光谱特征
我们提出了一种无需训练的方法,通过注意力模式的光谱分析来检测大型语言模型中的有效数学推理。通过将注意力矩阵视为动态图的邻接矩阵,我们提取了四个可解释的光谱诊断指标:Fiedler值(代数连通性)、高频能量比(HFER)、图信号平滑性和光谱熵,这些指标在有效和无效数学证明之间表现出统计学上的显著差异。在四个独立架构家族(Meta Llama、阿里巴巴 Qwen、微软 Phi 和 Mistral AI)的七个变压器模型上进行的实验表明,这种光谱特征产生的效应大小高达 Cohen's $d = 3.30$ ($p < 10^{-116}$),在严格的评估下可实现 85.0–95.6% 的分类准确率,完整数据集上的校准阈值达到 93–95%。该方法无需训练数据、微调或学习分类器:单一的光谱指标阈值足以实现高准确率。通过系统性的标签修正,我们发现光谱方法检测的是逻辑连贯性而非编译器接受,识别出形式验证器因技术故障而拒绝的数学上有效的证明。我们还发现一种架构依赖性:Mistral-7B 的滑动窗口注意力将区分信号从 HFER 转移到晚期层平滑性($d = 2.09$,$p_{\text{MW}} = 1.16 \times 10^{-48}$),揭示了注意力机制设计影响哪些光谱特征捕捉推理有效性。这些发现确立了光谱图分析作为推理验证的原理性框架,并立即应用于幻觉检测和 AI 安全监控。
Summary / 总结
The study introduces a training-free method for identifying valid mathematical reasoning in large language models using spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs, the authors extract spectral diagnostics including Fiedler value, high-frequency energy ratio, graph signal smoothness, and spectral entropy. These diagnostics show significant differences between valid and invalid proofs, achieving up to 95.6% classification accuracy with calibrated thresholds reaching 93-95% on a full dataset. The method reveals architectural dependencies, with certain models shifting discriminative signals to different spectral features, highlighting the impact of attention mechanism design on reasoning validity.
研究提出了一种无需训练的方法,通过分析注意力模式的谱特征来检测大型语言模型中的有效数学推理。通过将注意力矩阵视为动态图的邻接矩阵,提取了四个谱诊断指标——Fiedler值、高频率能量比、图信号平滑性和谱熵,显示出有效和无效证明之间的显著差异。实验表明,七个来自四个架构家族的变压器模型在有效证明分类上的准确率高达85.0–95.6%,且校准阈值为93–95%,并发现架构依赖性会影响某些谱特征的区分信号。
Categorical Reparameterization with Denoising Diffusion models
Authors: Samson Gourevitch, Alain Durmus, Eric Moulines, Jimmy Olsson, Yazid Janati
First: 2026-01-02T18:30:05+00:00 · Latest: 2026-01-02T18:30:05+00:00
Comments: working paper
Abstract
Gradient-based optimization with categorical variables typically relies on score-function estimators, which are unbiased but noisy, or on continuous relaxations that replace the discrete distribution with a smooth surrogate admitting a pathwise (reparameterized) gradient, at the cost of optimizing a biased, temperature-dependent objective. In this paper, we extend this family of relaxations by introducing a diffusion-based soft reparameterization for categorical distributions. For these distributions, the denoiser under a Gaussian noising process admits a closed form and can be computed efficiently, yielding a training-free diffusion sampler through which we can backpropagate. Our experiments show that the proposed reparameterization trick yields competitive or improved optimization performance on various benchmarks.
中文标题/摘要
标题:带噪扩散模型中的分类重参数化
基于梯度的优化通常依赖于评分函数估计器,这些估计器虽然无偏但噪声较大,或者依赖于连续松弛,用平滑的替代分布替换离散分布,从而允许路径(重参数化)梯度,但代价是优化一个带有温度依赖性的有偏目标。在本文中,我们通过引入基于扩散的软重参数化扩展了这一类松弛方法,对于这些分布,高斯噪声过程下的去噪器具有闭式解且可以高效计算,从而通过训练免费的扩散采样器进行反向传播。我们的实验表明,所提出的重参数化技巧在各种基准测试中提供了竞争力或改进的优化性能。
Summary / 总结
This paper addresses the challenge of optimizing categorical variables using a novel reparameterization technique based on denoising diffusion models. Unlike traditional score-function estimators or continuous relaxations, this method provides a closed-form denoiser that allows for efficient backpropagation without the need for training. The experiments demonstrate that this approach achieves competitive or even better optimization performance across various benchmarks.
本文通过引入基于扩散的软重参数化方法,解决了梯度优化中使用类别变量的挑战。与传统的评分函数估计器或连续松弛方法不同,该方法提供了一种无偏且高效的优化类别分布的方式。实验结果表明,所提出的技术在各种基准上实现了竞争性或改进的性能。
Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients
Authors: Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, Rafet Sifa
First: 2025-12-28T21:57:42+00:00 · Latest: 2026-01-02T18:25:09+00:00
Abstract
Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.
中文标题/摘要
标题:基准成功,临床失败:当强化学习优化基准而非患者
近期用于大型语言模型(LLMs)的强化学习(RL)进展在推理任务上取得了改进,但其在医疗成像领域的资源受限应用仍被严重忽视。我们引入了ChexReason,这是一种通过R1风格方法(SFT后跟GRPO)训练的视觉-语言模型,仅使用了2,000个SFT样本、1,000个RL样本和一个A100 GPU。在CheXpert和NIH基准上的评估揭示了一个根本性的矛盾:GRPO恢复了分布内性能(在CheXpert上提高了23%,宏F1分数为0.346),但降低了跨数据集的可迁移性(在NIH上下降了19%)。这与高资源模型如NV-Reason-CXR-3B的情况相似,表明问题源自RL范式而非规模。我们发现了一种泛化悖论,即SFT检查点在优化前对NIH的性能有所提升,表明教师引导的推理捕捉到了更多机构无关的特征。此外,跨模型比较显示,结构化推理框架对通用视觉语言模型有益,但对医学预训练模型的增益有限。因此,精心策划的监督微调可能在需要跨多样人群稳健性的临床部署中优于激进的RL方法。
Summary / 总结
The study investigates the application of Reinforcement Learning (RL) in medical imaging using a vision-language model, ChexReason, trained with limited resources. Despite improving in-distribution performance on CheXpert and NIH benchmarks, the model shows reduced cross-dataset transferability, highlighting a fundamental tension between benchmark success and clinical applicability. The research suggests that this issue arises from the RL paradigm itself rather than model scale, and identifies a generalization paradox where the supervised fine-tuning checkpoint uniquely enhances cross-dataset performance before optimization. The findings imply that curated supervised fine-tuning might be more suitable for clinical deployment requiring robustness across diverse populations.
论文探讨了在医疗影像中应用强化学习(RL)的情况,使用了仅用少量资源训练的视觉-语言模型ChexReason。尽管在CheXpert和NIH基准测试上提高了内部性能,但RL优化却降低了跨数据集的迁移能力,揭示了基准成功与临床应用之间的根本矛盾。研究指出,问题可能出在RL范式本身,而非模型规模,因此精心策划的监督微调可能更适合需要跨多样人群稳健性的临床部署。
Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models
Authors: Shambhavi Mishra, Julio Silva-Rodriguez, Ismail Ben Ayed, Marco Pedersoli, Jose Dolz
First: 2024-11-26T00:15:37+00:00 · Latest: 2026-01-02T18:18:27+00:00
Comments: Added additional figures to communicate the algorithm
Abstract
Large pre-trained vision-language models (VLMs), such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we investigate how to efficiently utilize class text information to mitigate distribution drifts encountered by VLMs during inference. In particular, we propose generating pseudo-labels for the noisy test-time samples by aligning visual embeddings with reliable, text-based semantic anchors. Specifically, to maintain the regular structure of the dataset properly, we formulate the problem as a batch-wise label assignment, which is efficiently solved using Optimal Transport. Our method, Semantic Anchor Transport (SAT), utilizes such pseudo-labels as supervisory signals for test-time adaptation, yielding a principled cross-modal alignment solution. Moreover, SAT further leverages heterogeneous textual clues, with a multi-template distillation approach that replicates multi-view contrastive learning strategies in unsupervised representation learning without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of SAT, achieving consistent performance gains over recent state-of-the-art methods, yet being computationally efficient.
中文标题/摘要
标题:语义锚点传输:视觉语言模型测试时鲁棒适应
大型预训练视觉语言模型(VLMs),如CLIP,在广泛的任务中展示了前所未有的零样本性能。然而,这些模型在分布变化下可能不可靠,其性能会显著下降。在本文中,我们研究了如何高效地利用类别文本信息来缓解VLMs在推理过程中遇到的分布漂移。特别是,我们提出通过将视觉嵌入与可靠的、基于文本的语义锚点对齐来生成噪声测试样本的伪标签。具体而言,为了保持数据集的正常结构,我们将问题形式化为批量标签分配问题,该问题可以使用最优传输高效求解。我们的方法,语义锚点传输(SAT),利用这些伪标签作为测试时适应的监督信号,提供了一种原理性的跨模态对齐解决方案。此外,SAT 进一步利用了异构文本线索,通过多模板蒸馏方法复制无监督表示学习中的多视图对比学习策略,而不会增加额外的计算复杂度。在多个流行的测试时适应基准上的广泛实验中,SAT 以不同的复杂性展示了其优越性,相对于最近的先进方法实现了持续的性能提升,同时计算效率高。
Summary / 总结
This work addresses the issue of distributional shifts in large pre-trained vision-language models (VLMs) like CLIP, proposing Semantic Anchor Transport (SAT) to mitigate performance degradation. SAT generates pseudo-labels by aligning visual embeddings with reliable text-based semantic anchors using Optimal Transport, and uses these labels for test-time adaptation. The method also incorporates a multi-template distillation approach to enhance cross-modal alignment. Experiments on various benchmarks demonstrate SAT's effectiveness in improving performance over state-of-the-art methods while maintaining computational efficiency.
本文解决了大型预训练视觉-语言模型(VLMs)如CLIP在推理过程中因分布变化而导致性能下降的问题。作者提出了语义锚点传输(SAT),通过使用最优传输将视觉嵌入与可靠的文本语义锚点对齐来生成测试样本的伪标签。SAT然后使用这些伪标签进行测试时的适应,实现了相对于最近的先进方法的一致性能提升,同时保持了计算效率。在多种基准测试上的广泛实验表明,SAT在缓解分布漂移方面具有有效性。
Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection
Authors: Akanksha Chuchra, Shukesh Reddy, Sudeepta Mishra, Abhijit Das, Abhinav Dhall
First: 2026-01-02T18:17:22+00:00 · Latest: 2026-01-02T18:17:22+00:00
Comments: Accepted at IJCB 2025
Abstract
While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.
中文标题/摘要
标题:探究在音频换音检测中使用多模态大型语言模型的可行性
尽管视觉-语言模型(VLMs)和多模态大型语言模型(MLLMs)在检测图像和视频换音方面表现出强大的泛化能力,但它们在音频换音检测中的应用尚未得到充分探索。本文旨在探索MLLMs在音频换音检测中的潜力。通过结合音频输入和一系列文本提示作为查询,以确定MLLMs在跨模态学习鲁棒表示方面的能力,特别是针对音频换音检测。因此,我们尝试使用文本感知和语境丰富的问答式提示,并进行二元决策。我们假设这种特征引导的推理将有助于促进更深层次的多模态理解,并使音频换音检测中的特征学习更加稳健。我们评估了两种MLLMs,Qwen2-Audio-7B-Instruct和SALMONN,在两种评估模式下的性能:(a)零样本和(b)微调。我们的实验表明,结合音频与多提示方法可能是音频换音检测的一个可行方向。我们的实验显示,这些模型在没有特定任务训练的情况下表现不佳,并且难以泛化到域外数据。然而,它们在少量监督下对域内数据表现出良好的性能,表明音频换音检测具有良好的潜力。
Summary / 总结
This study investigates the use of Multi-modal Large Language Models (MLLMs) for audio deepfake detection, focusing on text-aware and context-rich prompts. The research evaluates two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in zero-shot and fine-tuned modes. Experiments show that while these models perform poorly without task-specific training, they achieve good performance on in-domain data with minimal supervision, indicating potential for audio deepfake detection.
本研究探讨了多模态大型语言模型(MLLMs)在检测音频深伪方面的应用。通过将音频输入与各种文本提示结合,研究探索了模型在跨模态学习稳健表示方面的能力。研究评估了两种MLLMs,Qwen2-Audio-7B-Instruct和SALMONN,在零样本和微调模式下的表现。结果显示,尽管在没有特定任务训练的情况下模型表现不佳,但在少量监督下它们在领域内数据上表现出良好的性能,表明了在音频深伪检测方面的潜在应用价值。
QUITE: A Query Rewrite System Beyond Rules with LLM Agents
Authors: Yuyang Song, Hanxu Yan, Jiale Lao, Yibo Wang, Yufei Li, Yuanchun Zhou, Jianguo Wang, Mingjie Tang
First: 2025-06-09T11:51:27+00:00 · Latest: 2026-01-02T16:51:25+00:00
Abstract
Query rewrite transforms SQL queries into semantically equivalent forms that run more efficiently. Existing approaches mainly rely on predefined rewrite rules, but they handle a limited subset of queries and can cause performance regressions. This limitation stems from three challenges of rule-based query rewrite: (1) it is hard to discover and verify new rules, (2) fixed rewrite rules do not generalize to new query patterns, and (3) some rewrite techniques cannot be expressed as fixed rules. Motivated by the fact that human experts exhibit significantly better rewrite ability but suffer from scalability, and Large Language Models (LLMs) have demonstrated nearly human-level semantic and reasoning abilities, we propose a new approach of using LLMs to rewrite SQL queries beyond rules. Due to the hallucination problems in LLMs, directly applying LLMs often leads to nonequivalent and suboptimal queries. To address this issue, we propose QUITE (query rewrite), a training-free and feedback-aware system based on LLM agents that rewrites SQL queries into semantically equivalent forms with significantly better performance, covering a broader range of query patterns and rewrite strategies compared to rule-based methods. Firstly, we design a multi-agent framework controlled by a finite state machine (FSM) to equip LLMs with the ability to use external tools and enhance the rewrite process with real-time database feedback. Secondly, we develop a rewrite middleware to enhance the ability of LLMs to generate optimized query equivalents. Finally, we employ a novel hint injection technique to improve execution plans for rewritten queries. Extensive experiments show that QUITE reduces query execution time by up to 35.8% over state-of-the-art approaches and produces 24.1% more rewrites than prior methods, covering query cases that earlier systems did not handle.
中文标题/摘要
标题:QUITE:超越规则的LLM代理查询重写系统
查询重写将SQL查询转换为语义等价的形式,以更高效地运行。现有方法主要依赖预定义的重写规则,但它们只能处理查询的有限子集,并可能导致性能倒退。这种限制源于基于规则的查询重写三个挑战:(1)难以发现和验证新规则,(2)固定的重写规则不能泛化到新的查询模式,(3)一些重写技术无法用固定规则表达。鉴于人类专家在重写方面表现出显著的能力,但存在可扩展性问题,以及大型语言模型(LLMs)展示了接近人类的语义和推理能力,我们提出了一种新的方法,利用LLMs超越规则重写SQL查询。由于LLMs存在幻觉问题,直接应用LLMs通常会导致非等价和次优查询。为解决这一问题,我们提出了QUITE(查询重写),一种基于LLM代理的无需训练且反馈感知的系统,能够将SQL查询转换为语义等价的形式,性能显著提高,覆盖比基于规则方法更广泛的查询模式和重写策略。首先,我们设计了一个由有限状态机(FSM)控制的多代理框架,使LLMs能够使用外部工具,并通过实时数据库反馈增强重写过程。其次,我们开发了一种重写中间件,以增强LLMs生成优化查询等价物的能力。最后,我们采用了一种新颖的提示注入技术,以改进重写查询的执行计划。广泛实验表明,QUITE将查询执行时间减少了高达35.8%,并比最先进的方法多生成了24.1%的重写,涵盖了早期系统未处理的查询案例。
Summary / 总结
The paper proposes QUITE, a system that uses LLM agents to rewrite SQL queries beyond predefined rules, addressing the limitations of rule-based methods. QUITE employs a multi-agent framework with real-time database feedback and a rewrite middleware to enhance query optimization. Experimental results show that QUITE reduces query execution time by up to 35.8% and generates 24.1% more rewrites than previous methods, covering cases that earlier systems could not handle.
论文提出了QUITE系统,该系统使用LLM代理超越预定义规则来重写SQL查询,解决了基于规则的方法的局限性。QUITE采用具有实时数据库反馈的多代理框架和重写中间件来增强查询优化,并引入了一种提示注入技术以改进重写查询的执行计划。实验结果表明,QUITE将查询执行时间最多减少了35.8%,并且生成的重写数量比先前的方法多24.1%,涵盖了早期系统无法处理的查询案例。
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
Authors: Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Yan, Hao Fei, Tat-Seng Chua
Venue: NeurIPS Spotlight
First: 2025-12-28T12:25:43+00:00 · Latest: 2026-01-02T15:48:28+00:00
Comments: Accepted by NeurIPS as a Spotlight paper. Code: https://github.com/JavisVerse/JavisGPT
Abstract
This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
中文标题/摘要
标题:JavisGPT:统一多模态大语言模型用于音视频理解和生成
本文介绍了JavisGPT,这是首个用于联合音视频(JAV)理解和生成的统一多模态大语言模型(MLLM)。JavisGPT具有简洁的编码器-大语言模型-解码器架构,包含一个同步融合模块(SyncFusion)用于时空音视频融合和同步感知可学习查询,以连接预训练的JAV-DiT生成器。此设计使多模态指令下的音视频理解和生成具有时间一致性。我们设计了一个有效的三阶段训练管道,包括多模态预训练、音视频微调和大规模指令微调,逐步从现有的视觉语言模型构建多模态理解和生成。在指令微调方面,我们构建了JavisInst-Omni,这是一个高质量的指令数据集,包含超过20万GPT-4o筛选的音视频文本对话,涵盖了多样性和多层次的理解与生成场景。在音视频理解和生成基准测试中,我们的实验表明JavisGPT在复杂和时间同步的设置中优于现有MLLM。
Summary / 总结
JavisGPT is the first unified multimodal large language model for joint audio-video comprehension and generation. It uses a concise encoder-LLM-decoder architecture with a SyncFusion module for spatio-temporal fusion and synchrony-aware queries. The model is trained through a three-stage pipeline, including multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning. Experiments show that JavisGPT outperforms existing models, especially in complex and temporally synchronized settings.
JavisGPT 是首个用于联合音频-视频理解和生成的统一多模态大型语言模型。它采用编码器-LLM-解码器架构,并包含一个 SyncFusion 模块进行音频-视频融合和同步感知查询。JavisGPT 通过包括多模态预训练、音频-视频微调和指令微调的三阶段训练管道进行训练。该模型在 JAV 理解和生成基准测试中优于现有模型,尤其是在复杂和时间同步的设置中。
Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model
Authors: Hao Guan, Li Zhou
First: 2026-01-02T15:12:06+00:00 · Latest: 2026-01-02T15:12:06+00:00
Comments: 8 pages, 6 figures
Abstract
Vision-Language Models have demonstrated strong potential in medical image analysis and disease diagnosis. However, after deployment, their performance may deteriorate when the input data distribution shifts from that observed during development. Detecting such performance degradation is essential for clinical reliability, yet remains challenging for large pre-trained VLMs operating without labeled data. In this study, we investigate performance degradation detection under data shift in a state-of-the-art pathology VLM. We examine both input-level data shift and output-level prediction behavior to understand their respective roles in monitoring model reliability. To facilitate systematic analysis of input data shift, we develop DomainSAT, a lightweight toolbox with a graphical interface that integrates representative shift detection algorithms and enables intuitive exploration of data shift. Our analysis shows that while input data shift detection is effective at identifying distributional changes and providing early diagnostic signals, it does not always correspond to actual performance degradation. Motivated by this observation, we further study output-based monitoring and introduce a label-free, confidence-based degradation indicator that directly captures changes in model prediction confidence. We find that this indicator exhibits a close relationship with performance degradation and serves as an effective complement to input shift detection. Experiments on a large-scale pathology dataset for tumor classification demonstrate that combining input data shift detection and output confidence-based indicators enables more reliable detection and interpretation of performance degradation in VLMs under data shift. These findings provide a practical and complementary framework for monitoring the reliability of foundation models in digital pathology.
中文标题/摘要
标题:病理视觉语言模型在数据偏移下性能退化的检测
视觉语言模型在医学图像分析和疾病诊断中展现了强大的潜力。然而,在部署后,当输入数据分布从开发期间的变化中转移时,它们的性能可能会下降。检测这种性能退化对于临床可靠性至关重要,但对大型预训练VLMs在无标签数据下运行时仍具有挑战性。在本研究中,我们探讨了在先进病理VLM中数据偏移下性能退化的检测。我们研究了输入级数据偏移和输出级预测行为,以了解它们在监控模型可靠性中的各自作用。为了便于系统分析输入数据偏移,我们开发了DomainSAT,一个轻量级的图形界面工具箱,集成了代表性的偏移检测算法,使数据偏移的直观探索成为可能。我们的分析表明,虽然输入数据偏移检测在识别分布变化和提供早期诊断信号方面是有效的,但它并不总是与实际性能退化相对应。基于这一观察,我们进一步研究了基于输出的监控,并引入了一个无标签、基于置信度的退化指标,直接捕捉模型预测置信度的变化。我们发现,该指标与性能退化之间存在密切关系,并且作为输入偏移检测的有效补充。在大规模病理数据集上的肿瘤分类实验表明,结合输入数据偏移检测和基于输出置信度的指标,可以更可靠地检测和解释VLMs在数据偏移下的性能退化。这些发现为监测数字病理学中基础模型的可靠性提供了一个实用且互补的框架。
Summary / 总结
This study investigates performance degradation in a state-of-the-art pathology vision-language model under data shift. It develops DomainSAT, a lightweight toolbox for analyzing input-level data shift and introduces a label-free, confidence-based degradation indicator for output-level monitoring. The research finds that combining input shift detection and output confidence-based indicators enhances the reliability of detecting and interpreting performance degradation in VLMs under data shift, providing a practical framework for monitoring model reliability in digital pathology.
研究探讨了病理视觉语言模型在数据偏移下的性能退化问题,开发了DomainSAT轻量级工具箱来分析输入数据偏移,并引入了无标签、基于置信度的退化指标。研究发现,虽然输入数据偏移检测可以识别分布变化,但并不总是与性能退化相关。结合输入偏移检测与输出置信度基指标,可以提高在数据偏移下检测和解释性能退化的可靠性。这为数字病理学中监控模型可靠性提供了实用框架。
Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?
Authors: Robin Hesse, Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Stefan Roth
First: 2025-03-21T12:54:18+00:00 · Latest: 2026-01-02T14:05:54+00:00
Comments: Published in TMLR (12/2025) | OpenReview: https://openreview.net/forum?id=E7HDtLCoT6 | Project page: https://visinf.github.io/beyond-accuracy/
Abstract
Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect these quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high class balance on ImageNet-1k classification and strong robustness against domain changes; (ii) training models initialized with weights obtained through self-supervised learning is an effective strategy to improve most considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.
中文标题/摘要
标题:超越准确性:设计良好行为的图像分类模型有哪些重要方面?
深度学习已成为计算机视觉不可或缺的一部分,深度神经网络(DNNs)在预测性能方面表现出色。然而,它们在其他关键质量维度,如鲁棒性、校准或公平性方面往往表现不佳。虽然现有研究集中在这些质量维度的一部分,但没有研究更广泛形式的“良好行为”。通过这项工作,我们填补了这一空白,同时研究了图像分类中的九种不同质量维度。通过大规模研究,我们通过分析326个骨干模型及其不同训练范式和模型架构对这些质量维度的影响,提供了宏观视角。我们揭示了各种新的见解,例如:(i)视觉-语言模型在ImageNet-1k分类中表现出高类别平衡,并且对领域变化具有很强的鲁棒性;(ii)使用自监督学习获得的权重初始化模型是一种有效策略,可以提高大多数考虑的质量维度;(iii)训练数据集大小是大多数质量维度的主要驱动因素。我们通过引入QUBA评分(超越准确性理解的质量),一种多维度质量的新型度量标准,总结了我们的研究,该度量标准可以根据特定用户需求提供定制化建议。
Summary / 总结
This study addresses the limitations of deep neural networks in computer vision by exploring nine quality dimensions beyond accuracy, such as robustness and fairness. Through a large-scale analysis of 326 backbone models, the research reveals that vision-language models have high class balance and strong robustness, and self-supervised learning can improve most quality dimensions. The study also finds that the size of the training dataset significantly impacts most quality dimensions. A novel metric, QUBA score, is introduced to rank models across multiple quality dimensions, providing tailored recommendations based on specific user needs.
研究通过分析9个质量维度,填补了评估图像分类中深度神经网络良好行为的空白。通过对326个骨干模型的大规模分析,研究发现视觉语言模型在类别平衡和鲁棒性方面表现出色,并且使用自我监督学习初始化可以提高大多数质量维度。研究还发现,训练数据集的大小对大多数质量维度有显著影响。研究引入了QUBA评分(质量理解超越准确性),这是一种新型指标,用于在多个质量维度上对模型进行排名,从而根据特定需求提供定制化建议。
Evaluating the Performance of Open-Vocabulary Object Detection in Low-quality Image
Authors: Po-Chih Wu
First: 2025-12-28T06:18:22+00:00 · Latest: 2026-01-02T12:29:39+00:00
Abstract
Open-vocabulary object detection enables models to localize and recognize objects beyond a predefined set of categories and is expected to achieve recognition capabilities comparable to human performance. In this study, we aim to evaluate the performance of existing models on open-vocabulary object detection tasks under low-quality image conditions. For this purpose, we introduce a new dataset that simulates low-quality images in the real world. In our evaluation experiment, we find that although open-vocabulary object detection models exhibited no significant decrease in mAP scores under low-level image degradation, the performance of all models dropped sharply under high-level image degradation. OWLv2 models consistently performed better across different types of degradation, while OWL-ViT, GroundingDINO, and Detic showed significant performance declines. We will release our dataset and codes to facilitate future studies.
中文标题/摘要
标题:低质量图像中开放词汇对象检测性能评估
开放词汇对象检测使模型能够定位和识别超出预定义类别集的对象,并期望实现与人类相当的识别能力。在本研究中,我们旨在评估现有模型在低质量图像条件下的开放词汇对象检测任务性能。为此,我们引入了一个新的数据集,模拟了现实世界中的低质量图像。在我们的评估实验中,我们发现尽管开放词汇对象检测模型在低级别图像退化下没有显著降低mAP分数,但在高级别图像退化下所有模型的性能急剧下降。OWLv2模型在不同类型的退化中始终表现更好,而OWL-ViT、GroundingDINO和Detic则显示出显著的性能下降。我们将发布我们的数据集和代码,以促进未来的研究。
Summary / 总结
This study evaluates the performance of open-vocabulary object detection models under low-quality image conditions. A new dataset simulating real-world low-quality images was introduced. The results show that while models maintained similar mAP scores under low-level image degradation, their performance dropped significantly under high-level degradation. OWLv2 models performed consistently better across different types of degradation, while OWL-ViT, GroundingDINO, and Detic showed substantial declines in performance.
本研究评估了开放词汇对象检测模型在低质量图像条件下的性能。创建了一个模拟真实世界低质量图像的新数据集。结果显示,在低级图像退化下,模型的mAP分数保持相似,但在高级图像退化下性能显著下降。OWLv2模型表现更为一致,而OWL-ViT、GroundingDINO和Detic则显示出显著的性能下降。
CRoPS: A Training-Free Hallucination Mitigation Framework for Vision-Language Models
Authors: Neeraj Anand, Samyak Jha, Udbhav Bamba, Rahul Rahaman
First: 2026-01-02T11:39:00+00:00 · Latest: 2026-01-02T11:39:00+00:00
Comments: Accepted at TMLR 2026
Abstract
Despite the rapid success of Large Vision-Language Models (LVLMs), a persistent challenge is their tendency to generate hallucinated content, undermining reliability in real-world use. Existing training-free methods address hallucinations but face two limitations: (i) they rely on narrow assumptions about hallucination sources, and (ii) their effectiveness declines toward the end of generation, where hallucinations are most likely to occur. A common strategy is to build hallucinated models by completely or partially removing visual tokens and contrasting them with the original model. Yet, this alone proves insufficient, since visual information still propagates into generated text. Building on this insight, we propose a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. We further introduce Generalized Contrastive Decoding, which integrates multiple hallucinated models to represent diverse hallucination sources. Together, these ideas form CRoPS, a training-free hallucination mitigation framework that improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.
中文标题/摘要
标题:CRoPS:一种无需训练的视觉语言模型幻觉抑制框架
尽管大型视觉语言模型(LVLMs)取得了快速的成功,但它们生成幻觉内容的倾向一直是一个持续的挑战,这在实际应用中削弱了其可靠性。现有的无需训练的方法虽然可以解决幻觉问题,但存在两个局限性:(i) 它们依赖于幻觉来源的狭窄假设,(ii) 它们在生成过程中后期效果下降,而幻觉最有可能在此时发生。一种常见的策略是通过完全或部分移除视觉标记并将其与原始模型进行对比来构建幻觉模型。然而,这本身是不够的,因为视觉信息仍然会传递到生成的文本中。基于这一洞察,我们提出了一种新的幻觉模型,通过选择性地移除关键文本标记来捕捉幻觉效果。我们进一步引入了广义对比解码,将多个幻觉模型整合在一起以表示多种幻觉来源。这些想法共同构成了CRoPS,一种无需训练的幻觉抑制框架,该框架在CHAIR得分上提高了20%,并在六个基准和三个LVLM家族中实现了持续的收益,超越了最先进的无需训练方法。
Summary / 总结
The research addresses the challenge of hallucinations in Large Vision-Language Models (LVLMs) by proposing CRoPS, a training-free framework. CRoPS mitigates hallucinations by selectively removing key text tokens and integrating multiple hallucinated models. This approach improves CHAIR scores by 20% and consistently outperforms existing methods across various benchmarks and LVLM families.
研究旨在通过提出CRoPS框架来解决大型视觉语言模型(LVLM)中的幻觉问题。CRoPS通过选择性地移除关键文本令牌并整合多个幻觉模型来表示各种幻觉来源来减轻幻觉。该框架在六个基准和三种LVLM家族中显著提高了20%的CHAIR得分,并且在所有现有最先进的训练免费方法中表现出色。
DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations
Authors: Longtian Qiu, Shan Ning, Chuyu Zhang, Jiaxuan Sun, Xuming He
First: 2026-01-02T09:41:54+00:00 · Latest: 2026-01-02T09:41:54+00:00
Comments: Accepted by TMLR
Abstract
Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine-grained hallucination suppression and degrades overall performance. To address this issue, we propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process. DA-DPO consists of two main components: (1) Difficulty Estimation leverages pre-trained vision--language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution-aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty-Aware Training reweights preference pairs based on their estimated difficulty, down-weighting easy samples while emphasizing harder ones to alleviate overfitting. This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine-tuning stages. Extensive experiments demonstrate that DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient. The project page is available at https://artanic30.github.io/project_pages/DA-DPO/.
中文标题/摘要
标题:DA-DPO:成本高效的难度感知偏好优化方法以减少MLLM幻觉
直接偏好优化(DPO)在减轻多模态大型语言模型(MLLMs)幻觉方面显示出强大的潜力。然而,现有的多模态DPO方法往往由于偏好数据中的难度不平衡而容易过拟合。我们的分析表明,MLLMs倾向于过度强调易于区分的偏好对,这阻碍了细粒度幻觉抑制并降低了整体性能。为了解决这一问题,我们提出了难度感知直接偏好优化(DA-DPO),这是一种成本效益高的框架,旨在平衡学习过程。DA-DPO包括两个主要组成部分:(1)难度估计利用预训练的视觉-语言模型,结合生成和对比目标,通过分布感知投票策略整合输出,生成稳健的难度评分,无需额外训练;(2)难度感知训练根据估计的难度重新加权偏好对,降低简单样本的权重,强调更难的样本以缓解过拟合。该框架通过优先处理具有挑战性的示例,使偏好优化更加有效,而无需新数据或额外的微调阶段。广泛的实验表明,DA-DPO在多模态偏好优化中始终表现出色,增强了对幻觉的鲁棒性,并在标准基准测试中实现了更好的泛化能力,同时保持了计算效率。项目页面可在https://artanic30.github.io/project_pages/DA-DPO/访问。
Summary / 总结
The research aims to address the overfitting issue in Direct Preference Optimization (DPO) for Multimodal Large Language Models (MLLMs) by proposing Difficulty-Aware Direct Preference Optimization (DA-DPO). DA-DPO includes a Difficulty Estimation component that uses pre-trained vision-language models to estimate the difficulty of preference pairs, and a Difficulty-Aware Training component that reweights these pairs to focus on harder examples. Experiments show that DA-DPO enhances robustness to hallucinations and improves generalization without additional data or fine-tuning, making it a cost-effective solution.
研究旨在通过提出一种难度感知的直接偏好优化(DA-DPO)框架来解决多模态大型语言模型(MLLMs)中直接偏好优化(DPO)的过拟合问题。DA-DPO 包括一个难度估计组件,利用预训练的视觉-语言模型来估计偏好对的难度,以及一个难度感知训练组件,根据估计的难度重新加权这些偏好对,优先处理更难的例子。实验表明,DA-DPO 提高了对幻觉的鲁棒性并改善了泛化能力,同时不需要额外的数据或微调,保持了计算效率。
NeedleChain: Measuring Intact Context Comprehension Capability of Large Language Models
Authors: Hyeonseok Moon, Heuiseok Lim
First: 2025-07-30T06:29:50+00:00 · Latest: 2026-01-02T08:24:43+00:00
Comments: 13 pages
Abstract
Recent reports suggest that LLMs can handle increasingly long contexts. However, many existing benchmarks for context understanding embed substantial query-irrelevant content, which shifts evaluation toward retrieving relevant snippets rather than fully integrating all provided information. Under this setting, we view that current benchmarks can overestimate true context-understanding ability of LLMs. In particular, we demonstrate that when the context consists entirely of query-relevant text, even advanced models such as GPT-4o fail to reliably integrate inputs as short as 200 tokens. To evaluate this capability more rigorously, we introduce NeedleChain, a benchmark designed to test whether models can faithfully incorporate all given evidence. NeedleChain includes three variants that differ in the required order of comprehension, along with a parallel benchmark based on the needle-in-a-haystack(NIAH) paradigm. By comparing these variants, NeedleChain enables a more comprehensive assessment of context understanding. We further propose a training-free strategy that encourages models to reflect all available information, ROPE contraction, highlighting the importance of full-context integration and pointing to new directions for improving reliable reasoning over context.
中文标题/摘要
标题:NeedleChain:测量大型语言模型完整上下文理解能力
最近的报告表明,LLM可以处理越来越长的上下文。然而,许多现有的上下文理解基准嵌入了大量的与查询无关的内容,这使得评估偏向于检索相关片段,而不是全面整合所有提供的信息。在这种情况下,我们认为当前的基准可能会高估LLM的真实上下文理解能力。特别是,我们证明当上下文完全由查询相关文本组成时,即使是像GPT-4o这样的先进模型也无法可靠地整合长度仅为200个标记的输入。为了更严格地评估这种能力,我们引入了NeedleChain,一个旨在测试模型是否能够忠实整合所有给定证据的基准。NeedleChain包括三个不同理解顺序要求的变体,以及基于针在干草堆(NIAH)范式的平行基准。通过比较这些变体,NeedleChain能够提供更全面的上下文理解评估。我们进一步提出了一种无需训练的策略,鼓励模型反映所有可用信息,ROPE收缩,强调全面上下文整合的重要性,并指出了提高基于上下文可靠推理的新方向。
Summary / 总结
The research aims to evaluate the intact context comprehension capability of large language models (LLMs) by addressing the limitations of existing benchmarks. The study introduces NeedleChain, a benchmark that tests models' ability to integrate all provided information without embedding irrelevant content. Key findings show that even advanced models like GPT-4o struggle to reliably integrate inputs as short as 200 tokens when the context is entirely query-relevant. The benchmark includes variants that differ in the required order of comprehension, allowing for a more comprehensive assessment of context understanding. The study also proposes ROPE contraction as a training-free strategy to encourage full-context integration, highlighting the importance of this capability for reliable reasoning over context.
研究旨在通过解决现有基准的局限性,衡量大型语言模型(LLMs)的完整上下文理解能力。研究引入了NeedleChain基准,用于测试模型在完全查询相关上下文下整合所有提供信息的能力。关键发现表明,即使是如GPT-4o这样的先进模型,在处理200个词左右的输入时也难以可靠地进行整合。该基准包括三个变体和一个基于“针在草堆中”(NIAH)范式的并行基准,以全面评估上下文理解能力。研究还提出了一种无需训练的策略ROPE收缩,以鼓励模型反映所有可用信息,强调了全面上下文整合的重要性,并指出了改进可靠上下文推理的新方向。
ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation
Authors: Shin Seong Kim, Minjung Shin, Hyunin Cho, Youngjung Uh
First: 2025-12-29T07:06:57+00:00 · Latest: 2026-01-02T08:13:56+00:00
Abstract
Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: https://minjung-s.github.io/asemconsist
中文标题/摘要
标题:ASemConsist: 自适应语义特征控制以实现无需训练的身份一致生成
近期的文本到图像扩散模型显著提升了视觉质量和文本对齐度。然而,在多样场景描述下生成一系列图像并保持一致的人物身份仍然是一个具有挑战性的任务。现有方法往往在保持身份一致性与确保单张图像提示对齐之间存在权衡。在本文中,我们提出了一种新颖的框架ASemconsist,通过选择性地修改文本嵌入,实现对人物身份的显式语义控制,同时不牺牲提示对齐。此外,基于对FLUX中填充嵌入分析,我们提出了一种语义控制策略,将填充嵌入重新利用为语义容器。我们还引入了一种自适应特征共享策略,自动评估文本的模糊性,并仅对模糊身份提示施加约束。最后,我们提出了一种统一的评估协议,一致性质量分数(CQS),将身份保留和单张图像文本对齐整合为一个综合指标,明确捕捉两个指标之间的性能失衡。我们的框架实现了最先进的性能,有效克服了先前的权衡。项目页面:https://minjung-s.github.io/asemconsist
Summary / 总结
The research motivation is to address the challenge of generating a sequence of images with consistent character identity across diverse scene descriptions while maintaining prompt alignment. The main method involves a novel framework, ASemConsist, which uses selective text embedding modification to enable explicit semantic control over character identity. Key experimental findings show that ASemConsist achieves state-of-the-art performance, overcoming previous trade-offs between identity consistency and prompt alignment. The framework also introduces a unified evaluation protocol, the Consistency Quality Score (CQS), which comprehensively captures performance in identity preservation and per-image text alignment.
研究旨在解决在不同场景描述下生成一系列图像时保持角色身份一致性的挑战。ASemConsist框架通过选择性地修改文本嵌入来实现对角色身份的显式语义控制,同时不牺牲提示对齐。关键发现包括达到最先进的性能,并引入了一致性质量评分(CQS)作为统一的评估指标,综合考虑身份保持和每张图像的文本对齐。该框架有效地克服了身份一致性与提示对齐之间的先前权衡。
AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization
Authors: Binhe Yu, Zhen Wang, Kexin Li, Yuqian Yuan, Wenqiao Zhang, Long Chen, Juncheng Li, Jun Xiao, Yueting Zhuang
First: 2025-12-29T15:26:25+00:00 · Latest: 2026-01-02T06:21:26+00:00
Abstract
Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
中文标题/摘要
标题:AnyMS:基于布局引导和无需训练的多主题定制中的自底向上注意力解耦
多主题定制旨在将多个用户指定的主题合成到一个连贯的图像中。为了解决主题缺失或冲突等问题,最近的工作引入了布局指导以提供明确的空间约束。然而,现有方法仍然难以平衡文本对齐、主题身份保留和布局控制这三个关键目标,而对额外训练的依赖进一步限制了其可扩展性和效率。在本文中,我们提出了AnyMS,这是一种新颖的无需训练的布局引导多主题定制框架。AnyMS 利用三种输入条件:文本提示、主题图像和布局约束,并引入了一种自底向上的双层注意力解耦机制,以在生成过程中协调它们的整合。具体而言,全局解耦将文本和视觉条件之间的跨注意力分离,以确保文本对齐。局部解耦将每个主题的注意力限制在其指定区域内,从而防止主题冲突,从而保证身份保留和布局控制。此外,AnyMS 使用预训练的图像适配器来提取与扩散模型对齐的主题特定特征,从而去除主题学习或适配器调优的需要。广泛的实验表明,AnyMS 达到了最先进的性能,支持复杂的组合并扩展到更多的主题。
Summary / 总结
AnyMS is a training-free framework for layout-guided multi-subject customization, addressing the challenges of text alignment, subject identity preservation, and layout control. It uses a bottom-up dual-level attention decoupling mechanism to integrate text prompts, subject images, and layout constraints. Global decoupling ensures text alignment, while local decoupling prevents subject conflicts, preserving identity and layout. Pre-trained image adapters are used to extract subject-specific features, eliminating the need for subject learning or adapter tuning. Experiments show that AnyMS outperforms existing methods in supporting complex compositions and handling a larger number of subjects.
AnyMS 是一个无需训练的多主题定制框架,整合了文本提示、主体图像和布局约束。它采用自底向上的双层注意力解耦机制,确保文本对齐、防止主体冲突并保持身份和布局控制。实验结果表明,AnyMS 在处理复杂组合和扩展到多个主体方面优于现有方法,达到了最先进的性能。
GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval
Authors: Mingyu Jeon, Sunjae Yoon, Jonghee Kim, Junyeoung Kim
Venue: AAAI 2026
First: 2026-01-02T06:04:58+00:00 · Latest: 2026-01-02T06:04:58+00:00
Comments: Accepted to AAAI 2026
Abstract
Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.
中文标题/摘要
标题:GranAlign:面向零样本视频时刻检索的粒度感知对齐框架
零样本视频时刻检索(ZVMR)是指使用自然语言查询在未剪辑的视频中定位时间片段,而不依赖于特定任务的训练数据。在这种设置中,主要挑战在于文本查询和视觉内容在语义粒度上的不匹配。ZVMR领域的先前研究试图通过利用高质量的预训练知识来实现对齐,这些知识在联合空间中表示视频和语言。然而,这些方法未能平衡每个模态在给定场景中提供的预训练知识之间的语义粒度。因此,尽管每个模态的表示质量很高,但粒度的不匹配导致检索不准确。在本文中,我们提出了一种无需训练的框架,称为粒度感知对齐(GranAlign),以弥合粗粒度和细粒度语义表示之间的差距。我们的方法引入了两种互补的技术:基于粒度的查询重写以生成不同的语义粒度,以及查询感知的字幕生成以将查询意图嵌入到视频内容中。通过将多级查询与查询无关和查询感知的字幕配对,我们有效地解决了语义不匹配问题。因此,我们的方法在所有三个主要基准(QVHighlights、Charades-STA、ActivityNet-Captions)上均达到了新的最佳性能,在具有挑战性的QVHighlights数据集上mAP@avg提高了3.23%。
Summary / 总结
The paper addresses the challenge of zero-shot video moment retrieval by proposing GranAlign, a granularity-aware alignment framework. It introduces query rewriting and query-aware caption generation to balance semantic granularity between textual queries and visual content, improving retrieval accuracy. The method achieves state-of-the-art performance across three benchmarks, particularly showing a 3.23% mAP@avg improvement on QVHighlights.
该研究通过提出GranAlign,一种粒度感知对齐框架,解决了零样本视频片段检索的挑战。它引入了粒度基于的查询重写和查询感知的字幕生成技术,以对齐粗粒度和细粒度的语义表示。该方法在三个基准测试中达到了最先进的性能,在具有挑战性的QVHighlights数据集上实现了3.23%的mAP@avg改进。
OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot
Authors: Junhan Zhu, Hesong Wang, Mingluo Su, Zefang Wang, Huan Wang
First: 2025-10-08T08:19:15+00:00 · Latest: 2026-01-02T06:03:39+00:00
Abstract
Large-scale text-to-image diffusion models, while powerful, suffer from prohibitive computational cost. Existing one-shot network pruning methods can hardly be directly applied to them due to the iterative denoising nature of diffusion models. To bridge the gap, this paper presents OBS-Diff, a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models. Specifically, (i) OBS-Diff revitalizes the classic Optimal Brain Surgeon (OBS), adapting it to the complex architectures of modern diffusion models and supporting diverse pruning granularity, including unstructured, N:M semi-structured, and structured (MHA heads and FFN neurons) sparsity; (ii) To align the pruning criteria with the iterative dynamics of the diffusion process, by examining the problem from an error-accumulation perspective, we propose a novel timestep-aware Hessian construction that incorporates a logarithmic-decrease weighting scheme, assigning greater importance to earlier timesteps to mitigate potential error accumulation; (iii) Furthermore, a computationally efficient group-wise sequential pruning strategy is proposed to amortize the expensive calibration process. Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.
中文标题/摘要
标题:OBS-Diff:针对扩散模型的一次性精确剪枝
大规模文本到图像扩散模型虽然强大,但计算成本高昂。现有的一次性网络剪枝方法由于扩散模型的迭代去噪特性,难以直接应用于它们。为解决这一问题,本文提出了一种名为OBS-Diff的新颖一次性剪枝框架,能够实现大规模文本到图像扩散模型的无训练精确压缩。具体而言,(i) OBS-Diff 重新激活经典的 Optimal Brain Surgeon (OBS),将其适应现代扩散模型的复杂架构,并支持多种剪枝粒度,包括无结构、N:M半结构以及结构(MHA头和FFN神经元)稀疏性;(ii) 为使剪枝标准与扩散过程的迭代动态相一致,从误差累积的角度出发,我们提出了一种新颖的时间步感知海森矩阵构建方法,引入了对数递减加权方案,赋予早期时间步更高的重要性,以减轻潜在的误差累积;(iii) 同时,提出了一种计算高效的分组顺序剪枝策略,以摊销昂贵的校准过程。大量实验表明,OBS-Diff 在扩散模型中实现了最先进的一次性剪枝效果,同时在视觉质量上几乎没有退化。
Summary / 总结
This paper addresses the high computational cost of large-scale text-to-image diffusion models by presenting OBS-Diff, a novel one-shot pruning framework. OBS-Diff adapts the Optimal Brain Surgeon method to modern diffusion model architectures, supporting various sparsity levels. It introduces a timestep-aware Hessian construction with a logarithmic-decrease weighting scheme to align pruning with the iterative dynamics of diffusion models, and proposes an efficient group-wise sequential pruning strategy. Experiments demonstrate that OBS-Diff achieves state-of-the-art one-shot pruning, providing significant inference acceleration with minimal impact on visual quality.
本文针对大规模文本到图像扩散模型的高计算成本,提出了一种新颖的一次性剪枝框架OBS-Diff。OBS-Diff将经典的Optimal Brain Surgeon (OBS)适应到现代扩散模型架构中,支持多种稀疏性类型。它引入了一种时间步感知的Hessian构造方法,并采用对数递减权重方案来使剪枝标准与扩散模型的迭代动态相一致。此外,还提出了一种计算高效的分组顺序剪枝策略以减少校准成本。实验表明,OBS-Diff实现了扩散模型的一次性剪枝的最新成果,提供了显著的推理加速且对视觉质量影响最小。
FreeText: Training-Free Text Rendering in Diffusion Transformers via Attention Localization and Spectral Glyph Injection
Authors: Ruiqiang Zhang, Hengyi Wang, Chang Liu, Guanjie Wang, Zehua Ma, Weiming Zhang
First: 2026-01-02T02:36:48+00:00 · Latest: 2026-01-02T02:36:48+00:00
Abstract
Large-scale text-to-image (T2I) diffusion models excel at open-domain synthesis but still struggle with precise text rendering, especially for multi-line layouts, dense typography, and long-tailed scripts such as Chinese. Prior solutions typically require costly retraining or rigid external layout constraints, which can degrade aesthetics and limit flexibility. We propose \textbf{FreeText}, a training-free, plug-and-play framework that improves text rendering by exploiting intrinsic mechanisms of \emph{Diffusion Transformer (DiT)} models. \textbf{FreeText} decomposes the problem into \emph{where to write} and \emph{what to write}. For \emph{where to write}, we localize writing regions by reading token-wise spatial attribution from endogenous image-to-text attention, using sink-like tokens as stable spatial anchors and topology-aware refinement to produce high-confidence masks. For \emph{what to write}, we introduce Spectral-Modulated Glyph Injection (SGMI), which injects a noise-aligned glyph prior with frequency-domain band-pass modulation to strengthen glyph structure and suppress semantic leakage (rendering the concept instead of the word). Extensive experiments on Qwen-Image, FLUX.1-dev, and SD3 variants across longText-Benchmark, CVTG, and our CLT-Bench show consistent gains in text readability while largely preserving semantic alignment and aesthetic quality, with modest inference overhead.
中文标题/摘要
标题:FreeText:通过注意力定位和频谱字符注入在扩散变换器中的无训练文本渲染
大规模文本到图像(T2I)扩散模型在开放域合成方面表现出色,但在精确文本渲染方面仍然存在困难,尤其是在多行布局、密集排版和长尾字体(如中文)方面。先前的解决方案通常需要昂贵的重新训练或刚性外部布局约束,这可能会降低美学效果并限制灵活性。我们提出了一种名为\textbf{FreeText}的无训练、即插即用框架,通过利用\emph{扩散变换器(DiT)}模型的内在机制来提高文本渲染效果。\textbf{FreeText}将问题分解为\emph{在哪里写}和\emph{写什么}。对于\emph{在哪里写},我们通过读取端生图像到文本注意力的逐词空间注意力来定位书写区域,使用汇流型标记作为稳定的空间锚点,并使用拓扑感知细化来生成高置信度的掩码。对于\emph{写什么},我们引入了频谱调制字符注入(SGMI),它通过频域带通调制注入与噪声对齐的字符先验,以增强字符结构并抑制语义泄漏(渲染概念而非单词)。在Qwen-Image、FLUX.1-dev和SD3变体的长文本基准、CVTG和我们自己的CLT-Bench上进行的大量实验显示,在很大程度上保持语义对齐和美学质量的同时,一致地提高了文本可读性,且推理开销较小。
Summary / 总结
FreeText is a training-free framework that enhances text rendering in diffusion transformers by localizing writing regions and injecting spectral-modulated glyph priors. It decomposes the problem into determining where to write and what to write. For where to write, it uses token-wise spatial attribution and sink-like tokens to produce high-confidence masks. For what to write, it introduces Spectral-Modulated Glyph Injection (SGMI) to strengthen glyph structure and suppress semantic leakage. Experiments show consistent improvements in text readability while maintaining semantic alignment and aesthetic quality with minimal inference overhead.
FreeText 是一个无需训练的框架,通过利用扩散变换器的内在机制来增强文本渲染。它解决了多行布局和长尾字体的精确文本渲染难题。FreeText 将问题分解为确定在哪里写和写什么。在哪里写的部分使用标记间的空间注意力来定位书写区域,并使用sink-like标记作为稳定的空间锚点。写什么的部分引入了频域带通调制的光栅注入(SGMI),以增强光栅结构并减少语义泄漏。实验表明,在保持语义对齐和美学质量的同时,文本可读性得到了一致的提升。
FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation
Authors: Zebin Yao, Lei Ren, Huixing Jiang, Wei Chen, Xiaojie Wang, Ruifan Li, Fangxiang Feng
First: 2025-04-22T14:55:23+00:00 · Latest: 2026-01-02T01:08:59+00:00
Comments: Code: https://github.com/Nihukat/FreeGraftor
Abstract
Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.
中文标题/摘要
标题:FreeGraftor:无需训练的跨图像特征嫁接以实现主题驱动的文本到图像生成
主题驱动的图像生成旨在从参考图像中合成新颖的场景,同时忠实保留主题身份并遵循文本指导。然而,现有方法在保真度和效率之间面临关键权衡。基于调优的方法依赖于耗时且资源密集的主题特定优化,而零样本方法往往无法保持足够的主题一致性。在本文中,我们提出了一种无需训练的FreeGraftor框架,通过跨图像特征嫁接来解决这些限制。具体而言,FreeGraftor利用语义匹配和位置约束注意力融合将参考主题的视觉细节转移到生成图像中。此外,我们的框架引入了一种新颖的噪声初始化策略,以保留参考主题的几何先验,促进稳健的特征匹配。广泛的定性和定量实验表明,我们的方法能够实现精确的主题身份转移,同时保持文本对齐的场景合成。无需进行模型微调或额外训练,FreeGraftor在主题保真度和文本对齐方面显著优于现有零样本和无需训练的方法。此外,我们的框架可以无缝扩展到多主题生成,使其适用于实际部署。我们的代码可在https://github.com/Nihukat/FreeGraftor获取。
Summary / 总结
FreeGraftor is a training-free framework for subject-driven text-to-image generation that uses cross-image feature grafting to transfer visual details from reference subjects to generated images. It employs semantic matching and position-constrained attention fusion to maintain subject identity while adhering to textual guidance. Experimental results show that FreeGraftor outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment without requiring model fine-tuning or additional training.
FreeGraftor 是一个无需训练的框架,用于从参考图像中提取视觉细节并转移到生成图像中,以实现主题驱动的文字到图像生成。它使用语义匹配和位置约束注意力融合来保持主题身份并遵循文本指导。实验结果表明,FreeGraftor 在主题保真度和文本对齐方面优于现有零样本和无需训练的方法,且无需进行模型微调。此外,它还可以处理多主题生成,使其适用于实际应用。
CPPO: Contrastive Perception for Vision Language Policy Optimization
Authors: Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar, Kevin Cannons, Mohammad Asiful Hossain, Zhou Weimin, Shunbo Zhou, Yong Zhang, Mohammad Akbari
First: 2026-01-01T22:48:26+00:00 · Latest: 2026-01-01T22:48:26+00:00
Abstract
We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.
中文标题/摘要
标题:CPPO:对比感知的视觉语言策略优化
我们介绍了CPPO,一种用于微调视觉语言模型(VLMs)的对比感知策略优化方法。尽管强化学习(RL)在语言模型中推进了推理能力,但将其扩展到多模态推理需要同时提高感知和推理方面的能力。先前的工作主要通过显式的感知奖励来应对这一挑战,但分离感知令牌和推理令牌是困难的,需要额外的LLM、真实数据、通过策略模型强制分离感知与推理,或对所有输出令牌不分青红皂白地应用奖励。CPPO 通过检测在扰动输入图像下模型输出的熵移动生成感知令牌来解决这一问题。CPPO 然后通过对比感知损失(CPL)扩展了RL目标函数,该损失在信息保持扰动下强制一致性,在信息去除扰动下增强敏感性。实验表明,CPPO 超越了之前的感知奖励方法,同时避免了额外模型的使用,使训练更加高效和可扩展。
Summary / 总结
CPPO is a Contrastive Perception Policy Optimization method designed to fine-tune vision-language models. It addresses the challenge of multimodal reasoning by improving both perception and reasoning aspects, without relying on explicit perception rewards. Instead, CPPO detects perception tokens through entropy shifts in model outputs under perturbed images and uses a Contrastive Perception Loss to enforce consistency and sensitivity. Experiments demonstrate that CPPO outperforms previous methods while avoiding the need for additional models, making training more efficient and scalable.
CPPO 是一种对比感知策略优化方法,用于微调视觉语言模型。它通过检测扰动图像下模型输出的熵变化来识别感知令牌,并通过对比感知损失增强一致性与敏感性,从而解决感知和推理令牌难以分离的问题。实验表明,CPPO 在不需要额外模型的情况下超越了之前的感知奖励方法,使得训练更加高效和可扩展。
Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training
Authors: Yi Liu, Sukai Wang, Dafeng Wei, Xiaowei Cai, Linqing Zhong, Jiange Yang, Guanghui Ren, Jinyu Zhang, Maoqing Yao, Chuankang Li, Xindong He, Liliang Chen, Jianlan Luo
First: 2025-12-30T10:18:42+00:00 · Latest: 2026-01-01T17:42:44+00:00
Abstract
General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation. Project page: https://geniereasoner.github.io/GenieReasoner/
中文标题/摘要
标题:通过自回归离散预训练实现统一的具身VLM推理与机器人动作
通用的机器人系统在开放世界环境中运行时,必须实现广泛的泛化和高精度的动作执行,而现有视觉-语言-动作(VLA)模型难以同时实现这一目标。虽然大型视觉-语言模型(VLMs)提高了语义泛化能力,但缺乏具身推理会导致行为脆弱,反之亦然,仅强推理而不具备精确控制也是不够的。为了从定量角度评估这一瓶颈,我们引入了具身推理智能商(ERIQ),这是一个大规模的机器人操作具身推理基准,包含四个推理维度的6000多对问题-答案。通过将推理与执行分离,ERIQ 使系统性评估成为可能,并揭示了具身推理能力和端到端VLA泛化之间的强烈正相关性。为了弥合理论推理与精确执行之间的差距,我们提出了FACT,一种基于流匹配的动作分词器,将连续控制转换为离散序列,同时保持高保真轨迹重建。由此产生的GenieReasoner 在统一空间中同时优化推理和动作,优于现有的连续动作和先前的离散动作基线。ERIQ 和 FACT 一起提供了一个诊断和克服推理-精度权衡的原理框架,推动了稳健的通用机器人操作的发展。项目页面:https://geniereasoner.github.io/GenieReasoner/
Summary / 总结
The research aims to address the challenge of achieving both broad generalization and high-precision action execution in robotic systems. To evaluate embodied reasoning, the authors introduce ERIQ, a benchmark with 6K+ question-answer pairs. They propose FACT, a flow-matching-based action tokenizer, to bridge the gap between reasoning and precise execution, leading to better performance in real-world tasks compared to previous methods. The GenieReasoner, which jointly optimizes reasoning and action, outperforms both continuous-action and discrete-action baselines.
研究旨在解决机器人系统在开放环境中实现广泛泛化和高精度动作执行的挑战。为了评估体态推理能力,作者引入了ERIQ基准,包含6K多对问题-答案。他们提出了基于流匹配的动作分词器FACT,将连续控制转换为离散序列,从而在统一空间中同时优化推理和动作。GenieReasoner模型在实际任务中优于连续动作和先前的离散动作基线,推动了稳健的通用机器人操作的发展。
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
Authors: GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Haochen Li, Jiale Zhu, Jiali Chen, Jiaxing Xu, Jiazheng Xu, Jing Chen, Jinghao Lin, Jinhao Chen, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Ruiliang Lyu, Shangqin Tu, Sheng Yang, Shengbiao Meng, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wei Jia, Wenkai Li, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyu Zhang, Xinyue Fan, Xuancheng Huang, Yadong Xue, Yanfeng Wang, Yanling Wang, Yanzi Wang, Yifan An, Yifan Du, Yiheng Huang, Yilin Niu, Yiming Shi, Yu Wang, Yuan Wang, Yuanchang Yue, Yuchen Li, Yusen Liu, Yutao Zhang, Yuting Wang, Yuxuan Zhang, Zhao Xue, Zhengxiao Du, Zhenyu Hou, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang
First: 2025-07-01T17:55:04+00:00 · Latest: 2026-01-01T13:07:25+00:00
Abstract
We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at https://z.ai/blog/glm-4.6v. Code, models and more information are released at https://github.com/zai-org/GLM-V.
中文标题/摘要
标题:GLM-4.5V和GLM-4.1V-Thinking:迈向具有可扩展强化学习的多功能多模态推理
我们介绍了GLM-4.1V-Thinking、GLM-4.5V和GLM-4.6V,这是一个旨在推进通用多模态理解和推理的视觉-语言模型(VLM)家族。在本报告中,我们分享了我们在以推理为中心的训练框架开发中的关键发现。我们首先通过大规模预训练开发了一个强大的视觉基础模型,这可能为最终性能设定了上限。然后,我们提出了基于课程采样的强化学习(RLCS),以充分发挥模型的潜力,从而在包括STEM问题解决、视频理解、内容识别、编程、语义分割、基于GUI的代理和长文档解释等多种任务中实现全面的能力提升。在对42个公开基准的全面评估中,GLM-4.5V在几乎所有任务中均实现了开源模型中的最佳性能,并在包括编程和GUI代理在内的挑战性任务中表现出与闭源模型Gemini-2.5-Flash相当甚至更优的结果。同时,较小的GLM-4.1V-9B-Thinking在29个基准中仍保持高度竞争力,其结果优于更大的Qwen2.5-VL-72B。我们开源了GLM-4.1V-9B-Thinking和GLM-4.5V。我们还介绍了GLM-4.6V系列,这是一个具有原生工具使用能力和128K上下文窗口的开源多模态模型。有关概述,请参阅https://z.ai/blog/glm-4.6v。代码、模型和更多信息可在https://github.com/zai-org/GLM-V上发布。
ReMA: A Training-Free Plug-and-Play Mixing Augmentation for Video Behavior Recognition
Authors: Feng-Qi Cui, Jinyang Huang, Sirui Zhao, Jinglong Guo, Qifan Cai, Xin Yan, Zhi Liu
First: 2026-01-01T11:20:19+00:00 · Latest: 2026-01-01T11:20:19+00:00
Abstract
Video behavior recognition demands stable and discriminative representations under complex spatiotemporal variations. However, prevailing data augmentation strategies for videos remain largely perturbation-driven, often introducing uncontrolled variations that amplify non-discriminative factors, which finally weaken intra-class distributional structure and representation drift with inconsistent gains across temporal scales. To address these problems, we propose Representation-aware Mixing Augmentation (ReMA), a plug-and-play augmentation strategy that formulates mixing as a controlled replacement process to expand representations while preserving class-conditional stability. ReMA integrates two complementary mechanisms. Firstly, the Representation Alignment Mechanism (RAM) performs structured intra-class mixing under distributional alignment constraints, suppressing irrelevant intra-class drift while enhancing statistical reliability. Then, the Dynamic Selection Mechanism (DSM) generates motion-aware spatiotemporal masks to localize perturbations, guiding them away from discrimination-sensitive regions and promoting temporal coherence. By jointly controlling how and where mixing is applied, ReMA improves representation robustness without additional supervision or trainable parameters. Extensive experiments on diverse video behavior benchmarks demonstrate that ReMA consistently enhances generalization and robustness across different spatiotemporal granularities.
中文标题/摘要
标题:ReMA:一种无需训练的即插即用混合增强方法用于视频行为识别
视频行为识别需要在复杂的空间-时间变化下保持稳定和区分性的表示。然而,现有的视频数据增强策略主要依赖于扰动驱动,常常引入无法控制的变化,放大非区分性因素,最终削弱类内分布结构并导致随时间尺度不一致的表示漂移。为了解决这些问题,我们提出了 Representation-aware Mixing Augmentation (ReMA),这是一种即插即用的增强策略,将混合视为在分布对齐约束下的结构化类内混合过程,以扩展表示同时保持类条件稳定性。ReMA 结合了两种互补机制。首先,Representation Alignment Mechanism (RAM) 在分布对齐约束下执行结构化的类内混合,抑制无关的类内漂移并增强统计可靠性。然后,Dynamic Selection Mechanism (DSM) 生成运动感知的空间-时间掩码来定位扰动,引导它们远离区分敏感区域并促进时间一致性。通过联合控制混合如何和在哪里应用,ReMA 在无需额外监督或可训练参数的情况下提高了表示的鲁棒性。在多种视频行为基准上的广泛实验表明,ReMA 在不同空间-时间粒度下一致地增强了泛化能力和鲁棒性。
Summary / 总结
ReMA is a training-free video behavior recognition augmentation method that addresses the issue of uncontrolled variations in existing data augmentation strategies. It introduces Representation Alignment Mechanism (RAM) for structured intra-class mixing and Dynamic Selection Mechanism (DSM) for motion-aware spatiotemporal perturbation localization. Experiments show that ReMA improves representation robustness and generalization across various spatiotemporal scales without additional supervision or parameters.
论文提出了一种名为ReMA的无训练视频行为识别增强方法,解决了现有数据增强技术中无法控制的变异问题。ReMA采用插件式方法,包含两种机制:Representation Alignment Mechanism (RAM) 进行结构化的类内混合和Dynamic Selection Mechanism (DSM) 进行运动感知的时空扰动定位。实验表明,ReMA在不同时空粒度下提高了表示的鲁棒性和泛化能力,且无需额外的监督或可训练参数。
FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering
Authors: Chaodong Tong, Qi Zhang, Chen Li, Lei Jiang, Yanbing Liu
First: 2026-01-01T09:19:39+00:00 · Latest: 2026-01-01T09:19:39+00:00
Comments: 14 pages, 9 figures, 5 tables
Abstract
Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.
中文标题/摘要
标题:FaithSCAN:基于模型驱动的一次通过幻觉检测方法以实现忠实的视觉问答
视觉问答中的忠实性幻觉发生在视觉语言模型生成流畅但与视觉现实脱节的答案时,严重削弱了其在安全关键应用中的可靠性。现有检测方法主要分为两类:依赖辅助模型或知识库的外部验证方法,以及使用重复采样或不确定性估计的不确定性驱动方法。前者面临高计算开销的问题,并受限于外部资源质量,而后者仅捕捉模型不确定性的一小部分,并未能充分探索与多种失败模式相关的丰富内部信号。这两种范式在效率、鲁棒性和检测性能方面都存在固有的局限性。为应对这些挑战,我们提出FaithSCAN:一种轻量级网络,通过利用视觉语言模型的丰富内部信号(包括标记级解码不确定性、中间视觉表示和跨模态对齐特征)来检测幻觉。这些信号通过分支级证据编码和不确定性感知注意力进行融合。我们还将LLM作为裁判的范式扩展到视觉问答幻觉,并提出了一种低成本策略,自动生成模型依赖的监督信号,从而在无需昂贵的人工标签的情况下进行监督训练,同时保持高检测准确性。在多个视觉问答基准上的实验表明,FaithSCAN在效果和效率上均显著优于现有方法。深入分析表明,幻觉源于视觉感知、跨模态推理和语言解码中的系统性内部状态变化。不同的内部信号提供了互补的诊断线索,而幻觉模式在不同的视觉语言模型架构之间有所不同,为多模态幻觉的根本原因提供了新的见解。
Summary / 总结
FaithSCAN is a lightweight network designed to detect faithfulness hallucinations in Visual Question Answering (VQA) by leveraging rich internal signals from vision-language models, such as token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. It uses branch-wise evidence encoding and uncertainty-aware attention to fuse these signals. FaithSCAN also extends the LLM-as-a-Judge paradigm to automatically generate model-dependent supervision signals, enabling supervised training without human labels. Experiments show that FaithSCAN outperforms existing methods in both effectiveness and efficiency, with hallucinations arising from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding.
FaithSCAN 是一个轻量级网络,通过利用视觉语言模型中的丰富内部信号(如 token 级解码不确定性、中间视觉表示和跨模态对齐特征)来检测视觉问答中的忠实性幻觉。它使用分支级证据编码和不确定性感知注意力来融合这些信号。此外,FaithSCAN 还将 LLM 作为裁判的范式扩展到视觉问答幻觉检测,并提出了一种低成本策略来自动生成模型依赖的监督信号,从而实现高效监督训练。实验表明,FaithSCAN 在效果和效率上都优于现有方法,幻觉来源于视觉感知、跨模态推理和语言解码中的系统内部状态变化,不同的内部信号提供了互补的诊断线索。
ActErase: A Training-Free Paradigm for Precise Concept Erasure via Activation Patching
Authors: Yi Sun, Xinhao Zhong, Hongyan Li, Yimin Zhou, Junhao Li, Bin Chen, Xuan Wang
First: 2026-01-01T09:11:09+00:00 · Latest: 2026-01-01T09:11:09+00:00
Abstract
Recent advances in text-to-image diffusion models have demonstrated remarkable generation capabilities, yet they raise significant concerns regarding safety, copyright, and ethical implications. Existing concept erasure methods address these risks by removing sensitive concepts from pre-trained models, but most of them rely on data-intensive and computationally expensive fine-tuning, which poses a critical limitation. To overcome these challenges, inspired by the observation that the model's activations are predominantly composed of generic concepts, with only a minimal component can represent the target concept, we propose a novel training-free method (ActErase) for efficient concept erasure. Specifically, the proposed method operates by identifying activation difference regions via prompt-pair analysis, extracting target activations and dynamically replacing input activations during forward passes. Comprehensive evaluations across three critical erasure tasks (nudity, artistic style, and object removal) demonstrates that our training-free method achieves state-of-the-art (SOTA) erasure performance, while effectively preserving the model's overall generative capability. Our approach also exhibits strong robustness against adversarial attacks, establishing a new plug-and-play paradigm for lightweight yet effective concept manipulation in diffusion models.
中文标题/摘要
标题:ActErase: 一种基于激活补丁的无需训练框架以精确消除概念
文本到图像扩散模型的最新进展展示了卓越的生成能力,但同时也引发了关于安全、版权和伦理问题的重大关切。现有的概念消除方法通过从预训练模型中移除敏感概念来应对这些风险,但大多数方法依赖于数据密集型和计算成本高昂的微调,这构成了一个关键限制。为克服这些挑战,受模型激活主要由通用概念组成,仅有一小部分代表目标概念这一观察的启发,我们提出了一种新颖的无需训练方法(ActErase)以高效地消除概念。具体而言,该方法通过提示对分析识别激活差异区域,在前向传递过程中提取目标激活并动态替换输入激活。在三个关键消除任务(裸体、艺术风格和对象移除)上的全面评估表明,我们的无需训练方法实现了最先进的(SOTA)消除性能,同时有效地保留了模型的整体生成能力。我们的方法还表现出强大的对抗攻击鲁棒性,确立了轻量级且有效的扩散模型概念操控的新即插即用范式。
Summary / 总结
The paper introduces ActErase, a training-free method for concept erasure in text-to-image diffusion models. It addresses safety and ethical concerns by removing sensitive concepts without fine-tuning, using activation patching. ActErase identifies and replaces target activations with generic ones, achieving state-of-the-art erasure performance while preserving generative capabilities and robustness against adversarial attacks.
论文提出了ActErase,一种无需训练的精确概念擦除方法,利用激活区域替换来移除敏感概念而不进行微调。该方法在三个关键任务(裸体、艺术风格和物体移除)中表现出色,同时保持模型的整体生成能力。该方法还对对抗攻击具有很强的鲁棒性,提供了一种轻量级且有效的概念操控方案,适用于扩散模型。
TotalFM: An Organ-Separated Framework for 3D-CT Vision Foundation Models
Authors: Kohei Yamamoto, Tomohiro Kikuchi
First: 2026-01-01T08:27:01+00:00 · Latest: 2026-01-01T08:27:01+00:00
Abstract
While foundation models in radiology are expected to be applied to various clinical tasks, computational cost constraints remain a major challenge when training on 3D-CT volumetric data. In this study, we propose TotalFM, a radiological foundation model that efficiently learns the correspondence between 3D-CT images and linguistic expressions based on the concept of organ separation, utilizing a large-scale dataset of 140,000 series. By automating the creation of organ volume and finding-sentence pairs through segmentation techniques and Large Language Model (LLM)-based radiology report processing, and by combining self-supervised pre-training via VideoMAE with contrastive learning using volume-text pairs, we aimed to balance computational efficiency and representation capability. In zero-shot organ-wise lesion classification tasks, the proposed model achieved higher F1 scores in 83% (5/6) of organs compared to CT-CLIP and 64% (9/14) of organs compared to Merlin. These results suggest that the proposed model exhibits high generalization performance in a clinical evaluation setting using actual radiology report sentences. Furthermore, in zero-shot finding-wise lesion classification tasks, our model achieved a higher AUROC in 83% (25/30) of finding categories compared to Merlin. We also confirmed performance comparable to existing Vision-Language Models (VLMs) in radiology report generation tasks. Our results demonstrate that the organ-separated learning framework can serve as a realistic and effective design guideline for the practical implementation of 3D-CT foundation models.
中文标题/摘要
标题:TotalFM:一种基于器官分离的3D-CT视觉基础模型框架
尽管放射学中的基础模型预期可以应用于各种临床任务,但在3D-CT体数据上进行训练时,计算成本限制仍然是一个主要挑战。在本研究中,我们提出了一种基于器官分离概念的放射学基础模型TotalFM,该模型能够高效地学习3D-CT图像与语言表达之间的对应关系,利用包含14万个系列的大规模数据集。通过自动化器官体积的创建和寻找-句子对的生成,利用分割技术及大型语言模型(LLM)处理放射学报告,并结合使用VideoMAE进行自监督预训练和使用体积-文本对进行对比学习,我们旨在平衡计算效率和表示能力。在零样本器官特异性病变分类任务中,所提出模型在83%(5/6)的器官中实现了比CT-CLIP更高的F1分数,在14%(9/14)的器官中实现了比Merlin更高的F1分数。这些结果表明,所提出模型在实际放射学报告句子的临床评估环境中表现出较高的泛化性能。此外,在零样本发现特异性病变分类任务中,我们的模型在83%(25/30)的发现类别中实现了比Merlin更高的AUROC。我们还确认了在放射学报告生成任务中,模型的性能与现有的视觉-语言模型(VLMs)相当。我们的结果表明,基于器官分离的学习框架可以作为3D-CT基础模型实际实施的现实且有效的设计指南。
Summary / 总结
TotalFM is a radiological foundation model that learns the correspondence between 3D-CT images and linguistic expressions by separating organs and using a large-scale dataset. It combines self-supervised pre-training and contrastive learning to balance computational efficiency and representation capability. The model outperformed CT-CLIP and Merlin in organ-wise and finding-wise lesion classification tasks, showing high generalization performance and comparable performance in radiology report generation tasks.
TotalFM 是一种放射学基础模型,通过器官分离和大规模数据集处理来解决在训练 3D-CT 立体数据时的计算成本问题。它结合自监督预训练和对比学习,使其在零样本分类任务中83%的器官和83%的发现类别中分别实现了更高的F1分数和AUROC分数,表明其在临床环境中的泛化性能很强。
Do Vision Encoders Truly Explain Object Hallucination?: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore
Authors: Hongseok Oh, Wonseok Hwang
First: 2025-02-27T12:20:02+00:00 · Latest: 2026-01-01T07:42:35+00:00
Comments: Transactions on Machine Learning Research
Abstract
Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains. However, these models suffer from object hallucination. In this work, we study object hallucination primarily in a discriminative, retrieval-style evaluation setting (OHD-Caps), rather than in free-form caption generation. This study revisits the previous claim that the cause of such hallucinations lies in the limited representational capacity of the vision encoder. Our analysis implies that the capacity of the vision encoder is not necessarily a major limiting factor in detecting object hallucination. Based on this insight, we propose Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun level. Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training. We further demonstrate that F-CLIPScore-based data filtering reduces object hallucination in LVLM (4.9% in POPE accuracy after alignment pretraining). Our code is publicly available at https://github.com/abzb1/f-clip
中文标题/摘要
标题:视觉编码器真的能解释物体幻觉吗?:通过简单的细粒度CLIPScore缓解物体幻觉
近年来,大型视觉-语言模型(LVLMs)在各个领域表现出色。然而,这些模型存在物体幻觉的问题。在本研究中,我们主要在区分性检索式评估设置(OHD-Caps)中研究物体幻觉,而不是在自由形式的描述生成中研究。本研究重新审视了物体幻觉的原因在于视觉编码器的有限表示能力的说法。我们的分析表明,视觉编码器的能力未必是检测物体幻觉的主要限制因素。基于这一见解,我们提出了细粒度CLIPScore(F-CLIPScore),这是一种简单而有效的评估指标,通过在名词级别引入文本嵌入来增强物体级别的粒度。在OHD-Caps基准上的评估表明,F-CLIPScore在准确性上显著优于传统的CLIPScore,差距高达39.6%,且无需额外训练。我们进一步证明,基于F-CLIPScore的数据过滤可以减少LVLM中的物体幻觉(POPE准确率在对齐预训练后降低4.9%)。我们的代码已公开发布在https://github.com/abzb1/f-clip
Summary / 总结
This study investigates object hallucination in Large Vision-Language Models (LVLMs) using a discriminative evaluation setting. It challenges the notion that limited vision encoder capacity is the primary cause of hallucinations. Instead, it proposes Fine-grained CLIPScore (F-CLIPScore), which incorporates noun-level text embeddings to improve object-level granularity. Experiments show F-CLIPScore outperforms conventional CLIPScore by 39.6% in accuracy and reduces object hallucination in LVLMs by 4.9% in POPE accuracy after alignment pretraining.
该研究使用区分性评估设置来研究大型视觉-语言模型(LVLM)中的物体幻觉问题,挑战了之前认为视觉编码器容量有限是主要原因的观点。相反,它提出了细粒度CLIPScore(F-CLIPScore),该方法通过引入词素级别的文本嵌入来增强物体级别的粒度,其准确率比传统CLIPScore高出39.6%。研究还表明,使用F-CLIPScore进行数据过滤可以将LVLM中的物体幻觉减少4.9%的POPE准确率,在对齐预训练之后。
MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
Authors: Xiaokun Sun, Zeyu Cai, Hao Tang, Ying Tai, Jian Yang, Zhenyu Zhang
First: 2026-01-01T04:42:59+00:00 · Latest: 2026-01-01T04:42:59+00:00
Comments: Project page: https://xiaokunsun.github.io/MorphAny3D.github.io/
Abstract
3D morphing remains challenging due to the difficulty of generating semantically consistent and temporally smooth deformations, especially across categories. We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. Our key insight is that intelligently blending source and target SLAT features within the attention mechanisms of 3D generators naturally produces plausible morphing sequences. To this end, we introduce Morphing Cross-Attention (MCA), which fuses source and target information for structural coherence, and Temporal-Fused Self-Attention (TFSA), which enhances temporal consistency by incorporating features from preceding frames. An orientation correction strategy further mitigates the pose ambiguity within the morphing steps. Extensive experiments show that our method generates state-of-the-art morphing sequences, even for challenging cross-category cases. MorphAny3D further supports advanced applications such as decoupled morphing and 3D style transfer, and can be generalized to other SLAT-based generative models. Project page: https://xiaokunsun.github.io/MorphAny3D.github.io/.
中文标题/摘要
标题:MorphAny3D:利用结构化潜在表示在3D变形中释放潜力
由于难以生成语义一致且时间上平滑的变形,3D变形仍然具有挑战性,尤其是在不同类别之间。我们提出了MorphAny3D,这是一种无需训练的框架,利用结构化潜在表示(SLAT)进行高质量的3D变形。我们的关键见解是,在3D生成器的注意力机制中智能地混合源和目标SLAT特征,自然会产生合理的变形序列。为此,我们引入了变形交叉注意力(MCA),它融合了源和目标信息以实现结构一致性,并引入了时间融合自注意力(TFSA),通过结合前一帧的特征来增强时间一致性。进一步的朝向校正策略还减轻了变形步骤中的姿态歧义。大量实验表明,我们的方法即使在具有挑战性的跨类别情况下也能生成最先进的变形序列。MorphAny3D还支持解耦变形和3D风格迁移等高级应用,并可以泛化到其他基于SLAT的生成模型。项目页面:https://xiaokunsun.github.io/MorphAny3D.github.io/
Summary / 总结
The research addresses the challenge of generating semantically consistent and temporally smooth 3D morphing sequences, especially across different categories. MorphAny3D uses a training-free framework with Structured Latent (SLAT) representations and introduces Morphing Cross-Attention (MCA) and Temporal-Fused Self-Attention (TFSA) to enhance structural coherence and temporal consistency, respectively. The method produces state-of-the-art morphing sequences, even for cross-category cases, and supports advanced applications like decoupled morphing and 3D style transfer.
MorphAny3D通过利用结构化潜空间(SLAT)表示来解决生成语义一致且时间上平滑的3D变形序列的挑战,尤其是在跨类别情况下。它引入了变形交叉注意力(MCA)来融合源和目标信息,并引入了时间融合自我注意力(TFSA)来增强时间一致性。该方法还包括一个姿态校正策略来缓解姿态歧义。大量实验表明,MorphAny3D能够生成最先进的变形序列,甚至在跨类别情况下也能做到,并支持诸如解耦变形和3D风格迁移等高级应用。
AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection
Authors: Qihang Zhou, Guansong Pang, Yu Tian, Shibo He, Jiming Chen
Venue: ICLR 2024
First: 2023-10-29T10:03:49+00:00 · Latest: 2026-01-01T03:03:39+00:00
Comments: Accepted by ICLR 2024
Abstract
Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is a crucial task when training data is not accessible due to various concerns, eg, data privacy, yet it is challenging since the models need to generalize to anomalies across different domains where the appearance of foreground objects, abnormal regions, and background features, such as defects/tumors on different products/organs, can vary significantly. Recently large pre-trained vision-language models (VLMs), such as CLIP, have demonstrated strong zero-shot recognition ability in various vision tasks, including anomaly detection. However, their ZSAD performance is weak since the VLMs focus more on modeling the class semantics of the foreground objects rather than the abnormality/normality in the images. In this paper we introduce a novel approach, namely AnomalyCLIP, to adapt CLIP for accurate ZSAD across different domains. The key insight of AnomalyCLIP is to learn object-agnostic text prompts that capture generic normality and abnormality in an image regardless of its foreground objects. This allows our model to focus on the abnormal image regions rather than the object semantics, enabling generalized normality and abnormality recognition on diverse types of objects. Large-scale experiments on 17 real-world anomaly detection datasets show that AnomalyCLIP achieves superior zero-shot performance of detecting and segmenting anomalies in datasets of highly diverse class semantics from various defect inspection and medical imaging domains. Code will be made available at https://github.com/zqhang/AnomalyCLIP.
中文标题/摘要
标题:AnomalyCLIP:面向零样本异常检测的无对象提示学习
零样本异常检测(ZSAD)需要使用辅助数据训练检测模型,以便在目标数据集中没有训练样本的情况下检测异常。当由于各种原因(如数据隐私)无法访问训练数据时,这是一个关键任务,但挑战在于模型需要泛化到不同领域中的异常,而这些领域的前景对象、异常区域和背景特征(如不同产品/器官上的缺陷/肿瘤)可能会有很大差异。最近,大型预训练视觉-语言模型(VLMs),如CLIP,在各种视觉任务中,包括异常检测中,展示了强大的零样本识别能力。然而,它们的ZSAD性能较弱,因为VLMs更侧重于建模前景对象的类别语义,而不是图像中的异常/正常性。本文介绍了一种新的方法,即AnomalyCLIP,以适应CLIP在不同领域中进行准确的ZSAD。AnomalyCLIP的关键洞察是学习无对象的文本提示,以捕捉图像中的通用正常性和异常性,而不考虑其前景对象。这使我们的模型能够关注异常图像区域,而不是对象语义,从而在不同类型的对象上实现通用的正常性和异常性识别。大规模实验表明,AnomalyCLIP在17个真实世界的异常检测数据集上实现了优越的零样本性能,能够检测和分割来自各种缺陷检查和医学成像领域的具有高度不同类别语义的数据集中的异常。代码将在https://github.com/zqhang/AnomalyCLIP上公开。
Summary / 总结
AnomalyCLIP is a method for zero-shot anomaly detection that leverages CLIP to learn object-agnostic text prompts for capturing generic normal and abnormal image regions. This approach enables the model to focus on abnormal regions rather than object semantics, achieving superior performance across diverse datasets in defect inspection and medical imaging domains. Experiments on 17 real-world datasets demonstrate AnomalyCLIP's effectiveness in detecting and segmenting anomalies without any training samples from the target dataset.
AnomalyCLIP 是一种新颖的零样本异常检测方法,利用 CLIP 学习通用的文本提示,捕捉图像中的普遍正常性和异常性。这使得模型能够关注异常区域而非对象语义,从而在缺陷检测和医学成像等多个领域实现优异的性能。在 17 个真实世界数据集上的实验表明,AnomalyCLIP 能够在没有目标数据集训练样本的情况下检测和分割异常。
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving
Authors: Shuo Xing, Hongyuan Hua, Xiangbo Gao, Shenzhe Zhu, Renjie Li, Kexin Tian, Xiaopeng Li, Heng Huang, Tianbao Yang, Zhangyang Wang, Yang Zhou, Huaxiu Yao, Zhengzhong Tu
First: 2024-12-19T18:59:33+00:00 · Latest: 2026-01-01T02:42:57+00:00
Comments: Published at TMLR 2025
Abstract
Recent advancements in large vision language models (VLMs) tailored for autonomous driving (AD) have shown strong scene understanding and reasoning capabilities, making them undeniable candidates for end-to-end driving systems. However, limited work exists on studying the trustworthiness of DriveVLMs -- a critical factor that directly impacts public transportation safety. In this paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for large vision-language models in autonomous driving (DriveVLMs), considering diverse perspectives -- including trustfulness, safety, robustness, privacy, and fairness. We constructed the largest visual question-answering dataset for investigating trustworthiness issues in driving scenarios, comprising over 10k unique scenes and 18k queries. We evaluated six publicly available VLMs, spanning from generalist to specialist, from open-source to commercial models. Our exhaustive evaluations have unveiled previously undiscovered vulnerabilities of DriveVLMs to trustworthiness threats. Specifically, we found that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform specialized models fine-tuned for driving in terms of overall trustworthiness. DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing sensitive information. Additionally, both generalist and specialist VLMs remain susceptible to adversarial attacks and struggle to ensure unbiased decision-making across diverse environments and populations. Our findings call for immediate and decisive action to address the trustworthiness of DriveVLMs -- an issue of critical importance to public safety and the welfare of all citizens relying on autonomous transportation systems. We release all the codes and datasets in https://github.com/taco-group/AutoTrust.
中文标题/摘要
标题:AutoTrust:自动驾驶领域大型视觉语言模型可信度基准测试
近年来,针对自动驾驶(AD)的大型视觉语言模型(VLMs)取得了显著进展,展示了强大的场景理解和推理能力,使其成为端到端驾驶系统的有力候选者。然而,关于驱动VLMs可信度的研究工作较少,这是直接影响公共交通安全的关键因素。本文介绍了AutoTrust,一个全面的自动驾驶领域大型视觉语言模型可信度基准,考虑了多种视角,包括可信度、安全性、鲁棒性、隐私和公平性。我们构建了最大的视觉问答数据集,用于研究驾驶场景中的可信度问题,包含超过10000个独特场景和18000个查询。我们评估了六种公开可用的VLMs,从通用模型到专门模型,从开源模型到商业模型。我们的全面评估揭示了驱动VLMs在可信度威胁方面以前未被发现的漏洞。具体来说,我们发现通用VLMs如LLaVA-v1.6和GPT-4o-mini在整体可信度方面出人意料地优于专门针对驾驶进行微调的模型。像DriveLM-Agent这样的驱动VLMs特别容易泄露敏感信息。此外,无论是通用模型还是专门模型,都容易受到对抗性攻击的影响,难以确保在各种环境和人群中做出公平的决策。我们的研究结果呼吁立即采取行动解决驱动VLMs的可信度问题——这对公共安全和依赖自动驾驶交通系统的所有公民的福祉至关重要。我们已在https://github.com/taco-group/AutoTrust/发布了所有代码和数据集。
Summary / 总结
AutoTrust benchmarks the trustworthiness of large vision-language models for autonomous driving, considering factors like trustfulness, safety, robustness, privacy, and fairness. It evaluates six VLMs, revealing that generalist models like LLaVA-v1.6 and GPT-4o-mini outperform specialized models in overall trustworthiness, while DriveLM-Agent is particularly vulnerable to disclosing sensitive information. Both generalist and specialist models are susceptible to adversarial attacks and biased decision-making. This work highlights the urgent need for addressing trustworthiness issues in DriveVLMs for public safety.
AutoTrust 对于自动驾驶的大型视觉语言模型(VLMs)进行了信任度基准测试,关注信任度、安全性、鲁棒性、隐私和公平性等方面。研究评估了六种VLM,发现通用模型如LLaVA-v1.6和GPT-4o-mini在整体信任度上优于专门针对驾驶的模型。自动驾驶模型在隐私威胁和对抗攻击方面存在漏洞,并且难以实现跨不同环境和人群的公平决策。研究强调了解决这些信任问题以确保自动驾驶系统的公共安全的重要性。