arXiv 论文速递

2026-01-07 03:30
Snapshot: 20260107_0330
VINO: A Unified Visual Generator with Interleaved OmniModal Context
Authors: Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye
First: 2026-01-05T18:56:34+00:00 · Latest: 2026-01-05T18:56:34+00:00
Comments: Project page: https://sotamak1r.github.io/VINO-web/
Abstract
We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.
中文标题/摘要
标题:VINO:统一视觉生成器,融合全方位模态上下文
我们提出了VINO,一个统一的视觉生成器,能够在单一框架中进行图像和视频的生成与编辑。VINO 不依赖于特定任务的模型或独立的模块,而是使用一个共享的扩散骨干网络,该网络能够根据文本、图像和视频进行条件化,从而在一个模型中实现广泛的视觉创作和编辑任务。具体来说,VINO 将一个视觉语言模型(VLM)与多模态扩散变换器(MMDiT)相结合,其中多模态输入被编码为交错的条件化令牌,然后用于引导扩散过程。这种设计支持多参考定位、长格式指令跟随以及在静态和动态内容中保持一致的身份,同时避免了特定模态的架构组件。为了训练这样一个统一系统,我们引入了一个多阶段训练管道,逐步扩展一个视频生成基础模型,使其成为一个能够处理图像和视频输入输出的统一、多任务生成器。在各种生成和编辑基准测试中,VINO 展现了强大的视觉质量、忠实的指令跟随、改进的参考和属性保留以及更可控的多身份编辑。我们的结果突显了可扩展统一视觉生成的实用路径,并展示了交错的上下文计算作为通用视觉创作基础的潜力。
Summary / 总结
VINO is a unified visual generator that integrates image and video generation and editing within a single framework by using a shared diffusion backbone conditioned on text, images, and videos. It employs a Multimodal Diffusion Transformer (MMDiT) coupled with a vision-language model to support multi-reference grounding, long-form instruction following, and coherent identity preservation. VINO shows strong visual quality, faithful instruction following, and improved reference and attribute preservation across various benchmarks, demonstrating a practical approach to scalable unified visual generation.
VINO 是一个统一的视觉生成器,将图像和视频生成与编辑整合在一个框架中。它使用一个共享的扩散骨干网络,并结合了多模态扩散变换器(MMDiT)和视觉语言模型(VLM),该设计支持多种视觉任务,包括多参考定位和一致的身份保留。VINO 在不同基准测试中展示了强大的视觉质量和改进的参考和属性保留,显示出统一视觉生成的可扩展性潜力。
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
Authors: Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Scott Loftin, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt
First: 2026-01-05T18:07:51+00:00 · Latest: 2026-01-05T18:07:51+00:00
Abstract
Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
中文标题/摘要
标题:DatBench:区分性、忠实性和高效性的VLM评估
实证评估是指导基础模型研究进展的主要指南。尽管有大量的工作集中在训练前沿的视觉-语言模型(VLMs)上,但对其评估的方法仍处于初级阶段。为了促进其成熟,我们提出了评估应满足的三个标准:(1)忠实于模态和应用,(2)能够区分不同质量的模型,(3)计算效率。从这个角度来看,我们识别出了一些关键的失败模式,这些模式违反了忠实性和区分性,错误地代表了模型的能力:(i)多项选择题奖励猜测,不能很好地反映下游使用场景,并且随着模型的改进而饱和;(ii)一些可以不使用图像就能回答的问题占到了某些评估的70%以上;(iii)错误标记或模糊的样本在某些数据集中占到了42%。关于效率,评估前沿模型的计算负担已经变得难以承受:据一些说法,近20%的开发计算资源被用于评估本身。我们没有抛弃现有的基准,而是通过转化和筛选来优化它们,以最大化忠实性和区分性。我们发现,将多项选择题转换为生成任务可以揭示出高达35%的能力下降。此外,过滤掉可以不使用图像就能回答的问题和错误标记的样本可以提高区分能力,同时降低计算成本。我们发布了DatBench-Full,这是一个包含33个数据集的清理评估套件,涵盖了九种VLM能力,以及DatBench,这是一个区分性子集,实现了13倍的平均加速(最高可达50倍),同时与原始数据集的区分能力非常接近。我们的工作概述了一条通向评估实践的道路,这些实践既严格又可持续,随着VLMs的不断扩展。
Summary / 总结
The paper proposes DatBench to address the shortcomings in evaluating vision-language models (VLMs) by focusing on faithfulness, discriminability, and efficiency. It identifies issues such as multiple-choice formats that encourage guessing and mislabeled samples that compromise model evaluations. The authors transform and filter existing benchmarks to enhance fidelity and discriminability. Key findings include a 35% capability drop when converting multiple-choice questions to generative tasks and a 13x average speedup in evaluation time with DatBench, while maintaining similar discriminative power as the original datasets. This work aims to guide the maturation of VLM evaluations towards more rigorous and sustainable practices.
论文提出了DatBench,一个用于视觉语言模型(VLM)的新评估套件,解决了诸如忠实性、区分能力和效率等方面的关键问题。作者指出了现有评估方法的问题,例如鼓励猜测的多项选择题和不需要图像的盲目可解问题。研究发现,将多项选择题转换为生成任务,并过滤掉盲目可解和错误标记的样本,可以显著提高评估的区分能力并减少计算成本。DatBench-Full套件包括33个数据集,而DatBench子集则提供了13倍的平均加速,最多可达50倍,同时保持相似的区分能力。
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams
Authors: Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang
First: 2026-01-05T17:11:00+00:00 · Latest: 2026-01-05T17:11:00+00:00
Abstract
The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT
中文标题/摘要
标题:InfiniteVGGT:视觉几何导向变换器,用于无尽流
持久的大规模3D视觉几何理解的宏伟愿景受到可扩展性和长期稳定性不可调和需求的束缚。虽然离线模型如VGGT实现了令人鼓舞的几何能力,但它们基于批次的性质使它们对实时系统无关紧要。流式架构虽然是为实时操作设计的解决方案,但已被证明是不充分的。现有方法要么无法支持真正无限的输入,要么在长时间序列中遭受灾难性漂移。我们通过InfiniteVGGT打破了这一长期困境,这是一种因果视觉几何变换器,通过有界但适应性强且持续表达的KV缓存实现滚动记忆的概念化。利用这一点,我们提出了一种无需训练、不依赖注意力的剪枝策略,能够智能地丢弃过时信息,有效地“滚动”记忆向前推进每一帧。InfiniteVGGT完全兼容FlashAttention,最终解决了这一妥协,实现了无限时长的流式传输,同时在长期稳定性方面优于现有流式方法。对于此类系统而言,最终的考验是其在真正无限时长上的性能,由于缺乏长期连续基准,这种能力一直难以严格验证。为解决这一关键缺口,我们引入了Long3D基准,这是首次能够对序列长度约10,000帧的连续3D几何估计进行严格评估的基准。这为未来在长期3D几何理解方面的研究提供了决定性的评估平台。代码可在:https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT 获取
Summary / 总结
InfiniteVGGT addresses the challenge of persistent 3D visual geometry understanding by introducing a causal visual geometry transformer with a bounded yet adaptive KV cache, enabling infinite-horizon streaming. It employs a training-free, attention-agnostic pruning strategy to discard obsolete information, ensuring long-term stability. Experimental results on the Long3D benchmark demonstrate that InfiniteVGGT outperforms existing methods in long-term stability and achieves superior performance over sequences of 10,000 frames, marking a significant advancement in 3D geometry understanding for live systems.
InfiniteVGGT通过引入一个有界但自适应的KV缓存的因果视觉几何变换器,解决了持续的3D视觉几何理解挑战,支持无限时长的流式处理。它采用了一种无需训练、不依赖注意力机制的剪枝策略,以智能地丢弃过时信息,确保长期稳定性。实验结果表明,InfiniteVGGT在Long3D基准测试上优于现有方法,在10,000帧序列上实现了更好的长期稳定性和性能,标志着在实时系统中3D几何理解方面取得了重要进展。
Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion
Authors: Binglei Li, Mengping Yang, Zhiyu Tan, Junping Zhang, Hao Li
First: 2026-01-05T15:32:53+00:00 · Latest: 2026-01-05T15:32:53+00:00
Comments: 11 pages
Abstract
Recent breakthroughs of transformer-based diffusion models, particularly with Multimodal Diffusion Transformers (MMDiT) driven models like FLUX and Qwen Image, have facilitated thrilling experiences in text-to-image generation and editing. To understand the internal mechanism of MMDiT-based models, existing methods tried to analyze the effect of specific components like positional encoding and attention layers. Yet, a comprehensive understanding of how different blocks and their interactions with textual conditions contribute to the synthesis process remains elusive. In this paper, we first develop a systematic pipeline to comprehensively investigate each block's functionality by removing, disabling and enhancing textual hidden-states at corresponding blocks. Our analysis reveals that 1) semantic information appears in earlier blocks and finer details are rendered in later blocks, 2) removing specific blocks is usually less disruptive than disabling text conditions, and 3) enhancing textual conditions in selective blocks improves semantic attributes. Building on these observations, we further propose novel training-free strategies for improved text alignment, precise editing, and acceleration. Extensive experiments demonstrated that our method outperforms various baselines and remains flexible across text-to-image generation, image editing, and inference acceleration. Our method improves T2I-Combench++ from 56.92% to 63.00% and GenEval from 66.42% to 71.63% on SD3.5, without sacrificing synthesis quality. These results advance understanding of MMDiT models and provide valuable insights to unlock new possibilities for further improvements.
中文标题/摘要
标题:解析MMDiT模块:无需训练的文本条件扩散分析与增强
基于变压器的扩散模型的最新突破,特别是由多模态扩散变换器(MMDiT)驱动的模型如FLUX和Qwen Image,极大地促进了文本到图像生成和编辑的激动人心的体验。为了理解MMDiT基模型的内部机制,现有方法尝试分析特定组件如位置编码和注意力层的效果。然而,不同模块及其与文本条件的交互如何共同作用于合成过程的全面理解仍然难以捉摸。在本文中,我们首先开发了一个系统的工作流程,通过在相应模块中移除、禁用和增强文本隐藏状态来全面调查每个模块的功能。我们的分析揭示了以下几点:1)语义信息出现在较早的模块中,而更精细的细节则在较晚的模块中呈现;2)移除特定模块通常比禁用文本条件的影响小;3)在选择性模块中增强文本条件可以提高语义属性。基于这些观察,我们进一步提出了新的无需训练的策略,以提高文本对齐、精确编辑和加速。广泛的实验表明,我们的方法优于各种基线,并且在文本到图像生成、图像编辑和推理加速方面保持灵活性。我们的方法将T2I-Combench++从56.92%提高到63.00%,GenEval从66.42%提高到71.63%,在SD3.5上没有牺牲合成质量。这些结果推进了对MMDiT模型的理解,并提供了有价值的见解,以解锁进一步改进的新可能性。
Summary / 总结
This paper aims to understand the internal mechanisms of MMDiT-based models by analyzing the effects of different blocks and their interactions with textual conditions. The authors developed a systematic pipeline to remove, disable, and enhance textual hidden-states at each block. Key findings include that semantic information appears in earlier blocks while finer details are rendered in later blocks, and enhancing textual conditions in selective blocks improves semantic attributes. Based on these insights, the authors proposed training-free strategies for better text alignment, precise editing, and acceleration, which outperformed various baselines in text-to-image generation, image editing, and inference acceleration. The method improved T2I-Combench++ and GenEval scores without compromising synthesis quality.
本文旨在理解MMDiT模型在文本到图像生成和编辑中的内部机制。作者开发了一套系统的方法来分析不同模块及其与文本条件的交互效果。关键发现包括语义信息出现在较早的模块中,而精细细节则在较晚的模块中呈现,移除特定模块通常比禁用文本条件更具破坏性,而在选择性模块中增强文本条件可以提高语义属性。基于这些见解,作者提出了无训练策略以实现更好的文本对齐、精确编辑和加速。实验表明,他们的方法优于基线,并在SD3.5上提高了T2I-Combench++和GenEval得分,同时不牺牲合成质量。
Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models
Authors: Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, Rohan Sinha, Milan Ganai, Ole Andreas Alsos, Marco Pavone, Martin Steinert
First: 2025-12-30T21:20:41+00:00 · Latest: 2026-01-05T14:30:28+00:00
Comments: 17 pages without bibliography or appendix. The main paper has 16 figures. Paper webpage can be found at https://kimachristensen.github.io/bridge_policy/
Abstract
The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained VLM fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning. Website: kimachristensen.github.io/bridge_policy
中文标题/摘要
标题:基础模型在桥梁上的应用:基于视觉语言模型的海上自主航行中的语义风险检测与安全操作
国际海事组织(IMO)的MASS代码草案要求自主和远程监督的海上船舶能够检测到偏离其操作设计域的情况,进入预定义的后备模式通知操作员,允许立即的人工干预,并在未经批准的情况下不得更改航程计划。在警报到接管的窗口内满足这些义务需要一种短时间范围、可人工干预的后备操作。传统的海上自主系统在需要理解意义的情况下(例如,潜水员标志意味着水中有人员,火意味着危险)难以应对。我们认为(i)视觉语言模型(VLMs)为这些分布外情况提供了语义意识,(ii)快速-慢速异常检测流水线与短时间范围、可人工干预的后备操作使这一操作在交接窗口内变得可行。我们引入了语义瞭望,这是一种仅使用摄像头、候选受限的VLM后备操作选择器,它在连续的人类监督下从水有效、世界锚定的轨迹中选择一个谨慎的操作(或保持位置)。在40个港口场景中,我们测量了每次呼叫的场景理解能力和延迟,与人类共识的对齐(模型三票多数投票),火灾危险场景下的短时间范围风险缓解,以及水上警报->后备操作->操作员交接。亚10秒的模型保留了大多数先进模型的大部分意识。后备操作选择器优于仅几何的基线,并在火灾场景中增加了安全距离。现场运行验证了端到端操作。这些结果支持VLMs作为与IMO MASS代码草案兼容的语义后备操作选择器,符合实际的延迟预算,并激励未来的工作,即领域适应的混合自主,将基础模型语义与多传感器鸟瞰感知和短时间范围重新规划相结合。
Summary / 总结
This paper addresses the challenge of detecting semantic hazards and enabling safe maneuvers in autonomous maritime operations. It proposes using vision-language models (VLMs) for semantic awareness and a fast-slow anomaly pipeline for a short-horizon, human-overridable fallback maneuver. The authors introduce Semantic Lookout, a camera-only system that selects cautious actions or station-keeping from water-valid, world-anchored trajectories. Experiments on 40 harbor scenes show that sub-10 second models maintain awareness comparable to slower state-of-the-art models, and the fallback maneuver selector outperforms geometry-only baselines, increasing standoff distance on fire scenes. The system aligns with the IMO MASS Code requirements and supports future hybrid autonomy systems.
该研究旨在通过使用视觉-语言模型提出一种语义后备机动方案,解决自主海上船舶在警报到接管窗口内检测和应对危险的需求。方法包括一个快速-慢速异常管道,带有短期机动和人类可覆盖的后备机动。关键发现包括子10秒模型保留了较慢的先进模型大部分的感知能力,后备机动选择器优于几何基线,增加了火灾场景的退避距离。Semantic Lookout系统,一种仅使用摄像头、候选受限的VLM,在持续的人类监督下选择谨慎动作或保持原位,展示了在港口场景中的实际操作能力。
BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models
Authors: Sunny Gupta, Shounak Das, Amit Sethi
Venue: AAAI 2026
First: 2026-01-05T14:22:20+00:00 · Latest: 2026-01-05T14:22:20+00:00
Comments: Accepted at the AAAI 2026 Workshop AIR-FM, Assessing and Improving Reliability of Foundation Models in the Real World
Abstract
Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities. Existing debiasing approaches often address a single modality either visual or textual leading to partial robustness and unstable adaptation under distribution shifts. We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation. On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce orthogonal prediction consistency between causal and spurious regions. On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space. Together, these modules jointly minimize conditional mutual information between spurious cues and predictions, steering the model toward causal, domain invariant reasoning without retraining or domain supervision. Extensive evaluations on real-world and synthetic bias benchmarks demonstrate consistent improvements in both average and worst-group accuracies over prior test-time debiasing methods, establishing a lightweight yet effective path toward trustworthy and causally grounded vision-language adaptation.
中文标题/摘要
标题:BiPrompt:视觉和文本双模态去偏优化框架
视觉语言基础模型如CLIP在零样本泛化方面表现出色,但在视觉和文本模态之间仍易受虚假相关性的影响。现有去偏方法通常只针对单一模态,无论是视觉还是文本,这导致了部分鲁棒性和在分布变化下的不稳定适应。我们提出了一种双模态提示优化框架(BiPrompt),该框架在测试时同时减轻了两个模态中非因果特征依赖性。在视觉方面,它使用结构化注意力引导消除来抑制背景激活,并强制因果区域和虚假区域之间的预测一致性。在文本方面,它引入了平衡提示归一化,这是一种可学习的重新对齐机制,将类别嵌入对齐到等向性的语义空间。这些模块共同最小化了虚假线索与预测之间的条件互信息,引导模型向因果、领域不变的推理方向发展,而无需重新训练或领域监督。在现实世界和合成偏见基准上的广泛评估表明,与先前的测试时去偏方法相比,该方法在平均准确性和最差群体准确率上都取得了持续改进,为可信且因果导向的视觉语言适应指明了一条轻量级且有效的路径。
Summary / 总结
The research aims to improve the robustness of vision-language models like CLIP by addressing spurious correlations in both visual and textual modalities. BiPrompt optimizes prompts bilaterally to mitigate non-causal feature reliance and enforce orthogonal prediction consistency. The method includes structured attention-guided erasure for visual data and balanced prompt normalization for textual data. Experiments show consistent improvements in accuracy across different bias benchmarks compared to previous test-time debiasing methods, demonstrating the effectiveness of this approach without requiring retraining or domain supervision.
研究旨在通过同时解决视觉和文本模态中的虚假相关性,提高如CLIP等视觉语言模型的鲁棒性。BiPrompt通过在视觉上抑制背景激活并强制因果和虚假区域之间的一致性,在文本上引入可学习的重新中心机制将类别嵌入对齐到均匀的语义空间,减少虚假线索与预测之间的条件互信息,从而实现更好的因果推理,并在平均准确率和最差群体准确率上优于先前的方法。
DeCode: Decoupling Content and Delivery for Medical QA
Authors: Po-Jen Ko, Chen-Han Tsai, Yu-Shao Peng
First: 2026-01-05T13:54:38+00:00 · Latest: 2026-01-05T13:54:38+00:00
Comments: Preprint
Abstract
Large language models (LLMs) exhibit strong medical knowledge and can generate factually accurate responses. However, existing models often fail to account for individual patient contexts, producing answers that are clinically correct yet poorly aligned with patients' needs. In this work, we introduce DeCode, a training-free, model-agnostic framework that adapts existing LLMs to produce contextualized answers in clinical settings. We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses. DeCode improves the previous state of the art from $28.4\%$ to $49.8\%$, corresponding to a $75\%$ relative improvement. Experimental results suggest the effectiveness of DeCode in improving clinical question answering of LLMs.
中文标题/摘要
标题:DeCode: 解耦内容与交付以实现医疗QA
大型语言模型(LLMs)表现出强大的医学知识,并能生成事实准确的回答。然而,现有模型往往未能考虑个体患者的背景,导致答案在临床上正确但与患者需求严重脱节。在本工作中,我们引入了DeCode,这是一种无需训练、模型通用的框架,能够将现有的LLMs适应于在临床环境中生成上下文化的回答。我们使用OpenAI HealthBench对DeCode进行了评估,这是一个全面且具有挑战性的基准,旨在评估LLM回答的临床相关性和有效性。DeCode将先前的最佳性能从28.4%提高到49.8%,相当于75%的相对改进。实验结果表明,DeCode在提高LLMs的临床问题回答效果方面的有效性。
Summary / 总结
DeCode is a training-free, model-agnostic framework that enhances the clinical relevance of large language models (LLMs) by adapting them to individual patient contexts. Evaluated on OpenAI HealthBench, DeCode significantly improves the previous state-of-the-art performance from 28.4% to 49.8%, representing a 75% relative improvement in clinical question answering accuracy.
DeCode 是一个无需训练、适用于多种模型的框架,旨在使现有的大型语言模型能够生成具有临床相关性的医疗答案。它在 OpenAI HealthBench 上进行评估,并将准确率从 28.4% 提高到 49.8%,提高了 75%。这表明 DeCode 在增强 LLM 响应的临床相关性方面非常有效。
Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships
Authors: Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen
Venue: WACV 2026
First: 2024-05-29T05:20:02+00:00 · Latest: 2026-01-05T13:34:30+00:00
Comments: WACV 2026 Accepted. Code available at https://github.com/CyberAgentAILab/multimodal-adversarial-training
Abstract
Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift -- conditions overlooked by prior research. This work pioneers defense strategies against multimodal attacks, providing insights for building robust VLMs from both optimization and data perspectives. Our code is publicly available at https://github.com/CyberAgentAILab/multimodal-adversarial-training.
中文标题/摘要
标题:利用一对多关系的多模态对抗防御方法研究
预训练的视觉-语言(VL)模型对对抗攻击极为敏感。然而,现有的防御方法主要集中在图像分类上,忽视了VL任务中的两个关键方面:多模态攻击,其中图像和文本都可以被扰动,以及图像和文本之间的一对多关系,即一个图像可以对应多个文本描述,反之亦然(1:N和N:1)。本工作是首次探索VL任务中对抗多模态攻击的防御策略,而之前的VL防御方法主要关注视觉鲁棒性。我们提出了多模态对抗训练(MAT),在训练过程中同时在图像和文本模态中引入对抗扰动,显著优于现有的单模态防御方法。此外,我们发现MAT受限于VL训练数据中确定的一对一(1:1)图像-文本对。为了解决这一问题,我们对利用一对多关系增强鲁棒性进行了全面研究,探讨了多种增强技术。我们的分析表明,为了更有效的防御,增强的图像-文本对应该对齐良好、多样化,但要避免分布偏移——这是之前研究中被忽视的条件。本工作开创了对抗多模态攻击的防御策略,从优化和数据两个角度提供了构建鲁棒VL模型的见解。我们的代码已公开发布在https://github.com/CyberAgentAILab/multimodal-adversarial-training。
Summary / 总结
This work addresses the vulnerability of pre-trained vision-language models to adversarial attacks by proposing multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities. The method significantly outperforms existing unimodal defenses. The study also highlights the limitations of deterministic one-to-one image-text pairs and explores the use of one-to-many relationships to enhance robustness, suggesting that augmented pairs should be well-aligned, diverse, and avoid distribution shift. This research provides new insights for building robust vision-language models.
该研究针对预训练的视觉语言模型对抗攻击的脆弱性,提出了多模态对抗训练(MAT)方法,该方法在图像和文本模态中都引入了对抗扰动,显著优于现有的单模态防御方法。研究还指出了确定性一对一图像-文本对的局限性,并探索利用一对多关系来增强鲁棒性,建议增强的图像-文本对应是良好对齐、多样化的,并避免分布偏移。这是首次探索视觉语言任务中多模态攻击的防御策略,提供了从优化和数据两个角度的见解。
Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows
Authors: Yingte Shu, Yuchuan Tian, Chao Xu, Yunhe Wang, Hanting Chen
First: 2026-01-05T12:57:33+00:00 · Latest: 2026-01-05T12:57:33+00:00
Abstract
Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based diffusion, decoding tokens block by block. However, this paradigm suffers from a structural limitation that we term Boundary-Induced Context Truncation (BICT): undecoded tokens near block boundaries are forced to commit without access to nearby future context, even when such context could substantially reduce uncertainty. This limitation degrades decoding confidence and generation quality, especially for tasks requiring precise reasoning, such as mathematical problem solving and code generation. We propose Deferred Commitment Decoding (DCD), a novel, training-free decoding strategy that mitigates this issue. DCD maintains a confidence-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available. This design enables effective bidirectional information flow within the decoding window without sacrificing efficiency. Extensive experiments across multiple diffusion language models, benchmarks, and caching configurations show that DCD improves generation accuracy by 1.39% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 9.0%. These results demonstrate that deferring token commitment based on uncertainty is a simple yet effective principle for improving both the quality and efficiency of diffusion language model decoding.
中文标题/摘要
标题:延迟承诺解码:带有信心感知滑动窗口的扩散语言模型
扩散语言模型(DLMs)最近作为一种强大的替代自回归模型出现,通过实现并行文本生成。为了提高推理效率和KV缓存兼容性,先前的工作通常采用基于块的扩散,逐块解码令牌。然而,这种范式遭受了一个结构性限制,我们称之为边界诱导上下文截断(BICT):接近块边界的未解码令牌被迫在无法访问附近未来上下文的情况下做出承诺,即使这种上下文可以显著减少不确定性。这一限制降低了解码信心和生成质量,特别是在需要精确推理的任务中,如数学问题求解和代码生成。我们提出了一种名为延迟承诺解码(DCD)的新型、无需训练的解码策略,以缓解这一问题。DCD 维护一个信心感知的滑动窗口覆盖在掩码令牌上,早期解决低不确定性令牌,直到有足够的上下文证据为止再推迟高不确定性令牌。这种设计在解码窗口内实现了有效的双向信息流,同时保持了效率。在多个扩散语言模型、基准和缓存配置的广泛实验中显示,与固定块基扩散方法相比,DCD 在平均时间相同的情况下提高了生成准确性 1.39%,最高改善幅度达到 9.0%。这些结果表明,基于不确定性推迟令牌承诺是提高扩散语言模型解码质量和效率的一个简单而有效的原则。
Summary / 总结
The paper addresses the issue of Boundary-Induced Context Truncation (BICT) in block-based diffusion language models, which limits decoding confidence and generation quality. It introduces Deferred Commitment Decoding (DCD), a training-free method that uses a confidence-aware sliding window to resolve tokens based on their uncertainty. Experiments show that DCD improves generation accuracy by 1.39% on average compared to fixed block-based methods, with up to 9.0% improvement in some cases, without sacrificing efficiency.
论文针对块基扩散语言模型中的边界诱导上下文截断(BICT)问题,该问题限制了解码的信心和生成质量。提出了一种名为延迟承诺解码(DCD)的无训练方法,通过使用一个基于置信度的滑动窗口,提前解决低不确定性标记,并在获得足够上下文证据后延迟高不确定性标记,从而实现解码窗口内的双向信息流动。实验结果显示,DCD相比固定块基方法平均提高了1.39%的生成准确性,最高可达9.0%的提升。
Agentic Retoucher for Text-To-Image Generation
Authors: Shaocheng Shen, Jianfeng Liang. Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai
First: 2026-01-05T12:06:43+00:00 · Latest: 2026-01-05T12:06:43+00:00
Abstract
Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.
中文标题/摘要
标题:代理修图师:用于文本到图像生成
文本到图像(T2I)扩散模型如SDXL和FLUX已经实现了令人印象深刻的写实效果,但在肢体、面部、文本等方面仍然存在普遍的小规模失真。现有的精修方法要么进行昂贵的迭代重新生成,要么依赖于具有较弱空间定位能力的视觉语言模型(VLMs),导致语义漂移和不可靠的局部编辑。为了解决这一问题,我们提出了一种名为代理修图师的分层决策驱动框架,将后生成修正重新构想为类似人类感知-推理-行动的循环。具体来说,我们设计了(1)一个感知代理,学习在文本-图像一致性线索下的细粒度失真定位的上下文显著性;(2)一个推理代理,通过逐步偏好对齐进行符合人类的推断诊断;(3)一个行动代理,根据用户偏好自适应地计划局部修复。该设计将感知证据、语言推理和可控修正整合到一个统一的、自我修正的决策过程中。为了实现细粒度的监督和定量评估,我们进一步构建了包含6000张T2I图像和27000个注释缺陷区域的GenBlemish-27K数据集。广泛的实验表明,代理修图师在感知质量、失真定位和人类偏好对齐方面始终优于最先进的方法,建立了自修正和感知可靠的T2I生成的新范式。
Summary / 总结
Agentic Retoucher is a hierarchical framework that addresses small-scale distortions in text-to-image generation by reformulating post-generation correction as a perception-reasoning-action loop. It includes a perception agent for fine-grained distortion localization, a reasoning agent for human-aligned inferential diagnosis, and an action agent for adaptive localized inpainting. The framework integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified process. Experiments show that Agentic Retoucher outperforms existing methods in perceptual quality, distortion localization, and human preference alignment, setting a new standard for self-corrective and perceptually reliable T2I generation.
研究旨在解决SDXL和FLUX等文本到图像生成模型中存在的小尺度失真问题。提出了一个分层框架Agentic Retoucher,将其后生成修正过程重新构想为感知-推理-行动循环。该框架包括用于精细失真定位的感知代理、用于人类对齐的推理诊断代理以及用于适应性局部修复的动作代理。实验表明,Agentic Retoucher在感知质量、失真定位和人类偏好对齐方面优于现有方法,为自纠正和感知可靠的T2I生成设定了新标准。
Leveraging 2D-VLM for Label-Free 3D Segmentation in Large-Scale Outdoor Scene Understanding
Authors: Toshihiko Nishimura, Hirofumi Abe, Kazuhiko Murasaki, Taiga Yoshida, Ryuichi Tanida
Venue: 19th International Conference on Machine Vision Applications (MVA2025), IEICE Transactions on Information and Systems letter
First: 2026-01-05T11:42:49+00:00 · Latest: 2026-01-05T11:42:49+00:00
Comments: 19
Abstract
This paper presents a novel 3D semantic segmentation method for large-scale point cloud data that does not require annotated 3D training data or paired RGB images. The proposed approach projects 3D point clouds onto 2D images using virtual cameras and performs semantic segmentation via a foundation 2D model guided by natural language prompts. 3D segmentation is achieved by aggregating predictions from multiple viewpoints through weighted voting. Our method outperforms existing training-free approaches and achieves segmentation accuracy comparable to supervised methods. Moreover, it supports open-vocabulary recognition, enabling users to detect objects using arbitrary text queries, thus overcoming the limitations of traditional supervised approaches.
中文标题/摘要
标题:利用2D-VLM进行大规模室外场景无标注3D分割
本文提出了一种新颖的3D语义分割方法,适用于大规模点云数据,无需标注的3D训练数据或配对的RGB图像。所提出的方法使用虚拟相机将3D点云投影到2D图像上,并通过自然语言提示引导的基础2D模型进行语义分割。通过多视角加权投票聚合预测,实现3D分割。该方法优于现有的无需训练的方法,并且分割精度与监督方法相当。此外,它支持开放词汇识别,使用户能够使用任意文本查询检测对象,从而克服传统监督方法的局限。
Summary / 总结
This paper introduces a novel 3D semantic segmentation method for large-scale point cloud data without the need for annotated 3D training data or paired RGB images. The approach projects 3D point clouds onto 2D images using virtual cameras and uses a foundation 2D model guided by natural language prompts for semantic segmentation. 3D segmentation is achieved by aggregating predictions from multiple viewpoints through weighted voting. The method outperforms existing training-free approaches and achieves segmentation accuracy comparable to supervised methods, supporting open-vocabulary recognition with arbitrary text queries.
该论文提出了一种无需标注的3D点云数据语义分割方法,不依赖于3D训练数据或配对的RGB图像。该方法使用2D视觉-语言模型(VLM)将3D点云投影到2D图像上,并通过自然语言提示进行语义分割。3D分割通过从多个视角的预测进行加权投票来实现。该方法优于现有的无监督方法,并且在语义分割精度上与监督方法相当,同时支持任意文本查询的开放词汇识别。
CountCluster: Training-Free Object Quantity Guidance with Cross-Attention Map Clustering for Text-to-Image Generation
Authors: Joohyeon Lee, Jin-Seop Lee, Jee-Hyong Lee
First: 2025-08-14T14:53:53+00:00 · Latest: 2026-01-05T11:17:43+00:00
Comments: Under review
Abstract
Diffusion-based text-to-image generation models have demonstrated strong performance in terms of image quality and diversity. However, they still struggle to generate images that accurately reflect the number of objects specified in the input prompt. Several approaches have been proposed that rely on either external counting modules for iterative refinement or quantity representations derived from learned tokens or latent features. However, they still have limitations in accurately reflecting the specified number of objects and overlook an important structural characteristic--The number of object instances in the generated image is largely determined in the early timesteps of the denoising process. To correctly reflect the object quantity for image generation, the highly activated regions in the object cross-attention map at the early timesteps should match the input object quantity, while each region should be clearly separated. To address this issue, we propose \textit{CountCluster}, a method that guides the object cross-attention map to be clustered according to the specified object count in the input, without relying on any external tools or additional training. The proposed method partitions the object cross-attention map into $k$ clusters at inference time based on attention scores, defines an ideal distribution in which each cluster is spatially well-separated, and optimizes the latent to align with this target distribution. Our method achieves an average improvement of 18.5\%p in object count accuracy compared to existing methods, and demonstrates superior quantity control performance across a variety of prompts. Code will be released at: https://github.com/JoohyeonL22/CountCluster
中文标题/摘要
标题:CountCluster:无需训练的对象数量指导方法,基于跨注意力图聚类的文本到图像生成
基于扩散的文本到图像生成模型在图像质量和多样性方面表现出色。然而,它们仍然难以生成准确反映输入提示中指定对象数量的图像。已经提出了几种方法,依赖于外部计数模块的迭代细化或从学习的令牌或潜在特征中推导出的数量表示。然而,这些方法仍然难以准确反映指定的对象数量,并且忽略了一个重要结构特征——生成图像中的对象实例数量在去噪过程的早期阶段就已经确定。为了正确反映图像生成中的对象数量,早期时间步中的对象跨注意力图的高激活区域应与输入对象数量匹配,同时每个区域应清晰分离。为了解决这一问题,我们提出了一种名为\textit{CountCluster}的方法,该方法根据输入中的指定对象数量指导对象跨注意力图聚类,而无需依赖任何外部工具或额外训练。该方法在推理时根据注意力分数将对象跨注意力图划分为k个聚类,定义了一个理想分布,其中每个聚类在空间上清晰分离,并优化潜在变量以与该目标分布对齐。与现有方法相比,我们的方法在对象数量准确性上平均提高了18.5%,并在各种提示下展示了优越的数量控制性能。代码将在https://github.com/JoohyeonL22/CountCluster发布。
Summary / 总结
CountCluster is a method for guiding the object cross-attention map in text-to-image generation models to accurately reflect the specified number of objects in the input prompt. It clusters the object cross-attention map based on the input object count without external tools or additional training. The method improves object count accuracy by 18.5% on average compared to existing methods and shows superior quantity control performance across various prompts.
CountCluster 是一种方法,通过输入指定的对象数量来引导对象交叉注意力图聚类,无需外部工具或额外训练。该方法根据注意力分数将对象交叉注意力图划分为 $k$ 个簇,并优化潜在变量以与每个簇在空间上良好分离的理想分布对齐。该方法在现有方法的基础上平均提高了 18.5% 的对象数量准确性,并在各种提示下展示了优越的数量控制性能。
Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
Authors: Weijian Ma, Shizhao Sun, Tianyu Yu, Ruiyu Wang, Tat-Seng Chua, Jiang Bian
First: 2026-01-05T10:38:26+00:00 · Latest: 2026-01-05T10:38:26+00:00
Comments: Preprint. Under review
Abstract
Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image patches, improving fine-grained perception but weakening global spatial awareness, or mark isolated coordinates, which capture object locations but overlook their overall organization. In this work, we integrate the cognitive concept of an object-centric blueprint into VLMs to enhance spatial reasoning. Given an image and a question, the model first constructs a JSON-style blueprint that records the positions, sizes, and attributes of relevant objects, and then reasons over this structured representation to produce the final answer. To achieve this, we introduce three key techniques: (1) blueprint-embedded reasoning traces for supervised fine-tuning to elicit basic reasoning skills; (2) blueprint-aware rewards in reinforcement learning to encourage the blueprint to include an appropriate number of objects and to align final answers with this causal reasoning; and (3) anti-shortcut data augmentation that applies targeted perturbations to images and questions, discouraging reliance on superficial visual or linguistic cues. Experiments show that our method consistently outperforms existing VLMs and specialized spatial reasoning models.
中文标题/摘要
标题:使用蓝图思考:通过结构化对象表示辅助视觉语言模型的空间推理
空间推理——感知和推理空间中关系的能力——使视觉语言模型(VLMs)从视觉感知向空间语义理解迈进。现有方法要么重新审视局部图像块,提高细粒度感知但削弱全局空间意识,要么标记孤立坐标,捕捉对象位置但忽视其整体组织。在本工作中,我们将认知概念中的对象为中心的蓝图整合到VLMs中,以增强空间推理。给定一张图片和一个问题,模型首先构建一个JSON风格的蓝图,记录相关对象的位置、大小和属性,然后基于这个结构化表示进行推理以生成最终答案。为此,我们引入了三种关键技术:(1)蓝图嵌入推理跟踪用于监督微调以激发基本推理技能;(2)蓝图感知奖励在强化学习中鼓励蓝图包含适当数量的对象,并使最终答案与这种因果推理保持一致;(3)反捷径数据增强,对图像和问题应用有针对性的扰动,以防止依赖于表面视觉或语言线索。实验表明,我们的方法在所有现有VLMs和专门的空间推理模型中表现更优。
Summary / 总结
This work aims to enhance vision-language models' spatial reasoning capabilities by integrating an object-centric blueprint concept. The method involves constructing a structured representation of objects in an image and reasoning over this blueprint to answer questions. Key techniques include blueprint-embedded reasoning traces for supervised fine-tuning, blueprint-aware rewards in reinforcement learning, and anti-shortcut data augmentation. Experiments demonstrate that this approach outperforms existing vision-language models and specialized spatial reasoning models.
本文旨在通过引入对象中心的蓝图来增强视觉语言模型的空间推理能力。模型构建图像中对象的结构化表示,并基于此进行推理以回答问题。关键技术包括蓝图嵌入的推理跟踪、蓝图感知的奖励以及反捷径数据增强。实验表明,所提出的方法在空间推理任务中优于现有的视觉语言模型和专门的空间推理模型。
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Authors: Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang
First: 2025-06-10T17:57:50+00:00 · Latest: 2026-01-05T10:14:19+00:00
Abstract
Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.
中文标题/摘要
标题:自回归语义视觉重建有助于提升VLMs的理解能力
典型的大型视觉-语言模型(LVLMs)仅对文本序列应用自回归监督,而未充分将视觉模态纳入学习过程。这导致了三个关键限制:(1)无法利用无配图的图像,(2)配图可能遗漏关键视觉细节,(3)某些视觉中心的内容难以通过文本充分传达。因此,当前的LVLMs往往优先实现视觉到语言的对齐,而可能忽视了精细的视觉信息。尽管一些先前的工作探索了自回归图像生成,但有效利用自回归视觉监督来增强图像理解仍是一个开放的挑战。在本文中,我们引入了自回归语义视觉重建(ASVR),它能够在统一的自回归框架中联合学习视觉和文本模态。我们展示了自回归重建图像的原始视觉外观并不能提升,甚至可能损害多模态理解。相反,自回归重建图像的语义表示能够一致地提高理解能力。值得注意的是,我们发现即使模型接收到连续的图像特征输入,它们也能有效地重建离散的语义标记,从而在多种多模态理解基准测试中实现稳定且一致的改进。我们的方法在不同数据规模(55.6万-200万)和不同类型的LLM主干模型上均取得了显著的性能提升。具体而言,ASVR使LLaVA-1.5在14个多模态基准测试中的平均得分提高了5%。代码可在https://github.com/AlenjandroWang/ASVR获取。
Summary / 总结
This paper addresses the limitations of typical large vision-language models (LVLMs) by introducing Autoregressive Semantic Visual Reconstruction (ASVR), which enhances multimodal understanding. ASVR incorporates autoregressive supervision for both textual and visual modalities, focusing on reconstructing the semantic representation of images rather than raw visual appearance. The key finding is that ASVR improves comprehension and performance across various multimodal understanding benchmarks, with a 5% increase in average scores for LLaVA-1.5 compared to previous models.
本文通过引入自回归语义视觉重构(ASVR)来解决典型大型视觉-语言模型(LVLM)的局限性,旨在增强多模态理解。ASVR将自回归监督应用于文本和视觉模态,重点在于重构图像的语义表示而非原始视觉外观。关键发现是,ASVR在各种基准测试中提高了多模态理解能力,LLaVA-1.5的平均得分提高了5%。该方法在不同数据规模和LLM骨干类型下均表现出一致的增强效果。
AFTER: Mitigating the Object Hallucination of LVLM via Adaptive Factual-Guided Activation Editing
Authors: Tianbo Wang, Yuqing Ma, Kewei Liao, Zhange Zhang, Simin Li, Jinyang Guo, Xianglong Liu
First: 2026-01-05T10:02:22+00:00 · Latest: 2026-01-05T10:02:22+00:00
Abstract
Large Vision-Language Models (LVLMs) have achieved substantial progress in cross-modal tasks. However, due to language bias, LVLMs are susceptible to object hallucination, which can be primarily divided into category, attribute, and relation hallucination, significantly impeding the trustworthy AI applications. Editing the internal activations of LVLMs has shown promising effectiveness in mitigating hallucinations with minimal cost. However, previous editing approaches neglect the effective guidance offered by factual textual semantics, thereby struggling to explicitly mitigate language bias. To address these issues, we propose Adaptive Factual-guided Visual-Textual Editing for hallucination mitigation (AFTER), which comprises Factual-Augmented Activation Steering (FAS) and Query-Adaptive Offset Optimization (QAO), to adaptively guides the original biased activations towards factual semantics. Specifically, FAS is proposed to provide factual and general guidance for activation editing, thereby explicitly modeling the precise visual-textual associations. Subsequently, QAO introduces a query-aware offset estimator to establish query-specific editing from the general steering vector, enhancing the diversity and granularity of editing. Extensive experiments on standard hallucination benchmarks across three widely adopted LVLMs validate the efficacy of the proposed AFTER, notably achieving up to a 16.3% reduction of hallucination over baseline on the AMBER benchmark. Our code and data will be released for reproducibility.
中文标题/摘要
标题:AFTER: 通过自适应事实导向激活编辑减轻LVLM的对象幻觉
大型视觉-语言模型(LVLMs)在跨模态任务中取得了显著进展。然而,由于语言偏见,LVLMs 易于产生对象幻觉,这主要可以分为类别、属性和关系幻觉,严重阻碍了可信AI应用。通过编辑LVLMs的内部激活以减轻幻觉显示出极高的有效性,且成本较低。然而,先前的编辑方法忽视了事实文本语义提供的有效指导,因此难以明确减轻语言偏见。为解决这些问题,我们提出了自适应事实导向视觉-文本编辑(AFTER),它包括事实增强激活导向(FAS)和查询自适应偏移优化(QAO),以自适应地引导原始有偏的激活向事实语义。具体而言,FAS 提出了一种为激活编辑提供事实和通用指导的方法,从而明确建模视觉-文本关联。随后,QAO 引入了一个查询感知偏移估计器,以从通用导向向量中建立查询特定的编辑,增强编辑的多样性和精细度。在三个广泛采用的LVLMs上的标准幻觉基准上的广泛实验验证了所提出的AFTER的有效性,显著地在AMBER基准上将幻觉减少了16.3%。我们的代码和数据将发布以确保可再现性。
Summary / 总结
The paper addresses the issue of object hallucination in Large Vision-Language Models (LVLMs) by proposing AFTER, which uses Factual-Augmented Activation Steering (FAS) and Query-Adaptive Offset Optimization (QAO) to guide activations towards factual semantics. Extensive experiments show a significant reduction in hallucination, with up to a 16.3% decrease on the AMBER benchmark compared to baseline methods.
研究旨在通过提出AFTER方法,即自适应事实导向的视觉-文本编辑,解决大型视觉-语言模型(LVLM)中的物体幻觉问题。该方法包括提供事实和一般指导的Factual-Augmented Activation Steering (FAS),以及用于查询特定编辑的Query-Adaptive Offset Optimization (QAO)。实验表明,AFTER能够有效减少幻觉,相比基线方法在AMBER基准上实现了高达16.3%的幻觉减少。
RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models
Authors: Fan Wei, Runmin Dong, Yushan Lai, Yixiang Yang, Zhaoyang Luo, Jinxiao Zhang, Miao Yang, Shuai Yuan, Jiyao Zhao, Bin Luo, Haohuan Fu
First: 2025-12-29T06:44:06+00:00 · Latest: 2026-01-05T09:01:02+00:00
Abstract
Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.
中文标题/摘要
标题:RS-Prune: 高比例训练免费数据剪枝以提高远程 sensing 扩散基础模型效率
基于扩散的远程 sensing (RS) 生成基础模型对于下游任务至关重要。然而,这些模型依赖于大量全球代表性数据,这些数据通常包含冗余、噪声和类别不平衡,降低了训练效率并阻碍了收敛。现有的 RS 扩散基础模型通常聚合多个分类数据集或应用简单的去重方法,忽视了生成建模的分布要求以及 RS 图像的异质性。为了解决这些限制,我们提出了一种训练免费的两阶段数据剪枝方法,该方法能够在高剪枝比例下快速选择高质量子集,使初步基础模型能够快速收敛,并作为生成、下游微调和其他应用的多功能骨干。我们的方法同时考虑了局部信息内容与全局场景级的多样性和代表性。首先,基于熵准则高效移除低信息量样本。接着,利用 RS 场景分类数据集作为参考基准,我们进行场景感知聚类并采用分层抽样以提高聚类效果并减少大规模未标记数据上的计算成本。最后,通过平衡聚类级均匀性和样本代表性,该方法能够在高剪枝比例下实现精细选择,同时保持整体多样性和代表性。实验表明,即使剪枝了 85% 的训练数据,我们的方法也能显著提高收敛性和生成质量。此外,使用我们方法训练的扩散基础模型在包括超分辨率和语义图像合成在内的下游任务中始终实现了最先进的性能。这种数据剪枝范式为开发 RS 生成基础模型提供了实用指导。
Summary / 总结
The paper proposes RS-Prune, a training-free data pruning method for remote sensing (RS) diffusion foundation models. It aims to address the issues of redundancy and class imbalance in large RS datasets by pruning data under high ratios, ensuring efficient training and model convergence. The method uses an entropy-based criterion to remove low-information samples and performs scene-aware clustering with stratified sampling to maintain diversity and representativeness. Experiments show that even after pruning 85% of the training data, the model converges faster and generates higher quality images, achieving state-of-the-art performance in downstream tasks such as super-resolution and semantic image synthesis.
论文提出了一种名为RS-Prune的无训练数据剪枝方法,用于基于扩散的遥感(RS)生成基础模型。该方法通过使用熵基准则和基于场景的聚类与分层采样来解决大数据集中的冗余和类别不平衡问题。实验表明,即使剪枝掉85%的训练数据,该方法仍能提高收敛性和生成质量,并且在超分辨率和语义图像合成等下游任务中,使用此方法训练的模型优于现有方法。
TalkPhoto: A Versatile Training-Free Conversational Assistant for Intelligent Image Editing
Authors: Yujie Hu, Zecheng Tang, Xu Jiang, Weiqi Li, Jian Zhang
First: 2026-01-05T09:00:32+00:00 · Latest: 2026-01-05T09:00:32+00:00
Comments: a Conversational Assistant for Intelligent Image Editing
Abstract
Thanks to the powerful language comprehension capabilities of Large Language Models (LLMs), existing instruction-based image editing methods have introduced Multimodal Large Language Models (MLLMs) to promote information exchange between instructions and images, ensuring the controllability and flexibility of image editing. However, these frameworks often build a multi-instruction dataset to train the model to handle multiple editing tasks, which is not only time-consuming and labor-intensive but also fails to achieve satisfactory results. In this paper, we present TalkPhoto, a versatile training-free image editing framework that facilitates precise image manipulation through conversational interaction. We instruct the open-source LLM with a specially designed prompt template to analyze user needs after receiving instructions and hierarchically invoke existing advanced editing methods, all without additional training. Moreover, we implement a plug-and-play and efficient invocation of image editing methods, allowing complex and unseen editing tasks to be integrated into the current framework, achieving stable and high-quality editing results. Extensive experiments demonstrate that our method not only provides more accurate invocation with fewer token consumption but also achieves higher editing quality across various image editing tasks.
中文标题/摘要
标题:TalkPhoto:一种无需训练的多功能智能图像编辑对话助手
得益于大型语言模型(LLMs)强大的语言理解能力,现有的基于指令的图像编辑方法引入了多模态大型语言模型(MLLMs),以促进指令与图像之间的信息交流,确保图像编辑的可控性和灵活性。然而,这些框架通常需要构建一个多指令数据集来训练模型以处理多种编辑任务,这不仅耗时费力,而且难以达到满意的效果。在本文中,我们提出了TalkPhoto,一种无需训练的多功能图像编辑框架,通过对话交互实现精确的图像操作。我们使用一个特别设计的提示模板对开源LLM进行指令,接收指令后分析用户需求,并分层调用现有的高级编辑方法,无需额外训练。此外,我们实现了图像编辑方法的即插即用和高效调用,使复杂的未见过的编辑任务能够集成到当前框架中,实现稳定且高质量的编辑效果。广泛的实验表明,我们的方法不仅提供了更准确的调用且消耗更少的令牌,还在各种图像编辑任务中实现了更高的编辑质量。
Summary / 总结
TalkPhoto is a training-free conversational assistant for image editing that uses a specially designed prompt template to analyze user instructions and invoke existing advanced editing methods hierarchically. This approach avoids the need for extensive training and dataset creation, leading to more accurate and efficient image manipulation with high-quality results across various tasks. Extensive experiments show that TalkPhoto consumes fewer tokens and achieves better editing quality compared to existing methods.
TalkPhoto 是一个无需训练的图像编辑对话助手,通过专门设计的提示模板分析用户指令并分层调用现有的高级编辑方法。这种方法避免了额外训练的需要,能够实现高效且精确的图像操作。实验表明,TalkPhoto 在消耗更少令牌的同时实现了高质量的编辑效果,并且能够处理复杂的和未见过的编辑任务。
MMP-A*: Multimodal Perception Enhanced Incremental Heuristic Search on Path Planning
Authors: Minh Hieu Ha, Khanh Ly Ta, Hung Phan, Tung Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh
First: 2026-01-05T08:55:27+00:00 · Latest: 2026-01-05T08:55:27+00:00
Abstract
Autonomous path planning requires a synergy between global reasoning and geometric precision, especially in complex or cluttered environments. While classical A* is valued for its optimality, it incurs prohibitive computational and memory costs in large-scale scenarios. Recent attempts to mitigate these limitations by using Large Language Models for waypoint guidance remain insufficient, as they rely only on text-based reasoning without spatial grounding. As a result, such models often produce incorrect waypoints in topologically complex environments with dead ends, and lack the perceptual capacity to interpret ambiguous physical boundaries. These inconsistencies lead to costly corrective expansions and undermine the intended computational efficiency. We introduce MMP-A*, a multimodal framework that integrates the spatial grounding capabilities of vision-language models with a novel adaptive decay mechanism. By anchoring high-level reasoning in physical geometry, the framework produces coherent waypoint guidance that addresses the limitations of text-only planners. The adaptive decay mechanism dynamically regulates the influence of uncertain waypoints within the heuristic, ensuring geometric validity while substantially reducing memory overhead. To evaluate robustness, we test the framework in challenging environments characterized by severe clutter and topological complexity. Experimental results show that MMP-A* achieves near-optimal trajectories with significantly reduced operational costs, demonstrating its potential as a perception-grounded and computationally efficient paradigm for autonomous navigation.
中文标题/摘要
标题:MMP-A*: 多模态感知增强增量启发式搜索路径规划
自主路径规划需要在全局推理和几何精度之间实现协同作用,尤其是在复杂或拥挤的环境中。虽然经典的A*因其最优性而受到重视,但在大规模场景中会带来巨大的计算和内存成本。最近通过使用大型语言模型进行航点指导来缓解这些限制的努力仍然不足,因为它们仅依赖于基于文本的推理而缺乏空间定位能力。因此,这些模型在拓扑复杂且有死胡同的环境中经常生成错误的航点,并缺乏感知能力来解释模糊的物理边界。这些不一致导致昂贵的修正扩展,并削弱了预期的计算效率。我们引入了MMP-A*,这是一种结合了视觉语言模型的空间定位能力和新颖的自适应衰减机制的多模态框架。通过将高层次推理锚定在物理几何上,该框架生成连贯的航点指导,解决了纯文本规划器的局限性。自适应衰减机制动态调节启发式中不确定航点的影响,确保几何有效性同时大幅减少内存开销。为了评估鲁棒性,我们在严重拥挤和拓扑复杂性的环境中测试了该框架。实验结果表明,MMP-A*在显著降低操作成本的同时实现了接近最优的轨迹,展示了其作为感知导向和计算高效的自主导航范式的潜力。
Summary / 总结
MMP-A* is a multimodal framework that combines the spatial grounding of vision-language models with an adaptive decay mechanism to enhance path planning in complex environments. It addresses the limitations of classical A* and text-only planners by producing coherent waypoint guidance and ensuring geometric validity. Experimental results show that MMP-A* achieves near-optimal trajectories with reduced operational costs, making it a promising approach for autonomous navigation.
MMP-A* 是一种结合了视觉语言模型的空间接地能力和自适应衰减机制的多模态框架,以增强在复杂环境中的路径规划。该框架通过生成连贯的航点指导并确保几何有效性来克服纯文本规划器的局限性。实验结果表明,MMP-A* 能够实现接近最优的轨迹并显著降低操作成本,使其成为自主导航中一种感知导向且计算高效的范式。
Toward Auditable Neuro-Symbolic Reasoning in Pathology: SQL as an Explicit Trace of Evidence
Authors: Kewen Cao, Jianxu Chen, Yongbing Zhang, Ye Zhang, Hongxiao Wang
First: 2026-01-05T08:02:49+00:00 · Latest: 2026-01-05T08:02:49+00:00
Abstract
Automated pathology image analysis is central to clinical diagnosis, but clinicians still ask which slide features drive a model's decision and why. Vision-language models can produce natural language explanations, but these are often correlational and lack verifiable evidence. In this paper, we introduce an SQL-centered agentic framework that enables both feature measurement and reasoning to be auditable. Specifically, after extracting human-interpretable cellular features, Feature Reasoning Agents compose and execute SQL queries over feature tables to aggregate visual evidence into quantitative findings. A Knowledge Comparison Agent then evaluates these findings against established pathological knowledge, mirroring how pathologists justify diagnoses from measurable observations. Extensive experiments evaluated on two pathology visual question answering datasets demonstrate our method improves interpretability and decision traceability while producing executable SQL traces that link cellular measurements to diagnostic conclusions.
中文标题/摘要
标题:迈向可审计的神经符号推理在病理学中的应用:SQL作为明确的证据踪迹
自动化病理图像分析是临床诊断的核心,但临床医生仍然想知道哪些切片特征驱动了模型的决策以及为什么。视觉-语言模型可以生成自然语言解释,但这些解释往往是相关性的,缺乏可验证的证据。在本文中,我们提出了一种以SQL为中心的代理框架,使特征测量和推理都可以进行审计。具体来说,在提取出可解释的细胞特征后,特征推理代理会组成并执行SQL查询,对特征表进行聚合,将视觉证据转化为定量发现。知识比较代理随后会将这些发现与已建立的病理知识进行评估,类似于病理学家如何根据可测量的观察结果来证明诊断。在两个病理视觉问答数据集上的广泛实验表明,我们的方法提高了可解释性和决策可追溯性,同时生成了可执行的SQL踪迹,将细胞测量与诊断结论联系起来。
Summary / 总结
The research aims to enhance the interpretability and traceability of automated pathology image analysis by introducing an SQL-centered framework. This framework allows for the composition and execution of SQL queries to aggregate visual evidence into quantitative findings, which are then evaluated against established pathological knowledge. Experiments on two pathology datasets show that this method improves interpretability and decision traceability, linking cellular measurements to diagnostic conclusions.
本文旨在解决自动化病理图像分析中透明和可验证解释的需要。它提出了一种以SQL为中心的框架,其中特征推理代理提取并推理细胞特征,知识比较代理则将这些发现与已建立的知识进行评估。该方法通过生成将细胞测量与诊断结论联系起来的可执行SQL痕迹,提高了解释性和决策可追溯性,并在两个病理数据集上的实验中显示出这些改进。
Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models
Authors: Minghao Yin, Yukang Cao, Kai Han
First: 2025-11-27T13:03:57+00:00 · Latest: 2026-01-05T08:01:23+00:00
Abstract
We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This empowers WUKONG to support both global texture transitions and identity-preserving texture morphing, catering to diverse generation needs. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.
中文标题/摘要
标题:悟空的72变:基于流模型的无训练高保真纹理3D变形
我们提出了一种名为WUKONG的新型无训练框架,用于高保真纹理3D变形,该框架以一对源和目标提示(图像或文本)作为输入。与传统的依赖于手动对应匹配和变形轨迹估计的方法(限制了泛化能力并需要昂贵的预处理)不同,WUKONG利用基于流的生成器的生成先验来生成具有丰富纹理细节的高保真3D过渡。为了确保形状过渡的平滑性,我们利用基于流的生成过程的内在连续性,并将变形问题形式化为最优传输重心问题。我们进一步引入了一种顺序初始化策略,以防止突然的几何失真并保持身份一致性。为了忠实保留纹理,我们提出了一种基于相似性的语义一致性机制,该机制选择性地保留高频细节并允许对混合动力学进行精确控制。这使WUKONG能够支持全局纹理过渡和身份保留的纹理变形,以满足各种生成需求。广泛的定量和定性评估表明,WUKONG显著优于现有方法,在各种几何和纹理变化中取得了更优的结果。
Summary / 总结
WUKONG is a training-free framework for high-fidelity textured 3D morphing that uses a pair of source and target prompts as input. Unlike conventional methods, WUKONG leverages flow-based transformers to produce smooth and detailed 3D transitions. It formulates morphing as an optimal transport barycenter problem and introduces a sequential initialization strategy and a similarity-guided semantic consistency mechanism to prevent geometric distortions and preserve texture details. Experimental results show that WUKONG outperforms existing methods in handling diverse geometry and texture variations.
WUKONG 是一个无需训练的框架,利用流基变换器生成高保真且细节丰富的 3D 转换,从源和目标提示中生成。它将形态学问题表述为最优传输重心问题,并引入了顺序初始化策略和相似性引导的语义一致性机制,以防止几何失真并保留纹理细节。实验结果表明,WUKONG 在处理各种几何和纹理变化方面优于现有方法。
Entity-Guided Multi-Task Learning for Infrared and Visible Image Fusion
Authors: Wenyu Shao, Hongbo Liu, Yunchuan Ma, Ruili Wang
First: 2026-01-05T08:00:03+00:00 · Latest: 2026-01-05T08:00:03+00:00
Comments: Accepted by IEEE Transactions on Multimedia
Abstract
Existing text-driven infrared and visible image fusion approaches often rely on textual information at the sentence level, which can lead to semantic noise from redundant text and fail to fully exploit the deeper semantic value of textual information. To address these issues, we propose a novel fusion approach named Entity-Guided Multi-Task learning for infrared and visible image fusion (EGMT). Our approach includes three key innovative components: (i) A principled method is proposed to extract entity-level textual information from image captions generated by large vision-language models, eliminating semantic noise from raw text while preserving critical semantic information; (ii) A parallel multi-task learning architecture is constructed, which integrates image fusion with a multi-label classification task. By using entities as pseudo-labels, the multi-label classification task provides semantic supervision, enabling the model to achieve a deeper understanding of image content and significantly improving the quality and semantic density of the fused image; (iii) An entity-guided cross-modal interactive module is also developed to facilitate the fine-grained interaction between visual and entity-level textual features, which enhances feature representation by capturing cross-modal dependencies at both inter-visual and visual-entity levels. To promote the wide application of the entity-guided image fusion framework, we release the entity-annotated version of four public datasets (i.e., TNO, RoadScene, M3FD, and MSRS). Extensive experiments demonstrate that EGMT achieves superior performance in preserving salient targets, texture details, and semantic consistency, compared to the state-of-the-art methods. The code and dataset will be publicly available at https://github.com/wyshao-01/EGMT.
中文标题/摘要
标题:红外和可见光图像融合的实体引导多任务学习
现有的文本驱动的红外和可见光图像融合方法通常依赖于句子级别的文本信息,这可能导致冗余文本引起的语义噪声,并未能充分利用文本信息的深层语义价值。为了解决这些问题,我们提出了一种新的融合方法,名为红外和可见光图像融合的实体引导多任务学习(EGMT)。该方法包括三个关键创新组件:(i) 提出了一种方法从大型视觉-语言模型生成的图像描述中提取实体级别的文本信息,消除原始文本中的语义噪声,同时保留关键的语义信息;(ii) 构建了一个并行的多任务学习架构,将图像融合与多标签分类任务相结合。通过使用实体作为伪标签,多标签分类任务提供了语义监督,使模型能够更深入地理解图像内容,显著提高融合图像的质量和语义密度;(iii) 还开发了一个实体引导的跨模态交互模块,以促进视觉和实体级别文本特征的细粒度交互,通过捕获跨模态依赖关系,增强特征表示。为了促进实体引导图像融合框架的广泛应用,我们发布了四个公开数据集的实体标注版本(即TNO、RoadScene、M3FD和MSRS)。广泛的实验表明,EGMT在保留显著目标、纹理细节和语义一致性方面优于最先进的方法。代码和数据集将在https://github.com/wyshao-01/EGMT公开。
Summary / 总结
The paper proposes EGMT, a novel entity-guided multi-task learning approach for infrared and visible image fusion. It extracts entity-level textual information from image captions to reduce semantic noise and integrates image fusion with a multi-label classification task for deeper semantic understanding. Experiments show that EGMT outperforms existing methods in preserving salient targets, texture details, and semantic consistency.
该研究提出了一种新颖的实体引导多任务学习方法EGMT,用于红外和可见光图像融合。它通过从图像说明中提取实体级文本信息并用于语义监督来解决现有文本驱动方法的局限性。EGMT 包括一个并行多任务学习架构和一个实体引导的跨模态交互模块,这些模块增强了特征表示并提高了融合图像的质量和语义密度。实验表明,EGMT 在保留显著目标、纹理细节和语义一致性方面优于最先进的方法。
SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection
Authors: Joongwon Chae, Zhenyu Wang, Peiwu Qin
First: 2024-12-03T16:53:58+00:00 · Latest: 2026-01-05T07:34:56+00:00
Comments: A flaw was discovered in the experimental setup. Therefore, we are retracting the paper
Abstract
Despite significant advances in vision-language understanding, implementing image segmentation within multimodal architectures remains a fundamental challenge in modern artificial intelligence systems. Existing vision-language models, which primarily rely on backbone architectures or CLIP-based embedding learning, demonstrate inherent limitations in fine-grained spatial localization and operational capabilities. This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified Segmentation through Coordinate Detection, a framework that leverages spatial coordinate understanding to bridge vision-language interaction and precise segmentation, enabling accurate target identification through natural language instructions. The framework presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space. By utilizing normalized coordinate detection for bounding boxes and transforming them into actionable segmentation outputs, we establish a connection between spatial and language representations in multimodal architectures. Experimental results demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC. Testing on a single NVIDIA RTX 3090 GPU with 512x512 resolution images yields an average inference time of 7 seconds per image, demonstrating the framework's effectiveness in both accuracy and practical deployability. The project code is available at https://github.com/jw-chae/SJTU
中文标题/摘要
标题:SJTU:多模态模型中的空间判断——通过坐标检测实现统一分割
尽管在视觉-语言理解方面取得了显著进展,但在现代人工智能系统中,将图像分割融入多模态架构仍然是一个基本挑战。现有的视觉-语言模型主要依赖于骨干架构或基于CLIP的嵌入学习,存在精细的空间定位和操作能力的固有限制。本文介绍了SJTU:多模态模型中的空间判断——通过坐标检测实现统一分割,该框架利用空间坐标理解来弥合视觉-语言交互和精确分割之间的差距,通过自然语言指令实现准确的目标识别。该框架提出了一种通过多模态空间中的空间推理将分割技术与视觉-语言模型集成的方法。通过利用归一化的坐标检测边界框并将其转换为可操作的分割输出,我们建立了多模态架构中空间和语言表示之间的联系。实验结果表明,该框架在基准数据集上表现出色,在COCO 2017上实现了0.5958的IoU分数,在Pascal VOC上实现了0.6758的IoU分数。在单个NVIDIA RTX 3090 GPU上以512x512分辨率图像进行测试,每张图像的平均推理时间为7秒,证明了该框架在准确性和实际部署性方面的有效性。项目代码可在https://github.com/jw-chae/SJTU获取
Summary / 总结
This paper addresses the challenge of implementing image segmentation in multimodal models by introducing SJTU, which uses spatial coordinate understanding to enhance vision-language interaction and precise segmentation. The framework integrates segmentation techniques with vision-language models through spatial inference, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC. However, an experimental setup flaw led to the retraction of the paper.
本文旨在解决在多模态架构中实现图像分割的挑战,提出了一种名为SJTU的框架,该框架通过空间坐标理解增强视觉-语言交互,实现精确分割。通过将分割技术与视觉-语言模型结合,该框架提高了准确性和实际部署性,分别在COCO 2017和Pascal VOC上实现了IoU分数0.5958和0.6758。然而,实验设置中发现了一个问题,导致撤回了该论文。
VerLM: Explaining Face Verification Using Natural Language
Authors: Syed Abdul Hannan, Hazim Bukhari, Thomas Cantalapiedra, Eman Ansar, Massa Baali, Rita Singh, Bhiksha Raj
First: 2026-01-05T05:16:07+00:00 · Latest: 2026-01-05T05:16:07+00:00
Abstract
Face verification systems have seen substantial advancements; however, they often lack transparency in their decision-making processes. In this paper, we introduce an innovative Vision-Language Model (VLM) for Face Verification, which not only accurately determines if two face images depict the same individual but also explicitly explains the rationale behind its decisions. Our model is uniquely trained using two complementary explanation styles: (1) concise explanations that summarize the key factors influencing its decision, and (2) comprehensive explanations detailing the specific differences observed between the images. We adapt and enhance a state-of-the-art modeling approach originally designed for audio-based differentiation to suit visual inputs effectively. This cross-modal transfer significantly improves our model's accuracy and interpretability. The proposed VLM integrates sophisticated feature extraction techniques with advanced reasoning capabilities, enabling clear articulation of its verification process. Our approach demonstrates superior performance, surpassing baseline methods and existing models. These findings highlight the immense potential of vision language models in face verification set up, contributing to more transparent, reliable, and explainable face verification systems.
中文标题/摘要
标题:VerLM:使用自然语言解释面部验证
面部验证系统取得了显著的进步,但它们在决策过程中的透明度往往不足。本文介绍了一种创新的视觉-语言模型(VLM)用于面部验证,不仅能够准确判断两张面部图像是否属于同一个人,还能明确解释其决策的依据。我们的模型通过两种互补的解释风格进行独特训练:(1)简洁的解释,总结影响其决策的关键因素;(2)详尽的解释,详细说明图像之间的具体差异。我们借鉴并改进了最初为基于音频的差异设计的最先进的建模方法,使其能够有效处理视觉输入。这种跨模态的转移显著提高了模型的准确性和可解释性。所提出的VLM结合了先进的特征提取技术和推理能力,使验证过程的阐述更加清晰。我们的方法展示了优越的性能,超越了基线方法和现有模型。这些发现突显了视觉语言模型在面部验证设置中的巨大潜力,有助于构建更透明、可靠和可解释的面部验证系统。
Summary / 总结
This paper addresses the lack of transparency in face verification systems by introducing VerLM, a Vision-Language Model that provides both concise and comprehensive explanations for its decisions. The model is trained using a cross-modal transfer approach from an audio-based differentiation model, enhancing its interpretability and accuracy. VerLM outperforms baseline methods and existing models, demonstrating superior performance and contributing to more transparent and reliable face verification systems.
本文通过引入一种提供简洁和详尽解释的Vision-Language模型(VerLM),解决了面部验证系统缺乏透明度的问题。该模型通过从音频区分模型的跨模态转移训练,提高了其可解释性和准确性。VerLM在性能上超越了基线方法和现有模型,展示了更优秀的性能,并为更透明、可靠和可解释的面部验证系统做出了贡献。
OVSeg3R: Learn Open-vocabulary Instance Segmentation from 2D via 3D Reconstruction
Authors: Hongyang Li, Jinyuan Qu, Lei Zhang
First: 2025-09-28T00:41:22+00:00 · Latest: 2026-01-05T04:49:09+00:00
Abstract
In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.
中文标题/摘要
标题:OVSeg3R: 通过3D重建从2D学习开放词汇实例分割
在本文中,我们提出了一种名为OVSeg3R的训练方案,利用3D重建辅助,从成熟的2D感知模型中学习开放词汇的3D实例分割。OVSeg3R直接采用2D视频中的重建场景作为输入,避免了昂贵的手动调整,同时使输入与实际应用对齐。通过利用3D重建模型提供的2D到3D对应关系,OVSeg3R将每个视角的2D实例掩码预测投影到3D,生成对应子场景的注释。为了避免由于2D到3D部分注释引入的错误正例作为监督,我们提出了视角实例分区算法,将预测分配给各自的视角进行监督,稳定训练过程。此外,由于3D重建模型倾向于过度平滑几何细节,基于几何学将重建点聚类为代表性超点,可能会忽略几何上不显著的对象。因此,我们引入了2D实例边界感知超点,利用2D掩码约束超点聚类,防止超点违反实例边界。通过这些设计,OVSeg3R不仅将最先进的封闭词汇3D实例分割模型扩展到开放词汇,还显著缩小了尾部和头部类别的性能差距,最终在ScanNet200基准上总体提高了+2.3 mAP。此外,在标准开放词汇设置下,OVSeg3R在新类别上的表现比之前的方法高出约+7.1 mAP,进一步验证了其有效性。
Summary / 总结
OVSeg3R proposes a training scheme to learn open-vocabulary 3D instance segmentation using 2D perception models and 3D reconstruction. It projects 2D instance masks onto 3D scenes and uses a View-wise Instance Partition algorithm to avoid false positives. Additionally, it introduces 2D Instance Boundary-aware Superpoint to prevent superpoints from violating instance boundaries. These methods improve performance by +2.3 mAP on ScanNet200 and +7.1 mAP on novel classes.
OVSeg3R 提出了一种训练方案,利用 2D 感知模型和 3D 重建来学习开放词汇的 3D 实例分割。它使用 2D 视频重建的场景作为输入,避免手动调整并符合实际应用。该方法将 2D 实例掩码投影到 3D 场景上,并引入视图实例分区算法以稳定训练过程。此外,它利用 2D 实例边界来约束超点聚类,最终在 ScanNet200 基准上整体提高了 +2.3 mAP,并在开放词汇设置下的新类别上超过了之前的方法约 +7.1 mAP。
AdaptInfer: Adaptive Token Pruning for Vision-Language Model Inference with Dynamical Text Guidance
Authors: Weichen Zhang, Zhui Zhu, Ningbo Li, Shilong Tao, Kebin Liu, Yunhao Liu
First: 2025-08-08T07:27:26+00:00 · Latest: 2026-01-05T04:04:30+00:00
Abstract
Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering, image captioning and so on, but their inference cost remains a significant challenge due to the large number of vision tokens processed during the prefill stage. Existing pruning methods often rely on directly using the attention patterns or static text prompt guidance, failing to exploit the dynamic internal signals generated during inference. To address these issues, we propose AdaptInfer, a plug-and-play framework for adaptive vision token pruning in VLMs. First, we introduce a fine-grained, dynamic text-guided pruning mechanism that reuses layer-wise text-to-text attention maps to construct soft priors over text-token importance, allowing more informed scoring of vision tokens at each stage. Second, we perform an offline analysis of cross-modal attention shifts and identify consistent inflection locations in inference, which inspire us to propose a more principled and efficient pruning schedule. Our method is lightweight and plug-and-play, also generalizable across multi-modal tasks. Experimental results have verified the effectiveness of the proposed method. For example, it reduces CUDA latency by 61.3% while maintaining an average accuracy of 93.1% on vanilla LLaVA-1.5-7B. Under the same token budget, AdaptInfer surpasses SOTA in accuracy.
中文标题/摘要
标题:AdaptInfer:视觉-语言模型推理中的自适应令牌剪枝,带有动态文本指导
视觉-语言模型(VLMs)在视觉问答、图像字幕等多模态推理任务中取得了令人印象深刻的性能,但由于预填充阶段处理大量视觉令牌,其推理成本仍然是一个重大挑战。现有的剪枝方法通常依赖于直接使用注意力模式或静态文本提示指导,未能利用推理过程中生成的动态内部信号。为了解决这些问题,我们提出了一种名为AdaptInfer的即插即用框架,用于VLMs中的自适应视觉令牌剪枝。首先,我们引入了一种细粒度的、动态文本指导的剪枝机制,利用层间文本到文本的注意力图来构建文本令牌重要性的软先验,允许在每个阶段对视觉令牌进行更明智的评分。其次,我们进行了跨模态注意力转移的离线分析,并确定了推理过程中的一致拐点位置,这启发我们提出了一种更符合原理且高效的剪枝计划。我们的方法轻量级且即插即用,也适用于多种多模态任务。实验结果验证了该方法的有效性。例如,它将CUDA延迟降低了61.3%,同时保持了平均93.1%的vanilla LLaVA-1.5-7B准确率。在相同的令牌预算下,AdaptInfer超越了SOTA的准确率。
Summary / 总结
AdaptInfer is a framework for adaptive vision token pruning in vision-language models (VLMs) that uses dynamic text guidance to improve inference efficiency. It introduces a fine-grained pruning mechanism based on layer-wise text-to-text attention maps and identifies consistent inflection points in cross-modal attention shifts to optimize the pruning schedule. Experimental results show that AdaptInfer reduces CUDA latency by 61.3% while maintaining 93.1% accuracy on vanilla LLaVA-1.5-7B, outperforming state-of-the-art methods under the same token budget.
AdaptInfer 是一种用于视觉语言模型 (VLM) 视觉标记剪枝的框架,通过动态文本指导来提高推理效率。它引入了一种基于层间文本到文本注意力图的精细剪枝机制,并通过识别跨模态注意力转移中的一致拐点来优化剪枝计划。实验结果表明,AdaptInfer 在保持 vanilla LLaVA-1.5-7B 93.1% 准确率的同时,将 CUDA 延迟减少了 61.3%,在相同的标记预算下超越了现有最佳方法。
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Authors: Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou
First: 2025-06-26T18:00:00+00:00 · Latest: 2026-01-05T03:33:59+00:00
Abstract
Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCOT) reasoning trajectories. In addition, we propose a fine-grained Direct Preference Optimization (fDPO) method that introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves relative performance gains of 4.1% and 9.0% over standard DPO on spatial qualitative and quantitative tasks, respectively. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SpatialRGPT-Bench, outperforming the strongest baseline by 9.4% in average accuracy, while maintaining competitive performance on general vision-language tasks.
中文标题/摘要
标题:精细粒度的偏好优化改善了VLM的空间推理能力
当前的视觉-语言模型(VLMs)在精细的空间推理方面存在困难,尤其是在需要多步逻辑和精确的空间对齐时。在本工作中,我们引入了SpatialReasoner-R1,这是一种旨在解决这些限制的视觉-语言推理模型。为了构建高质量的空间推理监督,我们设计了一种多模型蒙特卡洛树搜索(M3CTS)方法,该方法生成多样且逻辑一致的长链推理轨迹(LongCOT)。此外,我们提出了一种精细粒度的直接偏好优化(fDPO)方法,该方法通过空间奖励机制对候选响应进行评估,以引入段落特定的偏好粒度,指导描述性定位和逻辑推理。实验结果表明,fDPO在空间定性任务和定量任务上分别相对于标准DPO取得了4.1%和9.0%的相对性能提升。使用fDPO训练的SpatialReasoner-R1在SpatialRGPT-Bench上达到了新的最佳性能,平均准确率比最强基线高出9.4%,同时在通用视觉-语言任务上保持了竞争力。
Summary / 总结
This work addresses the challenge of fine-grained spatial reasoning in Vision-Language Models (VLMs) by introducing SpatialReasoner-R1, which uses a Multi-Model Monte Carlo Tree Search (M3CTS) to generate diverse reasoning trajectories and a fine-grained Direct Preference Optimization (fDPO) method to enhance descriptive grounding and logical reasoning. The fDPO method introduces segment-specific preference granularity guided by a spatial reward mechanism. Experiments show that fDPO improves performance by 4.1% and 9.0% over standard DPO on qualitative and quantitative spatial tasks, respectively, and SpatialReasoner-R1 outperforms the strongest baseline by 9.4% on the SpatialRGPT-Bench while maintaining competitive performance on general vision-language tasks.
本文通过引入SpatialReasoner-R1,使用Multi-Model Monte Carlo Tree Search (M3CTS)生成多样化的推理轨迹,并结合细粒度的Direct Preference Optimization (fDPO)方法提高描述性定位和逻辑推理能力。fDPO方法通过空间奖励机制引入了段落级别的偏好粒度。实验表明,fDPO在空间定性任务和定量任务上的相对性能分别提高了4.1%和9.0%,而SpatialReasoner-R1在SpatialRGPT-Bench上取得了新的最佳性能,平均准确率提高了9.4%,同时在通用视觉语言任务上保持了竞争力。
MergeRec: Model Merging for Data-Isolated Cross-Domain Sequential Recommendation
Authors: Hyunsoo Kim, Jaewan Moon, Seongmin Park, Jongwuk Lee
Venue: KDD 2026
First: 2026-01-05T03:14:23+00:00 · Latest: 2026-01-05T03:14:23+00:00
Comments: Accepted by KDD 2026
Abstract
Modern recommender systems trained on domain-specific data often struggle to generalize across multiple domains. Cross-domain sequential recommendation has emerged as a promising research direction to address this challenge; however, existing approaches face fundamental limitations, such as reliance on overlapping users or items across domains, or unrealistic assumptions that ignore privacy constraints. In this work, we propose a new framework, MergeRec, based on model merging under a new and realistic problem setting termed data-isolated cross-domain sequential recommendation, where raw user interaction data cannot be shared across domains. MergeRec consists of three key components: (1) merging initialization, (2) pseudo-user data construction, and (3) collaborative merging optimization. First, we initialize a merged model using training-free merging techniques. Next, we construct pseudo-user data by treating each item as a virtual sequence in each domain, enabling the synthesis of meaningful training samples without relying on real user interactions. Finally, we optimize domain-specific merging weights through a joint objective that combines a recommendation loss, which encourages the merged model to identify relevant items, and a distillation loss, which transfers collaborative filtering signals from the fine-tuned source models. Extensive experiments demonstrate that MergeRec not only preserves the strengths of the original models but also significantly enhances generalizability to unseen domains. Compared to conventional model merging methods, MergeRec consistently achieves superior performance, with average improvements of up to 17.21% in Recall@10, highlighting the potential of model merging as a scalable and effective approach for building universal recommender systems. The source code is available at https://github.com/DIALLab-SKKU/MergeRec.
中文标题/摘要
标题:MergeRec:数据隔离跨域序列推荐的模型合并
现代在特定领域数据上训练的推荐系统往往难以在多个领域之间泛化。跨域序列推荐已成为解决这一挑战的有前途的研究方向;然而,现有方法面临根本性限制,如依赖于不同领域之间的重叠用户或项目,或忽视隐私约束的不切实际假设。在本文中,我们提出了一种新的框架MergeRec,基于一种新的且现实的问题设置——数据隔离跨域序列推荐,其中原始用户交互数据不能在不同领域之间共享。MergeRec 包含三个关键组件:(1) 合并初始化,(2) 虚拟用户数据构建,(3) 协同合并优化。首先,我们使用训练无监督的合并技术初始化合并模型。接下来,我们通过将每个项目视为每个领域中的虚拟序列来构建虚拟用户数据,从而合成有意义的训练样本,而无需依赖真实用户交互。最后,我们通过结合推荐损失和蒸馏损失的联合目标来优化特定领域的合并权重,推荐损失鼓励合并模型识别相关项目,而蒸馏损失则从微调的源模型中转移协同过滤信号。广泛的实验表明,MergeRec 不仅保留了原始模型的优点,还显著增强了对未见过领域的泛化能力。与传统的模型合并方法相比,MergeRec 一致地实现了更好的性能,平均召回率@10提高了17.21%,突显了模型合并作为构建通用推荐系统可扩展且有效的方法的潜力。源代码可在 https://github.com/DIALLab-SKKU/MergeRec/ 获取。
Summary / 总结
MergeRec is a framework designed to address the challenge of cross-domain sequential recommendation by merging models trained on isolated data. It consists of three components: merging initialization, pseudo-user data construction, and collaborative merging optimization. MergeRec initializes a merged model using training-free techniques, constructs pseudo-user data by treating items as virtual sequences, and optimizes merging weights through a joint objective that combines recommendation and distillation losses. Experiments show that MergeRec outperforms conventional methods, achieving up to 17.21% improvement in Recall@10 and enhancing generalizability to unseen domains.
该研究提出了MergeRec框架,旨在解决现有方法依赖重叠用户/项目或不切实际假设的问题,以实现数据隔离的跨域序列推荐。MergeRec 包括模型初始化、伪用户数据构建和协作合并优化三个关键步骤。实验结果显示,MergeRec 在通用性方面表现出显著提升,Recall@10 的平均提升高达 17.21%,证明了模型合并作为构建通用推荐系统的一种可扩展且有效的方法的潜力。
Crafting Adversarial Inputs for Large Vision-Language Models Using Black-Box Optimization
Authors: Jiwei Guan, Haibo Jin, Haohan Wang
First: 2026-01-05T02:49:33+00:00 · Latest: 2026-01-05T02:49:33+00:00
Comments: EACL
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have shown groundbreaking capabilities across diverse multimodal tasks. However, these models remain vulnerable to adversarial jailbreak attacks, where adversaries craft subtle perturbations to bypass safety mechanisms and trigger harmful outputs. Existing white-box attacks methods require full model accessibility, suffer from computing costs and exhibit insufficient adversarial transferability, making them impractical for real-world, black-box settings. To address these limitations, we propose a black-box jailbreak attack on LVLMs via Zeroth-Order optimization using Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). ZO-SPSA provides three key advantages: (i) gradient-free approximation by input-output interactions without requiring model knowledge, (ii) model-agnostic optimization without the surrogate model and (iii) lower resource requirements with reduced GPU memory consumption. We evaluate ZO-SPSA on three LVLMs, including InstructBLIP, LLaVA and MiniGPT-4, achieving the highest jailbreak success rate of 83.0% on InstructBLIP, while maintaining imperceptible perturbations comparable to white-box methods. Moreover, adversarial examples generated from MiniGPT-4 exhibit strong transferability to other LVLMs, with ASR reaching 64.18%. These findings underscore the real-world feasibility of black-box jailbreaks and expose critical weaknesses in the safety mechanisms of current LVLMs
中文标题/摘要
标题:使用黑盒优化构建针对大型视觉-语言模型的对抗输入
大型视觉-语言模型(LVLMs)在多种跨模态任务中展现了突破性的能力。然而,这些模型仍然容易受到对抗性脱管攻击的影响,攻击者通过施加微妙的扰动来绕过安全机制并触发有害输出。现有的白盒攻击方法需要完全访问模型,计算成本高且对抗性转移性不足,使其在实际的黑盒环境中不切实际。为了解决这些限制,我们提出了一种使用零阶优化和同时扰动随机近似(ZO-SPSA)对LVLMs进行黑盒脱管攻击的方法。ZO-SPSA提供了三个关键优势:(i) 无需模型知识的输入-输出交互的无梯度近似,(ii) 不依赖于代理模型的模型无关优化,(iii) 降低资源需求,减少GPU内存消耗。我们在三个LVLMs上评估了ZO-SPSA,包括InstructBLIP、LLaVA和MiniGPT-4,在InstructBLIP上实现了最高的脱管成功率83.0%,同时保持与白盒方法相当的不可感知扰动。此外,从MiniGPT-4生成的对抗性示例在其他LVLMs上表现出强大的转移性,ASR达到64.18%。这些发现强调了黑盒脱管攻击在实际环境中的可行性,并揭示了当前LVLMs安全机制中的关键弱点
Summary / 总结
This study addresses the vulnerability of Large Vision-Language Models (LVLMs) to adversarial attacks by proposing a black-box jailbreak attack using Zeroth-Order optimization with Simultaneous Perturbation Stochastic Approximation (ZO-SPSA). The method does not require model knowledge, is model-agnostic, and has lower resource requirements. Experiments on InstructBLIP, LLaVA, and MiniGPT-4 achieved a high jailbreak success rate of 83.0% and demonstrated strong transferability of adversarial examples, highlighting the need for improved safety mechanisms in LVLMs.
该论文通过提出使用零阶优化与同时扰动随机近似(ZO-SPSA)方法来解决大型视觉-语言模型(LVLMs)对黑盒攻击的脆弱性问题。该方法无需模型知识,具有模型无关性,并且减少了资源消耗。实验表明,在InstructBLIP、LLaVA和MiniGPT-4上的破解成功率高达83.0%,并且生成的对抗样本在MiniGPT-4上具有较强的迁移性,这突显了黑盒攻击在现实世界中的可行性,并揭示了LVLMs安全机制中的关键弱点。
Training-Free Adaptive Quantization for Variable Rate Image Coding for Machines
Authors: Yui Tatsumi, Ziyue Zeng, Hiroshi Watanabe
First: 2025-11-08T04:05:24+00:00 · Latest: 2026-01-05T02:42:50+00:00
Comments: Accepted to IEEE 44th International Conference on Consumer Electronics (ICCE 2026)
Abstract
Image Coding for Machines (ICM) has become increasingly important with the rapid integration of computer vision technology into real-world applications. However, most neural network-based ICM frameworks operate at a fixed rate, thus requiring individual training for each target bitrate. This limitation may restrict their practical usage. Existing variable rate image compression approaches mitigate this issue but often rely on additional training, which increases computational costs and complicates deployment. Moreover, variable rate control has not been thoroughly explored for ICM. To address these challenges, we propose a training-free framework for quantization strength control which enables flexible bitrate adjustment. By exploiting the scale parameter predicted by the hyperprior network, the proposed method adaptively modulates quantization step sizes across both channel and spatial dimensions. This allows the model to preserve semantically important regions while coarsely quantizing less critical areas. Our architectural design further enables continuous bitrate control through a single parameter. Experimental results demonstrate the effectiveness of our proposed method, achieving up to 11.07% BD-rate savings over the non-adaptive variable rate baseline. The code is available at https://github.com/qwert-top/AQVR-ICM.
中文标题/摘要
标题:无需训练自适应量化方法在机器图像编码中的可变比特率应用
机器图像编码(ICM)随着计算机视觉技术在实际应用中的快速集成变得越来越重要。然而,大多数基于神经网络的ICM框架以固定比特率运行,因此需要为每个目标比特率单独训练。这一限制可能限制了它们的实际应用。现有的可变比特率图像压缩方法缓解了这一问题,但通常依赖额外的训练,增加了计算成本并复杂化了部署。此外,ICM中的可变比特率控制尚未得到充分探索。为了解决这些挑战,我们提出了一种无需训练的量化强度控制框架,以实现灵活的比特率调整。通过利用超先验网络预测的尺度参数,所提出的方法在通道和空间维度上自适应地调节量化步长。这使得模型能够保留语义上重要的区域,同时粗略量化不太关键的区域。我们的架构设计还通过单一参数实现了连续的比特率控制。实验结果表明,所提出的方法的有效性,相对于非自适应的可变比特率基线,实现了高达11.07%的BD率节省。代码可在https://github.com/qwert-top/AQVR-ICM/ 获取。
Summary / 总结
The paper addresses the challenge of fixed-rate operation in neural network-based Image Coding for Machines (ICM) by proposing a training-free adaptive quantization method. This method uses the scale parameter predicted by the hyperprior network to adjust quantization step sizes, enabling flexible bitrate control without additional training. Experimental results show up to 11.07% BD-rate savings compared to non-adaptive variable rate baselines.
论文提出了一种无需训练的自适应量化方法,以解决基于神经网络的图像编码技术(ICM)固定比特率操作的限制。该方法利用超先验网络预测的尺度参数来调整量化步长,实现灵活的比特率控制。实验结果表明,该方法相比非自适应的变比特率基线,可实现高达11.07%的BD率节省。
VisualActBench: Can VLMs See and Act like a Human?
Authors: Daoan Zhang, Pai Liu, Xiaofei Zhou, Yuan Ge, Guangchen Lan, Jing Bi, Christopher Brinton, Ehsan Hoque, Jiebo Luo
First: 2025-12-10T18:36:18+00:00 · Latest: 2026-01-04T23:12:23+00:00
Abstract
Vision-Language Models (VLMs) have achieved impressive progress in perceiving and describing visual environments. However, their ability to proactively reason and act based solely on visual inputs, without explicit textual prompts, remains underexplored. We introduce a new task, Visual Action Reasoning, and propose VisualActBench, a large-scale benchmark comprising 1,074 videos and 3,733 human-annotated actions across four real-world scenarios. Each action is labeled with an Action Prioritization Level (APL) and a proactive-reactive type to assess models' human-aligned reasoning and value sensitivity. We evaluate 29 VLMs on VisualActBench and find that while frontier models like GPT4o demonstrate relatively strong performance, a significant gap remains compared to human-level reasoning, particularly in generating proactive, high-priority actions. Our results highlight limitations in current VLMs' ability to interpret complex context, anticipate outcomes, and align with human decision-making frameworks. VisualActBench establishes a comprehensive foundation for assessing and improving the real-world readiness of proactive, vision-centric AI agents.
中文标题/摘要
标题:VisualActBench:VLMs能否像人类一样观察和行动?
视觉-语言模型(VLMs)在感知和描述视觉环境方面取得了显著进展。然而,它们基于视觉输入进行主动推理和行动的能力,而无需明确的文本提示,仍处于探索阶段。我们引入了一个新的任务——视觉行动推理,并提出了一个包含1,074个视频和3,733个人类标注动作的大规模基准VisualActBench,覆盖四个真实场景。每个动作都标注了行动优先级水平(APL)和主动-反应类型,以评估模型的人类对齐推理和价值敏感性。我们在VisualActBench上评估了29个VLMs,并发现尽管前沿模型如GPT4o表现出相对较强的表现,但在生成主动、高优先级行动方面与人类水平的推理仍存在显著差距。我们的结果突显了当前VLMs在解释复杂背景、预测结果和与人类决策框架对齐方面的局限性。VisualActBench为评估和提高主动视觉中心AI代理的现实世界准备性奠定了全面的基础。
Summary / 总结
The study introduces VisualActionReasoning as a new task and VisualActBench, a benchmark with 1,074 videos and 3,733 human-annotated actions, to evaluate VLMs' ability to reason and act proactively based on visual inputs. Evaluating 29 VLMs, including GPT4o, the research finds that while these models show some capability, they still fall short of human-level reasoning, especially in generating high-priority proactive actions. This indicates that current VLMs struggle with complex context interpretation and aligning with human decision-making.
研究引入了VisualActionReasoning和VisualActBench,用于评估VLMs基于视觉输入进行主动推理和行动的能力。基准包括1,074个视频和3,733个人标注的动作,覆盖四个场景。评估29个VLMs后,研究发现虽然如GPT4o等模型表现出一定的能力,但它们在生成主动、高优先级动作方面仍远不及人类水平,这表明VLMs在理解复杂背景和预测结果方面存在不足。
History
20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553