SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
Authors: Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, Yan Li, Yubo Wang, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Yuwei Niu, Yue Zhu, Bo Liu, Chengguang Lv, Haojia Yu, Haozhe Xie, Hongli Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Peng Gao, Pengfei Liu, Qingping Sun, Rui Shen, Ruisi Wang, Shengnan Ma, Shuang Yang, Siyi Xie, Siying Li, Tianbo Zhong, Xiangli Kong, Xuanke Shi, Yang Gao, Yongqiang Yao, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zixin Yin, Wenxiu Sun, Ruihao Gong, Quan Wang, Lewei Lu, Lei Yang, Ziwei Liu, Dahua Lin
First: 2026-05-12T17:59:58+00:00 · Latest: 2026-05-12T17:59:58+00:00
Comments: Project page: https://github.com/OpenSenseNova/SenseNova-U1
Abstract
Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
中文标题/摘要
标题:SenseNova-U1:NEO-unify架构下的多模态理解和生成统一框架
近期的大规模视觉-语言模型(VLMs)仍然受到理解与生成之间根本性二分法的限制:理解与生成被视为独立的问题,导致分段架构、级联流水线和不一致的表示空间。我们认为这种二分法不仅是工程上的产物,更是结构上的限制,阻碍了原生多模态智能的出现。因此,我们提出了SenseNova-U1,一种基于NEO-unify的原生统一多模态范式,在其中理解和生成作为单一底层过程的协同视角而演变。我们推出了两种原生统一变体,SenseNova-U1-8B-MoT和SenseNova-U1-A3B-MoT,分别基于密集(8B)和混合专家(30B-A3B)理解基线。从第一原理设计,它们在文本理解、视觉-语言感知、知识推理、代理决策和空间智能方面与顶级的仅理解VLMs相媲美。同时,它们在语义一致性和视觉保真度方面表现出色,在常规或知识密集型的任何到图像(X2I)合成、复杂图文生成和交错的视觉-语言生成方面表现出色,有或没有思考模式。除了性能,我们详细介绍了模型设计、数据预处理、预/后训练和推理策略,以支持社区研究。最后但同样重要的是,初步证据表明,我们的模型不仅限于感知和生成,还在视觉-语言-行动(VLA)和世界模型(WM)场景中表现出色。这表明了一个更广泛的路线图,即模型不仅在模态之间进行转换,而是在原生方式下思考和行动。多模态AI不再关于连接独立的系统,而是关于构建一个统一的系统,并信任必要的能力从内部涌现。
Summary / 总结
The research introduces the NEON-U architecture, which unifies multimodal understanding and generation into a single unified process on two unified models, a dense 2B and a mixture-of-experts onB-A3B baseline baselines on respectively. This architecture aims to improvevercome the traditional dichotomy between these-language and vision-language models on by ontering semantic consistency and visual fidelity on excelling in X-toI synthesis and complex complex on infographic generation. The preliminary resultsd demonstrate on that the model-language and vision-language capabilities on the V-language-action (VLAL) and world-world (WM) scenarios on pointsd broader roadmap on models on translating to translate between between between different and act on on a unified manner.
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
Authors: Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez
First: 2026-05-12T17:53:47+00:00 · Latest: 2026-05-12T17:53:47+00:00
Comments: 12 pages, 3 figures, 6 tables
Abstract
We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.
中文标题/摘要
标题:KV-Fold:一步式KV缓存递归用于长上下文推理
我们引入了KV-Fold,这是一种简单的、无需训练的长上下文推理协议,将键值(KV)缓存视为序列片段左折叠的累加器。在每一步中,模型在累积缓存的基础上处理下一个片段,添加新生成的键和值,并将扩大的缓存传递下去;这种一步更新会反复应用,类似于函数编程中的foldl。基于为潜在多智能体通信引入的KV缓存连接原语,我们将其重新用于长上下文推理的片段到片段递归。在处理片段t时,模型将来自早期片段的KV缓存作为前缀进行关注,无需修改或重新训练模型即可在段落之间重用其内部状态。尽管其简单,但诱导的递归是稳定的:每步漂移短暂上升后饱和到一个平坦的平台,该平台在深层链中持续存在。该平台对数值精度10,000倍的变化、不同片段大小和不同模型家族都具有鲁棒性。在任务层面,KV-Fold在长距离上保持了精确信息。在“针扎干草堆”基准测试中,它在Llama-3.1-8B上实现了152次试验中100%的精确匹配检索,覆盖从16K到128K的上下文和链深最多511,同时保持在单个40GB GPU的内存限制内。与牺牲精度以换取有限内存的流式方法相比,KV-Fold在保持长距离检索的同时,作为一系列可处理的前向传递操作。总体而言,我们的结果表明,冻结的预训练变换器已经支持了一种稳定的KV缓存递归形式,为长上下文推理提供了一条无需架构更改或训练的实用途径。
Prototype Fusion: A Training-Free Multi-Layer Approach to OOD Detection
Authors: Shreen Gul, Mohamed Elmahallawy, Ardhendu Tripathy, Sanjay Madria
First: 2026-03-24T19:32:13+00:00 · Latest: 2026-05-12T17:52:40+00:00
Abstract
Deep learning models are increasingly deployed in safety-critical applications, where reliable out-of-distribution (OOD) detection is essential to ensure robustness. Existing methods predominantly rely on the penultimate-layer activations of neural networks, assuming they encapsulate the most informative in-distribution (ID) representations. In this work, we revisit this assumption to show that intermediate layers encode equally rich and discriminative information for OOD detection. Based on this observation, we propose a simple yet effective model-agnostic approach that leverages internal representations across multiple layers. Our scheme aggregates features from successive convolutional blocks, computes class-wise mean embeddings, and applies L_2 normalization to form compact ID prototypes capturing class semantics. During inference, cosine similarity between test features and these prototypes serves as an OOD score--ID samples exhibit strong affinity to at least one prototype, whereas OOD samples remain uniformly distant. Extensive experiments on state-of-the-art OOD benchmarks across diverse architectures demonstrate that our approach delivers robust, architecture-agnostic performance and strong generalization for image classification. Notably, it improves AUROC by up to 4.41% and reduces FPR by 13.58%, highlighting multi-layer feature aggregation as a powerful yet underexplored signal for OOD detection, challenging the dominance of penultimate-layer-based methods. Our code is available at: https://github.com/sgchr273/cosine-layers.git.
中文标题/摘要
标题:原型融合:一种无需训练的多层OOD检测方法
深度学习模型在安全关键应用中越来越广泛部署,可靠的离分布(OOD)检测对于确保其鲁棒性至关重要。现有方法主要依赖于神经网络的倒数第二层激活,假设它们包含了最具有信息量的在分布(ID)表示。在本文中,我们重新审视了这一假设,表明中间层同样编码了丰富的和区分性的信息用于OOD检测。基于这一观察,我们提出了一种简单而有效的模型无关方法,利用多层内部表示。我们的方案从连续的卷积块中聚合特征,计算类别级均值嵌入,并应用L_2归一化形成紧凑的ID原型,捕捉类别语义。在推理过程中,测试特征与这些原型之间的余弦相似度作为OOD分数——ID样本对至少一个原型表现出强烈的亲和力,而OOD样本则保持均匀的距离。在多种架构的先进OOD基准上的广泛实验表明,我们的方法提供了鲁棒且架构无关的性能,并且在图像分类中具有强大的泛化能力。值得注意的是,它将AUROC提高了最多4.41%,并将FPR降低了13.58%,突显了多层特征聚合作为强大的但尚未充分探索的OOD检测信号的重要性,挑战了基于倒数第二层的方法的主导地位。我们的代码可在:https://github.com/sgchr273/cosine-layers.git 获取。
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization
Authors: Huiran Duan, Qian Zhou, Zhongliang Guo, Junhao Dong, Yuqi Li, Guoying Zhao, Yingli Tian
First: 2026-05-12T17:27:56+00:00 · Latest: 2026-05-12T17:27:56+00:00
Comments: Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)
Abstract
Conventional gait de-identification methods often encounter an inherent trade-off: they either provide insufficient identity suppression or introduce spatiotemporal distortions that impede structure-sensitive downstream applications. We propose GaitProtector, an impersonation-driven gait de-identification framework that formulates privacy protection as a unified objective with two tightly coupled components: (i) obfuscation, which repels the protected gait from the source identity, and (ii) impersonation, which attracts it toward a selected target identity. The target identity serves as a semantic anchor that biases optimization toward structurally plausible gait patterns under the pretrained diffusion prior, helping preserve dominant body shape and motion dynamics. We instantiate this idea through a training-free diffusion latent optimization pipeline. Instead of retraining a generator for each dataset, we invert each input silhouette sequence into the latent trajectory of a pretrained 3D video diffusion model and iteratively optimize latent codes with a differentiable adversarial objective to synthesize protected gaits. Experiments on the CASIA-B dataset show that GaitProtector achieves a 56.7% impersonation success rate under black-box gait recognition and reduces Rank-1 identification accuracy from 89.6% to 15.0%, while maintaining favorable visual and temporal quality. We further evaluate downstream utility on the Scoliosis1K dataset, where diagnostic accuracy decreases only from 91.4% to 74.2%. To the best of our knowledge, this work is the first to leverage pretrained 3D diffusion priors in a training-free manner for silhouette-based gait de-identification.
中文标题/摘要
标题:GaitProtector:基于模仿驱动的无训练扩散潜空间优化步态去标识化
传统的步态去标识化方法通常会遇到一个固有的权衡:它们要么提供不足的身份抑制,要么引入时空失真,阻碍结构敏感的下游应用。我们提出了一种基于模仿驱动的步态去标识化框架,将隐私保护统一为两个紧密耦合组件的目标:(i)混淆,使受保护的步态远离源身份;(ii)模仿,使其接近选定的目标身份。目标身份作为语义锚点,偏向于在预训练的扩散先验下优化结构上合理的步态模式,有助于保持主导的身体形状和运动动态。我们通过一个无训练的扩散潜空间优化流水线实现这一理念。我们不是为每个数据集重新训练生成器,而是将每个输入轮廓序列反转为预训练的3D视频扩散模型的潜在轨迹,并通过可微对抗目标迭代优化潜在代码以合成受保护的步态。在CASIA-B数据集上的实验表明,GaitProtector在黑盒步态识别下的模仿成功率达到了56.7%,将Rank-1识别准确率从89.6%降低到15.0%,同时保持了良好的视觉和时间质量。我们还在Scoliosis1K数据集上进一步评估了下游应用,诊断准确率仅从91.4%下降到74.2%。据我们所知,这是首次以无训练的方式利用预训练的3D扩散先验进行基于轮廓的步态去标识化。
Summary / 总结
GaitProtector is an impersonation-driven gait de-identification framework that combines obfuscation and impersonation to protect gait privacy while preserving structural plausibility. It uses a training-free diffusion latent optimization pipeline to synthesize protected gaits by inverting input silhouettes into a pretrained 3D video diffusion model and iteratively optimizing latent codes. Experiments show GaitProtector achieves a 56.7% impersonation success rate and reduces Rank-1 identification accuracy to 15.0% on the CASIA-B dataset, while maintaining good visual and temporal quality. It also preserves diagnostic accuracy on the Scoliosis1K dataset. This work is the first to use pretrained 3D diffusion priors for silhouette-based gait de-identification in a training-free manner.
GaitProtector 是一个结合了混淆和模仿的隐私保护框架,旨在保护步态隐私同时保持结构合理性。它使用一个训练免费的扩散潜变量优化管道,通过将输入轮廓反转到预训练的 3D 视频扩散模型中并迭代优化潜变量代码来合成保护步态。实验结果显示,GaitProtector 在 CASIA-B 数据集上的模仿成功率达到了 56.7%,并将 Rank-1 识别准确率降低到 15.0%,同时保持了良好的视觉和时间质量。它还保留了在 Scoliosis1K 数据集上的诊断准确性。这是首次使用预训练的 3D 扩散先验进行轮廓基步态去标识化的工作。
Reinforcing VLAs in Task-Agnostic World Models
Authors: Yucen Wang, Rui Yu, Fengming Zhang, Junjie Lu, Xinyao Qin, Tianxiang Zhang, Kaixin Wang, Li Zhao
First: 2026-05-12T16:16:15+00:00 · Latest: 2026-05-12T16:16:15+00:00
Abstract
Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.
Summary / 总结
The research aims to improve the adaptability of Vision-Language-Action (VLA) models by using reinforcement learning in learned world models, reducing the need for task-specific data. RAW-Dream, a new approach, pre-trains a world model on diverse task-free behaviors and uses an off-the-shelf Vision-Language Model for reward generation, enabling zero-shot inference for new tasks. Experiments show consistent performance gains, indicating that generalized physical priors can effectively replace task-dependent data, enhancing scalability.
研究旨在通过在学习的世界模型中使用强化学习来提高Vision-Language-Action (VLA)模型的适应性,减少对任务特定数据的依赖。RAW-Dream是一种新方法,它在多样化的任务无关行为上预训练世界模型,并使用现成的Vision-Language模型生成奖励,从而实现对新任务的零样本推理。实验表明,一致的性能提升表明,泛化的物理先验可以有效地替代任务特定的数据,增强可扩展性。
Towards Automated Air Traffic Safety Assessment Around Non-Towered Airports Using Large Language Models
Authors: Torsten Darrell, Mahyar Ghazanfari, Jordan Kam, Alexandre Bayen, Amin Tabrizian, Peng Wei
First: 2026-05-12T16:15:15+00:00 · Latest: 2026-05-12T16:15:15+00:00
Comments: 25 pages, 17 figures, 5 tables, Accepted to AIAA 2026
Abstract
We investigate frameworks for post-flight safety analysis at non-towered airports using large language models (LLMs). Non-towered airports rely on the Common Traffic Advisory Frequency (CTAF) for air traffic coordination and experience frequent near mid-air collisions due to the pilot self-announcement communication protocol. We propose a general vision-language model (VLM) approach to analyze the transcribed CTAF radio communications in natural language, METeorological Aerodrome Report (METAR) weather data, Automatic Dependent Surveillance-Broadcast (ADS-B) flight trajectories, and Visual Flight Rules sectional charts of the airfield. We provide a preliminary study at Half Moon Bay Airport, with a qualitative real world case study and a quantitative evaluation using a new synthetic dataset of communications and weather modalities. We qualitatively evaluate our framework on real flight data using Gemini 2.5 Pro, demonstrating accurate identification of a right-of-way violation. The synthetic dataset is derived from real examples and includes a 12-category hazard taxonomy, and is used to benchmark three open-source (Qwen 2.5-7B, Mistral-7B, Gemma-2-9B) and three closed-source (GPT-4o, GPT-5.4, Claude Sonnet 4.6) LLM models on the subset of inputs related to CTAF and METAR. Even limited to CTAF and METAR inputs and open source LLMs, instances of our framework typically achieve a macro F1 score above 0.85 on a binary nominal/danger classification task. Future work includes a quantitative evaluation across all modalities and a larger number of real world examples. Taken together, our results suggest that VLM analysis of safety at non-towered airports may be a valuable future capability.
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
Authors: Hao Zhu, Shuo Jin, Wenbin Liao, Jiayu Xiao, Yan Zhu, Siyue Yu, Feng Dai
First: 2026-05-12T16:08:18+00:00 · Latest: 2026-05-12T16:08:18+00:00
Comments: Accepted by ICML2026
Abstract
Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino.txt framework to facilitate more efficient and high-quality dense prediction. While dino.txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce \textcolor{oursblue}{\textbf{VI}}sual-guided \textcolor{oursblue}{\textbf{P}}rompt evolution (\textcolor{oursblue}{\textbf{\textit{VIP}}}) to rectify the semantic expressiveness of text queries in dino.txt, unleashing its potential for fine-grained object perception. Towards this end, \VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that \VIP: \ding{182} surpasses the top-leading methods by $1.4\% \sim 8.4\%$ average mIoU, \ding{183} generalizes well to diverse challenging domains, and \ding{184} requires marginal inference time and memory overhead. \href{https://github.com/MiSsU-HH/VIP}{Our code is publicly available at GitHub \faGithub}.
中文标题/摘要
标题:VIP:视觉引导的提示进化以实现高效的密集视觉语言推理
由于CLIP根深蒂固的空间偏见,追求无需训练的开放词汇语义分割在高效和泛化方面仍然具有挑战性。为克服现有解决方案的局限性,这项工作超越了基于CLIP的范式,利用最近的空间感知dino.txt框架来促进更高效的高质量密集预测。尽管dino.txt表现出强大的空间感知能力,但我们发现文本查询的语义模糊性在其密集跨模态交互中引发了严重的不匹配。为了解决这个问题,我们引入了视觉引导的提示进化(VIP)来纠正dino.txt中文本查询的语义表达性,释放其对细粒度对象感知的潜力。为此,VIP结合了别名扩展和视觉引导的蒸馏机制来挖掘有价值的语义线索,这些线索以注意感知的方式稳健聚合,以产生高保真预测。广泛的评估表明:VIP:①超越了顶级方法的平均mIoU为1.4%至8.4%;②在多种具有挑战性的领域中表现出良好的泛化能力;③所需的推理时间和内存开销微乎其微。我们的代码已公开发布在GitHub上:https://github.com/MiSsU-HH/VIP。
Summary / 总结
This work addresses the challenge of training-free open-vocabulary semantic segmentation by introducing VIP, which integrates alias expansion and visual-guided distillation to enhance the semantic expressiveness of text queries in the dino.txt framework. Experiments show that VIP outperforms leading methods by 1.4% to 8.4% in average mIoU, generalizes well across various domains, and incurs minimal inference time and memory overhead.
该研究通过将别名扩展与视觉引导蒸馏集成到dino.txt框架中,增强文本查询的语义表达性,以解决无训练开放词汇语义分割的挑战。实验表明,VIP在平均mIoU上优于领先方法1.4%到8.4%,在各种领域中表现出良好的泛化能力,并且对推理时间和内存占用影响较小。
CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation
Authors: Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen
First: 2026-04-24T06:34:45+00:00 · Latest: 2026-05-12T16:00:28+00:00
Comments: This manuscript has been withdrawn by the authors because we found a methodological flaw in the formulation and evaluation of the proposed approach. The issue affects the reliability of the experimental results and the conclusions drawn from them. Therefore, the authors consider the current version unsuitable for citation or further use
Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.
中文标题/摘要
标题:CAGE-SGG:反事实主动图证据支持的开放词汇场景图生成
开放词汇场景图生成(SGG)旨在使用灵活和精细的关联短语来描述视觉场景,超越固定谓词词汇。虽然最近的视觉-语言模型大大扩展了SGG的语义覆盖范围,但也引入了一个关键的可靠性问题:预测的关联可能是由语言先验或对象共现驱动的,而不是基于视觉证据。本文提出了一种基于反事实关系验证的证据全面的开放词汇SGG框架。我们的方法不直接接受可能的关系提案,而是验证每个候选关系是否得到了特定关系的视觉、几何和上下文证据的支持。具体来说,我们首先使用视觉-语言提案生成开放词汇关系候选,然后将谓词短语分解为支持、接触、包含、深度和状态等软证据基础。关系条件下的证据编码器提取与谓词相关的线索,而反事实验证器测试在移除必要证据时关系得分是否降低,并在无关扰动下是否保持稳定。我们进一步引入了矛盾感知谓词学习和图级偏好优化,以提高细粒度的区分能力和全局图的一致性。在传统、开放词汇和泛光SGG基准上的实验表明,我们的方法在标准召回度量、未见过的谓词泛化和反事实定位质量方面都表现出一致的改进。这些结果表明,从关系生成转向关系验证可以产生更可靠、可解释和基于证据的场景图。
OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices
Authors: Yu Luo, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Xidao Wen, Yongqian Sun, Shenglin Zhang, Dan Pei
First: 2025-10-28T07:38:15+00:00 · Latest: 2026-05-12T16:00:09+00:00
Abstract
Incident management (IM) is central to the reliability of large-scale microservice systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world microservice systems. Notably, its deployment in Lenovo's production environment further validates its effectiveness in real-world industrial settings.
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
Authors: Junxian Li, Kai Liu, Zizhong Ding, Zhixin Wang, Zhikai Chen, Renjing Pei, Yulun Zhang
First: 2026-05-12T15:56:22+00:00 · Latest: 2026-05-12T15:56:22+00:00
Comments: Code is at: https://github.com/lijunxian111/G2TR
Abstract
The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks.
Summary / 总结
This paper addresses the issue of high inference cost in separate-encoder Unified Multimodal Models (UMMs) due to dense visual token processing. It introduces G$^2$TR, a generation-guided visual token reduction framework. G$^2$TR identifies understanding-side visual tokens based on their consistency with VAE latent and merges redundant tokens to reduce information loss. Experiments show that G$^2$TR reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming existing methods on various benchmarks.
本文针对单独编码器统一多模态模型(UMMs)因密集视觉标记处理而导致的高推理成本问题,提出了生成导向的视觉标记减少框架G$^2$TR。G$^2$TR根据视觉标记与VAE潜在空间的一致性来识别理解侧的视觉标记,并合并冗余标记以减少信息损失。实验表明,G$^2$TR将视觉标记和预填充计算减少了1.94倍,同时保持了推理准确性和编辑质量,并在各种基准测试中优于现有方法。
Large-Small Model Collaboration for Farmland Semantic Change Detection
Authors: Xinjia Li, Rui Wang, Qiurong Peng, Lingfei Ye, Dengrong Zhang, Haoyu Zhang
First: 2026-05-12T15:40:19+00:00 · Latest: 2026-05-12T15:40:19+00:00
Abstract
Farmland Semantic Change Detection (SCD) is essential for cultivated land protection, yet existing benchmarks and models remain insufficient for fine-grained farmland conversion monitoring. Current datasets often lack dedicated "from-to" annotations, while visual change detection models are easily disturbed by phenology-induced pseudo-changes caused by crop rotation, seasonal variation, and illumination differences. To address these challenges, we construct HZNU-FCD, a large-scale fine-grained farmland SCD benchmark with a unified five-class farmland-to-non-farmland annotation protocol. It contains 4,588 bitemporal image pairs with pixel-level labels for practical farmland protection. Based on this benchmark, we propose a large-small collaborative SCD framework that integrates a task-driven small visual model with a frozen large vision-language model. The small model, Fine-grained Difference-aware Mamba (FD-Mamba), learns dense change representations for boundary preservation and small-region localization. The large-model pathway, Cross-modal Logical Arbitration (CMLA), introduces CLIP-based textual priors for prompt-guided semantic arbitration and pseudo-change suppression. To enable effective collaboration, we design a hard-region co-training strategy that supervises the CMLA semantic score map only on low-confidence pixels. Experiments show that our method achieves 97.63% F1, 96.32% IoU, and 96.35% SCD_IoU_mean on HZNU-FCD with only 6.65M trainable parameters. Compared with the multimodal ChangeCLIP-ViT, which leverages vision-language information for change detection, our method improves F1 by 10.19 percentage points on HZNU-FCD. It also achieves 91.43% F1 and 84.21% IoU on LEVIR-CD, and 93.85% F1 and 88.41% IoU on WHU-CD, demonstrating strong robustness and generalization. The code is available at https://github.com/Lovelymili/FD-Mamba.
Summary / 总结
The research aims to improve farmland semantic change detection for cultivated land protection by addressing the limitations of existing datasets and models. It introduces HZNU-FCD, a large-scale benchmark with unified five-class annotations, and proposes a large-small collaborative framework integrating a fine-grained difference-aware small model (FD-Mamba) and a large vision-language model (CMLA) with CLIP-based textual priors. Experiments show that this method outperforms existing models on HZNU-FCD and other datasets, achieving high F1 and IoU scores with fewer parameters.
研究旨在通过解决现有数据集和模型的局限性,提高农田语义变化检测,以保护耕地。提出了HZNU-FCD,一个具有统一五类标注的大规模基准,并提出了一种大型-小型协作框架,结合了细粒度差异感知的小模型(FD-Mamba)和带有CLIP文本先验的大视觉语言模型(CMLA)。实验表明,该方法在HZNU-FCD和其他数据集上优于现有模型,实现了高F1和IoU分数,参数量较少。
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
Authors: Yaofang Liu, Kangning Cui, Meng Chu, Zhaoqing Li, Suiyun Zhang, Jean-Michel Morel, Xiaodong Cun, Haoxuan Che, Rui Liu, Raymond H. Chan
First: 2026-05-12T15:35:34+00:00 · Latest: 2026-05-12T15:35:34+00:00
Comments: Project Page: https://yaofang-liu.github.io/V2V_Web
Abstract
Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator's conditioning space.
On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbf{Simple-V2V Bench}, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0\% of conditioning-token attention mass on visual-page hidden states.
Summary / 总结
The research aims to address the limitations of text-based visual generation by proposing a visual-to-visual (V2V) generation paradigm, where users provide visual specifications instead of text prompts. V2V-Zero, a training-free framework, uses the final-layer hidden states from visual pages to condition existing vision-language models, achieving performance close to optimized text-to-image models on GenEval. The Simple-V2V Bench evaluates V2V across seven tasks and seven models, showing that while attribute binding is strong, content generation and structural control remain challenging even for commercial systems.
研究旨在通过提出视觉到视觉(V2V)生成范式来解决基于文本的视觉生成的局限性,用户可以提供视觉规范而非文本提示。V2V-Zero 是一个无需训练的框架,利用视觉页面的最终层隐藏状态来条件化现有的视觉语言模型,在GenEval上接近优化的文本到图像模型的性能。Simple-V2V Bench 在七个任务和七个模型上评估了V2V,结果显示属性绑定较强,但内容生成和结构控制仍然具有挑战性,即使是商用系统也是如此。
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
Authors: Shuo Ni, Tong Wang, Jing Zhang, He Chen, Haonan Guo, Ning Zhang, Bo Du
First: 2026-05-12T15:07:30+00:00 · Latest: 2026-05-12T15:07:30+00:00
Abstract
Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.
Summary / 总结
The research addresses the challenge of VLMs' vulnerability to scale mismatch in ultra-high-resolution Earth observation imagery, introducing UHR-Micro, a benchmark with 11,253 instructions on 1,212 images, to evaluate VLMs at micro-scale. Experiments show significant failures in spatial grounding and evidence parsing despite high-resolution inputs. The study proposes Micro-evidence Active Perception (MAP) to actively seek and use task-relevant micro-evidence, improving micro-level perception. UHR-Micro and MAP-Agent together provide a diagnostic platform for high-resolution reasoning in Earth observation VLMs.
研究针对VLMs在超高清地球观测图像中面临的尺度不匹配问题,引入了包含11,253个指令的UHR-Micro基准,评估VLMs在微尺度的表现。实验显示,即使有高分辨率输入,VLMs在空间定位和证据解析方面仍存在显著失败。研究提出了Micro-evidence Active Perception (MAP) 方法,主动寻找并利用相关微证据,提升微尺度感知。UHR-Micro和MAP-Agent共同提供了一个诊断平台,用于评估、理解和推进地球观测VLMs中的高分辨率推理。
SARU: A Shadow-Aware and Removal Unified Framework for Remote Sensing Images with New Benchmarks
Authors: Zi-Yang Bo, Wei Lu, Hongruixuan Chen, Si-Bao Chen, Bin Luo
First: 2026-04-28T09:38:02+00:00 · Latest: 2026-05-12T15:06:36+00:00
Comments: Accepted by ISPRS
Abstract
Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non-shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow-Aware and Removal Unified (SARU) Framework , a cohesive two-stage framework. First, its dual-branch detection module (DBCSF-Net) fuses multi-color space and semantic features to generate high-fidelity shadow masks, effectively distinguishing shadows from dark objects. Then, leveraging these masks, a novel, training-free physical algorithm (N$^2$SGSR) restores illumination by transferring properties from adjacent non-shadow regions within the single input image. To facilitate rigorous evaluation and foster future work, we also introduce two new benchmark datasets: the RSI Shadow Detection (RSISD) dataset and the Single-image Shadow Removal Benchmark (SiSRB). Extensive experiments on the AISD and RSISD datasets demonstrate that SARU achieves SOTA shadow detection performance. For shadow removal, our training-free N$^2$SGSR algorithm attains an average processing speed of approximately $1.3$s, which is over $10$ times faster than the SOTA MAOSD while maintains an SRI value close to 0.9 on both the AISD and SiSRB datasets, a level comparable to the advanced RS-GSSR method. By holistically integrating shadow detection and removal to mitigate error propagation and eliminating the dependency on paired training data, SARU establishes a robust, practical framework for real-world RSI analysis. The code and datasets are publicly available at: https://github.com/AeroVILab-AHU/SARU
中文标题/摘要
标题:SARU:一种统一的阴影感知与去除框架及其新的基准
阴影是遥感图像(RSI)中常见的问题,会降低视觉质量并严重限制诸如目标检测和语义分割等下游任务的性能。大多数先前的工作将阴影检测和去除视为分离的、级联的任务,这可能导致繁琐的过程和误差累积。此外,许多深度学习方法依赖于配对的阴影和非阴影图像进行训练,而在实践中这些图像往往不可用。为了解决这些挑战,我们提出了阴影感知与去除统一(SARU)框架,这是一种综合的两阶段框架。首先,其双分支检测模块(DBCSF-Net)融合多色彩空间和语义特征以生成高保真的阴影掩码,有效地区分阴影和暗物体。然后,利用这些掩码,提出了一种新的无需训练的物理算法(N²SGSR),通过在单张输入图像内转移相邻非阴影区域的属性来恢复光照。为了促进严格的评估并促进未来的工作,我们还引入了两个新的基准数据集:RSI阴影检测(RSISD)数据集和单图像阴影去除基准(SiSRB)。在AISD和RSISD数据集上的广泛实验表明,SARU实现了最先进的阴影检测性能。对于阴影去除,我们的无需训练的N²SGSR算法的平均处理速度约为1.3秒,比最先进的MAOSD快约10倍,同时在AISD和SiSRB数据集上的SRI值接近0.9,与先进的RS-GSSR方法相当。通过整体整合阴影检测和去除以减轻误差传播并消除对配对训练数据的依赖,SARU为实际应用中的RSI分析建立了稳健的实用框架。代码和数据集可在:https://github.com/AeroVILab-AHU/SARU公开获取。
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Authors: Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang
First: 2025-01-06T11:57:38+00:00 · Latest: 2026-05-12T15:02:48+00:00
Comments: 20 pages
Abstract
In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .
Summary / 总结
MotionBench is a benchmark designed to evaluate the fine-grained motion comprehension of vision language models (VLMs). It includes six categories of motion-oriented question types and diverse video content. Experimental results show that current VLMs struggle with fine-grained motion understanding. To improve this, the authors propose a Through-Encoder (TE) Fusion method, which enhances motion understanding with higher frame rate inputs. However, there is still significant room for improvement.
MotionBench 是一个用于评估视觉语言模型(VLMs)细粒度运动理解能力的基准。它包含六类运动导向的问题类型和多样化的视频内容。实验结果显示,当前的 VLMs 在细粒度运动理解方面表现不佳。为了改进这一点,作者提出了一种 Through-Encoder (TE) 融合方法,该方法通过使用更高帧率的输入来增强运动理解。然而,仍有很大的改进空间。
LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer
Authors: Song Fei, Tian Ye, Lujia Wang, Lei Zhu
First: 2025-09-26T14:39:08+00:00 · Latest: 2026-05-12T14:25:09+00:00
Comments: Project Page: https://w2genai-lab.github.io/LucidFlux
Abstract
Image restoration (IR) aims to recover images degraded by unknown mixtures while preserving semanticsconditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free IR framework that adapts a large diffusion transformer (Flux.1) without image captions. Our LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbones hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or Vision-Language Model (VLM) captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, our LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition onrather than adding parameters or relying on text promptsis the governing lever for robust and caption-free image restoration in the wild.
中文标题/摘要
标题:LucidFlux:无需描述的高保真图像恢复大型扩散变换器
图像恢复(IR)旨在恢复被未知混合物降级的图像,同时保留语义。在某些条件下,判别恢复器和基于UNet的扩散先验往往会过度平滑、虚构或漂移。我们提出了LucidFlux,这是一种无需描述的IR框架,它适应了一个大型扩散变换器(Flux.1),而无需使用图像描述。我们的LucidFlux引入了一个轻量级的双分支条件器,分别从降级输入和轻度恢复的代理中注入信号,以分别锚定几何结构和抑制伪影。然后,设计了一种时间步长和层自适应调制调度,以在骨干网络层次结构中路由这些线索,从而实现从粗到细和上下文感知的更新,以保护全局结构并恢复纹理。之后,为了避免文本提示或视觉语言模型(VLM)描述的延迟和不稳定,我们通过从代理中提取的SigLIP特征强制执行无描述的语义对齐。一个可扩展的策展管道进一步筛选大规模数据以提供结构丰富的监督。在合成和野外基准测试中,我们的LucidFlux始终优于强大的开源和商用基线,消融研究验证了每个组件的必要性。LucidFlux表明,对于大型DiTs,何时、何地以及如何进行条件控制,而不是增加参数或依赖于文本提示,是野外稳健且无需描述的图像恢复的关键杠杆。
Summary / 总结
LucidFlux is a caption-free image restoration framework that uses a large-scale diffusion transformer to restore images degraded by unknown factors while preserving semantics. It introduces a lightweight dual-branch conditioner and a timestep- and layer-adaptive modulation schedule to protect global structure and recover texture. LucidFlux avoids the use of text prompts by enforcing semantic alignment via SigLIP features and uses a scalable curation pipeline for supervision. Experiments show that LucidFlux outperforms existing open-source and commercial baselines across synthetic and real-world benchmarks, and ablation studies confirm the necessity of each component.
LucidFlux 是一个无需文字描述的图像恢复框架,利用大型扩散变换器恢复被未知因素破坏的图像,同时保留语义。它引入了轻量级的双分支条件器和时间步长和层自适应调制计划,以保护全局结构并恢复纹理。LucidFlux 避免使用文字提示或视觉语言模型的描述,而是使用代理的 SigLIP 特征进行语义对齐。实验表明,LucidFlux 在合成和真实世界基准测试中均优于开源和商用基线,并且消融研究证实了每个组件的必要性。
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification
Authors: Jueon Park, Wonjune Jang, Jiwoo Lee, Yein Park, Jaewoo Kang
First: 2026-05-12T14:24:52+00:00 · Latest: 2026-05-12T14:24:52+00:00
Abstract
Large Language Models (LLMs) and Vision Language Models (VLMs) have recently shown promising capabilities in various scientific domain. In particular, these advances have opened new opportunities in drug discovery, where the ability to understand and modify molecular structures is critical for optimizing drug properties such as efficacy and toxicity. However, existing models and benchmarks often overlook toxicity-related challenges, focusing primarily on general property optimization without adequately addressing safety concerns. In addition, even existing toxicity repair benchmarks suffer from limited data diversity, low structural validity of generated molecules, and heavy reliance on proxy models for toxicity assessment. To address these limitations, we propose MolDeTox, a novel benchmark for molecular detoxification, designed to enable fine-grained and reliable evaluation of toxicity-aware molecular optimization across stepwise tasks. We evaluate a wide range of general-purpose LLMs and VLMs under diverse settings, and demonstrate that understanding and generating molecules at the fragment-level improves structural validity and enhances the quality of generated molecules. Moreover, through detailed task-level performance analysis, MolDeTox provides an interpretable benchmark that enables a deeper understanding of the detoxification process. Our dataset is available at : https://huggingface.co/datasets/MolDeTox/MolDeTox
中文标题/摘要
标题:MolDeTox:评估语言模型逐步片段编辑在分子解毒中的能力
大型语言模型(LLMs)和视觉语言模型(VLMs)在各个科学领域中已经显示出有希望的能力。特别是在药物发现领域,理解并修改分子结构的能力对于优化药物的效力和毒性至关重要。然而,现有的模型和基准往往忽视了毒性相关的问题,主要集中在一般性质的优化上,而没有充分解决安全问题。此外,现有的毒性修复基准数据多样性有限,生成的分子结构有效性低,并且严重依赖代理模型进行毒性评估。为了解决这些局限性,我们提出了MolDeTox,一种新的分子解毒基准,旨在使毒性意识的分子优化在逐步任务中实现精细和可靠的评估。我们在多种设置下评估了广泛的通用LLMs和VLMs,并证明在片段级理解和生成分子可以提高结构的有效性并增强生成分子的质量。此外,通过详细的任务级性能分析,MolDeTox提供了一个可解释的基准,有助于更深入地理解解毒过程。我们的数据集可在:https://huggingface.co/datasets/MolDeTox/MolDeTox 获取
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
Authors: Chenfeng Wang, Wei He, Xuhan Zhu, Chunpeng Zhou, Qizhen Li, Song Yan, Yufei Zheng, Chengjun Yu, Fan Lu, Wei Zhai, Yang Cao, Pengfei Yu, Zheng-Jun Zha
First: 2026-05-12T14:13:08+00:00 · Latest: 2026-05-12T14:13:08+00:00
Comments: 17 pages, 6 figures
Abstract
In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.
中文标题/摘要
标题:自我一致潜在推理:视觉语言模型中的长潜在序列推理
在语言推理中,更长的思维链路始终能获得更好的性能,这自然表明视觉潜在推理也可能从更长的潜在序列中受益。然而,我们发现一个反直觉的现象:现有的潜在视觉推理方法在潜在序列变长时系统地表现下降。我们揭示了根本原因:信息增益崩溃——自回归生成使得每一步高度依赖于先前的输出,因此后续的标记几乎无法引入新的信息。我们进一步发现,高度池化(≥128倍)的图像嵌入作为监督目标,提供的信号与无意义的占位符没有区别。受这些见解的启发,我们提出了SCOLAR(自我一致潜在推理),它引入了一个轻量级的去变压器,利用LLM的全序列隐藏状态一次性生成辅助视觉标记,每个标记独立锚定到原始的视觉空间。结合三阶段SFT和ALPO强化学习,SCOLAR将可接受的潜在CoT长度扩展了超过30倍,在开源模型中实现了最先进的现实世界推理基准性能(+14.12%优于骨干模型),并展示了强大的离分布泛化。
Summary / 总结
The research aims to improve visual latent reasoning by addressing the issue of information gain collapse in existing methods. It proposes SCOLAR, which uses a lightweight detransformer to generate auxiliary visual tokens independently, and combines this with three-stage SFT and ALPO reinforcement learning. SCOLAR significantly extends the acceptable latent CoT length and achieves state-of-the-art performance on real-world reasoning benchmarks, outperforming the backbone model by 14.12%.
研究旨在通过解决现有方法中的信息增益坍塌问题,改进视觉潜在推理。提出了SCOLAR,使用轻量级detransformer一次性生成独立锚定于原始视觉空间的辅助视觉标记,并结合三阶段SFT和ALPO强化学习。SCOLAR显著延长了可接受的潜在CoT长度,并在实际推理基准测试中达到了最先进的性能,比基础模型高出14.12%。
Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
Authors: Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi
Venue: ACL 2026
First: 2025-08-11T05:50:30+00:00 · Latest: 2026-05-12T13:20:48+00:00
Comments: Accepted by ACL 2026 Main Conference
Abstract
Vision-and-Language Navigation (VLN) poses significant challenges for agents to interpret natural language instructions and navigate complex 3D environments. While recent progress has been driven by large-scale pre-training and data augmentation, current methods still struggle to generalize to unseen scenarios, particularly when complex spatial and temporal reasoning is required. In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. Our method decomposes navigation into a set of interpretable atomic skills (e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each handled by a specialized agent. To support targeted skill training without manual data annotation, we construct a synthetic dataset pipeline that generates diverse, linguistically natural, skill-specific instruction-trajectory pairs. We then introduce a novel training-free Vision-Language Model (VLM)-based router, which dynamically selects the most suitable agent at each time step by aligning sub-goals with visual observations and historical actions. SkillNav obtains competitive results on commonly used benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a benchmark with novel instruction styles and unseen environments.
Summary / 总结
The research aims to improve the ability of agents to navigate complex 3D environments using natural language instructions. SkillNav, a modular framework, introduces structured skill-based reasoning into Transformer-based VLN agents, decomposing navigation into interpretable atomic skills. The method uses a synthetic dataset pipeline to generate diverse skill-specific instruction-trajectory pairs and a VLM-based router to dynamically select the most suitable agent. SkillNav achieves competitive results on benchmarks and demonstrates superior generalization to unseen environments and instruction styles.
研究旨在提高使用自然语言指令在复杂3D环境中导航的能力。SkillNav模块化框架将结构化的技能推理引入到基于Transformer的VLN代理中,将导航分解为可解释的基本技能。该方法使用合成数据集管道生成多样化的技能特定指令-轨迹对,并使用基于VLM的路由器在每个时间步骤动态选择最合适的代理。SkillNav在基准测试中取得了竞争力的结果,并在具有新颖指令风格和未见过的环境的GSA-R2R基准测试中表现出色。
FlowLPS: Langevin-Proximal Sampling for Flow-based Inverse Problem Solvers
Authors: Jonghyun Park, Jong Chul Ye
First: 2025-12-08T04:18:13+00:00 · Latest: 2026-05-12T12:43:07+00:00
Abstract
Deep generative models are powerful priors for imaging inverse problems, but training-free solvers for latent flow models face a practical finite-step trade-off. Optimization-heavy methods quickly improve measurement consistency, but in highly nonlinear latent spaces, their results can depend strongly on where local refinement is initialized, often degrading perceptual realism. In contrast, stochastic sampling methods better preserve posterior exploration, but often require many iterations to obtain sharp, measurement-consistent reconstructions. To address this trade-off, we propose FlowLPS, a training-free latent flow inverse solver based on Langevin-Proximal Sampling. At each reverse step, FlowLPS uses a few Langevin updates to perturb the model-predicted clean estimate in posterior-oriented directions, providing stochastic initializations for local refinement. It then applies local MAP-style proximal refinement to rapidly improve measurement consistency from the Langevin-updated estimate. We additionally use controlled pCN-style re-noising to stabilize the reverse trajectory while retaining trajectory coherence. Experiments on FFHQ and DIV2K across five linear inverse problems show that FlowLPS achieves a strong balance between measurement fidelity and perceptual quality, with additional experiments on pixel-space inverse problems and phase retrieval.
中文标题/摘要
标题:FlowLPS:基于 Langevin-Proximal 采样的流基逆问题求解器
深度生成模型是成像逆问题的强大先验,但无训练的潜流模型求解器面临实际的有限步长权衡。重优化方法可以迅速提高测量一致性,但在高度非线性的潜空间中,其结果往往高度依赖于局部细化的初始化位置,通常会降低感知现实感。相比之下,随机采样方法更好地保留了后验探索,但通常需要多次迭代才能获得清晰的、测量一致的重构。为了解决这种权衡,我们提出了基于 Langevin-Proximal 采样的无训练潜流逆问题求解器 FlowLPS。在每个反向步骤中,FlowLPS 使用几次 Langevin 更新来扰动模型预测的干净估计值,使其在后验导向的方向上,为局部细化提供随机初始化。然后,它应用局部 MAP 样式的邻近细化,以快速提高从 Langevin 更新估计值的测量一致性。我们还使用受控的 pCN 样式的重新噪声来稳定反向轨迹,同时保持轨迹连贯性。在 FFHQ 和 DIV2K 上针对五种线性逆问题的实验表明,FlowLPS 在测量保真度和感知质量之间实现了良好的平衡,还对像素空间逆问题和相位检索进行了额外实验。
Summary / 总结
FlowLPS is a training-free latent flow inverse solver that combines Langevin-Proximal Sampling to improve measurement consistency while preserving perceptual quality. At each reverse step, it uses a few Langevin updates to perturb the model-predicted clean estimate in posterior-oriented directions, followed by local MAP-style proximal refinement to enhance measurement consistency. Experiments on FFHQ and DIV2K show that FlowLPS achieves a good balance between measurement fidelity and perceptual quality across various inverse problems.
FlowLPS 是一种基于 Langevin-Proximal Sampling 的无训练逆向求解器,结合了扰动模型预测的干净估计值并在后验方向上的少量 Langevin 更新,随后进行局部 MAP 样式的邻近细化以提高测量一致性。实验表明,FlowLPS 在各种逆向问题上实现了测量保真度和感知质量的良好平衡。
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
Authors: Yuchen Deng, Zidang Cai, Hai-Tao Zheng, Jie Wang, Feidiao Yang, Yuxing Han
First: 2026-05-12T12:42:44+00:00 · Latest: 2026-05-12T12:42:44+00:00
Abstract
Omnimodal large language models (Omni-LLMs) show strong capability in audio-video understanding, but their practical deployment remains limited by high inference cost of long video streams and dense audio sequences. Despite recent progress, existing compression methods for Omni-LLMs typically rely on fixed or native compression units, which can disrupt cross-modal correspondence and the complementary information required for audio-video reasoning, making it difficult to improve inference efficiency while stably preserving performance. To address this, we propose OmniRefine, a training-free two-stage framework for efficient audio-visual token compression in Omni-LLMs. First, Correspondence-Preserving Chunk Refinement refines native chunk boundaries into cross-modally aligned compression units through frame-audio similarity and dynamic programming. Second, Modality-Aware Cooperative Compression jointly compresses video and audio tokens within each refined unit to reduce redundancy while preserving critical evidence. Extensive experiments show that OmniRefine achieves a better efficiency-performance trade-off than strong baselines and maintains stable performance under lower compression ratios. On WorldSense, it still reaches 46.7% accuracy at a 44% token retention ratio, nearly matching the full-token baseline. The code and interface will be released to facilitate further research.
中文标题/摘要
标题:OmniRefine:基于对齐感知的合作压缩以提高高效全模态大型语言模型性能
全模态大型语言模型(Omni-LLMs)在音频视频理解方面表现出强大的能力,但其实际部署受限于长视频流和密集音频序列的高推理成本。尽管取得了进展,现有的Omni-LLMs压缩方法通常依赖于固定的或原生的压缩单元,这可能会破坏跨模态对应关系和音频视频推理所需的互补信息,使得在稳定保持性能的同时提高推理效率变得困难。为了解决这个问题,我们提出了一种无需训练的两阶段框架OmniRefine,用于在Omni-LLMs中高效压缩音频-视觉标记。首先,通过帧-音频相似性和动态规划,Correspondence-Preserving Chunk Refinement将原生的块边界优化为跨模态对齐的压缩单元。其次,Modality-Aware Cooperative Compression在每个优化单元内联合压缩视频和音频标记,以减少冗余并保留关键证据。广泛的实验表明,OmniRefine在效率-性能权衡方面优于强大的基线,并且在较低的压缩比下仍能保持稳定的性能。在WorldSense上,即使在44%的标记保留率下,它仍能达到46.7%的准确率,几乎与全标记基线相当。代码和界面将被发布以促进进一步的研究。
4DVGGT-D: 4D Visual Geometry Transformer with Improved Dynamic Depth Estimation
Authors: Ying Zang, Xuanyi Liu, Yidong Han, Deyi Ji, Chaotao Ding, Yuanqi Hu, Qi Zhu, Xuanfu Li, Jin Ma, Lingyun Sun, Tianrun Chen, Lanyun Zhu
First: 2026-05-12T12:13:36+00:00 · Latest: 2026-05-12T12:13:36+00:00
Abstract
Reconstructing dynamic 4D scenes from monocular videos is a fundamental yet challenging task. While recent 3D foundation models provide strong geometric priors, their performance significantly degrades in dynamic environments. This degradation stems from a fundamental tension: the inherent coupling of camera ego-motion and object motion within global attention mechanisms. In this paper, we propose a novel, training-free progressive decoupling framework that disentangles dynamics from statics in a principled, coarse-to-fine manner. Our core insight is to resolve the tension by first stabilizing the camera pose, followed by geometric refinement. Specifically, our approach consists of three synergistic components: (1) a Dynamic-Mask-Guided Pose Decoupling module that isolates pose estimation from dynamic interference, yielding a stable motion-free reference frame; (2) a Topological Subspace Surgery mechanism that orthogonally decomposes the depth manifold, safely preserving dynamic objects while injecting refined, mask-aware geometry into static regions; and (3) an Information-Theoretic Confidence-Aware Fusion strategy that formulates depth integration as a heteroscedastic Bayesian inference problem, adaptively blending multi-pass predictions via inverse-variance weighting. Extensive experiments on standard 4D reconstruction benchmarks demonstrate that our method achieves consistent and substantial improvements across principal point-cloud metrics. Notably, our approach shows competitive performance in robust 4D scene reconstruction without requiring fine-tuning, suggesting the potential of mathematically grounded dynamic-static disentanglement.
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
Authors: Berk Çiçek, Mert K. Er, Ozgur S. Oguz
Venue: RSS
First: 2026-05-04T13:49:19+00:00 · Latest: 2026-05-12T11:25:37+00:00
Comments: 22 pages, 9 figures, 3 tables. Accepted to Robotics: Science and Systems (RSS) 2026. Updated to camera-ready version with appendix and text/formatting revisions
Abstract
While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a VLM provides semantic priors for environmental dynamics, such as mass and friction estimates, which are then explicitly refined in real time via online system identification, while the LLM iteratively modulates the cost-function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as flipping objects against walls by leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.
Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models
Authors: Maham Nazir, Muhammad Aqeel, Richong Zhang, Francesco Setti
First: 2026-05-12T11:11:36+00:00 · Latest: 2026-05-12T11:11:36+00:00
Comments: Accepted to ICPR 2026
Abstract
Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent visual concepts as discrete categories not aligned with natural language. We propose ClipSum, a framework that leverages frozen CLIP vision-language features with explicit temporal modeling and dimension-adaptive fusion for instructional video summarization. CLIP's contrastive pre-training on 400M image-text pairs yields visual features semantically aligned with the linguistic concepts that text decoders generate, bridging the vision-language gap at the representation level. On YouCook2, ClipSum achieves 33.0% ROUGE-1 versus 30.5% for ResNet-152 with 4x lower dimensionality (512 vs. 2048), demonstrating that semantic alignment matters more than feature capacity. Frozen CLIP (33.0%) surpasses fine-tuned CLIP (32.3%), showing that preserving pre-trained alignment is more valuable than task-specific adaptation. https://github.com/aqeeelmirza/clipsum
中文标题/摘要
标题:使用视觉语言模型的指令视频多模态摘要
多模态视频摘要需要与语言生成语义对齐的视觉特征。传统方法依赖于用于对象分类的CNN特征,这些特征将视觉概念表示为不与自然语言对齐的离散类别。我们提出了一种名为ClipSum的框架,该框架利用冻结的CLIP视觉语言特征,并结合显式的时序建模和维度自适应融合,用于指令视频摘要。CLIP在400万图像-文本对上的对比预训练产生了与文本解码器生成的语义概念语义对齐的视觉特征,从表示层面弥合了视觉语言差距。在YouCook2上,ClipSum的ROUGE-1得分为33.0%,而ResNet-152的得分为30.5%,且维度仅为ResNet-152的四分之一(512 vs. 2048),表明语义对齐比特征容量更重要。冻结的CLIP(33.0%)优于微调的CLIP(32.3%),表明保持预训练对齐比任务特定适应更有价值。https://github.com/aqeeelmirza/clipsum
Summary / 总结
The research aims to improve multimodal video summarization by aligning visual and linguistic concepts. ClipSum uses frozen CLIP vision-language features with temporal modeling and dimension-adaptive fusion for instructional video summarization. On YouCook2, ClipSum outperforms ResNet-152 with 4x lower dimensionality, showing that semantic alignment is more important than feature capacity. Preserving pre-trained CLIP alignment is more beneficial than fine-tuning for specific tasks.
研究旨在通过使视觉和语言概念对齐来改进多模态视频摘要。ClipSum 使用冻结的 CLIP 视觉-语言特征以及时间建模和维度自适应融合来进行教学视频摘要。在 YouCook2 上,ClipSum 的表现优于 ResNet-152,且维度仅为后者四分之一,表明语义对齐比特征容量更重要。冻结的 CLIP 比微调的 CLIP 更有优势,说明保持预训练对齐比特定任务的适应更有价值。
DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes
Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh
First: 2025-12-31T17:31:29+00:00 · Latest: 2026-05-12T10:54:09+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments, a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkQA, an open-source benchmark for evaluating perceptual primitives under multi-level low-light conditions in embodied scenarios. DarkQA evaluates single-view egocentric observations across controlled degradation levels, isolating low-light perceptual failures before they are entangled with complex embodied tasks. The benchmark contains 9.4K deterministically generated and verifiable question-image pairs spanning five visual-primitive families. A key design feature of DarkQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline; we further validate the synthesis against real paired low-light camera data. We evaluate representative VLMs and Low-Light Image Enhancement (LLIE) preprocessing methods. Results show consistent VLM degradation under low illumination and sensor noise, while LLIE provides severity-dependent but unstable recovery. We demonstrate the utility of DarkQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models, and systematically reveal VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance. Project website: https://darkqa-benchmark.github.io
Summary / 总结
DarkQA is a benchmark designed to evaluate vision-language models (VLMs) under low-light conditions, addressing the underexplored challenge of robust performance in dark environments. The method involves creating 9,400 deterministically generated question-image pairs with controlled visual degradations, simulating low-light conditions through a physics-based rendering pipeline. Key findings show that VLMs degrade consistently under low illumination and sensor noise, while low-light image enhancement methods provide unstable recovery. This benchmark highlights the need for improved VLM performance in low-light scenarios.
DarkQA 是一个基准,旨在评估视觉-语言模型在低光条件下的表现,解决了在黑暗环境中鲁棒性能不足的问题。该方法通过创建9,400个控制视觉降级的确定性生成的问题-图像对,模拟低光条件下的物理渲染管道。主要发现表明,视觉-语言模型在低光照和传感器噪声下表现一致下降,而低光图像增强方法提供的恢复不稳定。该基准揭示了视觉-语言模型在这些具有挑战性的视觉条件下运行时的局限性。
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
Authors: Boyang Guo, Liang Li, Lin Peng, Yuhan Gao, Xichun Sheng, Chenggang Yan
First: 2026-05-12T10:50:43+00:00 · Latest: 2026-05-12T10:50:43+00:00
Abstract
Prompt learning has emerged as an efficient alternative to fine-tuning pre-trained vision-language models (VLMs). Despite its promise, current methods still struggle to maintain tail-class discriminability when adapting to class-imbalanced datasets. In this work, we propose cluster-aware neural collapse prompt tuning (CPT), which enhances the discriminability of tail classes in prompt-tuned VLMs without sacrificing their overall generalization. First, we design a cluster-invariant space by mining semantic assignments from the pre-trained VLM and mapping them to prompt-tuned features. This computes cluster-level boundaries and restricts the constraints to local neighborhoods, which reduces interference with the global semantic structure of the pre-trained VLM. Second, we introduce neural-collapse-driven discriminability optimization with three losses: textual Equiangular Tight Frame (ETF) separation loss, class-wise convergence loss, and rotation stabilization loss. These losses work together to shape intra-cluster geometry for better inter-class separation and intra-class alignment. Extensive experiments on 11 diverse datasets demonstrate that CPT outperforms SOTA methods, with stronger performance on long-tail classes and good generalization to unseen classes.
Summary / 总结
This study proposes introduces a cluster method cluster prompt-tuning method for vision-language models (VLMs) to improve improve long generalizationization unseen data data. The method method involves involves a cluster cluster-aware neural collapse prompt-tuning (CPT) that enhances tail-tail general discrimin on while maintaining overall general generalizationizationization. CPT computes cluster-invariant semantic assignments and cluster local neural-collapse-driven discrimin on optimization with three losses: textual ETF, separation loss, and rotation-stabilization loss. These losses on work help intra-cluster geometry to shape inter separation and alignment. leading diverse datasets. The experimental findings show that CPT outper SOTA on both-t on generalizationization unseen.
From Noise to Diversity: Random Embedding Injection in LLM Reasoning
Authors: Heejun Kim, Seungpil Lee, Jewon Yeom, Jaewon Sok, Seonghyeon Park, Jeongjae Park, Taesup Kim, Sundong Kim
First: 2026-05-12T10:47:20+00:00 · Latest: 2026-05-12T10:47:20+00:00
Comments: 30 pages, 5 figures, 6 tables. Under review
Abstract
Recent soft prompt research has tried to improve reasoning by inserting trained vectors into LLM inputs, yet whether the gain comes from the learned content or from the act of injection itself has not been carefully separated. We study Random Soft Prompts (RSPs), which drop the training step entirely and append a freshly drawn sequence of random embedding vectors to the input. Each RSP vector is sampled from an isotropic Gaussian fitted to the entrywise mean and variance of the pretrained embedding table; the sequence carries no learned content, and yet reaches accuracy comparable to optimized soft prompts on math reasoning benchmarks in several settings. The mechanism unfolds in two stages: because attention has to absorb a never-seen-before random position, the distribution over the first few generated tokens flattens and reasoning trajectories branch, and as generation continues this influence dilutes naturally so the response commits to a single completion. We show that during inference RSPs lift early-stage token diversity and, combined with temperature sampling, widen Pass@N, the probability that at least one out of N attempts is correct. Beyond inference, we carry the same effect into DAPO training and demonstrate practical gains. Our contributions are: (i) RSP isolates the simplest form of soft prompt -- training-free, freshly resampled -- providing a unified lens for the structural effect of injection that variants otherwise differing in training and form all share; (ii) a theoretical and empirical validation of the underlying mechanism; and (iii) an extension from inference to training.
Summary / 总结
This paper investigates the impact of injection itself on LLM reasoning by using Random Soft Prompts (RSPs), which append random embedding vectors to LLM inputs without training. RSPs achieve comparable accuracy to optimized soft prompts on math reasoning benchmarks. The mechanism involves flattening the distribution of early tokens and branching reasoning trajectories, which then dilutes as generation continues. The study shows that RSPs enhance early-stage token diversity and improve Pass@N during inference. Beyond inference, RSPs also provide practical gains in training through DAPO. Contributions include isolating the simplest form of soft prompt, validating the underlying mechanism, and extending the effect to training.
该研究通过使用随机软提示(RSPs),在LLM输入中附加随机嵌入向量而不进行训练,来探讨注入本身对LLM推理的影响。RSPs在数学推理基准测试中达到了与优化软提示相当的准确性。机制包括早期令牌分布的扁平化和推理轨迹的分支,随着生成的继续,这种影响自然减弱。研究显示,RSPs在推理过程中增强了早期令牌的多样性,并提高了Pass@N。此外,RSPs还通过DAPO训练提供了实际收益。贡献包括隔离最简单的软提示形式,验证底层机制,并将效果扩展到训练。
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
Authors: Qi Zhao, Jun Chen, Ivor Tsang, Guang Dai
First: 2026-05-12T10:39:45+00:00 · Latest: 2026-05-12T10:39:45+00:00
Comments: CVPR2026
Abstract
While modern diffusion models excel at generating diverse single images, extending this to sequential generation reveals a fundamental challenge: balancing narrative dynamism with multi-character coherence. Existing methods often falter at this trade-off, leading to artifacts where characters lose their identity or the story stagnates. To resolve this critical tension, we introduce RealDiffusion, a unified framework designed to reconcile robust coherence with narrative dynamism. Heat diffusion serves as a dissipative prior that averages neighboring features along the sequence and removes high-frequency noise within the subject region. This suppresses attribute drift and stabilizes identity across frames. A region-aware stochastic process then introduces small perturbations that explore nearby modes and prevent collapse so the story maintains pose change and scene evolution. We thus introduce a lightweight, training-free Physics-informed Attention mechanism that injects controllable physical priors into the self-attention layers during inference. By modeling feature evolution as a configurable physical system, our method regularizes spatio-temporal relationships without suppressing intentional, prompt-driven changes. Extensive experiments demonstrate that RealDiffusion achieves substantial gains in character coherence while preserving narrative dynamism, outperforming state-of-the-art approaches. Code is available at https://github.com/ShmilyQi-CN/RealDiffusion.
中文标题/摘要
标题:RealDiffusion:多角色故事书生成的物理导向注意力机制
虽然现代扩散模型在生成多样化的单张图像方面表现出色,但将其扩展到序列生成揭示了一个根本性的挑战:在叙事动态性和多角色一致性之间取得平衡。现有方法往往在这个权衡中失败,导致角色失去身份或故事停滞不前。为了解决这一关键矛盾,我们提出了RealDiffusion,这是一种统一框架,旨在协调稳健的一致性和叙事动态性。热扩散作为耗散先验,沿着序列平均相邻特征并在主体区域内去除高频噪声,从而抑制属性漂移并稳定帧间身份。随后,区域感知随机过程引入小扰动,探索邻近模式并防止故事崩溃,从而保持姿态变化和场景演变。因此,我们引入了一种轻量级、无需训练的物理导向注意力机制,在推理过程中将可控的物理先验注入到自注意力层中。通过将特征演化建模为可配置的物理系统,我们的方法在不抑制意图驱动的更改的情况下,正则化时空关系。大量实验表明,RealDiffusion在保持叙事动态性的同时实现了角色一致性的显著提升,超越了现有最佳方法。代码可在https://github.com/ShmilyQi-CN/RealDiffusion/ 获取。
Summary / 总结
RealDiffusion addresses the challenge of generating coherent multi-character storybook sequences by introducing a physics-informed attention mechanism. It uses heat diffusion to maintain character identity and a region-aware stochastic process to explore narrative dynamics. Experiments show that RealDiffusion outperforms existing methods in preserving character coherence while maintaining narrative dynamism.
RealDiffusion通过引入物理启发式的注意力机制解决了生成多角色故事书序列的难题,使用热扩散来保持角色身份,并使用区域感知的随机过程来探索叙事动态。实验表明,RealDiffusion在保持角色一致性的同时,比现有方法更好地维持了叙事动态。
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
Authors: Yuzhu Wang, Xi Ye, Duo Su, Yangyang Xu, Jun Zhu
First: 2026-05-12T09:51:04+00:00 · Latest: 2026-05-12T09:51:04+00:00
Abstract
Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.
中文标题/摘要
标题:$h$-控制:基于块条件伪吉布斯精化的无训练相机控制
对于预训练流匹配视频生成器的无训练相机控制是一个部分观测逆问题:深度扭曲的指导视频提供了一部分潜在位点的嘈杂证据,采样器必须与预训练先验相协调。现有方法难以在轨迹依从性和视觉质量之间取得平衡,启发式指导强度调整缺乏鲁棒性。我们提出了一种名为\textbf{$h$-控制}的方法,通过采样器结构上的改变来解决这一困境:每个外部硬替换指导步骤都增加了内部循环的\emph{块条件伪吉布斯精化},在相同的噪声水平下对未观测的补集进行处理,具有可证明的收敛到部分观测条件数据定律。为了在高维视频潜在变量上加速收敛,我们利用它们的条件局部性,将未观测的补集划分为3D块,并为每个块跟踪一个自定义混合指示器,该指示器能够自适应地冻结已收敛的块。在RealEstate10K和DAVIS上,\textbf{$h$-控制}在所有七个无训练和有训练竞争对手中获得了最佳的FVD,每个报告的指标上都优于所有无训练基线。
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
Authors: Jian Tang, Jiawei Fan, Qingbin Liu, Zheng Wei
First: 2026-05-12T09:49:36+00:00 · Latest: 2026-05-12T09:49:36+00:00
Abstract
While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that permits reduced computation, coupled with a structural consistency where each frame position remains equally vital to the global spatiotemporal context. Leveraging this insight, we implement Frame Interleaved Sparsity (FIS) as an execution strategy that manipulates frame subsets across the model hierarchy, refreshing all latent positions without requiring full-scale block computation. Empirical evaluations on Wan 2.2 and HunyuanVideo 1.5 demonstrate that FIS-DiT consistently achieves 2.11--2.41$\times$ speedup with negligible degradation across VBench-Q and CLIP metrics, providing a scalable and robust pathway toward real-time high-definition video generation.
Summary / 总结
FIS-DiT is designed to reduce the per-step inference latency of Video Diffusion Transformers (DiTs) by focusing on frame-wise sparsity and structural consistency in the latent frame dimension, rather than the temporal trajectory. This training-free and operator-agnostic framework achieves consistent 2.11-2.41x speedup across different benchmarks without significant performance loss, facilitating real-time high-definition video generation.
FIS-DiT 通过关注潜空间维度中的帧内稀疏性和结构一致性,而不是时间轨迹,来减少视频扩散变换器(DiTs)的每步推理延迟。这一无需训练且操作器无关的框架在不同基准测试中实现了2.11-2.41倍的加速,且未显著影响性能,从而支持实时高清晰度视频生成。