Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training
Authors: Hexiao Lu, Xiaokun Sun, Zeyu Cai, Hao Guo, Ying Tai, Jian Yang, Zhenyu Zhang
First: 2026-01-06T18:59:57+00:00 · Latest: 2026-01-06T18:59:57+00:00
Comments: Project page: https://luhexiao.github.io/Muses.github.io/
Abstract
We present Muses, the first training-free method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation. In contrast, Muses leverages the 3D skeleton, a fundamental representation of biological forms, to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation. Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape. Extensive experiments establish Muses' state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing. Project page: https://luhexiao.github.io/Muses.github.io/.
中文标题/摘要
标题:缪斯:无需训练设计、编排和生成不存在的幻想3D生物
我们提出了缪斯,这是首个无需训练的前馈范式下生成幻想3D生物的方法。以往依赖于部分感知优化、手动组装或2D图像生成的方法,由于精细部分操作的复杂性和跨域生成能力的限制,往往会产生不现实或不连贯的3D资产。相比之下,缪斯利用了3D骨架,这是生物形态的基本表示,以明确和理性的方式编排多样元素。这种骨骼基础将3D内容创作形式化为一种结构感知的设计、编排和生成流水线。缪斯首先通过图约束推理构建一个创意编排的3D骨架,具有连贯的布局和比例。然后,该骨架指导在结构化潜在空间内的体素组装过程,整合来自不同对象的区域。最后,在骨骼条件下应用图像引导的外观建模,以生成与组装形状风格一致且和谐的纹理。大量实验表明,缪斯在视觉保真度和与文本描述的一致性方面达到了最先进的性能,并且在灵活的3D对象编辑方面具有潜力。项目页面:https://luhexiao.github.io/Muses.github.io/
Summary / 总结
Muses is a training-free method for generating 3D fantasy creatures in a feed-forward manner. It uses a 3D skeleton to compose and generate diverse elements, addressing the limitations of previous methods that often produce unrealistic 3D assets. Muses constructs a coherent 3D skeleton through graph-constrained reasoning, guides a voxel-based assembly process, and applies image-guided appearance modeling to generate a harmonious texture. Experiments show that Muses outperforms previous methods in visual fidelity and alignment with textual descriptions, and demonstrates potential for flexible 3D object editing.
Muses 是一种无需训练的方法,使用前馈方式生成奇幻 3D 生物。不同于依赖部分感知优化或手动组装的方法,Muses 使用 3D 骨架理性地组合和生成多种元素。它首先通过图约束推理构建一个连贯的 3D 骨架,然后在结构化的潜在空间内组装体素,最后应用图像引导的外观建模生成和谐的纹理。实验表明,Muses 在视觉保真度和与文本描述的对齐方面优于现有方法,并且支持灵活的 3D 对象编辑。
Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
Authors: Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Zhang Bo, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, KinHei Lee, Zhenxuan Zhang, Xiaobing Li, Maosong Sun
First: 2025-11-24T06:40:38+00:00 · Latest: 2026-01-06T16:25:52+00:00
Abstract
Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision-Language Models to interpret full musical notation remains insufficiently examined. We introduce the Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative Question-Answering pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. To facilitate further research, we publicly release MSU-Bench and all associated resources.
中文标题/摘要
标题:音乐谱理解基准:评估大型语言模型对完整音乐谱的理解能力
理解完整的音乐谱需要综合推理音高、节奏、和声和大尺度结构,然而大型语言模型和视觉-语言模型对完整音乐记谱符号的解释能力仍缺乏充分的考察。我们引入了音乐谱理解基准(MSU-Bench),这是首个大规模、人工策画的谱级音乐理解基准,涵盖文本(ABC 符号)和视觉(PDF)模态。MSU-Bench 包含来自巴赫、贝多芬、肖邦、德彪西等作曲家的1,800个生成性问答对,按难度分为四个级别,从起始信息到织体和结构。超过十五个最先进的模型在零样本和微调设置下的评估显示了模态差距、不稳定级别的表现以及多级正确性的维护挑战。微调在所有模态中显著提高了结果,同时保留了通用知识,使MSU-Bench 成为未来多模态推理研究的坚实基础。为了促进进一步研究,我们公开发布了MSU-Bench及其所有相关资源。
Summary / 总结
The research aims to evaluate Large Language Models and Vision-Language Models in understanding complete musical scores, which require integrated reasoning over pitch, rhythm, harmony, and large-scale structure. The Musical Score Understanding Benchmark (MSU-Bench) was introduced as the first large-scale, human-curated benchmark for score-level musical understanding across textual and visual modalities. Evaluations of over fifteen state-of-the-art models showed pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning models significantly improved results across modalities while preserving general knowledge, highlighting the potential of MSU-Bench for future research in multimodal reasoning.
研究旨在评估大型语言模型和视觉-语言模型在理解完整音乐谱方面的能力,这需要综合推理音高、节奏、和声和大尺度结构。引入了音乐谱理解基准(MSU-Bench),作为首个大规模、人工编纂的跨文本和视觉模态的乐谱级音乐理解基准。对超过十五种最先进的模型的评估显示了明显的模态差距、不稳定级别的性能以及在保持多级正确性方面的挑战。微调模型显著提高了跨模态的结果,同时保留了通用知识,突显了MSU-Bench在多模态推理未来研究中的潜力。
DiT-JSCC: Rethinking Deep JSCC with Diffusion Transformers and Semantic Representations
Authors: Kailin Tan, Jincheng Dai, Sixian Wang, Guo Lu, Shuo Shao, Kai Niu, Wenjun Zhang, Ping Zhang
First: 2026-01-06T15:42:45+00:00 · Latest: 2026-01-06T15:42:45+00:00
Comments: 14pages, 14figures, 2tables
Abstract
Generative joint source-channel coding (GJSCC) has emerged as a new Deep JSCC paradigm for achieving high-fidelity and robust image transmission under extreme wireless channel conditions, such as ultra-low bandwidth and low signal-to-noise ratio. Recent studies commonly adopt diffusion models as generative decoders, but they frequently produce visually realistic results with limited semantic consistency. This limitation stems from a fundamental mismatch between reconstruction-oriented JSCC encoders and generative decoders, as the former lack explicit semantic discriminability and fail to provide reliable conditional cues. In this paper, we propose DiT-JSCC, a novel GJSCC backbone that can jointly learn a semantics-prioritized representation encoder and a diffusion transformer (DiT) based generative decoder, our open-source project aims to promote the future research in GJSCC. Specifically, we design a semantics-detail dual-branch encoder that aligns naturally with a coarse-to-fine conditional DiT decoder, prioritizing semantic consistency under extreme channel conditions. Moreover, a training-free adaptive bandwidth allocation strategy inspired by Kolmogorov complexity is introduced to further improve the transmission efficiency, thereby indeed redefining the notion of information value in the era of generative decoding. Extensive experiments demonstrate that DiT-JSCC consistently outperforms existing JSCC methods in both semantic consistency and visual quality, particularly in extreme regimes.
中文标题/摘要
标题:DiT-JSCC:基于扩散变换器和语义表示的重新思考的深度JSCC
生成联合源信道编码(GJSCC)已成为在极端无线信道条件下(如超低带宽和低信噪比)实现高保真和鲁棒图像传输的新深度JSCC范式。近期研究通常采用扩散模型作为生成解码器,但它们经常产生视觉上逼真但语义一致性有限的结果。这一局限性源于重建导向的JSCC编码器与生成解码器之间的根本性不匹配,前者缺乏显式的语义可区分性,无法提供可靠的条件线索。本文提出了一种新颖的GJSCC骨干DiT-JSCC,能够联合学习语义优先的表示编码器和基于扩散变换器(DiT)的生成解码器,我们的开源项目旨在促进GJSCC的未来研究。具体而言,我们设计了一种语义-细节双分支编码器,自然地与粗到细条件DiT解码器对齐,在极端信道条件下优先考虑语义一致性。此外,还引入了一种基于kolmogorov复杂性的无训练自适应带宽分配策略,进一步提高传输效率,从而确实重新定义了生成解码时代的信息价值。大量实验表明,DiT-JSCC在语义一致性和视觉质量方面始终优于现有JSCC方法,特别是在极端条件下。
Summary / 总结
The paper introduces DiT-JSCC, a novel approach to joint source-channel coding that combines a semantics-prioritized representation encoder with a diffusion transformer-based generative decoder. This method addresses the limitations of previous diffusion models by enhancing semantic consistency and visual quality, especially under extreme channel conditions. Experimental results show that DiT-JSCC outperforms existing methods in both semantic consistency and visual quality in challenging scenarios.
研究旨在通过解决现有扩散模型的局限性,改进在极端无线条件下图像传输的联合源信道编码(JSCC)。DiT-JSCC提出了一种新的GJSCC架构,包括一种语义优先的编码器和基于扩散变换器的生成解码器。该方法通过细粒度条件解码器和基于柯尔莫哥洛夫复杂性的自适应带宽分配策略,实现了更好的语义一致性和视觉质量,特别是在极端条件下。大量实验表明,DiT-JSCC在语义一致性和视觉质量方面均优于现有JSCC方法。
ReCCur: A Recursive Corner-Case Curation Framework for Robust Vision-Language Understanding in Open and Edge Scenarios
Authors: Yihan Wei, Shenghai Yuan, Tianchen Deng, Boyang Lou, Enwen Hu
First: 2026-01-06T13:36:43+00:00 · Latest: 2026-01-06T13:36:43+00:00
Abstract
Corner cases are rare or extreme scenarios that drive real-world failures, but they are difficult to curate at scale: web data are noisy, labels are brittle, and edge deployments preclude large retraining. We present ReCCur (Recursive Corner-Case Curation), a low-compute framework that converts noisy web imagery into auditable fine-grained labels via a multi-agent recursive pipeline. First, large-scale data acquisition and filtering expands a domain vocabulary with a vision-language model (VLM), crawls the web, and enforces tri-modal (image, description, keyword) consistency with light human spot checks to yield refined candidates. Next, mixture-of-experts knowledge distillation uses complementary encoders (e.g., CLIP, DINOv2, BEiT) for kNN voting with dual-confidence activation and uncertainty sampling, converging to a high-precision set. Finally, region-evidence VLM adversarial labeling pairs a proposer (multi-granularity regions and semantic cues) with a validator (global and local chained consistency) to produce explainable labels and close the loop. On realistic corner-case scenarios (e.g., flooded-car inspection), ReCCur runs on consumer-grade GPUs, steadily improves purity and separability, and requires minimal human supervision, providing a practical substrate for downstream training and evaluation under resource constraints. Code and dataset will be released.
中文标题/摘要
标题:ReCCur:一种递归边缘案例策展框架,用于开放和边缘场景下的稳健视觉-语言理解
边缘案例是驱动现实世界失败的罕见或极端情况,但难以大规模策展:网络数据嘈杂,标签脆弱,边缘部署禁止大规模重新训练。我们提出了ReCCur(递归边缘案例策展),一种低计算量框架,通过多智能体递归管道将嘈杂的网络图像转换为可审计的细粒度标签。首先,大规模数据获取和过滤扩展了领域词汇表,使用视觉-语言模型(VLM)爬取网络,并通过轻量级的人工抽查确保三模态(图像、描述、关键词)一致性,从而产生精炼的候选者。其次,混合专家知识蒸馏使用互补编码器(例如,CLIP、DINOv2、BEiT)进行kNN投票,结合双重置信激活和不确定性采样,最终收敛到高精度集合。最后,区域证据VLM对抗标签将提案者(多粒度区域和语义线索)与验证者(全局和局部链式一致性)配对,生成可解释的标签并完成循环。在现实世界的边缘案例场景(例如,水淹车辆检查)中,ReCCur在消费级GPU上运行,逐步提高纯度和可分性,并需要最少的人工监督,为资源受限条件下的下游训练和评估提供实用的基础。代码和数据集将被发布。
Summary / 总结
ReCCur is a low-compute framework that curates corner-case scenarios for robust vision-language understanding. It uses a multi-agent recursive pipeline to acquire and filter web imagery, enforce tri-modal consistency, and distill knowledge from multiple encoders. The final step involves adversarial labeling by a proposer and validator to produce explainable labels. On realistic corner-case scenarios, ReCCur improves purity and separability with minimal human supervision and runs on consumer-grade GPUs, making it practical for resource-constrained environments.
ReCCur 是一个低计算框架,通过网络数据获取和过滤扩展领域词汇,使用多代理递归管道进行数据收集。然后使用互补编码器的知识蒸馏进行高精度标签生成,并采用区域证据 VLM 对抗标签对生成可解释的标签。在现实的边缘案例场景中,ReCCur 在消费级 GPU 上运行,逐步提高数据纯度和可分性,并需要少量的人工监督,使其适用于资源受限的环境。
Batch-of-Thought: Cross-Instance Learning for Enhanced LLM Reasoning
Authors: Xuan Yang, Furong Jia, Roy Xie, Xiong Xi, Hengwei Bian, Jian Li, Monica Agrawal
First: 2026-01-06T11:47:45+00:00 · Latest: 2026-01-06T11:47:45+00:00
Abstract
Current Large Language Model reasoning systems process queries independently, discarding valuable cross-instance signals such as shared reasoning patterns and consistency constraints. We introduce Batch-of-Thought (BoT), a training-free method that processes related queries jointly to enable cross-instance learning. By performing comparative analysis across batches, BoT identifies high-quality reasoning templates, detects errors through consistency checks, and amortizes computational costs. We instantiate BoT within a multi-agent reflection architecture (BoT-R), where a Reflector performs joint evaluation to unlock mutual information gain unavailable in isolated processing. Experiments across three model families and six benchmarks demonstrate that BoT-R consistently improves accuracy and confidence calibration while reducing inference costs by up to 61%. Our theoretical and experimental analysis reveals when and why batch-aware reasoning benefits LLM systems.
中文标题/摘要
标题:Batch-of-Thought:增强LLM推理的跨实例学习
当前大型语言模型推理系统独立处理查询,丢弃了诸如共享推理模式和一致性约束等有价值的跨实例信号。我们提出了Batch-of-Thought (BoT),一种无需训练的方法,通过联合处理相关查询来实现跨实例学习。通过在批次之间进行比较分析,BoT 识别高质量的推理模板,通过一致性检查检测错误,并摊销计算成本。我们将在多智能体反思架构(BoT-R)中实例化BoT,其中Reflector执行联合评估以解锁孤立处理中不可用的互信息增益。在三个模型家族和六个基准上的实验表明,BoT-R 一致地提高了准确性和置信度校准,同时将推理成本降低了高达61%。我们的理论和实验分析揭示了何时以及为什么批次感知推理对LLM系统有益。
Summary / 总结
The paper introduces Batch-of-Thought (BoT), a method that processes related queries jointly to leverage cross-instance signals for enhanced reasoning. BoT identifies high-quality reasoning templates, detects errors through consistency checks, and reduces computational costs. The multi-agent reflection architecture BoT-R, which includes a Reflector for joint evaluation, further enhances these benefits. Experiments show that BoT-R improves accuracy and confidence calibration while reducing inference costs by up to 61%. The analysis reveals the conditions under which batch-aware reasoning is beneficial for LLM systems.
论文提出了Batch-of-Thought (BoT) 方法,该方法联合处理相关查询以利用跨实例信号来增强推理。BoT 识别高质量的推理模板,通过一致性检查检测错误,并减少计算成本。多代理反思架构 BoT-R 包括一个用于联合评估的 Reflector,进一步增强了这些益处。实验表明,BoT-R 提高了准确性和置信度校准,同时将推理成本降低了高达 61%。分析揭示了在什么条件下批量感知推理对 LLM 系统有益。
Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning
Authors: Guoqiang Liang, Jianyi Wang, Zhonghua Wu, Shangchen Zhou
First: 2026-01-06T11:00:17+00:00 · Latest: 2026-01-06T11:00:17+00:00
Comments: Project Page: https://ethanliang99.github.io/ZOOMIQA-Projectpage
Abstract
Image Quality Assessment (IQA) is a long-standing problem in computer vision. Previous methods typically focus on predicting numerical scores without explanation or provide low-level descriptions lacking precise scores. Recent reasoning-based vision language models (VLMs) have shown strong potential for IQA, enabling joint generation of quality descriptions and scores. However, we notice that existing VLM-based IQA methods tend to exhibit unreliable reasoning due to their limited capability of integrating visual and textual cues. In this work, we introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors: uncertainty awareness, region reasoning, and iterative refinement. Specifically, we present a two-stage training pipeline: 1) supervised fine-tuning (SFT) on our Grounded-Rationale-IQA (GR-IQA) dataset to teach the model to ground its assessments in key regions; and 2) reinforcement learning (RL) for dynamic policy exploration, primarily stabilized by our KL-Coverage regularizer to prevent reasoning and scoring diversity collapse, and supported by a Progressive Re-sampling Strategy to mitigate annotation bias. Extensive experiments show that Zoom-IQA achieves improved robustness, explainability, and generalization. The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.
中文标题/摘要
标题:Zoom-IQA:基于可靠区域感知推理的图像质量评估
图像质量评估(IQA)是计算机视觉中的一个长期问题。以往的方法通常侧重于预测数值评分而没有解释,或者提供低级描述而缺乏精确的评分。最近的基于视觉语言模型(VLM)的推理方法在IQA方面显示出强大的潜力,能够同时生成质量描述和评分。然而,我们注意到现有的基于VLM的IQA方法往往由于其整合视觉和文本线索能力有限而表现出不可靠的推理。在本文中,我们引入了Zoom-IQA,这是一种基于VLM的IQA模型,旨在明确模拟关键的认知行为:不确定性意识、区域推理和迭代细化。具体而言,我们提出了一种两阶段训练管道:1)在我们的Grounded-Rationale-IQA(GR-IQA)数据集上进行监督微调(SFT),以教导模型将其评估扎根于关键区域;2)通过强化学习(RL)进行动态策略探索,主要通过我们的KL-Coverage正则化器来防止推理和评分多样性崩溃,并通过渐进重采样策略来减轻注释偏差。广泛的实验表明,Zoom-IQA在鲁棒性、可解释性和泛化能力方面有所提升。将其应用于下游任务,如图像恢复,进一步证明了Zoom-IQA的有效性。
Summary / 总结
Zoom-IQA is a VLM-based IQA model that addresses the limitations of existing methods by focusing on uncertainty awareness, region reasoning, and iterative refinement. It uses a two-stage training pipeline: supervised fine-tuning on a Grounded-Rationale-IQA dataset and reinforcement learning with a KL-Coverage regularizer and Progressive Re-sampling Strategy. The model demonstrates improved robustness, explainability, and generalization in IQA tasks and shows effectiveness in downstream applications like image restoration.
Zoom-IQA 是一种基于 VLM 的图像质量评估模型,通过区域感知推理和迭代细化来提高可靠性。它采用两阶段训练管道:在 Grounded-Rationale-IQA 数据集上进行监督微调和带有 KL-Coverage 正则化器和渐进重采样策略的强化学习。该模型在图像质量评估任务中展示了增强的鲁棒性、可解释性和泛化能力,并且其有效性在图像恢复等下游任务中得到了验证。
LOST-3DSG: Lightweight Open-Vocabulary 3D Scene Graphs with Semantic Tracking in Dynamic Environments
Authors: Sara Micol Ferraina, Michele Brienza, Francesco Argenziano, Emanuele Musumeci, Vincenzo Suriani, Domenico D. Bloisi, Daniele Nardi
First: 2026-01-06T10:44:19+00:00 · Latest: 2026-01-06T10:44:19+00:00
Abstract
Tracking objects that move within dynamic environments is a core challenge in robotics. Recent research has advanced this topic significantly; however, many existing approaches remain inefficient due to their reliance on heavy foundation models. To address this limitation, we propose LOST-3DSG, a lightweight open-vocabulary 3D scene graph designed to track dynamic objects in real-world environments. Our method adopts a semantic approach to entity tracking based on word2vec and sentence embeddings, enabling an open-vocabulary representation while avoiding the necessity of storing dense CLIP visual features. As a result, LOST-3DSG achieves superior performance compared to approaches that rely on high-dimensional visual embeddings. We evaluate our method through qualitative and quantitative experiments conducted in a real 3D environment using a TIAGo robot. The results demonstrate the effectiveness and efficiency of LOST-3DSG in dynamic object tracking. Code and supplementary material are publicly available on the project website at https://lab-rococo-sapienza.github.io/lost-3dsg/.
中文标题/摘要
标题:LOST-3DSG:动态环境中的轻量级开放词汇3D场景图及其语义跟踪
在动态环境中跟踪移动对象是机器人技术中的核心挑战。近期研究在这一领域取得了显著进展,但许多现有方法仍因依赖重模型而效率低下。为解决这一限制,我们提出LOST-3DSG,一种轻量级开放词汇3D场景图,旨在实现实时环境中的动态对象跟踪。我们的方法基于word2vec和句子嵌入采用语义实体跟踪,实现开放词汇表示,同时避免存储密集的CLIP视觉特征的必要性。因此,LOST-3DSG在性能上优于依赖高维视觉嵌入的方法。我们通过在真实3D环境中使用TIAGo机器人进行定性和定量实验来评估我们的方法。结果表明,LOST-3DSG在动态对象跟踪中的有效性和效率。代码和补充材料可在项目网站https://lab-rococo-sapienza.github.io/lost-3dsg/上公开获取。
CaTS-Bench: Can Language Models Describe Time Series?
Authors: Luca Zhou, Pratham Yashwante, Marshall Fisher, Alessio Sampieri, Zihao Zhou, Fabio Galasso, Rose Yu
First: 2025-09-25T07:10:03+00:00 · Latest: 2026-01-06T10:33:53+00:00
Comments: 8 pages, 6 figures, 3 tables in the main paper. Many more in the appendix
Abstract
Time series captioning, the task of describing time series in natural language, requires numeric and temporal reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on fully synthetic or generic captions, and typically neglect metadata and visual representations. We introduce \textbf{CaTS-Bench}, a comprehensive benchmark for \textbf{C}ontext-\textbf{a}ware \textbf{T}ime \textbf{S}eries reasoning across $11$ diverse domains, centered on a gold-standard evaluation set of $1746$ human-rewritten captions that measure how effectively models translate numeric trends into immediately interpretable narratives. To address the scarcity of human-annotated data, we also propose a scalable pipeline for generating high-fidelity synthetic captions, the quality of which we validate. We evaluate leading Vision-Language Models on our benchmark, revealing that even proprietary models struggle to capture numeric nuances in temporal descriptions, while finetuning open-source models on synthetic data yields substantial performance gains. Finally, release a diagnostic suite of $910$ multiple-choice questions and tailored numeric metrics to gauge time-series-specific reasoning capabilities, establishing CaTS-Bench as a reliable foundation for grounded, multimodal language generation in numeric domains.
中文标题/摘要
标题:CaTS-Bench:语言模型能否描述时间序列?
时间序列描述,即用自然语言描述时间序列的任务,需要数值和时间推理、趋势解释以及上下文理解。然而,现有的基准测试往往依赖于完全合成或通用的描述,通常忽略了元数据和视觉表示。我们引入了**CaTS-Bench**,这是一个针对11个不同领域的上下文感知时间序列推理的综合基准测试,围绕一个包含1746个人类重写描述的标准评估集,该集衡量模型如何将数值趋势转化为即时可理解的叙述。为了解决人类标注数据稀缺的问题,我们还提出了一种可扩展的生成高质量合成描述的管道,并验证了其质量。我们在基准测试上评估了领先的空间语言模型,发现即使是专有模型也难以捕捉时间描述中的数值细微差别,而使用合成数据微调开源模型则能显著提高性能。最后,发布了一个包含910个选择题和定制化数值指标的诊断套件,以评估时间序列特定的推理能力,使CaTS-Bench成为可靠的基础,用于数值领域中的接地多模态语言生成。
RPIQ: Residual-Projected Multi-Collaboration Closed-Loop and Single Instance Quantization for Visually Impaired Assistance
Authors: Xuanyu Wang, Haisen Su, Jingtao Zhang, Xiangxiang Wang, Yongbin Yu, Manping Fan, Bo Gong, Siqi Chen, Mingsheng Cao, Liyong Ren
First: 2026-01-06T10:22:34+00:00 · Latest: 2026-01-06T10:22:34+00:00
Abstract
Visually impaired users face significant challenges in daily information access and real-time environmental perception, and there is an urgent need for intelligent assistive systems with accurate recognition capabilities. Although large-scale models provide effective solutions for perception and reasoning, their practical deployment on assistive devices is severely constrained by excessive memory consumption and high inference costs. Moreover, existing quantization strategies often ignore inter-block error accumulation, leading to degraded model stability. To address these challenges, this study proposes a novel quantization framework -- Residual-Projected Multi-Collaboration Closed-Loop and Single Instance Quantization(RPIQ), whose quantization process adopts a multi-collaborative closed-loop compensation scheme based on Single Instance Calibration and Gauss-Seidel Iterative Quantization. Experiments on various types of large-scale models, including language models such as OPT, Qwen, and LLaMA, as well as vision-language models such as CogVLM2, demonstrate that RPIQ can compress models to 4-bit representation while significantly reducing peak memory consumption (approximately 60%-75% reduction compared to original full-precision models). The method maintains performance highly close to full-precision models across multiple language and visual tasks, and exhibits excellent recognition and reasoning capabilities in key applications such as text understanding and visual question answering in complex scenarios. While verifying the effectiveness of RPIQ for deployment in real assistive systems, this study also advances the computational efficiency and reliability of large models, enabling them to provide visually impaired users with the required information accurately and rapidly.
中文标题/摘要
标题:RPIQ:残差投影多协作闭环和单实例量化在视障辅助中的应用
视障用户在日常信息获取和实时环境感知方面面临巨大挑战,迫切需要具有准确识别能力的智能辅助系统。尽管大规模模型为感知和推理提供了有效的解决方案,但它们在辅助设备上的实际部署受到过度内存消耗和高推理成本的严重限制。此外,现有的量化策略往往忽视了块间误差累积,导致模型稳定性下降。为解决这些挑战,本研究提出了一种新的量化框架——残差投影多协作闭环和单实例量化(RPIQ),其量化过程采用基于单实例校准和高斯-赛德尔迭代量化的多协作闭环补偿方案。实验表明,RPIQ可以在保持模型压缩至4位表示的同时,显著降低峰值内存消耗(与原始全精度模型相比,约减少60%-75%)。该方法在多种语言和视觉任务中保持了与全精度模型高度接近的性能,并在文本理解和复杂场景中的视觉问答等关键应用中表现出卓越的识别和推理能力。本研究不仅验证了RPIQ在实际辅助系统部署中的有效性,还提高了大型模型的计算效率和可靠性,使其能够为视障用户提供准确快速的信息。
Summary / 总结
This study addresses the challenges faced by visually impaired users in accessing information and perceiving their environment. It proposes RPIQ, a novel quantization framework that uses a multi-collaborative closed-loop compensation scheme to compress large-scale models to 4-bit representation, significantly reducing memory consumption while maintaining performance close to full-precision models. Experiments show that RPIQ can reduce peak memory consumption by approximately 60%-75% and enhance the recognition and reasoning capabilities in various tasks, making it suitable for deployment in assistive systems for visually impaired users.
该研究针对视障用户在获取日常信息和实时环境感知方面面临的挑战,提出了一种名为RPIQ的新型量化框架,该框架采用多协作闭环补偿方案将大规模模型压缩到4位表示,相比全精度模型显著减少了60%-75%的峰值内存消耗。该方法在各种任务中保持了接近全精度模型的性能,并在复杂场景下的文本理解和视觉问答等关键应用中展示了出色的识别和推理能力,使其适用于视障用户的辅助系统部署。
CogCanvas: Verbatim-Grounded Artifact Extraction for Long LLM Conversations
Authors: Tao An
Venue: ACL
First: 2025-12-23T16:45:15+00:00 · Latest: 2026-01-06T09:48:29+00:00
Comments: 15 pages, 5 figures. Submitted to ACL Rolling Review January 2026
Abstract
Conversation summarization loses nuanced details: when asked about coding preferences after 40 turns, summarization recalls "use type hints" but drops the critical constraint "everywhere" (19.0% exact match vs. 93.0% for our approach).
We present CogCanvas, a training-free framework inspired by how teams use whiteboards to anchor shared memory. Rather than compressing conversation history, CogCanvas extracts verbatim-grounded artifacts (decisions, facts, reminders) and retrieves them via temporal-aware graph.
On the LoCoMo benchmark (all 10 conversations from the ACL 2024 release), CogCanvas achieves the highest overall accuracy among training-free methods (32.4%), outperforming RAG (24.6%) by +7.8pp, with decisive advantages on complex reasoning tasks: +20.6pp on temporal reasoning (32.7% vs. 12.1% RAG) and +1.1pp on multi-hop questions (41.7% vs. 40.6% RAG). CogCanvas also leads on single-hop retrieval (26.6% vs. 24.6% RAG). Ablation studies reveal that BGE reranking contributes +7.7pp, making it the largest contributor to CogCanvas's performance.
While heavily-optimized approaches achieve higher absolute scores through dedicated training (EverMemOS: ~92%), our training-free approach provides practitioners with an immediately-deployable alternative that significantly outperforms standard baselines. Code and data: https://github.com/tao-hpu/cog-canvas
中文标题/摘要
标题:CogCanvas:基于逐字接地的长LLM对话提取
对话总结丢失了细微的细节:在询问编码偏好时,总结只回忆起“使用类型提示”,但忽略了关键约束“处处”(精确匹配度为19.0%,而我们方法为93.0%)。
我们提出了CogCanvas,一种无需训练的框架,灵感来源于团队如何使用白板来锚定共享记忆。CogCanvas 不压缩对话历史,而是提取逐字接地的制品(决策、事实、提醒),并通过时间感知图进行检索。
在LoCoMo基准测试(ACL 2024发布的所有10个对话)中,CogCanvas 在无需训练的方法中整体准确率最高(32.4%),比RAG(24.6%)高出7.8个百分点,特别是在复杂推理任务上:时间推理上高出20.6个百分点(32.7% vs. 12.1% RAG),多跳问题上高出1.1个百分点(41.7% vs. 40.6% RAG)。CogCanvas 在单跳检索上也领先(26.6% vs. 24.6% RAG)。消融研究显示,BGE重新排序贡献了7.7个百分点,是CogCanvas性能提升的最大因素。
尽管高度优化的方法通过专门训练获得更高的绝对分数(EverMemOS:约92%),但我们的无需训练方法为实践者提供了立即可部署的替代方案,显著优于标准基线。代码和数据:https://github.com/tao-hpu/cog-canvas
Summary / 总结
CogCanvas is a training-free framework that extracts verbatim-grounded artifacts from long LLM conversations to improve accuracy in summarization and retrieval. It outperforms RAG by +7.8 percentage points on the LoCoMo benchmark, especially in complex reasoning tasks, and leads in single-hop retrieval. BGE reranking significantly contributes to its performance, enhancing it by +7.7 percentage points. While heavily-optimized approaches achieve higher scores, CogCanvas offers a practical, immediately-deployable solution for practitioners.
CogCanvas 是一个无需训练的框架,从长 LLM 对话中提取字面量锚定的片段以提高总结和检索任务的准确性。它在 LoCoMo 基准测试中比 RAG 高出 7.8 个百分点,特别是在复杂推理任务上表现更优,并在单跳检索中领先。消融研究显示,BGE 重新排序是其性能提升的最大贡献者。
Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance
Authors: Xiangxiang Wang, Xuanyu Wang, YiJia Luo, Yongbin Yu, Manping Fan, Jingtao Zhang, Liyong Ren
First: 2025-08-25T16:32:32+00:00 · Latest: 2026-01-06T09:46:22+00:00
Comments: 28 pages,9 figures
Abstract
Visually impaired individuals face significant challenges in environmental perception. Traditional assistive technologies often lack adaptive intelligence, focusing on individual components rather than integrated systems. While Vision-Language Models (VLMs) offer a promising path to richer, integrated understanding, their deployment is severely limited by substantial computational requirements, demanding dozens of gigabytes of memory. To address these gaps in computational efficiency and integrated design, this study proposes a dual technological innovation framework: a cross-modal differentiated quantization framework for VLMs and a scene-aware vectorized memory multi-agent system. The quantization framework implements differentiated strategies, reducing memory from 38GB to 11.3GB. The multi-agent system uses vectorized memory and perception-memory-reasoning workflows to provide environmental information beyond the current view, achieving 2.83-3.52s latency to initial speech output. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory. This research advances computational efficiency and assistive technology, offering comprehensive assistance in scene perception, text recognition, and navigation.
中文标题/摘要
标题:面向视障辅助的场景感知向量记忆多智能体框架:跨模态差异化量化VLMs
视障人士在环境感知方面面临重大挑战。传统辅助技术往往缺乏适应性智能,侧重于个体组件而非集成系统。尽管视觉语言模型(VLMs)为更丰富的集成理解提供了前景,但其部署受到巨大计算需求的限制,需要数十吉字节的内存。为解决这些计算效率和集成设计的差距,本研究提出了一种双技术创新框架:跨模态差异化量化框架和场景感知向量记忆多智能体系统。量化框架实施差异化策略,将内存从38GB减少到11.3GB。多智能体系统使用向量记忆和感知-记忆-推理工作流提供超出当前视域的环境信息,实现从初始语音输出2.83-3.52秒的延迟。实验表明,量化后的19B参数模型在MMBench上的性能下降仅为2.05%,在OCR-VQA上的准确率为63.7%(原为64.9%),优于等内存的小型模型。该研究推进了计算效率和辅助技术的发展,提供了场景感知、文本识别和导航的全面辅助。
Summary / 总结
This study addresses the challenges faced by visually impaired individuals in environmental perception by proposing a dual technological framework: a cross-modal differentiated quantization framework for Vision-Language Models (VLMs) and a scene-aware vectorized memory multi-agent system. The quantization framework reduces memory usage from 38GB to 11.3GB, while the multi-agent system provides environmental information beyond the current view with a latency of 2.83-3.52s to initial speech output. Experiments show that the quantized 19B-parameter model experiences only a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA, outperforming smaller models with equivalent memory.
该研究针对视障人士在环境感知方面面临的挑战,提出了一种双技术框架:用于视觉-语言模型(VLMs)的跨模态差异化量化框架和场景感知向量化记忆多代理系统。量化框架将内存使用量从38GB减少到11.3GB,而多代理系统能够在2.83-3.52秒的延迟下提供超出当前视野的环境信息。实验表明,量化后的19B参数模型在OCR-VQA上的准确率为63.7%,并优于具有同等内存的较小模型,展示了在场景感知、文本识别和导航方面改进的计算效率和辅助技术。
FCMBench: A Comprehensive Financial Credit Multimodal Benchmark for Real-world Applications
Authors: Yehui Yang, Dalu Yang, Wenshuo Zhou, Fangxin Shang, Yifan Liu, Jie Ren, Haojun Fei, Qing Yang, Yanwu Xu, Tao Chen
First: 2026-01-01T00:42:54+00:00 · Latest: 2026-01-06T08:08:49+00:00
Abstract
As multimodal AI becomes widely used for credit risk assessment and document review, a domain-specific benchmark is urgently needed that (1) reflects documents and workflows specific to financial credit applications, (2) includes credit-specific understanding and real-world robustness, and (3) preserves privacy compliance without sacrificing practical utility. Here, we introduce FCMBench-V1.0 -- a large-scale financial credit multimodal benchmark for real-world applications, covering 18 core certificate types, with 4,043 privacy-compliant images and 8,446 QA samples. The FCMBench evaluation framework consists of three dimensions: Perception, Reasoning, and Robustness, including 3 foundational perception tasks, 4 credit-specific reasoning tasks that require decision-oriented understanding of visual evidence, and 10 real-world acquisition artifact types for robustness stress testing. To reconcile compliance with realism, we construct all samples via a closed synthesis-capture pipeline: we manually synthesize document templates with virtual content and capture scenario-aware images in-house. This design also mitigates pre-training data leakage by avoiding web-sourced or publicly released images. FCMBench can effectively discriminate performance disparities and robustness across modern vision-language models. Extensive experiments were conducted on 23 state-of-the-art vision-language models (VLMs) from 14 top AI companies and research institutes. Among them, Gemini 3 Pro achieves the best F1(\%) score as a commercial model (64.61), Qwen3-VL-235B achieves the best score as an open-source baseline (57.27), and our financial credit-specific model, Qfin-VL-Instruct, achieves the top overall score (64.92). Robustness evaluations show that even top-performing models suffer noticeable performance drops under acquisition artifacts.
中文标题/摘要
标题:FCMBench:全面的金融信贷多模态基准,适用于实际应用
随着多模态AI在信用风险评估和文件审查中的广泛应用,迫切需要一个特定领域的基准,该基准(1)反映金融信贷应用中的文档和工作流程,(2)包含信用特定的理解和现实世界的鲁棒性,(3)在不牺牲实用性的前提下保护隐私合规。在此,我们介绍了FCMBench-V1.0——一个大规模的金融信贷多模态基准,适用于实际应用,涵盖18种核心证书类型,包含4,043张隐私合规图像和8,446个问答样本。FCMBench评估框架包括三个维度:感知、推理和鲁棒性,包括3个基础感知任务、4个需要对视觉证据进行决策导向理解的信用特定推理任务,以及10种实际世界获取的缺陷类型,用于鲁棒性压力测试。为了平衡合规性和现实性,我们通过一个封闭的合成-捕获管道构建所有样本:我们手动合成带有虚拟内容的文档模板,并在内部拍摄场景感知图像。此设计还通过避免使用网络来源或公开发布的图像来减轻预训练数据泄露的风险。FCMBench能够有效区分现代视觉-语言模型之间的性能差异和鲁棒性。我们在14家顶级AI公司和研究机构的23种最先进的视觉-语言模型(VLMs)上进行了广泛的实验。其中,Gemini 3 Pro作为商用模型获得最佳F1(%)分数(64.61),Qwen3-VL-235B作为开源基线获得最佳分数(57.27),而我们专门针对金融信贷的模型Qfin-VL-Instruct获得最高总体分数(64.92)。鲁棒性评估表明,即使表现最佳的模型在获取缺陷下也会出现明显的性能下降。
Summary / 总结
FCMBench is a comprehensive benchmark for financial credit applications, addressing the need for a domain-specific dataset that includes privacy compliance and real-world robustness. It consists of 4,043 privacy-compliant images and 8,446 QA samples covering 18 core certificate types. The evaluation framework includes perception, reasoning, and robustness tasks, and extensive experiments on 23 state-of-the-art vision-language models show that Qfin-VL-Instruct achieves the top overall score of 64.92, while Gemini 3 Pro and Qwen3-VL-235B achieve 64.61 and 57.27 respectively. Robustness evaluations indicate that even top-performing models experience significant performance drops under real-world acquisition artifacts.
FCMBench 是一个全面的金融信用多模态基准,旨在评估实际应用,涵盖18种证书类型,包含4,043张隐私合规图像和8,446个问答样本。它包括感知、推理和鲁棒性三个评估维度,任务针对金融信用应用定制。广泛实验显示,Qfin-VL-Instruct 在整体表现上得分最高,而 Gemini 3 Pro 和 Qwen3-VL-235B 分别在商用和开源类别中表现最佳。鲁棒性评估表明,即使顶级模型在实际获取的图像下也表现出显著的性能下降。
Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats
Authors: Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang
Venue: NeurIPS 2025
First: 2025-11-21T13:57:38+00:00 · Latest: 2026-01-06T08:06:33+00:00
Comments: Accepted to NeurIPS 2025, Project Page: https://github.com/SooLab/AllPath
Abstract
Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.
中文标题/摘要
标题:Intervene-All-Paths: 统一跨对齐格式抑制LVLM幻觉
尽管大型视觉-语言模型(LVLMs)在广泛的任务中表现出色,但它们仍然容易产生幻觉。在本研究中,我们提出了一种与变压器因果架构相一致的综合干预框架,整合了不同干预路径对幻觉的影响。我们发现LVLM中的幻觉并非源自单一因果路径,而是来自图像到输入文本、图像到输出文本以及文本到文本路径之间的相互作用。这也是首次发现LVLM依赖于不同的路径,这取决于问题-答案对齐格式。基于这些见解,我们提出了简单而有效的方法来识别并干预每个路径中的关键幻觉头部,针对区分性和生成性格式进行了定制。在多个基准测试中的实验表明,我们的方法能够一致地减少不同对齐类型中的幻觉。
Summary / 总结
This study addresses the issue of hallucinations in Large Vision-Language Models (LVLMs) by proposing a unified intervention framework that considers the interplay among different causal pathways. The research identifies that hallucinations arise from the interaction of image-to-input-text, image-to-output-text, and text-to-text pathways, and that LVLMs rely on different pathways depending on the question-answer alignment format. The proposed methods effectively reduce hallucinations across various alignment types, demonstrating consistent improvements across multiple benchmarks.
该研究通过提出一个综合的干预框架来解决大型视觉语言模型(LVLM)中的幻觉问题,该框架考虑了不同因果路径之间的相互作用。研究发现,幻觉是由图像到输入文本、图像到输出文本以及文本到文本路径之间的交互引起的,并且LVLM在不同问题-答案对齐格式下依赖于不同的路径。提出的干预方法在多种对齐类型下有效减少了幻觉,展示了在多个基准测试中的持续改进。
Towards Zero-Shot Point Cloud Registration Across Diverse Scales, Scenes, and Sensor Setups
Authors: Hyungtae Lim, Minkyun Seo, Luca Carlone, Jaesik Park
Venue: ICCV 2025 highlight
First: 2026-01-06T06:51:24+00:00 · Latest: 2026-01-06T06:51:24+00:00
Comments: 18 pages, 15 figures. Extended version of our ICCV 2025 highlight paper [arXiv:2503.07940]. arXiv admin note: substantial text overlap with arXiv:2503.07940
Abstract
Some deep learning-based point cloud registration methods struggle with zero-shot generalization, often requiring dataset-specific hyperparameter tuning or retraining for new environments. We identify three critical limitations: (a) fixed user-defined parameters (e.g., voxel size, search radius) that fail to generalize across varying scales, (b) learned keypoint detectors exhibit poor cross-domain transferability, and (c) absolute coordinates amplify scale mismatches between datasets. To address these three issues, we present BUFFER-X, a training-free registration framework that achieves zero-shot generalization through: (a) geometric bootstrapping for automatic hyperparameter estimation, (b) distribution-aware farthest point sampling to replace learned detectors, and (c) patch-level coordinate normalization to ensure scale consistency. Our approach employs hierarchical multi-scale matching to extract correspondences across local, middle, and global receptive fields, enabling robust registration in diverse environments. For efficiency-critical applications, we introduce BUFFER-X-Lite, which reduces total computation time by 43% (relative to BUFFER-X) through early exit strategies and fast pose solvers while preserving accuracy. We evaluate on a comprehensive benchmark comprising 12 datasets spanning object-scale, indoor, and outdoor scenes, including cross-sensor registration between heterogeneous LiDAR configurations. Results demonstrate that our approach generalizes effectively without manual tuning or prior knowledge of test domains. Code: https://github.com/MIT-SPARK/BUFFER-X.
中文标题/摘要
标题:跨尺度、场景和传感器配置的零样本点云注册
一些基于深度学习的点云注册方法在零样本泛化方面存在困难,通常需要针对新环境进行数据集特定的超参数调整或重新训练。我们识别了三个关键限制:(a) 固定的用户定义参数(例如,体素大小、搜索半径),这些参数无法在不同尺度下泛化;(b) 学习到的关键点检测器在跨域转移方面表现不佳;(c) 绝对坐标放大了数据集之间的尺度不匹配。为了解决这三个问题,我们提出了BUFFER-X,这是一种无需训练的注册框架,通过:(a) 几何自举实现自动超参数估计;(b) 分布感知的最远点采样来替代学习到的检测器;(c) 崩溃级坐标归一化以确保尺度一致性。我们的方法采用分层多尺度匹配来提取局部、中间和全局感受野之间的对应关系,从而在各种环境中实现稳健的注册。对于效率关键的应用,我们引入了BUFFER-X-Lite,通过早期退出策略和快速姿态求解器将总计算时间减少了43%(相对于BUFFER-X),同时保持了准确性。我们在一个包含12个数据集的基准测试上进行了评估,这些数据集涵盖了对象尺度、室内和室外场景,包括异构LiDAR配置之间的跨传感器注册。结果表明,我们的方法在无需手动调整或了解测试域先验知识的情况下实现了有效的泛化。代码:https://github.com/MIT-SPARK/BUFFER-X.
Summary / 总结
This paper addresses the challenge of zero-shot generalization in point cloud registration by introducing BUFFER-X, a training-free framework that automatically estimates hyperparameters, uses distribution-aware sampling, and normalizes coordinates to ensure scale consistency. The approach achieves robust registration across diverse scales, scenes, and sensor setups without manual tuning. BUFFER-X-Lite further enhances efficiency by reducing computation time by 43% while maintaining accuracy. Evaluations on 12 datasets show effective zero-shot generalization.
该论文通过引入BUFFER-X,一种无需训练的框架,自动估计超参数、使用分布感知采样并归一化坐标以确保尺度一致性,来解决点云配准的零样本泛化问题。该方法在多种环境和传感器配置下实现稳健的配准,无需手动调整。BUFFER-X-Lite通过减少43%的计算时间进一步提高效率,同时保持准确性。实验结果表明,该方法在12个数据集上实现了有效的零样本泛化。
FLUID: Training-Free Face De-identification via Latent Identity Substitution
Authors: Jinhyeong Park, Shaheryar Muhammad, Seangmin Lee, Jong Taek Lee, Soon Ki Jung
First: 2025-11-21T07:18:37+00:00 · Latest: 2026-01-06T06:33:15+00:00
Abstract
Current face de-identification methods that replace identifiable cues in the face region with other sacrifices utilities contributing to realism, such as age and gender. To retrieve the damaged realism, we present FLUID (Face de-identification in the Latent space via Utility-preserving Identity Displacement), a single-input face de-identification framework that directly replaces identity features in the latent space of a pretrained diffusion model without affecting the model's weights. We reinterpret face de-identification as an image editing task in the latent h-space of a pretrained unconditional diffusion model. Our framework estimates identity-editing directions through optimization guided by loss functions that encourage attribute preservation while suppressing identity signals. We further introduce both linear and geodesic (tangent-based) editing schemes to effectively navigate the latent manifold. Experiments on CelebA-HQ and FFHQ show that FLUID achieves a superior balance between identity suppression and attribute preservation, outperforming existing de-identification approaches in both qualitative and quantitative evaluations.
中文标题/摘要
标题:FLUID:无需训练的面部去标识化通过潜在身份置换实现
当前的面部去标识化方法通过用其他牺牲品(如年龄和性别)来替换面部区域中的可识别线索,以提高现实感。为了恢复受损的现实感,我们提出了FLUID(面部在预训练扩散模型的潜在空间中的保留效用身份置换去标识化),这是一种单输入面部去标识化框架,它直接在预训练扩散模型的潜在空间中替换身份特征,而不影响模型的权重。我们将面部去标识化重新解释为在预训练无条件扩散模型的潜在h空间中的图像编辑任务。我们的框架通过优化引导,使用鼓励属性保留并抑制身份信号的损失函数来估计身份编辑方向。我们还引入了线性和测地线(基于切线)编辑方案,以有效地导航潜在流形。在CelebA-HQ和FFHQ上的实验表明,FLUID在身份抑制和属性保留之间实现了更好的平衡,在定性和定量评估中均优于现有去标识化方法。
Summary / 总结
The research aims to develop a training-free face de-identification method that preserves realism by substituting identity features in the latent space of a pretrained diffusion model. The FLUID framework directly replaces identity features without altering the model's weights, using optimization to encourage attribute preservation while suppressing identity signals. Experiments on CelebA-HQ and FFHQ demonstrate that FLUID outperforms existing methods in balancing identity suppression and attribute preservation, both qualitatively and quantitatively.
研究旨在通过在预训练扩散模型的潜在空间中直接替换身份特征,开发一种无需训练的面部去标识化方法,以保持现实感。FLUID框架通过优化来保留属性并抑制身份信号,而不改变模型的权重。实验结果表明,FLUID在CelebA-HQ和FFHQ上在身份抑制和属性保留之间的平衡上优于现有方法,无论是定性还是定量评价都表现出色。
RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
Authors: Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang
First: 2025-12-15T18:52:43+00:00 · Latest: 2026-01-06T06:06:57+00:00
Comments: Project page: https://zhoues.github.io/RoboTracer
Abstract
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes. See the project page at https://zhoues.github.io/RoboTracer.
中文标题/摘要
标题:RoboTracer:通过视觉语言模型中的推理掌握空间跟踪
空间跟踪是机器人基本的具身交互能力之一,由于需要多步度量导向的推理和复杂的空间指代以及现实世界的度量测量,因此本质上具有挑战性。然而,现有方法在处理这种组合任务时存在困难。为此,我们提出RoboTracer,这是一种3D感知的VLM,首先通过通用空间编码器和回归监督解码器实现3D空间指代和测量,增强监督微调(SFT)期间的尺度意识。此外,RoboTracer通过度量敏感的过程奖励进行强化微调(RFT),监督关键中间感知提示,以准确生成空间跟踪。为了支持SFT和RFT训练,我们引入了TraceSpatial,这是一个包含3000万QA对的大规模数据集,涵盖了户外/室内/桌面场景,并支持复杂的推理过程(多达9步)。我们还提出了TraceSpatial-Bench,这是一个具有挑战性的基准,填补了空间跟踪评估的空白。实验结果表明,RoboTracer在空间理解、测量和指代方面超越了基线,平均成功率达到了79.1%,并且在TraceSpatial-Bench上也以显著优势超越了Gemini-2.5-Pro,准确率高出36%。值得注意的是,RoboTracer可以与各种控制策略结合,执行跨不同机器人(UR5,G1人形机器人)的复杂场景中的长期动态任务。请访问项目页面:https://zhoues.github.io/RoboTracer/
Summary / 总结
RoboTracer is designed to address the challenges of spatial tracing in robotics by integrating 3D-aware vision-language models with multi-step metric-grounded reasoning. It uses a universal spatial encoder and regression-supervised decoder for 3D spatial referring and measuring, and reinforcement fine-tuning with metric-sensitive process rewards to enhance reasoning. RoboTracer outperforms existing methods with an average success rate of 79.1% and achieves state-of-the-art performance on the TraceSpatial-Bench, surpassing Gemini-2.5-Pro by 36% accuracy. It can be applied to various robots and tasks in complex real-world environments.
RoboTracer 是一种利用 3D 意识视觉语言模型来解决机器人领域中空间跟踪挑战的方法。它通过一个通用的空间编码器和回归监督解码器实现 3D 空间引用和测量,并采用强化微调来增强多步度量导向的推理。该模型在包含 3000 万 QA 对的 TraceSpatial 数据集上进行训练,涵盖复杂的推理过程。RoboTracer 的平均成功率达到了 79.1%,并在 TraceSpatial-Bench 基准测试中超越了 Gemini-2.5-Pro,准确率高出 36%。
DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh
First: 2025-12-31T17:31:29+00:00 · Latest: 2026-01-06T05:24:09+00:00
Comments: Submitted to IEEE Robotics and Automation Letters (RA-L)
Abstract
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Project website: https://darkeqa-benchmark.github.io/
中文标题/摘要
标题:DarkEQA:在低光室内环境中的视觉语言模型体态问答基准测试
视觉语言模型(VLMs)越来越多地被用作体态代理的核心推理模块。现有的基准测试在理想、光线充足的条件下评估其能力,但全天候24/7运行需要在各种视觉退化条件下表现出色,包括夜间或黑暗环境中的低光条件——这一核心需求被很大程度上忽视了。为应对这一未充分探索的挑战,我们提出了DarkEQA,这是一个开源基准测试,用于在多级低光条件下评估与体态问答相关的感知基本能力。DarkEQA通过在受控退化条件下评估从第一人称观察中进行问答来隔离感知瓶颈,从而实现可归因的鲁棒性分析。DarkEQA的一个关键设计特点是其物理保真度:视觉退化在线性RAW空间中建模,模拟基于物理的照明下降和传感器噪声,随后通过ISP启发式的渲染管道。我们通过评估一系列最先进的VLMs和低光图像增强(LLIE)模型来展示DarkEQA的实用性。我们的分析系统地揭示了这些视觉条件下的操作限制。项目网站:https://darkeqa-benchmark.github.io/
Summary / 总结
DarkEQA is a benchmark designed to evaluate the performance of Vision-Language Models (VLMs) under low-light conditions, which is crucial for 24/7 operation of embodied agents. It isolates the perception bottleneck by degrading egocentric observations and evaluating question answering capabilities. Key findings show that state-of-the-art VLMs struggle with low-light conditions, highlighting their limitations in real-world applications where lighting is poor.
DarkEQA 是一个基准,旨在评估 Vision-Language 模型在低光室内环境中的性能,解决其在 24/7 运行中的鲁棒性不足问题。它通过模拟低光条件下的视觉退化来评估模型的感知能力。研究结果表明,当前的 VLMs 在这些具有挑战性的视觉条件下进行问题回答时存在局限性,强调了低光性能改进的必要性。项目网站: https://darkeqa-benchmark.github.io/
MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations
Authors: Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayyeri, Gokhan Tur, Dilek Hakkani-Tür, Vikram S. Adve
Venue: NeurIPS 2025
First: 2025-06-25T03:07:54+00:00 · Latest: 2026-01-06T04:02:48+00:00
Comments: Accepted to NeurIPS 2025
Abstract
We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning and decision-making in consultative interaction settings. Designed for the agriculture domain, MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context, offering a high-fidelity benchmark for evaluating models on grounded reasoning, clarification strategies, and long-form generation in a real-world, knowledge-intensive domain. Grounded in over 35,000 real user-expert interactions and curated through a carefully designed multi-step pipeline, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios. The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases, making it one of the most taxonomically diverse benchmarks available for vision-language models, grounded in the real world. Unlike existing benchmarks that rely on well-specified user inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich scenarios with open-world settings, requiring models to infer latent knowledge gaps, handle rare entities, and either proactively guide the interaction or respond. Project Page: https://mirage-benchmark.github.io
中文标题/摘要
标题:MIRAGE:农业专家引导对话中多模态信息查询与推理基准
我们介绍了MIRAGE,一个新的基准,用于农业领域咨询互动场景中的多模态专家级推理和决策。MIRAGE通过结合自然用户查询、专家撰写的回应以及基于图像的背景,捕捉了专家咨询的全部复杂性,为评估模型在真实世界、知识密集型领域中的基于视觉语言的推理、澄清策略和长文本生成提供了一个高保真基准。MIRAGE基于超过35,000个真实用户-专家互动,通过精心设计的多步骤管道进行筛选,涵盖了多种作物健康、病虫害诊断和作物管理场景。基准数据集包括超过7,000个独特的生物实体,涵盖植物种类、害虫和疾病,使其成为视觉语言模型中最具分类多样性的基准之一,基于真实世界。与依赖于明确用户输入和封闭分类体系的现有基准不同,MIRAGE包含未明确指定、背景丰富的场景,具有开放世界设置,要求模型推断潜在的知识空白,处理稀有实体,并主动引导对话或回应。
Summary / 总结
MIRAGE is a new benchmark for multimodal reasoning and decision-making in agricultural expert consultations, capturing the complexity of real user-expert interactions through natural queries, expert responses, and images. It includes over 35,000 interactions and more than 7,000 unique biological entities, making it suitable for evaluating models on grounded reasoning and long-form generation. Unlike other benchmarks, MIRAGE features open-world scenarios requiring models to handle underspecified and rare entities. Main findings include the benchmark's effectiveness in evaluating models' ability to infer latent knowledge gaps and guide interactions proactively. Venue: NeurIPS 2025.
MIRAGE 是一个新的多模态推理和决策基准,用于农业专家咨询,结合了自然用户查询、专家回应和图像背景。它涵盖了作物健康、病虫害诊断和作物管理等多种场景,包含超过7,000种独特的生物实体。与现有基准不同,MIRAGE 包含上下文丰富的未指定场景,要求模型处理开放世界设置并推断潜在的知识缺口。
Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?
Authors: Mingyuan Wu, Meitang Li, Jingcheng Yang, Jize Jiang, Kaizhuo Yan, Zhaoheng Li, Hanchao Yu, Minjia Zhang, Klara Nahrstedt
Venue: Neurips 2025 Oral
First: 2025-06-20T18:23:48+00:00 · Latest: 2026-01-06T03:04:42+00:00
Comments: Neurips 2025 Multimodal Algorithmic Reasoning Workshop Oral. In submission
Abstract
Inference time techniques such as decoding time scaling and self refinement have been shown to substantially improve mathematical reasoning in large language models (LLMs), largely attributed to emergent self correction and self verification behaviors often elicited through reinforcement learning (RL). In this work, we ask whether the same recipe transfers to vision language models (VLMs), especially RL finetuned variants that claim strong visual mathematical reasoning.
Through extensive evaluation, we reach three main findings that differ markedly from text only models. First, generation time capability matters more than verification and refinement: simple majority voting consistently and substantially outperforms verification centric strategies such as best of N with self verification. Second, behaviors often associated with RL tuned models at inference time, such as the 'Aha moment,' do not yield reliable reasoning performance improvements. Third, visual information is not effectively integrated into the model's self verification process.
Overall, our analysis highlights a key limitation: current RL trained VLMs derive limited benefit from self verification in the visual modality, which constrains the effectiveness of inference time scaling for visual mathematical reasoning.
中文标题/摘要
标题:再探恍然大悟时刻:视觉语言模型(VLMs)真的能在推理时自我验证吗?
诸如解码时间缩放和自我精炼等推理时间技术已被证明能显著提高大型语言模型(LLMs)的数学推理能力,这主要归因于通过强化学习(RL)引发的自我纠正和自我验证行为。在本研究中,我们探讨这种配方是否适用于视觉语言模型(VLMs),特别是那些声称具有强大视觉数学推理能力的RL微调版本。
通过广泛的评估,我们得出三个主要发现,这些发现与仅文本模型的情况大不相同。首先,生成时间的能力比验证和精炼更重要:简单的多数投票始终且显著地优于以自我验证为中心的策略,如N次最佳选择。其次,与推理时的RL调优模型相关的行为,如“恍然大悟”时刻,并不能带来可靠的推理性能提升。第三,视觉信息并未有效整合到模型的自我验证过程中。
总体而言,我们的分析突显了一个关键限制:当前的RL训练VLMs在视觉模态中从自我验证中获得的益处有限,这限制了推理时缩放对视觉数学推理的有效性。
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Authors: Jiahe Song, Chuang Wang, Bowen Jiang, Yinfan Wang, Hao Zheng, Xingjian Wei, Chengjin Liu, Rui Nie, Junyuan Gao, Jiaxing Sun, Yubin Wang, Lijun Wu, Zhenhua Huang, Jiang Wu, Qian Yu, Conghui He
First: 2025-11-04T09:08:44+00:00 · Latest: 2026-01-06T02:51:19+00:00
Abstract
Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision Language Models (LVLMs) handle naturally. We introduce a strategy termed BBox and Index as Visual Prompt (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-15k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.
中文标题/摘要
标题:RxnCaption: 将化学反应图解析重新定义为基于视觉提示的描述任务
大规模的化学反应数据集对于化学领域的AI研究至关重要。然而,现有的化学反应数据通常以论文中的图像形式存在,这使得它们无法被机器读取和用于训练机器学习模型。为应对这一挑战,我们提出了RxnCaption框架,用于化学反应图解析(RxnDP)任务。我们的框架将传统的基于坐标的解析过程重新定义为图像描述问题,这是大型视觉语言模型(LVLMs)能够自然处理的问题。我们引入了一种称为BBox和索引作为视觉提示(BIVP)的策略,使用我们最先进的分子检测器MolYOLO在输入图像上预先绘制分子边界框和索引。这将下游解析转化为自然语言描述问题。大量实验表明,BIVP策略显著提高了结构提取质量,同时简化了模型设计。我们进一步构建了包含15,000个样本的RxnCaption-15k数据集,其规模比之前的实际文献基准数据集大一个数量级,并且在四个布局原型上具有平衡的测试子集。实验表明,RxnCaption-VL在多个指标上达到了最先进的性能。我们相信,我们的方法、数据集和模型将推动化学文献中结构化信息的提取,并促进更广泛的化学领域AI应用。我们将通过GitHub发布数据、模型和代码。
Summary / 总结
The research aims to address the challenge of making chemical reaction images machine-readable for AI training. The RxnCaption framework reformulates chemical Reaction Diagram Parsing as an image captioning task, utilizing Large Vision Language Models. It introduces a BBox and Index as Visual Prompt (BIVP) strategy, which enhances structural extraction quality and simplifies model design. Experiments on the RxnCaption-15k dataset show that RxnCaption-VL achieves state-of-the-art performance on multiple metrics, advancing structured information extraction from chemical literature.
研究旨在解决将化学反应图像转换为机器可读格式以供AI训练的挑战。RxnCaption框架将化学反应图解析重新定义为图像描述任务,利用大型视觉语言模型。它引入了一种称为BBox和Index作为视觉提示(BIVP)的策略,该策略提高了结构提取的质量并简化了模型设计。实验表明,RxnCaption-VL在多个指标上达到了最先进的性能,促进了化学文献中结构化信息的提取。
Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control
Authors: Yoonpyo Lee, Kazuma Kobayashi, Sai Puppala, Sajedul Talukder, Seid Koric, Souvik Chakraborty, Syed Bahauddin Alam
First: 2025-12-29T08:26:27+00:00 · Latest: 2026-01-06T02:29:00+00:00
Abstract
The prevailing paradigm in AI for physical systems, scaling general-purpose foundation models toward universal multimodal reasoning, confronts a fundamental barrier at the control interface. Recent benchmarks show that even frontier vision-language models achieve only 50-53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility while violating physical constraints. This input unfaithfulness is not a scaling deficiency but a structural limitation. Perception-centric architectures optimize parameter-space imitation, whereas safety-critical control demands outcome-space guarantees over executed actions. Here, we present a fundamentally different pathway toward domain-specific foundation models by introducing compact language models operating as Agentic Physical AI, in which policy optimization is driven by physics-based validation rather than perceptual inference. We train a 360-million-parameter model on synthetic reactor control scenarios, scaling the dataset from 10^3 to 10^5 examples. This induces a sharp phase transition absent in general-purpose models. Small-scale systems exhibit high-variance imitation with catastrophic tail risk, while large-scale models undergo variance collapse exceeding 500x reduction, stabilizing execution-level behavior. Despite balanced exposure to four actuation families, the model autonomously rejects approximately 70% of the training distribution and concentrates 95% of runtime execution on a single-bank strategy. Learned representations transfer across distinct physics and continuous input modalities without architectural modification.
中文标题/摘要
标题:面向核反应堆控制的专用领域基础模型的代理物理AI
当前AI在物理系统中的范式,将通用基础模型扩展到多模态通用推理,面临控制接口的基本障碍。最近的基准测试显示,即使是前沿的视觉-语言模型,在基本的定量物理任务上也只能达到50-53%的准确率,它们更像是近似猜测者,保持语义合理性的同时违反物理约束。这种输入不忠实不是扩展不足,而是结构性限制。感知为中心的架构优化参数空间模仿,而安全关键的控制则需要执行动作的结果空间保证。在这里,我们通过引入代理物理AI的紧凑语言模型,提出了一种不同的途径来构建专用领域基础模型,其中策略优化由基于物理的验证驱动,而不是感知推理。我们训练了一个3.6亿参数的模型,在合成的反应堆控制场景上,将数据集从10^3扩展到10^5个例子。这在通用基础模型中没有出现相变。小型系统表现出高方差模仿,伴随灾难性尾部风险,而大型模型经历方差崩溃,超过500倍的减少,稳定执行级行为。尽管对四种执行家族有均衡的暴露,模型自主拒绝了大约70%的训练分布,并将95%的运行时执行集中在单一银行策略上。学习到的表示在不同的物理和连续输入模态之间转移,无需架构修改。
Summary / 总结
This paper addresses the limitations of general-purpose AI models in physical systems control by introducing Agentic Physical AI, which uses physics-based validation for policy optimization. The authors trained a 360-million-parameter model on synthetic reactor control scenarios, observing a phase transition where large-scale models stabilize execution behavior with a significant reduction in variance. The model autonomously focuses on a single-bank strategy, rejecting 70% of the training distribution, and transfers learned representations across different physics and input modalities without modification.
本文通过引入基于物理验证的政策优化的Agentic Physical AI,解决了通用AI模型在物理系统控制中的局限性。作者在合成的反应堆控制场景上训练了一个3.6亿参数的模型,观察到大规模模型在执行行为上趋于稳定,显著减少了方差。该模型自主聚焦于单一策略,拒绝了约70%的训练分布,并且在不同的物理和输入模态下无需修改架构即可迁移学习到。
VINO: A Unified Visual Generator with Interleaved OmniModal Context
Authors: Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye
First: 2026-01-05T18:56:34+00:00 · Latest: 2026-01-05T18:56:34+00:00
Comments: Project page: https://sotamak1r.github.io/VINO-web/
Abstract
We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.
中文标题/摘要
标题:VINO:统一的视觉生成器,具有交错的全模态上下文
我们提出了VINO,一个统一的视觉生成器,可以在单一框架内进行图像和视频生成与编辑。VINO 不依赖于特定任务的模型或独立的模块,而是使用一个共享的扩散骨干网络,该网络可以条件化于文本、图像和视频,从而在一个模型中实现广泛的视觉创作和编辑任务。具体来说,VINO 将一个视觉语言模型(VLM)与一个多模态扩散变换器(MMDiT)耦合,其中多模态输入被编码为交错的条件令牌,然后用于引导扩散过程。这种设计支持多参考定位、长格式指令跟随以及在静态和动态内容中保持一致的身份,同时避免了特定模态的架构组件。为了训练这样一个统一系统,我们引入了一个多阶段训练管道,逐步扩展一个视频生成基础模型,使其成为一个能够处理图像和视频输入输出的统一、多任务生成器。在各种生成和编辑基准测试中,VINO 展现了强大的视觉质量、忠实的指令跟随、改进的参考和属性保留以及更可控的多身份编辑。我们的结果突显了可扩展的统一视觉生成的实用路径,并展示了交错的上下文计算作为通用视觉创作基础的潜力。
Summary / 总结
VINO is a unified visual generator that integrates image and video generation and editing within a single framework. It uses a shared diffusion backbone conditioned on text, images, and videos, coupled with a vision-language model and a Multimodal Diffusion Transformer. VINO supports various visual tasks, including multi-reference grounding, long-form instruction following, and coherent identity preservation. The model was trained using a multi-stage pipeline and showed strong visual quality, faithful instruction following, and improved reference and attribute preservation across different benchmarks.
VINO 是一个统一的视觉生成器,将图像和视频的生成与编辑整合在一个框架中。它使用一个共享的扩散骨干网络,并结合多模态扩散变换器(MMDiT),根据文本、图像和视频进行条件化,以支持各种视觉任务。VINO 在不同基准测试中展示了强大的视觉质量、忠实的指令跟随以及改进的参考和属性保留,展示了统一视觉生成的实用途径。
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations
Authors: Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Scott Loftin, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt
First: 2026-01-05T18:07:51+00:00 · Latest: 2026-01-05T18:07:51+00:00
Abstract
Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.
中文标题/摘要
标题:DatBench:区分性、忠实性和高效性的VLM评估
经验性评估是指导基础模型研究进展的主要指南。尽管有大量的工作集中在训练前沿的视觉-语言模型(VLMs)上,但对其评估的方法仍处于初级阶段。为了促进其成熟,我们提出了评估应满足的三个标准:(1)忠实于模态和应用,(2)能够区分不同质量的模型,(3)计算效率。通过这一视角,我们识别出一些关键的失败模式,这些模式违反了忠实性和区分性,错误地代表了模型的能力:(i)多项选择题奖励猜测,不能很好地反映下游使用场景,并且随着模型的改进而饱和;(ii)一些可以不使用图像直接回答的问题占到了某些评估的70%以上;(iii)错误标记或模棱两可的样本在某些数据集中占到了42%。关于效率,评估前沿模型的计算负担已经变得难以承受:据一些说法,近20%的开发计算资源被用于评估本身。我们没有抛弃现有的基准,而是通过转换和筛选来优化它们,以最大化忠实性和区分性。我们发现,将多项选择题转换为生成任务可以揭示出高达35%的能力下降。此外,过滤掉可以不使用图像直接回答的问题和错误标记的样本可以提高区分能力,同时降低计算成本。我们发布了DatBench-Full,这是一个包含33个数据集的清理评估套件,涵盖了九种VLM能力,以及DatBench,这是一个区分性子集,实现了13倍的平均加速(最高可达50倍),同时与原始数据集的区分能力非常接近。我们的工作概述了一条通往评估实践的道路,这些实践既严格又可持续,随着VLMs的不断扩展。
Summary / 总结
The paper proposes DatBench, a new evaluation suite for vision-language models (VLMs) that addresses the shortcomings of existing benchmarks by ensuring faithfulness, discriminability, and efficiency. It identifies issues such as multiple-choice formats that encourage guessing and mislabeled samples that compromise model evaluations. The study finds that converting multiple-choice questions to generative tasks and filtering out blindly solvable and mislabeled samples significantly improves the discriminative power and reduces computational costs. The resulting DatBench-Full suite includes 33 datasets, while the DatBench subset offers a 13x average speedup with minimal loss in discriminative power.
论文提出了DatBench以解决现有视觉-语言模型(VLM)评估中的问题。它指出了现有评估中存在的忠实性、区分能力和效率问题。关键发现包括将多项选择题转换为生成任务,揭示了显著的能力下降,以及过滤掉盲目可解和错误标注的样本,提高了区分能力并减少了计算成本。作者发布了DatBench-Full和DatBench,这些评估套件增强了VLM评估的忠实性和效率。
InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams
Authors: Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang
First: 2026-01-05T17:11:00+00:00 · Latest: 2026-01-05T17:11:00+00:00
Abstract
The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT
中文标题/摘要
标题:InfiniteVGGT:视觉几何导向变换器,用于无尽流
持久的、大规模的3D视觉几何理解的宏伟愿景受到可扩展性和长期稳定性的不可调和需求的束缚。虽然离线模型如VGGT实现了令人鼓舞的几何能力,但它们基于批次的性质使它们对实时系统无关紧要。流式架构虽然是为实时操作设计的解决方案,但已被证明是不足的。现有方法要么无法支持真正无限的输入,要么在长时间序列中遭受灾难性漂移。我们通过InfiniteVGGT打破了这一长期困境,这是一种因果视觉几何变换器,通过有界但适应性强且持续表达的KV缓存实现滚动记忆的概念化。利用这一点,我们设计了一种无需训练、不依赖注意力的剪枝策略,能够智能地丢弃过时的信息,有效地“滚动”记忆向前推进每一帧。InfiniteVGGT完全兼容FlashAttention,最终解决了这一妥协,使无限时长的流式传输成为可能,同时在长期稳定性方面优于现有流式方法。对于这样一个系统来说,最终的考验是其在真正无限时长上的性能,由于缺乏长期连续基准,这种能力一直难以严格验证。为解决这一关键缺口,我们引入了Long3D基准,这是首次能够对序列长度约10,000帧的连续3D几何估计进行严格评估的基准。这为未来在长期3D几何理解方面的研究提供了决定性的评估平台。代码可在:https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT 获取
Summary / 总结
InfiniteVGGT is designed to address the scalability and long-term stability challenges in 3D visual geometry understanding. It introduces a causal visual geometry transformer with a bounded adaptive KV cache to support infinite-horizon streaming. The method employs a training-free, attention-agnostic pruning strategy to discard outdated information, ensuring long-term stability. Experimental results on the Long3D benchmark show that InfiniteVGGT outperforms existing methods in long-term 3D geometry estimation, marking a significant advancement in the field.
InfiniteVGGT通过引入具有边界但适应性强的KV缓存的因果变压器,解决了持续的3D视觉几何理解挑战,实现了无限时长的流式处理并保持长期稳定性。它超越了现有方法,并引入了Long3D基准,首次实现了对极长序列上3D几何估计的严格评估。
Unraveling MMDiT Blocks: Training-free Analysis and Enhancement of Text-conditioned Diffusion
Authors: Binglei Li, Mengping Yang, Zhiyu Tan, Junping Zhang, Hao Li
First: 2026-01-05T15:32:53+00:00 · Latest: 2026-01-05T15:32:53+00:00
Comments: 11 pages
Abstract
Recent breakthroughs of transformer-based diffusion models, particularly with Multimodal Diffusion Transformers (MMDiT) driven models like FLUX and Qwen Image, have facilitated thrilling experiences in text-to-image generation and editing. To understand the internal mechanism of MMDiT-based models, existing methods tried to analyze the effect of specific components like positional encoding and attention layers. Yet, a comprehensive understanding of how different blocks and their interactions with textual conditions contribute to the synthesis process remains elusive. In this paper, we first develop a systematic pipeline to comprehensively investigate each block's functionality by removing, disabling and enhancing textual hidden-states at corresponding blocks. Our analysis reveals that 1) semantic information appears in earlier blocks and finer details are rendered in later blocks, 2) removing specific blocks is usually less disruptive than disabling text conditions, and 3) enhancing textual conditions in selective blocks improves semantic attributes. Building on these observations, we further propose novel training-free strategies for improved text alignment, precise editing, and acceleration. Extensive experiments demonstrated that our method outperforms various baselines and remains flexible across text-to-image generation, image editing, and inference acceleration. Our method improves T2I-Combench++ from 56.92% to 63.00% and GenEval from 66.42% to 71.63% on SD3.5, without sacrificing synthesis quality. These results advance understanding of MMDiT models and provide valuable insights to unlock new possibilities for further improvements.
中文标题/摘要
标题:解析MMDiT模块:无需训练的文本条件扩散分析与增强
基于变压器的扩散模型的最新突破,特别是由多模态扩散变换器(MMDiT)驱动的模型如FLUX和Qwen Image,极大地促进了文本到图像生成和编辑的激动人心的体验。为了理解MMDiT基模型的内部机制,现有方法试图分析特定组件如位置编码和注意力层的效果。然而,不同模块及其与文本条件的交互如何共同作用于合成过程的全面理解仍然难以捉摸。在本文中,我们首先开发了一种系统化的管道,通过在相应模块中移除、禁用和增强文本隐藏状态来全面调查每个模块的功能。我们的分析揭示了以下几点:1)语义信息出现在较早的模块中,而更精细的细节则在较晚的模块中呈现;2)移除特定模块通常比禁用文本条件的影响小;3)在选择性模块中增强文本条件可以提高语义属性。基于这些观察,我们进一步提出了新的无需训练的策略,以提高文本对齐、精确编辑和加速。广泛的实验表明,我们的方法优于各种基线,并且在文本到图像生成、图像编辑和推理加速方面保持灵活性。我们的方法将T2I-Combench++从56.92%提高到63.00%,GenEval从66.42%提高到71.63%,在SD3.5上没有牺牲合成质量。这些结果推进了对MMDiT模型的理解,并提供了有价值的见解,以解锁进一步改进的新可能性。
Summary / 总结
This paper aims to understand the internal mechanisms of MMDiT-based models in text-to-image generation and editing. The authors develop a systematic pipeline to analyze the impact of different blocks and their interactions with textual conditions. Key findings include that semantic information appears in earlier blocks while finer details are rendered in later blocks, removing specific blocks is less disruptive than disabling text conditions, and enhancing textual conditions in selective blocks improves semantic attributes. Based on these insights, the authors propose training-free strategies for better text alignment, precise editing, and acceleration. Experiments show that their method outperforms baselines and improves T2I-Combench++ and GenEval scores on SD3.5 without compromising synthesis quality.
本文旨在理解MMDiT模型在文本到图像生成和编辑中的内部机制。作者开发了一种系统化的管道来分析和增强每个区块的功能性,通过操纵文本隐藏状态。关键发现包括语义信息在早期区块出现,细节在后期区块呈现,移除特定区块造成的破坏通常小于禁用文本条件,以及通过选择性增强文本条件来提升语义属性。基于这些见解,作者提出了训练免费策略,以增强文本对齐、精确编辑和推理加速,超越了各种基线,并在SD3.5上提高了基准分数,同时不牺牲合成质量。
Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models
Authors: Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, Rohan Sinha, Milan Ganai, Ole Andreas Alsos, Marco Pavone, Martin Steinert
First: 2025-12-30T21:20:41+00:00 · Latest: 2026-01-05T14:30:28+00:00
Comments: 17 pages without bibliography or appendix. The main paper has 16 figures. Paper webpage can be found at https://kimachristensen.github.io/bridge_policy/
Abstract
The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained VLM fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning. Website: kimachristensen.github.io/bridge_policy
中文标题/摘要
标题:基础模型在桥梁上的应用:基于视觉语言模型的海上自主航行中的语义风险检测与安全机动
国际海事组织(IMO)的MASS代码草案要求自主和远程监督的海上船舶能够检测偏离其操作设计域的情况,进入预定义的后备模式通知操作员,允许立即的人工干预,并在未经批准的情况下不得更改航程计划。在警报到接管的窗口内满足这些义务需要一种短时间范围、可人工干预的后备机动。传统的海上自主系统在需要理解意义的情况下(例如,潜水员标志意味着水中有人员,火意味着危险)难以应对。我们认为(i)视觉语言模型(VLMs)为这些分布外情况提供了语义意识,(ii)快速-慢速异常检测流水线与短时间范围、可人工干预的后备机动使这一操作在交接窗口内成为可能。我们引入了语义瞭望,这是一种仅使用摄像头、候选限制的VLM后备机动选择器,它在连续的人类授权下从水有效、世界锚定的轨迹中选择一个谨慎的动作(或保持位置)。在40个港口场景中,我们测量了每次呼叫的场景理解能力和延迟,与人类共识的对齐(模型三票多数投票),火灾危险场景下的短时间范围风险缓解,以及水上警报->后备机动->操作员交接。亚10秒的模型保留了大多数先进模型的大部分意识。后备机动选择器优于仅几何的基线,并在火灾场景中增加了安全距离。现场运行验证了端到端操作。这些结果支持VLMs作为与IMO MASS代码草案兼容的语义后备机动选择器,符合实际的延迟预算,并激励未来工作,即领域适应的混合自主,将基础模型语义与多传感器鸟瞰感知和短时间范围重新规划相结合。
Summary / 总结
This research aims to address the challenges of detecting semantic hazards and enabling safe human override in autonomous maritime vessels. The authors propose a system using vision-language models to select cautious actions or station-keeping maneuvers under continuous human authority. Key findings include sub-10 second latency, alignment with human consensus, and increased standoff distance on fire scenes compared to geometry-only baselines. The system supports the draft IMO MASS Code requirements and motivates further development of domain-adapted, hybrid autonomy systems.
研究旨在通过使用视觉-语言模型提出一种语义后备机动方案,以解决自主海上船舶在警报到接管窗口内检测和应对危险的需求。方法包括一个快速慢速异常管道,从水有效、世界锚定的轨迹中选择谨慎动作,并在持续的人类监督下进行。关键发现包括亚10秒的模型延迟、与人类共识的对齐以及与几何仅基线相比,在火灾场景中增加了安全距离。Semantic Lookout系统在40个港口场景中的测试表明,该系统可以在IMO MASS代码要求范围内实现实际应用。
BiPrompt: Bilateral Prompt Optimization for Visual and Textual Debiasing in Vision-Language Models
Authors: Sunny Gupta, Shounak Das, Amit Sethi
Venue: AAAI 2026
First: 2026-01-05T14:22:20+00:00 · Latest: 2026-01-05T14:22:20+00:00
Comments: Accepted at the AAAI 2026 Workshop AIR-FM, Assessing and Improving Reliability of Foundation Models in the Real World
Abstract
Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities. Existing debiasing approaches often address a single modality either visual or textual leading to partial robustness and unstable adaptation under distribution shifts. We propose a bilateral prompt optimization framework (BiPrompt) that simultaneously mitigates non-causal feature reliance in both modalities during test-time adaptation. On the visual side, it employs structured attention-guided erasure to suppress background activations and enforce orthogonal prediction consistency between causal and spurious regions. On the textual side, it introduces balanced prompt normalization, a learnable re-centering mechanism that aligns class embeddings toward an isotropic semantic space. Together, these modules jointly minimize conditional mutual information between spurious cues and predictions, steering the model toward causal, domain invariant reasoning without retraining or domain supervision. Extensive evaluations on real-world and synthetic bias benchmarks demonstrate consistent improvements in both average and worst-group accuracies over prior test-time debiasing methods, establishing a lightweight yet effective path toward trustworthy and causally grounded vision-language adaptation.
中文标题/摘要
标题:BiPrompt:视觉和文本双边提示优化以减轻视觉语言模型中的偏见
视觉语言基础模型如CLIP在零样本泛化方面表现出色,但在视觉和文本模态之间的虚假相关性方面仍然脆弱。现有的去偏方法通常只针对单一模态,无论是视觉还是文本,导致在分布变化下的部分鲁棒性和不稳定适应。我们提出了一种双边提示优化框架(BiPrompt),该框架在测试时同时减轻了两个模态中的非因果特征依赖性。在视觉方面,它使用结构化注意力引导消除来抑制背景激活,并强制因果区域和虚假区域之间的预测一致性。在文本方面,它引入了平衡提示归一化,这是一种可学习的重新对齐机制,将类别嵌入对齐到等向性的语义空间。这些模块共同最小化了虚假线索与预测之间的条件互信息,引导模型朝着因果、领域不变的推理方向发展,而无需重新训练或领域监督。在现实世界和合成偏见基准上的广泛评估表明,与先前的测试时去偏方法相比,该方法在平均准确性和最差群体准确率上都取得了持续的改进,为可信且因果导向的视觉语言适应指明了一条轻量级且有效的路径。
Summary / 总结
Vision language foundation models such as CLIP exhibit impressive zero-shot generalization yet remain vulnerable to spurious correlations across visual and textual modalities.
论文提出了BiPrompt,一种双边提示优化框架,旨在减轻视觉-语言模型中的非因果特征依赖。该方法同时处理视觉和文本模态,视觉侧使用结构化注意力引导消除背景激活,文本侧引入平衡提示归一化机制。这些模块共同减少了伪关联线索与预测之间的条件互信息,使模型朝向因果、领域不变的推理方向发展,而无需重新训练或领域监督。实验结果显示,该方法在不同偏见基准测试中的一般准确性和最差群体准确率均优于现有方法。
DeCode: Decoupling Content and Delivery for Medical QA
Authors: Po-Jen Ko, Chen-Han Tsai, Yu-Shao Peng
First: 2026-01-05T13:54:38+00:00 · Latest: 2026-01-05T13:54:38+00:00
Comments: Preprint
Abstract
Large language models (LLMs) exhibit strong medical knowledge and can generate factually accurate responses. However, existing models often fail to account for individual patient contexts, producing answers that are clinically correct yet poorly aligned with patients' needs. In this work, we introduce DeCode, a training-free, model-agnostic framework that adapts existing LLMs to produce contextualized answers in clinical settings. We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses. DeCode improves the previous state of the art from $28.4\%$ to $49.8\%$, corresponding to a $75\%$ relative improvement. Experimental results suggest the effectiveness of DeCode in improving clinical question answering of LLMs.
中文标题/摘要
标题:DeCode: 解耦内容与交付以实现医疗QA
大型语言模型(LLMs)表现出强大的医学知识,并能生成事实准确的回答。然而,现有模型往往未能考虑个体患者的背景,导致答案在临床上正确但与患者需求严重脱节。在本工作中,我们引入了DeCode,这是一种无需训练、模型通用的框架,能够将现有LLMs适应于在临床环境中生成上下文化的回答。我们使用OpenAI HealthBench对DeCode进行了评估,这是一个全面且具有挑战性的基准,旨在评估LLM回答的临床相关性和有效性。DeCode将先前的最佳性能从28.4%提高到49.8%,相当于75%的相对改进。实验结果表明,DeCode在提高LLM的临床问题回答效果方面的有效性。
Summary / 总结
DeCode is a training-free, model-agnostic framework designed to adapt existing large language models for generating more contextually relevant medical answers. Evaluated on OpenAI HealthBench, DeCode significantly improves the previous state-of-the-art performance from 28.4% to 49.8%, representing a 75% relative improvement in clinical question answering accuracy.
DeCode 是一个无需训练、适用于多种模型的框架,旨在使现有的大型语言模型能够生成具有临床相关性的医疗回答。在 OpenAI HealthBench 上进行评估后,DeCode 显著提高了 LLMs 在临床环境中的准确性,相比之前的最佳表现,相对改进了 75%。
Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships
Authors: Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen
Venue: WACV 2026
First: 2024-05-29T05:20:02+00:00 · Latest: 2026-01-05T13:34:30+00:00
Comments: WACV 2026 Accepted. Code available at https://github.com/CyberAgentAILab/multimodal-adversarial-training
Abstract
Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift -- conditions overlooked by prior research. This work pioneers defense strategies against multimodal attacks, providing insights for building robust VLMs from both optimization and data perspectives. Our code is publicly available at https://github.com/CyberAgentAILab/multimodal-adversarial-training.
中文标题/摘要
标题:利用一对多关系的多模态对抗防御方法研究
预训练的视觉-语言(VL)模型对对抗攻击极为敏感。然而,现有的防御方法主要集中在图像分类上,忽视了VL任务中的两个关键方面:多模态攻击,其中图像和文本都可以被扰动,以及一对多关系,即一个图像可以对应多个文本描述,反之亦然(1:N和N:1)。本工作是首次探索VL任务中对抗多模态攻击的防御策略,而之前的VL防御方法主要关注视觉鲁棒性。我们提出了多模态对抗训练(MAT),在训练过程中同时在图像和文本模态中引入对抗扰动,显著优于现有的单模态防御方法。此外,我们发现MAT受限于VL训练数据中确定的一对一(1:1)图像-文本对。为了解决这一问题,我们对利用一对多关系增强鲁棒性进行了全面研究,探讨了多种增强技术。我们的分析表明,为了更有效的防御,增强的图像-文本对应该对齐良好、多样化,但要避免分布偏移——这是先前研究中被忽视的条件。本工作开创了对抗多模态攻击的防御策略,从优化和数据两个角度提供了构建鲁棒VL模型的见解。我们的代码已公开发布在https://github.com/CyberAgentAILab/multimodal-adversarial-training。
Summary / 总结
This work addresses the vulnerability of pre-trained vision-language models to adversarial attacks by proposing multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities. The method significantly outperforms existing unimodal defenses. The study also highlights the limitations of deterministic one-to-one image-text pairs and explores the use of one-to-many relationships to enhance robustness, suggesting that augmented pairs should be well-aligned, diverse, and avoid distribution shift. This work provides new insights for building robust vision-language models.
该研究通过提出多模态对抗训练(MAT),在图像和文本模态中同时引入对抗扰动,显著优于现有的单模态防御方法。研究还强调了利用图像-文本对的一对多关系来增强鲁棒性的重要性,建议增强的对应该对齐良好、多样化且避免分布偏移。这项研究为构建鲁棒的视觉-语言模型提供了新的见解。
Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows
Authors: Yingte Shu, Yuchuan Tian, Chao Xu, Yunhe Wang, Hanting Chen
First: 2026-01-05T12:57:33+00:00 · Latest: 2026-01-05T12:57:33+00:00
Abstract
Diffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based diffusion, decoding tokens block by block. However, this paradigm suffers from a structural limitation that we term Boundary-Induced Context Truncation (BICT): undecoded tokens near block boundaries are forced to commit without access to nearby future context, even when such context could substantially reduce uncertainty. This limitation degrades decoding confidence and generation quality, especially for tasks requiring precise reasoning, such as mathematical problem solving and code generation. We propose Deferred Commitment Decoding (DCD), a novel, training-free decoding strategy that mitigates this issue. DCD maintains a confidence-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available. This design enables effective bidirectional information flow within the decoding window without sacrificing efficiency. Extensive experiments across multiple diffusion language models, benchmarks, and caching configurations show that DCD improves generation accuracy by 1.39% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 9.0%. These results demonstrate that deferring token commitment based on uncertainty is a simple yet effective principle for improving both the quality and efficiency of diffusion language model decoding.
中文标题/摘要
标题:延迟承诺解码:带有信心感知滑动窗口的扩散语言模型
扩散语言模型(DLMs)最近作为一种强大的替代自回归模型出现,通过实现并行文本生成。为了提高推理效率和KV缓存兼容性,先前的工作通常采用基于块的扩散,逐块解码令牌。然而,这种范式遭受了一个我们称之为边界诱导上下文截断(BICT)的结构性限制:接近块边界的未解码令牌被迫在无法访问附近未来上下文的情况下做出承诺,即使这种上下文可以显著减少不确定性。这一限制降低了解码信心和生成质量,特别是在需要精确推理的任务中,如数学问题求解和代码生成。我们提出了延迟承诺解码(DCD),这是一种无需训练的新颖解码策略,可以缓解这一问题。DCD 维护一个信心感知的滑动窗口覆盖在掩码令牌上,早期解决低不确定性令牌,直到有足够的上下文证据才推迟高不确定性令牌。这种设计在解码窗口内实现了有效的双向信息流,而不牺牲效率。在多个扩散语言模型、基准和缓存配置的广泛实验中显示,与固定块基扩散方法相比,DCD 在平均时间相同的情况下提高了生成准确性 1.39%,最高改善幅度达到 9.0%。这些结果表明,基于不确定性推迟令牌承诺是提高扩散语言模型解码质量和效率的一个简单而有效的原则。
Summary / 总结
The paper addresses the issue of Boundary-Induced Context Truncation (BICT) in block-based diffusion language models, which limits decoding confidence and generation quality. It introduces Deferred Commitment Decoding (DCD), a training-free method that uses a confidence-aware sliding window to resolve tokens with low uncertainty early and defer high-uncertainty tokens until sufficient context is available. Experiments show that DCD improves generation accuracy by 1.39% on average compared to fixed block-based methods, with the best improvement reaching 9.0%.
论文针对块基扩散语言模型中的边界诱导上下文截断(BICT)问题,该问题限制了接近块边界未解码令牌对未来的访问。为了解决这一问题,作者提出了延迟承诺解码(DCD),这是一种无需训练的方法,通过使用一个基于信心的滑动窗口来早期解决低不确定性令牌,并在获得足够上下文证据后延迟高不确定性令牌。实验结果显示,与固定块基方法相比,DCD 平均提高了 1.39% 的生成准确性,最佳改进达到 9.0%。