DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
Authors: Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li
First: 2025-12-04T18:59:53+00:00 · Latest: 2025-12-04T18:59:53+00:00
Comments: Project Page: https://github.com/CaraJ7/DraCo
Abstract
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.
中文标题/摘要
标题:DraCo: 文本生成图像预览及稀有概念生成的草图作为推理
近期统一的多模态大型语言模型(MLLMs)展示了令人印象深刻的性能,通过推理链(CoT)增强文本生成图像的能力。然而,现有方法仍然有限,要么将模型仅视为独立生成器,要么依赖抽象的文本规划。为此,我们提出了Draft-as-CoT(DraCo),一种新颖的交替推理范式,充分利用文本和视觉内容在CoT中的作用,以更好地规划和验证。我们的方法首先生成低分辨率的草图图像作为预览,提供更具体的视觉规划和指导。然后,我们利用模型的内在理解能力验证草图与输入提示之间潜在的语义不一致,并通过选择性修正进行超分辨率细化。这样,我们的方法解决了两个基本挑战:文本规划的粗粒度性质和生成稀有属性组合的难度。为了支持训练,我们整理了DraCo-240K,旨在增强一般修正、实例操作和布局重组的三种原子能力。借助DraCo-CFG,一种专门的交替推理无分类器引导(CFG)策略,DraCo在GenEval上取得了8%的巨大提升,在Imagine-Bench上提升了0.91,在GenEval++上提升了3%,显著优于直接生成和其他基于CoT的生成方法。
Summary / 总结
DraCo proposes a novel interleaved reasoning paradigm called Draft-as-CoT to enhance text-to-image generation by leveraging both textual and visual contents in the chain-of-thought process. It generates a low-resolution draft image first, which serves as a visual planning guide, and then refines the image through selective corrections with super-resolution. DraCo significantly improves performance on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%) compared to direct generation and other CoT-empowered methods, addressing challenges of coarse-grained textual planning and rare attribute generation.
DraCo 提出了一种名为 Draft-as-CoT 的新型交错推理范式,通过在链式思考过程中利用文本和视觉内容,增强文本到图像的生成。它首先生成一个低分辨率的草图作为预览,提供具体的视觉指导并帮助验证语义不一致,从而通过超分辨率进行更好的细化。DraCo 在 GenEval (+8%)、Imagine-Bench (+0.91) 和 GenEval++ (+3%) 上的表现显著优于直接生成和其他基于 CoT 的生成方法。
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
Authors: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang
First: 2025-12-04T18:59:52+00:00 · Latest: 2025-12-04T18:59:52+00:00
Abstract
Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
中文标题/摘要
标题:ARM-Thinker:通过自主工具使用和视觉推理强化多模态生成奖励模型
奖励模型对于使视觉-语言系统与人类偏好保持一致至关重要,但当前的方法存在幻觉、视觉定位弱以及无法使用工具进行验证的问题,这限制了它们在复杂多模态推理任务中的可靠性。我们提出了ARM-Thinker,这是一种自主多模态奖励模型,能够自主调用外部工具(例如,图像裁剪、文档页面检索)以基于可验证的证据来支撑判断,取代了静态、非交互式的奖励评分。这使模型能够验证细微的视觉细节、交叉引用多页证据并验证推理声明,而这些能力在现有的奖励模型中是不存在的。我们通过多阶段强化学习训练ARM-Thinker,联合优化工具调用决策和判断准确性。为了评估自主奖励建模,我们引入了ARMBench-VL,包含三个基准测试,分别评估细微的视觉定位(图像级工具)、多页文档理解(检索工具)和指令遵循(文本级验证)。ARM-Thinker 在奖励模型基准测试中平均提高了16.2%,在工具使用任务中提高了9.6%,并在多模态数学和逻辑推理基准测试中优于基线模型。我们的结果表明,自主能力显著提高了奖励模型的准确性和可解释性。
Summary / 总结
ARM-Thinker is an agentic multimodal reward model that uses external tools for verification, improving visual grounding and reasoning accuracy. It is trained with multi-stage reinforcement learning and evaluated on ARMBench-VL, achieving significant improvements in reward modeling and tool-use tasks, and outperforming baselines in multimodal math and logical reasoning benchmarks.
ARM-Thinker 通过将自主工具使用和视觉推理引入奖励模型中,旨在提高视觉语言系统的可靠性。它使用多阶段强化学习来优化工具调用决策和判断准确性。ARM-Thinker 在奖励模型基准测试中的平均改进幅度为 16.2%,在工具使用任务中的改进幅度为 9.6%,并且在多模态数学和逻辑推理基准测试中表现出色。
TV2TV: A Unified Framework for Interleaved Language and Video Generation
Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan
First: 2025-12-04T18:59:09+00:00 · Latest: 2025-12-04T18:59:09+00:00
Abstract
Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
中文标题/摘要
标题:TV2TV:一种统一的交错语言和视频生成框架
视频生成模型正在迅速发展,但仍可能在需要大量语义分支或反复进行下一步应该发生什么的高层次推理的复杂视频输出上遇到困难。在本文中,我们介绍了一种新的全能视频-文本模型类别,该模型结合了最近语言模型推理进展的想法,以应对这一挑战。具体来说,我们提出了TV2TV,这是一种统一的生成建模框架,将视频生成分解为交错的语言和视频生成过程。TV2TV 使用混合的变换器(MoT)架构联合学习语言建模(下一个标记预测)和视频流匹配(下一个帧预测)。在推理时,TV2TV 决定何时在生成文本和视频帧之间交替,使模型能够在“用词思考”后续内容之前“用像素行动”来生成帧。这种设计将决定下一步应该发生什么的责任大部分转移到了语言建模塔上,从而提高了生成视频的视觉质量和提示对齐。它还使用户能够在过程中任何时间通过文本干预来修改视频生成轨迹,实现细粒度的可控性。在对视频游戏数据的受控实验中,TV2TV 在视觉质量和可控性方面都表现出显著的改进。TV2TV 还扩展到自然视频,我们通过使用视觉-语言模型(VLMs)交替自然语言动作描述来增强体育视频,展示了这一点。在该语料库上训练 TV2TV 产生了强大的视觉质量和提示对齐,展示了该模型能够推理和生成复杂的现实世界动作序列的能力。这些结果共同突显了 TV2TV 是朝着具有开放文本推理和控制的视频生成迈出的有希望的一步。
Summary / 总结
TV2TV is a unified generative modeling framework that addresses the challenge of generating complex video outputs by integrating language and video generation processes. It uses a Mixture-of-Transformers architecture to jointly learn language modeling and video flow matching, allowing the model to decide when to generate text or video frames. Experiments on video game data show that TV2TV improves both visual quality and controllability, and it scales to natural videos by augmenting sports videos with interleaved natural language descriptions, demonstrating strong visual quality and prompt alignment.
TV2TV 是一种统一生成模型框架,将文本和视频生成交织进行,以应对复杂视频输出的挑战。它使用 Mixture-of-Transformers 架构联合学习语言建模和视频流匹配。实验表明,TV2TV 在视频生成中提高了视觉质量和可控性,特别是在视频游戏数据中,并且通过视觉语言模型(VLM)对自然视频进行动作描述,展示了模型在生成复杂现实动作序列方面的推理能力。
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
Authors: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim
First: 2025-12-04T18:46:44+00:00 · Latest: 2025-12-04T18:46:44+00:00
Comments: Project Page: https://cvlab-kaist.github.io/DeepForcing/
Abstract
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
中文标题/摘要
标题:深度强迫:无需训练的长视频生成与深度下陷和参与式压缩
近期自回归视频扩散技术的进步使得实时帧流成为可能,但现有解决方案仍然存在时间重复、漂移和运动减速的问题。我们发现,简单地将类似于StreamingLLM的注意力下陷应用于视频扩散会导致保真度下降和运动停滞。为克服这一问题,我们引入了深度强迫,这是一种无需训练的机制,无需任何微调即可解决这一问题。具体来说,1) 深度下陷将滑动窗口的一半分配给持久下陷标记,并重新对齐它们的当前时间线的时空旋转相位,从而在长时间展开过程中稳定全局上下文。2) 参与式压缩执行重要性感知的KV缓存剪枝,仅保留最近参与注意力的活跃标记,同时安全地丢弃冗余和退化的历史记录,从而在生成超出分布长度时最小化误差累积。这些组件结合在一起,使生成能力提高了超过12倍(例如,5秒训练到60秒以上的生成),同时保持了更好的成像质量,更好的美学质量,几乎保持了整体一致性,并在动态程度上取得了显著进步,同时保持了实时生成。我们的结果表明,无需训练的KV缓存管理可以与基于训练的方法相匹配或超越自回归流式长视频生成。
Summary / 总结
Deep Forcing is a training-free method for long video generation that addresses temporal repetition and motion deceleration issues in existing solutions. It introduces two mechanisms: Deep Sink, which stabilizes global context by re-aligning persistent sink tokens, and Participative Compression, which prunes the KV cache to preserve only active tokens. These components enable over 12x extrapolation with better imaging and aesthetic quality compared to previous methods, maintaining consistency and dynamic degree while supporting real-time generation.
Deep Forcing 是一种无需训练的方法,用于解决现有解决方案中的时间重复和运动减速问题。它引入了两种机制:Deep Sink 通过重新对齐持久的 sink 标记来稳定全局上下文,而 Participative Compression 则通过仅保留积极参与的标记来修剪 KV 缓存。这些组件使生成能力提高了超过 12 倍,同时保持了更好的成像质量和美学质量,以及整体一致性和动态程度,同时支持实时生成。
Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
Authors: Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adrià Garriga-Alonso
Venue: NeurIPS 2025
First: 2025-06-11T19:36:17+00:00 · Latest: 2025-12-04T18:28:33+00:00
Comments: Presented at the Mechanistic Interpretability Workshop at NeurIPS 2025. 34 pages, 26 figures
Abstract
We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.
中文标题/摘要
标题:路径通道和计划扩展核:Sokoban RNN规划的机理描述
我们部分逆向工程了一个通过无模型强化学习训练的卷积递归神经网络(RNN),使其能够玩推箱子游戏Sokoban。我们发现,RNN将未来的动作(计划)存储在隐藏状态的特定通道中,我们称之为路径通道。特定位置的高激活意味着当箱子位于该位置时,它将被推到该通道指定的方向。我们检查了路径通道之间的卷积核,发现它们编码了每种可能动作导致的位置变化,从而代表了学习到的部分转移模型。RNN通过从箱子和目标开始构建计划。这些核将路径通道中的激活向前扩展到箱子,向后扩展到目标。在障碍物处放置负值会使得扩展核反向传播负值,从而修剪最后几步,让另一种计划浮现;一种形式的回溯。我们的工作表明,对计划表示的精确理解使我们能够直接用更熟悉的术语理解无模型训练中学习到的双向规划算法。
Summary / 总结
This study partially reverse-engineers a convolutional recurrent neural network (RNN) trained on Sokoban to reveal that the RNN stores future moves (plans) in specific channels of the hidden state, termed path channels. These channels indicate the direction a box will be pushed when in a particular location. The convolutional kernels between path channels encode the change in position for each action, representing a learned transition model. The RNN constructs plans by starting at the boxes and goals, using extension kernels to propagate activations forward from boxes and backward from the goal. Negative values at obstacles cause the extension kernels to prune the last few steps, allowing alternative plans to emerge through backtracking. This work demonstrates that understanding the plan representation helps in comprehending the bidirectional planning algorithm learned by the RNN during model-free training.
研究部分逆向工程了一个在Sokoban上训练的卷积循环神经网络(RNN),发现RNN将未来的移动(计划)存储在隐藏状态的特定通道中,称为路径通道。这些通道指示当箱子处于特定位置时将被推的方向。路径通道之间的卷积核编码每个动作的位置变化,代表了一个学习到的转移模型。RNN通过从箱子和目标开始构建计划,使用扩展核将激活向前从箱子传播并从目标向后传播。障碍物处的负值导致扩展核修剪最后几步,通过回溯允许替代计划的出现。这项工作表明,理解计划表示有助于理解模型自由训练期间学习到的双向规划算法。
4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer
Authors: Xianfeng Wu, Yajing Bai, Minghan Li, Xianzu Wu, Xueqi Zhao, Zhongyuan Lai, Wenyu Liu, Xinggang Wang
First: 2025-12-04T18:15:27+00:00 · Latest: 2025-12-04T18:15:27+00:00
Comments: Code: https://github.com/hustvl/4DLangVGGT, Webpage: https://hustvl.github.io/4DLangVGGT
Abstract
Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT
中文标题/摘要
标题:4DLangVGGT:四维语言-视觉几何接地变换器
构建四维语言场对于具身AI、增强/虚拟现实以及四维场景理解至关重要,因为它们提供了动态环境的丰富语义表示,并在复杂场景中支持开放词汇查询。然而,现有的四维语义场构建方法主要依赖于场景特定的高斯点积,这需要逐场景优化,表现出有限的泛化能力,并难以扩展到实际应用。为了解决这些限制,我们提出了4DLangVGGT,这是一种基于变换器的前馈统一框架,用于四维语言接地,该框架在单一架构中联合整合了几何感知和语言对齐。4DLangVGGT有两个关键组件:四维视觉几何变换器StreamVGGT,用于捕获动态场景的时空几何表示;以及语义桥梁解码器(SBD),将几何感知特征投影到语言对齐的语义空间,从而增强语义可解释性同时保持结构保真度。与依赖于昂贵的逐场景优化的先前方法不同,4DLangVGGT可以在多个动态场景上联合训练,并在推理时直接应用,实现了部署效率和强大的泛化能力。这种设计显著提高了大规模部署的实用性,并建立了开放词汇四维场景理解的新范式。在HyperNeRF和Neu3D数据集上的实验表明,我们的方法不仅泛化效果良好,还在逐场景训练和多场景训练下分别实现了高达2%和1%的性能提升。我们的代码发布在https://github.com/hustvl/4DLangVGGT
Towards a unified framework for guided diffusion models
Authors: Yuchen Jiao, Yuxin Chen, Gen Li
First: 2025-12-04T16:55:20+00:00 · Latest: 2025-12-04T16:55:20+00:00
Abstract
Guided or controlled data generation with diffusion models\blfootnote{Partial preliminary results of this work appeared in International Conference on Machine Learning 2025 \citep{li2025provable}.} has become a cornerstone of modern generative modeling. Despite substantial advances in diffusion model theory, the theoretical understanding of guided diffusion samplers remains severely limited. We make progress by developing a unified algorithmic and theoretical framework that accommodates both diffusion guidance and reward-guided diffusion. Aimed at fine-tuning diffusion models to improve certain rewards, we propose injecting a reward guidance term -- constructed from the difference between the original and reward-reweighted scores -- into the backward diffusion process, and rigorously quantify the resulting reward improvement over the unguided counterpart. As a key application, our framework shows that classifier-free guidance (CFG) decreases the expected reciprocal of the classifier probability, providing the first theoretical characterization of the specific performance metric that CFG improves for general target distributions. When applied to reward-guided diffusion, our framework yields a new sampler that is easy-to-train and requires no full diffusion trajectories during training. Numerical experiments further corroborate our theoretical findings.
中文标题/摘要
标题:迈向统一的引导扩散模型框架
带有扩散模型的引导或控制数据生成已成为现代生成建模的基石。尽管在扩散模型理论方面取得了重大进展,但对引导扩散采样器的理论理解仍然非常有限。我们通过开发一个统一的算法和理论框架取得了进展,该框架可以同时容纳扩散引导和奖励引导扩散。旨在微调扩散模型以提高某些奖励,我们提出将奖励引导项——由原始分数和奖励加权分数之差构建——注入反向扩散过程,并严格量化与未引导的对应物相比的奖励改进。作为关键应用,我们的框架表明,无分类器引导(CFG)降低了分类器概率的期望倒数,首次为通用目标分布提供了CFG改进的具体性能指标的理论表征。当应用于奖励引导扩散时,我们的框架产生了一种新的采样器,该采样器易于训练,并且在训练过程中不需要完整的扩散轨迹。数值实验进一步证实了我们的理论发现。
Summary / 总结
The research aims to develop a unified framework for guided diffusion models to enhance theoretical understanding and practical applications. The method involves integrating a reward guidance term into the backward diffusion process to improve specific rewards. Key findings include the theoretical characterization of classifier-free guidance (CFG) and the development of a new reward-guided diffusion sampler that is easy to train and does not require full diffusion trajectories during training.
研究旨在开发统一框架以增强指导扩散模型的理论理解与实际应用。方法是将奖励指导项整合到反向扩散过程中以提高特定奖励。关键发现包括对无分类器指导(CFG)的理论表征以及开发了一种新的奖励导向扩散采样器,该采样器易于训练且在训练过程中不需要完整的扩散轨迹。
Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models
Authors: NaHyeon Park, Namin An, Kunhee Kim, Soyeon Yoon, Jiahao Huo, Hyunjung Shim
First: 2025-12-04T16:52:45+00:00 · Latest: 2025-12-04T16:52:45+00:00
Comments: Project page: https://fairpro-t2i.github.io
Abstract
Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
中文标题/摘要
标题:对齐但刻板?LVLM 基础文本到图像模型中社会偏见的隐秘影响
基于大型视觉语言模型(LVLM)的文本到图像(T2I)系统已成为图像生成的主导范式,但它们是否放大了社会偏见仍不够了解。在本文中,我们展示了基于LVLM的模型生成的社会偏见图像明显多于非LVLM基础模型。我们引入了一个包含四个语言复杂度级别的1024个提示基准,并以系统的方式评估了多个属性上的人口统计学偏见。我们的分析确定系统提示,即引导LVLM的预定义指令,是偏见行为的主要驱动因素。通过解码中间表示、标记概率诊断和嵌入关联分析,我们揭示了系统提示如何编码人口统计学先验并传播到图像合成中。为此,我们提出了FairPro,一种无需训练的元提示框架,使LVLM能够在测试时自我审计并构建公平意识的系统提示。在两个基于LVLM的T2I模型SANA和Qwen-Image上的实验表明,FairPro在保持文本图像对齐的同时显著减少了人口统计学偏见。我们认为我们的发现提供了对系统提示在偏见传播中核心作用的更深入见解,并提供了一种实用的、可部署的方法来构建更具社会责任感的T2I系统。
Summary / 总结
This paper investigates the social bias in text-to-image models based on large vision-language models (LVLMs) and finds that these models produce more biased images than non-LVLM-based models. By analyzing system prompts, which are predefined instructions guiding LVLMs, the authors identify them as a key factor in generating biased images. They propose FairPro, a framework that helps LVLMs self-audit and construct fairness-aware system prompts, reducing demographic bias without losing text-image alignment.
本文研究了大型视觉语言模型(LVLM)基于文本到图像(T2I)系统的社会偏见问题,发现这些模型生成的社会偏见图像比非LVLM模型更多。作者引入了一个包含1,024个提示的基准来系统地评估人口统计学偏见,并提出了一种无需训练的元提示框架FairPro,以减少偏见同时保持文本与图像的对齐。实验结果显示,FairPro能有效减少人口统计学偏见。
A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models
Authors: X. Y. Han, Yuan Zhong
First: 2025-12-03T16:00:02+00:00 · Latest: 2025-12-04T16:34:28+00:00
Abstract
In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.
中文标题/摘要
标题:无辅助损失的稀疏Mixture-of-Experts负载均衡的理论框架
在大规模AI训练中,稀疏Mixture-of-Experts (s-MoE)层通过每令牌激活一小部分专家来实现扩展。此设计中的一个操作挑战是负载均衡:通过最小化空闲专家的数量来路由令牌,这对于高效利用(昂贵的)GPU至关重要。我们提供了一个理论框架来分析由DeepSeek的Wang等人(2024年)提出的无辅助损失的负载均衡(ALF-LB)过程,将其视为每迭代一步的原始对偶方法来解决分配问题。首先,在一个简化的确定性环境中,我们的框架揭示了几个重要的结构特性:(i) 拉格朗日目标的单调改进,(ii) 一种偏好规则,将令牌从过载专家移动到欠载专家,以及(iii) 一种近似平衡保证。然后,我们通过广义在线优化形式来纳入AI训练的随机性和动态性。在在线设置中,我们推导出目标函数的强凸性性质,这在某些步长选择下导致了对数期望后悔界。此外,我们还展示了针对1B参数DeepSeekMoE模型的实际实验,以补充我们的理论发现。这些结果共同构建了一个分析AI模型中s-MoE的无辅助损失负载均衡的原理框架。
Summary / 总结
The paper provides a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure in Sparse Mixture-of-Experts (s-MoE) layers, which is crucial for efficient GPU utilization in large-scale AI training. The framework is first applied in a deterministic setting, revealing structural properties such as monotonic improvement of a Lagrangian objective and an approximate-balancing guarantee. It then extends to a stochastic and dynamic setting, deriving a logarithmic expected regret bound. Real experiments on DeepSeekMoE models validate the theoretical findings, offering a principled approach to s-MoE load balancing.
论文提供了一种分析Sparse Mixture-of-Experts (s-MoE) 层中无辅助损失负载均衡(ALF-LB)过程的理论框架,这对于大规模AI训练中的高效GPU利用至关重要。该框架被表述为一个分配问题的原始-对偶方法,并包括了拉格朗日目标单调改进、负载均衡的偏好规则以及近似平衡保证等洞察。研究还考虑了AI训练的随机性和动态性,并在某些步长选择下推导出对数期望遗憾界。实验证实在1B参数的DeepSeekMoE模型上支持了理论发现。
LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics
Authors: Weiye Shi, Zhaowei Zhang, Shaoheng Yan, Yaodong Yang
First: 2025-12-04T16:26:42+00:00 · Latest: 2025-12-04T16:26:42+00:00
Abstract
Large language models (LLMs) demonstrate remarkable potential across diverse language related tasks, yet whether they capture deeper linguistic properties, such as syntactic structure, phonetic cues, and metrical patterns from raw text remains unclear. To analysis whether LLMs can learn these features effectively and apply them to important nature language related tasks, we introduce a novel multilingual genre classification dataset derived from Project Gutenberg, a large-scale digital library offering free access to thousands of public domain literary works, comprising thousands of sentences per binary task (poetry vs. novel;drama vs. poetry;drama vs. novel) in six languages (English, French, German, Italian, Spanish, and Portuguese). We augment each with three explicit linguistic feature sets (syntactic tree structures, metaphor counts, and phonetic metrics) to evaluate their impact on classification performance. Experiments demonstrate that although LLM classifiers can learn latent linguistic structures either from raw text or from explicitly provided features, different features contribute unevenly across tasks, which underscores the importance of incorporating more complex linguistic signals during model training.
中文标题/摘要
标题:LLMs 知识远超文字:一种涉及句法、隐喻与音韵的体裁研究
大型语言模型(LLMs)在多种语言相关任务中展现出显著潜力,但它们是否能够捕捉到更深层次的语言特性,如句法结构、音素提示和韵律模式,仍然不清楚。为了分析LLMs是否能够有效学习这些特征并应用于重要的自然语言相关任务,我们引入了一个新的多语言体裁分类数据集,该数据集源自Project Gutenberg,这是一个提供数千篇公共领域文学作品的大型数字图书馆,包含六种语言(英语、法语、德语、意大利语、西班牙语和葡萄牙语)的数千个句子,每种二元任务(诗歌 vs. 小说;戏剧 vs. 隐喻;戏剧 vs. 小说)都有三个显式的语言特征集(句法树结构、隐喻计数和音韵指标)来评估其对分类性能的影响。实验表明,尽管LLM分类器可以从原始文本或明确提供的特征中学习潜在的语言结构,但不同特征在不同任务中的贡献不均,这突显了在模型训练过程中整合更复杂语言信号的重要性。
Summary / 总结
This study investigates whether large language models (LLMs) can learn and apply deeper linguistic properties such as syntax, metaphor, and phonetics for genre classification. A multilingual dataset from Project Gutenberg was created, including thousands of sentences in six languages for three binary genre classification tasks. The experiments show that LLMs can learn these features from both raw text and explicitly provided linguistic features, but the contribution of different features varies across tasks, highlighting the need for incorporating complex linguistic signals during training.
该研究探讨了大型语言模型(LLMs)是否能够学习和应用更深层次的语言特性,如句法、隐喻和音韵。从Project Gutenberg创建了一个多语言体裁分类数据集,包含六种语言数千个句子,用于三个二元分类任务。实验表明,LLM分类器可以从原始文本和明确提供的语言特征中学习,但不同任务的效果不同,强调了在模型训练中需要结合更复杂的语言信号的重要性。
FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via neural Action Tokenization
Authors: Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang, Jingjing Gong, Xipeng Qiu, Hang Zhao
First: 2025-12-04T16:21:38+00:00 · Latest: 2025-12-04T16:21:38+00:00
Abstract
Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce FASTer, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.
中文标题/摘要
标题:FASTer: 通过神经动作分词实现高效自回归视觉语言动作建模
自回归视觉-语言-动作(VLA)模型最近在机器人操作方面展示了强大的能力。然而,它们的核心动作分词过程通常会在重建保真度和推理效率之间进行权衡。我们引入了FASTer,这是一种统一的高效且可泛化的机器人学习框架,该框架结合了一个可学习的分词器和基于它的自回归策略。FASTerVQ 将动作片段编码为单通道图像,捕获全局时空依赖关系的同时保持高压缩比。FASTerVLA 在此基础上使用块状自回归解码和轻量级动作专家,实现更快的推理和更高的任务性能。广泛的实验表明,FASTerVQ 提供了卓越的重建质量、高分词利用率和强大的跨任务和跨载体泛化能力,而 FASTerVLA 进一步提高了整体能力,在推理速度和任务性能方面均超越了之前的最先进的 VLA 模型。
Summary / 总结
The research aims to improve the efficiency and generalizability of autoregressive vision-language-action models for robotic manipulation. The method involves a unified framework called FASTer, which includes a learnable tokenizer and an autoregressive policy. FASTerVQ encodes action chunks as single-channel images, offering high compression and reconstruction quality. FASTerVLA builds on this with block-wise autoregressive decoding and a lightweight action expert, enhancing inference speed and task performance. Experiments show that FASTerVQ outperforms in reconstruction quality and token utilization, while FASTerVLA surpasses previous models in both speed and performance.
研究旨在提高自回归视觉-语言-动作模型在机器人操作中的效率和泛化能力。方法包括一个名为FASTerVQ的可学习分词器,它高效地编码动作片段,同时保持高质量的重建效果。FASTerVLA在此分词器基础上使用块状自回归解码和轻量级动作专家,进一步提升推理速度和任务性能。实验表明,FASTerVQ在重建质量和分词利用率方面优于先前模型,而FASTerVLA在速度和性能上进一步超越了最先进的视觉-语言-动作模型,在各种基准测试中表现出色。
"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
Authors: Ziyi Zhang, Zhen Sun, Zongmin Zhang, Zifan Peng, Yuemeng Zhao, Zichun Wang, Zeren Luo, Ruiting Zuo, Xinlei He
First: 2025-05-07T15:03:16+00:00 · Latest: 2025-12-04T16:15:45+00:00
Comments: 17 pages
Abstract
The visually impaired population faces significant challenges in daily activities. While prior works employ vision language models for assistance, most focus on static content and cannot address real-time perception needs in complex environments. Recent VideoLLMs enable real-time vision and speech interaction, offering promising potential for assistive tasks. In this work, we conduct the first study evaluating their effectiveness in supporting daily life for visually impaired individuals. We first conducted a user survey with visually impaired participants to design the benchmark VisAssistDaily for daily life evaluation. Using VisAssistDaily, we evaluate popular VideoLLMs and find GPT-4o achieves the highest task success rate. We further conduct a user study to reveal concerns about hazard perception. To address this, we propose SafeVid, an environment-awareness dataset, and fine-tune VITA-1.5, improving risk recognition accuracy from 25.00% to 76.00%.We hope this work provides valuable insights and inspiration for future research in this field.
中文标题/摘要
标题:"我能看到永远!": 评估实时视频LLM在辅助视觉障碍个体中的有效性
视觉障碍人群在日常活动中面临重大挑战。虽然先前的工作利用视觉语言模型进行辅助,但大多数都集中在静态内容上,无法解决复杂环境中实时感知的需求。最近的视频LLM能够实现实时视觉和语音交互,为辅助任务提供了巨大的潜力。在本研究中,我们首次评估了它们在支持视觉障碍个体日常生活的有效性。我们首先对视觉障碍参与者进行了用户调查,设计了用于日常生活的基准测试VisAssistDaily。使用VisAssistDaily,我们评估了流行的视频LLM,并发现GPT-4o的任务成功率最高。我们进一步进行了一项用户研究,揭示了对危险感知的担忧。为了解决这一问题,我们提出了SafeVid,一种环境感知数据集,并对VITA-1.5进行了微调,将风险识别准确性从25.00%提高到76.00%。我们希望这项工作为该领域的未来研究提供有价值的见解和灵感。
Multi-Agent Reinforcement Learning for Intraday Operating Rooms Scheduling under Uncertainty
Authors: Kailiang Liu, Ying Chen, Ralf Borndörfer, Thorsten Koch
First: 2025-12-04T15:47:08+00:00 · Latest: 2025-12-04T15:47:08+00:00
Abstract
Intraday surgical scheduling is a multi-objective decision problem under uncertainty-balancing elective throughput, urgent and emergency demand, delays, sequence-dependent setups, and overtime. We formulate the problem as a cooperative Markov game and propose a multi-agent reinforcement learning (MARL) framework in which each operating room (OR) is an agent trained with centralized training and decentralized execution. All agents share a policy trained via Proximal Policy Optimization (PPO), which maps rich system states to actions, while a within-epoch sequential assignment protocol constructs conflict-free joint schedules across ORs. A mixed-integer pre-schedule provides reference starting times for electives; we impose type-specific quadratic delay penalties relative to these references and a terminal overtime penalty, yielding a single reward that captures throughput, timeliness, and staff workload. In simulations reflecting a realistic hospital mix (six ORs, eight surgery types, random urgent and emergency arrivals), the learned policy outperforms six rule-based heuristics across seven metrics and three evaluation subsets, and, relative to an ex post MIP oracle, quantifies optimality gaps. Policy analytics reveal interpretable behavior-prioritizing emergencies, batching similar cases to reduce setups, and deferring lower-value electives. We also derive a suboptimality bound for the sequential decomposition under simplifying assumptions. We discuss limitations-including OR homogeneity and the omission of explicit staffing constraints-and outline extensions. Overall, the approach offers a practical, interpretable, and tunable data-driven complement to optimization for real-time OR scheduling.
中文标题/摘要
标题:不确定性条件下日内手术室调度的多智能体强化学习
日内手术调度是一个在不确定性条件下多目标决策问题,需要平衡择期手术量、紧急和急诊需求、延迟、顺序相关的设置以及加班。我们将问题形式化为合作马尔可夫博弈,并提出一个多智能体强化学习(MARL)框架,其中每个手术室(OR)是一个通过集中训练和分散执行训练的智能体。所有智能体共享一个通过近端策略优化(PPO)训练的策略,该策略将丰富的系统状态映射为动作,而每轮内的顺序分配协议构建了OR之间的无冲突联合调度。混合整数预调度提供择期手术的参考开始时间;我们对这些参考施加类型特定的二次延迟惩罚,并施加一个终端加班惩罚,产生一个综合了吞吐量、及时性和工作人员工作量的单一奖励。在反映现实医院情况(六个OR,八种手术类型,随机的紧急和急诊到达)的模拟中,学习到的策略在七个指标和三个评估子集上均优于六种基于规则的启发式方法,并且相对于事后MIP优化器,量化了最优性差距。策略分析揭示了可解释的行为-优先处理紧急情况、批量处理相似病例以减少设置以及推迟低价值的择期手术。我们还在简化假设下推导了顺序分解的次优性界。我们讨论了限制因素,包括OR同质性和未明确包含的人员配置约束,并概述了扩展。总体而言,该方法为实时手术室调度提供了实用、可解释且可调节的数据驱动补充,与优化方法相结合。
Summary / 总结
The paper addresses the challenge of intraday surgical scheduling under uncertainty by formulating the problem as a cooperative Markov game and proposing a multi-agent reinforcement learning (MARL) framework. Each operating room is an agent trained with centralized training and decentralized execution, using Proximal Policy Optimization (PPO) to map system states to actions. The approach outperforms six rule-based heuristics across seven metrics and three evaluation subsets in simulations, and provides interpretable behavior such as prioritizing emergencies and deferring lower-value electives. The method also offers a practical, interpretable, and tunable data-driven solution for real-time OR scheduling.
论文通过将内日手术排程问题表述为合作马尔可夫游戏,并提出了一种基于多智能体强化学习(MARL)的框架来解决在不确定性下的内日手术排程问题。每个手术室作为一个智能体,采用集中训练和分散执行的方式,并使用近端策略优化(PPO)来将系统状态映射到行动。该方法在模拟中在七个指标和三个评估子集上优于六种基于规则的启发式方法,并提供了可解释的行为,如优先处理紧急情况和推迟低价值的择期手术。该方法还提供了一种实用、可解释且可调的数据驱动解决方案,用于实时手术室排程。
Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment
Authors: Yang Chen, Xiaowei Xu, Shuai Wang, Chenhui Zhu, Ruxue Wen, Xubin Li, Tiezheng Ge, Limin Wang
Venue: AAAI 2026
First: 2025-11-27T11:35:08+00:00 · Latest: 2025-12-04T15:44:45+00:00
Comments: Accepted by AAAI 2026
Abstract
Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$\times$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$\times$64 and 256$\times$256. Our code is available at https://github.com/MCG-NJU/FlowBack.
中文标题/摘要
标题:逆向表示对齐:通过反向表示对齐改进流动模型
流动模型(NFs)是一类生成模型,以其数学可逆的架构为特征,其中前向传递将数据转换为潜在空间进行密度估计,而反向传递则从该空间生成新的样本。这一特性在表示学习和数据生成之间创造了内在的协同作用。然而,标准NFs的生成质量受限于从对数似然优化中获得的较差语义表示。为了解决这一问题,我们提出了一种新颖的对齐策略,创造性地利用了NFs的可逆性:而不是正则化前向传递,我们对生成(反向)传递的中间特征与强视觉基础模型的表示进行对齐,显示出比简单对齐更优越的效果。我们还引入了一种新的无需训练、测试时的优化算法,用于分类,这为NF嵌入的语义知识提供了更内在的评估。全面的实验表明,我们的方法不仅将NFs的训练加速了3.3倍以上,还在生成质量和分类准确性方面取得了显著的改进。在ImageNet 64×64和256×256上,我们建立了NFs的新最佳结果。我们的代码可在https://github.com/MCG-NJU/FlowBack获取。
Summary / 总结
This paper addresses the limitations of standard Normalizing Flows (NFs) in generating high-quality data due to poor semantic representations. It introduces a novel alignment strategy that aligns the intermediate features of the reverse pass with those from a powerful vision foundation model, improving both generative quality and classification accuracy. Experiments show a 3.3 times faster training speed and new state-of-the-art results on ImageNet 64x64 and 256x256.
本文针对标准归一化流(NFs)因语义表示不佳而导致生成高质量数据能力有限的问题,提出了一种新颖的对齐策略,该策略将生成(逆向)过程中的中间特征与视觉基础模型的表示对齐,从而提高了生成质量和分类准确性。实验显示训练速度提高了3.3倍,并在ImageNet上取得了新的最佳结果。还提出了一种无需训练的优化算法,在测试时评估NFs的内在知识。
EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?
Authors: Pierre Adorni, Minh-Tan Pham, Stéphane May, Sébastien Lefèvre
First: 2025-11-26T15:52:56+00:00 · Latest: 2025-12-04T15:22:57+00:00
Abstract
Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs. All codes and pretrained models are available at https://github.com/pierreadorni/EoS-FM.
中文标题/摘要
标题:EoS-FM:专家模型集合能否充当通用特征提取器?
基础模型在自然语言处理和计算机视觉等领域取得了巨大进展,类似的努力现在也在地球观测领域出现。这些模型旨在在有限监督的情况下泛化到各种任务,减少为每个任务单独训练模型的需要。然而,当前的策略主要集中在扩大模型规模和数据集规模上,这需要巨大的计算和数据资源,限制了其仅对少数大型机构的可用性。此外,这种不断扩大的模型范式与可持续和环境友好的人工智能原则背道而驰,因为它导致了巨大的碳足迹和资源低效。在本文中,我们提出了一种新颖且高效的替代方案:用于构建遥感基础模型(RSFM)的专家模型集合框架。我们的方法将训练过程分解为轻量级、任务特定的ConvNeXtV2专家,这些专家可以冻结并重用。这种模块化方法在效率、可解释性和可扩展性方面具有明显优势。此外,它自然支持联邦训练、剪枝和连续专家集成,使其特别适合协作和资源受限的环境。我们的框架为构建可扩展和高效的RSFM指明了新方向。所有代码和预训练模型均可在https://github.com/pierreadorni/EoS-FM获取。
Summary / 总结
This paper introduces EoS-FM, an Ensemble-of-Specialists framework for Remote Sensing Foundation Models (RSFMs) that addresses the limitations of current large-scale models by offering a more efficient and sustainable approach. The method uses lightweight, task-specific ConvNeXtV2 specialists that can be reused, providing advantages in efficiency, interpretability, and extensibility. The framework supports federated training and continuous integration, making it suitable for resource-constrained settings. Experiments demonstrate the effectiveness of EoS-FM in remote sensing tasks without the need for large-scale models.
本文提出了EoS-FM,一种用于遥感基础模型(RSFM)的专家集合框架,通过提供更高效和可持续的方法来解决当前大规模模型的局限性。该方法使用可重用的轻量级、任务特定的ConvNeXtV2专家,提供效率、可解释性和可扩展性的优势。该框架支持联邦训练和持续专家集成,特别适合资源受限的环境。实验表明,EoS-FM在遥感任务中有效,无需大规模模型。
Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems
Authors: M Zeeshan, Saud Satti
First: 2025-12-04T15:22:28+00:00 · Latest: 2025-12-04T15:22:28+00:00
Comments: 5 pages, 2 figures, IEEE Transactions on Dependable and Secure Computing
Abstract
Multimodal Artificial Intelligence (AI) systems, particularly Vision-Language Models (VLMs), have become integral to critical applications ranging from autonomous decision-making to automated document processing. As these systems scale, they rely heavily on preprocessing pipelines to handle diverse inputs efficiently. However, this dependency on standard preprocessing operations, specifically image downscaling, creates a significant yet often overlooked security vulnerability. While intended for computational optimization, scaling algorithms can be exploited to conceal malicious visual prompts that are invisible to human observers but become active semantic instructions once processed by the model. Current adversarial strategies remain largely static, failing to account for the dynamic nature of modern agentic workflows. To address this gap, we propose Chameleon, a novel, adaptive adversarial framework designed to expose and exploit scaling vulnerabilities in production VLMs. Unlike traditional static attacks, Chameleon employs an iterative, agent-based optimization mechanism that dynamically refines image perturbations based on the target model's real-time feedback. This allows the framework to craft highly robust adversarial examples that survive standard downscaling operations to hijack downstream execution. We evaluate Chameleon against Gemini 2.5 Flash model. Our experiments demonstrate that Chameleon achieves an Attack Success Rate (ASR) of 84.5% across varying scaling factors, significantly outperforming static baseline attacks which average only 32.1%. Furthermore, we show that these attacks effectively compromise agentic pipelines, reducing decision-making accuracy by over 45% in multi-step tasks. Finally, we discuss the implications of these vulnerabilities and propose multi-scale consistency checks as a necessary defense mechanism.
中文标题/摘要
标题:变色龙:基于缩放的视觉提示注入适应性对抗代理在多模态AI系统中的应用
多模态人工智能(AI)系统,特别是视觉-语言模型(VLMs),已成为从自主决策到自动化文档处理等关键应用的重要组成部分。随着这些系统的扩展,它们依赖于预处理管道来高效处理各种输入。然而,对标准预处理操作,特别是图像缩放的依赖,创造了一个重要的但经常被忽视的安全漏洞。虽然缩放算法旨在进行计算优化,但它们可以被利用来隐藏对人类观察者不可见但被模型处理后成为有效语义指令的恶意视觉提示。当前的对抗策略大多保持静态,未能考虑到现代代理工作流程的动态性。为了解决这一差距,我们提出了变色龙,这是一种新颖的、适应性的对抗框架,旨在揭示并利用生产VLMs中的缩放漏洞。与传统的静态攻击不同,变色龙采用了一种迭代的、基于代理的优化机制,根据目标模型的实时反馈动态细化图像扰动。这使得框架能够生成高度鲁棒的对抗样本,这些样本能够生存下来标准的缩放操作,从而劫持下游执行。我们使用Gemini 2.5 Flash模型对变色龙进行了评估。我们的实验表明,变色龙在不同缩放因子下的攻击成功率(ASR)为84.5%,远高于平均32.1%的静态基线攻击。此外,我们展示了这些攻击有效地破坏了代理管道,在多步骤任务中使决策准确性降低了超过45%。最后,我们讨论了这些漏洞的影响,并提出了多尺度一致性检查作为必要的防御机制。
Summary / 总结
Chameleon is an adaptive adversarial framework designed to exploit scaling vulnerabilities in Vision-Language Models (VLMs). Unlike static attacks, Chameleon uses an iterative, agent-based optimization mechanism to refine image perturbations based on real-time feedback from the target model. Experiments show that Chameleon achieves an Attack Success Rate of 84.5% across different scaling factors, significantly outperforming static attacks which average 32.1%. The attacks reduce decision-making accuracy by over 45% in multi-step tasks, highlighting the need for multi-scale consistency checks as a defense mechanism.
Chameleon 是一个适应性对抗框架,旨在利用视觉语言模型(VLMs)中的缩放漏洞。与静态攻击不同,Chameleon 会根据目标模型的实时反馈迭代优化图像扰动。实验表明,Chameleon 在不同缩放因子下的攻击成功率高达 84.5%,远超静态攻击的平均水平。这些攻击会破坏代理管道,使多步任务的决策准确性降低超过 45%。
You Only Train Once (YOTO): A Retraining-Free Object Detection Framework
Authors: Priyanto Hidayatullah, Nurjannah Syakrani, Yudi Widhiyasana, Muhammad Rizqi Sholahuddin, Refdinal Tubagus, Zahri Al Adzani Hidayat, Hanri Fajar Ramadhan, Dafa Alfarizki Pratama, Farhan Muhammad Yasin
First: 2025-12-04T15:15:43+00:00 · Latest: 2025-12-04T15:15:43+00:00
Comments: under review in the Elsevier Engineering Journal
Abstract
Object detection constitutes the primary task within the domain of computer vision. It is utilized in numerous domains. Nonetheless, object detection continues to encounter the issue of catastrophic forgetting. The model must be retrained whenever new products are introduced, utilizing not only the new products dataset but also the entirety of the previous dataset. The outcome is obvious: increasing model training expenses and significant time consumption. In numerous sectors, particularly retail checkout, the frequent introduction of new products presents a great challenge. This study introduces You Only Train Once (YOTO), a methodology designed to address the issue of catastrophic forgetting by integrating YOLO11n for object localization with DeIT and Proxy Anchor Loss for feature extraction and metric learning. For classification, we utilize cosine similarity between the embedding features of the target product and those in the Qdrant vector database. In a case study conducted in a retail store with 140 products, the experimental results demonstrate that our proposed framework achieves encouraging accuracy, whether for detecting new or existing products. Furthermore, without retraining, the training duration difference is significant. We achieve almost 3 times the training time efficiency compared to classical object detection approaches. This efficiency escalates as additional new products are added to the product database. The average inference time is 580 ms per image containing multiple products, on an edge device, validating the proposed framework's feasibility for practical use.
中文标题/摘要
标题:你只需训练一次(YOTO):一种无需重新训练的目标检测框架
目标检测是计算机视觉领域的主要任务,被广泛应用于多个领域。然而,目标检测仍然面临灾难性遗忘的问题。每当引入新产品时,模型需要重新训练,不仅需要新产品的数据集,还需要整个之前的数据集。结果显而易见:增加了模型训练成本和大量时间消耗。在许多领域,尤其是零售结账领域,频繁引入新产品是一个巨大挑战。本研究引入了你只需训练一次(YOTO),一种通过结合YOLO11n进行目标定位、DeIT和Proxy Anchor Loss进行特征提取和度量学习的方法来解决灾难性遗忘问题。对于分类,我们使用目标产品嵌入特征与Qdrant向量数据库中特征的余弦相似度。在一家拥有140种产品的零售店进行的案例研究中,实验结果表明,我们提出的框架在检测新产品或现有产品时均取得了令人鼓舞的准确性。此外,无需重新训练,训练时间差异显著。我们实现了与经典目标检测方法相比几乎3倍的训练时间效率。随着产品数据库中新增产品的数量增加,这种效率会进一步提高。在边缘设备上,每张包含多个产品的图像平均推理时间为580毫秒,验证了所提框架在实际应用中的可行性。
Summary / 总结
This study addresses the issue of catastrophic forgetting in object detection by proposing You Only Train Once (YOTO), which integrates YOLO11n, DeIT, and Proxy Anchor Loss. The framework achieves high accuracy for both new and existing products in a retail setting with 140 products, reducing training time by almost 3 times compared to traditional methods, and maintaining efficient inference times.
研究通过提出You Only Train Once (YOTO) 方法,整合YOLO11n、DeIT和Proxy Anchor Loss,解决了对象检测中的灾难性遗忘问题。该框架在包含140种产品的零售店中,对新旧产品都表现出高精度,相比传统方法,训练时间效率提高了近三倍。平均推理时间为每张包含多个产品的图像580毫秒,证明了该框架的实际可行性。
SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms
Authors: Jiawen Wen, Yu Hu, Suixuan Qiu, Jinshan Huang, Xiaowen Chu
First: 2025-12-04T15:11:43+00:00 · Latest: 2025-12-04T15:11:43+00:00
Comments: https://github.com/Jeffry-wen/SDG-Track
Abstract
Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track
中文标题/摘要
标题:SDG-Track:嵌入式平台高分辨率无人机跟踪的异构观察者-跟随框架
在边缘设备上实时跟踪小型无人机面临着分辨率与速度的基本冲突。将高分辨率图像下采样为标准检测输入大小会导致小目标特征低于可检测阈值。然而,在资源受限的平台上处理原生1080p帧会因吞吐量不足而无法实现平滑的云台控制。我们提出SDG-Track,一种稀疏检测引导跟踪器,采用观察者-跟随架构来解决这一冲突。观察者流在GPU上以低频率运行高容量检测器,从1920x1080帧中提供准确的位置锚点。跟随者流在CPU上通过区域约束稀疏光流进行高频率轨迹插值。为处理由光谱相似干扰物引起的遮挡或模型漂移导致的跟踪失败,我们引入了双空间恢复机制,这是一种无需训练的重新获取机制,结合了颜色直方图匹配与几何一致性约束。在地面到空中跟踪站上的实验表明,SDG-Track实现了35.1 FPS系统吞吐量,同时保留了97.2%的逐帧检测精度。该系统在NVIDIA Jetson Orin Nano上成功跟踪了现实世界操作条件下的敏捷FPV无人机。我们的论文代码已公开发布在https://github.com/Jeffry-wen/SDG-Track
Summary / 总结
SDG-Track addresses the challenge of real-time tracking of small UAVs on edge devices by proposing an Observer-Follower architecture. The Observer uses a high-capacity detector on the GPU to provide accurate position anchors from high-resolution frames, while the Follower performs high-frequency trajectory interpolation on the CPU. To handle tracking failures, Dual-Space Recovery combines color histogram matching with geometric consistency constraints. Experiments show SDG-Track achieves 35.1 FPS throughput with 97.2% frame-by-frame detection precision and successfully tracks agile FPV drones under real-world conditions.
SDG-Track通过使用Observer-Follower架构解决实时UAV跟踪中的分辨率-速度冲突。Observer在GPU上运行高容量检测器提供准确的位置锚点,而Follower在CPU上进行高频轨迹插值。系统引入了Dual-Space Recovery来处理跟踪失败,结合颜色直方图匹配和几何一致性约束。实验显示SDG-Track实现了35.1 FPS的吞吐量,保持了97.2%的帧间检测精度,并成功在真实世界条件下跟踪敏捷的FPV无人机。
Autoregressive Image Generation Needs Only a Few Lines of Cached Tokens
Authors: Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, Weiyao Lin
First: 2025-12-04T14:41:21+00:00 · Latest: 2025-12-04T14:41:21+00:00
Abstract
Autoregressive (AR) visual generation has emerged as a powerful paradigm for image and multimodal synthesis, owing to its scalability and generality. However, existing AR image generation suffers from severe memory bottlenecks due to the need to cache all previously generated visual tokens during decoding, leading to both high storage requirements and low throughput. In this paper, we introduce \textbf{LineAR}, a novel, training-free progressive key-value (KV) cache compression pipeline for autoregressive image generation. By fully exploiting the intrinsic characteristics of visual attention, LineAR manages the cache at the line level using a 2D view, preserving the visual dependency regions while progressively evicting less-informative tokens that are harmless for subsequent line generation, guided by inter-line attention. LineAR enables efficient autoregressive (AR) image generation by utilizing only a few lines of cache, achieving both memory savings and throughput speedup, while maintaining or even improving generation quality. Extensive experiments across six autoregressive image generation models, including class-conditional and text-to-image generation, validate its effectiveness and generality. LineAR improves ImageNet FID from 2.77 to 2.68 and COCO FID from 23.85 to 22.86 on LlamaGen-XL and Janus-Pro-1B, while retaining only 1/6 KV cache. It also improves DPG on Lumina-mGPT-768 with just 1/8 KV cache. Additionally, LineAR achieves significant memory and throughput gains, including up to 67.61% memory reduction and 7.57x speedup on LlamaGen-XL, and 39.66% memory reduction and 5.62x speedup on Janus-Pro-7B.
中文标题/摘要
标题:自回归图像生成仅需几行缓存令牌
自回归(AR)视觉生成已成为图像和多模态合成的强大范式,得益于其可扩展性和通用性。然而,现有的AR图像生成由于解码过程中需要缓存所有之前生成的视觉令牌而遭受严重的内存瓶颈,导致高存储需求和低吞吐量。本文介绍了一种名为LineAR的新型、无需训练的渐进式键值(KV)缓存压缩管道,用于自回归图像生成。通过充分利用视觉注意力的内在特性,LineAR在二维视图中按行级管理缓存,保留视觉依赖区域的同时,逐步淘汰对后续行生成无害的、信息量较少的令牌,由行间注意力引导。LineAR通过仅使用几行缓存实现高效的自回归(AR)图像生成,同时实现内存节省和吞吐量提升,同时保持或甚至提高生成质量。在六个自回归图像生成模型中,包括类别条件和文本到图像生成的广泛实验验证了其有效性和通用性。LineAR在LlamaGen-XL和Janus-Pro-1B上将ImageNet FID从2.77提高到2.68,COCO FID从23.85提高到22.86,同时仅保留1/6的KV缓存。它还在Lumina-mGPT-768上仅使用1/8的KV缓存提高了DPG。此外,LineAR实现了显著的内存和吞吐量增益,包括在LlamaGen-XL上高达67.61%的内存减少和7.57倍的速度提升,在Janus-Pro-7B上则为39.66%的内存减少和5.62倍的速度提升。
Summary / 总结
This paper addresses the memory bottleneck in autoregressive (AR) image generation by introducing LineAR, a training-free method that compresses the key-value cache. LineAR uses a 2D view to manage cache at the line level, preserving visual dependencies while evicting less-informative tokens. This approach reduces memory usage and increases throughput without compromising generation quality, as demonstrated by improvements in FID scores and memory/throughput gains across various AR models.
本文通过引入LineAR,一种无需训练的缓存压缩方法,解决了自回归(AR)图像生成中的内存瓶颈问题。LineAR采用2D视图在行级别管理缓存,保留视觉依赖关系的同时逐级移除对后续行生成影响较小的无用令牌。该方法减少了内存使用并提高了吞吐量,同时保持或提升了生成质量,如在各种AR模型中所验证的FID分数和内存/吞吐量增益。
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
Authors: GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu
First: 2025-10-22T09:57:13+00:00 · Latest: 2025-12-04T14:28:04+00:00
Comments: https://gigabrain0.github.io/
Abstract
Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.
中文标题/摘要
标题:GigaBrain-0:一种基于世界模型的视觉-语言-行动模型
训练通用机器人视觉-语言-行动(VLA)模型通常需要大量的真实世界机器人数据,这收集起来既昂贵又耗时。物理数据收集的低效严重限制了当前VLA系统的可扩展性和泛化能力。为了解决这一挑战,我们引入了GigaBrain-0,这是一种由世界模型生成数据(例如视频生成、真实到真实转移、人类转移、视角转移、模拟到真实转移数据)赋能的新型VLA基础模型。通过利用世界模型生成大规模的多样化数据,GigaBrain-0显著减少了对真实机器人数据的依赖,同时提高了跨任务的泛化能力。我们的方法进一步通过RGBD输入建模和具身思维链(CoT)监督,提高了策略的鲁棒性,使模型在执行任务时能够推理空间几何、物体状态和长时依赖关系。这在灵巧、长时依赖和移动操作任务上带来了显著的现实世界性能提升。大量实验表明,GigaBrain-0在外观变化(例如纹理、颜色)、物体摆放和摄像机视角等方面实现了卓越的泛化能力。此外,我们还介绍了GigaBrain-0-Small,这是一种优化的轻量级变体,旨在高效运行在NVIDIA Jetson AGX Orin等设备上。
Summary / 总结
GigaBrain-0 is a Vision-Language-Action foundation model that uses world model-generated data to reduce reliance on expensive real-world robot data, enhancing cross-task generalization and policy robustness. It improves real-world performance on dexterous, long-horizon, and mobile manipulation tasks through RGBD input modeling and embodied Chain-of-Thought supervision, achieving superior generalization across various task variations and camera viewpoints.
GigaBrain-0 是一个视觉-语言-行动基础模型,通过使用世界模型生成的数据来减少对昂贵的真实世界机器人数据的依赖,增强跨任务泛化能力和策略鲁棒性。它通过 RGBD 输入建模和具身思维链监督,提高了在灵巧、长时序和移动操作任务中的实际性能,实现了在各种任务变化和摄像头视角下的优越泛化能力。
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Authors: Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li
First: 2025-05-21T12:18:15+00:00 · Latest: 2025-12-04T14:24:47+00:00
Comments: https://github.com/xtong-zhang/Chain-of-Focus
Abstract
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks. However, the multimodal reasoning capability has not been fully explored in existing models. In this paper, we propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions based on obtained visual cues and the given questions, achieving efficient multimodal reasoning. To enable this CoF capability, we present a two-stage training pipeline, including supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct the MM-CoF dataset, comprising 3K samples derived from a visual agent designed to adaptively identify key regions to solve visual tasks with different image resolutions and questions. We use MM-CoF to fine-tune the Qwen2.5-VL model for cold start. In the RL stage, we leverage the outcome accuracies and formats as rewards to update the Qwen2.5-VL model, enabling further refining the search and reasoning strategy of models without human priors. Our model achieves significant improvements on multiple benchmarks. On the V* benchmark that requires strong visual reasoning capability, our model outperforms existing VLMs by 5% among 8 image resolutions ranging from 224 to 4K, demonstrating the effectiveness of the proposed CoF method and facilitating the more efficient deployment of VLMs in practical applications.
中文标题/摘要
标题:基于动态视觉搜索与缩放的自适应焦点链推理方法以提高高效VLMs
视觉语言模型(VLMs)在各种计算机视觉任务中取得了令人印象深刻的性能。然而,现有的模型尚未充分探索其多模态推理能力。本文提出了一种焦点链(CoF)方法,使VLMs能够根据获得的视觉线索和给定的问题,自适应地聚焦并放大关键图像区域,实现高效的多模态推理。为了使VLMs具备CoF能力,我们提出了一种两阶段训练管道,包括监督微调(SFT)和强化学习(RL)。在SFT阶段,我们构建了MM-CoF数据集,包含3000个样本,这些样本来自一个视觉代理,该代理能够自适应地识别关键区域以解决不同图像分辨率和问题的视觉任务。我们使用MM-CoF对Qwen2.5-VL模型进行冷启动微调。在RL阶段,我们利用结果准确性和格式作为奖励来更新Qwen2.5-VL模型,从而进一步优化模型的搜索和推理策略,无需人类先验知识。我们的模型在多个基准测试中取得了显著改进。在V*基准测试中,该基准测试要求强大的视觉推理能力,我们的模型在从224到4K的8种图像分辨率中优于现有VLMs 5%,证明了所提出的CoF方法的有效性,并促进了VLMs在实际应用中的更高效部署。
Summary / 总结
This paper introduces a Chain-of-Focus (CoF) method for VLMs to perform adaptive focusing and zooming on key image regions based on visual cues and questions. It uses a two-stage training pipeline, including supervised fine-tuning and reinforcement learning, to improve multimodal reasoning. The model shows significant improvements on multiple benchmarks, outperforming existing VLMs by 5% on the V* benchmark across various image resolutions, highlighting the effectiveness of the CoF method for efficient VLM deployment.
本文提出了一种Chain-of-Focus (CoF) 方法,使VLM能够在获得视觉线索和问题的基础上,对关键图像区域进行自适应聚焦和放大。该方法采用两阶段训练流程,包括监督微调和强化学习,以提高多模态推理能力。模型在多个基准测试中表现出显著改进,在V*基准测试中,该模型在从224到4K的8种不同图像分辨率下,比现有VLMs提高了5%,证明了CoF方法的有效性,并促进了VLM在实际应用中的高效部署。
FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis
Authors: Shijie Chen, Peixi Peng
First: 2025-12-04T14:14:21+00:00 · Latest: 2025-12-04T14:14:21+00:00
Comments: Novel View Synthesis, Driving Scene, Free Trajectory, Image Generation
Abstract
Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.
中文标题/摘要
标题:FreeGen:前馈重建-生成协同训练在自由视角驾驶场景合成中的应用
闭环模拟和可扩展预训练需要合成自由视角驾驶场景。然而,现有数据集和生成管道很少提供一致的离轨迹观测,限制了大规模评估和训练。尽管最近的生成模型展示了强大的视觉真实性,但在无需场景优化的情况下同时实现插值一致性和外推真实性方面仍存在困难。为了解决这个问题,我们提出FreeGen,一种用于自由视角驾驶场景合成的前馈重建-生成协同训练框架。重建模型提供稳定的几何表示以确保插值一致性,而生成模型则进行几何感知增强以提高在未见过视角的真实性。通过协同训练,生成先验知识被提炼到重建模型中以改善离轨迹渲染,而细化的几何结构反过来为生成提供了更强的结构指导。实验表明,FreeGen 在自由视角驾驶场景合成中达到了最先进的性能。
Summary / 总结
The research aims to address the limitations of existing datasets and generative pipelines in synthesizing consistent off-trajectory observations for autonomous driving. The proposed FreeGen framework uses a feed-forward reconstruction-generation co-training approach, where the reconstruction model ensures interpolation consistency and the generation model enhances realism. Experiments show that FreeGen outperforms existing methods in free-viewpoint driving scene synthesis, achieving state-of-the-art performance.
研究旨在解决现有数据集和生成管道在提供自动驾驶仿真和训练所需的离轨迹观察一致性方面的局限性。提出的FreeGen框架采用前向重建-生成联合训练方法,其中重建模型确保插值一致性,生成模型增强现实感。实验表明,FreeGen在自由视角驾驶场景合成方面优于现有方法,达到最先进的性能。
Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships
Authors: Futa Waseda, Antonio Tejero-de-Pablos, Isao Echizen
Venue: WACV 2026
First: 2024-05-29T05:20:02+00:00 · Latest: 2025-12-04T13:44:07+00:00
Comments: WACV 2026 Accepted. Code available at https://github.com/CyberAgentAI/multimodal-adversarial-training
Abstract
Pre-trained vision-language (VL) models are highly vulnerable to adversarial attacks. However, existing defense methods primarily focus on image classification, overlooking two key aspects of VL tasks: multimodal attacks, where both image and text can be perturbed, and the one-to-many relationship of images and texts, where a single image can correspond to multiple textual descriptions and vice versa (1:N and N:1). This work is the first to explore defense strategies against multimodal attacks in VL tasks, whereas prior VL defense methods focus on vision robustness. We propose multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities during training, significantly outperforming existing unimodal defenses. Furthermore, we discover that MAT is limited by deterministic one-to-one (1:1) image-text pairs in VL training data. To address this, we conduct a comprehensive study on leveraging one-to-many relationships to enhance robustness, investigating diverse augmentation techniques. Our analysis shows that, for a more effective defense, augmented image-text pairs should be well-aligned, diverse, yet avoid distribution shift -- conditions overlooked by prior research. This work pioneers defense strategies against multimodal attacks, providing insights for building robust VLMs from both optimization and data perspectives. Our code is publicly available at https://github.com/CyberAgentAI/multimodal-adversarial-training.
中文标题/摘要
标题:利用一对多关系的多模态对抗防御方法
预训练的视觉-语言(VL)模型对对抗攻击极为敏感。然而,现有的防御方法主要集中在图像分类上,忽视了VL任务中的两个关键方面:多模态攻击,其中图像和文本都可以被扰动,以及一对多关系,即一个图像可以对应多个文本描述,反之亦然(1:N和N:1)。这项工作是首次探索VL任务中对抗多模态攻击的防御策略,而之前的VL防御方法主要关注视觉鲁棒性。我们提出了多模态对抗训练(MAT),在训练过程中同时在图像和文本模态中引入对抗扰动,显著优于现有的单模态防御方法。此外,我们发现MAT受限于VL训练数据中确定的一对一(1:1)图像-文本对。为了解决这个问题,我们对利用一对多关系增强鲁棒性进行了全面研究,探讨了多种增强技术。我们的分析表明,为了更有效的防御,增强的图像-文本对应该很好地对齐,多样化,但要避免分布偏移——这是之前研究中忽略的条件。这项工作开创了对抗多模态攻击的防御策略,从优化和数据两个角度提供了构建鲁棒VL模型的见解。我们的代码可在https://github.com/CyberAgentAI/multimodal-adversarial-training获取。
Summary / 总结
This work addresses the vulnerability of pre-trained vision-language models to adversarial attacks by proposing multimodal adversarial training (MAT), which incorporates adversarial perturbations in both image and text modalities. The method significantly outperforms existing unimodal defenses. The study also highlights the limitations of deterministic one-to-one image-text pairs in VL training data and proposes leveraging one-to-many relationships to enhance robustness. The research provides insights for building more robust vision-language models from both optimization and data perspectives.
该研究针对预训练的视觉-语言模型对抗攻击的脆弱性,提出了多模态对抗训练(MAT)方法,该方法在图像和文本模态中都引入了对抗扰动,显著优于现有的单模态防御方法。研究还指出了确定性的一对一图像-文本对的局限性,并提出利用一对多关系来增强鲁棒性,建议增强的图像-文本对应是良好对齐、多样化的,并避免分布偏移。这是首次探索视觉-语言任务中多模态攻击的防御策略,为从优化和数据两个角度构建鲁棒的视觉-语言模型提供了见解。
ASTRIDE: A Security Threat Modeling Platform for Agentic-AI Applications
Authors: Eranga Bandara, Amin Hass, Ross Gore, Sachin Shetty, Ravi Mukkamala, Safdar H. Bouk, Xueping Liang, Ng Wee Keong, Kasun De Zoysa, Aruna Withanage, Nilaan Loganathan
First: 2025-12-04T13:32:40+00:00 · Latest: 2025-12-04T13:32:40+00:00
Abstract
AI agent-based systems are becoming increasingly integral to modern software architectures, enabling autonomous decision-making, dynamic task execution, and multimodal interactions through large language models (LLMs). However, these systems introduce novel and evolving security challenges, including prompt injection attacks, context poisoning, model manipulation, and opaque agent-to-agent communication, that are not effectively captured by traditional threat modeling frameworks. In this paper, we introduce ASTRIDE, an automated threat modeling platform purpose-built for AI agent-based systems. ASTRIDE extends the classical STRIDE framework by introducing a new threat category, A for AI Agent-Specific Attacks, which encompasses emerging vulnerabilities such as prompt injection, unsafe tool invocation, and reasoning subversion, unique to agent-based applications. To automate threat modeling, ASTRIDE combines a consortium of fine-tuned vision-language models (VLMs) with the OpenAI-gpt-oss reasoning LLM to perform end-to-end analysis directly from visual agent architecture diagrams, such as data flow diagrams(DFDs). LLM agents orchestrate the end-to-end threat modeling automation process by coordinating interactions between the VLM consortium and the reasoning LLM. Our evaluations demonstrate that ASTRIDE provides accurate, scalable, and explainable threat modeling for next-generation intelligent systems. To the best of our knowledge, ASTRIDE is the first framework to both extend STRIDE with AI-specific threats and integrate fine-tuned VLMs with a reasoning LLM to fully automate diagram-driven threat modeling in AI agent-based applications.
中文标题/摘要
标题:ASTRIDE:面向代理AI应用的安全威胁建模平台
基于AI代理的系统正逐渐成为现代软件架构中的重要组成部分,通过大型语言模型(LLMs)实现自主决策、动态任务执行和多模态交互。然而,这些系统引入了新型且不断演变的安全挑战,包括提示注入攻击、上下文污染、模型操控和代理间不透明的通信,这些挑战未能被传统的威胁建模框架有效捕捉。在本文中,我们介绍了ASTRIDE,一个专为基于代理的AI系统设计的自动化威胁建模平台。ASTRIDE通过引入一个新的威胁类别A(针对AI代理的特定攻击),扩展了经典的STRIDE框架,该类别涵盖了诸如提示注入、不安全工具调用和推理篡改等新兴漏洞,这些漏洞是代理应用特有的。为了自动化威胁建模,ASTRIDE结合了一个由微调的视觉-语言模型(VLMs)组成的联盟和OpenAI-gpt-oss推理LLM,直接从视觉代理架构图(如数据流图DFDs)进行端到端分析。LLM代理协调整个威胁建模自动化过程,协调VLM联盟与推理LLM之间的交互。我们的评估表明,ASTRIDE能够为下一代智能系统提供准确、可扩展和可解释的威胁建模。据我们所知,ASTRIDE是第一个扩展STRIDE以包含AI特定威胁并结合微调的VLMs与推理LLM以完全自动化基于代理的AI应用中的图驱动威胁建模的框架。
Summary / 总结
ASTRIDE is an automated threat modeling platform designed for AI agent-based systems, extending the classical STRIDE framework to include a new category A for AI-specific attacks. It uses a consortium of fine-tuned vision-language models and the OpenAI-gpt-oss reasoning LLM to analyze visual agent architecture diagrams, automating the threat modeling process. Evaluations show that ASTRIDE provides accurate, scalable, and explainable threat modeling for next-generation intelligent systems, being the first framework to integrate fine-tuned VLMs with a reasoning LLM for this purpose.
ASTRIDE 是一个自动化威胁建模平台,专为 AI 剂量系统设计,扩展了经典的 STRIDE 框架,包括 AI 特定的威胁。它使用一个由细调的视觉语言模型组成的联盟和 OpenAI-gpt-oss 推理 LLM 来分析视觉剂型架构图,自动化威胁建模过程。ASTRIDE 展示了准确、可扩展和可解释的威胁建模,是第一个将细调的 VLM 与推理 LLM 集成以进行基于图的威胁建模的框架,适用于 AI 剂量应用。
TTRV: Test-Time Reinforcement Learning for Vision Language Models
Authors: Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza
First: 2025-10-08T09:10:31+00:00 · Latest: 2025-12-04T13:17:07+00:00
Abstract
Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets. Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.
中文标题/摘要
标题:TTRV:视觉语言模型的测试时强化学习
现有的强化学习中提取奖励信号的方法通常依赖于标记数据和专门的训练分割,这与人类直接从环境学习的方式不同。在本工作中,我们提出了TTRV,通过在推理时使模型实时适应,从而增强视觉语言理解,无需任何标记数据。具体而言,我们通过基于基模型输出频率设计奖励,结合多次对每个测试样本进行推理,改进了Group Relative Policy Optimization (GRPO)框架。此外,我们还提出通过同时奖励模型以获得输出经验分布的低熵来控制模型输出的多样性。我们的方法在对象识别和视觉问答(VQA)中均取得了持续的改进,分别提高了52.4%和29.8%,并在16个数据集中平均提高了24.6%和10.0%。值得注意的是,在图像识别方面,TTRV应用于InternVL 8B在8个基准测试中平均优于GPT-4o 2.3%,同时在VQA方面保持高度竞争力,表明测试时的强化学习可以匹配或超越最强的专有模型。最后,我们发现测试时的RL对于VLMs有许多有趣的特性:例如,在极端数据受限的场景中,即使在单个随机选择的未标记测试样本上进行适应,TTRV仍能带来高达5.5%的识别任务改进。
Summary / 总结
TTRV proposes a test-time reinforcement learning approach to enhance vision language models by adapting the model at inference time without labeled data. It uses the frequency of the base model's output to design rewards and infers on each test sample multiple times to control output diversity. TTRV achieves consistent gains in object recognition and visual question answering, with improvements up to 52.4% and 29.8% respectively, and average boosts of 24.6% and 10.0% across 16 datasets. On image recognition, TTRV outperforms GPT-4o by 2.3% on average across 8 benchmarks while maintaining competitiveness in VQA.
TTRV通过在推理时调整模型而不使用标注数据,利用基模型输出的频率和控制输出多样性来增强视觉语言理解。它在物体识别和视觉问答中实现了持续的改进,分别提高了最多52.4%和29.8%,并在16个数据集上平均提升了24.6%和10.0%。在图像识别方面,TTRV在8个基准测试中平均超过GPT-4o 2.3%,同时在视觉问答中保持竞争力。即使在数据受限的情况下,TTRV在识别任务中的改进也达到了最多5.5%。
MemLoRA: Distilling Expert Adapters for On-Device Memory Systems
Authors: Massimo Bini, Ondrej Bohdal, Umberto Michieli, Zeynep Akata, Mete Ozay, Taha Ceritli
First: 2025-12-04T12:56:30+00:00 · Latest: 2025-12-04T12:56:30+00:00
Abstract
Memory-augmented Large Language Models (LLMs) have demonstrated remarkable consistency during prolonged dialogues by storing relevant memories and incorporating them as context. Such memory-based personalization is also key in on-device settings that allow users to keep their conversations and data private. However, memory-augmented systems typically rely on LLMs that are too costly for local on-device deployment. Even though Small Language Models (SLMs) are more suitable for on-device inference than LLMs, they cannot achieve sufficient performance. Additionally, these LLM-based systems lack native visual capabilities, limiting their applicability in multimodal contexts. In this paper, we introduce (i) MemLoRA, a novel memory system that enables local deployment by equipping SLMs with specialized memory adapters, and (ii) its vision extension MemLoRA-V, which integrates small Vision-Language Models (SVLMs) to memory systems, enabling native visual understanding. Following knowledge distillation principles, each adapter is trained separately for specific memory operations$\unicode{x2013}$knowledge extraction, memory update, and memory-augmented generation. Equipped with memory adapters, small models enable accurate on-device memory operations without cloud dependency. On text-only operations, MemLoRA outperforms 10$\times$ larger baseline models (e.g., Gemma2-27B) and achieves performance comparable to 60$\times$ larger models (e.g., GPT-OSS-120B) on the LoCoMo benchmark. To evaluate visual understanding operations instead, we extend LoCoMo with challenging Visual Question Answering tasks that require direct visual reasoning. On this, our VLM-integrated MemLoRA-V shows massive improvements over caption-based approaches (81.3 vs. 23.7 accuracy) while keeping strong performance in text-based tasks, demonstrating the efficacy of our method in multimodal contexts.
中文标题/摘要
标题:MemLoRA:为本地内存系统配备专家适配器
增强内存的大型语言模型(LLMs)在长时间对话中表现出显著的一致性,通过存储相关记忆并将其作为上下文进行整合。这种基于记忆的个性化在允许用户保持对话和数据隐私的本地设备设置中也至关重要。然而,增强内存的系统通常依赖于成本过高的LLMs,不适合本地设备部署。尽管小型语言模型(SLMs)比LLMs更适合本地推理,但它们无法达到足够的性能。此外,这些基于LLM的系统缺乏原生的视觉能力,限制了它们在多模态环境中的应用。在本文中,我们介绍了(i) MemLoRA,一种新颖的内存系统,通过为SLMs配备专门的记忆适配器实现本地部署,以及(ii) 其视觉扩展MemLoRA-V,将小型视觉-语言模型(SVLMs)集成到内存系统中,实现原生的视觉理解。遵循知识蒸馏原则,每个适配器分别针对特定的记忆操作进行训练——知识提取、记忆更新和增强记忆的生成。配备记忆适配器的小型模型能够在没有云依赖的情况下实现准确的本地内存操作。在仅文本操作上,MemLoRA在LoCoMo基准测试中优于10倍更大的基线模型(例如,Gemma2-27B),并在性能上与60倍更大的模型(例如,GPT-OSS-120B)相当。为了评估视觉理解操作,我们扩展了LoCoMo,加入了具有直接视觉推理要求的挑战性视觉问答任务。在这些任务上,我们的VLM集成的MemLoRA-V在准确率上大幅优于基于字幕的方法(81.3 vs. 23.7),同时在基于文本的任务上保持了强大的性能,证明了我们方法在多模态环境中的有效性。
Jina-VLM: Small Multilingual Vision Language Model
Authors: Andreas Koukounas, Georgios Mastrapas, Florian Hönicke, Sedigheh Eslami, Guillaume Roncari, Scott Martens, Han Xiao
First: 2025-12-03T18:13:41+00:00 · Latest: 2025-12-04T12:45:29+00:00
Comments: 18 pages, 1-7 main content, 13-18 appendix for tables and dataset
Abstract
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .
中文标题/摘要
标题:Jina-VLM:小型多语言视觉语言模型
我们提出了Jina-VLM,这是一种参数量为24亿的视觉-语言模型,在开放的2B规模VLM中实现了最先进的多语言视觉问答效果。该模型通过一种注意力池化连接器将SigLIP2视觉编码器与Qwen3语言骨干网络耦合在一起,能够高效处理任意分辨率的图像。该模型在标准VQA基准测试和多语言评估中取得了领先结果,同时保持了竞争力的纯文本性能。模型权重和代码已公开发布在https://huggingface.co/jinaai/jina-vlm 。
Summary / 总结
Jina-VLM is a 2.4 billion parameter vision-language model designed to excel in multilingual visual question answering. It integrates a SigLIP2 vision encoder with a Qwen3 language model via an attention-pooling connector, allowing efficient processing of images of varying resolutions. The model demonstrates superior performance on standard VQA benchmarks and multilingual evaluations, maintaining competitive performance on text-only tasks. The model and code are publicly available.
Jina-VLM 是一个 24 亿参数的视觉语言模型,旨在实现多语言视觉问答任务,并在开放的 2B 级别模型中达到了最先进的结果。它结合了 SigLIP2 视觉编码器和 Qwen3 语言骨干,并通过注意力池化连接器实现对不同分辨率图像的高效处理。该模型在标准 VQA 基准测试和多语言评估中表现出色,同时保持了竞争力的纯文本任务性能。模型权重和代码已公开发布。
E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
Authors: Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu
First: 2025-12-04T12:17:25+00:00 · Latest: 2025-12-04T12:17:25+00:00
Abstract
End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.
中文标题/摘要
标题:E3AD:一种面向人类的端到端情绪感知视觉-语言-行动模型
端到端自动驾驶(AD)系统越来越多地采用视觉-语言-行动(VLA)模型,但通常会忽略乘客的情绪状态,这在舒适性和AD接受度方面至关重要。我们提出了开放域端到端(OD-E2E)自动驾驶,其中自动驾驶车辆(AV)必须解释自由形式的自然语言命令、推断情绪并规划一个物理上可行的轨迹。我们提出了E3AD,这是一种情绪感知的VLA框架,通过两个认知启发式的组件增强了语义理解:一种连续的正价-唤醒-支配(VAD)情绪模型,用于捕捉语言中的语气和紧迫感,以及一种双路径空间推理模块,将第一人称和第三人称视角融合以实现类人的空间认知。一种以一致性为导向的训练方案,结合模态预训练与偏好对齐,进一步确保了情绪意图与驾驶行为之间的连贯性。在真实世界数据集上,E3AD提高了视觉定位和航点规划,并实现了情绪估计的最新技术水平(SOTA)VAD相关性。这些结果表明,将情绪注入VLA风格的驾驶中可以产生更符合人类的定位、规划和以人为本的反馈。
Summary / 总结
E3AD is an emotion-aware VLA framework for autonomous driving that incorporates a continuous Valence-Arousal-Dominance emotion model and a dual-pathway spatial reasoning module. It improves visual grounding and waypoint planning and achieves SOTA VAD correlation for emotion estimation, demonstrating that integrating emotion into VLA-style driving leads to more human-aligned outcomes.
研究旨在通过将情感意识融入视觉-语言-行动模型来提升端到端的自动驾驶系统。提出了E3AD情感感知的VLA框架,该框架包括连续的VAD情感模型和双路径空间推理模块。该模型在视觉定位和路径规划上有所改进,并实现了情感估计的SOTA VAD相关性,展示了更好的与人类行为和反馈的对齐。
Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild
Authors: Yigui Feng, Qinglin Wang, Haotian Mo, Yang Liu, Ke Liu, Gencheng Liu, Xinhai Chen, Siqi Shen, Songzhu Mei, Jie Liu
First: 2025-12-04T12:13:18+00:00 · Latest: 2025-12-04T12:13:18+00:00
Abstract
Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.
中文标题/摘要
标题:测量未言说之事:一种心理分析去纠缠模型及基准
在自然对话中的生成性心理分析面临两大根本挑战:(1) 现有的视觉-语言模型(VLMs)无法解决发音-情感模糊性问题,即视觉语音模式模仿情感表达;(2) 缺乏可验证的评估指标阻碍了视觉定位和推理深度的评估。我们提出了一整套生态系统来应对这些挑战。首先,我们引入了多层次洞察网络去纠缠(MIND),这是一种新颖的分层视觉编码器,引入了状态判断模块,基于其时间特征方差算法性抑制模糊唇部特征,实现显式的视觉去纠缠。其次,我们构建了ConvoInsight-DB,这是一个新的大规模数据集,包含专家标注的微表情和深层次心理推断。第三,我们设计了心理推理洞察评级指标(PRISM),这是一种自动化的多维度框架,使用专家指导的大规模语言模型来衡量大型心理视觉模型的多维度性能。在我们的PRISM基准上,MIND显著优于所有基线,微表情检测的性能提高了86.95%。消融研究证实,我们的状态判断去纠缠模块是实现这一性能飞跃的关键组件。我们的代码已开源。
Summary / 总结
The paper addresses the challenges of analyzing in-the-wild conversations by proposing MIND, a novel hierarchical visual encoder that disentangles articulatory-affective ambiguity. It introduces a Status Judgment module to suppress ambiguous lip features and constructs a new dataset, ConvoInsight-DB, for micro-expressions and deep psychological inference. The authors also developed PRISM, an automated dimensional framework, to evaluate large mental vision models. MIND outperforms existing methods by 86.95% in micro-expression detection on the PRISM benchmark, with the Status Judgment module being the key component for this improvement.
本文通过提出MIND,一种分层视觉编码器来解决在野对话分析中的语义-情感歧义问题,并构建了ConvoInsight-DB数据集,用于微表情和心理推理。作者还引入了PRISM,一种自动化的评估心理推理性能的度量标准。MIND在微表情检测上的表现比现有模型高出86.95%,其中Status Judgment模块是这一性能提升的关键组成部分。
EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models
Authors: Haiyang Yu, Mengyang Zhao, Jinghui Lu, Ke Niu, Yanjie Wang, Weijie Yin, Weitao Jia, Teng Fu, Yang Liu, Jun Liu, Hong Chen
First: 2025-03-06T03:19:56+00:00 · Latest: 2025-12-04T11:50:09+00:00
Abstract
Video subtitles play a crucial role in short videos and movies, as they not only help models better understand video content but also support applications such as video translation and content retrieval. Existing video subtitle extraction methods typically rely on multi-stage frameworks, where errors accumulate across stages and temporal dependencies are underutilized due to frame-wise processing. Moreover, although some Large Vision-Language Models (LVLMs) possess strong OCR capabilities, predicting accurate timestamps for subtitle texts remains challenging. To this end, we propose an End-to-end Video subtitle Extraction framework based on LVLMs, named EVE, which can output subtitles and their timestamps simultaneously. Specifically, we introduce a dual-branch Spatiotemporal Subtitle-Salient (S\textsuperscript{3}) Module that serves as an adapter for LVLMs, capable of representing subtitle-related content and considering inter-frame correlations using only a small number of tokens. Within this module, the Spatial Semantic Context Aggregate branch aggregates high-level global semantics to provide spatial visual contextual information, while the Temporal Subtitle Token Query branch explicitly queries subtitle-relevant tokens while considering temporal correlation across frames. The small number of tokens retained by the S\textsuperscript{3} module are fed to the language model, which then directly outputs the subtitle text along with its timestamps. Furthermore, we construct the first large-scale dataset dedicated to video subtitle extraction, ViSa, containing over 2.5M videos with timestamped and bilingual annotation, thereby providing the community with a well-organized training and evaluation benchmark.
中文标题/摘要
标题:EVE:基于视觉语言模型的端到端视频字幕提取
视频字幕在短视频和电影中起着关键作用,不仅有助于模型更好地理解视频内容,还支持视频翻译和内容检索等应用。现有的视频字幕提取方法通常依赖多阶段框架,各阶段的错误会累积,且由于逐帧处理,时间依赖性被严重低估。此外,尽管一些大型视觉语言模型(LVLMs)具有强大的OCR能力,但预测字幕文本的准确时间戳仍然具有挑战性。为此,我们提出了一种基于LVLMs的端到端视频字幕提取框架EVE,该框架可以同时输出字幕及其时间戳。具体而言,我们引入了一种双分支时空字幕显著性(S³)模块,作为LVLMs的适配器,仅使用少量令牌即可表示与字幕相关的内容并考虑帧间相关性。在该模块中,空间语义上下文聚合分支聚合高层次的全局语义以提供空间视觉上下文信息,而时间字幕令牌查询分支则明确查询与字幕相关的令牌并考虑帧间的时间相关性。S³模块保留的少量令牌被送入语言模型,该模型直接输出字幕文本及其时间戳。此外,我们构建了首个专注于视频字幕提取的大规模数据集ViSa,包含超过250万条带有时间戳和双语注释的视频,从而为社区提供了一个组织良好的训练和评估基准。
Summary / 总结
The paper proposes EVE, an end-to-end video subtitle extraction framework using Large Vision-Language Models (LVLMs) to predict both subtitles and their timestamps simultaneously. It introduces a dual-branch Spatiotemporal Subtitle-Salient (S³) Module that aggregates spatial and temporal information with a small number of tokens, improving subtitle extraction accuracy. The ViSa dataset, containing over 2.5 million videos with timestamped and bilingual annotations, is also introduced to support training and evaluation of video subtitle extraction models.
论文提出了一种基于大型视觉-语言模型(LVLM)的端到端视频字幕提取框架EVE。该框架通过使用双分支时空字幕显著(S³)模块来表示与字幕相关的内容并考虑帧间相关性,解决了多阶段方法的局限性。S³模块聚合了空间和时间信息,语言模型直接输出字幕及其时间戳。作者还构建了首个用于视频字幕提取的大规模数据集ViSa,包含超过250万条带有时间戳和双语标注的视频,为该领域的训练和评估提供了良好的基准。